CV论文阅读大合集

news2026/5/6 1:04:37

Year	Name	Area	model	description	drawback
2021 ICML	Clip （Contrastive Language-Image Pre-training）	contrastive learning、zero-shot learing、mutimodel		用文本作为监督信号来训练可迁移的视觉模型	CLIP’s zero-shot performance, although comparable to supervised ResNet50, is not yet SOTA, and the authors estimate that to achieve SOTA, CLIP would need to add 1000x more computation, which is unimaginable;CLIP’s zero-shot performs poorly on certain datasets, such as fine-grained classification, abstraction tasks, etc; CLIP performs robustly on natural distribution drift, but still suffers from out-of-domain generalisation, i.e., if the distribution of the test dataset differs significantly from the training set, CLIP will perform poorly; CLIP does not address the data inefficiency challenges of deep learning, and training CLIP requires a large amount of data;
2021 ICLR	ViT (VisionTransformer)			将Transformer应用到vision中:simple, efficient,scalable	当拥有足够多的数据进行预训练的时候，ViT的表现就会超过CNN，突破transformer缺少归纳偏置的限制，可以在下游任务中获得较好的迁移效果
2022	DALL-E		基于文本来生成模型
2021 ICCV	Swin Transformer			使用滑窗和层级式的结构，解决transformer计算量大的问题;披着Transformer皮的CNN
2021	MAE(Masked Autoencoders)	self-supervised		CV版的bert	scalablel;very high-capacity models that generalize well
	TransMed: Transformers Advance Multi-modal Medical Image Classification
	I3D
2021	Pathway
2021 ICML	VILT			视觉文本多模态Transformer	性能不高推理时间快训练时间特别慢
ALBef		align before fusion 为了清理noisy data 提出用一个momentum model生成pseudo target