前言：上一次读恺明大神的文章还是两年前，被ResNet的设计折服得不行，两年过去了，我已经被卷死在沙滩上

Momentum Contrast for Unsupervised Visual Representation Learning

摘要

我们提出了针对无监督表征学习的方法MOCO,利用对比学习作为字典查找，我们建立了一个动态队列字典，和一个moving-averaged的编码器。这就可以实时的构建一个大的并且一致的字典来促进无监督的学习。MOCO在同样的线性协议（线性分类头）在IMAGENET上取得了很好的分类结果，并且学习到的特征可以很好的转移到下游任务中，MOCO在7个检测分割的任务中都取得了最好的效果，有的甚至是大幅超过。这表明视觉任务中的无监督学习核有监督学习的差距在缩小。

We present Momentum Contrast (MoCo) for unsupervised visual representation learning. From a perspective on contrastive learning as dictionary look-up, we build a dynamic dictionary with a queue and a moving-averaged encoder. This enables building a large and consistent dictionary on-the-fly that facilitates contrastive unsupervised learning. MoCo provides competitive results under the common linear protocol on ImageNet classification. More importantly, the representations learned by MoCo transfer well to downstream tasks. MoCo can outperform its supervised pre-training counterpart in 7 detection/segmentation tasks on PASCAL VOC, COCO, and other datasets, sometimes surpassing it by large margins. This suggests that the gap between unsupervised and supervised representation learning has been largely closed in many vision tasks.

引言

无监督学习NLP领域很成功，但是在视觉领域，监督学习做预训练还是主流，无监督的方法落后。
当然这个现象的原因主要还是响应的信号空间不同。语言任务是离散的信号空间，有一个个的单词组成，就可以建立一个个标记的字典，无监督学习就是基于这些字典进行的。但是视觉任务把字典的建立看作是在连续高维空间的原始信号，并不是人类表达中的结构性信息

Unsupervised representation learning is highly successful in natural language processing, e.g., as shown by GPT and BERT [12]. But supervised pre-training is still dominant in computer vision, where unsupervised methods generally lag behind. The reason may stem from differences in their respective signal spaces. Language tasks have discrete signal spaces (words, sub-word units, etc.) for building tokenized dictionaries, on which unsupervised learning can be based. Computer vision, in contrast, further concerns dictionary building [54, 9, 5], as the raw signal is in a continuous, high-dimensional space and is not structured for human communication (e.g., unlike words).

一些研究也针对无监督学习任务，利用对比学习的方法做出了一些成果。不管是出于什么样的动机，这些方法其实都是考虑建造动态字典。 key在字典中是对原数据的抽样，并被网络的编码器抽取特征。无监督学习就是训练这样一个编码器来不断的进行字典的查找。一个编码好的查询子应该与对应匹配的key相似，与不匹配的key不相似。这种学习可以被表示成最小化一个对比损失。

Several recent studies [61, 46, 36, 66, 35, 56, 2] present promising results on unsupervised visual representation learning using approaches related to the contrastive loss [29]. Though driven by various motivations, these methods can be thought of as building dynamic dictionaries. The “keys” (tokens) in the dictionary are sampled from data (e.g., images or patches) and are represented by an encoder network. Unsupervised learning trains encoders to perform dictionary look-up: an encoded “query” should be similar to its matching key and dissimilar to others. Learning is formulated as minimizing a contrastive loss [29].

从这个角度说，我们假设构建这样一个字典需要，字典够大，并且一致。直觉上来看，一个更大的字典可以更好的对连续高维的视觉空间进行抽样，而字典里的key的特征都应该由同一个编码器编码，这样他们的对比才会一致。
然而，现有的方法或多或少都被这两个局限性限制了。

From this perspective, we hypothesize that it is desirable to build dictionaries that are: (i) large and (ii) consistent as they evolve during training. Intuitively, a larger dictionary may better sample the underlying continuous, high dimensional visual space, while the keys in the dictionary should be represented by the same or similar encoder so that their comparisons to the query are consistent. However, existing methods that use contrastive losses can be limited in one of these two aspects (discussed later in context).

我们提出了一种MOCO，动量对比学习，如图。我们利用队列来保持这个字典里的样本，最新编码的key的表示，而最久被更新的key被队列挤出去。利用队列的形式就可以和batchsize解耦，就可以构建更大的队列而不必受限于机器中有限制大小的batchsize大小。
其次，我们字典里的key都是来自新来的一些batch中的key ，而这些key是缓慢逐渐更新的，这是由于我们设计了一个动量来实现的。因此可以保持整个队列的一致性

We present Momentum Contrast (MoCo) as a way of building large and consistent dictionaries for unsupervised learning with a contrastive loss (Figure 1). We maintain the dictionary as a queue of data samples: the encoded representations of the current mini-batch are enqueued, and the oldest are dequeued. The queue decouples the dictionary size from the mini-batch size, allowing it to be large. Moreover, as the dictionary keys come from the preceding several mini-batches, a slowly progressing key encoder, implemented as a momentum-based moving average of the query encoder, is proposed to maintain consistency.

MOCO是一个构建动态字典来实现对比学习的机制，并且可以被用于代理任务。在本文，我们遵循一个简单的实例判别任务：一个query匹配一个key如果他们是来自同一张图片的编码。用这样的代理任务，MOCO展示了十分有竞争力的结果。

MoCo is a mechanism for building dynamic dictionaries for contrastive learning, and can be used with various pretext tasks. In this paper, we follow a simple instance discrimination task [61, 63, 2]: a query matches a key if they are encoded views (e.g., different crops) of the same image. Using this pretext task, MoCo shows competitive results under the common protocol of linear classification in the ImageNet dataset [11].

一个使用无监督学习的理应是为了进行下游任务的学习。我们展示了7个不同的下游任务，检测和分割，MOCO无监督预训练都在这几个数据集上好，有的还超过了不少。在实验里，我们用了一个亿的数据集进行续联，显示了MOCO可以在实际世界，亿级图片，没有被标记的场景中工作的更好，这些都真实了无监督学习在是视觉任务中可以替代掉有监督学习视觉任务的预训练模型。

A main purpose of unsupervised learning is to pre-train representations (i.e., features) that can be transferred to downstream tasks by fine-tuning. We show that in 7 downstream tasks related to detection or segmentation, MoCo unsupervised pre-training can surpass its ImageNet supervised counterpart, in some cases by nontrivial margins. In these experiments, we explore MoCo pre-trained on ImageNet or on a one-billion Instagram image set, demonstrating that MoCo can work well in a more real-world, billion image scale, and relatively uncurated scenario. These results show that MoCo largely closes the gap between unsupervised and supervised representation learning in many computer vision tasks, and can serve as an alternative to ImageNet supervised pre-training in several applications.

方法

3.1 对比学习作为字典查找
对比学习和相关发展，可以看做是训练一个字典查找编码器，接下来具体介绍

Contrastive learning [29], and its recent developments, can be thought of as training an encoder for a dictionary look-up task, as described next.

一个重新

Consider an encoded query q and a set of encoded samples {k0, k1, k2, …} that are the keys of a dictionary. Assume that there is a single key (denoted as k+) in the dictionary that q matches. A contrastive loss [29] is a function whose value is low when q is similar to its positive key k+ and dissimilar to all other keys (considered negative keys for q). With similarity measured by dot product, a form of a contrastive loss function, called InfoNCE [46], is considered in this paper: