{"title": "Contrastive Learning for Image Captioning", "book": "Advances in Neural Information Processing Systems", "page_first": 898, "page_last": 907, "abstract": "Image captioning, a popular topic in computer vision, has achieved substantial progress in recent years. However, the distinctiveness of natural descriptions is often overlooked in previous work. It is closely related to the quality of captions, as distinctive captions are more likely to describe images with their unique aspects. In this work, we propose a new learning method, Contrastive Learning (CL), for image captioning. Specifically, via two constraints formulated on top of a reference model, the proposed method can encourage distinctiveness, while maintaining the overall quality of the generated captions. We tested our method on two challenging datasets, where it improves the baseline model by significant margins. We also showed in our studies that the proposed method is generic and can be used for models with various structures.", "full_text": "Contrastive Learning for Image Captioning\n\nBo Dai\n\nDahua Lin\n\nDepartment of Information Engineering, The Chinese University of Hong Kong\n\ndb014@ie.cuhk.edu.hk\n\ndhlin@ie.cuhk.edu.hk\n\nAbstract\n\nImage captioning, a popular topic in computer vision, has achieved substantial\nprogress in recent years. However, the distinctiveness of natural descriptions is\noften overlooked in previous work. It is closely related to the quality of captions,\nas distinctive captions are more likely to describe images with their unique aspects.\nIn this work, we propose a new learning method, Contrastive Learning (CL), for\nimage captioning. Speci\ufb01cally, via two constraints formulated on top of a reference\nmodel, the proposed method can encourage distinctiveness, while maintaining the\noverall quality of the generated captions. We tested our method on two challenging\ndatasets, where it improves the baseline model by signi\ufb01cant margins. We also\nshowed in our studies that the proposed method is generic and can be used for\nmodels with various structures.\n\n1\n\nIntroduction\n\nImage captioning, a task to generate natural descriptions of images, has been an active research\ntopic in computer vision and machine learning. Thanks to the advances in deep neural networks,\nespecially the wide adoption of RNN and LSTM, there has been substantial progress on this topic\nin recent years [23, 24, 15, 19]. However, studies [1, 3, 2, 10] have shown that even the captions\ngenerated by state-of-the-art models still leave a lot to be desired. Compared to human descriptions,\nmachine-generated captions are often quite rigid and tend to favor a \u201csafe\u201d (i.e. matching parts of\nthe training captions in a word-by-word manner) but restrictive way. As a consequence, captions\ngenerated for different images, especially those that contain objects of the same categories, are\nsometimes very similar [1], despite their differences in other aspects.\nWe argue that distinctiveness, a property often overlooked in previous work, is signi\ufb01cant in natural\nlanguage descriptions. To be more speci\ufb01c, when people describe an image, they often mention or\neven emphasize the distinctive aspects of an image that distinguish it from others. With a distinctive\ndescription, someone can easily identify the image it is referring to, among a number of similar\nimages. In this work, we performed a self-retrieval study (see Section 4.1), which reveals the lack of\ndistinctiveness affects the quality of descriptions.\nFrom a technical standpoint, the lack of distinctiveness is partly related to the way that the captioning\nmodel was learned. A majority of image captioning models are learned by Maximum Likelihood\nEstimation (MLE), where the probabilities of training captions conditioned on corresponding images\nare maximized. While well grounded in statistics, this approach does not explicitly promote distinc-\ntiveness. Speci\ufb01cally, the differences among the captions of different images are not explicitly taken\ninto account. We found empirically that the resultant captions highly resemble the training set in a\nword-by-word manner, but are not distinctive.\nIn this paper, we propose Contrastive Learning (CL), a new learning method for image captioning,\nwhich explicitly encourages distinctiveness, while maintaining the overall quality of the generated\ncaptions. Speci\ufb01cally, it employs a baseline, e.g. a state-of-the-art model, as a reference. During\nlearning, in addition to true image-caption pairs, denoted as (I, c), this method also takes as input\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fmismatched pairs, denoted as (I, c/), where c/ is a caption describing another image. Then, the\ntarget model is learned to meet two goals, namely (1) giving higher probabilities p(c|I) to positive\npairs, and (2) lower probabilities p(c/|I) to negative pairs, compared to the reference model. The\nformer ensures that the overall performance of the target model is not inferior to the reference; while\nthe latter encourages distinctiveness.\nIt is noteworthy that the proposed learning method (CL) is generic. While in this paper, we focused on\nmodels based on recurrent neural networks [23, 15], the proposed method can also generalize well to\nmodels based on other formulations, e.g. probabilistic graphical models [4, 9]. Also, by choosing the\nstate-of-the-art model as the reference model in CL, one can build on top of the latest advancement in\nimage captioning to obtain improved performances.\n\n2 Related Work\n\nModels for Image Captioning The history of image captioning can date back to decades ago.\nEarly attempts are mostly based on detections, which \ufb01rst detect visual concepts (e.g. objects and\ntheir attributes) [9, 4] followed by template \ufb01lling [9] or nearest neighbor retrieving for caption\ngeneration [2, 4]. With the development of neural networks, a more powerful paradigm, encoder-\nand-decoder, was proposed by [23], which then becomes the core of most state-of-the-art image\ncaptioning models. It uses a CNN [20] to represent the input image with a feature vector, and applies\na LSTM net [6] upon the feature to generate words one by one.\nBased on the encoder-and-decoder, many variants are proposed, where attention mechanism [24] ap-\npears to be the most effective add-on. Speci\ufb01cally, attention mechanism replaces the feature vector\nwith a set of feature vectors, such as the features from different regions [24] , and those under different\nconditions [27]. It also uses the LSTM net to generate words one by one, where the difference is that\nat each step, a mixed guiding feature over the whole feature set, will be dynamically computed. In\nrecent years, there are also approaches combining attention mechanism and detection. Instead of\ndoing attention on features, they consider the attention on a set of detected visual concepts, such as\nattributes [25] and objects [26].\nDespite of the speci\ufb01c structure of any image captioning model, it is able to give p(c|I), the probability\nof a caption conditioned on an image. Therefore, all image captioning models can be used as the\ntarget or the reference in CL method.\n\nLearning Methods for Image Captioning Many state-of-the-art image captioning models adopt\nMaximum Likelihood Estimation (MLE) as their learning method, which maximizes the conditional\nlog-likelihood of the training samples, as:\n\n(cid:88)\n\nTi(cid:88)\n\nln p(w(t)\n\ni\n\n|Ii, w(t\u22121)\n\ni\n\n, ..., w(1)\n\ni\n\n, \u03b8),\n\n(1)\n\n(ci,Ii)\u2208D\n\nt=1\n\n, w(2)\n\ni\n\ni\n\ni\n\n, ..., w(Ti)\n\nwhere \u03b8 is the parameter vector, Ii and ci = (w(1)\n) are a training image and its\ncaption. Although effective, some issues, including high resemblance in model-gerenated captions,\nare observed [1] on models learned by MLE.\nFacing these issues, alternative learning methods are proposed in recent years. Techniques of\nreinforcement learning (RL) have been introduced in image captioning by [19] and [14]. RL sees the\nprocedure of caption generation as a procedure of sequentially sampling actions (words) in a policy\nspace (vocabulary). The rewards in RL are de\ufb01ned to be evaluation scores of sampled captions. Note\nthat distinctiveness has not been considered in both approaches, RL and MLE.\nPrior to this work, some relevant ideas have been explored [21, 16, 1]. Speci\ufb01cally, [21, 16] proposed\nan introspective learning (IL) approach that learns the target model by comparing its outputs on (I, c)\nand (I/, c). Note that IL uses the target model itself as a reference. On the contrary, the reference\nmodel in CL provides more independent and stable indications about distinctiveness. In addition,\n(I/, c) in IL is pre-de\ufb01ned and \ufb01xed across the learning procedure, while the negative sample in\nCL, i.e. (I, c/), is dynamically sampled, making it more diverse and random. Recently, Generative\nAdversarial Networks (GAN) was also adopted for image captioning [1], which involves an evaluator\nthat may help promote the distinctiveness. However, this evaluator is learned to directly measure the\n\n2\n\n\fFigure 1: This \ufb01gure illustrates respectively a nondistinctive and distinctive captions of an image, where the\nnondistinctive one fails to retrieve back the original image in self retrieval task.\n\nMethod\nNeuraltalk2 [8]\nAdaptiveAttention [15]\nAdaptiveAttention + CL\n\n5\n\nSelf Retrieval Top-K Recall\n1\n500\n27.50\n78.46\n80.96\n\n50\n3.02\n11.76\n11.84\n\n0.32\n0.96\n1.18\n\n0.02\n0.10\n0.32\n\nCaptioning\n\nROUGE_L CIDEr\n0.827\n1.004\n1.029\n\n0.652\n0.689\n0.695\n\nTable 1: This table lists results of self retrieval and captioning of different models. The results are reported on\nstandard MSCOCO test set. See sec 4.1 for more details.\n\ndistinctiveness as a parameterized approximation, and the approximation accuracy is not ensured in\nGAN. In CL, the \ufb01xed reference provides stable bounds about the distinctiveness, and the bounds are\nsupported by the model\u2019s performance on image captioning. Besides that, [1] is speci\ufb01cally designed\nfor models that generate captions word-by-word, while CL is more generic.\n\n3 Background\n\nOur formulation is partly inspired by Noise Contrastive Estimation (NCE) [5]. NCE is originally\nintroduced for estimating probability distributions, where the partition functions can be dif\ufb01cult or\neven infeasible to compute. To estimate a parametric distribution pm(.; \u03b8), which we refer to as the\ntarget distribution, NCE employs not only the observed samples X = (x1, x2, ..., xTm), but also\nthe samples drawn from a reference distribution pn, denoted as Y = (y1, y2, ..., yTn ). Instead of\nestimating pm(.; \u03b8) directly, NCE estimates the density ratio pm/pn by training a classi\ufb01er based on\nlogistic regression.\nSpeci\ufb01cally, let U = (u1, ..., uTm+Tn ) be the union of X and Y . A binary class label Ct is assigned\nto each ut, where Ct = 1 if ut \u2208 X and Ct = 0 if ut \u2208 Y . The posterior probabilities for the class\nlabels are therefore\n\nP (C = 1|u, \u03b8) =\n\npm(u; \u03b8)\n\n(2)\nwhere \u03bd = Tn/Tm. Let G(u; \u03b8) = ln pm(u; \u03b8) \u2212 ln pn(u) and h(u, \u03b8) = P (C = 1|u, \u03b8), then we\ncan write\n\npm(u; \u03b8) + \u03bdpn(u)\n\npm(u; \u03b8) + \u03bdpn(u)\n\n,\n\n,\n\nP (C = 0|u, \u03b8) =\n\n\u03bdpn(u)\n\nh(u; \u03b8) = r\u03bd(G(u; \u03b8)), with\n\nr\u03bd(z) =\n\n.\n\n(3)\n\n1\n\n1 + \u03bd exp(\u2212z)\n\nThe objective function of NCE is the joint conditional log-probabilities of Ct given the samples U,\nwhich can be written as\n\nTm(cid:88)\n\nTn(cid:88)\n\nL(\u03b8; X, Y ) =\n\nln[h(xt; \u03b8)] +\n\nln[1 \u2212 h(yt; \u03b8)].\n\n(4)\n\nMaximizing this objective with respect to \u03b8 leads to an estimation of G(\u00b7; \u03b8), the logarithm of the\ndensity ratio pm/pn. As pn is a known distribution, pm(: |\u03b8) can be readily derived.\n\nt=1\n\nt=1\n\n4 Contrastive Learning for Image Captioning\n\nLearning a model by characterizing desired properties relative to a strong baseline is a convenient\nand often quite effective way in situations where it is hard to describe these properties directly.\nSpeci\ufb01cally, in image captioning, it is dif\ufb01cult to characterize the distinctiveness of natural image\ndescriptions via a set of rules, without running into the risk that some subtle but signi\ufb01cant points are\n\n3\n\nA man performing stunt in the air at skate parkSelf RetrievalA man doing a trick on a skateboardSelf Retrieval(a) NondistinctiveCaption (b) Distinctive Caption \fmissed. Our idea in this work is to introduce a baseline model as a reference, and try to enhance the\ndistinctiveness on top, while maintaining the overall quality of the generated captions.\nIn the following we will \ufb01rst present an empirical study on the correlation between distinctiveness of\nits generated captions and the overall performance of a captioning model. Subsequently, we introduce\nthe main framework of Contrastive Learning in detail.\n\n4.1 Empirical Study: Self Retrieval\n\nIn most of the existing learning methods of image captioning, models are asked to generate a caption\nthat best describes the semantics of a given image. In the meantime, distinctiveness of the caption,\nwhich, on the other hand, requires the image to be the best matching among all images for the caption,\nhas not been explored. However, distinctiveness is crucial for high-quality captions. A study by Jas\n[7] showed that speci\ufb01city is common in human descriptions, which implies that image descriptions\noften involve distinctive aspects. Intuitively, a caption satisfying this property is very likely to contain\nkey and unique content of the image, so that the original image could easily be retrieved when the\ncaption is presented.\nTo verify this intuition, we conducted an empirical study which we refer to as self retrieval. In this\nexperiment, we try to retrieve the original image given its model-generated caption and investigate top-\nk recalls, as illustrated in Figure 1. Speci\ufb01cally, we randomly sampled 5, 000 images (I1, I2, ..., I5000)\nfrom standard MSCOCO [13] test set as the experiment benchmark. For an image captioning model\npm(:, \u03b8), we \ufb01rst ran it on the benchmark to get corresponding captions (c1, c2, ..., c5000) for the\nimages. After that, using each caption ct as a query, we computed the conditional probabilities\n(pm(ct|I1), pm(ct|I2), ..., pm(ct|I5000)), which were used to get a ranked list of images, denoted by\nrt. Based on all ranked lists, we can compute top-k recalls, which is the fraction of images within\ntop-k positions of their corresponding ranked lists. The top-k recalls are good indicators of how well\na model captures the distinctiveness of descriptions.\nIn this experiment, we compared three different models, including Neuraltalk2 [8] and AdaptiveAt-\ntention [15] that are learned by MLE, as well as AdaptiveAttention learned by our method. The\ntop-k recalls are listed in Table 1, along with overall performances of these models in terms of\nRouge [12] and Cider [22]. These results clearly show that the recalls of self retrieval are positively\ncorrelated to the performances of image captioning models in classical captioning metrics. Although\nmost of the models are not explicitly learned to promote distinctiveness, the one with better recalls\nof self retrieval, which means the generated-captions are more distinctive, performs better in the\nimage captioning evaluation. Such positive correlation clearly demonstrates the signi\ufb01cance of\ndistinctiveness to captioning performance.\n\n4.2 Contrastive Learning\n\nIn Contrastive Learning (CL), we learn a target image captioning model pm(:; \u03b8) with parameter \u03b8\nby constraining its behaviors relative to a reference model pn(:; \u03c6) with parameter \u03c6. The learning\nprocedure requires two sets of data: (1) the observed data X, which is a set of ground-truth image-\ncaption pairs ((c1, I1), (c2, I2), ..., (cTm , ITm )), and is readily available in any image captioning\ndataset, (2) the noise set Y , which contains mismatched pairs ((c/1, I1), (c/2, I2), ..., (c/Tn, ITn )),\nand can be generated by randomly sampling c/t \u2208 C/It for each image It, where C/It is the set of\nall ground-truth captions except captions of image It. We refer to X as positive pairs while Y as\nnegative pairs.\nFor any pair (c, I), the target model and the reference model will respectively give their estimated\nconditional probabilities pm(c|I, \u03b8) and pn(c|I, \u03c6). We wish that pm(ct|It, \u03b8) is greater than\npn(ct|It, \u03c6) for any positive pair (ct, It), and vice versa for any negative pair (c/t, It). Follow-\ning this intuition, our initial attempt was to de\ufb01ne D((c, I); \u03b8, \u03c6), the difference between pm(c|I, \u03b8)\nand pn(c|I, \u03c6), as\n\nD((c, I); \u03b8, \u03c6) = pm(c|I, \u03b8) \u2212 pn(c|I, \u03c6),\n\nand set the loss function to be:\n\nL(cid:48)(\u03b8; X, Y, \u03c6) =\n\nTm(cid:88)\n\nD((ct, It); \u03b8, \u03c6) \u2212 Tn(cid:88)\n\nt=1\n\nt=1\n\n4\n\nD((c/t, It); \u03b8, \u03c6).\n\n(5)\n\n(6)\n\n\fIn practice, this formulation would meet with several dif\ufb01culties. First, pm(c|I, \u03b8) and pn(c|I, \u03c6) are\nvery small (\u223c 1e-8), which may result in numerical problems. Second, Eq (6) treats easy samples,\nhard samples, and mistaken samples equally. This, however, is not the most effective way. For\nexample, when D((ct, It); \u03b8, \u03c6) (cid:29) 0 for some positive pair, further increasing D((ct, It); \u03b8, \u03c6)\nis probably not as effective as updating D((ct(cid:48), It(cid:48)); \u03b8, \u03c6) for another positive pair, for which\nD((ct(cid:48), It(cid:48)); \u03b8, \u03c6) is much smaller.\nTo resolve these issues, we adopted an alternative formulation inspired by NCE (sec 3), where we\nreplace the difference function D((c, I); \u03b8, \u03c6) with a log-ratio function G((c, I); \u03b8, \u03c6):\n\nG((c, I); \u03b8, \u03c6) = ln pm(c|I, \u03b8) \u2212 ln pn(c|I, \u03c6),\n\nand further use a logistic function r\u03bd (Eq(3)) after G((c, I); \u03b8, \u03c6) to saturate the in\ufb02uence of easy\nsamples. Following the notations in NCE, we let \u03bd = Tn/Tm, and turn D((c, I); \u03b8, \u03c6) into:\n\nNote that h((c, I); \u03b8, \u03c6) \u2208 (0, 1). Then, we de\ufb01ne our updated loss function as:\n\nh((c, I); \u03b8, \u03c6) = r\u03bd(G((c, I); \u03b8, \u03c6))).\n\nTm(cid:88)\n\nTn(cid:88)\n\nL(\u03b8; X, Y, \u03c6) =\n\nln[h((ct, It); \u03b8, \u03c6)] +\n\nt=1\n\nt=1\n\nln[1 \u2212 h((c/t, It); \u03b8, \u03c6)].\n\n(7)\n\n(8)\n\n(9)\n\nK(cid:88)\n\nk=1\n\nFor the setting of \u03bd = Tn/Tm, we choose \u03bd = 1, i.e. Tn = Tm, to ensure balanced in\ufb02uences from\nboth positive and negative pairs. This setting consistently yields good performance in our experiments.\nFurthermore, we copy X for K times and sample K different Y s, in order to involve more diverse\nnegative pairs without over\ufb01tted to them. In practice we found K = 5 is suf\ufb01cient to make the\nlearning stable. Finally, our objective function is de\ufb01ned to be\n\nJ(\u03b8) =\n\n1\nK\n\n1\nTm\n\nL(\u03b8; X, Yk, \u03c6).\n\n(10)\n\nNote that J(\u03b8) attains its upper bound 0 if positive and negative pairs can be perfectly distinguished,\nnamely, for all t, h((ct, It); \u03b8, \u03c6) = 1 and h((c/t, It); \u03b8, \u03c6) = 0. In this case, G((ct, It); \u03b8, \u03c6) \u2192 \u221e\nand G((c/t, It); \u03b8, \u03c6) \u2192 \u2212\u221e, which indicates the target model will give higher probability p(ct|It)\nand lower probability p(c/t|It), compared to the reference model. Towards this goal, the learning\nprocess would encourage distinctiveness by suppressing negative pairs, while maintaining the overall\nperformance by maximizing the probability values on positive pairs.\n\n4.3 Discussion\n\nMaximum Likelihood Estimation (MLE) is a popular learning method in the area of image captioning\n[23, 24, 15]. The objective of MLE is to maximize only the probabilities of ground-truth image-\ncaption pairs, which may lead to some issues [1], including high resemblance in generated captions.\nWhile in CL, the probabilities of ground-truth pairs are indirectly ensured by the positive constraint\n(the \ufb01rst term in Eq(9)), and the negative constraint (the second term in Eq(9)) suppresses the\nprobabilities of mismatched pairs, forcing the target model to also learn from distinctiveness.\nGenerative Adversarial Network (GAN) [1] is a similar learning method that involves an auxiliary\nmodel. However, in GAN the auxiliary model and the target model follow two opposite goals, while\nin CL the auxiliary model and the target model are models in the same track. Moreover, in CL the\nauxiliary model is stable across the learning procedure, while itself needs careful learning in GAN.\nIt\u2019s worth noting that although our CL method bears certain level of resemblance with Noise Con-\ntrastive Estimation (NCE) [5]. The motivation and the actual technical formulation of CL and NCE\nare essentially different. For example, in NCE the logistic function is a result of computing posterior\nprobabilities, while in CL it is explicitly introduced to saturate the in\ufb02uence of easy samples.\nAs CL requires only pm(c|I) and pn(c|I), the choices of the target model and the reference model\ncan range from models based on LSTMs [6] to models in other formats, such as MRFs [4] and\nmemory-networks [18]. On the other hand, although in CL, the reference model is usually \ufb01xed\nacross the learning procedure, one can replace the reference model with the latest target model\nperiodically. The reasons are (1) \u2207J(\u03b8) (cid:54)= 0 when the target model and the reference model are\nidentical, (2) latest target model is usually stronger than the reference model, (3) and a stronger\nreference model can provide stronger bounds and lead to a stronger target model.\n\n5\n\n\fMethod\nGoogle NIC [23]\nHard-Attention[24]\nAdaptiveAttention [15]\nAdpativeAttention + CL (Ours)\nPG-BCMR [14]\nATT-FCN\u2020 [26]\nMSM\u2020 [25]\nAdaptiveAttention\u2020 [15]\nAtt2in\u2020 [19]\n\nMethod\nGoogle NIC [23]\nHard-Attention [24]\nAdaptiveAttention [15]\nAdaptiveAttention + CL (Ours)\nPG-BCMR [14]\nATT-FCN\u2020 [26]\nMSM\u2020 [25]\nAdaptiveAttention\u2020 [15]\nAtt2in\u2020 [19]\n\nCOCO Online Testing Server C5\nB-1\n0.713\n0.705\n0.735\n0.742\n0.754\n0.731\n0.739\n0.746\n\nB-2\n0.542\n0.528\n0.569\n0.577\n0.591\n0.565\n0.575\n0.582\n\nB-3\n0.407\n0.383\n0.429\n0.436\n0.445\n0.424\n0.436\n0.443\n\nB-4 METEOR ROUGE_L CIDEr\n0.943\n0.309\n0.865\n0.277\n0.323\n1.001\n1.010\n0.326\n1.013\n0.332\n0.943\n0.316\n0.330\n0.984\n1.037\n0.335\n0.344\n1.123\n\n0.254\n0.241\n0.258\n0.260\n0.257\n0.250\n0.256\n0.264\n0.268\n\n0.530\n0.516\n0.541\n0.544\n0.550\n0.535\n0.542\n0.550\n0.559\n\n-\n\n-\n\n-\n\nCOCO Online Testing Server C40\nB-1\n0.895\n0.881\n0.906\n0.910\n\nB-2\n0.802\n0.779\n0.823\n0.831\n\nB-3\n0.694\n0.658\n0.717\n0.728\n\nB-4 METEOR ROUGE_L CIDEr\n0.946\n0.587\n0.893\n0.537\n1.004\n0.607\n0.617\n1.029\n\n0.682\n0.654\n0.689\n0.695\n\n0.346\n0.322\n0.347\n0.350\n\n-\n\n-\n\n-\n\n-\n\n-\n\n-\n\n-\n\n0.900\n0.919\n0.918\n\n-\n\n0.815\n0.842\n0.842\n\n-\n\n0.709\n0.740\n0.740\n\n-\n\n0.599\n0.632\n0.633\n\n-\n\n0.335\n0.350\n0.359\n\n-\n\n0.682\n0.700\n0.706\n\n-\n\n0.958\n1.003\n1.051\n\n-\n\nTable 2: This table lists published results of state-of-the-art image captioning models on the online COCO\ntesting server. \u2020 indicates ensemble model. \"-\" indicates not reported. In this table, CL improves the base model\n(AdaptiveAttention [15]) to gain the best results among all single models on C40.\n\n5 Experiment\n5.1 Datasets\n\nWe use two large scale datasets to test our contrastive learning method. The \ufb01rst dataset is MSCOCO\n[13], which contains 122, 585 images for training and validation. Each image in MSCOCO has 5\nhuman annotated captions. Following splits in [15], we reserved 2, 000 images for validation. A\nmore challenging dataset, InstaPIC-1.1M [18], is used as the second dataset, which contains 648, 761\nimages for training, and 5, 000 images for testing. The images and their ground-truth captions\nare acquired from Instagram, where people post images with related descriptions. Each image in\nInstaPIC-1.1M is paired with 1 caption. This dataset is challenging, as its captions are natural posts\nwith varying formats. In practice, we reserved 2, 000 images from the training set for validation.\nOn both datasets, non-alphabet characters except emojis are removed, and alphabet characters are\nconverted to lowercases. Words and emojis that appeared less than 5 times are replaced with UNK.\nAnd all captions are truncated to have at most 18 words and emojis. As a result, we obtained a\nvocabulary of size 9, 567 on MSCOCO, and a vocabulary of size 22, 886 on InstaPIC-1.1M.\n\n5.2 Settings\n\nTo study the generalization ability of proposed CL method, we tested it on two different image\ncaptioning models, namely Neuraltalk2 [8] and AdaptiveAttention [15]. Both models are based\non encoder-and-decoder [23], where no attention mechanism is used in the former, and an adaptive\nattention component is used in the latter.\nFor both models, we have pretrained them by MLE, and use the pretrain checkpoints as initializations.\nIn all experiments except for the experiment on model choices, we choose the same model and use\nthe same initialization for target model and reference model. In all our experiments, we \ufb01xed the\nlearning rate to be 1e-6 for all components, and used Adam optimizer. Seven evaluation metrics have\nbeen selected to compare the performances of different models, including Bleu-1,2,3,4 [17], Meteor\n[11], Rouge [12] and Cider [22]. All experiments for ablation studies are conducted on the validation\nset of MSCOCO.\n\n6\n\n\fFigure 2: This \ufb01gure illustrates several images with captions generated by different models, where AA represents\nAdaptiveAttention [15] learned by MLE, and AA + CL represents the same model learned by CL. Compared to\nAA, AA + CL generated more distinctive captions for these images.\n\nMethod\nGoogle NIC [23]\nHard-Attention [24]\nCSMN [18]\nAdaptiveAttention [15]\nAdaptiveAttention + CL (Ours)\n\nB-1\n0.055\n0.106\n0.079\n0.065\n0.072\n\nB-2\n0.019\n0.015\n0.032\n0.026\n0.028\n\nB-3\n0.007\n0.000\n0.015\n0.011\n0.013\n\nB-4 METEOR ROUGE_L CIDEr\n0.004\n0.003\n0.000\n0.049\n0.008\n0.133\n0.126\n0.005\n0.144\n0.006\n\n0.081\n0.140\n0.120\n0.093\n0.101\n\n0.038\n0.026\n0.037\n0.029\n0.032\n\nTable 3: This table lists results of different models on the test split of InstaPIC-1.1M [18], where CL improves\nthe base model (AdaptiveAttention [15]) by signi\ufb01cant margins, achieving the best result on Cider.\n\n5.3 Results\nOverall Results We compared our best model (AdaptiveAttention [15] learned by CL) with state-\nof-the-art models on two datasets. On MSCOCO, we submitted the results to the online COCO\ntesting server. The results along with other published results are listed in Table 2. Compared to\nMLE-learned AdaptiveAttention, CL improves the performace of it by signi\ufb01cant margins across\nall metrics. While most of state-of-the-art results are achieved by ensembling multiple models, our\nimproved AdaptiveAttention gains competitive results as a single model. Speci\ufb01cally, on Cider, CL\nimproves AdaptiveAttention from 1.003 to 1.029, which is the best single-model result on C40 among\nall published ones. In terms of Cider, if we use MLE, we need to combine 5 models to get 4.5% boost\non C40 for AdaptiveAttention. Using CL, we improve the performance by 2.5% with just a single\nmodel. On InstaPIC-1.1M, CL improves the performance of AdaptiveAttention by 14% in terms of\nCider, which is the state-of-the-art. Some qualitative results are shown in Figure 2. It\u2019s worth noting\nthat the proposed learning method can be used with stronger base models to obtain better results\nwithout any modi\ufb01cation.\nCompare Learning Methods Using AdaptiveAttention learned by MLE as base model and ini-\ntialization, we compared our CL with similar learning methods, including CL(P) and CL(N) that\n\nMethod\nAdaptiveAttention [15] (Base)\nBase + IL [21]\nBase + GAN [1]\nBase + CL(P)\nBase + CL(N)\nBase + CL(Full)\n\nB-1\n0.733\n0.706\n0.629\n0.735\n0.539\n0.755\n\nB-2\n0.572\n0.544\n0.437\n0.573\n0.411\n0.598\n\nB-3\n0.433\n0.408\n0.290\n0.437\n0.299\n0.460\n\nB-4 METEOR ROUGE_L CIDEr\n1.042\n0.327\n1.004\n0.307\n0.700\n0.190\n1.059\n0.334\n0.212\n0.603\n1.142\n0.353\n\n0.260\n0.253\n0.212\n0.262\n0.246\n0.271\n\n0.540\n0.530\n0.458\n0.545\n0.479\n0.559\n\nTable 4: This table lists results of a model learned by different methods. The best result is obtained by the one\nlearned with full CL, containing both the positive constraint and negative constraint.\n\n7\n\nAAThreeclocks are mounted to the side of a buildingTwo people on a yellowyellowand yellow motorcycleA baseball player pitching a ball on top of a fieldA bunchof lights hanging from a ceilingAA+ CLThree threeclocks withthree different time zonesTwo people riding a yellowmotorcycle in a forestA baseball game in progress with pitcherthrowing the ballA bunchof baseballs batshanging from a ceilingAATwo people on a tennis court playing tennisA fighterjet flying through a blue skyA row of boats on a river near a riverA bathroom with a toiletand a sinkAA+ CLTwo tennis players shaking handson a tennis courtA fighter jet flyingover a lush green fieldA row of boatsdockedin a riverA bathroom with a redtoiletand red walls\fTarget Model Reference Model\n\nNT\nNT\nNT\nAA\nAA\n\n-\nNT\nAA\n-\nAA\n\nB-4 METEOR ROUGE_L CIDEr\n0.882\n0.291\n0.905\n0.300\n0.956\n0.311\n0.327\n1.042\n1.142\n0.353\nTable 5: This table lists results of different model choices on MSCOCO. In this table, NT represents Neuraltalk2\n[8], and AA represents AdaptiveAttention [15]. \"-\" indicates the target model is learned using MLE.\n\nB-3\n0.389\n0.399\n0.411\n0.433\n0.460\n\nB-2\n0.525\n0.536\n0.547\n0.572\n0.598\n\nB-1\n0.697\n0.708\n0.716\n0.733\n0.755\n\n0.238\n0.242\n0.249\n0.260\n0.271\n\n0.516\n0.524\n0.533\n0.540\n0.559\n\nRun\n0\n1\n2\n\nB-1\n0.733\n0.755\n0.756\n\nB-2\n0.572\n0.598\n0.598\n\nB-3\n0.433\n0.460\n0.460\n\nB-4 METEOR ROUGE_L CIDEr\n1.042\n0.327\n1.142\n0.353\n0.353\n1.142\n\n0.260\n0.271\n0.272\n\n0.540\n0.559\n0.559\n\nTable 6: This table lists results of periodical replacement of the reference in CL. In run 0, the model is learned\nby MLE, which are used as both the target and the reference in run 1. In run 2, the reference is replaced with the\nbest target in run 1.\n\nrespectively contains only the positive constraint and the negative constraint in CL. We also compared\nwith IL [21], and GAN [1]. The results on MSCOCO are listed in Table 4, where (1) among IL, CL\nand GAN, CL improves performance of the base model, while both IL and GAN decrease the results.\nThis indicates the trade-off between learning distinctiveness and maintaining overall performance is\nnot well settled in IL and GAN. (2) comparing models learned by CL(P), CL(N) and CL, we found\nusing the positive constraint or the negative constraint alone is not suf\ufb01cient, as only one source of\nguidance is provided. While CL(P) gives the base model lower improvement than full CL, CL(N)\ndowngrades the base model, indicating over\ufb01ts on distinctiveness. Combining CL(P) and CL(N),\nCL is able to encourage distinctiveness while also emphasizing on overall performance, resulting in\nlargest improvements on all metrics.\n\nCompare Model Choices To study the generalization ability of CL, AdaptiveAttention and Neu-\nraltalk2 are respectively chosen as both the target and the reference in CL. In addition, AdaptiveAt-\ntention learned by MLE, as a better model, is chosen to be the reference, for Neuraltalk2. The\nresults are listed in Table 5, where compared to models learned by MLE, both AdaptiveAttention\nand Neuraltalk2 are improved after learning using CL. For example, on Cider, AdaptiveAttention\nimproves from 1.042 to 1.142, and Neuraltalk2 improves from 0.882 to 0.905. Moreover, by using\na stronger model, AdaptiveAttention, as the reference, Neuraltalk2 improves further from 0.905 to\n0.956, which indicates stronger references empirically provide tighter bounds on both the positive\nconstraint and the negative constraint.\n\nReference Replacement As discussed in sec 4.3, one can periodically replace the reference with\nlatest best target model, to further improve the performance. In our study, using AdaptiveAttention\nlearned by MLE as a start, each run we \ufb01x the reference model util the target saturates its performance\non the validation set, then we replace the reference with latest best target model and rerun the learning.\nAs listed in Table 6, in second run, the relative improvements of the target model is incremental,\ncompared to its improvement in the \ufb01rst run. Therefore, when learning a model using CL, with a\nsuf\ufb01ciently strong reference, the improvement is usually saturated in the \ufb01rst run, and there is no\nneed, in terms of overall performance, to replace the reference multiple times.\n\n6 Conclusion\n\nIn this paper, we propose Contrastive Learning, a new learning method for image captioning. By\nemploying a state-of-the-art model as a reference, the proposed method is able to maintain the\noptimality of the target model, while encouraging it to learn from distinctiveness, which is an\nimportant property of high quality captions. On two challenging datasets, namely MSCOCO and\nInstaPIC-1.1M, the proposed method improves the target model by signi\ufb01cant margins, and gains\nstate-of-the-art results across multiple metrics. On comparative studies, the proposed method extends\nwell to models with different structures, which clearly shows its generalization ability.\n\n8\n\n\fAcknowledgment This work is partially supported by the Big Data Collaboration Research grant\nfrom SenseTime Group (CUHK Agreement No.TS1610626), the General Research Fund (GRF) of\nHong Kong (No.14236516) and the Early Career Scheme (ECS) of Hong Kong (No.24204215).\n\nReferences\n[1] Bo Dai, Sanja Fidler, Raquel Urtasun, and Dahua Lin. Towards diverse and natural image\ndescriptions via a conditional gan. In Proceedings of the IEEE International Conference on\nComputer Vision, 2017.\n\n[2] Jacob Devlin, Saurabh Gupta, Ross Girshick, Margaret Mitchell, and C Lawrence Zitnick.\nExploring nearest neighbor approaches for image captioning. arXiv preprint arXiv:1505.04467,\n2015.\n\n[3] Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh K Srivastava, Li Deng, Piotr Doll\u00e1r, Jianfeng\nGao, Xiaodong He, Margaret Mitchell, John C Platt, et al. From captions to visual concepts and\nback. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,\npages 1473\u20131482, 2015.\n\n[4] Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia\nHockenmaier, and David Forsyth. Every picture tells a story: Generating sentences from images.\nIn European conference on computer vision, pages 15\u201329. Springer, 2010.\n\n[5] Michael U Gutmann and Aapo Hyv\u00e4rinen. Noise-contrastive estimation of unnormalized\nstatistical models, with applications to natural image statistics. Journal of Machine Learning\nResearch, 13(Feb):307\u2013361, 2012.\n\n[6] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural computation,\n\n9(8):1735\u20131780, 1997.\n\n[7] Mainak Jas and Devi Parikh. Image speci\ufb01city. In Proceedings of the IEEE Conference on\n\nComputer Vision and Pattern Recognition, pages 2727\u20132736, 2015.\n\n[8] Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descrip-\ntions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,\npages 3128\u20133137, 2015.\n\n[9] Girish Kulkarni, Visruth Premraj, Vicente Ordonez, Sagnik Dhar, Siming Li, Yejin Choi,\nAlexander C Berg, and Tamara L Berg. Babytalk: Understanding and generating simple image\ndescriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(12):2891\u2013\n2903, 2013.\n\n[10] Polina Kuznetsova, Vicente Ordonez, Tamara L Berg, and Yejin Choi. Treetalk: Composition\n\nand compression of trees for image descriptions. TACL, 2(10):351\u2013362, 2014.\n\n[11] Michael Denkowski Alon Lavie. Meteor universal: Language speci\ufb01c translation evaluation for\n\nany target language. ACL 2014, page 376, 2014.\n\n[12] Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization\n\nbranches out: Proceedings of the ACL-04 workshop, volume 8. Barcelona, Spain, 2004.\n\n[13] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr\nDoll\u00e1r, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European\nConference on Computer Vision, pages 740\u2013755. Springer, 2014.\n\n[14] Siqi Liu, Zhenhai Zhu, Ning Ye, Sergio Guadarrama, and Kevin Murphy. Optimization of image\n\ndescription metrics using policy gradient methods. arXiv preprint arXiv:1612.00370, 2016.\n\n[15] Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. Knowing when to look: Adaptive\n\nattention via a visual sentinel for image captioning. arXiv preprint arXiv:1612.01887, 2016.\n\n[16] Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L Yuille, and Kevin\nMurphy. Generation and comprehension of unambiguous object descriptions. In Proceedings\nof the IEEE Conference on Computer Vision and Pattern Recognition, pages 11\u201320, 2016.\n\n9\n\n\f[17] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic\nevaluation of machine translation. In Proceedings of the 40th annual meeting on association for\ncomputational linguistics, pages 311\u2013318. Association for Computational Linguistics, 2002.\n\n[18] Cesc Chunseong Park, Byeongchang Kim, and Gunhee Kim. Attend to you: Personalized image\n\ncaptioning with context sequence memory networks. In CVPR, 2017.\n\n[19] Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Jarret Ross, and Vaibhava Goel. Self-\n\ncritical sequence training for image captioning. arXiv preprint arXiv:1612.00563, 2016.\n\n[20] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale\n\nimage recognition. arXiv preprint arXiv:1409.1556, 2014.\n\n[21] Ramakrishna Vedantam, Samy Bengio, Kevin Murphy, Devi Parikh, and Gal Chechik. Context-\n\naware captions from context-agnostic supervision. arXiv preprint arXiv:1701.02870, 2017.\n\n[22] Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image\ndescription evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition, pages 4566\u20134575, 2015.\n\n[23] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural\nimage caption generator. In Proceedings of the IEEE Conference on Computer Vision and\nPattern Recognition, pages 3156\u20133164, 2015.\n\n[24] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C Courville, Ruslan Salakhutdinov,\nRichard S Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation\nwith visual attention. In ICML, volume 14, pages 77\u201381, 2015.\n\n[25] Ting Yao, Yingwei Pan, Yehao Li, Zhaofan Qiu, and Tao Mei. Boosting image captioning with\n\nattributes. arXiv preprint arXiv:1611.01646, 2016.\n\n[26] Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. Image captioning with\nsemantic attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition, pages 4651\u20134659, 2016.\n\n[27] Luowei Zhou, Chenliang Xu, Parker Koch, and Jason J Corso. Image caption generation with\n\ntext-conditional semantic attention. arXiv preprint arXiv:1606.04621, 2016.\n\n10\n\n\f", "award": [], "sourceid": 582, "authors": [{"given_name": "Bo", "family_name": "Dai", "institution": "The Chinese University of Hong Kong"}, {"given_name": "Dahua", "family_name": "Lin", "institution": "The Chinese University of Hong Kong"}]}