author:
- 'Yuewei Yang, Mingxi Cheng' title: Reinforcement Leanring in Image Captioning
Image captioning is a challenging task in computer vision. It is
attracting increasing attention because describing the complicated
content of an image in a natural language is one of core human
intelligence and now a machine can be trained to do as good as a human.
In previous work, advanced algorithms have already overcome the
difficulties in extracting visual information from an image, deriving
textual content and sequence, and combining two pieces together to
compute a sequence of words describing an image. The most popular
algorithm consists a convolutional neural network to encode visual
information and a recurrent neural network to decode a sequence of words
from the encoded visual information [@show; @and; @tell].
Reinforcement learning has been applied in many applications such as
gaming, robotics control, and finance. The core idea of how an agent
maximises total reward by interacting with its environment forms a
general framework for reward-based learning. It can be applied to any
learning problems. The structure of the model and the algorithm used in
this paper are designed based on [@RL; @image; @captioning]. The model
applied in this paper is simpler than proposed in
[@RL; @image; @captioning]. A difference in performance is expected and
will be explained in detail.
An encoder-decoder based image captioning model [@merge; @model] is
implemented as a baseline. Based on [@merge; @model],
[@show; @and; @tell; @attention] and [@RL; @image; @captioning] a policy
network is computed and is used to update the policy using policy
gradient algorithm. A value network is further added to the model to
implement an actor-critic model so that both value and policy can be
optimized using corresponding gradients.
The experiment is conducted using Flickr8k datasets and the performances
are compared in standard evaluation metrics: BLEU [@BLEU]. In this
paper, the aim is to study how to apply reinforcement learning in image
captioning and learn the basics of reinforcement learning using image
captioning as an application.
In early works like [@bottom-up], it proposes a bottom-up model that
generates words from object recognition and attribute prediction, then
reconstructs a meaningful description using a language model. In more
recent studies, an encoder-decoder model is proposed using multiple
neural networks and is convinced to improve the accuracy of the
automated descriptions. The basic idea is to encode visual information
through a convoluted neural network and then together with a text
embedding model a sequence of texts are decoded through a recurrent
neural network. Most of the modern applications consist the
encoder-decoder learning model and more advanced algorithms involve
combining both models [@deep; @visual-semantic] or adding attention
mechanism to the model
[@semantic; @attention][@show; @and; @tell; @attention] so that the
performance can be improved further.
In the past two years, more papers have discussed the validity of using
a reward based learning algorithm to caption an image.
[@policy; @RL; @1] [@policy; @RL; @2] have applied policy gradient
search to update the transition matrix/policy network and the results
show better performance. In most recent paper
[@RL; @image; @captioning], it proposed a value network based on current
total reward, a new visual-semantic embedding reward, and an inference
mechanism. By using actor-critic reinforcement learning, the model
outperforms most state-of-the-art approaches consistently in all scoring
metrics.
We formulate image captioning as a decision-making process. In
decision-making, there is an agent that interacts with the environment,
and executes a series of actions, so as to optimize a goal. In image
captioning, the goal is, given an image
In this section three different models will be presented: an encoder-decoder model (baseline), a policy network model, and a policy+value networks model. The ultimate goal of this project is to achieve a simple policy+value networks using a simple reward representation.
In this model, image features are extracted through a convoluted neural network (VGG16)[@VGG16], and text embeddings are extracted using a recurrent neural network (LSTM)[@LSTM]. Figure 3.1 illustrates the outline of the model. Since this model serves as a baseline, for more details refer to [@merge; @model].
The network on the top left branch is a text embedding network. Its
input is a word and the output is the corresponding embedding. The brach
on the top right is a visual information encoder. Its input is an image
and the output is a vector representation of the image. Then two
branches are joined into a decoder network, a forward feed layer. And
the final output is the next word. The definitions of RNN and LSTM used
in this model is well explained in [@RNN] and [@LSTM]. To train the
model, the cross-entropy loss is minimized:
In this model, the decoder part of the previous model is modified to include an additional LSTM stage. With this additional stage, the information of image and text is fed back to LSTM together at every time step to update the hidden states in LSTM so the model could “infer” the next most possible word. The output of the model is a probability matrix of next possible words.
The visual information is fed into the initial input node
The model is first trained as a supervised learning, and the entropy
loss is to be minimised
The policy netwrok in this model is exactly the same as the one in section 4.2. The value network is added in this model to implement an actor-critic model. The structure of value network is shown below:
The value network is an approximation of estimated total reward at time
step
Here the value network
All three models are trained and tested on the Flickr8k dataset. There are 8,097 images (6,000 training images, 1,000 validation images, and 1,000 test images) in the dataset. The reason for this small size dataset is to control training time as COCO dataset has 123,287 images and every single image increases the training time considerably. The performance of each model is compared in terms of BLEU scores against each other.
This section will demonstrate the performances of different models in terms of BLEU scores. The effect of different reward representations on the performance will be discussed too.
Methods Bleu-1 Blue-2 Bleu-3 Bleu-4
Encoder-Decoder 0.518 0.334 0.206 0.132 Policy Network 0.544 0.362 0.231 0.151 Policy and Value Network 0.551 0.378 0.251 0.169
Encoder-decoder model is a supervised learning model. The performance is
optimized by tuning the length of features and layer parameters shown in
Figure 1. Policy network model update transition probability matrix
The descriptions generated using supervised learning make error often. Those generated using reinforcement learning are much better, though policy network and policy and value network produce similar descriptions. However, my best models do make errors (figure 6(d)). This is due to the fact the best score can be achieved is about 0.551 in BLEU-1. The scale in BLEU scores measures the accuracy of generated text compared with original text. As a fact, dog picutures have good captions through our model. But other picutres showed poor results. Futhermore, as shown in the figure 6(a) above, there are some objects in the original text that are not recognized by our model. And supervised learning seems to miss more objects.
Representations Bleu-1 Blue-2 Bleu-3 Bleu-4
Loss 0.551 0.378 0.251 0.169 Euclidean Distance 0.492 0.325 0.198 0.137
The model used in this experiment is policy and value network model.
Different reward representations do make a difference. Euclidean
distance between texts generated and original text does not perform as
well as loss cost. The explaination would be that our model cannot
recognise as many objects as in original text. Hence the distance
between them is going to be big. This could introduce some unstabability
into the system as
The best performance of our model is not achieving as good as other state-of-art methods. One issue with experiments in this project is that they train on Flickr8k and the limited samples cause inaccuracy in the performance. Another defect of our model is that there is no inference mechanism such as beam search used in most methods. In our models, we only used the best choice only at each time step. In [@RL; @image; @captioning], besides beam search, an lookahead mechanism used as another inference mechanism. Policy network serves as a global guidance and value network as a local guide. By applying different wrights on these two guidances, the combined beam may have different orders. This enables the model to include the good words that are with low probability to be drawn by using the policy network alone. Another area can improve the model is investigating other reward representations and a better structure of the model.
In this project, reinforcement learning is applied to image captions. The reward-based learning is added to supervised learning model to improve the transition matrix. In reinforcement learning models, a policy network and a value network are trained first with supervised learning and then update using a gradient method. The effect of policy network is more significant than that of the value network. From this project, the effect of reinforcement learning based model is compared with supervised learning model. The effect of different reward representation is also studied. How reinforcement learning can be applied in addition to a supervised learning model is learned and discussed.
Complete codes for our model can be found on GitHub 1. In this project, codes are implemented based on [@RL; @image; @captioning] and [@show; @and; @tell]. Codes for reinforcement learning are modified using [@xinping] and [@tsenghuangchen]. My consultant, Mingxi Cheng, provides great assistance in constructing the model and writing codes.
[9]{} Vinyals, Oriol, et al. “Show and tell: A neural image caption generator.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.
Ren, Zhou, et al. “Deep Reinforcement Learning-based Image Captioning with Embedding Reward.” arXiv preprint arXiv:1704.03899 (2017).
Tanti, Marc, Albert Gatt, and Kenneth P. Camilleri. “Where to put the Image in an Image Caption Generator.” arXiv preprint arXiv:1703.09137 (2017).
Xu, Kelvin, et al. “Show, attend and tell: Neural image caption generation with visual attention.” International Conference on Machine Learning. 2015.
Papineni, Kishore, et al. “BLEU: a method for automatic evaluation of machine translation.” Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, 2002.
Farhadi, Ali, et al. “Every picture tells a story: Generating sentences from images.” European conference on computer vision. Springer, Berlin, Heidelberg, 2010.
Karpathy, Andrej, and Li Fei-Fei. “Deep visual-semantic alignments for generating image descriptions.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015.
You, Quanzeng, et al. “Image captioning with semantic attention.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016.
Liu, Siqi, et al. “Optimization of image description metrics using policy gradient methods.” arXiv preprint arXiv:1612.00370 (2016).
Liu, Siqi, et al. “Improved Image Captioning via Policy Gradient optimization of SPIDEr.” arXiv preprint arXiv:1612.00370 (2016).
Simonyan, Karen, and Andrew Zisserman. “Very deep convolutional networks for large-scale image recognition.”arXiv preprint arXiv:1409.1556 (2014).
Graves, Alex, and Jürgen Schmidhuber. “Framewise phoneme classification with bidirectional LSTM and other neural network architectures.” Neural Networks 18.5 (2005): 602-610.
Mao, Junhua, et al. “Deep captioning with multimodal recurrent neural networks (m-rnn).” arXiv preprint arXiv:1412.6632 (2014).
Cheng, Xinping, “Optimization of image description metrics using policy gradient methods”, https://github.com/chenxinpeng/Optimization_of_image_description_metrics_using_policy_gradient_methods
Chen, Tseng-Huang, “Show, Adapt and Tell: Adversarial Training of Cross-domain Image Captioner”, https://github.com/tsenghungchen/show-adapt-and-tell
![Encoder-Decoder model structure[]{data-label="fig:framework"}](https://github.com/yueweiyang/RL-cmps590/raw/master/model.png)
![Policy network strucutre[]{data-label="fig:policy network"}](https://github.com/yueweiyang/RL-cmps590/raw/master/model_policy.png)
![Illustration of policy network flow [@RL; @image; @captioning][]{data-label="fig:policy illustration"}](https://github.com/yueweiyang/RL-cmps590/raw/master/policy_model.png)
![Value network structure[]{data-label="fig:value network"}](https://github.com/yueweiyang/RL-cmps590/raw/master/MLP.png)
![Illustration of value network flow [@RL; @image; @captioning][]{data-label="fig:value illustration"}](https://github.com/yueweiyang/RL-cmps590/raw/master/valuenetwork.png)
![Examples of generated descriptions of images[]{data-label="fig:examples"}](https://github.com/yueweiyang/RL-cmps590/raw/master/pic.png)