List

Last Editied: January 2017

Video Description is the task of generating a textual description of a video in the form of one or more sentences. It has numerous applications for video cataloguing, sort and search, as well as for aiding the visually impaired. Recently interest in image and video classification and description has increased significantly, stemming from the introduction of Neural Networks towards solving these problems. Convolutional Neural Networks and Recurrent Neural Networks have become the standard general practices for visual classification and description problems respectively. In this post I explain the proposed datasets, methodologies, metrics and results for the problem of video description.

# Datasets

All of the approaches towards solving visual classification and description tasks are based within data driven machine learning. This means that the methods learn their parameters by learning patterns in data related to each problem. For video description many methods use a supervised learning approach where the models learn from many examples containing an input video and a corresponding ground-truth (correct) description. The problem is that in order to learn generalised patterns and relationships between the input and output, datasets need to be big and cover as much of the possible states as possible, this has been a particular problem for the task of video description.

The negative effect of this ‘lack of data’ is to some degree lessened with the use of Transfer Learning. Transfer Learning is an idea where you can use data to learn a particular model which can then be applied to a similar but different problem. For example, in many video description works authors re-use image classification CNNs as a basis for their models, which were trained on still images and single words rather than videos and sequences of words.

As time goes on not only do methodologies improve, but so does the quality and quantity of the data, which is currently very important with models reliance on learning from many diverse examples. Below is a list of the current video description datasets, which does not include datasets used in transfer learning (they will be linked where necessary on a approach by approach basis)

### “Collecting Highly Parallel Data for Paraphrase Evaluation”

#### David Chen, William Dolan

MSVD is the first main generalised video description dataset, containing 1,970 video clips from YouTube. Each clip has approximately 41 descriptions per video (most in English), making 80,839 video description pairs. The set is split based on this paper with 1,200 training, 100 validation and 670 test videos.

### “A Dataset for Movie Description”

#### Anna Rohrbach, Marcus Rohrbach, Niket Tandon, Bernt Schiele

The MPII-MD dataset…

### “Using Descriptive Video Services to Create a Large Data Source for Video Annotation Research”

#### Atousa Torabi, Christopher Pal, Hugo Larochelle, Aaron Courville

The M-VAD dataset…

### “MSR-VTT: A Large Video Description Dataset for Bridging Video and Language”

#### Jun Xu, Tao Mei, Ting Yao, Yong Rui

The MSR-VTT dataset is the biggest general video benchmark to date. It contains 10,000 web clips covering a broad range of topics. Each clip is annotated with about 20 natural sentences, resulting in 200,000 video-caption pairs. The dataset is split to have 6,315 training, 497 validation and 2,990 test clips.

The table below gives a brief summary of all of the above datasets.

Approach Context Sent. Source # Videos # Clips # Sentences # Words Vocabulary # Hours
YouCook Cooking Labelled 88 2,668 42,457 2,711 2.3
TACos Cooking AMT 123 7,206 18,227
TACos M-L Cooking AMT 185 14,105 52,593
MSVD General (YouTube) AMT 1,970 70,028 607,339 13,010 5.3
MPII-MD Movie DVS + Script 94 68,337 68,375 653,467 24,549 73.6
M-VAD Movie DVS 92 48,986 55,905 519,933 18,269 84.6
MSR-VTT General AMT 7,180 10,000 200,000 1,856,523 29,316 41.2

# Dataset Studies

Following are some studies of the above and related datasets.

Yao et al.

# Approaches

In this section I will go through the main approaches towards solving the video description problem. I will describe how each model works, what it offers and improves upon, and also any interesting findings.

# Plain Convolutional and Recurrent Neural Networks

In … the paper … was introduced which showed that neural networks … Convolutional neural networks… recurrent neural networks..

# 0

### “Long-term Recurrent Convolutional Networks for Visual Recognition and Description”

Donahue et al. were the first to utilise deep models for the video description problem. In their paper they focus not only on the video description problem but also image description and activity recognition.

Task-specific instantiations of their LRCN model for activity recognition, image description, and video description

For the video description task, the authors utilise a conditional random field (CRF) framework using video features as unaries. They attribute the use of such a traditional approach to the lack of video description datasets. The CRF outputs a phrase of words related to objects, subjects, and verbs, which they use to form a sentence with an LSTM RNN.

# 1

### “Translating Videos to Natural Language Using Deep Recurrent Neural Networks”

#### Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond Mooney, Kate Saenko

Frames from an input video are passed through a CNN with dense layer activations mean pooled across time to create a single high level vector descriptor for the entire video. This descriptor vector is then passed into a two-layer LSTM RNN, which produces a description word-by-word.

This work is the first to introduce a fully end-to-end trainable neural network architecture for the problem of video description.

The framework consists of sampling every 10th frame of a video and passing it through a CNN (ImageNet Trained Caffe Ref. Model). The fc7 layer activation outputs are temporally mean pooled across the video, resulting in a single 4096 long vector representing the entire video.

This video descriptor vector was then decoded into a sentence using a 2-layer LSTM RNN. Due to the lack of sentences for the video description task (approx. 52,000 from YouTube2Text), the authors utilise transfer learning. That is they first train on larger image description datasets Flickr30k (approx. 150,000) and COCO2014 (approx. 616,000) and then fine tune on the YouTube dataset.

# 2

### “Sequence to Sequence – Video to Text”

#### Subhashini Venugopalan, Marcus Rohrbach, Jeff Donahue, Raymond Mooney, Trevor Darrell, Kate Saenko

A video is processed frame by frame by an object pretrained RGB CNN and an action pretrained opt. flow CNN. The activations are then used as input to a sequence-to-sequence 2 layer RNN to process the CNN outputs temporally and then generate a sentence word-by-word.

This work introduces the idea of using an optical flow CNN for capturing localised temporal information and the use of a sequence-to-sequence RNN model for capturing higher level temporal information.

This work extends on the past work Venugopalan et al. talked about previously. In this work they no longer mean pool CNN activation outputs as it removes practically all of the temporal information of the video, instead they use a sequence-to-sequence RNN model. The sequence-to-sequence model (introduced in machine translation problems) can transform one sequence into another, in this case video frames represented by vectors to words of a sentence, which are also represented as vectors. The encoding stage involves taking the CNN outputs across multiple timesteps, while the decoding stage involves generating a sentence with a single word being produced per timestep. In this work the authors use the same RNN for the encoding and decoding stages so parameter sharing can occur.

Compared to the previous work, the authors experiment with using not only Caffe Ref. CNN, but also the newer VGG16 CNN (also trained on ImageNet) for frame processing. They also remove the fc7 dense layers and replace them with an embedding space of size 500, which serves and the input to the RNN (embeddings learnt jointly with the RNN in training).

Lastly this work also use an activity trained CNN (trained on UCF101) on optical flow frames to better capture low level motion information. In this case the fc6 layer is replace with an embedding layer also of size 500. When combining RGB and optical flow inputs into the RNN the authors use a shallow fusion technique, weighing each type of input based on the candidate word hypotheses.

# 3

### “Jointly Modeling Embedding and Translation to Bridge Video and Language”

#### Yingwei Pan, Tao Mei, Ting Yao, Houqiang Li, Yong Rui

On the left, video features are generated with CNNs and mean-pooled across the sequence. These features as well as the sentence features are embedding into a common space. A relevance loss is then calculated based on the distance between these embeddings. A coherence loss is also calculated for each newly generated word, The entire model is learnt jointly by taking in consideration both losses.

This work introduces a new relevance loss which is based on video and sentence embeddings, that is jointly learnt alongside the coherence loss (same loss as previous works just termed differently).

Pan et al. believe that previous works optimise a new word too locally, unable to exploit the relationships between the entire sentence and video, resulting in contextually correct sentences that are incorrect semantically (subjects, verbs, objects). They therefore introduce a model which learns by jointly calculating a relevance loss which focuses on the semantic relationship between the entire video and entire sentence and a coherence loss which focuses on the contextual relationships among words given an entire video.

The relevance loss $E_r$ is calculated as the squared L2-norm distance between a video and sentence embedding, and is the main contribution of this work.

$E_r(\textbf{v},\textbf{s}) = \left \| \textbf{T}_v\textbf{v} - \textbf{T}_s\textbf{s} \right \|_2^2$

Video features $\textbf{v}$ are generated by passing frames/clips through a 2D/3D CNN and mean-pooling temporally across each video. Sentence features $\textbf{s}$ are represented as a vector with length of the vocabulary, where entries corresponding to a word in the sentence is 1, otherwise 0. The video features and sentence features are then each embedded into a common space (same dimensionality) using transformation matrices $\textbf{T}_v, \textbf{T}_s$ (which are learnt in training).

The coherence loss $E_c$ is the same as that seen in past works, and is calculated as the sum of the log probabilities over the word ($N_s$ is length of sentence).

$E_c(\textbf{v},\textbf{W}) = - \textup{log}\; p(\textbf{W}|\textbf{v}) = - \sum_{t=0}^{N_s}\textup{log}\; p(\textbf{w}_t|\textbf{v},\textbf{w}_0,...,\textbf{w}_{t-1})$

The model is learnt jointly over these two losses, minimising the following energy function, where $\lambda$ is a explicitly set weight, and $\theta$ are LSTM parameters.

$latex E(V,S) = (1-\lambda)\times E_r(\textbf{v},\textbf{s}) \; + \; \lambda \times E_c(\textbf{v},\textbf{W})$

# 4

### “Video Captioning with Transferred Semantic Attributes”

#### Yingwei Pan, Ting Yao, Houqiang Li, Tao Mei

This work introduces the use of semantic attributes from not only video but also still images to generate more semantically correct sentences.

The authors highlight that previous sequence-to-sequence approaches that translate directly from video to language miss important high-level semantic information. They therefore introduce a model which generates a description based on the video representation as well as semantic attributes gathered from both images and videos. They explore how using attributes from both images, which are generally more object and subject centric, and video, which is generally more verb centric, affect the output sentence.

A probability distribution of all attributes is calculated from images and video separately, and then joined with what the authors call a ‘transfer unit’. Following papers [1, 2], the authors utilise the weakly-supervised approach Multiple Instance Learning (MIL) to learn attribute detectors for images. Specifically, an image is considered a positive bag for an attribute if it’s ground-truth sentence contains the attribute. For videos a similar approach is taken on a per-frame basis however different spatial regions of the video across all frames are considered as one bag, to reduce semantic shift and noise.

The transfer unit considers contextual information and controls how much impact each of the attribute sources has at a particular word generation step, and is one of the main contributions of this work. More precisly it produces a fixed-length weight vector $\textbf{g}^t$.

$\textbf{g}^t = \sigma (\textbf{G}_s\textbf{w}_t + \textbf{G}_h\textbf{h}^{t-1} + \textbf{G}_i\textbf{A}_i + \textbf{G}_v\textbf{A}_v)$

$\textbf{G}_s, \textbf{G}_h, \textbf{G}_i, \textbf{G}_v$ are the transformation matrices, $\sigma$ is the sigmoid function, $\textbf{w}_t$ is the previous word, $\textbf{h}^{t-1}$ is the hidden state of the LSTM, and $latex \textbf{A}_i , \textbf{A}_v$ are the attribute distributions for images and videos respectively.

# The Addition of Attention

Following similar trends in image classification and description it soon became apparent that the use of attention mechanisms within the models achieved better results. Attention for video can be considered in two ways, spatially (where in the frame to focus) and temporally (when in the video to focus).

# 5

### “Describing Videos by Exploiting Temporal Structure”

#### Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo Larochelle, Aaron Courville

Temporal Attention Mechanism

This work introduces a temporal attention mechanism allowing the text generation RNN to choose the most relevant video segments. The model also utilises a 3D CNN for capturing more localised spatio-temporal information.

Yao et al. argue that temporal structure in video can be broken into localised structure, referring to fine-grained action based motion, and global structure, referring the sequence in which objects, actions and scenes appear in video. The authors hence propose a temporal attention mechanism to exploit global temporal structure, and an action based 3D CNN to capture local temporal structure.

To ensure good extraction of local temporal structure and to reduce computation, the 3D CNN takes motion features – histograms of oriented gradients, oriented flow, and motion boundaries – as input, rather than the raw pixel data. After being passed through the 3D CNN features are max-pooled spatially and concatenated with a standard image feature from a single frame from a similar time point.

The features and 3D CNN architecture

The temporal mechanism calculates a weight $\alpha_i^{(t)}$ for each feature vector $latex \textbf{v}_i$ at each timestep $t$, conditioned on the previously generated words. The weights are calculated by normalising a relevance score $e_i^{(t)}$ across time, with each $e_i^{(t)}$ being calculated in the LSTM decoder used for word generation.

$latex e_i^{(t)} = \textbf{w}^\top \textup{tanh}(\textbf{W}_a \textbf{h}_{t-1}+\textbf{U}_a\textbf{v}_i+\textbf{b}_a)$

They find exploiting both local and global structure are complementary, with the best models utilising both.

# 6

### “Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks”

#### Haonan Yu, Jiang Wang, Zhiheng Huang, Yi Yang, Wei Xu

EDIT CAPTION

This work introduces a hierarchical-RNN model which is able to create video captions of multiple sentences, where new latter sentences are dependent on previous ones. It incorporates soft temporal attention and spatial attention (for TACoS-ML only) by extracting multiple smaller patches per frame.

The key contribution of this work is the addition of a paragraph generation process overseeing the sentence generation process which is similar to past works. The sentence generator is a Gated Recurrent Unit (GRU is a simplified version of a LSTM unit) RNN which considers word embeddings and the paragraph state (provided by the paragraph generator), and informs the next paragraph state and the attentions models.

Following the work of Yao et al, attention models calculate and assign weights to video features $\textbf{v}_{KM}$ for $K$ patches in $M$ frames. Spatial attention is performed only on the TACoS-ML dataset as objects are small and difficult to localise. Patches are extracted from the lower border of a bounding box around the person, whom is detected using optical flow. The authors find a small $K$ value (3-5) is enough. Video features are sourced from both the ImageNet trained VGG network, and the Sports-1M trained C3D (for MSVD) or Dense Trajectories encoded with a Fisher vector (for TACoS-ML).

cc

# 7

### “Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning”

#### Pingbo Pan, Zhongwen Xu, Yi Yang, Fei Wu, Yueting Zhuang

A comparison between a normal stacked RNN (a), and the proposed hierarchical model (b) for the two layer case. The red line shows the steps taken from first input to last input, with the proposed model having less steps leading to less computational overhead.

This paper introduces a Hierarchical Recurrent Neural Encoder (HRNE) which has greater non-linearity by stacking recurrent layers, but reduced computational cost due to the hierarchical structure. This structure also better represents temporal actions and events, representing them at different hierarchical levels and specificity.

The model is composed of short stacks of LSTM RNN chains (used as temporal filters of length $n$) which are applied to chunks of the input sequence. So provided an input sequence $(\textbf{x}_1, \textbf{x}_2, ..., \textbf{x}_T)$, it is split into chunks the same length as the temporal filters ($n$) $latex (\textbf{x}_1, \textbf{x}_2, …, \textbf{x}_n), (\textbf{x}_{1+s}, \textbf{x}_{2+s}, …, \textbf{x}_{n+s}), …, (\textbf{x}_{T-n+1}, \textbf{x}_{T-n+2}, …, \textbf{x}_T)$, where $s$ is the stride denoting the number of temporal units adjacent chunks are apart. The mean of the hidden states $latex (\textbf{h}_1, \textbf{h}_2, …, \textbf{h}_n)$ is then considered as the filter output for chucks. These outputs are then passed to the next RNN layer as inputs, where the same process occurs.

Left: Spatial convolutional operation of a CNN. Right: Temporal operation of HRNE

The HRNE compacts the input over multiple layers into a single output vector which is considered the entire video representation (not 100% on this not made clear). This is then passed into a standard sentence generation LSTM RNN seen in past works ([Yao et al. (5)] [Ven (2)]).

The model also incorporates the soft temporal attention model introduced by Yao et al. applying it in three stages, on the input, on the output of the first layer (input to second layer), and on the output of the second layer (input to the sentence generation RNN).

EDIT CAPTION

Ballas et al. …

# 9

### “Spatio-Temporal Attention Models for Grounded Video Captioning”

#### Mihai Zanfir, Elisabeta Marinoiu, Cristian Sminchisescu

Provided a video (left), spatio-temporal proposals are chosen and filtered (top row). They are then each scored using an object detector and/or an image classifier and temporally pooled (avg) (bottom row). Also an SVO representation is generated (middle row). The scored spatio-temporal proposals and the SVO representation are passed into an attention LSTM which decides the next word for the caption (right).

This model attempts to ground each of the words of the description as being somewhere spatio-temporally in the video.

Firstly the authors obtain object proposals across multiple frames (using the method from “Spatio-Temporal Object Detection Proposals”) and score them with an 1000 class image classification CNN (ImageNet VGG19) and a 20 class object detector (PASCAL VOC Fast R-CNN). For each proposal scores are averaged temporally across frames, and across the classification and detection class outputs. The best 20 proposals are taken and represented with the fc7 activations from the image classification model.

Secondly the authors generate a semantic SVO (Subject, Verb, Object) representation for each video. Considering S, V and O as separate classification problems, the authors use temporally mean pooled fc7 layer activations (VGG19) for the S and O classes, and mean pooled trajectory features and motion-CNN features for the V class. They use Least Squares Support Vector Machines (LS-SVM) as classifiers using a one-vs-all methodology. Videos are then represented as a classifier response vector for all classes in the combined SVO vocabulary.

Thirdly the authors generate a semantic vector representation for each video. They do this by running same the 1000 class image classifier and 20 class object detector on the entire frames rather than just subset proposals. For the image classification responses they average pool the 1000 long classification response vector. While for the object detector responses they perform temporal max pooling across a window of 25 frames to stabilise the detections, with the final scores representing the confidence of having seen that particular object at some point in the video.

Lastly they combine the object proposals (P) and the semantic (s) SVO, classification and object detection representations using a temporal soft-attention LSTM RNN. They try two methods, firstly just using a single layer LSTM RNN with soft-attention which takes both P and s, and secondly by using two layers where s is passed as input to the earlier layer which has no attention mechanism (see image below).

(a) The single layer soft-attention LSTM RNN and (b) the two layer LSTM RNN which processes the semantic representations first with no temporal attention.

Their best results were achieved using the two layer LSTM RNN with only the SVO representation for s. This result is listed in the table in the results section.

# 10

### “Hierarchical Boundary-Aware Neural Encoder for Video Captioning”

#### Lorenzo Baraldi, Costantino Grana, Rita Cucchiara

The addition of the Boundary Detection module enables re-initialisation of LSTM parameters.

This work proposes a hierarchical LSTM RNN approach similar to that of Yao et al [7], however instead of fixed filter sizes, they enable the RNN to learn when to end chunks and pass data to the next layer RNN, and reset the underlying layers parameters. To to this they incorporate a boundary detector, which informs the RNN what action to take (stay with current sequence, or output and start new one).

The main contribution of this work is the time boundary-aware recurrent unit, which modifies RNN layer connectivity through time. The unit is built on top of an LSTM unit and utilises the past memory $latex \textbf{c}_{t-1}$, hidden state $latex \textbf{h}_{t-1}$ and current input $latex \textbf{x}_{t}$ to modulate the LSTM unit’s gates.

The boundary detection unit and its influences on the LSTM unit

The boundary detector $latex \textbf{s}_{t}$ is calculated as:

$latex \textbf{s}_t = \tau (\textbf{v}_s^\textit{T}\cdot (\textit{W}_{si}\textbf{x}_t + \textit{W}_{sh}\textbf{h}_{t-1} + \textbf{b}_s))$

$\tau (x) = \left\{\begin{matrix} 1, \; \textrm{if} \: \sigma (x) > 0.5 \\ 0, \; \; \; \textrm{otherwise} \end{matrix}\right.$

where $latex\textbf{v}_s^\textit{T}$ is a learnable row vector. Before applying $latex \textbf{s}_t$, the following substitutions are applied to either pass along or reinitialise for the next time step.

$\textbf{h}_{t-1} \leftarrow \textbf{h}_{t-1} \cdot (1-s_t)$

$\textbf{c}_{t-1} \leftarrow \textbf{c}_{t-1} \cdot (1-s_t)$

If $s_t = 1$ then $latex \textbf{h}_{t-1}$ is passed to the next RNN layer, which summarises each chunk into a single video representation.

# 11

### “Video Captioning with Multi-Faceted Attention”

#### Xiang Long, Chuang Gan, Gerard de Melo

The proposed approach showing the different inputs on the left being passed into the framework, containing multi-faceted attention units, which are detailed on the right.

This work presents a model which incorporates a multi-faceted attention framework which jointly considers multiple different sources of input, including temporal features, motion features and semantic attributes.

The main contribution of this work is the multi-faceted attention units, which take as input temporal features (usual CNN frame high level vector) $v_i$ [denoted $t_i$ in the diagram], motion features $f_i$, and vocabulary embedded semantic attribute features $s_i = E[a_i]$, and either an input vocabulary embedded word $x_t = E[w_t]$ or the LSTM units hidden state $h_t$. The temporal, motion, and semantic features are processed by their own soft attention model, and combined using a multimodal layer:

$latex m_t^x = \phi (W^x[x_t,s_t^x,w_v^x\odot v_t^x,w_f^x\odot f_t^x]+b^{m,x})$

where $latex s_t^x$, $v_t^x$, and $f_t^x$, are the semantic, temporal and motion features with attention weighting applied. This $latex m_t^x$ is the output of the multi-faceted attention model using the input $x_t$, but the authors utilise the model both before the LSTM and after, in which case they take the hidden state $h_t$ in place of $x_t$.

# The Addition of Memory Networks

RNNs lack the ability to capture long term information over many time steps. This is somewhat reduced by the use of LSTM and GRU units, however even these can suffer. Therefore .. A good powerpoint presentation about Memory Networks can be found here.

# 12

### “Memory-augmented Attention Modelling for Videos”

#### Rasool Fakoor, Abdel-rahman Mohamed, Margaret Mitchell, Sing Bing Kang, Pushmeet Kohli

Their architecture with the Temporal Model (TEM), Hierarchical Attention Memory (HAM) and language decoder.

Fakoor et al. present a memory-based attention sequence-to-sequence model which learns where to look and what to look for in a video for use in the description. Their three component model uses:

1. Temporal Model to capture temporal structure and track motion. Instead of using features extracted from the latter dense layers of a CNN, they extract intermediate convolutional maps, to gain a better representation for the earlier (lower level) CNN activations, which are especially useful for frame-to-frame motion. They let the network focus on the particular locations of the frame using Location Attention [Bahdanau et al. 2015] ;
2. A Hierarchical Attention / Memory which learns where to attend to in a video given the past states and generated words. This representation of the video contains more information than a single vector (as seen in other sequence-to-sequence models), which is available to the decoder enabling it to generate a better descriptions;
3.  A Decoder which is similar to those seen in past sequence-to-sequence models, however they have access to the Memory.

# Results

Approach METEOR BLEU@1 BLEU@2 BLEU@3 BLEU@4 CIDEr
[1] Venugopalan et al.  27.7  31.2
[2] Venugopalan et al.  29.2
[3] Pan et al.  29.5  74.9  60.9  50.6  40.2
[4] Pan et al.  33.5  82.8  72.0  62.8  52.8 74.0
[5] Yao et al.  29.60  80.00  64.70  52.60  41.92 (42.2)  51.67
[6] Yu et al.  31.1  77.3  64.5  54.6  44.3
[7] Pan et al.  33.1  79.2  66.3  55.1  43.8
[8] Ballas et al.  31.70  49.63  68.01
[9] Zanfir & Marinoiu et al.  32.3  82.4 71.8  62.5  52.0
[10] Baraldi et al.  32.4  42.5  63.5
[11] Long et al.  42.9  91.8  87.2  82.5  76.4  139.3
[12] Fakoor et al.  31.80  79.4  67.1  56.8  46.1  62.7
Human  43.6  89.1  77.6  68.1  58.3  132.2