Paper: Integrating Language and Vision to Generate Natural Language Descriptions of Videos in the Wild

ACL ID C14-1115
Title Integrating Language and Vision to Generate Natural Language Descriptions of Videos in the Wild
Venue International Conference on Computational Linguistics
Session Main Conference
Year 2014
Authors

This paper integrates techniques in natural language processing and computer vision to improve recognition and description of entities and activities in real-world videos. We propose a strategy for generating textual descriptions of videos by using a factor graph to combine visual detections with language statistics. We use state-of-the-art visual recognition systems to obtain confidences on entities, activities, and scenes present in the video. Our factor graph model combines these detection confidences with probabilistic knowledge mined from text corpora to estimate the most likely subject, verb, object, and place. Results on YouTube videos show that our approach im- proves both the joint detection of these latent, diverse sentence components and the detection of some individual componen...