This paper investigates unique video captioning. We introduce a method, Captioning by Discriminative Prompting (CDP) and challenging unique captioning benchmarks on Egocentric video and Timeloop movies.
The following figures highlight a shortcoming of current captioning approaches. They caption each clip independently, giving similar captions for similar clips. First, in a timeloop movie:
Second, in a long egocentric video:
Given a set of similar video clips, our goal is to generate a caption for each, which is concise and focuses on what is unique to each clip. This should give a one-to-one relationship between clips and captions, were each caption can retrieve its corresponding video clip. We visualise our approach here:
(a) Standard captioning can generate the same caption for multiple clips.
(b) We consider clips with the same caption, to find a property that captions them uniquely. Here, what the person is "holding" can uniquely identify the pink clip.
(c) If we cannot find a unique property, we explore the following clips for an extended unique caption.
Captioning by Discriminative Prompting (CDP) is based around predicting discriminative prompts for the set of similar clips. Our intuition is that it is easier to spot where a difference occurs, instead of fully captioning multiple clips with a captioner all at once.
This allows us to use one frozen single-clip pre-trained captioner, and only requires us to learn a lightweight and scalable network, CDPNet, which predicts the similarity between a video clip and a prompted caption in a shared embedding space. These similarities are then searched over to find the most unique caption. CDPNet is small enough to be conditioned on all the clips we want unique captions for.
(a) First, CDPNet is used to predict simiarities, which are searched over to predict a discriminative prompt for each video clip.
(b) Each video clip is then passed with its predicted prompt to a frozen single-clip captioner.
We also acknowledge that all captioners have limitations. When CDPNet determines that the captioner cannot generate a unique caption for a given clip, we exploit the long-term nature of the video to advance temporally until it can.
We curate sets of similar clips from two domains: egocentric videos from Ego4D and timeloop movies. Methods are evaluated by average video-to-text and text-to-video R@1. We apply CDP to the SOTA captioner for each domain with significant improvements on each.
Here are results for the LaViLa VCLM captioner on egocentric sets of 10. Additional time (T) indicates the model is allowed to advance through the video.
Method | T=0 | T=5 | T=10 | T=30 |
---|---|---|---|---|
LaViLa VCLM | 37 | 38 | 41 | 43 |
LaViLa VCLM + CDP | 45 | 57 | 65 | 76 |
Here are three examples of uninque captioning on three timeloop movies: "Groundhog Day", "Edge of Tomorrow" and "The Map of Tiny Perfect Things", followed by three egocentric examples. In all cases, note how every caption can uniquely identify the video-clip it was generated from.
@inproceedings{perrett2024unique, title={It's Just Another Day: Unique Video Captioning by Discriminitave Prompting}, author={Perrett, Toby and Han, Tengda and Damen, Dima and Zisserman, Andrew}, booktitle={Asian Conference on Computer Vision}, year={2024}, }