PromptonomyViT: Multi-Task Prompt Learning Improves Video Transformers using Synthetic Scene Data

Abstract

Action recognition models have achieved impressive results by incorporating scene-level annotations, such as objects, their relations, 3D structure, and more. However, obtaining annotations of scene structure for videos requires a significant amount of effort to gather and annotate, making these methods expensive to train. In contrast, synthetic datasets generated by graphics engines provide powerful alternatives for generating scene-level annotations across multiple tasks.

In this work, we propose an approach to leverage synthetic scene data for improving video understanding. We present a multi-task prompt learning approach for video transformers, where a shared video transformer backbone is enhanced by a small set of specialized parameters for each task. Specifically, we add a set of ``task prompts'', each corresponding to a different task, and let each prompt predict task-related annotations. This design allows the model to capture information shared among synthetic scene tasks as well as information shared between synthetic scene tasks and a real video downstream task throughout the entire network. We refer to this approach as ``Promptonomy'', since the prompts model task-related structure.

We propose the PromptonomyViT model (PViT), a video transformer that incorporates various types of scene-level information from synthetic data using the ``Promptonomy'' approach. PViT shows strong performance improvements on multiple video understanding tasks and datasets.

PViT Architecture

We extend a transformer with a set of ``task prompts'', p_i, that are designed to capture information regarding each task, as well as capture the inter-task structure. The prompts are supervised by synthetic scene auxiliary tasks (depth, segmentation, normal, and 3D pose) available only during training, in order to enhance performance on a video task (predicting ``put-down cereal''). Each task prompt in the attention block interacts with the patch tokens and CLS token, as well as other task prompts within the block.

``Task Prompts'' Visualization

Visualization of the output of the ``task prompts'' prediction heads on frames from the SSv2, Diving48, and Ego4D datasets. The model was trained with Something-Else as the action recognition dataset. Shown are prediction head outputs (i.e., H_i) for depth, normal, and semantic segmentation. It can be seen that the task prompts produce meaningful maps, despite not receiving such labels for real videos.

Dataset-Task Agreement

A polygon represents a real video dataset, and the closer a vertex is to the circle border, the greater the gain from applying that synthetic task. The gains are scaled for comparison.

BibTeX

@misc{herzig2022promptonomyvit,
      title={PromptonomyViT: Multi-Task Prompt Learning Improves Video Transformers using Synthetic Scene Data},
      author={Roei Herzig and Ofir Abramovich and Elad Ben-Avraham and Assaf Arbelle and Leonid Karlinsky and Ariel Shamir and Trevor Darrell and Amir Globerson},
      year={2022},
      eprint={2212.04821},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
  }

Related Works

This project is build upon PySlowFast and MViT codebase. If you find our work helpful, please consider citing these as well.

Acknowledgements

This project has received funding from the European Research Council (ERC) under the European Unions Horizon 2020 research and innovation programme (grant ERC HOLI 819080). Prof. Darrell’s group was supported in part by DoD including DARPA's XAI, and LwLL programs, as well as BAIR's industrial alliance programs.