VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding

1National Key Laboratory of General Artificial Intelligence, Beijing Institute for General Artificial Intelligence (BIGAI) 2School of Computer Science, Peking University 3School of Intelligence Science and Technology, Peking University


Figure 1. With a unified memory as a structured representation for videos, VideoAgent can utilize a curated set of tools to perform sophisticated queries about the memory, and respond with the correct answer.

Abstract

We explore how reconciling several foundation models (large language models and vision-language models) with a novel unified memory mechanism could tackle the challenging video understanding problem, especially capturing the long-term temporal relations in lengthy videos. In particular, the proposed multimodal agent VideoAgent: 1) constructs a structured memory to store both the generic temporal event descriptions and object-centric tracking states of the video; 2) given an input task query, it employs tools including video segment localization and object memory querying along with other visual foundation models to interactively solve the task, utilizing the zero-shot tool-use ability of LLMs. VideoAgent demonstrates impressive performances on several long-horizon video understanding benchmarks, an average increase of 6.6% on NExT-QA and 26.0% on EgoSchema over baselines, closing the gap between open-sourced models and private counterparts including Gemini 1.5 Pro. The code will be released to the public.

TL;DR: We propose VideoAgent, a LLM agent that understands videos by using a structured memory and 4 tools.


Figure 2. An overview of VideoAgent. Left: We first translate an input video into structured representations: a temporal memory and an object memory; Right: the LLM within VideoAgent will be prompted to solve the given task by interactively invoking tools. Our considered tools primarily work with the memory (e.g. Caption Retrieval interacts with the caption part of the temporal memory while Object Memory Querying looks up the object memory).

Method

Given a video and a question, VideoAgent has two phases: memory construction phase and inference phase. During the memory construction phase, structured information is extracted from the video and stored in the memory. During the inference phase, a LLM is prompted to use a set of tools interacting with the memory to answer the question.

Memory Construction

VideoAgent has a temporal memory and an object memory. By slicing the video into 2-second segments, the temporal memory is designed to store the event descriptions of these segments generated by a video captioning model. Besides, the textual features and visual features of these segments are also stored in the temporal memory for similarity-based segment localization in the inference stage. The object memory stores all the objects with their information in a SQL database including: categories, CLIP features, and their appearing segments. The object information is achieved by object tracking with a unique re-identfication method proposed in this paper.


Figure 3. Illustration of object tracking with our re-identification method.

Inference

During the inference stage, 4 tools can be utilized by the LLM to gather video information required to answer the question. The 4 tools are:

  • Caption Retrieval: given a start segment and an end segment, retrieve all the captions (15 captions at most) of the segments between them from the temporal memory.
  • Segment Localization: given a text query, locate the relevant segments according to the similarities between the query feature and the segment features stored in the temporal memory.
  • Visual Question Answering: given a question and a target video segment, it will use a video LLM to describe what happened in this short video segment and answer the question.
  • Object Memory Querying: given an object(person)-related question, it will retrieve the information from the object memory to answer the question.

The LLM agent will perform multiple steps towards the final answer. In each step, it will invoke a tool to gather information based on its reasoning and the results of the preivous steps. An example can be found as follows.


Figure 4. Given a question, VideoAgent executes multiple tool-use steps until it reaches the answer. The yellow, red, and blue blocks in each step denote the chain of thought, action to be taken, and results of tool use.

Video Demo