File size: 6,654 Bytes
6cdccef
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2479ea7
6cdccef
 
2479ea7
6cdccef
 
 
 
 
 
2479ea7
6cdccef
2479ea7
6cdccef
 
 
 
2479ea7
 
 
 
 
 
 
6cdccef
 
 
 
 
 
 
 
 
 
 
 
 
2479ea7
 
 
6cdccef
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
<div align="center">
  <!-- Project Title -->
  <h1>
    MotionAgent: Fine-grained Controllable Video Generation via<br>
    Motion Field Agent
  </h1>
  <!-- Conference Info -->
  <p><em>International Conference on Computer Vision, ICCV 2025.</em></p>
  <!-- Project Badges -->
  <p>
    <a href="https://arxiv.org/abs/2502.03207">
      <img src="https://img.shields.io/badge/arXiv-2502.03207-b31b1b.svg" alt="arXiv"/>
    </a>
    <a href="https://huggingface.co/leoisufa/MotionAgent">
      <img src="https://img.shields.io/badge/HuggingFace-Model-yellow.svg" alt="HuggingFace"/>
    </a>
  </p>
</div>


<div align="center">
  <strong>Xinyao Liao<sup>1,2</sup></strong>,
  <strong>Xianfang Zeng<sup>2</sup></strong>,
  <strong>Liao Wang<sup>2</sup></strong>,
  <strong>Gang Yu<sup>2*</sup></strong>,
  <strong>Guosheng Lin<sup>1*</sup></strong>,
  <strong>Chi Zhang<sup>3</sup></strong>
  <br><br>
  <b>
    <sup>1</sup> Nanyang Technological Universityโ€ƒ
    <sup>2</sup> StepFunโ€ƒ
    <sup>3</sup> Westlake University
  </b>
</div>

## ๐Ÿงฉ Overview
<p align="center">
  <img src="assets/agent.jpg" alt="Pipeline of Motion Field Agent" width="100%">
</p>

MotionAgent is a novel framework that enables **fine-grained motion control** for text-guided image-to-video generation. At its core is a **motion field agent** that parses motion information in text prompts and converts it into explicit *object trajectories* and *camera extrinsics*. These motion representations are analytically integrated into a unified optical flow, which conditions a diffusion-based image-to-video model to generate videos with precise and flexible motion control. An optional rethinking step further refines motion alignment by iteratively correcting the agentโ€™s previous actions.

## ๐ŸŽฅ Demo
<p align="center">
  <a href="https://www.youtube.com/watch?v=O9WW2UpXsAI" target="_blank">
    <img src="https://img.youtube.com/vi/O9WW2UpXsAI/maxresdefault.jpg" 
         alt="MotionAgent Demo Video" 
         width="80%" 
         style="max-width:900px; border-radius:10px; box-shadow:0 0 10px rgba(0,0,0,0.15);">
  </a>
  <br>
  <em>Click the image above to watch the full video on YouTube ๐ŸŽฌ</em>
</p>

## ๐Ÿ› ๏ธ Dependencies and Installation  
Follow the steps below to set up **MotionAgent** and run the demo smoothly ๐Ÿ’ซ
### ๐Ÿ”น 1. Clone the Repository  
Clone the official GitHub repository and enter the project directory:  
```bash
git clone https://github.com/leoisufa/MotionAgent.git
cd MotionAgent
```
### ๐Ÿ”น 2. Environment Setup
```bash
# Create and activate conda environment
conda create -n motionagent python==3.10 -y
conda activate motionagent

# Install PyTorch with CUDA 12.4 support
pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu124

# Install project dependencies
pip install -r requirements.txt
```
### ๐Ÿ”น 3. Install Grounded-Segment-Anything Dependencies
MotionAgent relies on external segmentation and grounding models.
Follow the steps below to install [Grounded-Segment-Anything](https://github.com/IDEA-Research/Grounded-Segment-Anything):
```bash
# Navigate to models directory
cd models

# Clone the Grounded-Segment-Anything repository
git clone https://github.com/IDEA-Research/Grounded-Segment-Anything.git

# Enter the cloned directory
cd Grounded-Segment-Anything

# Install Segment Anything
python -m pip install -e segment_anything

# Install Grounding DINO
pip install --no-build-isolation -e GroundingDINO
```

### ๐Ÿ”น 4. Install Metric3D Dependencies
MotionAgent relies on an external monocular depth estimation model.
Follow the steps below to install [Metric3D](https://github.com/YvanYin/Metric3D):
```bash
# Navigate to models directory
cd models

# Clone the Grounded-Segment-Anything repository
git clone https://github.com/YvanYin/Metric3D.git
```

## ๐Ÿงฑ Download Models  
To run **MotionAgent**, please download all pretrained and auxiliary models listed below, and organize them under the `ckpts/` directory as shown in the example structure.  

### 1๏ธโƒฃ **Optical Flow ControlNet Weights**  
Download from ๐Ÿ‘‰ [Hugging Face (MotionAgent)](https://huggingface.co/leoisufa/MotionAgent) and place the files in `ckpts`.

### 2๏ธโƒฃ **Stable Video Diffusion**  
Download from ๐Ÿ‘‰ [Hugging Face (MOFA-Video-Hybrid/stable-video-diffusion-img2vid-xt-1-1)](https://huggingface.co/MyNiuuu/MOFA-Video-Hybrid/tree/main/ckpts/mofa/stable-video-diffusion-img2vid-xt-1-1) and save the model to `ckpts`.

### 3๏ธโƒฃ **Grounding DINO**  
Download the grounding model checkpoint using the command below:  
```bash
wget https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth
```
Then place it directly under `ckpts`.

### 4๏ธโƒฃ **Segment Anything**
Download the segmentation model using:
```bash
wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth
```
Then place it under `ckpts`.

### 5๏ธโƒฃ **Metric Depth Estimator** 
Download from ๐Ÿ‘‰ [Hugging Face (Metric3d)](https://drive.google.com/file/d/1YfmvXwpWmhLg3jSxnhT7LvY0yawlXcr_/view?usp=drive_link) and place the files in `ckpts`.

### 6๏ธโƒฃ **CMP**
Download from ๐Ÿ‘‰ [Hugging Face (MOFA-Video-Hybrid/cmp)](https://huggingface.co/MyNiuuu/MOFA-Video-Hybrid/resolve/main/models/cmp/experiments/semiauto_annot/resnet50_vip%2Bmpii_liteflow/checkpoints/ckpt_iter_42000.pth.tar) and save the model to `models/cmp/experiments/semiauto_annot/resnet50_vip+mpii_liteflow/checkpoints`.

After all downloads and installations, your ckpts folder should look like this:  

```shell
ckpts/
โ”œโ”€โ”€ controlnet/
โ”œโ”€โ”€ stable-video-diffusion-img2vid-xt-1-1/
โ”œโ”€โ”€ groundingdino_swint_ogc.pth
โ”œโ”€โ”€ metric_depth_vit_small_800k.pth
โ””โ”€โ”€ sam_vit_h_4b8939.pth
```

## ๐Ÿš€ Running the Demos
```bash
python run_agent.py
```

## ๐Ÿ”— BibTeX
If you find [MotionAgent](https://arxiv.org/abs/2502.03207) useful for your research and applications, please cite using this BibTeX:
```BibTeX
@article{liao2025motionagent,
  title={Motionagent: Fine-grained controllable video generation via motion field agent},
  author={Liao, Xinyao and Zeng, Xianfang and Wang, Liao and Yu, Gang and Lin, Guosheng and Zhang, Chi},
  journal={arXiv preprint arXiv:2502.03207},
  year={2025}
}
```

## ๐Ÿ™ Acknowledgements
We thank the following prior art for their excellent open source work: 
- [MOFA-Video](https://github.com/MyNiuuu/MOFA-Video)
- [AppAgent](https://github.com/TencentQQGYLab/AppAgent)
- [Grounded-Segment-Anything](https://github.com/IDEA-Research/Grounded-Segment-Anything)
- [Metric3D](https://github.com/YvanYin/Metric3D)