File size: 6,654 Bytes
6cdccef 2479ea7 6cdccef 2479ea7 6cdccef 2479ea7 6cdccef 2479ea7 6cdccef 2479ea7 6cdccef 2479ea7 6cdccef |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 |
<div align="center">
<!-- Project Title -->
<h1>
MotionAgent: Fine-grained Controllable Video Generation via<br>
Motion Field Agent
</h1>
<!-- Conference Info -->
<p><em>International Conference on Computer Vision, ICCV 2025.</em></p>
<!-- Project Badges -->
<p>
<a href="https://arxiv.org/abs/2502.03207">
<img src="https://img.shields.io/badge/arXiv-2502.03207-b31b1b.svg" alt="arXiv"/>
</a>
<a href="https://huggingface.co/leoisufa/MotionAgent">
<img src="https://img.shields.io/badge/HuggingFace-Model-yellow.svg" alt="HuggingFace"/>
</a>
</p>
</div>
<div align="center">
<strong>Xinyao Liao<sup>1,2</sup></strong>,
<strong>Xianfang Zeng<sup>2</sup></strong>,
<strong>Liao Wang<sup>2</sup></strong>,
<strong>Gang Yu<sup>2*</sup></strong>,
<strong>Guosheng Lin<sup>1*</sup></strong>,
<strong>Chi Zhang<sup>3</sup></strong>
<br><br>
<b>
<sup>1</sup> Nanyang Technological Universityโ
<sup>2</sup> StepFunโ
<sup>3</sup> Westlake University
</b>
</div>
## ๐งฉ Overview
<p align="center">
<img src="assets/agent.jpg" alt="Pipeline of Motion Field Agent" width="100%">
</p>
MotionAgent is a novel framework that enables **fine-grained motion control** for text-guided image-to-video generation. At its core is a **motion field agent** that parses motion information in text prompts and converts it into explicit *object trajectories* and *camera extrinsics*. These motion representations are analytically integrated into a unified optical flow, which conditions a diffusion-based image-to-video model to generate videos with precise and flexible motion control. An optional rethinking step further refines motion alignment by iteratively correcting the agentโs previous actions.
## ๐ฅ Demo
<p align="center">
<a href="https://www.youtube.com/watch?v=O9WW2UpXsAI" target="_blank">
<img src="https://img.youtube.com/vi/O9WW2UpXsAI/maxresdefault.jpg"
alt="MotionAgent Demo Video"
width="80%"
style="max-width:900px; border-radius:10px; box-shadow:0 0 10px rgba(0,0,0,0.15);">
</a>
<br>
<em>Click the image above to watch the full video on YouTube ๐ฌ</em>
</p>
## ๐ ๏ธ Dependencies and Installation
Follow the steps below to set up **MotionAgent** and run the demo smoothly ๐ซ
### ๐น 1. Clone the Repository
Clone the official GitHub repository and enter the project directory:
```bash
git clone https://github.com/leoisufa/MotionAgent.git
cd MotionAgent
```
### ๐น 2. Environment Setup
```bash
# Create and activate conda environment
conda create -n motionagent python==3.10 -y
conda activate motionagent
# Install PyTorch with CUDA 12.4 support
pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu124
# Install project dependencies
pip install -r requirements.txt
```
### ๐น 3. Install Grounded-Segment-Anything Dependencies
MotionAgent relies on external segmentation and grounding models.
Follow the steps below to install [Grounded-Segment-Anything](https://github.com/IDEA-Research/Grounded-Segment-Anything):
```bash
# Navigate to models directory
cd models
# Clone the Grounded-Segment-Anything repository
git clone https://github.com/IDEA-Research/Grounded-Segment-Anything.git
# Enter the cloned directory
cd Grounded-Segment-Anything
# Install Segment Anything
python -m pip install -e segment_anything
# Install Grounding DINO
pip install --no-build-isolation -e GroundingDINO
```
### ๐น 4. Install Metric3D Dependencies
MotionAgent relies on an external monocular depth estimation model.
Follow the steps below to install [Metric3D](https://github.com/YvanYin/Metric3D):
```bash
# Navigate to models directory
cd models
# Clone the Grounded-Segment-Anything repository
git clone https://github.com/YvanYin/Metric3D.git
```
## ๐งฑ Download Models
To run **MotionAgent**, please download all pretrained and auxiliary models listed below, and organize them under the `ckpts/` directory as shown in the example structure.
### 1๏ธโฃ **Optical Flow ControlNet Weights**
Download from ๐ [Hugging Face (MotionAgent)](https://huggingface.co/leoisufa/MotionAgent) and place the files in `ckpts`.
### 2๏ธโฃ **Stable Video Diffusion**
Download from ๐ [Hugging Face (MOFA-Video-Hybrid/stable-video-diffusion-img2vid-xt-1-1)](https://huggingface.co/MyNiuuu/MOFA-Video-Hybrid/tree/main/ckpts/mofa/stable-video-diffusion-img2vid-xt-1-1) and save the model to `ckpts`.
### 3๏ธโฃ **Grounding DINO**
Download the grounding model checkpoint using the command below:
```bash
wget https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth
```
Then place it directly under `ckpts`.
### 4๏ธโฃ **Segment Anything**
Download the segmentation model using:
```bash
wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth
```
Then place it under `ckpts`.
### 5๏ธโฃ **Metric Depth Estimator**
Download from ๐ [Hugging Face (Metric3d)](https://drive.google.com/file/d/1YfmvXwpWmhLg3jSxnhT7LvY0yawlXcr_/view?usp=drive_link) and place the files in `ckpts`.
### 6๏ธโฃ **CMP**
Download from ๐ [Hugging Face (MOFA-Video-Hybrid/cmp)](https://huggingface.co/MyNiuuu/MOFA-Video-Hybrid/resolve/main/models/cmp/experiments/semiauto_annot/resnet50_vip%2Bmpii_liteflow/checkpoints/ckpt_iter_42000.pth.tar) and save the model to `models/cmp/experiments/semiauto_annot/resnet50_vip+mpii_liteflow/checkpoints`.
After all downloads and installations, your ckpts folder should look like this:
```shell
ckpts/
โโโ controlnet/
โโโ stable-video-diffusion-img2vid-xt-1-1/
โโโ groundingdino_swint_ogc.pth
โโโ metric_depth_vit_small_800k.pth
โโโ sam_vit_h_4b8939.pth
```
## ๐ Running the Demos
```bash
python run_agent.py
```
## ๐ BibTeX
If you find [MotionAgent](https://arxiv.org/abs/2502.03207) useful for your research and applications, please cite using this BibTeX:
```BibTeX
@article{liao2025motionagent,
title={Motionagent: Fine-grained controllable video generation via motion field agent},
author={Liao, Xinyao and Zeng, Xianfang and Wang, Liao and Yu, Gang and Lin, Guosheng and Zhang, Chi},
journal={arXiv preprint arXiv:2502.03207},
year={2025}
}
```
## ๐ Acknowledgements
We thank the following prior art for their excellent open source work:
- [MOFA-Video](https://github.com/MyNiuuu/MOFA-Video)
- [AppAgent](https://github.com/TencentQQGYLab/AppAgent)
- [Grounded-Segment-Anything](https://github.com/IDEA-Research/Grounded-Segment-Anything)
- [Metric3D](https://github.com/YvanYin/Metric3D) |