---
license: apache-2.0
tags:
- video-classification
- action-recognition
- ucf101
- pytorch
- computer-vision
datasets:
- UCF-101
metrics:
- accuracy
- f1
model-index:
- name: mc3-18-ucf101
  results:
  - task:
      type: video-classification
      name: Action Recognition
    dataset:
      name: UCF-101
      type: ucf101
      split: test
    metrics:
    - type: accuracy
      value: 87.05
      name: Top-1 Accuracy
    - type: f1
      value: 85.69
      name: F1 Score
language:
- en
---

[![🐙 GitHub](https://img.shields.io/badge/GitHub-Repository-181717?logo=github&logoColor=white&style=for-the-badge)](https://github.com/dronefreak/human-action-classification)
[![📄 Paper: MC3](https://img.shields.io/badge/Paper-MC3-2EA44F?logo=arxiv&logoColor=white&style=for-the-badge)](https://arxiv.org/abs/1711.11248)
[![💽 Dataset: UCF-101](https://img.shields.io/badge/Dataset-UCF--101-34aa44?logo=database&logoColor=white&style=for-the-badge)](https://www.crcv.ucf.edu/data/UCF101.php)
[![⚖️ License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-green?logo=open-source-initiative&logoColor=white&style=for-the-badge)](https://opensource.org/licenses/Apache-2.0)


![Demo](demo.gif)

# MC3-18 for UCF-101 Action Recognition

## Model Summary

This model is an MC3-18 (Mixed 3D Convolutions) network fine-tuned on the UCF-101 dataset for human action recognition. The architecture combines 2D and 3D convolutions, delivering an efficient temporal-spatial representation while maintaining a lightweight parameter count.

- **Architecture:** MC3-18 (3D CNN with mixed convolutions)  
- **Pretraining:** Kinetics-400  
- **Parameter Count:** ~11.7M  
- **Input Format:** 16-frame clips, 112×112 spatial resolution  
- **Number of Classes:** 101  

---

## Intended Use

**Primary use case:** Action classification in short, trimmed videos similar in distribution to UCF-101.  
**Users:** Researchers, practitioners, and engineers working on video-understanding pipelines.  
**Tasks:**  
- Action recognition  
- Clip-level human activity tagging  
- Baseline modeling for low-compute video applications  

Not suitable for long-horizon temporal reasoning or untrimmed video detection without adaptation.

---

## Performance

### Quantitative Results (UCF-101 Split 1, Test Set)

| Metric      | Value    |
|-------------|----------|
| Accuracy    | 87.05%   |
| F1 Score    | 0.857    |
| Precision   | 0.868    |

### Comparison to Published Baseline

- **Original MC3-18 (Kinetics-400 → UCF-101):** 85.0%  
- **This model:** **87.05%** (+2.05%)

---

## How to Use

### Inference Example (PyTorch)

```python
import torch
# Load from HuggingFace
from huggingface_hub import hf_hub_download
from torchvision.transforms import Compose, Resize, CenterCrop, Normalize, ToTensor
model_path = hf_hub_download(repo_id="dronefreak/mc3-18-ucf101", filename="mc318-ufc101-split-1.pth")
model = torch.load(model_path)

# Prepare video (16 frames, C×T×H×W)
transform = Compose([
    Resize((128, 171)),
    CenterCrop(112),
    ToTensor(),
    Normalize(mean=[0.43216, 0.394666, 0.37645], 
              std=[0.22803, 0.22145, 0.216989])
])

# Inference
with torch.no_grad():
    output = model(video_tensor)
    prediction = output.argmax(dim=1)
```

## Training

- **Dataset:** UCF-101 Split 1 (9,537 train / 3,783 test videos)
- **Epochs:** 200
- **Batch Size:** 32
- **Optimizer:** SGD (lr=0.001, momentum=0.9, weight_decay=1e-4)
- **Augmentation:** ColorJitter, RandomHorizontalFlip, RandomCrop

## Limitations

- Trained only on UCF-101 (limited to 101 action classes)
- Requires 16-frame clips (not suitable for real-time single-frame)
- Best performance on similar action types to UCF-101

## Citation

```bibtex
@misc{mc3_18_ucf101,
  author = {Saumya Saksena},
  title = {MC3-18 for UCF-101 Action Recognition},
  year = {2024},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/dronefreak/mc3-18-ucf101}}
}
```

## License

Apache-2.0