|
|
---
|
|
|
language: en
|
|
|
license: apache-2.0
|
|
|
library_name: tensorflow
|
|
|
tags:
|
|
|
- tensorflow
|
|
|
- keras
|
|
|
- tflite
|
|
|
- emotion-recognition
|
|
|
- transformer
|
|
|
- lstm
|
|
|
- mediapipe
|
|
|
- computer-vision
|
|
|
- deep-learning
|
|
|
- facial-expression
|
|
|
- affective-computing
|
|
|
- sequential-data
|
|
|
model-index:
|
|
|
- name: emotion_landmark_lstm_model
|
|
|
results:
|
|
|
- task:
|
|
|
type: sequence-classification
|
|
|
dataset:
|
|
|
type: dataset
|
|
|
name: Optimized 478-Point 3D Facial Landmark Dataset
|
|
|
metrics:
|
|
|
- name: accuracy
|
|
|
type: float
|
|
|
value: 0.7289
|
|
|
inference: "Supports TensorFlow and TensorFlow Lite real-time inference"
|
|
|
---
|
|
|
|
|
|
# π₯ Emotion Sequence Transformer (TensorFlow) β Mediapipe 478 Landmarks (Seq256)
|
|
|
|
|
|
**Version:** v1.0
|
|
|
**Framework:** TensorFlow 2.x
|
|
|
**Optimized format:** TensorFlow Lite
|
|
|
**Input:** 478 Mediapipe Face Mesh landmarks per frame (up to 300 frames)
|
|
|
**Output:** 6-class emotion prediction (`Angry`, `Disgust`, `Fear`, `Happy`, `Neutral`, `Sad`)
|
|
|
|
|
|
---
|
|
|
|
|
|
## π§ Model Overview
|
|
|
|
|
|
The **Emotion Sequence Transformer** is a deep learning model built using TensorFlow for recognizing **human emotions** from continuous **video clips**.
|
|
|
It uses **478 Mediapipe facial landmarks per frame** to capture spatiotemporal patterns of facial movements across time.
|
|
|
The model predicts one of six basic emotions by analyzing both facial geometry and temporal variation within sequences of up to **300 frames**.
|
|
|
|
|
|
This model is suitable for **real-time video-based emotion detection**, **affective computing**, **human-computer interaction**, and **emotion-aware AI systems**.
|
|
|
|
|
|
---
|
|
|
|
|
|
## π Dataset
|
|
|
|
|
|
This model was trained on the **[Optimized 478-Point 3D Facial Landmark Dataset](https://www.kaggle.com/datasets/psewmuthu/optimized-video-facial-landmarks)** β
|
|
|
a dataset derived from the **Video Emotion Dataset**, optimized for emotion recognition using Mediapipeβs 3D face mesh landmarks.
|
|
|
|
|
|
Each sample in the dataset includes:
|
|
|
|
|
|
- Up to **300 frames per clip**
|
|
|
- **478 facial landmarks per frame**
|
|
|
- Corresponding **emotion label**
|
|
|
|
|
|
---
|
|
|
|
|
|
## π§© Model Architecture
|
|
|
|
|
|
The architecture is based on a **Transformer encoder** design that processes sequential data of facial landmarks.
|
|
|
|
|
|
**Pipeline:**
|
|
|
|
|
|
1. Input normalization using precomputed mean and std (global stats)
|
|
|
2. Sequence embedding via positional encodings
|
|
|
3. Transformer encoder blocks to capture temporal and spatial dependencies
|
|
|
4. Dense layers for emotion classification (6 output neurons with softmax)
|
|
|
|
|
|
**Core Components:**
|
|
|
|
|
|
- Transformer Encoder Layers (Multi-Head Self-Attention)
|
|
|
- Layer Normalization and Dropout
|
|
|
- Dense classification head
|
|
|
|
|
|
---
|
|
|
|
|
|
## π Performance
|
|
|
|
|
|
| Metric | Value |
|
|
|
| --------------------- | ---------- |
|
|
|
| **Test Accuracy** | 0.7289 |
|
|
|
| **Test Loss** | 1.1336 |
|
|
|
| **Macro F1-Score** | 0.73 |
|
|
|
| **Weighted F1-Score** | 0.73 |
|
|
|
| **Max Clip Length** | 300 frames |
|
|
|
| **Input Shape** | (300, 478) |
|
|
|
|
|
|
### π§Ύ Classification Report
|
|
|
|
|
|
| Emotion | Precision | Recall | F1-score | Support |
|
|
|
| -------------------- | --------- | ------ | ------------------- | ------- |
|
|
|
| Angry | 0.75 | 0.73 | 0.74 | 139 |
|
|
|
| Disgust | 0.88 | 0.70 | 0.78 | 128 |
|
|
|
| Fear | 0.52 | 0.60 | 0.55 | 114 |
|
|
|
| Happy | 0.88 | 0.97 | 0.92 | 129 |
|
|
|
| Neutral | 0.66 | 0.79 | 0.72 | 101 |
|
|
|
| Sad | 0.70 | 0.58 | 0.64 | 134 |
|
|
|
| **Overall Accuracy** | **0.73** | | **Macro Avg: 0.73** | 745 |
|
|
|
|
|
|
---
|
|
|
|
|
|
## π Visualizations
|
|
|
|
|
|
### πΉ Training Accuracy and Loss
|
|
|
|
|
|

|
|
|
|
|
|
### πΉ Confusion Matrix
|
|
|
|
|
|

|
|
|
|
|
|
### πΉ ROC Curves (Per Class)
|
|
|
|
|
|

|
|
|
|
|
|
---
|
|
|
|
|
|
## π Repository Structure
|
|
|
|
|
|
```
|
|
|
TF-Emotion-Sequence-Transformer/
|
|
|
βββ tf_emotion_sequence_transformer_mp478_seq256.h5
|
|
|
βββ tf_emotion_sequence_transformer_mp478_seq256_optimized.tflite
|
|
|
βββ tf_emotion-sequence-transformer-bilstm-usage.ipynb
|
|
|
βββ assets/
|
|
|
β βββ global_mean.npy
|
|
|
β βββ global_std.npy
|
|
|
β βββ label_encoder.pkl
|
|
|
β βββ metadata.json
|
|
|
βββ README.md
|
|
|
```
|
|
|
|
|
|
### File Descriptions
|
|
|
|
|
|
| File | Description |
|
|
|
| --------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------- |
|
|
|
| `tf_emotion_sequence_transformer_mp478_seq256.h5` | Main TensorFlow model trained on 478 landmarks (300 frames max). |
|
|
|
| `tf_emotion_sequence_transformer_mp478_seq256_optimized.tflite` | Optimized TensorFlow Lite version for deployment (mobile, edge). |
|
|
|
| `tf_emotion-sequence-transformer-bilstm-usage.ipynb` | Example notebook demonstrating how to use the model for emotion prediction from Mediapipe landmarks. |
|
|
|
| `assets/global_mean.npy` | Precomputed global mean for normalization. |
|
|
|
| `assets/global_std.npy` | Precomputed global standard deviation for normalization. |
|
|
|
| `assets/label_encoder.pkl` | Encoder mapping integer labels to emotion names. |
|
|
|
| `assets/metadata.json` | Model metadata and configuration details. |
|
|
|
|
|
|
---
|
|
|
|
|
|
## π Example Usage
|
|
|
|
|
|
### πΈ TensorFlow (.h5) Model
|
|
|
|
|
|
```python
|
|
|
import numpy as np
|
|
|
import tensorflow as tf
|
|
|
import joblib
|
|
|
import json
|
|
|
|
|
|
# Load Model
|
|
|
model = tf.keras.models.load_model("tf_emotion_sequence_transformer_mp478_seq256.h5")
|
|
|
|
|
|
# Load assets
|
|
|
mean = np.load("assets/global_mean.npy")
|
|
|
std = np.load("assets/global_std.npy")
|
|
|
label_encoder = joblib.load("assets/label_encoder.pkl")
|
|
|
|
|
|
# Preprocess input
|
|
|
input_seq = np.load("example_input.npy") # shape: (300, 478)
|
|
|
input_seq = (input_seq - mean) / std
|
|
|
input_seq = np.expand_dims(input_seq, axis=0)
|
|
|
|
|
|
# Predict
|
|
|
pred = model.predict(input_seq)
|
|
|
emotion = label_encoder.inverse_transform([np.argmax(pred)])[0]
|
|
|
print("Predicted Emotion:", emotion)
|
|
|
```
|
|
|
|
|
|
---
|
|
|
|
|
|
### πΈ TensorFlow Lite (Optimized) Model
|
|
|
|
|
|
```python
|
|
|
import numpy as np
|
|
|
import tensorflow as tf
|
|
|
import joblib
|
|
|
|
|
|
# Load TFLite model
|
|
|
interpreter = tf.lite.Interpreter(model_path="tf_emotion_sequence_transformer_mp478_seq256_optimized.tflite")
|
|
|
interpreter.allocate_tensors()
|
|
|
|
|
|
# Get input and output tensors
|
|
|
input_details = interpreter.get_input_details()
|
|
|
output_details = interpreter.get_output_details()
|
|
|
|
|
|
# Load preprocessing assets
|
|
|
mean = np.load("assets/global_mean.npy")
|
|
|
std = np.load("assets/global_std.npy")
|
|
|
label_encoder = joblib.load("assets/label_encoder.pkl")
|
|
|
|
|
|
# Prepare input
|
|
|
input_seq = np.load("example_input.npy") # shape: (300, 478)
|
|
|
input_seq = (input_seq - mean) / std
|
|
|
input_seq = np.expand_dims(input_seq, axis=0).astype(np.float32)
|
|
|
|
|
|
# Inference
|
|
|
interpreter.set_tensor(input_details[0]['index'], input_seq)
|
|
|
interpreter.invoke()
|
|
|
pred = interpreter.get_tensor(output_details[0]['index'])
|
|
|
|
|
|
# Decode emotion
|
|
|
emotion = label_encoder.inverse_transform([np.argmax(pred)])[0]
|
|
|
print("Predicted Emotion:", emotion)
|
|
|
```
|
|
|
|
|
|
---
|
|
|
|
|
|
## π Version Information
|
|
|
|
|
|
**Version:** v1.0
|
|
|
**Date:** November 2025
|
|
|
**Author:** [P.S. Abewickrama Singhe](https://www.kaggle.com/psewmuthu)
|
|
|
**Framework:** TensorFlow 2.x
|
|
|
**Exported Models:** `.h5`, `.tflite`
|
|
|
**Landmarks per frame:** 478
|
|
|
**Max frames per clip:** 300
|
|
|
|
|
|
---
|
|
|
|
|
|
## π·οΈ Tags
|
|
|
|
|
|
`tensorflow` β’ `emotion-recognition` β’ `mediapipe` β’ `transformer` β’ `sequence-model` β’ `facial-landmarks` β’ `video-analysis` β’ `tflite` β’ `human-emotion-ai` β’ `affective-computing` β’ `computer-vision` β’ `deep-learning`
|
|
|
|
|
|
---
|
|
|
|
|
|
## π Citation
|
|
|
|
|
|
If you use this model in your research, please cite it as:
|
|
|
|
|
|
```bibtex
|
|
|
@misc{pasindu_sewmuthu_abewickrama_singhe_2025,
|
|
|
author = { Pasindu Sewmuthu Abewickrama Singhe },
|
|
|
title = { EmotionFormer-BiLSTM (Revision f329517) },
|
|
|
year = 2025,
|
|
|
url = { https://huggingface.co/PSewmuthu/EmotionFormer-BiLSTM },
|
|
|
doi = { 10.57967/hf/6899 },
|
|
|
publisher = { Hugging Face }
|
|
|
}
|
|
|
```
|
|
|
|
|
|
---
|
|
|
|
|
|
## πͺͺ License
|
|
|
|
|
|
This model is released under the **Apache 2.0 License** β free for academic and commercial use with attribution.
|
|
|
|
|
|
---
|
|
|
|