🦙 Llama-3.2-1B-Instruct-GGUF [Optimized for Edge AI]

🦙 Llama-3.2-1B-Instruct-GGUF [Benchmarked & Verified]

📌 Model Description

This repository contains manually benchmarked GGUF quantized versions of the Meta Llama 3.2 1B Instruct model.

These models are optimized for Edge AI deployment (Mobile, Raspberry Pi, Laptops) using llama.cpp. Unlike auto-generated quants, these weights have been tested against WikiText-2 to ensure the best balance between speed and accuracy.

🌟 Exclusive Features

🚀 Hyper-Fast: The Q4_K_M version achieves 42+ tokens/sec generation speed on CPU.
📉 Ultra-Low Memory: Runs comfortably on devices with < 1GB RAM (Measured: ~639 MiB).
✅ Verified Quality: Perplexity (PPL) tested on WikiText-2 to guarantee performance.

📊 Benchmark Results (The Science)

Tests were conducted using llama.cpp on a standard CPU setup.

Model Version	Size	Perplexity (PPL)	Quality Loss	Gen Speed (CPU)	Memory Usage
F16 (Original)	2.30 GB	13.99	Baseline	15.73 t/s	~2.4 GB
Q8_0	1.22 GB	14.01	~0.1% (Negligible)	28.43 t/s	~1.3 GB
Q4_K_M	762 MB	14.49	~3.5% (Acceptable)	42.60 t/s 🚀	~640 MB

Conclusion: The Q4_K_M model offers the best trade-off, running 2.7x faster than the original with minimal quality loss.

📥 Which File to Download?

Filename	Description	Use Case
`llama-3.2-1b-q4_k_m.gguf`	🏆 Recommended. Balanced speed & accuracy.	Chatbots, Android/iOS Apps, RAG
`llama-3.2-1b-q8_0.gguf`	High precision, larger size.	Research, Creative Writing
`llama-3.2-1b-f16.gguf`	Uncompressed weights.	Fine-Tuning, Conversion

💻 Quick Usage

Python (Google Colab / Local)

# pip install llama-cpp-python huggingface_hub

from huggingface_hub import hf_hub_download
from llama_cpp import Llama

model_path = hf_hub_download(
    repo_id="Habibur2/Llama-3.2-1B-Instruct-GGUF",
    filename="llama-3.2-1b-q4_k_m.gguf"
)

llm = Llama(
    model_path=model_path,
    n_ctx=2048,
    verbose=False
)

response = llm.create_chat_completion(
    messages=[{"role": "user", "content": "Hello! Explain AI in one sentence."}]
)
print(response['choices'][0]['message']['content'])

Uploaded by Habibur2 | Benchmarked with WikiText-2 & llama-bench

📊 Detailed Benchmark Results (WikiText-2)

Tests were conducted on llama.cpp (CPU Backend). The results show that quantization has negligible impact on model quality while significantly reducing memory usage.

Model Version	VRAM/RAM Usage	Perplexity (Lower is Better)	Accuracy Loss
F16 (Original)	2,357 MB	13.99	Baseline (0%)
Q8_0	1,252 MB	14.01	+0.01 (Negligible)

Analysis: The Q8_0 version retains 99.99% of the original model's performance while using 47% less memory.

📌 Model Description

This repository contains verified and benchmarked GGUF quantized versions of the Meta Llama 3.2 1B Instruct model.

These models are optimized for Edge AI deployment (Mobile, Raspberry Pi, Laptops) using llama.cpp. Unlike auto-generated quants, these weights have been manually benchmarked to ensure the best balance between speed and accuracy.

🌟 Why use this Repository?

🚀 Real-World Benchmarks: Performance data provided for informed decision-making.
⚡ Ultra-Fast Inference: The Q4_K_M version achieves 40+ tokens/sec on standard CPUs.
📉 Memory Efficient: Runs comfortably on devices with < 2GB RAM.

📊 Benchmark & Performance Data

Tests were conducted using llama.cpp on a standard CPU setup (8 threads).

Quantization	Size (MB)	Compression	Perplexity (PPL)	Speed (CPU)	Recommended For
F16 (Original)	2,300 MB	0%	Baseline	15.73 t/s	Research / GPU
Q8_0	1,220 MB	47%	Low Loss	28.43 t/s	High Accuracy Needs
Q4_K_M	762 MB	68%	Balanced	42.60 t/s 🚀	Edge / Real-time Chat

Note: Speed may vary depending on your hardware. GPU offloading will significantly increase these numbers.

📥 Which File Should I Download?

Filename	Description
`llama-3.2-1b-q4_k_m.gguf`	🏆 Best Choice. High speed, low memory, negligible quality loss.
`llama-3.2-1b-q8_0.gguf`	Near-original quality. Use if you have 4GB+ RAM.
`llama-3.2-1b-f16.gguf`	Uncompressed weights. Use for further conversion or research.

💻 Quick Usage Guide

1. Install llama.cpp

git clone https://github.com/ggml-org/llama.cpp.git

Run in CLI (Chat Mode)

./llama-cli -m llama-3.2-1b-q4_k_m.gguf -cnv -p "You are a helpful assistant."

Python (using llama-cpp-python)


from llama_cpp import Llama

llm = Llama(
    model_path="./llama-3.2-1b-q4_k_m.gguf",
    chat_format="llama-3",
    n_gpu_layers=-1 # Set to 0 if no GPU
)

response = llm.create_chat_completion(
    messages=[{"role": "user", "content": "Hello, explain Quantum Physics in simple terms."}]
)
print(response['choices'][0]['message']['content'])

Downloads last month: 141

GGUF

Model size

1B params

Architecture

llama

Hardware compatibility

4-bit

8-bit

16-bit

Model tree for Habibur2/Llama-3.2-1B-Instruct-GGUF

Base model

meta-llama/Llama-3.2-1B-Instruct

Quantized

(349)

this model