πŸ¦™ Llama-3.2-1B-Instruct-GGUF [Optimized for Edge AI]

Llama Banner

πŸ¦™ Llama-3.2-1B-Instruct-GGUF [Benchmarked & Verified]

Llama Banner

πŸ“Œ Model Description

This repository contains manually benchmarked GGUF quantized versions of the Meta Llama 3.2 1B Instruct model.

These models are optimized for Edge AI deployment (Mobile, Raspberry Pi, Laptops) using llama.cpp. Unlike auto-generated quants, these weights have been tested against WikiText-2 to ensure the best balance between speed and accuracy.

🌟 Exclusive Features

  1. πŸš€ Hyper-Fast: The Q4_K_M version achieves 42+ tokens/sec generation speed on CPU.
  2. πŸ“‰ Ultra-Low Memory: Runs comfortably on devices with < 1GB RAM (Measured: ~639 MiB).
  3. βœ… Verified Quality: Perplexity (PPL) tested on WikiText-2 to guarantee performance.

πŸ“Š Benchmark Results (The Science)

Tests were conducted using llama.cpp on a standard CPU setup.

Model Version Size Perplexity (PPL) Quality Loss Gen Speed (CPU) Memory Usage
F16 (Original) 2.30 GB 13.99 Baseline 15.73 t/s ~2.4 GB
Q8_0 1.22 GB 14.01 ~0.1% (Negligible) 28.43 t/s ~1.3 GB
Q4_K_M 762 MB 14.49 ~3.5% (Acceptable) 42.60 t/s πŸš€ ~640 MB

Conclusion: The Q4_K_M model offers the best trade-off, running 2.7x faster than the original with minimal quality loss.


πŸ“₯ Which File to Download?

Filename Description Use Case
llama-3.2-1b-q4_k_m.gguf πŸ† Recommended. Balanced speed & accuracy. Chatbots, Android/iOS Apps, RAG
llama-3.2-1b-q8_0.gguf High precision, larger size. Research, Creative Writing
llama-3.2-1b-f16.gguf Uncompressed weights. Fine-Tuning, Conversion

πŸ’» Quick Usage

Python (Google Colab / Local)

# pip install llama-cpp-python huggingface_hub

from huggingface_hub import hf_hub_download
from llama_cpp import Llama

model_path = hf_hub_download(
    repo_id="Habibur2/Llama-3.2-1B-Instruct-GGUF",
    filename="llama-3.2-1b-q4_k_m.gguf"
)

llm = Llama(
    model_path=model_path,
    n_ctx=2048,
    verbose=False
)

response = llm.create_chat_completion(
    messages=[{"role": "user", "content": "Hello! Explain AI in one sentence."}]
)
print(response['choices'][0]['message']['content'])

Uploaded by Habibur2 | Benchmarked with WikiText-2 & llama-bench

πŸ“Š Detailed Benchmark Results (WikiText-2)

Tests were conducted on llama.cpp (CPU Backend). The results show that quantization has negligible impact on model quality while significantly reducing memory usage.

Model Version VRAM/RAM Usage Perplexity (Lower is Better) Accuracy Loss
F16 (Original) 2,357 MB 13.99 Baseline (0%)
Q8_0 1,252 MB 14.01 +0.01 (Negligible)

Analysis: The Q8_0 version retains 99.99% of the original model's performance while using 47% less memory.

πŸ“Œ Model Description

This repository contains verified and benchmarked GGUF quantized versions of the Meta Llama 3.2 1B Instruct model.

These models are optimized for Edge AI deployment (Mobile, Raspberry Pi, Laptops) using llama.cpp. Unlike auto-generated quants, these weights have been manually benchmarked to ensure the best balance between speed and accuracy.

🌟 Why use this Repository?

  1. πŸš€ Real-World Benchmarks: Performance data provided for informed decision-making.
  2. ⚑ Ultra-Fast Inference: The Q4_K_M version achieves 40+ tokens/sec on standard CPUs.
  3. πŸ“‰ Memory Efficient: Runs comfortably on devices with < 2GB RAM.

πŸ“Š Benchmark & Performance Data

Tests were conducted using llama.cpp on a standard CPU setup (8 threads).

Quantization Size (MB) Compression Perplexity (PPL) Speed (CPU) Recommended For
F16 (Original) 2,300 MB 0% Baseline 15.73 t/s Research / GPU
Q8_0 1,220 MB 47% Low Loss 28.43 t/s High Accuracy Needs
Q4_K_M 762 MB 68% Balanced 42.60 t/s πŸš€ Edge / Real-time Chat

Note: Speed may vary depending on your hardware. GPU offloading will significantly increase these numbers.


πŸ“₯ Which File Should I Download?

Filename Description
llama-3.2-1b-q4_k_m.gguf πŸ† Best Choice. High speed, low memory, negligible quality loss.
llama-3.2-1b-q8_0.gguf Near-original quality. Use if you have 4GB+ RAM.
llama-3.2-1b-f16.gguf Uncompressed weights. Use for further conversion or research.

πŸ’» Quick Usage Guide

1. Install llama.cpp

git clone https://github.com/ggml-org/llama.cpp.git
  1. Run in CLI (Chat Mode)
./llama-cli -m llama-3.2-1b-q4_k_m.gguf -cnv -p "You are a helpful assistant."
  1. Python (using llama-cpp-python)

from llama_cpp import Llama

llm = Llama(
    model_path="./llama-3.2-1b-q4_k_m.gguf",
    chat_format="llama-3",
    n_gpu_layers=-1 # Set to 0 if no GPU
)

response = llm.create_chat_completion(
    messages=[{"role": "user", "content": "Hello, explain Quantum Physics in simple terms."}]
)
print(response['choices'][0]['message']['content'])
Downloads last month
141
GGUF
Model size
1B params
Architecture
llama
Hardware compatibility
Log In to add your hardware

4-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Habibur2/Llama-3.2-1B-Instruct-GGUF

Quantized
(349)
this model