MiniMax-M2.7 — 100 GB (MLX)

Mixed-precision MLX build of MiniMaxAI/MiniMax-M2.7, prepared by baa.ai.

Metrics

Metric	Value
Size on disk	100.1 GB (20 shards)
Group size	64
Framework	MLX (Apple Silicon)

Benchmarks

Benchmark	Score	Notes
HumanEval pass@1 (single-shot)	87.2% (143/164)	164/164 completed, 0 skipped
HumanEval pass@1 (best-of-2)	94.5% (155/164)	Retry of the 21 single-shot failures recovered 12
Decode throughput (Apple Silicon)	36.4 tok/s (wall-gen) / 36.8 tok/s (task-mean)	296,683 tokens generated over 136.1 min

Settings for both runs match the Recommended inference settings below.

Recommended inference settings

sampler_params = {
    "temperature": 1.0,
    "top_p": 0.95,
    "top_k": 40,
    "repetition_penalty": 1.1,
    "max_tokens": 8192,
}

Chat template — thinking mode

MiniMax-M2.7 uses a <think>…</think> reasoning block. Important: the base chat template injects <think>\n at the end of the prompt before generation, so the model's output begins inside the reasoning block with no opening tag. Strip everything up to and including the first </think>:

def strip_thinking(text: str) -> str:
    if "</think>" in text:
        return text.split("</think>", 1)[1].strip()
    return text.strip()

Give the model enough token budget that it can finish reasoning and emit the </think> closing tag — we recommend at least 4096, and 8192 for harder problems.

Usage

from mlx_lm import load, generate
from mlx_lm.sample_utils import make_sampler, make_logits_processors

model, tokenizer = load("baa-ai/MiniMax-M2.7-RAM-100GB-MLX")

sampler = make_sampler(temp=1.0, top_p=0.95, top_k=40)
logits_processors = make_logits_processors(repetition_penalty=1.1)

prompt = tokenizer.apply_chat_template(
    [{"role": "user", "content": "Write a Python function that reverses a string."}],
    tokenize=False,
    add_generation_prompt=True,
)

response = generate(
    model,
    tokenizer,
    prompt=prompt,
    max_tokens=8192,
    sampler=sampler,
    logits_processors=logits_processors,
)

if "</think>" in response:
    response = response.split("</think>", 1)[1].strip()
print(response)

Hardware

Apple Silicon Mac with ~112 GB unified memory recommended for comfortable inference.
Runs on less with swap, at substantially reduced throughput.

Variants

Variant	Size	Link
100 GB	100.1 GB	baa-ai/MiniMax-M2.7-RAM-100GB-MLX
111 GB	110.9 GB	baa-ai/MiniMax-M2.7-RAM-111GB-MLX
116 GB	116.0 GB	baa-ai/MiniMax-M2.7-RAM-116GB-MLX
120 GB	120.1 GB	baa-ai/MiniMax-M2.7-RAM-120GB-MLX

License

Inherited from the upstream MiniMax-M2.7 license: non-commercial use permitted; commercial use requires written authorization from MiniMax.

Quantized by baa.ai

Downloads last month: 1,378

Safetensors

Model size

229B params

Tensor type

BF16

U32

F32

MLX

Hardware compatibility

4-bit

Model tree for baa-ai/MiniMax-M2.7-RAM-100GB-MLX

Base model

MiniMaxAI/MiniMax-M2.7

Quantized

(52)

this model

Collection including baa-ai/MiniMax-M2.7-RAM-100GB-MLX

MiniMax-M2.7 (MLX)

Collection

RAM quantized versions of MiniMaxAI/MiniMax-M2.7 for Apple Silicon. Size points: 100 GB, 111 GB, 116 GB, 120 GB. • 4 items • Updated about 1 hour ago