AI & ML interests

Connecting individuals with innovation: Emancipating and Truly Federalizing Private Intelligence

Recent Activity

fuzzy-mittenz  updated a collection about 1 month ago
SwarmModels
fuzzy-mittenz  updated a collection 2 months ago
SotA-GGUF
View all activity

takarajordan 
posted an update about 18 hours ago
csabakecskemeti 
posted an update 1 day ago
csabakecskemeti 
posted an update 2 days ago
view post
Post
1993
Looking for some help to test an INT8 Deepseek 3.2:
SGLang supports Channel wise INT8 quants on CPUs with AMX instructions (Xeon 5 and above AFAIK)
https://lmsys.org/blog/2025-07-14-intel-xeon-optimization/

Currently uploading an INT8 version of Deepseek 3.2 Speciale:
DevQuasar/deepseek-ai.DeepSeek-V3.2-Speciale-Channel-INT8

I cannot test this I'm on AMD
"AssertionError: W8A8Int8LinearMethod on CPU requires that CPU has AMX support"
(I assumed it can fall back to some non optimized kernel but seems not)

If anyone with the required resources (Intel Xeon 5/6 + ~768-1TB ram) can help to test this that would be awesome.

If you have hints how to make this work on AMD Threadripper 7000 Pro series please guide me.

Thanks all!
·
upgraedd 
posted an update 3 days ago
view post
Post
247
I know it doesn't seem likely but I literally am starving on the streets if anyone can help me. please just inspect my repository and you'll see what I'm doing. what I'm doing is for YOU.
bc1qga8njlqm9x76327jsap4wqq0e5kfczg8pc7p39
#CONSCIOUSNESS
takarajordan 
posted an update 11 days ago
view post
Post
3162
Two weeks ago I had an engaging discussion with locals in Cockermouth about AI and the broader industry, a reminder that hearing candid perspectives beyond our professional circles is invaluable and something anyone working full-time in this field should make time for.

Thank you!
mitkox 
posted an update 13 days ago
view post
Post
3023
I run 20 AI coding agents locally on my desktop workstation at 400+ tokens/sec with MiniMax-M2. It’s a Sonnet drop-in replacement in my Cursor, Claude Code, Droid, Kilo and Cline peak at 11k tok/sec input and 433 tok/s output, can generate 1B+ tok/m.All with 196k context window. I'm running it for 6 days now with this config.

Today max performance was stable at 490.2 tokens/sec across 48 concurrent clients and MiniMax M2.

Z8 Fury G5, Xeon 3455, 4xA6K. Aibrix 0.5.0, vLLM 0.11.2,
·
csabakecskemeti 
posted an update 25 days ago
view post
Post
301
Recently there are so much activity on token efficient formats, I've also build a package (inspired by toon).

Deep-TOON

My goal was to token efficiently handle json structures with complex embeddings.

So this is what I've built on the weekend. Feel free try:

https://pypi.org/project/deep-toon/0.1.0/

takarajordan 
posted an update 26 days ago
view post
Post
257
🌞 LOVABLE IS CRACKED

Built a golden hour tracker in under 15 minutes with Lovable: uses your phone’s Geolocation API, the SunCalc library, and runs fully client-side with no servers. https://goldenhour.404missing.link
mitkox 
posted an update 27 days ago
view post
Post
4121
I just threw Qwen3-0.6B in BF16 into an on device AI drag race on AMD Strix Halo with vLLM:

564 tokens/sec on short 100-token sprints
96 tokens/sec on 8K-token marathons

TL;DR You don't just run AI on AMD. You negotiate with it.

The hardware absolutely delivers. Spoiler alert; there is exactly ONE configuration where vLLM + ROCm + Triton + PyTorch + Drivers + Ubuntu Kernel to work at the same time. Finding it required the patience of a saint

Consumer AMD for AI inference is the ultimate "budget warrior" play, insane performance-per-euro, but you need hardcore technical skills that would make a senior sysadmin nod in quiet respect.
  • 1 reply
·
mitkox 
posted an update about 1 month ago
view post
Post
358
I have just vibe coded a feature for ODA on-device AI with MiniMax M2, running locally on my Z8 Fury - and holy silicon, this thing SLAPS!
TL;DR the nerd stuff

Specialized in coding and agentic work
60 tokens/sec
Ryzen AI is getting some serious ROCm 7.0.2 brain implants
One extra script to rule them all and bind them to my GPU
Vibe coding feature implementation that actually worked on the first try. I know, I'm scared too
mitkox 
posted an update about 1 month ago
view post
Post
1832
I’m just reading that Ryzen AI 395 has to be 30% slower than DGX Spark in LLM inferencing… and only 96GB GPU RAM… good I haven’t RTFM upfront, so I made the AMD faster with 128GB unified RAM 🫡
Z2 mini G1a can run Qwen3 Coder 30B BF16 at 26.8 tok/sec in ~60GB GPU RAM
mitkox 
posted an update about 1 month ago
view post
Post
2765
Say hello to my little friends! I just unboxed this trio of HP Z2 G1a!

Three is always better than one!
3x AMD Ryzen AI Max+ Pro 395
384GB RAM
24TB of RAID storage
Ubuntu 24.04
ROCm 7.0.2
llama cpp, vLLM and Aibrix

Small, cheap GPUs are about to become the Raspberry Pi of edge AI inference. Sprinkle some kubectl fairy dust on top, and suddenly it's a high-availability, self-healing, cloud-native, enterprise-grade AI cluster camping in a closet.

Make sure you own your AI. AI in the cloud is not aligned with you; it’s aligned with the company that owns it.
·
mitkox 
posted an update about 2 months ago
view post
Post
2799
I see all Chinese labs are turning TL;DR into TL;DRGB

Problem: 1M text tokens == 1 M opportunities for your GPU to file worker-comp
Solution: don’t feed the model War & Peace—feed it the movie poster.

This is Glyph, Zai’s new visual-text compression voodoo:
• 10 k words → 3 PNGs ≈ 3 k visual tokens
• Compression ratio: 4.3×
• Throughput: 40-60 tok/s i.e. your context window now finishes before my coffee does

So I did the only reasonable thing: asked GLM-4.6 to port Glyph for Qwen3-VL-8B-Thinking.
Translation: I made one model compress a novel into a comic strip, then made another model read the comic strip and still ace QA.
It’s basically passing notes in class, except the note is a 1920×1080 meme and the teacher is a transformer.

We've gone from "Attention is All You Need" to "Attention is Too Expensive, Just Use Your Eyes." Remember kids: in 2025 literacy is optional, but JPEG is forever.
csabakecskemeti 
posted an update about 2 months ago
view post
Post
2602
Christmas came early this year
·
mitkox 
posted an update about 2 months ago
view post
Post
301
Friday evening. KAT-Dev-72B-Exp is spinning in Aibrix K8s. The GPUs in the Z8 are fired up. It's a LAN party for one. After 6 months on a diet of MoEs, I'd forgotten the main-course feeling of a dense 72B model.
  • 1 reply
·