mertbozkurt (Mert Bozkurt)

liked 2 datasets 7 days ago

venetis/symptom_text_to_disease_mk4

Viewer • Updated Mar 10, 2023 • 6.66k • 134 • 4

gretelai/symptom_to_diagnosis

Viewer • Updated May 24, 2023 • 1.07k • 506 • 45

liked a model about 1 month ago

google/gemma-3-270m-it

Text Generation • 0.3B • Updated Aug 14, 2025 • 140k • 513

reacted to flozi00's post with ❤️ about 1 month ago

Post

3578

When models get too large for a single GPU, simply stacking layers vertically (Pipeline Parallelism) isn't always the answer. Sometimes, you need to slice the matrices themselves.

My latest guide breaks down the hardware mechanics of Tensor Parallelism (TP). We look at how to shard individual operations across devices to make a cluster function as one massive accelerator.

This isn't high-level theory—it is a look at the bare metal implementation.

Here is what is covered in the deep dive:

The Strategies: Column vs. Row Parallelism
We analyze how to split weight matrices (W) and inputs (X).

Column-Linear: Splits weights by columns. Requires an All-Gather to reconstruct the output.
Row-Linear: Splits weights by rows. Requires an All-Reduce to sum partial results.
The "Megatron-LM" Optimization
Efficiency comes from minimizing communication. By sandwiching the non-linearity (GeLU) between a Column-Parallel layer and a Row-Parallel layer, we can skip synchronization entirely during the activation phase. This cuts communication events by 50% per block.

The Hardware Reality: The Bandwidth Wall
In TP, the dist.all_reduce operation sits on the critical path. The CUDA cores effectively stall while waiting for the ring-reduce to finish.

Intra-Node: Works well because NVLink provides enough bandwidth to hide this latency.
Inter-Node: Fails at scale. Standard networking (Ethernet/InfiniBand) is too slow for the high-frequency syncs required by TP.
The article includes a raw PyTorch implementation using torch.distributed primitives to show exactly where the data moves and where the bottlenecks sit.

Read the full hardware-centric guide here:
https://flozi.net/en/guides/ai/scaling/tensor_parallel

upvoted an article about 1 month ago

Article

Continuous batching from first principles

+1

Nov 25, 2025

•

297

reacted to flozi00's post with 👍 about 2 months ago

Post

2768

Running large language models efficiently is more than just raw GPU power. The latest guide breaks down the essential math to determine if your LLM workload is compute-bound or memory-bound.

We apply these principles to a real-world example: Qwen's 32B parameter model on the new NVIDIA RTX PRO 6000 Blackwell Edition.

In this guide, you will learn how to:

Calculate your GPU's operational intensity (Ops:Byte Ratio)
Determine your model's arithmetic intensity
Identify whether your workload is memory-bound or compute-bound

Read the full guide here: https://flozi.net/en/guides/ai/llm-inference-math

liked a model about 2 months ago

MoritzLaurer/multilingual-MiniLMv2-L6-mnli-xnli

Zero-Shot Classification • 0.1B • Updated Apr 22, 2024 • 2.86k • • 43

upvoted an article about 2 months ago

Article

🌳 QAT: The Art of Growing a Bonsai Model

Nov 9, 2025

•

15

reacted to Kseniase's post with 👍 about 2 months ago

Post

4114

7+ Main precision formats used in AI:

Precision is very important in AI as it shapes how accurate and efficient models are. It controls how finely numbers are represented, approximating real-world values with formats like fixed-point and floating-point. A recent BF16 → FP16 study renewed attention to precision impact.
Here are the main precision types used in AI, from full precision for training to ultra-low precision for inference:

1. FP32 (Float32):
Standard full-precision float used in most training: 1 sign bit, 8 exponent bits, 23 mantissa bits. Default for backward-compatible training and baseline numerical stability

2. FP16 (Float16) → https://arxiv.org/abs/2305.10947v6
Half-precision float. It balances accuracy and efficiency. 1 sign bit, 5 exponent bits, 10 mantissa bits. Common on NVIDIA Tensor Cores and mixed-precision setups. There’s now a new wave of using it in reinforcement learning: https://www.turingpost.com/p/fp16

3. BF16 (BFloat16) → https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus
Same dynamic range as FP32 but fewer mantissa bits: 1 sign bit, 8 exponent bits (same as FP32), 7 mantissa bits. It was developed by the research group Google Brain as part of their AI/ML infrastructure work at Google. Preferred on TPUs and modern GPUs

4. FP8 (E4M3 / E5M2) → https://proceedings.neurips.cc/paper_files/paper/2018/file/335d3d1cd7ef05ec77714a215134914c-Paper.pdf
Emerging standard for training and inference on NVIDIA Hopper (H100) and Blackwell (B200) tensor cores and AMD MI300. Also supported in NVIDIA’s Transformer Engine: https://developer.nvidia.com/blog/floating-point-8-an-introduction-to-efficient-lower-precision-ai-training/
E4M3 = 4 exponent, 3 mantissa bits
E5M2 = 5 exponent, 2 mantissa bits

Read further below ⬇️
If you like this, also subscribe to the Turing post: https://www.turingpost.com/subscribe

1 reply

·

liked a model about 2 months ago

defog/sqlcoder-7b-2

Text Generation • 7B • Updated Feb 12, 2024 • 46.5k • 412

upvoted an article 3 months ago

Article

Image-Guard-2.0: A SigLIP 2 Based Image Safety Classification Model

Oct 14, 2025

•

4

liked a model 4 months ago

lerobot/pi0_old

Robotics • 4B • Updated Sep 19, 2025 • 540 • 304

upvoted an article 4 months ago

Article

SmolVLA: Efficient Vision-Language-Action Model trained on Lerobot Community Data

+7

Jun 3, 2025

•

305

liked 3 datasets 4 months ago

liked 2 models 4 months ago

google/gemma-3-4b-it

Image-Text-to-Text • 4B • Updated Mar 21, 2025 • 766k • 1.08k

Zhengyi/LLaMA-Mesh

Text-to-3D • 8B • Updated Dec 7, 2024 • 568 • 174

reacted to mrs83's post with 🔥 4 months ago

Post

340

While LLMs are getting a lot of attention, I believe in the power of narrow AI/ML to solve everyday problems.

That's why I've created "Obesity Risk Predictor", a tool designed to be a preventive measure, helping to identify health risks based on lifestyle habits.

It’s a clear example of AI/ML built for a specific and impactful task.

The gradio app lets you compare the performance of three different models (Random Forest, LightGBM, and XGBoost) on the same dataset.

Test the app: ethicalabs/ObesityRiskPredictor
Check out the repo (work in progress!): https://github.com/ethicalabs-ai/ObesityRiskPredictor

Please donate to support ethicalabs.ai projects!
- GitHub: https://github.com/sponsors/ethicalabs-ai
- OpenCollective: https://opencollective.com/ethicalabs-ai

reacted to wcy1122's post with 🚀 5 months ago

Post

4828

🚀 Update: We release the technical report of MGM-Omni. Moreover, we introduce Long-TTS-Eval, a benchmark for long-form and complex case TTS evaluation.
📝 Arxiv: https://arxiv.org/abs/2509.25131
📊 benchmark: wcy1122/Long-TTS-Eval
-------------------------
🚀 Introducing MGM-Omni, an omni-chatbot capable of processing text, image, video, and speech inputs, and can generate both text and speech responses.
👂 MGM-Omni support hour-level audio understanding.
🗣️ MGM-Omni support 10-minute speech generation and voice cloning.
For more details, please check:
📝 Blog: https://mgm-omni.notion.site/MGM-Omni-An-Open-source-Omni-Chatbot-2395728e0b0180149ac9f24683fc9907
🌟 Code: https://github.com/dvlab-research/MGM-Omni
🤖 Model: wcy1122/mgm-omni-6896075e97317a88825032e1
🎮 Demo: wcy1122/MGM-Omni

Mert Bozkurt

AI & ML interests

Recent Activity

Organizations

venetis/symptom_text_to_disease_mk4

gretelai/symptom_to_diagnosis

google/gemma-3-270m-it

Continuous batching from first principles

MoritzLaurer/multilingual-MiniLMv2-L6-mnli-xnli

🌳 QAT: The Art of Growing a Bonsai Model

defog/sqlcoder-7b-2

Image-Guard-2.0: A SigLIP 2 Based Image Safety Classification Model

lerobot/pi0_old

SmolVLA: Efficient Vision-Language-Action Model trained on Lerobot Community Data

alibayram/turkish_mmlu

cubukcum/TurkishVoiceDataset

ysdede/khanacademy-turkish

google/gemma-3-4b-it

Zhengyi/LLaMA-Mesh

Mert Bozkurt

AI & ML interests

Recent Activity

Organizations

mertbozkurt's activity

Continuous batching from first principles

🌳 QAT: The Art of Growing a Bonsai Model

Image-Guard-2.0: A SigLIP 2 Based Image Safety Classification Model

SmolVLA: Efficient Vision-Language-Action Model trained on Lerobot Community Data