Twill + GauS: Optimal GPU Kernel Scheduling

Implementation of two complementary papers for GPU kernel scheduling optimization:

Twill — "Optimal Software Pipelining and Warp Specialization for Tensor Core GPUs" (Soi et al., NVIDIA Research, 2024). Exact ILP+SMT solver that finds provably optimal schedules.
GauS — "Differentiable Scheduling Optimization via Gaussian Reparameterization" (Cai et al., 2026). Scalable gradient-based solver using Gaussian distributions and Augmented Lagrangian Method.

When to Use Which

Property	Twill (ILP+SMT)	GauS (Differentiable)
Optimality	Provably optimal	Approximate (Pareto-optimal)
Speed on small graphs	Fast (< 1s for 3-7 nodes)	Slower (~10s for 3-7 nodes)
Scalability	Exponential in graph size	O(
Warp specialization	Full joint SWP+WS	Schedule only (no WS)
Best for	Kernels with < 50 ops	Large graphs with 100+ ops

Architecture

┌─────────────────────────────────────────────────────────┐
│                    DependenceGraph                       │
│  G = (V, E), Machine Description, RRTs                  │
├────────────────────┬────────────────────────────────────┤
│    Twill Solver    │         GauS Solver                │
│                    │                                    │
│  Phase 1: ILP      │  Gaussian Reparameterization       │
│  (CBC/PuLP)        │  X_i ~ N(μ_i, σ_i²)               │
│  → Modulo Schedule │  → P_i^d = Φ((d+0.5-μ)/σ)        │
│                    │        - Φ((d-0.5-μ)/σ)            │
│  Phase 2: SMT      │                                    │
│  (Z3/QFLIA)       │  Augmented Lagrangian Method        │
│  → Joint SWP + WS │  → Adam optimizer on (μ, σ)        │
│                    │  → Dependency + Resource +          │
│  Cost Norm (§5.2)  │    Modulo + Recurrence losses      │
│  → Ratio-preserving│                                    │
│    cycle count     │  Legalization Heuristics            │
│    reduction       │  → Topological pass (regular)      │
│                    │  → Fixed-point iteration (modulo)   │
├────────────────────┴────────────────────────────────────┤
│               Code Generation                           │
│  Pseudocode · CUDA Skeleton · Pipelined Schedule        │
└─────────────────────────────────────────────────────────┘

Quick Start

pip install pulp z3-solver numpy matplotlib torch

Twill (exact solver)

from twill.kernels import flash_attention_forward_simplified
from twill.twill_solver import twill_solve

graph = flash_attention_forward_simplified()
result = twill_solve(graph, max_I=5, verbose=True)
# → I=2, schedule: S@0, P@2, O@3 (proves FA3 optimal)

GauS (differentiable solver)

from twill.kernels import flash_attention_forward_hopper
from twill.gaus_solver import gaus_solve_twill_graph

graph = flash_attention_forward_hopper()
result = gaus_solve_twill_graph(graph, target_II=4, D=20, verbose=True)
# → II=4, feasible schedule found via gradient descent

GauS on custom large graphs

from twill.gaus_solver import GauSSolver, generate_random_dag

graph = generate_random_dag(num_nodes=1000, edge_density=0.01, num_back_edges=50)
solver = GauSSolver(graph, D=500, lr=0.01)
result = solver.solve_modulo(II=10, R_cap=50.0, max_iters=2000)

Pre-built Kernel Descriptions

Kernel	Architecture	Twill Result	GauS Result
`flash_attention_forward_simplified()`	Hopper	I=2 ✓ (optimal)	II=2 ✓ (feasible)
`flash_attention_forward_hopper()`	Hopper (H100)	I=4, FA3 pipeline ✓	II=4 ✓ (feasible)
`flash_attention_forward_blackwell()`	Blackwell (B200)	I=2, FA4 strategy ✓	II=2 ✓ (feasible)
`simple_gemm_pipeline()`	Hopper	I=2, TMA on producer ✓	II=2 ✓ (feasible)

Module Structure

twill/
├── __init__.py              # Package exports (v0.2.0)
├── graph.py                 # DependenceGraph, Instruction, RRT, MachineDescription
├── cost_normalization.py    # §5.2: ILP-based cycle count normalization
├── modulo_scheduler.py      # Twill Phase 1: ILP modulo scheduling (CBC)
├── smt_joint.py             # Twill Phase 2: SMT joint SWP+WS (Z3)
├── twill_solver.py          # Twill Algorithm 1: Main search procedure
├── gaus_solver.py           # GauS: Differentiable scheduling (PyTorch)
├── codegen.py               # Code generation (pseudocode, CUDA skeleton)
├── visualization.py         # Schedule visualization (text + matplotlib)
└── kernels.py               # Pre-built kernel descriptions (FMHA, GEMM)

GauS: Technical Details

Gaussian Reparameterization (§3.1)

Each operator v_i is modeled as X_i ~ N(μ_i, σ_i²) with only 2|V| parameters (vs. D·|V| for categorical approaches). The probability of scheduling at step d:

P_i^d = Φ((d+0.5-μ_i)/σ_i) - Φ((d-0.5-μ_i)/σ_i)

Differentiable Loss Functions

Dependency: Expected topological violations via CDF products
Resource: LogSumExp smooth-max of per-step usage + ReLU violations
Modulo Resource: Wrapped Gaussian probabilities into II-slot reservation table
Recurrence: Expected back-edge constraint violations
Memory: CDF-based active lifetime estimation
Communication: Expected edge lengths (compactness)

Augmented Lagrangian Method (ALM)

L_total = L_primary + Σ_i (λ_i · V_i + ρ/2 · ||V_i||²)
λ_i ← λ_i + ρ · V_i    (dual update)

Hyperparameters (from paper): lr=0.01, ρ=1e-4, τ=1e-2, κ=1/6

Tests

python test_twill.py   # Twill: 6/6 pass (~5s)
python test_gaus.py    # GauS:  7/7 pass (~30s)

Solvers Used

Component	Solver	Theory	Paper
Twill Cost Normalization	CBC (PuLP)	ILP	Soi et al. §5.2
Twill Modulo Scheduling	CBC (PuLP)	ILP	Soi et al. §3.1
Twill Joint SWP+WS	Z3	QFLIA (SMT)	Soi et al. §4
GauS Scheduling	PyTorch (Adam)	Differentiable	Cai et al. §3

Citations

@article{soi2024twill,
  title={Optimal Software Pipelining and Warp Specialization for Tensor Core GPUs},
  author={Soi, Rupanshu and others},
  journal={arXiv preprint arXiv:2512.18134},
  year={2024}
}

@article{cai2026gaus,
  title={GauS: Differentiable Scheduling Optimization via Gaussian Reparameterization},
  author={Cai, Yaohui and others},
  journal={arXiv preprint arXiv:2602.20427},
  year={2026}
}

Related Work

FlashAttention-3 — Hopper FMHA schedule that Twill rediscovers
FlashAttention-4 — Blackwell FMHA schedule that Twill rediscovers
ThunderKittens — Warp-level kernel framework
CUTLASS 3.x — NVIDIA GEMM templates with WS
Tawa — Automatic WS compiler (downstream of Twill)
Cypress — Task-based GPU programming model
Nautilus — Auto-scheduling tensor compiler (fills Twill's tile-size gap)
MPK — Cross-kernel software pipelining
GS-Schedule — GauS predecessor (categorical approach)

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for AshenNav/twill-swp-ws

Nautilus: An Auto-Scheduling Tensor Compiler for Efficient Tiled GPU Kernels

Paper • 2604.14825 • Published 10 days ago

FlashAttention-4: Algorithm and Kernel Pipelining Co-Design for Asymmetric Hardware Scaling

Paper • 2603.05451 • Published Mar 5 • 1

GauS: Differentiable Scheduling Optimization via Gaussian Reparameterization

Paper • 2602.20427 • Published Feb 23

Mirage Persistent Kernel: A Compiler and Runtime for Mega-Kernelizing Tensor Programs

Paper • 2512.22219 • Published Dec 22, 2025

Optimal Software Pipelining and Warp Specialization for Tensor Core GPUs

Paper • 2512.18134 • Published Dec 19, 2025