PIE0.1 — 0.2B Dense LLM from Scratch
[GitHub]
- Designed and trained a 0.2B Transformer: custom Rust BPE tokenizer (PyO3), GQA, SwiGLU, RoPE, Flash Attention, bf16 mixed precision.
- Trained on ~7B tokens across 4 × RTX 4090 via DDP. Data mix: SkyPile 45%, FineWeb 35%, StarCoder 14%, NuminaMath 6%.
- Diagnosed and disabled MoE after tracing CPU-sync routing bottleneck at this scale; documented decision process across 23 public video episodes.
- Built custom Triton CUDA kernels; studied KV cache quantization literature (KIVI, QJL, PolarQuant, TurboQuant) for inference optimization.