Tianyu Zhou — Resume

Projects

PIE0.1 — 0.2B Dense LLM from Scratch [GitHub] open-source · Apache 2.0

Designed and trained a 0.2B Transformer: custom Rust BPE tokenizer (PyO3), GQA, SwiGLU, RoPE, Flash Attention, bf16 mixed precision.
Trained on ~7B tokens across 4 × RTX 4090 via DDP. Data mix: SkyPile 45%, FineWeb 35%, StarCoder 14%, NuminaMath 6%.
Diagnosed and disabled MoE after tracing CPU-sync routing bottleneck at this scale; documented decision process across 23 public video episodes.
Built custom Triton CUDA kernels; studied KV cache quantization literature (KIVI, QJL, PolarQuant, TurboQuant) for inference optimization.

RustPyTorch DDPFlash Attention TritonGQARoPE

passionie.uk — Interactive AI Textbook ↗

Full-stack platform: Next.js frontend, FastAPI backend, Supabase vector DB, RAG pipeline with local llama.cpp inference (Qwen3.5 27B, Q4_K_M).
Bilingual (EN/ZH) interactive textbook covering LLM theory; accessible to international readers without authentication.
Prototype toward a long-term goal: a teaching agent that models each learner's knowledge state and adapts content presentation dynamically — addressing the bottleneck that written knowledge updates far slower than knowledge is produced.

Next.jsFastAPIRAG llama.cppDocker

Research

General Vision-Language Infrastructure & Interpretability 2025.09 — present

SJTU Biomedical Engineering · Advisor: Suncheng Xiang

Built Med-Vision-Agent: 5-module pipeline (data collection → annotation → YOLO → inference → Qwen3-VL-8B report generation) for automated polyp report generation; converting to patent.
Developed a domain-agnostic pipeline where medical polyp datasets serve as a high-precision validation benchmark; framework is designed for seamless adaptation to any vision-to-text task.
Contributed to manuscript writing and experimental reproduction; preprint available at arXiv:2512.10750. [arXiv:2512.10750].
Designed a controlled MI experiment on a self-trained 30M dense model to isolate when memorization transitions to compositional generalization — a question large labs cannot study without sacrificing interpretability at scale.

Decoupled ArchitectureLoRADPO Vision-Language AlignmentMechanistic Interp.

Teaching & Outreach

Douyin / Bilibili — "Handmaking LLMs" Series 2026.01 — present

23-episode bilingual series covering Transformer internals, BPE tokenization, DDP training, and inference — documented alongside real development of PIE0.1.
~11k followers, 1M+ total views; 800+ member technical community; code cross-referenced with episodes in inline comments.

Teaching Assistant — Deep Learning (High School) 2025.11 — present

Qingpu High School · co-taught with Suncheng Xiang

Designed GUI-based YOLO experiment workflow enabling students with no programming background to run object detection without writing code.

By the numbers

9mo

zero → 0.2B LLM

1,000,000

views of videos

teaching episodes

11,000+

followers

Education

Shanghai Jiao Tong Univ. 2023 — 2027

B.Eng., Environmental Science & Engineering
Score: 83.9 / 100 · Rank: 6 / 33 (top 18%)
Cross-enrolled researcher, School of Biomedical Engineering · Advisor: Suncheng Xiang

TOEFL 2026.03

110 / 120 — 5.5 / 6.0

Technical skills

Languages

Python · Rust (PyO3) · C++ basics · JavaScript / TypeScript

ML / DL

PyTorch · Flash Attention · Triton · DDP · llama.cpp · Transformers

Systems

Linux · Docker · Nginx · SSH tunnelling · GPU cluster ops

Web / Infra

Next.js · FastAPI · Supabase · RAG pipelines · Alibaba Cloud

Research interests

mechanistic interpretability efficient architectures positional encoding KV cache compression