my resources for learning how llms work and how to optimize them
writing
- jeff dean, sanjay ghemawat - performance hints
- chris olah - visual information theory
- jochen görtler - a visual exploration of gaussian processes
- gregory gundersen - a history of large language models
- transformer inference arithmetic
- christopher fleetwood - domain specific architectures for ai inference
- damek davis - basic facts about gpus
- modal gpu glossary - performance
- horace he - making deep learning go brrrr from first principles
- writing high-performance matrix multiplication kernels for blackwell
- simon boehm - how to optimize a cuda matmul kernel for cublas-like performance: a worklog
- pranjal shankhdhar - outperforming cublas on h100: a worklog
- how to scale your model
- the ultra-scale playbook: training llms on gpu clusters
- horace he - defeating nondeterminism in llm inference
- arkar min aung - turboquant: a first-principles walkthrough
- sankalp shubham - how prompt caching works - paged attention and automatic prefix caching plus practical tips
- sam rose - prompt caching: 10x cheaper llm tokens, but how?
- max mynter - becoming a research engineer at a big llm lab – 18 months of strategic career development