When legendary game developer John Carmack asked how to catch up on modern artificial intelligence, Ilya SutskeverβOpenAI's co-founder and former Chief Scientistβhanded him a curated list of research papers. Sutskever famously remarked that these papers contained "90% of what matters today" in AI.
The list is not merely a bibliography; it is a worldview. It traces the arc of deep learning from the empirical breakthroughs of AlexNet to the scaling laws that define modern LLMs, while grounding these advances in the theoretical bedrock of compression and complexity theory.
"I plowed through all those things, and it all started sorting out [AI] in my head... not extreme black-magic mathematical wizardries, but simple techniques that make perfect sense." β John Carmack
Ilya's Worldview
01. Don't Bet Against Deep Learning
A rejection of hybrid approaches. The list focuses almost exclusively on supervised and self-supervised deep learning, omitting symbolic AI and classical planning.
02. Engineering Pragmatism
Genuine progress arises from refining ideas at scale rather than pure theoretical novelty. Scaling simple architectures often beats complex, handcrafted solutions.
03. Do More with Less at Scale
Complexity is a liability. The most enduring architectures (ResNet, Transformers) are often the simplest. Leverage brute-force scale to push the frontier.
04. Emergence via Compression
Intelligence is viewed as a compression process. Better compression of data leads to better generalization and emergent reasoning.
I. Foundational Architectures
Breakthroughs that proved neural networks could scale and outperform handcrafted features.
-
AlexNetThe "Big Bang" of modern deep learning. It ended the era of handcrafted features by halving the error rate on ImageNet using ReLU, Dropout, and GPUs.
-
ResNetIntroduced "residual connections", allowing gradients to flow through networks of unprecedented depth. A standard component in modern Transformers.
-
RNN RegularizationSolved overfitting in large LSTMs by applying dropout only to non-recurrent connections, unlocking large-scale sequence modeling.
-
Attention MechanismAllowed models to focus on specific parts of the input sequence, solving the bottleneck of fixed-length vector compression in translation.
-
Identity Mappings in ResNetsA refinement of the original ResNet (v2) that analyzed skip connections, suggesting identity mappings facilitate training in extremely deep networks.
II. The Transformer Era & Scaling
The shift to pure attention and the realization that scale is the primary driver of performance.
-
Attention Is All You NeedProposed the Transformer architecture, discarding recurrence for self-attention. The backbone of almost all modern LLMs.
-
The Annotated TransformerA "literate programming" implementation of the Transformer paper in PyTorch, demystifying the architecture for practitioners.
-
Scaling LawsDemonstrated that performance scales as a power law with compute, data, and parameters. The scientific justification for massive LLM investment.
-
GPipeIntroduced pipeline parallelism, essential for training models that exceed the memory of a single GPU.
III. Memory & Reasoning
-
Unreasonable Effectiveness of RNNsDemonstrated the "magic" of sequence generation (Shakespeare, code) using simple character-level RNNs.
-
Understanding LSTM NetworksThe definitive visual guide to LSTMs, breaking down gating mechanisms into intuitive diagrams.
-
Pointer NetworksUses attention to "point" to input elements as output, allowing models to solve geometric problems and generalize to unseen lengths.
-
Neural Turing MachinesCoupled neural networks with external memory, an early attempt at giving AI algorithmic "working memory".
-
Deep Speech 2Proved that end-to-end deep learning could replace complex handcrafted pipelines in speech recognition at scale.
-
Order MattersInvestigates how input order affects training, showing that Seq2Seq models can learn to process sets and sort numbers.
IV. Complexity & Compression
The philosophical bedrock: intelligence as the discovery of patterns through compression.
-
Minimum Description Length (MDL)The principle that the best explanation for data is the one that leads to the best compression.
-
Keeping Neural Networks SimpleAn early application of MDL to neural networks, forcing the network to find generalizable patterns by simplifying weight information.
-
Kolmogorov ComplexityFoundational text on algorithmic information theory, exploring the absolute limits of compression and learning.
-
The Coffee AutomatonExplores how complexity evolves in physical systems, hinting at the physics of intelligence and entropy.
-
First Law of ComplexodynamicsA blog post discussing theoretical definitions of complexity, further exploring themes of entropy and information.
V. Additional Key Readings
-
Dilated ConvolutionsIntroduced "atrous convolutions" to expand the receptive field without losing resolution, crucial for segmentation.
-
Neural Message PassingUnified Graph Neural Network (GNN) approaches into a single framework for molecular property prediction.
-
Variational Lossy AutoencoderCombined VAEs and autoregressive models to model global structure and local details using MDL.
-
Relational ReasoningProposed Relation Networks (RNs) to confer relational reasoning abilities to neural networks.
-
Machine Super IntelligencePhD thesis providing formal definitions of intelligence and analyzing the feasibility of superintelligence.