Sutskever's List - Nilesh Sarkar

When legendary game developer John Carmack asked how to catch up on modern artificial intelligence, Ilya Sutskever—OpenAI's co-founder and former Chief Scientist—handed him a curated list of research papers. Sutskever famously remarked that these papers contained "90% of what matters today" in AI.

The list is not merely a bibliography; it is a worldview. It traces the arc of deep learning from the empirical breakthroughs of AlexNet to the scaling laws that define modern LLMs, while grounding these advances in the theoretical bedrock of compression and complexity theory.

"I plowed through all those things, and it all started sorting out [AI] in my head... not extreme black-magic mathematical wizardries, but simple techniques that make perfect sense." — John Carmack

Ilya's Worldview

01. Don't Bet Against Deep Learning

A rejection of hybrid approaches. The list focuses almost exclusively on supervised and self-supervised deep learning, omitting symbolic AI and classical planning.

02. Engineering Pragmatism

Genuine progress arises from refining ideas at scale rather than pure theoretical novelty. Scaling simple architectures often beats complex, handcrafted solutions.

03. Do More with Less at Scale

Complexity is a liability. The most enduring architectures (ResNet, Transformers) are often the simplest. Leverage brute-force scale to push the frontier.

04. Emergence via Compression

Intelligence is viewed as a compression process. Better compression of data leads to better generalization and emergent reasoning.

I. Foundational Architectures

Breakthroughs that proved neural networks could scale and outperform handcrafted features.

AlexNet — Krizhevsky, Sutskever, Hinton (2012)

The "Big Bang" of modern deep learning. It ended the era of handcrafted features by halving the error rate on ImageNet using ReLU, Dropout, and GPUs.

[Paper] [PDF]
ResNet — Kaiming He, et al. (2015)

Introduced "residual connections", allowing gradients to flow through networks of unprecedented depth. A standard component in modern Transformers.

[ArXiv] [PDF]
RNN Regularization — Zaremba, Sutskever, Vinyals (2014)

Solved overfitting in large LSTMs by applying dropout only to non-recurrent connections, unlocking large-scale sequence modeling.

[ArXiv] [PDF]
Attention Mechanism — Bahdanau, et al. (2014)

Allowed models to focus on specific parts of the input sequence, solving the bottleneck of fixed-length vector compression in translation.

[ArXiv] [PDF]
Identity Mappings in ResNets — Kaiming He, et al. (2016)

A refinement of the original ResNet (v2) that analyzed skip connections, suggesting identity mappings facilitate training in extremely deep networks.

[ArXiv] [PDF]

II. The Transformer Era & Scaling

The shift to pure attention and the realization that scale is the primary driver of performance.

Attention Is All You Need — Vaswani, et al. (2017)

Proposed the Transformer architecture, discarding recurrence for self-attention. The backbone of almost all modern LLMs.

[ArXiv] [PDF]
The Annotated Transformer — Sasha Rush, et al.

A "literate programming" implementation of the Transformer paper in PyTorch, demystifying the architecture for practitioners.

[Blog] [Code]
Scaling Laws — Kaplan, et al. (2020)

Demonstrated that performance scales as a power law with compute, data, and parameters. The scientific justification for massive LLM investment.

[ArXiv] [PDF]
GPipe — Huang, et al. (2018)

Introduced pipeline parallelism, essential for training models that exceed the memory of a single GPU.

[ArXiv] [PDF]

III. Memory & Reasoning

Unreasonable Effectiveness of RNNs — Karpathy (2015)

Demonstrated the "magic" of sequence generation (Shakespeare, code) using simple character-level RNNs.

[Blog]
Understanding LSTM Networks — Christopher Olah (2015)

The definitive visual guide to LSTMs, breaking down gating mechanisms into intuitive diagrams.

[Blog]
Pointer Networks — Oriol Vinyals, et al. (2015)

Uses attention to "point" to input elements as output, allowing models to solve geometric problems and generalize to unseen lengths.

[Paper]
Neural Turing Machines — Graves, et al. (2014)

Coupled neural networks with external memory, an early attempt at giving AI algorithmic "working memory".

[ArXiv] [PDF]
Deep Speech 2 — Amodei, et al. (2015)

Proved that end-to-end deep learning could replace complex handcrafted pipelines in speech recognition at scale.

[ArXiv] [PDF]
Order Matters — Oriol Vinyals, et al. (2015)

Investigates how input order affects training, showing that Seq2Seq models can learn to process sets and sort numbers.

[ArXiv]

IV. Complexity & Compression

The philosophical bedrock: intelligence as the discovery of patterns through compression.

Minimum Description Length (MDL) — Peter Grunwald

The principle that the best explanation for data is the one that leads to the best compression.

[ArXiv]
Keeping Neural Networks Simple — Hinton and van Camp (1993)

An early application of MDL to neural networks, forcing the network to find generalizable patterns by simplifying weight information.

[PDF]
Kolmogorov Complexity — A. Shen, et al.

Foundational text on algorithmic information theory, exploring the absolute limits of compression and learning.

[Book]
The Coffee Automaton — Scott Aaronson, et al.

Explores how complexity evolves in physical systems, hinting at the physics of intelligence and entropy.

[PDF]
First Law of Complexodynamics — Scott Aaronson

A blog post discussing theoretical definitions of complexity, further exploring themes of entropy and information.

[Blog]

V. Additional Key Readings

Dilated Convolutions — Yu and Koltun (2015)

Introduced "atrous convolutions" to expand the receptive field without losing resolution, crucial for segmentation.

[ArXiv]
Neural Message Passing — Justin Gilmer, et al. (2017)

Unified Graph Neural Network (GNN) approaches into a single framework for molecular property prediction.

[ArXiv]
Variational Lossy Autoencoder — Xi Chen, et al. (2016)

Combined VAEs and autoregressive models to model global structure and local details using MDL.

[ArXiv]
Relational Reasoning — Adam Santoro, et al. (2017)

Proposed Relation Networks (RNs) to confer relational reasoning abilities to neural networks.

[ArXiv]
Machine Super Intelligence — Shane Legg (2008)

PhD thesis providing formal definitions of intelligence and analyzing the feasibility of superintelligence.

[PDF]