Hacker's Guide to Neural Networks

Attribution: This guide is based on Andrej Karpathy's original "Hacker's Guide to Neural Networks". Karpathy is a renowned researcher and the founder of OpenAI's Research team. The original work remains a masterpiece in making neural networks accessible through code-first explanations rather than heavy mathematics.

Introduction

Hi there, I'm Nilesh Sarkar, an AI Research Intern at Moog Controls working on safety-critical AI systems and generative design. I've been deeply inspired by Andrej Karpathy's work on neural networks and deep learning. Karpathy, now at OpenAI, has a remarkable ability to explain complex concepts through code and intuition rather than dense mathematics—an approach that fundamentally changed how I understand neural networks.

This guide is my attempt to expand on that philosophy and create a resource that bridges theory and practice. It's on web instead of PDF because all educational materials should be interactive and accessible, and eventually it will hopefully include interactive visualizations and demos.

My personal experience with Neural Networks is that everything became much clearer when I started ignoring full-page, dense derivations of backpropagation equations and just started writing code. This guide will contain very little math (I don't believe it is necessary and it can sometimes even obfuscate simple concepts). I will develop the topic from what I call a hacker's perspective—focusing on code and physical intuitions instead of mathematical derivations. My goal is to present the algorithms in a way that I wish I had come across when I was starting out, inspired by the clarity and elegance of Karpathy's original work.

"…everything became much clearer when I started writing code."

You might be eager to jump right in and learn about Neural Networks, backpropagation, how they can be applied to datasets in practice, etc. But before we get there, I'd like us to first forget about all that. Let's take a step back and understand what is really going on at the core. Let's first talk about real-valued circuits.

Update note: The original author suspended work on this guide and redirected energy to teaching CS231n (Convolutional Neural Networks) at Stanford. The notes are available on cs231.github.io and course slides can be found there. These materials are highly related to this guide, but more comprehensive and sometimes more polished.

Chapter 1: Real-valued Circuits

In my opinion, the best way to think of Neural Networks is as real-valued circuits, where real values (instead of boolean values {0,1}) "flow" along edges and interact in gates. However, instead of gates such as AND, OR, NOT, etc, we have binary gates such as * (multiply), + (add), max or unary gates such as exp, etc. Unlike ordinary boolean circuits, however, we will eventually also have gradients flowing on the same edges of the circuit, but in the opposite direction. But we're getting ahead of ourselves. Let's focus and start out simple.

Base Case: Single Gate in the Circuit

Lets first consider a single, simple circuit with one gate. The circuit takes two real-valued inputs x and y and computes x * y with the * gate. Here's what this looks like in code:

var forwardMultiplyGate = function(x, y) { return x * y; };
forwardMultiplyGate(-2, 3); // returns -6. Exciting.

def forward_multiply_gate(x, y):
    return x * y

forward_multiply_gate(-2, 3)  # returns -6. Exciting.

And in math form we can think of this gate as implementing the real-valued function:

$$f(x,y) = xy$$

As with this example, all of our gates will take one or two inputs and produce a single output value.

The Goal

The problem we are interested in studying looks as follows:

We provide a given circuit some specific input values (e.g. x = -2, y = 3)
The circuit computes an output value (e.g. -6)
The core question then becomes: How should one tweak the input slightly to increase the output?

In this case, in what direction should we change x,y to get a number larger than -6? Note that, for example, x = -1.99 and y = 2.99 gives x * y = -5.95, which is higher than -6.0. Don't get confused by this: -5.95 is better (higher) than -6.0. It's an improvement of 0.05, even though the magnitude of -5.95 (the distance from zero) happens to be lower.

Strategy #1: Random Local Search

Okay. So wait, we have a circuit, we have some inputs and we just want to tweak them slightly to increase the output value? Why is this hard? We can easily "forward" the circuit to compute the output for any given x and y. So isn't this trivial? Why don't we tweak x and y randomly and keep track of the tweak that works best:

// circuit with single gate for now
var forwardMultiplyGate = function(x, y) { return x * y; };
var x = -2, y = 3; // some input values

// try changing x,y randomly small amounts and keep track of what works best
var tweak_amount = 0.01;
var best_out = -Infinity;
var best_x = x, best_y = y;
for(var k = 0; k < 100; k++) {
  var x_try = x + tweak_amount * (Math.random() - 0.5) * 2;
  var y_try = y + tweak_amount * (Math.random() - 0.5) * 2;
  var out = forwardMultiplyGate(x_try, y_try);
  if(out > best_out) {
    best_out = out;
    best_x = x_try;
    best_y = y_try;
  }
}

import random

def forward_multiply_gate(x, y):
    return x * y

x = -2
y = 3

# try changing x,y randomly and keep track of what works best
tweak_amount = 0.01
best_out = float('-inf')
best_x = x
best_y = y

for k in range(100):
    x_try = x + tweak_amount * (random.random() - 0.5) * 2
    y_try = y + tweak_amount * (random.random() - 0.5) * 2
    out = forward_multiply_gate(x_try, y_try)
    if out > best_out:
        best_out = out
        best_x = x_try
        best_y = y_try

When I run this, I get best_x = -1.9928, best_y = 2.9901, and best_out = -5.9588. Again, -5.9588 is higher than -6.0. So, we're done, right? Not quite: This is a perfectly fine strategy for tiny problems with a few gates if you can afford the compute time, but it won't do if we want to eventually consider huge circuits with millions of inputs. It turns out that we can do much better.

Strategy #2: Numerical Gradient

Here's a better way. Remember again that in our setup we are given a circuit (e.g. our circuit with a single * gate) and some particular input (e.g. x = -2, y = 3). The gate computes the output (-6) and now we'd like to tweak x and y to make the output higher.

A nice intuition for what we're about to do is as follows: Imagine taking the output value that comes out from the circuit and tugging on it in the positive direction. This positive tension will in turn translate through the gate and induce forces on the inputs x and y. Forces that tell us how x and y should change to increase the output value.

What might those forces look like in our specific example? Thinking through it, we can intuit that the force on x should also be positive, because making x slightly larger improves the circuit's output. For example, increasing x from x = -2 to x = -1 would give us output -3 - much larger than -6. On the other hand, we'd expect a negative force induced on y that pushes it to become lower (since a lower y, such as y = 2, down from the original y = 3 would make output higher: 2 x -2 = -4, again, larger than -6).

The derivative can be thought of as a force on each input as we pull on the output to become higher.

So how do we exactly evaluate this force (derivative)? It turns out that there is a very simple procedure for this. We will work backwards: Instead of pulling on the circuit's output, we'll iterate over every input one by one, increase it very slightly and look at what happens to the output value. The amount the output changes in response is the derivative. Let's look at the mathematical definition:

\frac{\partial f(x,y)}{\partial x} = \frac{f(x+h,y) - f(x,y)}{h}

Where h is small - it's the tweak amount. Here's the code:

var x = -2, y = 3;
var out = forwardMultiplyGate(x, y); // -6
var h = 0.0001;

// compute derivative with respect to x
var xph = x + h; // -1.9999
var out2 = forwardMultiplyGate(xph, y); // -5.9997
var x_derivative = (out2 - out) / h; // 3.0

// compute derivative with respect to y
var yph = y + h; // 3.0001
var out3 = forwardMultiplyGate(x, yph); // -6.0002
var y_derivative = (out3 - out) / h; // -2.0

x = -2
y = 3
out = forward_multiply_gate(x, y)  # -6
h = 0.0001

# compute derivative with respect to x
xph = x + h  # -1.9999
out2 = forward_multiply_gate(xph, y)  # -5.9997
x_derivative = (out2 - out) / h  # 3.0

# compute derivative with respect to y
yph = y + h  # 3.0001
out3 = forward_multiply_gate(x, yph)  # -6.0002
y_derivative = (out3 - out) / h  # -2.0

Lets walk through x for example. We turned the knob from x to x + h and the circuit responded by giving a higher value (note again that yes, -5.9997 is higher than -6: -5.9997 > -6). The division by h is there to normalize the circuit's response by the (arbitrary) value of h we chose to use here. Technically, you want the value of h to be infinitesimal (the precise mathematical definition of the gradient is defined as the limit of the expression as h goes to zero), but in practice h=0.00001 or so works fine in most cases to get a good approximation. Now, we see that the derivative w.r.t. x is +3. I'm making the positive sign explicit, because it indicates that the circuit is tugging on x to become higher. The actual value, 3 can be interpreted as the force of that tug.

The derivative with respect to some input can be computed by tweaking that input by a small amount and observing the change on the output value.

By the way, we usually talk about the derivative with respect to a single input, or about a gradient with respect to all the inputs. The gradient is just made up of the derivatives of all the inputs concatenated in a vector (i.e. a list). Crucially, notice that if we let the inputs respond to the tug by following the gradient a tiny amount (i.e. we just add the derivative on top of every input), we can see that the value increases, as expected:

var step_size = 0.01;
var out = forwardMultiplyGate(x, y); // before: -6
x = x + step_size * x_derivative; // x becomes -1.97
y = y + step_size * y_derivative; // y becomes 2.98
var out_new = forwardMultiplyGate(x, y); // -5.87! exciting.

step_size = 0.01
out = forward_multiply_gate(x, y)  # before: -6
x = x + step_size * x_derivative  # x becomes -1.97
y = y + step_size * y_derivative  # y becomes 2.98
out_new = forward_multiply_gate(x, y)  # -5.87! exciting.

As expected, we changed the inputs by the gradient and the circuit now gives a slightly higher value (-5.87 > -6.0). That was much simpler than trying random changes to x and y, right? A fact to appreciate here is that if you take calculus you can prove that the gradient is, in fact, the direction of the steepest increase of the function. There is no need to monkey around trying out random pertubations as done in Strategy #1. Evaluating the gradient requires just three evaluations of the forward pass of our circuit instead of hundreds, and gives the best tug you can hope for (locally) if you are interested in increasing the value of the output.

Bigger step is not always better

Let me clarify on this point a bit. It is important to note that in this very simple example, using a bigger step_size than 0.01 will always work better. For example, step_size = 1.0 gives output -1 (higher, better!), and indeed infinite step size would give infinitely good results. The crucial thing to realize is that once our circuits get much more complex (e.g. entire neural networks), the function from inputs to the output value will be more chaotic and wiggly. The gradient guarantees that if you have a very small (indeed, infinitesimally small) step size, then you will definitely get a higher number when you follow its direction, and for that infinitesimally small step size there is no other direction that would have worked better. But if you use a bigger step size (e.g. step_size = 0.01) all bets are off. The reason we can get away with a larger step size than infinitesimally small is that our functions are usually relatively smooth. But really, we're crossing our fingers and hoping for the best.

Hill-climbing analogy: One analogy I've heard before is that the output value of our circuit is like the height of a hill, and we are blindfolded and trying to climb upwards. We can sense the steepness of the hill at our feet (the gradient), so when we shuffle our feet a bit we will go upwards. But if we took a big, overconfident step, we could have stepped right into a hole.

Great, I hope I've convinced you that the numerical gradient is indeed a very useful thing to evaluate, and that it is cheap. But. It turns out that we can do even better.

Strategy #3: Analytic Gradient

In the previous section we evaluated the gradient by probing the circuit's output value, independently for every input. This procedure gives you what we call a numerical gradient. This approach, however, is still expensive because we need to compute the circuit's output as we tweak every input value independently a small amount. So the complexity of evaluating the gradient is linear in number of inputs. But in practice we will have hundreds, thousands or (for neural networks) even tens to hundreds of millions of inputs, and the circuits aren't just one multiply gate but huge expressions that can be expensive to compute. We need something better.

Luckily, there is an easier and much faster way to compute the gradient: we can use calculus to derive a direct expression for it that will be as simple to evaluate as the circuit's output value. We call this an analytic gradient and there will be no need for tweaking anything. You may have seen other people who teach Neural Networks derive the gradient in huge and, frankly, scary and confusing mathematical equations (if you're not well-versed in maths). But it's unnecessary. I've written plenty of Neural Nets code and I rarely have to do mathematical derivation longer than two lines, and 95% of the time it can be done without writing anything at all. That is because we will only ever derive the gradient for very small and simple expressions (think of it as the base case) and then I will show you how we can compose these very simply with chain rule to evaluate the full gradient (think inductive/recursive case).

The analytic derivative requires no tweaking of the inputs. It can be derived using mathematics (calculus).

If you remember your product rules, power rules, quotient rules, etc., it's very easy to write down the derivative with respect to both x and y for a small expression such as x * y. But suppose you don't remember your calculus rules. We can go back to the definition. For example, here's the expression for the derivative w.r.t x:

\frac{\partial f(x,y)}{\partial x} = \frac{f(x+h,y) - f(x,y)}{h}

Okay and let's plug in our function (f(x,y) = xy) into the expression:

\frac{\partial f(x,y)}{\partial x} = \frac{(x+h)y - xy}{h} = \frac{xy + hy - xy}{h} = \frac{hy}{h} = y

That's interesting. The derivative with respect to x is just equal to y. Did you notice the coincidence in the previous section? We tweaked x to x+h and calculated x_derivative = 3.0, which exactly happens to be the value of y in that example. It turns out that wasn't a coincidence at all because that's just what the analytic gradient tells us the x derivative should be for f(x,y) = x * y. The derivative with respect to y, by the way, turns out to be x, unsurprisingly by symmetry. So there is no need for any tweaking! We invoked powerful mathematics and can now transform our derivative calculation into the following code:

var x = -2, y = 3;
var out = forwardMultiplyGate(x, y); // before: -6
var x_gradient = y; // by our complex mathematical derivation above
var y_gradient = x;

var step_size = 0.01;
x += step_size * x_gradient; // -1.97
y += step_size * y_gradient; // 2.98
var out_new = forwardMultiplyGate(x, y); // -5.87. Higher output! Nice.

x = -2
y = 3
out = forward_multiply_gate(x, y)  # before: -6
x_gradient = y  # analytical derivative
y_gradient = x

step_size = 0.01
x += step_size * x_gradient  # -1.97
y += step_size * y_gradient  # 2.98
out_new = forward_multiply_gate(x, y)  # -5.87. Higher output! Nice.

To compute the gradient we went from forwarding the circuit hundreds of times (Strategy #1) to forwarding it only on order of number of times twice the number of inputs (Strategy #2), to forwarding it a single time! And it gets EVEN better, since the more expensive strategies (#1 and #2) only give an approximation of the gradient, while #3 (the fastest one by far) gives you the exact gradient. No approximations. The only downside is that you should be comfortable with some calculus 101.

Recap

INPUT: We are given a circuit, some inputs and compute an output value.
OUTPUT: We are then interested finding small changes to each input (independently) that would make the output higher.
Strategy #1: One silly way is to randomly search for small pertubations of the inputs and keep track of what gives the highest increase in output.
Strategy #2: We saw we can do much better by computing the gradient. Regardless of how complicated the circuit is, the numerical gradient is very simple (but relatively expensive) to compute. We compute it by probing the circuit's output value as we tweak the inputs one at a time.
Strategy #3: In the end, we saw that we can be even more clever and analytically derive a direct expression to get the analytic gradient. It is identical to the numerical gradient, it is fastest by far, and there is no need for any tweaking.

In practice by the way (and we will get to this once again later), all Neural Network libraries always compute the analytic gradient, but the correctness of the implementation is verified by comparing it to the numerical gradient. That's because the numerical gradient is very easy to evaluate (but can be a bit expensive to compute), while the analytic gradient can contain bugs at times, but is usually extremely efficient to compute. As we will see, evaluating the gradient (i.e. while doing backprop, or backward pass) will turn out to cost about as much as evaluating the forward pass.

Recursive Case: Circuits with Multiple Gates

But hold on, you say: "The analytic gradient was trivial to derive for your super-simple expression. This is useless. What do I do when the expressions are much larger? Don't the equations get huge and complex very fast?". Good question. Yes the expressions get much more complex. No, this doesn't make it much harder. As we will see, every gate will be hanging out by itself, completely unaware of any details of the huge and complex circuit that it could be part of. It will only worry about its inputs and it will compute its local derivatives as seen in the previous section, except now there will be a single extra multiplication it will have to do.

A single extra multiplication will turn a single (useless gate) into a cog in the complex machine that is an entire neural network.

I should stop hyping it up now. I hope I've piqued your interest! Lets drill down into details and get two gates involved with this next example. The expression we are computing now is f(x,y,z) = (x+y)*z. Let's structure the code as follows to make the gates explicit as functions:

var forwardAddGate = function(a, b) { return a + b; };
var forwardMultiplyGate = function(a, b) { return a * b; };

var forwardCircuit = function(x, y, z) {
  var q = forwardAddGate(x, y); // q is x + y
  var f = forwardMultiplyGate(q, z); // f is q * z
  return f;
};

var x = -2, y = 5, z = -4;
var f = forwardCircuit(x, y, z); // output is -12

def forward_add_gate(a, b):
    return a + b

def forward_multiply_gate(a, b):
    return a * b

def forward_circuit(x, y, z):
    q = forward_add_gate(x, y)  # q is x + y
    f = forward_multiply_gate(q, z)  # f is q * z
    return f

x = -2
y = 5
z = -4
f = forward_circuit(x, y, z)  # output is -12

In the above, I am using a and b as the local variables in the gate functions so that we don't get these confused with our circuit inputs x,y,z. As before, we are interested in finding the derivatives with respect to the three inputs x,y,z. But how do we compute it now that there are multiple gates involved? First, lets pretend that the + gate is not there and that we only have two variables in the circuit: q,z and a single * gate. Note that the q is output of the + gate. If we don't worry about x and y but only about q and z, then we are back to having only a single gate, and as far as that single * gate is concerned, we know what the (analytic) derivatives are from previous section:

f(q,z) = qz \implies \frac{\partial f}{\partial q} = z, \quad \frac{\partial f}{\partial z} = q

Simple enough: these are the expressions for the gradient with respect to q and z. But wait, we don't want gradient with respect to q, but with respect to the inputs: x and y. Luckily, q is computed as a function of x and y (by addition in our example). We can write down the gradient for the addition gate as well, it's even simpler:

q(x,y) = x + y \implies \frac{\partial q}{\partial x} = 1, \quad \frac{\partial q}{\partial y} = 1

That's right, the derivatives are just 1, regardless of the actual values of x and y. If you think about it, this makes sense because to make the output of a single addition gate higher, we expect a positive tug on both x and y, regardless of their values.

Backpropagation

We are finally ready to invoke the Chain Rule: We know how to compute the gradient of q with respect to x and y (that's a single gate case with + as the gate). And we know how to compute the gradient of our final output with respect to q. The chain rule tells us how to combine these to get the gradient of the final output with respect to x and y, which is what we're ultimately interested in. Best of all, the chain rule very simply states that the right thing to do is to simply multiply the gradients together to chain them. For example, the final derivative for x will be:

\frac{\partial f}{\partial x} = \frac{\partial q}{\partial x} \cdot \frac{\partial f}{\partial q}

There are many symbols there so maybe this is confusing again, but it's really just two numbers being multiplied together. Here is the code:

// initial conditions
var x = -2, y = 5, z = -4;
var q = forwardAddGate(x, y); // q is 3
var f = forwardMultiplyGate(q, z); // output is -12

// gradient of the MULTIPLY gate with respect to its inputs
// wrt is short for "with respect to"
var derivative_f_wrt_z = q; // 3
var derivative_f_wrt_q = z; // -4

// derivative of the ADD gate with respect to its inputs
var derivative_q_wrt_x = 1.0;
var derivative_q_wrt_y = 1.0;

// chain rule
var derivative_f_wrt_x = derivative_q_wrt_x * derivative_f_wrt_q; // -4
var derivative_f_wrt_y = derivative_q_wrt_y * derivative_f_wrt_q; // -4

# initial conditions
x = -2
y = 5
z = -4
q = forward_add_gate(x, y)  # q is 3
f = forward_multiply_gate(q, z)  # output is -12

# gradient of the MULTIPLY gate with respect to its inputs
# wrt is short for "with respect to"
derivative_f_wrt_z = q  # 3
derivative_f_wrt_q = z  # -4

# derivative of the ADD gate with respect to its inputs
derivative_q_wrt_x = 1.0
derivative_q_wrt_y = 1.0

# chain rule
derivative_f_wrt_x = derivative_q_wrt_x * derivative_f_wrt_q  # -4
derivative_f_wrt_y = derivative_q_wrt_y * derivative_f_wrt_q  # -4

That's it. We computed the gradient (the forces) and now we can let our inputs respond to it by a bit. Lets add the gradients on top of the inputs. The output value of the circuit better increase, up from -12!

// final gradient, from above: [-4, -4, 3]
var gradient_f_wrt_xyz = [derivative_f_wrt_x, derivative_f_wrt_y, derivative_f_wrt_z]

// let the inputs respond to the force/tug:
var step_size = 0.01;
x = x + step_size * derivative_f_wrt_x; // -2.04
y = y + step_size * derivative_f_wrt_y; // 4.96
z = z + step_size * derivative_f_wrt_z; // -3.97

// Our circuit now better give higher output:
var q = forwardAddGate(x, y); // q becomes 2.92
var f = forwardMultiplyGate(q, z); // output is -11.59, up from -12! Nice!

# final gradient, from above: [-4, -4, 3]
gradient_f_wrt_xyz = [derivative_f_wrt_x, derivative_f_wrt_y, derivative_f_wrt_z]

# let the inputs respond to the force/tug:
step_size = 0.01
x = x + step_size * derivative_f_wrt_x  # -2.04
y = y + step_size * derivative_f_wrt_y  # 4.96
z = z + step_size * derivative_f_wrt_z  # -3.97

# Our circuit now better give higher output:
q = forward_add_gate(x, y)  # q becomes 2.92
f = forward_multiply_gate(q, z)  # output is -11.59, up from -12! Nice!

Looks like that worked! Lets now try to interpret intuitively what just happened. The circuit wants to output higher values. The last gate saw inputs q = 3, z = -4 and computed output -12. "Pulling" upwards on this output value induced a force on both q and z: To increase the output value, the circuit "wants" z to increase, as can be seen by the positive value of the derivative (derivative_f_wrt_z = +3). Again, the size of this derivative can be interpreted as the magnitude of the force. On the other hand, q felt a stronger and downward force, since derivative_f_wrt_q = -4. In other words the circuit wants q to decrease, with a force of 4.

Now we get to the second, + gate which outputs q. By default, the + gate computes its derivatives which tells us how to change x and y to make q higher. BUT! Here is the crucial point: the gradient on q was computed as negative (derivative_f_wrt_q = -4), so the circuit wants q to decrease, and with a force of 4! So if the + gate wants to contribute to making the final output value larger, it needs to listen to the gradient signal coming from the top. In this particular case, it needs to apply tugs on x,y opposite of what it would normally apply, and with a force of 4, so to speak. The multiplication by -4 seen in the chain rule achieves exactly this: instead of applying a positive force of +1 on both x and y (the local derivative), the full circuit's gradient on both x and y becomes 1 x -4 = -4. This makes sense: the circuit wants both x and y to get smaller because this will make q smaller, which in turn will make f larger.

If this makes sense, you understand backpropagation.

Recap Once Again

In the previous chapter we saw that in the case of a single gate (or a single expression), we can derive the analytic gradient using simple calculus. We interpreted the gradient as a force, or a tug on the inputs that pulls them in a direction which would make this gate's output higher.
In case of multiple gates everything stays pretty much the same way: every gate is hanging out by itself completely unaware of the circuit it is embedded in. Some inputs come in and the gate computes its output and the derivate with respect to the inputs. The only difference now is that suddenly, something can pull on this gate from above. That's the gradient of the final circuit output value with respect to the ouput this gate computed. It is the circuit asking the gate to output higher or lower numbers, and with some force. The gate simply takes this force and multiplies it to all the forces it computed for its inputs before (chain rule).
This has the desired effect: If a gate experiences a strong positive pull from above, it will also pull harder on its own inputs, scaled by the force it is experiencing from above. And if it experiences a negative tug, this means that circuit wants its value to decrease not increase, so it will flip the force of the pull on its inputs to make its own output value smaller.
A nice picture to have in mind is that as we pull on the circuit's output value at the end, this induces pulls downward through the entire circuit, all the way down to the inputs.

Isn't it beautiful? The only difference between the case of a single gate and multiple interacting gates that compute arbitrarily complex expressions is this additional multiply operation that now happens in each gate.

Patterns in the "Backward" Flow

After a while you start to notice patterns in how the gradients flow backward in the circuits. For example, the + gate always takes the gradient on top and simply passes it on to all of its inputs. This is because its own derivative for the inputs is just +1, regardless of what the actual values of the inputs are, so in the chain rule, the gradient from above is just multiplied by 1 and stays the same. Similar intuitions apply to, for example, a max(x,y) gate. Since the gradient of max(x,y) with respect to its input is +1 for whichever one of x, y is larger and 0 for the other, this gate is during backprop effectively just a gradient "switch": it will take the gradient from above and "route" it to the input that had a higher value during the forward pass.

Numerical Gradient Check

Before we finish with this section, lets just make sure that the (analytic) gradient we computed by backprop above is correct as a sanity check. Remember that we can do this simply by computing the numerical gradient and making sure that we get [-4, -4, 3] for x,y,z. Here's the code:

// initial conditions
var x = -2, y = 5, z = -4;

// numerical gradient check
var h = 0.0001;
var x_derivative = (forwardCircuit(x+h,y,z) - forwardCircuit(x,y,z)) / h; // -4
var y_derivative = (forwardCircuit(x,y+h,z) - forwardCircuit(x,y,z)) / h; // -4
var z_derivative = (forwardCircuit(x,y,z+h) - forwardCircuit(x,y,z)) / h; // 3
// We get [-4, -4, 3], as computed with backprop. phew!

# initial conditions
x = -2
y = 5
z = -4

# numerical gradient check
h = 0.0001
x_derivative = (forward_circuit(x+h,y,z) - forward_circuit(x,y,z)) / h  # -4
y_derivative = (forward_circuit(x,y+h,z) - forward_circuit(x,y,z)) / h  # -4
z_derivative = (forward_circuit(x,y,z+h) - forward_circuit(x,y,z)) / h  # 3
# We get [-4, -4, 3], as computed with backprop. phew!

Example: Single Neuron

In the previous section you hopefully got the basic intuition behind backpropagation. Lets now look at an even more complicated and borderline practical example. We will consider a 2-dimensional neuron that computes the following function:

f(x,y,a,b,c) = \sigma(ax + by + c)

In this expression, $\sigma$ is the sigmoid function. It's best thought of as a "squashing function", because it takes the input and squashes it to be between zero and one: Very negative values are squashed towards zero and positive values get squashed towards one. For example, we have $\sigma(-5) = 0.006$, $\sigma(0) = 0.5$, $\sigma(5) = 0.993$. The sigmoid function is defined as:

\sigma(x) = \frac{1}{1 + e^{-x}}

The gradient with respect to its single input is given by this expression:

\frac{\partial \sigma(x)}{\partial x} = \sigma(x)(1 - \sigma(x))

That's all we need to use this gate: we know how to take an input and forward it through the sigmoid gate, and we also have the expression for the gradient with respect to its input, so we can also backprop through it.

Lets take this opportunity to carefully structure the associated code in a nice and modular way. First, I'd like you to note that every wire in our diagrams has two numbers associated with it:

The value it carries during the forward pass
The gradient (i.e the pull) that flows back through it in the backward pass

Lets create a simple Unit structure that will store these two values on every wire. Our gates will now operate over Units: they will take them as inputs and create them as outputs.

// every Unit corresponds to a wire in the diagrams
var Unit = function(value, grad) {
  // value computed in the forward pass
  this.value = value; 
  // the derivative of circuit output w.r.t this unit, computed in backward pass
  this.grad = grad; 
}

var multiplyGate = function(){ };
multiplyGate.prototype = {
  forward: function(u0, u1) {
    // store pointers to input Units u0 and u1 and output unit utop
    this.u0 = u0; 
    this.u1 = u1; 
    this.utop = new Unit(u0.value * u1.value, 0.0);
    return this.utop;
  },
  backward: function() {
    // take the gradient in output unit and chain it with the
    // local gradients, which we derived for multiply gate before
    // then write those gradients to those Units.
    this.u0.grad += this.u1.value * this.utop.grad;
    this.u1.grad += this.u0.value * this.utop.grad;
  }
}

var addGate = function(){ };
addGate.prototype = {
  forward: function(u0, u1) {
    this.u0 = u0; 
    this.u1 = u1; // store pointers to input units
    this.utop = new Unit(u0.value + u1.value, 0.0);
    return this.utop;
  },
  backward: function() {
    // add gate. derivative wrt both inputs is 1
    this.u0.grad += 1 * this.utop.grad;
    this.u1.grad += 1 * this.utop.grad;
  }
}

var sigmoidGate = function() { 
  // helper function
  this.sig = function(x) { return 1 / (1 + Math.exp(-x)); };
};
sigmoidGate.prototype = {
  forward: function(u0) {
    this.u0 = u0;
    this.utop = new Unit(this.sig(this.u0.value), 0.0);
    return this.utop;
  },
  backward: function() {
    var s = this.sig(this.u0.value);
    this.u0.grad += (s * (1 - s)) * this.utop.grad;
  }
}

import math

class Unit:
    def __init__(self, value, grad):
        self.value = value
        self.grad = grad

class MultiplyGate:
    def forward(self, u0, u1):
        self.u0 = u0
        self.u1 = u1
        self.utop = Unit(u0.value * u1.value, 0.0)
        return self.utop
    
    def backward(self):
        self.u0.grad += self.u1.value * self.utop.grad
        self.u1.grad += self.u0.value * self.utop.grad

class AddGate:
    def forward(self, u0, u1):
        self.u0 = u0
        self.u1 = u1
        self.utop = Unit(u0.value + u1.value, 0.0)
        return self.utop
    
    def backward(self):
        self.u0.grad += 1 * self.utop.grad
        self.u1.grad += 1 * self.utop.grad

class SigmoidGate:
    def sig(self, x):
        return 1 / (1 + math.exp(-x))
    
    def forward(self, u0):
        self.u0 = u0
        self.utop = Unit(self.sig(self.u0.value), 0.0)
        return self.utop
    
    def backward(self):
        s = self.sig(self.u0.value)
        self.u0.grad += (s * (1 - s)) * self.utop.grad

To fully specify everything lets finally write out the forward and backward flow for our 2-dimensional neuron with some example values:

// create input units
var a = new Unit(1.0, 0.0);
var b = new Unit(2.0, 0.0);
var c = new Unit(-3.0, 0.0);
var x = new Unit(-1.0, 0.0);
var y = new Unit(3.0, 0.0);

// create the gates
var mulg0 = new multiplyGate();
var mulg1 = new multiplyGate();
var addg0 = new addGate();
var addg1 = new addGate();
var sg0 = new sigmoidGate();

// do the forward pass
var forwardNeuron = function() {
  ax = mulg0.forward(a, x); // a*x = -1
  by = mulg1.forward(b, y); // b*y = 6
  axpby = addg0.forward(ax, by); // a*x + b*y = 5
  axpbypc = addg1.forward(axpby, c); // a*x + b*y + c = 2
  s = sg0.forward(axpbypc); // sig(a*x + b*y + c) = 0.8808
};
forwardNeuron();

console.log('circuit output: ' + s.value); // prints 0.8808

// And now lets compute the gradient: Simply iterate in reverse order and call the backward function!
s.grad = 1.0;
sg0.backward(); // writes gradient into axpbypc
addg1.backward(); // writes gradients into axpby and c
addg0.backward(); // writes gradients into ax and by
mulg1.backward(); // writes gradients into b and y
mulg0.backward(); // writes gradients into a and x

// Finally, lets make the inputs respond to the computed gradients
var step_size = 0.01;
a.value += step_size * a.grad; // a.grad is -0.105
b.value += step_size * b.grad; // b.grad is 0.315
c.value += step_size * c.grad; // c.grad is 0.105
x.value += step_size * x.grad; // x.grad is 0.105
y.value += step_size * y.grad; // y.grad is 0.210

forwardNeuron();
console.log('circuit output after one backprop: ' + s.value); // prints 0.8825

# create input units
a = Unit(1.0, 0.0)
b = Unit(2.0, 0.0)
c = Unit(-3.0, 0.0)
x = Unit(-1.0, 0.0)
y = Unit(3.0, 0.0)

# create the gates
mulg0 = MultiplyGate()
mulg1 = MultiplyGate()
addg0 = AddGate()
addg1 = AddGate()
sg0 = SigmoidGate()

# do the forward pass
def forward_neuron():
    global ax, by, axpby, axpbypc, s
    ax = mulg0.forward(a, x)  # a*x = -1
    by = mulg1.forward(b, y)  # b*y = 6
    axpby = addg0.forward(ax, by)  # a*x + b*y = 5
    axpbypc = addg1.forward(axpby, c)  # a*x + b*y + c = 2
    s = sg0.forward(axpbypc)  # sig(a*x + b*y + c) = 0.8808

forward_neuron()
print('circuit output:', s.value)  # prints 0.8808

# And now lets compute the gradient: Simply iterate in reverse order and call the backward function!
s.grad = 1.0
sg0.backward()  # writes gradient into axpbypc
addg1.backward()  # writes gradients into axpby and c
addg0.backward()  # writes gradients into ax and by
mulg1.backward()  # writes gradients into b and y
mulg0.backward()  # writes gradients into a and x

# Finally, lets make the inputs respond to the computed gradients
step_size = 0.01
a.value += step_size * a.grad  # a.grad is -0.105
b.value += step_size * b.grad  # b.grad is 0.315
c.value += step_size * c.grad  # c.grad is 0.105
x.value += step_size * x.grad  # x.grad is 0.105
y.value += step_size * y.grad  # y.grad is 0.210

forward_neuron()
print('circuit output after one backprop:', s.value)  # prints 0.8825

Success! 0.8825 is higher than the previous value, 0.8808.

Becoming a Backprop Ninja

Over time you will become much more efficient in writing the backward pass, even for complicated circuits and all at once. Lets practice backprop a bit with a few examples. In what follows, lets not worry about Unit, Circuit classes because they obfuscate things a bit, and lets just use variables such as a,b,c,x, and refer to their gradients as da,db,dc,dx respectively.

Our first example was the * gate:

var x = a * b;
// and given gradient on x (dx), we saw that in backprop we would compute:
var da = b * dx;
var db = a * dx;

x = a * b
# and given gradient on x (dx), we saw that in backprop we would compute:
da = b * dx
db = a * dx

Here's the + gate in this condensed form:

var x = a + b;
// ->
var da = 1.0 * dx;
var db = 1.0 * dx;

x = a + b
# ->
da = 1.0 * dx
db = 1.0 * dx

What about adding three numbers?

var x = a + b + c;
var da = 1.0 * dx; var db = 1.0 * dx; var dc = 1.0 * dx;

x = a + b + c
da = 1.0 * dx; db = 1.0 * dx; dc = 1.0 * dx

Okay, how about combining gates?

var x = a * b + c;
// given dx, backprop in-one-sweep would be =>
da = b * dx;
db = a * dx;
dc = 1.0 * dx;

x = a * b + c
# given dx, backprop in-one-sweep would be =>
da = b * dx
db = a * dx
dc = 1.0 * dx

Now how about this:

var x = a * a;
var da = 2 * a * dx;

x = a * a
da = 2 * a * dx

You can think of this as value a flowing to the * gate, but the wire gets split and becomes both inputs. This is actually simple because the backward flow of gradients always adds up.

Here are a few more useful functions and their local gradients that are useful in practice:

// Division
var x = 1.0/a; 
var da = -1.0/(a*a);

// Max
var x = Math.max(a, b);
var da = a === x ? 1.0 * dx : 0.0;
var db = b === x ? 1.0 * dx : 0.0;

// ReLU (Rectified Linear Unit)
var x = Math.max(a, 0);
var da = a > 0 ? 1.0 * dx : 0.0;

# Division
x = 1.0 / a
da = -1.0 / (a * a)

# Max
x = max(a, b)
da = 1.0 * dx if a == x else 0.0
db = 1.0 * dx if b == x else 0.0

# ReLU (Rectified Linear Unit)
x = max(a, 0)
da = 1.0 * dx if a > 0 else 0.0

Everything we've done in this chapter comes down to this: We saw that we can feed some input through arbitrarily complex real-valued circuit, tug at the end of the circuit with some force, and backpropagation distributes that tug through the entire circuit all the way back to the inputs. If the inputs respond slightly along the final direction of their tug, the circuit will "give" a bit along the original pull direction. Maybe this is not immediately obvious, but this machinery is a powerful hammer for Machine Learning.

Chapter 2: Machine Learning

In the last chapter we were concerned with real-valued circuits that computed possibly complex expressions of their inputs (the forward pass), and also we could compute the gradients of these expressions on the original inputs (backward pass). In this chapter we will see how useful this extremely simple mechanism is in Machine Learning.

Binary Classification

Let's start out simple. The simplest, common and yet very practical problem in Machine Learning is binary classification. A lot of very interesting and important problems can be reduced to it. The setup is as follows: We are given a dataset of N vectors and every one of them is labeled with a +1 or a -1. For example, in two dimensions our dataset could look as simple as:

Vector	Label
[1.2, 0.7]	+1
[-0.3, 0.5]	-1
[-3, -1]	+1
[0.1, 1.0]	-1
[3.0, 1.1]	-1
[2.1, -3]	+1

Here, we have N = 6 datapoints, where every datapoint has two features (D = 2). Three of the datapoints have label +1 and the other three label -1. This is a silly toy example, but in practice a +1/-1 dataset could be very useful things indeed: For example spam/no spam emails, where the vectors somehow measure various features of the content of the email, such as the number of times certain enhancement drugs are mentioned.

Goal

Our goal in binary classification is to learn a function that takes a 2-dimensional vector and predicts the label. This function is usually parameterized by a certain set of parameters, and we will want to tune the parameters of the function so that its outputs are consistent with the labeling in the provided dataset. In the end we can discard the dataset and use the learned parameters to predict labels for previously unseen vectors.

Training Protocol

We will eventually build up to entire neural networks and complex expressions, but let's start out simple and train a linear classifier very similar to the single neuron we would see at the end of Chapter 1. Let's use a simple linear function:

$$f(x,y) = ax + by + c$$

In this expression we think of x and y as the inputs (the 2D vectors) and a,b,c as the parameters of the function that we will want to learn. Here is how the training will work:

We select a random datapoint and feed it through the circuit
We will interpret the output of the circuit as a confidence that the datapoint has class +1. (i.e. very high values = circuit is very certain datapoint has class +1 and very low values = circuit is certain this datapoint has class -1.)
We will measure how well the prediction aligns with the provided labels. Intuitively, for example, if a positive example scores very low, we will want to tug in the positive direction on the circuit, demanding that it should output higher value for this datapoint.
The circuit will take the tug and backpropagate it to compute tugs on the inputs a,b,c,x,y
Since we think of x,y as (fixed) datapoints, we will ignore the pull on x,y. If you're a fan of my physical analogies, think of these inputs as pegs, fixed in the ground.
On the other hand, we will take the parameters a,b,c and make them respond to their tug (i.e. we'll perform what we call a parameter update). This, of course, will make it so that the circuit will output a slightly higher score on this particular datapoint in the future.
Iterate! Go back to step 1.

The training scheme I described above, is commonly referred as Stochastic Gradient Descent. The interesting part I'd like to reiterate is that a,b,c,x,y are all made up of the same stuff as far as the circuit is concerned: They are inputs to the circuit and the circuit will tug on all of them in some direction. It doesn't know the difference between parameters and datapoints. However, after the backward pass is complete we ignore all tugs on the datapoints (x,y) and keep swapping them in and out as we iterate over examples in the dataset. On the other hand, we keep the parameters (a,b,c) around and keep tugging on them every time we sample a datapoint. Over time, the pulls on these parameters will tune these values in such a way that the function outputs high scores for positive examples and low scores for negative examples.

Learning a Support Vector Machine

As a concrete example, lets learn a Support Vector Machine. The SVM is a very popular linear classifier; Its functional form is exactly as I've described in previous section, $f(x,y) = ax + by + c$. Instead of defining loss functions, I would like to base the explanation on the force specification of a Support Vector Machine, which I personally find much more intuitive.

Support Vector Machine "Force Specification":

If we feed a positive datapoint through the SVM circuit and the output value is less than 1, pull on the circuit with force +1. This is a positive example so we want the score to be higher for it.
Conversely, if we feed a negative datapoint through the SVM and the output is greater than -1, then the circuit is giving this datapoint dangerously high score: Pull on the circuit downwards with force -1.
In addition to the pulls above, always add a small amount of pull on the parameters a,b (notice, not on c!) that pulls them towards zero. This pull is something we call regularization, and it ensures that neither of our parameters a or b gets disproportionally large.

Lets write the SVM code and take advantage of the circuit machinery we have from Chapter 1:

// A circuit: it takes 5 Units (x,y,a,b,c) and outputs a single Unit
var Circuit = function() {
  this.mulg0 = new multiplyGate();
  this.mulg1 = new multiplyGate();
  this.addg0 = new addGate();
  this.addg1 = new addGate();
};
Circuit.prototype = {
  forward: function(x,y,a,b,c) {
    this.ax = this.mulg0.forward(a, x); // a*x
    this.by = this.mulg1.forward(b, y); // b*y
    this.axpby = this.addg0.forward(this.ax, this.by); // a*x + b*y
    this.axpbypc = this.addg1.forward(this.axpby, c); // a*x + b*y + c
    return this.axpbypc;
  },
  backward: function(gradient_top) { // takes pull from above
    this.axpbypc.grad = gradient_top;
    this.addg1.backward(); // sets gradient in axpby and c
    this.addg0.backward(); // sets gradient in ax and by
    this.mulg1.backward(); // sets gradient in b and y
    this.mulg0.backward(); // sets gradient in a and x
  }
}

// SVM class
var SVM = function() {
  this.a = new Unit(1.0, 0.0); 
  this.b = new Unit(-2.0, 0.0);
  this.c = new Unit(-1.0, 0.0);
  this.circuit = new Circuit();
};
SVM.prototype = {
  forward: function(x, y) { // assume x and y are Units
    this.unit_out = this.circuit.forward(x, y, this.a, this.b, this.c);
    return this.unit_out;
  },
  backward: function(label) { // label is +1 or -1
    this.a.grad = 0.0; 
    this.b.grad = 0.0; 
    this.c.grad = 0.0;

    var pull = 0.0;
    if(label === 1 && this.unit_out.value < 1) { 
      pull = 1; // the score was too low: pull up
    }
    if(label === -1 && this.unit_out.value > -1) {
      pull = -1; // the score was too high for a positive example, pull down
    }
    this.circuit.backward(pull); // writes gradient into x,y,a,b,c
    
    // regularization pull
    this.a.grad += -this.a.value;
    this.b.grad += -this.b.value;
  },
  learnFrom: function(x, y, label) {
    this.forward(x, y); // forward pass (set .value in all Units)
    this.backward(label); // backward pass (set .grad in all Units)
    this.parameterUpdate(); // parameters respond to tug
  },
  parameterUpdate: function() {
    var step_size = 0.01;
    this.a.value += step_size * this.a.grad;
    this.b.value += step_size * this.b.grad;
    this.c.value += step_size * this.c.grad;
  }
};

class Circuit:
    def __init__(self):
        self.mulg0 = MultiplyGate()
        self.mulg1 = MultiplyGate()
        self.addg0 = AddGate()
        self.addg1 = AddGate()
        
    def forward(self, x, y, a, b, c):
        self.ax = self.mulg0.forward(a, x)
        self.by = self.mulg1.forward(b, y)
        self.axpby = self.addg0.forward(self.ax, self.by)
        self.axpbypc = self.addg1.forward(self.axpby, c)
        return self.axpbypc
        
    def backward(self, gradient_top):
        self.axpbypc.grad = gradient_top
        self.addg1.backward()
        self.addg0.backward()
        self.mulg1.backward()
        self.mulg0.backward()

class SVM:
    def __init__(self):
        self.a = Unit(1.0, 0.0)
        self.b = Unit(-2.0, 0.0)
        self.c = Unit(-1.0, 0.0)
        self.circuit = Circuit()
        
    def forward(self, x, y):
        self.unit_out = self.circuit.forward(x, y, self.a, self.b, self.c)
        return self.unit_out
        
    def backward(self, label):
        self.a.grad = 0.0
        self.b.grad = 0.0
        self.c.grad = 0.0
        
        pull = 0.0
        if label == 1 and self.unit_out.value < 1:
            pull = 1
        if label == -1 and self.unit_out.value > -1:
            pull = -1
            
        self.circuit.backward(pull)
        
        self.a.grad += -self.a.value
        self.b.grad += -self.b.value
        
    def learn_from(self, x, y, label):
        self.forward(x, y)
        self.backward(label)
        self.parameter_update()
        
    def parameter_update(self):
        step_size = 0.01
        self.a.value += step_size * self.a.grad
        self.b.value += step_size * self.b.grad
        self.c.value += step_size * self.c.grad

By the way, I intentionally structured the code in a modular way, but we could have trained an SVM with a much simpler code. Here is really what all of these classes and computations boil down to:

var a = 1, b = -2, c = -1; // initial parameters
for(var iter = 0; iter < 400; iter++) {
  // pick a random data point
  var i = Math.floor(Math.random() * data.length);
  var x = data[i][0];
  var y = data[i][1];
  var label = labels[i];

  // compute pull
  var score = a*x + b*y + c;
  var pull = 0.0;
  if(label === 1 && score < 1) pull = 1;
  if(label === -1 && score > -1) pull = -1;

  // compute gradient and update parameters
  var step_size = 0.01;
  a += step_size * (x * pull - a); // -a is from the regularization
  b += step_size * (y * pull - b); // -b is from the regularization
  c += step_size * (1 * pull);
}

import random
import math

a = 1; b = -2; c = -1 # initial parameters
for iter in range(400):
    # pick a random data point
    i = math.floor(random.random() * len(data))
    x = data[i][0]
    y = data[i][1]
    label = labels[i]

    # compute pull
    score = a*x + b*y + c
    pull = 0.0
    if label == 1 and score < 1: pull = 1
    if label == -1 and score > -1: pull = -1

    # compute gradient and update parameters
    step_size = 0.01
    a += step_size * (x * pull - a) # -a is from the regularization
    b += step_size * (y * pull - b) # -b is from the regularization
    c += step_size * (1 * pull)

Generalizing the SVM into a Neural Network

Of interest is the fact that an SVM is just a particular type of a very simple circuit. This can be easily extended to more complicated functions. For example, lets write a 2-layer Neural Network that does the binary classification. The forward pass will look like this:

// assume inputs x,y
var n1 = Math.max(0, a1*x + b1*y + c1); // activation of 1st hidden neuron
var n2 = Math.max(0, a2*x + b2*y + c2); // 2nd neuron
var n3 = Math.max(0, a3*x + b3*y + c3); // 3rd neuron
var score = a4*n1 + b4*n2 + c4*n3 + d4; // the score

# assume inputs x,y
n1 = max(0, a1*x + b1*y + c1) # activation of 1st hidden neuron
n2 = max(0, a2*x + b2*y + c2) # 2nd neuron
n3 = max(0, a3*x + b3*y + c3) # 3rd neuron
score = a4*n1 + b4*n2 + c4*n3 + d4 # the score

The specification above is a 2-layer Neural Network with 3 hidden neurons (n1, n2, n3) that uses Rectified Linear Unit (ReLU) non-linearity on each hidden neuron. As you can see, there are now several parameters involved, which means that our classifier is more complex and can represent more intricate decision boundaries than just a simple linear decision rule such as an SVM.

Here is the full training loop for this Neural Network:

// random initial parameters
var a1 = Math.random() - 0.5; // a random number between -0.5 and 0.5
// ... similarly initialize all other parameters to randoms
for(var iter = 0; iter < 400; iter++) {
  // pick a random data point
  var i = Math.floor(Math.random() * data.length);
  var x = data[i][0];
  var y = data[i][1];
  var label = labels[i];

  // compute forward pass
  var n1 = Math.max(0, a1*x + b1*y + c1); // activation of 1st hidden neuron
  var n2 = Math.max(0, a2*x + b2*y + c2); // 2nd neuron
  var n3 = Math.max(0, a3*x + b3*y + c3); // 3rd neuron
  var score = a4*n1 + b4*n2 + c4*n3 + d4; // the score

  // compute the pull on top
  var pull = 0.0;
  if(label === 1 && score < 1) pull = 1; // we want higher output! Pull up.
  if(label === -1 && score > -1) pull = -1; // we want lower output! Pull down.

  // now compute backward pass to all parameters of the model

  // backprop through the last "score" neuron
  var dscore = pull;
  var da4 = n1 * dscore;
  var dn1 = a4 * dscore;
  var db4 = n2 * dscore;
  var dn2 = b4 * dscore;
  var dc4 = n3 * dscore;
  var dn3 = c4 * dscore;
  var dd4 = 1.0 * dscore; // phew

  // backprop the ReLU non-linearities, in place
  // i.e. just set gradients to zero if the neurons did not "fire"
  var dn3 = n3 === 0 ? 0 : dn3;
  var dn2 = n2 === 0 ? 0 : dn2;
  var dn1 = n1 === 0 ? 0 : dn1;

  // backprop to parameters of neuron 1
  var da1 = x * dn1;
  var db1 = y * dn1;
  var dc1 = 1.0 * dn1;
  
  // backprop to parameters of neuron 2
  var da2 = x * dn2;
  var db2 = y * dn2;
  var dc2 = 1.0 * dn2;

  // backprop to parameters of neuron 3
  var da3 = x * dn3;
  var db3 = y * dn3;
  var dc3 = 1.0 * dn3;

  // add the pulls from the regularization
  da1 += -a1; da2 += -a2; da3 += -a3;
  db1 += -b1; db2 += -b2; db3 += -b3;
  da4 += -a4; db4 += -b4; dc4 += -c4;

  // finally, do the parameter update
  var step_size = 0.01;
  a1 += step_size * da1; 
  b1 += step_size * db1; 
  c1 += step_size * dc1;
  a2 += step_size * da2; 
  b2 += step_size * db2;
  c2 += step_size * dc2;
  a3 += step_size * da3; 
  b3 += step_size * db3; 
  c3 += step_size * dc3;
  a4 += step_size * da4; 
  b4 += step_size * db4; 
  c4 += step_size * dc4; 
  d4 += step_size * dd4;
}

import random
import math

# random initial parameters
a1 = random.random() - 0.5
# ... similarly initialize all other parameters to randoms ...

for iter in range(400):
    # pick a random data point
    i = math.floor(random.random() * len(data))
    x = data[i][0]
    y = data[i][1]
    label = labels[i]

    # compute forward pass
    n1 = max(0, a1*x + b1*y + c1) # activation of 1st hidden neuron
    n2 = max(0, a2*x + b2*y + c2) # 2nd neuron
    n3 = max(0, a3*x + b3*y + c3) # 3rd neuron
    score = a4*n1 + b4*n2 + c4*n3 + d4 # the score

    # compute the pull on top
    pull = 0.0
    if label == 1 and score < 1: pull = 1 # we want higher output! Pull up.
    if label == -1 and score > -1: pull = -1 # we want lower output! Pull down.

    # now compute backward pass to all parameters of the model

    # backprop through the last "score" neuron
    dscore = pull
    da4 = n1 * dscore
    dn1 = a4 * dscore
    db4 = n2 * dscore
    dn2 = b4 * dscore
    dc4 = n3 * dscore
    dn3 = c4 * dscore
    dd4 = 1.0 * dscore

    # backprop the ReLU non-linearities, in place
    dn3 = 0 if n3 == 0 else dn3
    dn2 = 0 if n2 == 0 else dn2
    dn1 = 0 if n1 == 0 else dn1

    # backprop to parameters of neuron 1
    da1 = x * dn1
    db1 = y * dn1
    dc1 = 1.0 * dn1
    
    # backprop to parameters of neuron 2
    da2 = x * dn2
    db2 = y * dn2
    dc2 = 1.0 * dn2

    # backprop to parameters of neuron 3
    da3 = x * dn3
    db3 = y * dn3
    dc3 = 1.0 * dn3

    # add the pulls from the regularization
    da1 += -a1; da2 += -a2; da3 += -a3
    db1 += -b1; db2 += -b2; db3 += -b3
    da4 += -a4; db4 += -b4; dc4 += -c4

    # finally, do the parameter update
    step_size = 0.01
    a1 += step_size * da1 
    b1 += step_size * db1 
    c1 += step_size * dc1
    a2 += step_size * da2 
    b2 += step_size * db2
    c2 += step_size * dc2
    a3 += step_size * da3 
    b3 += step_size * db3 
    c3 += step_size * dc3
    a4 += step_size * da4 
    b4 += step_size * db4 
    c4 += step_size * dc4 
    d4 += step_size * dd4

A more Conventional Approach: Loss Functions

Now that we understand the basics of how these circuits function with data, lets adopt a more conventional approach that you might see elsewhere on the internet and in other guides and books. You won't see people talking too much about force specifications. Instead, Machine Learning algorithms are specified in terms of loss functions (or cost functions, or objectives).

Lets start with an example of a 2-dimensional SVM. We are given a dataset of N examples $(x_{i0}, x_{i1})$ and their corresponding labels $y_i$ which are allowed to be either +1/-1. The SVM loss function is then defined as follows:

L = \left[ \sum_{i=1}^N \max(0, -y_i(w_0x_{i0} + w_1x_{i1} + w_2) + 1) \right] + \alpha[w_0^2 + w_1^2]

Notice that this expression is always positive, due to the thresholding at zero in the first expression and the squaring in the regularization. The idea is that we will want this expression to be as small as possible.

var X = [ [1.2, 0.7], [-0.3, 0.5], [3, 2.5] ] // array of 2-dimensional data
var y = [1, -1, 1] // array of labels
var w = [0.1, 0.2, 0.3] // example: random numbers
var alpha = 0.1; // regularization strength

function cost(X, y, w) {
  
  var total_cost = 0.0; // L, in SVM loss function above
  N = X.length;
  for(var i=0;i cost computed to be ' + costi.toFixed(3));
    total_cost += costi;
  }

  // regularization cost: we want small weights
  reg_cost = alpha * (w[0]*w[0] + w[1]*w[1])
  console.log('regularization cost for current model is ' + reg_cost.toFixed(3));
  total_cost += reg_cost;

  console.log('total cost is ' + total_cost.toFixed(3));
  return total_cost;
}

X = [ [1.2, 0.7], [-0.3, 0.5], [3, 2.5] ] # array of 2-dimensional data
y = [1, -1, 1] # array of labels
w = [0.1, 0.2, 0.3] # example: random numbers
alpha = 0.1 # regularization strength

def cost(X, y, w):
    total_cost = 0.0 # L, in SVM loss function above
    N = len(X)
    for i in range(N):
        # loop over all data points and compute their score
        xi = X[i]
        score = w[0] * xi[0] + w[1] * xi[1] + w[2]
        
        # accumulate cost based on how compatible the score is with the label
        yi = y[i] # label
        costi = max(0, - yi * score + 1)
        print(f'example {i}: xi = ({xi}) and label = {yi}')
        print(f'  score computed to be {score:.3f}')
        print(f'  => cost computed to be {costi:.3f}')
        total_cost += costi

    # regularization cost: we want small weights
    reg_cost = alpha * (w[0]*w[0] + w[1]*w[1])
    print(f'regularization cost for current model is {reg_cost:.3f}')
    total_cost += reg_cost

    print(f'total cost is {total_cost:.3f}')
    return total_cost

A cost function is an expression that measures how bad your classifier is. When the training set is perfectly classified, the cost (ignoring the regularization) will be zero. The majority of cost functions in Machine Learning consist of two parts: 1. A part that measures how well a model fits the data, and 2: Regularization, which measures some notion of how complex or likely a model is.

Conclusion

We've covered the essential foundations of how neural networks learn:

Real-valued circuits: The abstraction of neural networks as computational graphs
Gradients: Understanding forces that guide learning
Backpropagation: The efficient algorithm for computing these forces
Learning: How to update parameters to improve predictions

The beauty of this approach is its universality. Whether you're building a simple linear classifier or a deep neural network with millions of parameters, the same principles apply. Each gate computes local gradients, the chain rule connects them, and backpropagation efficiently propagates learning signals through the entire system.

Next Steps: To truly master these concepts, implement them from scratch. Build circuits with different gate types, experiment with different learning rates, and apply them to real datasets. The journey from theory to practice is where deep understanding is forged.

Gratitude: This guide stands on the shoulders of Andrej Karpathy's brilliant exposition. His ability to make complex concepts accessible through code and intuition has inspired countless researchers and practitioners. If you found this guide helpful, I encourage you to explore his other works and the broader field of neural networks and deep learning.