Paper Walkthrough — Matrix Calculus for Deep Learning (Part 2 / 2)
In the last post, we learned about Gradients and Jacobians. In this post, we will go through:
- The Vector Chain Rule.
- The gradient of the neuron activation function.
A small aside:
Do go through the original paper by Terrence Parr and Jeremy Howard. While the posts are pretty comprehensive, they do not go into all the details that the paper does. The posts explore concepts that were difficult for me to understand and remember personally, in the hopes that someone else finds it useful.
Specifically, go through sections 4.2–5 in the paper. The concepts mentioned in these sections were pretty easy to grasp from the paper itself. The comment section is your kingdom and Google is at your disposal, in case you have questions.
The Vector Chain Rule
While I assume that you already aware of the chain rule for differentiation, the vector chain rule deserves VIP status.
The golden nugget to takeaway from this whole ordeal will be that the vector chain rule is expressible as a product of Jacobians. It’s kind of important since this same rule will be used to find the gradient of the neuron activation function.
Let’s consider the following function:
Introduce two intermediate variables, such that we can replace x² and 4x with those variables. This will make the function look more palatable and tasty. I am told that the function gets closer to a Michelin star with every substitution.
On substituting for taste, we get:
To see the vector chain rule in chain, look at the 3rd matrix above. Think of it like this: If we have to differentiate a function and we have introduced two intermediate functions to make our life easier, then:
In this case we have 2 intermediate functions. We differentiate our original function with respect to these and then we multiply that by the derivative of the intermediate function. We do this over all the intermediate functions. To sum up:
Refer to section 4.5.2 in the paper for a more detailed treatment of the vector chain rule. Going back to our original problem, we have can express our vector chain rule matrix as the product of 2 Jacobians, as follows:
Now we are primed to find the gradient of neuron activation.
The gradient of neuron activation
We can now find the derivative of-
Let’s find the derivative of w.x first-
We can express the dot product of w and x as the sum of their elementwise products. This is also called the Hadamard product at times.
We can use the chain rule to find the derivative of this function -
The derivative of y according to the chain rule will be:
This follows from our work on the Jacobian. Just for demonstration purposes, consider the case where w is a 3x3 matrix and x is a 3x1 matrix.
It’s pretty straightforward to surmise that the derivatives will be a diagonal vector with the elements x₁, x₂ and x₃ -
Multiplying a 1 x 3 vector by a 3 x 3 vector will give us 1 x 3 vector-
Now, we have found the derivative of y with respect to w. We also need to optimize b:
At this point, we have the derivative of part of the activation function :
We know the derivative of the w.x + b with respect to both w and b. We now need to find the derivative of the max() component of the cost function. For a thorough mathematical explanation, refer to the paper by Terrence and Jeremy.
However, let’s try to treat this intuitively. The max function has 2 arguments 0 and t, if you consider the function- max(0, t). If the value of t is less than 0, then there is nothing to differentiate. Else, we would differentiate t, in case t is greater than zero.
To sum up,
The gradient of the neural network loss function
Say we have the following inputs to our neural network:
The target elements like target( x₁) and target(x₂) are scalars.
This gives us the following cost function:
We then have the following intermediate variables:
The gradient with respect to the weights
In the previous section, we solved for the derivative of the activation function with respect to the weights and the biases:
Looking at our intermediate variables, we see that u is the same as our activation function from the last section. So, let’s move on to finding the derivative of v.
The derivative of y respect to the weights is 0, since y here is the target, which would be a constant. We just substituted the derivative of u with respect to w, from our work in the previous section.
Now, we are left to differentiate C with respect to w-
Substituting for the partial derivative of v:
After some mathematical acrobatics, we arrive at:
Above, the term in the brackets is the difference between the predicted and the target value. Terming this the error seems more than apt.
We can similarly compute the derivative with respect to the bias-
With this, we come to an end of this series on matrix calculus required for Deep Learning. I plan to continue writing on the math required for Deep Learning. The next post might probably have more code along with mathematical concepts.
Do reach out if you have any comments, questions or suggestions.