In the last post, we learned about Gradients and Jacobians. In this post, we will go through:

  1. The Vector Chain Rule.
  2. The gradient of the neuron activation function.

A small aside:

Do go through the original paper by Terrence Parr and Jeremy Howard. While the posts are pretty comprehensive, they do not go into all the details that the paper does. The posts explore concepts that were difficult for me to understand and remember personally, in the hopes that someone else finds it useful.

Specifically, go through sections 4.2–5 in the paper. The concepts mentioned in these sections were pretty easy to grasp from the paper itself. The comment section is your kingdom and Google is at your disposal, in case you have questions.

The Vector Chain Rule

While I assume that you already aware of the chain rule for differentiation, the vector chain rule deserves VIP status.

The golden nugget to takeaway from this whole ordeal will be that the vector chain rule is expressible as a product of Jacobians. It’s kind of important since this same rule will be used to find the gradient of the neuron activation function.

Let’s consider the following function:

Eq.1 : Vector function

Introduce two intermediate variables, such that we can replace x² and 4x with those variables. This will make the function look more palatable and tasty. I am told that the function gets closer to a Michelin star with every substitution.

Eq.2 : Variables used for substitution

On substituting for taste, we get:

Eq.3 : Substituted new variables in the Vector function
Eq.4 : Vector chain rule for differentiation

To see the vector chain rule in chain, look at the 3rd matrix above. Think of it like this: If we have to differentiate a function and we have introduced two intermediate functions to make our life easier, then:

Eq.5 : Chain rule applied to the first vector function

In this case we have 2 intermediate functions. We differentiate our original function with respect to these and then we multiply that by the derivative of the intermediate function. We do this over all the intermediate functions. To sum up:

Eq.6 : Generalized vector chain rule

Refer to section 4.5.2 in the paper for a more detailed treatment of the vector chain rule. Going back to our original problem, we have can express our vector chain rule matrix as the product of 2 Jacobians, as follows:

Eq.7 : Vector chain rule as product of Jacobians

Now we are primed to find the gradient of neuron activation.

The gradient of neuron activation

We can now find the derivative of-

Eq.8 : Activation function

Let’s find the derivative of w.x first-

Eq.9 : Function to differentiate

We can express the dot product of w and x as the sum of their elementwise products. This is also called the Hadamard product at times.

Eq.10 : Elementwise product operator

We can use the chain rule to find the derivative of this function -

Eq. 11 : Derivative of w.x

The derivative of y according to the chain rule will be:

Eq.12 : Chain rule to find the derivative of y

This follows from our work on the Jacobian. Just for demonstration purposes, consider the case where w is a 3x3 matrix and x is a 3x1 matrix.

Eq.13 : Derivative of w.x

It’s pretty straightforward to surmise that the derivatives will be a diagonal vector with the elements x₁, x₂ and x₃ -

Eq.14 : Solving for the derivative of y

Multiplying a 1 x 3 vector by a 3 x 3 vector will give us 1 x 3 vector-

Eq.15 : Multiplying the vectors in Equation 14.

Now, we have found the derivative of y with respect to w. We also need to optimize b:

Eq.16 : Derivative with respect to b.

At this point, we have the derivative of part of the activation function :

Eq.17 : Activation function again

We know the derivative of the w.x + b with respect to both w and b. We now need to find the derivative of the max() component of the cost function. For a thorough mathematical explanation, refer to the paper by Terrence and Jeremy.

However, let’s try to treat this intuitively. The max function has 2 arguments 0 and t, if you consider the function- max(0, t). If the value of t is less than 0, then there is nothing to differentiate. Else, we would differentiate t, in case t is greater than zero.

To sum up,

Eq.18 : Derivative of the activation function with respect to w.
Eq.19 : Derivative of the activation function with respect to b.

The gradient of the neural network loss function

Say we have the following inputs to our neural network:

Eq.20 : Inputs and targets to the neural network

The target elements like target( x₁) and target(x₂) are scalars.

This gives us the following cost function:

Eq.21 : Cost function

We then have the following intermediate variables:

Eq.22 : Intermediate variables

The gradient with respect to the weights

In the previous section, we solved for the derivative of the activation function with respect to the weights and the biases:

Eq.23 : Derivative of the activation function with respect to the weights and the biases

Looking at our intermediate variables, we see that u is the same as our activation function from the last section. So, let’s move on to finding the derivative of v.

Eq.24 : Derivative of v with respect to the weights.

The derivative of y respect to the weights is 0, since y here is the target, which would be a constant. We just substituted the derivative of u with respect to w, from our work in the previous section.

Now, we are left to differentiate C with respect to w-

Eq.25 : Differentiating C with respect to w.

Substituting for the partial derivative of v:

Eq.26 : Substituting for the partial derivative of v

After some mathematical acrobatics, we arrive at:

Eq.27 : Derivative of the cost function with respect to the weights

Above, the term in the brackets is the difference between the predicted and the target value. Terming this the error seems more than apt.

Eq.28 : Substituting the difference between the actual and target values as the error

We can similarly compute the derivative with respect to the bias-

Eq.29 : Derivative of the cost function with respect to the bias

With this, we come to an end of this series on matrix calculus required for Deep Learning. I plan to continue writing on the math required for Deep Learning. The next post might probably have more code along with mathematical concepts.

Do reach out if you have any comments, questions or suggestions.

--

--

Kaushik Moudgalya

Computer Science Master’s graduate from the University of Montreal, specializing in Machine Learning.