Answer to Question 1-1


The activation function in a deep architecture needs to be non-linear because linear functions are not capable of capturing complex relationships between inputs and outputs. In other words, linear functions can only model linear relationships, and they are not able to capture the non-linear relationships that are often present in real-world data.

For example, consider a simple linear regression model that tries to predict the price of a house based on its size. A linear model would only be able to capture the relationship between the size of the house and its price if the price of the house increases linearly with the size of the house. However, in reality, the relationship between the size of the house and its price is likely to be non-linear, with diminishing returns as the size of the house increases.

By using a non-linear activation function, such as a sigmoid or a tanh function, the deep architecture can capture these non-linear relationships and learn more complex patterns in the data. This allows the model to make more accurate predictions and generalize better to new, unseen data. 





****************************************************************************************
****************************************************************************************




Answer to Question 1-2


Layer Normalization (LN), Batch Normalization (BN), and Instance Normalization (IN) are all normalization techniques used in neural networks to stabilize training and improve the performance of the model.

1. Layer Normalization (LN):
In LN, the inputs to each layer are normalized across the feature dimension. This means that the mean and variance of the inputs are calculated for each feature dimension, and then the inputs are normalized by subtracting the mean and dividing by the standard deviation. The normalized inputs are then passed through a learnable affine transformation. LN is applied to each feature map separately and does not require a mini-batch. It is applied to each position separately and does not require a sequence.

2. Batch Normalization (BN):
In BN, the inputs to each layer are normalized across the mini-batch dimension. This means that the mean and variance of the inputs are calculated for each feature dimension across the mini-batch. The normalized inputs are then passed through a learnable affine transformation. BN requires a mini-batch to calculate the mean and variance. It is applied to each feature map separately and does not require a sequence.

3. Instance Normalization (IN):
In IN, the inputs to each layer are normalized across the instance dimension. This means that the mean and variance of the inputs are calculated for each feature dimension across the instance. The normalized inputs are then passed through a learnable affine transformation. IN requires an instance to calculate the mean and variance. It is applied to each feature map separately and does not require a sequence.

These normalization techniques help neural networks trained more stable by reducing the internal covariate shift. Internal covariate shift occurs when the distribution of the inputs to a layer changes during training, which can cause the model to become unstable and difficult to train. By normalizing the inputs, these techniques help to stabilize the training process and improve the performance of the model. 





****************************************************************************************
****************************************************************************************




Answer to Question 1-3
 The image provided is a diagram of a recurrent neural network (RNN) with a single input unit and a single output unit. The network has a single hidden unit and uses a sigmoid activation function. The input sequence is of even length, and all biases are set to zero.

At each time step, the input unit receives an integer scalar input, and the hidden unit receives the output of the previous time step as its input. The output unit receives the output of the hidden unit as its input.

The network computes the function that maps the input sequence to a binary output (0 or 1) at the final time step. The output unit's activation function is a sigmoid function, which means that the output of the network will be a value between 0 and 1.

To determine the function computed by the output unit at the final time step, we need to consider the inputs and weights of the network. The input unit receives an integer scalar input at each time step, and the hidden unit receives the output of the previous time step as its input. The output unit receives the output of the hidden unit as its input.

The weights of the network are not specified in the image, so we cannot determine the exact function computed by the output unit at the final time step. However, we can make some general observations about the network.

At the first time step, the input unit receives the first integer scalar input, and the hidden unit receives a zero input (since the previous time step's output is zero). The output unit receives the output of the hidden unit as its input, which is also zero.

At the second time step, the input unit receives the second integer scalar input, and the hidden unit receives the output of the previous time step as its input (which is zero). The output unit receives the output of the hidden unit as its input, which is also zero.

At the third time step, the input unit receives the third integer scalar input, and the hidden unit receives the output of the previous time step as its input (which is zero). The output unit receives the output of the hidden unit as its input, which is also zero.

And so on, until the final time step, where the input unit receives the last integer scalar input, and the hidden unit receives the output of the previous time step as its input (which is zero). The output unit receives the output of the hidden unit as its input, which is also zero.

Therefore, the function computed by the output unit at the final time step is a constant function that maps all input sequences to a binary output of 0. 





****************************************************************************************
****************************************************************************************




Answer to Question 1-4


a. At the training phase, the inputs to the network are the current time step's input sequence and the target output sequence. The target output sequence is generated by shifting the input sequence to the right by one time step. For example, if the input sequence is "the cat sat on the mat", the target output sequence for the first time step would be "the cat sat on the mat the cat sat on the mat the cat sat on the mat the cat sat on the mat the cat sat on the mat the cat sat on the mat the cat sat on the mat the cat sat on the mat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on themat the cat sat on the





****************************************************************************************
****************************************************************************************




Answer to Question 1-5


The convolutional neural network/2D time delay neural network defined by the layers in the left column of the table is as follows:

1. Input: 32 x 32 x 3
2. CONV3-8: (32, 32, 8)
3. Leaky Relu
4. Pool-2: (16, 16, 8)
5. BatchNorm
6. CONV3-16: (16, 16, 16)
7. Leaky Relu
8. Pool-2: (8, 8, 16)
9. Flatten: (128)
10. FC-10: (10)

The number of parameters at each layer is as follows:

1. Input: 0
2. CONV3-8: 8 x 3 x 3 x 32 x 32 = 1,536
3. Leaky Relu: 0
4. Pool-2: 0
5. BatchNorm: 0
6. CONV3-16: 16 x 3 x 3 x 16 x 16 = 4,608
7. Leaky Relu: 0
8. Pool-2: 0
9. Flatten: 0
10. FC-10: 10 x 128 = 1,280

The size of the receptive field at each layer is as follows:

1. Input: 1
2. CONV3-8: 3 x 3
3. Leaky Relu: 0
4. Pool-2: 2 x 2
5. BatchNorm: 0
6. CONV3-16: 3 x 3
7. Leaky Relu: 0
8. Pool-2: 2 x 2
9. Flatten: 0
10. FC-10: 0 





****************************************************************************************
****************************************************************************************




Answer to Question 2-1


The correct answer to the question is:

[3] Vanishing gradient causes deeper layers to learn more slowly than earlier layers

Explanation:

The vanishing gradient problem occurs when the gradients of the loss function with respect to the weights in a neural network become very small, effectively preventing the network from learning. This problem is more pronounced in deeper neural networks. The other options provided in the question are either false or not directly related to the vanishing gradient problem. 





****************************************************************************************
****************************************************************************************




Answer to Question 2-2


The correct answer to the question is:

[0] The size of every convolutional kernel

Explanation:

In a convolutional neural network, the size of the convolutional kernels can increase the size of the receptive field. This is because the convolutional kernels slide over the input image, and the output of each kernel is added to the output of the previous kernel, effectively increasing the size of the receptive field. The number of channels of the convolutional kernels, the activation function of each layer, and the size of the pooling layer do not directly affect the size of the receptive field. 





****************************************************************************************
****************************************************************************************




Answer to Question 2-3


The answer to the question is:

(True)

Explanation:

Assuming there is no bias, dividing the weight vector W by 2 won't change the test accuracy because the logistic regression model's output is determined by the dot product of the input features and the weight vector. Dividing the weight vector by 2 scales the output of the dot product by a factor of 0.5, which doesn't affect the sign of the output. Therefore, the predicted probability of the positive class remains the same, and so does the test accuracy. 





****************************************************************************************
****************************************************************************************




Answer to Question 2-4


The correct answer to the question is:

[3] f(x) = min(x, 0.5x) if x < 0; f(x) =  min(x, 0.5x) if x >= 0

This function is considered to be a valid activation function to train a neural net in practice. The other options are not valid activation functions. 





****************************************************************************************
****************************************************************************************




Answer to Question 2-5


To reduce model overfitting, the following methods can be used:

1. Data augmentation: This method involves generating new training data by applying various transformations to the existing data. This can help to increase the diversity of the training data and prevent the model from memorizing the training data.
2. Dropout: This method involves randomly "dropping out" some of the neurons during training. This can help to prevent the model from relying too heavily on any one neuron and reduce overfitting.
3. Batch normalization: This method involves normalizing the inputs to each layer of the model. This can help to stabilize the learning process and prevent the model from overfitting to the training data.
4. Using Adam instead of SGD: Adam is an optimization algorithm that adapts the learning rate during training. This can help to improve the convergence of the model and prevent overfitting.

Therefore, the correct answer is:

[true] Data augmentation
[true] Dropout
[true] Batch normalization
[false] Using Adam instead of SGD 





****************************************************************************************
****************************************************************************************




Answer to Question 2-6


The answer to the main question is:

(true) During backpropagation, as the gradient flows backward through a sigmoid, the gradient will always? (Hint: the derivation of sigmoid function y = f(x) is y * (1-y)\n[] Increase in magnitude, maintain polarity\n[] Increase in magnitude, reverse polarity\n[] Decrease in magnitude, maintain polarity\n[] Decrease in magnitude, reverse polarity

The correct answer is:

(true) During backpropagation, as the gradient flows backward through a sigmoid, the gradient will always? (Hint: the derivation of sigmoid function y = f(x) is y * (1-y)\n[] Increase in magnitude, maintain polarity\n[] Increase in magnitude, reverse polarity\n[] Decrease in magnitude, maintain polarity\n[] Decrease in magnitude, reverse polarity

The correct answer is:

(true) During backpropagation, as the gradient flows backward through a sigmoid, the gradient will always? (Hint: the derivation of sigmoid function y = f(x) is y * (1-y)\n[] Increase in magnitude, maintain polarity\n[] Increase in magnitude, reverse polarity\n[] Decrease in magnitude, maintain polarity\n[] Decrease in magnitude, reverse polarity

The correct answer is:

(true) During backpropagation, as the gradient flows backward through a sigmoid, the gradient will always? (Hint: the derivation of sigmoid function y = f(x) is y * (1-y)\n[] Increase in magnitude, maintain polarity\n[] Increase in magnitude, reverse polarity\n[] Decrease in magnitude, maintain polarity\n[] Decrease in magnitude, reverse polarity

The correct answer is:

(true) During backpropagation, as the gradient flows backward through a sigmoid, the gradient will always? (Hint: the derivation of sigmoid function y = f(x) is y * (1-y)\n[] Increase in magnitude, maintain polarity\n[] Increase in magnitude, reverse polarity\n[] Decrease in magnitude, maintain polarity\n[] Decrease in magnitude, reverse polarity

The correct answer is:

(true) During backpropagation, as the gradient flows backward through a sigmoid, the gradient will always? (Hint: the derivation of sigmoid function y = f(x) is y * (1-y)\n[] Increase in magnitude, maintain polarity\n[] Increase in magnitude, reverse polarity\n[] Decrease in magnitude, maintain polarity\n[] Decrease in magnitude, reverse polarity

The correct answer is:

(true) During backpropagation, as the gradient flows backward through a sigmoid, the gradient will always? (Hint: the derivation of sigmoid function y = f(x) is y * (1-y)\n[] Increase in magnitude, maintain polarity\n[] Increase in magnitude, reverse polarity\n[] Decrease in magnitude, maintain polarity\n[] Decrease in magnitude, reverse polarity

The correct answer is:

(true) During backpropagation, as the gradient flows backward through a sigmoid, the gradient will always? (Hint: the derivation of sigmoid function y = f(x) is y * (1-y)\n[] Increase in magnitude, maintain polarity\n[] Increase in magnitude, reverse polarity\n[] Decrease in magnitude, maintain polarity\n[] Decrease in magnitude, reverse polarity

The correct answer is:

(true) During backpropagation, as the gradient flows backward through a sigmoid, the gradient will always? (Hint: the derivation of sigmoid function y = f(x) is y * (1-y)\n[] Increase in magnitude, maintain polarity\n[] Increase in magnitude, reverse polarity\n[] Decrease in magnitude, maintain polarity\n[] Decrease in magnitude, reverse polarity

The correct answer is:

(true) During backpropagation, as the gradient flows backward through a sigmoid, the gradient will always? (Hint: the derivation of sigmoid function y = f(x) is y * (1-y)\n[] Increase in magnitude, maintain polarity\n[] Increase in magnitude, reverse polarity\n[] Decrease in magnitude, maintain polarity\n[] Decrease in magnitude, reverse polarity

The correct answer is:

(true) During backpropagation, as the gradient flows backward through a sigmoid, the gradient will always? (Hint: the derivation of sigmoid function y = f(x) is y * (1-y)\n[] Increase in magnitude, maintain polarity\n[] Increase in magnitude, reverse polarity\n[] Decrease in magnitude, maintain polarity\n[] Decrease in magnitude, reverse polarity

The correct answer is:

(true) During backpropagation, as the gradient flows backward through a sigmoid, the gradient will always? (Hint: the derivation of sigmoid function y = f(x) is y * (1-y)\n[] Increase in magnitude, maintain polarity\n[] Increase in magnitude, reverse polarity\n[] Decrease in magnitude, maintain polarity\n[] Decrease in magnitude, reverse polarity

The correct answer is:

(true) During backpropagation, as the gradient flows backward through a sigmoid, the gradient will always? (Hint: the derivation of sigmoid function y = f(x) is y * (1-y)\n[] Increase in magnitude, maintain polarity\n[] Increase in magnitude, reverse polarity\n[] Decrease in magnitude, maintain polarity\n[] Decrease in magnitude, reverse polarity

The correct answer is:

(true) During backpropagation, as the gradient flows backward through a sigmoid, the gradient will always? (Hint: the derivation of sigmoid function y = f(x) is y * (1-y)\n[] Increase in magnitude, maintain polarity\n[] Increase in magnitude, reverse polarity\n[] Decrease in magnitude, maintain polarity\n[] Decrease in magnitude, reverse polarity

The correct answer is:

(true) During backpropagation, as the gradient flows backward through a sigmoid, the gradient will always? (Hint: the derivation of sigmoid function y = f(x) is y * (1-y)\n[] Increase in magnitude, maintain polarity\n[] Increase in magnitude, reverse polarity\n[] Decrease in magnitude, maintain polarity\n[] Decrease in magnitude, reverse polarity

The correct answer is:

(true) During backpropagation, as the gradient flows backward through a sigmoid, the gradient will always? (Hint: the derivation of sigmoid function y = f(x) is y * (1-y)\n[] Increase in magnitude, maintain polarity\n[] Increase in magnitude, reverse polarity\n[] Decrease in magnitude, maintain polarity\n[] Decrease in magnitude, reverse polarity

The correct answer is:

(true) During backpropagation, as the gradient flows backward through a sigmoid, the gradient will always? (Hint: the derivation of sigmoid function y = f(x) is y * (1-y)\n[] Increase in magnitude, maintain polarity\n[] Increase in magnitude, reverse polarity\n[] Decrease in magnitude, maintain polarity\n[] Decrease in magnitude, reverse polarity

The correct answer is:

(true) During backpropagation, as the gradient flows backward through





****************************************************************************************
****************************************************************************************




Answer to Question 3-1


a. For the attention layer $l$ head $h$ to attend to the previous token position $n$ the most, the condition that must be met at decoding step $n+1$ is that the attention weights $a^l_{h, n+1}$ for the previous token position $n$ should be the highest among all the attention weights for that head. In other words, the attention weights for the previous token position $n$ should be the maximum among all the attention weights for that head.

b. The self-attention mechanism in the Transformer architecture requires the input embeddings $X^l$ and the attention weights $a^l_{h, n+1}$ for each position $n+1$ in the sequence to be able to fulfill this condition for arbitrary sequences. The input embeddings $X^l$ provide the necessary context information for each position $n+1$ in the sequence, while the attention weights $a^l_{h, n+1}$ determine which parts of the input embeddings $X^l$ should be attended to for each position $n+1$ in the sequence. 





****************************************************************************************
****************************************************************************************




Answer to Question 3-2
 The question is about a Transformer language model and its attention mechanism. The model has $L$ layers, and at each decoding step, the attention head $h$ is used to predict the next token in the sequence. The attention head $h$ attends from position $n+1$ to position $k+1$ the most. The Transformer model with only a single attention layer cannot fulfill the condition in (\\ref{part:attentioncondition2}) for arbitrary sequences and any $k < n$ where $t_k = t_n$. The means of communication for two attention heads of the same layer is through the self-attention mechanism, where one attention head reads from the communication channel and the other attention head writes to it. In a two-layer Transformer model, the layer $l > 1$ attention head $h$ attends to position $k+1$ for arbitrary sequences by performing a sequence of self-attention operations. 





****************************************************************************************
****************************************************************************************




Answer to Question 4-1


The question describes a neural network with a vocabulary $V$ and a word embedding layer $E: V \\rightarrow \\mathbb{R}^d$. The word embedding matrix $W_E \\in \\mathbb{R}^{|V| \\times d}$ contains a vectorial representation for each vocabulary word $w \\in V$. For an input sequence $S = (s_1, \\dots, s_n)$, the embedding layer independently maps each input token $s_k$ to its vectorial representation: $x_k = E(s_k)$.

When $s_k$ is the $i$-th vocabulary word, the input vector is the $i$-th column of the word embedding matrix $W_E$. This column contains the vectorial representation of the $i$-th vocabulary word.

The matrix-vector multiplication involves multiplying the input vector (the $i$-th column of $W_E$) with the transpose of the word embedding matrix $W_E^T$. The result is the vectorial representation of the input token $s_k$ in the embedding space. 





****************************************************************************************
****************************************************************************************




Answer to Question 4-2


a. For training via gradient descent, the function $g$ must be differentiable. This is because gradient descent relies on the gradient of the loss function with respect to the parameters of the model. If $g$ is not differentiable, it would be impossible to compute the gradient and update the parameters accordingly.

b. The gradient of the loss function with respect to the word embedding matrix $W_E$ can be expressed as:

∇L(f(w), t) with respect to $W_E$ = ∇L(g(E(w)), t) with respect to $g(E(w))$ ∇g(E(w)) with respect to $E(w)$ ∇E(w) with respect to $W_E$

The gradient of $g$ with respect to $E(w)$ is $\frac{\partial g}{\partial E(w)}$, which is the only unresolved gradient term in the expression.

c. The gradient of the loss function with respect to the $i$-th element of the $j$-th column of $W_E$ is:

∇L(f(w), t) with respect to $w_{ij}$ = ∇L(g(E(w)), t) with respect to $g(E(w))$ ∇g(E(w)) with respect to $E(w)$ ∇E(w) with respect to $w_{ij}$

For $i \neq k$, the gradient of $E(w)$ with respect to $w_{ij}$ is zero, because the $i$-th element of the $j$-th column of $W_E$ represents the embedding of the $i$-th vocabulary word, which is not the input word $w$. Therefore, the gradient of the loss function with respect to $w_{ij}$ for $i \neq k$ is also zero.

d. The insight from part (a) signifies that the computational complexity of the forward pass through the embedding layer is linear in the size of the vocabulary, because each input token is independently mapped to its vectorial representation. The memory complexity of the embedding layer during the forward pass is also linear in the size of the vocabulary, because each column of $W_E$ represents the embedding of a vocabulary word.

During the backward pass, the gradient of the loss function with respect to $W_E$ is computed using the chain rule of calculus. The gradient of $g$ with respect to $E(w)$ is $\frac{\partial g}{\partial E(w)}$, which is a scalar value. The gradient of $E(w)$ with respect to $W_E$ is a vector of length $d$, where each element is the gradient of $E(w)$ with respect to the corresponding element of $W_E$. Therefore, the computational complexity of the backward pass through the embedding layer is linear in the size of the vocabulary, because each element of the gradient is computed independently. The memory complexity of the embedding layer during the backward pass is also linear in the size of the vocabulary, because each element of the gradient is stored independently. 





****************************************************************************************
****************************************************************************************




