Answer to Question 1-1


The activation function in a deep architecture needs to be non-linear because if it were linear, then multiple layers of neurons would not provide any additional computational power. This is because the composition of linear functions is also a linear function. Therefore, if the activation function is linear, then the output of the entire deep architecture would still be a linear function of its inputs, and it would not be able to model complex, non-linear relationships between the inputs and outputs.

To illustrate this point, consider the following figure, which shows a simple deep architecture with two layers of neurons:

![Simple deep architecture](simple_deep_architecture.png)

If the activation function in each neuron is linear, then the output of the first layer of neurons would be a linear function of its inputs, and the output of the second layer of neurons would also be a linear function of its inputs. Therefore, the output of the entire deep architecture would still be a linear function of its inputs, and it would not be able to model complex, non-linear relationships between the inputs and outputs.

On the other hand, if the activation function is non-linear, then the output of each layer of neurons would be a non-linear function of its inputs, and the deep architecture would be able to model complex, non-linear relationships between the inputs and outputs. This is why the activation function in a deep architecture needs to be non-linear.





****************************************************************************************
****************************************************************************************




Answer to Question 1-2


LayerNorm, BatchNorm, and InstanceNorm are all types of normalization layers used in neural networks to help improve the stability and performance of the training process.

LayerNorm (Layer Normalization) is a normalization technique that normalizes the inputs of each individual neuron in a layer, rather than normalizing the inputs of the entire layer as a whole. This is done by calculating the mean and variance of the inputs to each neuron, and then normalizing those inputs using those statistics. LayerNorm is applied to each sample independently, and it can be used in both feedforward and recurrent neural networks.

BatchNorm (Batch Normalization) is a normalization technique that normalizes the inputs of an entire layer at once, using the mean and variance of the inputs across the entire batch of samples. This helps to reduce the "internal covariate shift" that can occur during training, which is the change in the distribution of the inputs to each layer as the weights of the network are updated. BatchNorm can also help to regularize the network and reduce overfitting.

InstanceNorm (Instance Normalization) is a normalization technique that normalizes the inputs of an entire layer at once, but uses the mean and variance of the inputs for each individual sample, rather than using the mean and variance of the inputs across the entire batch. This helps to reduce the "style" of the input images, and it is often used in style transfer and image generation tasks.

All of these normalization layers help neural network trained more stable by reducing the internal covariate shift, which can cause the distribution of the inputs to each layer to change as the weights of the network are updated. This can make the training process unstable and difficult to optimize. By normalizing the inputs to each layer, these normalization layers help to keep the distribution of the inputs stable, which can make the training process more stable and easier to optimize. Additionally, these normalization layers can also help to regularize the network and reduce overfitting.

In summary, LayerNorm normalizes the inputs of each individual neuron in a layer, BatchNorm normalizes the inputs of an entire layer at once using the mean and variance of the inputs across the entire batch, and InstanceNorm normalizes the inputs of an entire layer at once using the mean and variance of the inputs for each individual sample. These normalization layers help to reduce internal covariate shift, regularize the network and reduce overfitting, making the training process more stable and easier to optimize.





****************************************************************************************
****************************************************************************************




Answer to Question 1-3


The recurrent network in question is a simple recurrent neural network (RNN) with one hidden layer and one output layer. The hidden layer has 3 units and uses a hyperbolic tangent (tanh) activation function. The output layer has 1 unit and uses a sigmoid activation function. There are no biases in the network. The input sequence is of even length and the inputs are integer scalars.

To determine the function computed by the output unit at the final time step, we need to analyze the forward pass of the RNN. At each time step, the hidden layer receives the input and the previous hidden state as inputs. The output of the hidden layer is then passed through a weight matrix to produce the input to the output layer. The output layer then applies the sigmoid activation function to produce the final output.

Let's denote the input sequence as x = [x1, x2, ..., xn], where n is the length of the sequence. Let's denote the weights connecting the input to the hidden layer as Wih, the weights connecting the hidden layer to itself as Whh, and the weights connecting the hidden layer to the output layer as Why. The hidden state at time t is denoted as ht, and the output at time t is denoted as yt.

At the first time step, the hidden state h1 is initialized as a zero vector. The input x1 is then fed into the network, and the hidden state h2 is computed as:

ht = tanh(Wih * x1 + Whh * h1)

The output y1 is then computed as:

yt = Sigmoid(Why * h2)

At the second time step, the hidden state h3 is computed as:

ht = tanh(Wih * x2 + Whh * h2)

The output y2 is then computed as:

yt = Sigmoid(Why * h3)

This process is repeated for all time steps in the input sequence. At the final time step, the hidden state hn is computed as:

ht = tanh(Wih * xn + Whh * hn-1)

The output yn is then computed as:

yt = Sigmoid(Why * hn)

Since we are interested in the output at the final time step, we can focus on the computation of yn. The sigmoid activation function maps any real-valued number to a value between 0 and 1. Therefore, the output of the network is a probability value between 0 and 1.

The input sequence x is a sequence of integer scalars. We can interpret each input as a binary feature indicating the presence or absence of a certain attribute. For example, if the input sequence represents a sequence of words in a sentence, each input could indicate the presence or absence of a certain word.

Given the binary nature of the inputs, we can interpret the output of the network as the probability of a certain event related to the input sequence. For example, if the input sequence represents a sequence of words in a sentence, the output could be the probability of the sentence being positive or negative in sentiment.

In summary, the recurrent network in question computes a probability value between 0 and 1, based on the binary features represented in the input sequence. The specific interpretation of the output depends on the context in which the network is used.





****************************************************************************************
****************************************************************************************




Answer to Question 1-4


a. The inputs to the network at training phase are the actual words in the sentence.

b. The inputs to the network at test phase are the predicted words from the previous time step.





****************************************************************************************
****************************************************************************************




Answer to Question 1-5


Answer:

Layer | Output volume dimension | Number of Parameters | Size of receptive field
Input | (32, 32, 3) | 0 | 1
CONV3-8 | (32, 32, 8) | 3*3*3*8 + 8 = 216 | 3
Leaky Relu | (32, 32, 8) | 0 | 3
Pool-2 | (16, 16, 8) | 0 | 2
BATCHNORM | (16, 16, 8) | 8*2 = 16 | 2
CONV3-16 | (16, 16, 16) | 3*3*8*16 + 16 = 1224 | 5
Leaky ReLU | (16, 16, 16) | 0 | 5
POOL-2 | (8, 8, 16) | 0 | 2
FLATTEN | (1024) | 0 | 2
FC-10 | (10) | 1024*10 + 10 = 10350 | 28

Explanation:

1. Input layer: The input layer has a shape of (32, 32, 3) and no parameters. The receptive field size is 1 since it is the input image.
2. CONV3-8 layer: The CONV3-8 layer has 8 filters of size 3x3. The number of parameters is calculated as (3*3*3)*8 + 8 = 216, where 3*3*3 is the number of weights for each filter, and 8 is the bias term for each filter. The receptive field size is 3 since the filter size is 3x3.
3. Leaky ReLU layer: The Leaky ReLU layer has no parameters and the receptive field size is the same as the previous layer, which is 3.
4. Pool-2 layer: The Pool-2 layer has a stride of 2 and a filter size of 2x2. The output shape is (16, 16, 8) and the number of parameters is 0. The receptive field size is 2 since the filter size is 2x2.
5. BATCHNORM layer: The BATCHNORM layer has 8*2 = 16 parameters, where 8 is the number of channels and 2 is the number of moments (mean and variance). The receptive field size is the same as the previous layer, which is 2.
6. CONV3-16 layer: The CONV3-16 layer has 16 filters of size 3x3. The number of parameters is calculated as (3*3*8)*16 + 16 = 1224, where 3*3*8 is the number of weights for each filter, and 16 is the bias term for each filter. The receptive field size is 5 since the filter size is 3x3 and the stride is 1.
7. Leaky ReLU layer: The Leaky ReLU layer has no parameters and the receptive field size is the same as the previous layer, which is 5.
8. POOL-2 layer: The POOL-2 layer has a stride of 2 and a filter size of 2x2. The output shape is (8, 8, 16) and the number of parameters is 0. The receptive field size is 2 since the filter size is 2x2.
9. FLATTEN layer: The FLATTEN layer has no parameters and the receptive field size is the same as the previous layer, which is 2.
10. FC-10 layer: The FC-10 layer has 10 neurons and 1024 inputs. The number of parameters is calculated as 1024*10 + 10 = 10350, where 1024 is the number of inputs and





****************************************************************************************
****************************************************************************************




Answer to Question 2-1


The following is my answer:

[
"Vanishing gradient causes deeper layers to learn more slowly than earlier layers",
"Leaky ReLU is less likely to suffer from vanishing gradients than sigmoid",
"Xavier initialization can help prevent the vanishing gradient problem"
]





****************************************************************************************
****************************************************************************************




Answer to Question 2-2


The size of every convolutional kernel can increase the size of the receptive field in a convolutional neural network. This is because a larger convolutional kernel will cover a larger area of the input image, resulting in a larger receptive field. Therefore, the answer to the first option is (true).

The number of channels in every convolutional kernel can also increase the size of the receptive field. This is because each channel in a convolutional kernel can capture different features of the input image. Therefore, increasing the number of channels in a convolutional kernel can increase the amount of information that is captured, resulting in a larger receptive field. Therefore, the answer to the second option is (true).

The activation function of each layer does not directly affect the size of the receptive field. The activation function is used to introduce non-linearity into the model and to help the model learn more complex relationships between the input and output. Therefore, the answer to the third option is (false).

The size of the pooling layer can also increase the size of the receptive field. This is because a larger pooling layer will cover a larger area of the input image, resulting in a larger receptive field. Therefore, the answer to the fourth option is (true).

In summary, the size of every convolutional kernel, the number of channels in every convolutional kernel, and the size of the pooling layer can all increase the size of the receptive field in a convolutional neural network. Therefore, the correct answer to the question is:

* The size of every convolutional kernel (true)
* The number channel of every convolutional kernel (true)
* The activation function of each layer (false)
* The size of pooling layer (true)





****************************************************************************************
****************************************************************************************




Answer to Question 2-3


Answer:

True

Explanation:

The weight vector W in logistic regression is used to calculate the probability of a certain class given the input features. The test accuracy Acc is calculated based on these probabilities and the actual class labels. Dividing W by 2 will halve the probabilities calculated from it, but the relative differences between the probabilities for different classes will remain the same. Therefore, the test accuracy Acc will not change, assuming there is no bias.





****************************************************************************************
****************************************************************************************




Answer to Question 2-4


The valid activation functions to train a neural net in practice are:

1. f(x) = min(2,x)
2. f(x) = 3x + 1
3. f(x) = min(x, 0.5x) if x < 0; f(x) = min(x, 0.5x) if x >= 0 (this can be simplified to f(x) = 0.5x)
4. f(x) = min(x, 0.1x) if x < 0; f(x) = max(x, 0.1x) if x >= 0 (this can be simplified to f(x) = 0.1x if x < 0; f(x) = x if x >= 0)

Therefore, the correct options are:

* f(x) = min(2,x) (true)
* f(x) = 3x + 1 (true)
* f(x) = 0.5x (true)
* f(x) = 0.1x if x < 0; f(x) = x if x >= 0 (true)





****************************************************************************************
****************************************************************************************




Answer to Question 2-5


The answer is:

[true, true, true, false]

Explanation:

* Data augmentation: This method can reduce overfitting by increasing the amount of training data. It works by applying random transformations to the existing data, such as rotation, scaling, or flipping, to create new synthetic data. This new data can help the model generalize better and reduce overfitting.
* Dropout: This method can reduce overfitting by preventing co-adaptation of feature detectors. It works by randomly dropping out a proportion of the neurons in each layer during training. This forces the remaining neurons to learn more robust features and reduces overfitting.
* Batch normalization: This method can reduce overfitting by normalizing the inputs of each layer. It works by computing the mean and variance of the inputs for each mini-batch and scaling them to have zero mean and unit variance. This can improve the stability and convergence of the training process and reduce overfitting.
* Using Adam instead of SGD: This method may or may not reduce overfitting, depending on the specific situation. Adam is an adaptive learning rate optimization algorithm that can converge faster and more reliably than SGD. However, it may also lead to overfitting if the learning rate is too high or the model is too complex. Therefore, using Adam instead of SGD is not a guaranteed way to reduce overfitting.





****************************************************************************************
****************************************************************************************




Answer to Question 2-6


The answer is:

[
  "Decrease in magnitude, maintain polarity"
]





****************************************************************************************
****************************************************************************************




Answer to Question 3-1


a) The formula for the condition is:

$$\text{argmax}_{i \in \{1, \dots, n+1\}} \text{Attention}(i) = n$$

where $\text{Attention}(i)$ is the attention score for position $i$ in the sequence, which is computed as:

$$\text{Attention}(i) = \frac{\text{exp}(e(i))}{\sum_{j=1}^{n+1} \text{exp}(e(j))}$$

where $e(i)$ is the unnormalized attention score for position $i$, which is computed as:

$$e(i) = \frac{Q_i K_n^T}{\sqrt{d_k}}$$

where $Q_i$ is the query vector for position $i$, $K_n$ is the key vector for position $n$, and $d_k$ is the dimension of the key vectors.

b) The self-attention mechanism requires the positional encodings to be able to fulfill this condition for arbitrary sequences. The positional encodings are added to the input embeddings before they are passed through the self-attention mechanism. The positional encodings provide the model with information about the position of each token in the sequence, allowing the model to attend to the previous token position $n$ the most. Without the positional encodings, the model would not be able to distinguish between the different positions in the sequence, and would not be able to attend to the previous token position $n$ the most.





****************************************************************************************
****************************************************************************************




Answer to Question 3-2


a) The condition for $t_{k+1}$ to be predicted as the next token when using greedy decoding is that the probability of $t_{k+1}$ must be the highest among all possible tokens. This can be written as:

$$
P(t_{k+1} \mid T) > P(t' \mid T) \quad \forall t' \in V
$$

where $V$ is the vocabulary of the language model.

b) The condition for the layer $l$ attention head $h$ to attend from position $n+1$ to position $k+1$ the most is that the attention weight $a^{l, h}_{n+1, k+1}$ must be the highest among all possible positions. This can be written as:

$$
a^{l, h}_{n+1, k+1} > a^{l, h}_{n+1, j} \quad \forall j \in [1, n+1]
$$

where $a^{l, h}_{i, j}$ is the attention weight from position $i$ to position $j$ in layer $l$ attention head $h$.

c) No, a Transformer model with only a single attention layer cannot fulfill the condition in (
ef{part:attentioncondition2}) for arbitrary sequences and any $k < n$ where $t_k = t_n$. The reason is that the single attention layer can only attend to positions in the input sequence, and cannot attend to positions in the output sequence that have not been generated yet. Therefore, it cannot identify whether the currently decoded token sequence has already previously occurred in the current sequence.

d) The means of communication between two attention heads in a Transformer model is through the residual stream.

* For attention heads of the same layer, they communicate through the residual stream by modifying the input embeddings of the next attention head. Specifically, the output of each attention head is added to the input embeddings of the next attention head through the residual stream.
* For attention heads of successive layers, they communicate through the residual stream by modifying the input embeddings of the next layer. Specifically, the output of each attention head is added to the input embeddings of the next layer through the residual stream.

The part of the self-attention that decides what to write to this communications channel is the output of the self-attention, which is added to the input embeddings through the residual stream. The part of the self-attention that decides what to read from this communications channel is the input embeddings, which are modified by the output of the previous self-attention through the residual stream.

e) One possible sequence of self-attention operations for a two-layer Transformer model to make the layer $l > 1$ attention head $h$ attend to position $k+1$ for arbitrary sequences (given $t_k = t_n$) is as follows:

1. In the first layer, the attention head $h$ attends to position $n+1$ (i.e. the current position) and position $k+1$ (i.e. the position to attend to).
2. In the first layer, the attention head $h$ computes a linear mapping of the input embeddings to modify the position embeddings of position $n+1$ and position $k+1$. Specifically, the attention head $h$ adds a constant vector to the position embeddings of position $n+1$ and position $k+1$.
3. In the second layer, the attention head $h$ attends to position $n+1$ (i.e. the current position) and position $k+1$ (i.e. the position to attend to).
4. In the second layer, the attention head $h$ computes a linear mapping of the input embeddings to modify the position embeddings of position $n+1$ and position $k+1$. Specifically, the attention head $h$ subtracts the modified position embeddings of position $n+1$ from the modified position embeddings of position $k+1$.
5. In the second layer, the attention head $h$ computes the attention weights based on the modified position embeddings of position $n+1$ and position $k+1$. Since the





****************************************************************************************
****************************************************************************************




Answer to Question 4-1


The layer $E$ can be described as a matrix-vector multiplication where the input vector is multiplied with the $i$-th column of the word embedding matrix $W_E$. If $s_k$ is the $i$-th vocabulary word, then the input vector is a one-hot encoded vector with a 1 at the $i$-th index and 0's everywhere else. This input vector is then multiplied with the $i$-th column of $W_E$ to produce the vectorial representation of the input token $s_k$.





****************************************************************************************
****************************************************************************************




Answer to Question 4-2


a) The function $g$ must be differentiable if we want to perform training via gradient descent.

b) The gradient $\nabla l$ with respect to $W_E$ is given by:

$\nabla l = \frac{\partial l}{\partial g} \frac{\partial g}{\partial E(w)} \frac{\partial E(w)}{\partial W_E}$

c) The gradient $\frac{\partial l}{\partial w_{ij}}$ for an $i 
eq k$ is 0, since the input word $w$ only affects the embedding vector $E(w)$ and not any other embedding vectors in $W_E$.

d) The insight from part (
ef{ex:gradient_input_word_i}) signifies that the computational and memory complexity of the embedding layer during the forward pass is independent of the size of the vocabulary $V$, since only the embedding vector $E(w)$ is used in the computation. However, during the backward pass, the gradient $\frac{\partial l}{\partial W_E}$ needs to be computed for all embedding vectors in $W_E$, which increases the computational and memory complexity with the size of the vocabulary $V$.





****************************************************************************************
****************************************************************************************




