Answer to Question 1-1


Answer:

a) Word embeddings like word2vec take context into account by training a neural network to predict a target word based on its surrounding context words. The network learns to associate similar words with similar vector representations, capturing semantic relationships between words. In contrast, TF-IDF represents words as numerical vectors based on their frequency in a collection of documents, but it does not inherently capture semantic relationships or context. The main difference is that word embeddings learn to capture semantic meaning and context from data, while TF-IDF relies on explicit statistical information about word frequencies.





****************************************************************************************
****************************************************************************************




Answer to Question 1-2


Answer:

a. I, love, NLP, a, lot

Explanation:

The sentence "I love NLP a lot" can be segmented into subwords using the given BPE codes as follows:

* "I" is a single character and does not have a corresponding BPE code, so it remains as it is.
* "love" is a word that does not have a corresponding BPE code, so it remains as it is.
* "NLP" is a sequence of three characters that does not have a corresponding BPE code, so it remains as it is.
* "a" is a single character that has a corresponding BPE code, so it remains as it is.
* "lot" is a word that does not have a corresponding BPE code, so it remains as it is.

Therefore, the sentence "I love NLP a lot" can be segmented into the subwords "I", "love", "NLP", "a", and "lot".





****************************************************************************************
****************************************************************************************




Answer to Question 1-3


a) The output projection is a matrix of size (10000, 300).

b) I do not agree with Bart. The context window size should affect the trained word vectors. The context window size determines how many words before and after the target word are used to predict the target word. If the context window size is too small, then the model may not be able to capture the full meaning of the target word. If the context window size is too large, then the model may capture unnecessary information. Therefore, the context window size should be carefully chosen. However, if Bart used the same training data, preprocessing steps, and hyperparameters (except for the context window size), then the trained word vectors should be similar.





****************************************************************************************
****************************************************************************************




Answer to Question 1-4


a) True. Using subwords can help to process morphologically-rich languages by capturing the morphological structure of words, which can lead to better generalization and fewer out-of-vocabulary words.

b) True. Given the number of occurrences of each word in a corpus, we can derive a unigram language model by estimating the probability of each word as the ratio of its number of occurrences to the total number of words in the corpus.

c) False. One-hot word representations cannot be used to measure the semantic difference between two words, as they only capture whether a word is present or not, and do not provide any information about the meaning or context of the word.

d) True. In latent Dirichlet allocation (LDA), a document is modeled as a distribution over topics, where each topic is a distribution over words.

e) True. In term frequency-inverse document frequency (TF-IDF), the term frequency increases the importance of words that occur frequently in a document, while the inverse document frequency decreases the importance of words that occur in many documents, such as stopwords.

f) False. When hidden Markov models (HMMs) are used for part-of-speech tagging, the hidden states are the part-of-speech tags, not the words. The observable sequence is the sequence of words, and the goal is to infer the most likely sequence of hidden states (i.e. part-of-speech tags) that generated the observed sequence of words.





****************************************************************************************
****************************************************************************************




Answer to Question 2-1


a) The model is a simple feedforward neural network. The input is a 300-dimensional word embedding vector. The intermediate operations consist of a fully connected layer with 100 hidden units and a ReLU activation function, followed by another fully connected layer with 3 output units (one for each classification label) and a softmax activation function. The output is a probability distribution over the 3 classification labels. The shape of the fully connected layer's weight matrix is (300, 100), and the shape of its bias vector is (100,). The shape of the second fully connected layer's weight matrix is (100, 3), and the shape of its bias vector is (3,).

b) The model from the previous subquestion is not suitable for the audio utterance classification task for two reasons. First, the input is a 300-dimensional word embedding vector, but for the audio utterance classification task, the input is an 80-dimensional spectrogram. Second, the model is designed to classify individual words based on their word embeddings, but the audio utterance classification task requires classifying entire utterances based on their spectrograms.

c) An improved model for the audio utterance classification task is a convolutional neural network (CNN) with multiple convolutional layers followed by a fully connected layer and a softmax activation function. The input is an 80-dimensional spectrogram. The intermediate operations consist of multiple convolutional layers with small filters (e.g., 3x3) and a small number of channels (e.g., 8 or 16), followed by max pooling layers to reduce the spatial dimensions of the feature maps. The final convolutional layer is flattened and connected to a fully connected layer with 100 hidden units and a ReLU activation function, followed by another fully connected layer with 3 output units (one for each classification label) and a softmax activation function. The output is a probability distribution over the 3 classification labels. The shapes of the convolutional layers' weight matrices and bias vectors depend on the filter size, number of channels, and stride, and are not easily specified without more information. The shape of the final fully connected layer's weight matrix is (100, 3), and the shape of its bias vector is (3,).





****************************************************************************************
****************************************************************************************




Answer to Question 2-2


a) The sequence classification approach is not optimal because it does not take into account the context of the utterance. For example, in the given dialog, the patient says "For an entire week, my knee has been in pain." and "And some swelling during this time too.". If we only look at the second utterance, it is not clear whether the patient is talking about knee pain or knee swelling. However, if we consider the context, it is clear that the patient is still talking about knee pain. Therefore, a sequence classification approach would not be able to correctly label the second utterance without considering the context.

A case where the model will certainly make a mistake is when the same word has different meanings in different contexts. For example, the word "pain" can be used as a noun or a verb. If the model only looks at the word "pain" without considering the context, it will not be able to distinguish between the two meanings and will certainly make a mistake.

b) I would model the task as a sequence labeling problem because the output label for each utterance depends on the previous utterances. In other words, the label for each utterance is conditioned on the labels of the previous utterances. This is because the dialog act of an utterance can depend on the dialog acts of the previous utterances.

The input is a matrix of dimension number\_of\_utterances $\times$ d, where each row represents an utterance and each column represents a feature of the utterance. The intermediate operations involve encoding the input matrix into a sequence of hidden states using a recurrent neural network (RNN) or a transformer. The output is a sequence of labels, where each label corresponds to the dialog act of the corresponding utterance.

c) A model for the dialog act identification task is as follows:

* Input: A matrix of dimension number\_of\_utterances $\times$ d, where each row represents an utterance and each column represents a feature of the utterance.
* Intermediate operations:
	+ Use an RNN or a transformer to encode the input matrix into a sequence of hidden states.
	+ Apply a softmax activation function to the hidden states to obtain a probability distribution over the labels for each utterance.
* Output: A sequence of labels, where each label corresponds to the dialog act of the corresponding utterance.

The input matrix can be obtained by applying a sentence encoder to each utterance in the dialog. The sentence encoder can be a pre-trained model such as BERT or RoBERTa. The RNN or transformer can be a long short-term memory (LSTM) network or a transformer encoder. The softmax activation function converts the hidden states into a probability distribution over the labels. The label with the highest probability is chosen as the predicted label for the utterance.





****************************************************************************************
****************************************************************************************




Answer to Question 3-1


a. Autoregressive means that the model generates the output sequence one element at a time, using the previously generated elements as context. In the case of the Transformer decoder, this means that it generates the output sequence one token at a time, using the previously generated tokens as context.

b. The Transformer decoder self-attention must be partially masked out during training to prevent the model from using future tokens as context when generating a particular token. This is necessary to ensure that the model is autoregressive and does not cheat by using information from future tokens. The masking is typically done by setting the attention weights to negative infinity for future tokens, which effectively sets their contribution to zero.

c. The weights that should be masked out in the decoder self-attention weight matrix are indicated with "x" in the table. These are the weights corresponding to the attention query at position i and the attention key at position j, where j > i. In other words, the weights corresponding to future tokens should be masked out.

d. The attention query and key at position i are given as q\_i and k\_i. The attention weights when querying from the word "Mary" are α\_Mary. To show that α\_Mary is the same whether the sequence is "John loves Mary" or "Mary loves John", we need to show that the dot product between q\_Mary and k\_i is the same for all i, where k\_i corresponds to the word "Mary". This is because the attention weights are computed as the softmax of the dot products between the query and the keys. Since the word embeddings for "Mary" are the same in both sequences, it follows that the dot products between q\_Mary and k\_i are the same for all i, and therefore the attention weights α\_Mary are the same.





****************************************************************************************
****************************************************************************************




Answer to Question 3-2


a) Two solutions to the problem of unknown words in the medical domain are:

1. Pre-training the model on a large medical corpus to learn the representations of medical terms. This will allow the model to generate medical terms that it has not seen before.

2. Using a dictionary or ontology of medical terms to map unknown words to their corresponding medical concepts. This will allow the model to generate medical terms that it has not seen before, as long as they are present in the dictionary or ontology.

b) ROUGE-n is based on n-grams, which are sequences of n words. For example, ROUGE-1 is based on unigrams (single words), ROUGE-2 is based on bigrams (sequences of two words), and so on.

c) The model receives high ROUGE-2 scores despite the non-grammatical output because ROUGE-2 is based on bigrams, which are sequences of two words. The model may have learned to generate bigrams that are present in the reference summaries, even if the overall output is not grammatical. To avoid this problem, we can use a metric such as BLEU, which is based on n-grams but also takes into account the grammaticality of the output. To reduce the amount of repetition in the output, we can use techniques such as beam search or n-gram blocking, which prevent the model from generating the same sequence of words multiple times in a row.





****************************************************************************************
****************************************************************************************




Answer to Question 3-3


a)

An advantage of using BERT with CTC for machine translation is that BERT is a pretrained text encoder that has been shown to create good representations for text inputs. This means that it can potentially generate high-quality representations of the source language text, which can then be used to translate it into the target language.

A disadvantage of this approach is that CTC is a loss function that is typically used for automatic speech recognition, not machine translation. This means that it may not be well-suited for the task of machine translation, and may not be able to effectively model the relationships between the source and target languages.

b)

One way to improve the performance of the model using BERT as the encoder is to replace the CTC loss function with a loss function that is more appropriate for machine translation, such as the cross-entropy loss. This would allow the model to better learn the relationships between the source and target languages, and potentially improve its translation performance. Another way to improve the model is to fine-tune the pretrained BERT model on a machine translation task, which would allow it to better adapt to the specific requirements of the translation task. Additionally, using a sequence-to-sequence architecture with an attention mechanism could also help the model to better model the relationships between the source and target languages.





****************************************************************************************
****************************************************************************************




Answer to Question 3-4


a) A model for the text-to-SQL task can be designed using a sequence-to-sequence architecture with attention. The input to the model would be the natural language question, and the output would be the corresponding SQL query. The model can be trained on the 30,000 examples provided, with the goal of minimizing the cross-entropy loss between the predicted SQL query and the ground truth SQL query.

The encoder can be a recurrent neural network (RNN) such as a long short-term memory (LSTM) or a gated recurrent unit (GRU), which takes the sequence of words in the question as input and generates a sequence of hidden states. The decoder can also be an RNN, which takes the sequence of hidden states generated by the encoder as input and generates the sequence of tokens in the SQL query. At each time step, the decoder can attend to the sequence of hidden states generated by the encoder, allowing it to focus on different parts of the question as it generates the SQL query.

To handle unanswerable questions, the model can be adapted by adding an additional output token, such as "INVALID", to the decoder vocabulary. The model can then be trained to predict this token when the input question is unanswerable. During inference, if the model predicts the "INVALID" token, it can indicate that the question is unanswerable.

b) To adapt the model to handle unanswerable questions, we can add an additional output token, such as "INVALID", to the decoder vocabulary. During training, if the input question is unanswerable, the target SQL query can be set to this token. During inference, if the model predicts the "INVALID" token, it can indicate that the question is unanswerable.

Additionally, we can modify the loss function used during training to account for unanswerable questions. For example, we can use a binary cross-entropy loss function, where the target label is 1 if the question is answerable and 0 if it is not. This way, the model can learn to distinguish between answerable and unanswerable questions.

Finally, we can also add a validation step during inference, where the predicted SQL query is executed on the database and the results are checked for correctness. If the results are not as expected, the model can indicate that the question is unanswerable.





****************************************************************************************
****************************************************************************************




Answer to Question 4-1


Answer:

a) The advantage of this approach is that it allows for fine-tuning of the BERT model on a small dataset without the risk of overfitting or divergence. By keeping the BERT parameters frozen, the risk of overfitting is reduced as the model is not able to make drastic changes to the pre-trained weights. The addition of adapters allows for the model to learn new features specific to the task at hand, without making significant changes to the pre-trained weights.

b) The adapters should be inserted after the output of the self-attention layer and before the intermediate layer in each of the BERT layers. This is because the adapters are designed to learn new features specific to the task at hand, and the self-attention layer is responsible for capturing the context and dependencies between the input tokens. By inserting the adapters after the self-attention layer, the model can learn new features based on the context and dependencies captured by the self-attention layer.

c) To add the adapters, we first project the output of the self-attention layer down to 256 dimensions using a linear projection. We then apply a non-linear activation function to the projected output. Next, we project the activated output back up to the original dimension using another linear projection. The number of parameters added to the existing model depends on the size of the linear projections. Specifically, we add two linear projections with an output size of 256 and the original dimension respectively, and a non-linear activation function. The number of parameters in each linear projection is equal to the product of the input size, output size, and bias. Therefore, the total number of parameters added to the existing model is equal to 2 times (input size x output size x bias) + (number of activation function parameters).





****************************************************************************************
****************************************************************************************




Answer to Question 4-2


Answer:

a) The approach of using sentence representations from the BERT CLS token and pooling (e.g. meanpool or maxpool) the word vectors in the passages or questions as the sentence representation are different in the following ways:

* BERT CLS token: This approach uses the representation of the special [CLS] token as the sentence representation. The [CLS] token is added at the beginning of the input sequence and is used by BERT to generate a single vector that summarizes the entire sequence. This vector is then used as the sentence representation.
* Pooling: In this approach, the word vectors in the passages or questions are pooled together to generate the sentence representation. This can be done using different pooling techniques such as mean pooling or max pooling.

The advantages of using the BERT CLS token as the sentence representation are:

* Better representation: The [CLS] token is specifically designed to generate a single vector that summarizes the entire sequence. This makes it a better representation of the sentence compared to pooling the word vectors.
* Contextualized representation: The [CLS] token is contextualized, which means that its representation changes based on the context of the sentence. This makes it a more meaningful representation compared to pooling the word vectors.

b) The reason why we need to include irrelevant/negative pairs in the training objective is to ensure that the model learns to distinguish between relevant and irrelevant question-passage pairs. If we leave out the irrelevant/negative pairs in the training objective, the model will not learn to distinguish between relevant and irrelevant question-passage pairs. This will result in the model retrieving irrelevant passages as potential answers, which will reduce the performance of the model.





****************************************************************************************
****************************************************************************************




