Answer to Question 1-1
a. Word embeddings and TF-IDF both provide representations of words, but they take context into account differently:

Word embeddings (like word2vec): These models learn dense vector representations of words based on their surrounding context in a large corpus of text. The key idea is that words appearing in similar contexts will have similar vector representations. For example, the embeddings of "king" and "queen" would be close in the vector space because they often appear in similar contexts related to royalty, while "king" and "car" would be farther apart. The context window (e.g., the surrounding 5 words) is sliding over the text corpus during training, allowing the model to capture the contextual similarities between words.

TF-IDF: This approach represents words in the context of a specific document within a larger collection (corpus) of documents. TF (term frequency) measures how frequently a word appears in a given document, while IDF (inverse document frequency) measures the importance of a word across the entire corpus. Words that are frequent in a specific document but rare in the overall corpus receive high TF-IDF scores. So the context here is the document itself and how unique the word is in the corpus.

Main difference: Word embeddings capture the semantic similarity between words based on their general contextual usage in a large text corpus, allowing words with similar meanings to have similar vector representations. TF-IDF represents the importance of words within the specific context of each document, relative to the full corpus. It doesn't capture semantic similarities, but rather the uniqueness and relevance of words to each document.





****************************************************************************************
****************************************************************************************




Answer to Question 1-2
Here are the answers to the question:

a. Using the given byte-pair-encoding (BPE) codes, the sentence "I love NLP a lot" would be segmented into the following subwords:

I lo ve N L P a lo t

The subwords are:
I
lo
ve
N
L
P
a
lo
t





****************************************************************************************
****************************************************************************************




Answer to Question 1-3
a. The shape of the output projection is (10000, 300). This is because the output projection matrix maps the hidden layer representation back to the vocabulary space. Since the vocabulary size is 10,000 and each word vector has 300 dimensions, the output projection matrix will have a shape of (10000, 300).

b. I do not agree that Bart's training pipeline is necessarily broken. The reason he found no difference in the trained word vectors when using context window sizes of 30, 40, and 50 words is likely because these window sizes are too large for training word embeddings effectively.

In the skip-gram model, the context window size determines how many words before and after the target word are considered as its context. Typically, smaller window sizes (e.g., 5-10 words) are used because they capture more meaningful and syntactically related words.

When using very large context window sizes like 30, 40, or 50 words, the model may capture more noise and less relevant information. The distant words in the window may not have a strong semantic relationship with the target word, leading to less informative word embeddings.

Therefore, the lack of difference in the trained word vectors with large window sizes does not necessarily indicate a broken training pipeline. Instead, it suggests that Bart should experiment with smaller context window sizes to obtain more meaningful and informative word embeddings.





****************************************************************************************
****************************************************************************************




Answer to Question 1-4
Here are my answers to the exam question:

a. True. Using subwords instead of whole words can better handle out-of-vocabulary words and capture meaningful subword patterns in morphologically-rich languages.

b. True. A unigram language model is simply the probability distribution over individual words, which can be estimated from their counts in the corpus.

c. False. One-hot representations only encode the identity of words and do not capture any semantic information that could be used to measure semantic differences.

d. False. In LDA, a document is modeled as a distribution over topics, and each topic is a distribution over words.

e. False. In TF-IDF, the inverse document frequency (IDF) decreases the importance of words that occur in many documents, not the term frequency (TF).

f. False. In a HMM for part-of-speech tagging, the hidden states are the part-of-speech tags and the observed states are the words.





****************************************************************************************
****************************************************************************************




Answer to Question 2-1
a. Given the requirements, I would propose a simple feedforward neural network model for sentence classification. Here's the model description:

Input: The input is a fixed-length sentence representation obtained by averaging the 300-dimensional pretrained word embedding vectors for each word in the sentence. If the sentence has N words, the input shape would be (300,).

Intermediate operations:
- The input is passed through a single fully connected (dense) layer with a small number of units, e.g., 32 or 64. The shape of the weight matrix would be (300, 32) or (300, 64), respectively.
- The output of the dense layer is passed through a ReLU activation function for non-linearity.

Output: The output of the dense layer is passed through a final fully connected layer with 3 units, corresponding to the three classification labels (happy, neutral, sad). The shape of the weight matrix would be (32, 3) or (64, 3), depending on the number of units in the previous layer. The output is then passed through a softmax activation function to obtain the predicted probabilities for each class.

b. The model from the previous subquestion is not suitable for audio utterance classification for the following reasons:

1. Input representation: The previous model expects a fixed-length sentence representation obtained by averaging word embeddings. However, in the audio utterance classification task, the input is a spectrogram, which is a 2D representation of the audio signal. The model needs to handle this different input format.

2. Temporal information: The previous model does not consider the temporal nature of audio data. Spectrograms capture the time-frequency representation of the audio signal, and the model should be able to process and learn from the temporal patterns present in the spectrogram.

c. For the audio utterance classification task, I would propose a convolutional neural network (CNN) model. Here's the model description:

Input: The input is an 80-dimensional spectrogram of the audio utterance. Let's assume the spectrogram has a fixed length of T time steps. The input shape would be (T, 80).

Intermediate operations:
- The input spectrogram is passed through one or more convolutional layers to learn local patterns and extract features. For example, we can use a 1D convolutional layer with 64 filters, a kernel size of 5, and a stride of 1. The shape of the weight matrix would be (5, 80, 64).
- The output of the convolutional layer is passed through a ReLU activation function.
- Max pooling is applied to downsample the feature maps and reduce the temporal dimension. For example, we can use a max pooling layer with a pool size of 2 and a stride of 2.
- The output of the max pooling layer is flattened to obtain a 1D feature vector.
- The flattened feature vector is passed through one or more fully connected (dense) layers to learn higher-level representations. For example, we can use a dense layer with 64 units. The shape of the weight matrix would depend on the output size of the flattened feature vector.
- The output of the dense layer is passed through a ReLU activation function.

Output: The output of the dense layer is passed through a final fully connected layer with 3 units, corresponding to the three classification labels (happy, neutral, sad). The output is then passed through a softmax activation function to obtain the predicted probabilities for each class.

This CNN-based model is more suitable for audio utterance classification as it can learn from the spectrogram's time-frequency representation and capture the temporal patterns present in the audio data.





****************************************************************************************
****************************************************************************************




Answer to Question 2-2
Here are my answers to the exam question:

a. The sequence classification approach of using only the utterance itself as input is not optimal because the label of an utterance often depends on the context and prior utterances in the dialog. The model needs this context to disambiguate and correctly label many utterances. A clear case where the model will make a mistake is the "For a week, right?" utterance by the patient. In isolation, this looks like it is referring to the symptom duration. But in the context of the dialog, it is actually asking for clarification about the medication duration that the doctor just mentioned. Without the prior context as input, the model will certainly mislabel this utterance.

b. I would model this as a sequence labeling problem rather than sequence generation. In sequence labeling, the output has the same length as the input, and we are assigning a label to each input element. This fits the dialog act identification task well, since we want to assign a label to each utterance in the input dialog. Sequence generation is less suitable, because we are not trying to generate a free-form output sequence, but rather pick a label for each input utterance from a fixed set of label options. Also, sequence labeling allows us to jointly pick the most likely coherent assignment of labels over the entire dialog, whereas sequence generation would pick the label for each utterance independently.

c. I would design an encoder-tagger architecture for dialog act identification:

Input: The dialog represented as a matrix of dimension number_of_utterances × d, where each row is the d-dimensional embedding of one utterance.

Encoder: Feed the input matrix into a bidirectional RNN or Transformer encoder. This will contextualize the representation of each utterance based on the surrounding utterances in the dialog. The output is a matrix of contextualized utterance representations with the same dimensions as the input.

Tagger: Feed the contextualized representations into a feed-forward network that maps each utterance representation to a vector of label scores, with length equal to the number of possible labels. Apply softmax to get a distribution over labels for each utterance.

Output: A sequence of labels, one for each utterance, corresponding to the highest scoring label from the tagger.

The key aspects are: (1) Utilizing the sentence embeddings to get a fixed-size vector per utterance. (2) Encoding the dialog with a bidirectional RNN or Transformer to contextualize utterance representations based on the dialog context. (3) Tagging each utterance representation separately to get a label sequence output.





****************************************************************************************
****************************************************************************************




Answer to Question 3-1
Here are the answers to the subquestions:

a. "Autoregressive" in the context of the Transformer decoder means that the model generates the output sequence token by token, and each generated token is conditioned on the previously generated tokens. In other words, the decoder predicts the next token based on the tokens it has already produced.

b. The Transformer decoder self-attention must be partially masked out during training to prevent the decoder from attending to future positions in the output sequence. This is necessary because during training, the decoder has access to the entire target sequence, but during inference, it generates the output sequence token by token. Masking ensures that at each position, the decoder can only attend to the positions up to and including the current position, mimicking the behavior during inference.

c. The masked out weights are indicated with "x" below:

,BoS, E, F, G
BoS,,,,,
E,x,,,,
F,x,x,,,
G,x,x,x,,

d. Let's denote the attention weights when querying from the word "Mary" as $\bm{\\alpha}_{\\texttt{Mary}} = [\\alpha_{\\texttt{Mary},\\texttt{John}}, \\alpha_{\\texttt{Mary},\\texttt{loves}}, \\alpha_{\\texttt{Mary},\\texttt{Mary}}]$.

For the sequence "John loves Mary", the attention weights are computed as:
$\\alpha_{\\texttt{Mary},i} = \\frac{\\exp(\\mathbf{q}_{\\texttt{Mary}}^T \\mathbf{k}_i)}{\\sum_{j \\in \\{\\texttt{John}, \\texttt{loves}, \\texttt{Mary}\\}} \\exp(\\mathbf{q}_{\\texttt{Mary}}^T \\mathbf{k}_j)}$

For the sequence "Mary loves John", the attention weights are computed as:
$\\alpha_{\\texttt{Mary},i} = \\frac{\\exp(\\mathbf{q}_{\\texttt{Mary}}^T \\mathbf{k}_i)}{\\sum_{j \\in \\{\\texttt{Mary}, \\texttt{loves}, \\texttt{John}\\}} \\exp(\\mathbf{q}_{\\texttt{Mary}}^T \\mathbf{k}_j)}$

The numerators in both cases are the same because the query vector $\\mathbf{q}_{\\texttt{Mary}}$ and the key vectors $\\mathbf{k}_i$ are the same regardless of the sequence order. The denominators are also the same because they are summing over the same set of values, just in a different order. Therefore, the attention weights $\\bm{\\alpha}_{\\texttt{Mary}}$ are the same for both sequences.





****************************************************************************************
****************************************************************************************




Answer to Question 3-2
Here are my answers to the exam question on summarization:

a. Two solutions for adapting a summarization model trained on news to the medical domain with many unknown words:
1. Retrain the model's word embeddings on a large corpus of medical text, so that it can learn vector representations for the unknown medical words. This will allow the model to better represent and generate the medical terminology.
2. Augment the model with a copy mechanism, which allows it to directly copy words from the input text to the output summary. This is useful for handling rare or unknown words, as the model can simply copy them from the input rather than having to generate them from scratch.

b. ROUGE-n is based on n-gram overlap between the generated summary and reference summaries. It measures the percentage of n-grams (sequences of n words) from the reference summaries that are also present in the generated summary.

c. The model receives high ROUGE-2 scores despite the non-grammatical output with repetitions because ROUGE only measures n-gram overlap, not the grammaticality or fluency of the output. The repeated phrases like "amyloid angiopathy" will contribute many matching bigrams, artificially inflating the ROUGE-2 score.

To avoid this problem, we can use a metric that considers more than just n-gram matching, such as METEOR which also takes into account synonyms, paraphrases, and the order of words. We could also use a language model to assess the fluency and grammaticality of the generated summaries.

When generating the output, we can reduce repetition by using techniques like coverage mechanisms or intra-attention to keep track of what has been generated so far and avoid repeating the same content. We can also apply n-gram repetition penalties at decoding time to discourage the model from generating the same n-grams over and over.





****************************************************************************************
****************************************************************************************




Answer to Question 3-3
a. Advantage: BERT is a powerful pretrained text encoder that has been shown to create rich representations of text inputs. Using BERT as the encoder in a machine translation model could potentially capture meaningful features and semantics from the source language.

Disadvantage: The CTC loss, which is commonly used in speech recognition, may not be the most suitable choice for machine translation. CTC is designed to handle monotonic alignments between the input and output sequences, which is appropriate for speech recognition where the output sequence (text) aligns with the input sequence (audio) in a monotonic manner. However, in machine translation, the alignment between the source and target languages is often non-monotonic, making CTC less effective.

b. To improve the model while still using BERT as the encoder, one approach is to replace the CTC loss with a more suitable loss function for machine translation, such as the cross-entropy loss. Additionally, instead of directly outputting the translated text, the model can be modified to generate the target language sequence using an autoregressive decoder, such as an LSTM or Transformer decoder.

Here's how the improved model could work:
1. Use BERT as the encoder to generate contextualized representations of the source language tokens.
2. Pass the BERT representations through an autoregressive decoder, such as an LSTM or Transformer decoder, which attends to the encoder representations and generates the target language tokens one at a time.
3. Train the model using the cross-entropy loss, which measures the difference between the predicted target language tokens and the actual target language tokens.

This approach leverages the power of BERT as a pretrained encoder while using a more suitable decoder and loss function for machine translation. The autoregressive decoder allows for non-monotonic alignments and can generate fluent translations by considering the previously generated tokens. The cross-entropy loss provides a more appropriate training signal for the language generation task.





****************************************************************************************
****************************************************************************************




Answer to Question 3-4
Here are my answers to the exam question:

a. To train a text-to-SQL model with the given 30,000 training examples, I would use a sequence-to-sequence model with attention, such as a transformer-based model like T5 or BART. The input to the model would be the concatenation of the table name, column names, and the natural language question, separated by special tokens. The output would be the corresponding SQL query.

During training, the model learns to attend to relevant parts of the input (table information and question) to generate the appropriate SQL query tokens. Attention allows the model to learn alignments between the question words and the table schema to understand which columns and values are relevant for the given question.

The model would be trained using teacher forcing, where the ground truth SQL query is provided at each decoding step. A cross-entropy loss on the generated tokens would be used to train the model to maximize the likelihood of the correct SQL query given the input.

To handle different table schemas, the model should learn to generalize to unseen tables and columns. This can be achieved by using a diverse set of tables and questions in the training data.

b. To adapt the model to handle unanswerable questions, I would add a special "unanswerable" token to the output vocabulary. During training, for questions that cannot be answered by the given table, the model should learn to generate this special token instead of an SQL query.

To encourage the model to predict the "unanswerable" token when appropriate, unanswerable questions should be included in the training data, paired with the special token as the target output. The model would learn to identify when a question cannot be answered based on the given table information.

During inference, if the model generates the "unanswerable" token, it indicates that the question cannot be answered using the provided table. This allows the model to avoid generating invalid SQL queries for questions that are out of scope for the given table.

Additionally, a separate binary classifier could be trained to predict the answerability of a question given a table. This classifier could be used as a filter before passing the question to the text-to-SQL model, to avoid generating SQL queries for unanswerable questions.





****************************************************************************************
****************************************************************************************




Answer to Question 4-1
a. The advantage of keeping the BERT parameters frozen and adding adapters is that it allows the model to adapt to the specific named-entity recognition task without overfitting to the small dataset. By freezing the BERT parameters, we preserve the pre-trained knowledge and only train the adapter layers, which have fewer parameters compared to the entire BERT model. This approach can help prevent divergence during training on a small dataset.

b. I would insert the adapters after the BertSelfOutput module in each BertLayer. Specifically, the adapters should be placed between the LayerNorm of the BertSelfOutput and the BertIntermediate module. This way, the adapters can learn task-specific transformations of the self-attention outputs before passing them to the intermediate and output layers.

c. To calculate the number of parameters added by inserting the adapters, we need to consider the following steps:
1. Each adapter consists of two linear projections: one projecting down to 256 dimensions and another projecting up to the original dimension (768 in this case).
2. The number of parameters in a linear projection is (input_dim * output_dim) + output_dim, accounting for the weights and biases.
3. For the first linear projection, the number of parameters is (768 * 256) + 256.
4. For the second linear projection, the number of parameters is (256 * 768) + 768.
5. The total number of parameters for one adapter is the sum of the parameters in both linear projections.
6. Since we are adding adapters to each of the 12 BERT layers, we need to multiply the number of parameters in one adapter by 12 to get the total number of parameters added to the existing model.





****************************************************************************************
****************************************************************************************




Answer to Question 4-2
a. The BERT CLS token is a special token added to the beginning of each input sequence during BERT pre-training. It is used to capture the overall sentence-level representation. The main differences and advantages of using the CLS token representation compared to pooling the word vectors are:

1. The CLS token is trained to specifically capture the sentence-level information during BERT pre-training, while pooling methods like mean or max pooling are generic aggregation techniques applied after the fact.

2. The CLS token representation can capture more complex and non-linear relationships among the words in the sentence, as it is learned through the deep transformer layers of BERT. Pooling methods, on the other hand, perform a simpler aggregation of the word vectors.

3. Using the CLS token allows the sentence representation to be learned in the context of the specific task (e.g., question-answering), as the BERT model can be fine-tuned with the CLS token for the downstream task. Pooling methods do not have this adaptability.

4. The CLS token representation is more compact and efficient, as it is a single vector rather than a combination of word vectors.

b. Including irrelevant/negative pairs in the training objective of DPR is crucial for the following reasons:

1. Contrastive learning: By pushing irrelevant pairs away from each other, the model learns to distinguish between relevant and irrelevant passages for a given question. This contrastive learning helps the model to better understand the semantic similarities and differences between questions and passages.

2. Preventing overfitting: If only relevant pairs were used during training, the model might simply learn to match exact phrases or keywords without understanding the underlying semantics. Including irrelevant pairs forces the model to learn more robust and generalized representations.

3. Improved retrieval performance: At inference time, the model needs to retrieve the most relevant passages from a large pool of candidates. By training with irrelevant pairs, the model becomes better at ranking and retrieving the most relevant passages while pushing down the irrelevant ones.

If we leave out the irrelevant/negative pairs in the training objective, the model might suffer from the following issues:

1. Poor generalization: The model might not learn to distinguish between relevant and irrelevant passages effectively, leading to poor performance on unseen data.

2. Retrieval of irrelevant passages: Without learning to push away irrelevant pairs, the model might retrieve many irrelevant passages along with the relevant ones, reducing the overall quality of the retrieved results.

3. Overfitting: The model might memorize the exact phrasing of the relevant pairs seen during training, rather than learning the underlying semantic relationships between questions and passages.





****************************************************************************************
****************************************************************************************




