Answer to Question 1-1
a) Word2vec, specifically the Continuous Bag of Words (CBOW) and Skip-gram models, takes context into account by predicting a target word based on its surrounding words. In CBOW, the model predicts the target word using the average of its context words' embeddings, while in Skip-gram, it tries to predict the context words given a target word. The context is thus directly incorporated in the learning process.

On the other hand, TF-IDF does not explicitly consider context during representation creation. It is a statistical measure that reflects how important a word is to a document in a collection or corpus. TF-IDF calculates the frequency of a term (word) in a document and discounts its frequency across the entire corpus to give more weight to rare terms.

The main difference between the two approaches lies in their objectives: Word2vec aims to learn dense vector representations that capture semantic relationships between words based on their contexts, while TF-IDF is a simple weighting scheme focused on term importance within a document relative to a collection. Word2vec focuses on context-dependent meaning, whereas TF-IDF focuses on term frequency.





****************************************************************************************
****************************************************************************************




Answer to Question 1-2
a. The sentence "I love NLP a lot" would be segmented into subwords using the provided BPE codes as follows: "I", "love", "NLP", "a", "lot".





****************************************************************************************
****************************************************************************************




Answer to Question 1-3
a. The shape of the output projection is 10,000 (vocabulary size) by 300 (each word vector's dimension).

b. I don't agree that Bart's training pipeline is necessarily broken. In some cases, increasing the context window size might not lead to significant differences in the trained word vectors if the data has a lot of overlapping contexts or if the words' relationships are already well-captured with smaller windows. It could also be due to the fact that beyond a certain point, larger context windows may not contribute additional meaningful information for capturing co-occurrence patterns. Other factors like training time, embedding quality, and dataset size can play a more significant role in determining the effectiveness of the embeddings. Bart should consider checking these aspects before concluding that the pipeline is broken.





****************************************************************************************
****************************************************************************************




Answer to Question 1-4
a. True. Using subwords can handle morphologically-rich languages by capturing word affixes and breaking down unknown words.

b. False. A unigram language model requires not only the frequency of each word but also the context in which they appear to estimate probabilities.

c. False. One-hot encoding represents each word as a unique vector, so it cannot capture semantic similarity between words.

d. True. LDA models documents as a mixture of topics, where each topic is a distribution over words.

e. False. TF-IDF actually assigns higher importance to terms that occur frequently in a document but infrequently across the corpus, thus reducing stopwords' impact.

f. False. In HMMs for POS tagging, hidden states represent the part-of-speech tags, while observed states are the words.





****************************************************************************************
****************************************************************************************




Answer to Question 2-1
a. For a lightweight sentence classifier, we can use a simple feedforward neural network with one hidden layer. The input would be the sentence, represented as a sequence of word embeddings. Each word is represented by a 300-dimensional vector. We can concatenate these vectors to create a single fixed-length representation for the entire sentence (e.g., by taking the average or using an attention mechanism). Let's assume we use the average for simplicity. The input shape would then be `(batch_size, max_sentence_length)`. After obtaining the sentence embedding, it will pass through a fully connected layer with 64 hidden units (arbitrarily chosen small number to keep the model lightweight), followed by an activation function like ReLU. Finally, there's another fully connected output layer with 3 units (one for each label: happy, neutral, sad). The output is the predicted probability of each class.

b. The model from subquestion a is not suitable for audio utterance classification because:
1. The input dimensionality is different: Instead of word embeddings, we have 80-dimensional spectrograms, which require a different type of processing (e.g., convolutional or recurrent layers) to capture temporal patterns in the audio.
2. The model architecture is designed for text data and does not directly handle time-series data like spectrograms. A feedforward network would not be able to effectively learn from the temporal structure present in the audio data.

c. For an improved model for audio utterance classification, we can use a 1D Convolutional Neural Network (CNN) followed by one or more Recurrent Neural Networks (RNN), such as LSTM or GRU units. The input would be the 80-dimensional spectrograms with shape `(batch_size, time_steps, frequency_bins)`. First, a 1D CNN layer can capture local patterns in the temporal dimension, followed by one or more RNN layers to model the sequential information. Finally, there's a fully connected output layer with 3 units for classification. The parameter shapes would vary depending on the number of filters, kernel sizes, and hidden units chosen for the CNN and RNN layers.





****************************************************************************************
****************************************************************************************




Answer to Question 2-2
a. The sequence classification approach is not optimal for dialog act identification in this case because it treats each utterance independently, ignoring the context and flow of the conversation. In the example dialog, when the doctor asks "For a week, right?" twice, the intent behind these questions is to confirm the patient's symptoms based on previous statements. The model might misclassify these follow-up questions as separate instances of the same symptom (e.g., knee swelling) instead of recognizing them as clarification or confirmation.

b. I would model the task as a sequence labeling problem. This approach allows for considering the context and dependencies between utterances in the conversation, which is crucial for understanding the dialog acts accurately. In contrast, sequence generation may introduce more complexity by predicting variable-length sequences, whereas each utterance has a fixed label in this case.

c. For a sequence labeling model, the input would be the matrix of sentence embeddings, where each row represents an utterance's embedding vector. The intermediate operations could involve using a recurrent neural network (RNN), such as LSTM or GRU, to capture the context and temporal dependencies between the utterances. The hidden state at each time step would encode the information from previous utterances. Another option is to use a transformer-based architecture like BERT, which can also model contextual relationships without strict sequential order.

The output layer would be a sequence of labels, with one label for each utterance in the dialog. This could be realized using a linear layer followed by softmax activation, where the softmax function produces probabilities over all possible dialog act classes. The model is trained to minimize a loss function (e.g., cross-entropy) between its predicted label sequences and the ground truth labels.





****************************************************************************************
****************************************************************************************




Answer to Question 3-1
a. In the context of a Transformer decoder, "autoregressive" means that the model predicts each output token based on the previous tokens in the sequence. The decoder generates the output sequence one token at a time, with each prediction conditioned on the previously generated tokens.

b. The Transformer decoder self-attention must be partially masked out during training to prevent the current position from attending to future positions in the sequence. This is necessary because in autoregressive models, the generation of each token should not have access to information about tokens that come after it. Masking ensures that the model learns to predict a token based only on the available context and not on information from the future.

c. The self-attention weight matrix for the given input-output pair should mask out weights where the attention query is at or beyond the current position in the sequence. Since we are processing the sequence right-to-left, the decoder should not attend to positions that come after the current one. Therefore, the weights to be masked out (indicated with "x") would be:

```
BoS, E, F, G
x   x   x   x
x   x   x   x
x   x   x   x
x   x   x   x
```

d. To show that $\\bm{\\alpha}_{\\texttt{Mary}}$ is the same for both sequences, we need to consider the self-attention mechanism. The attention weight from a position $i$ to another position $j$ is computed as the dot product of their respective query and key vectors normalized by a scaling factor:

$$
\bm{\alpha}_{ij} = \frac{\\mathbf{q}_i^\top \\mathbf{k}_j}{\sqrt{d}}
$$

For the word "Mary" at position 2 in both sequences, its attention weights are computed using its query vector $\\mathbf{q}_{\\texttt{Mary}}$ and all key vectors. Since the word embedding vectors for "John", "loves", and "Mary" are the same in both sequences, the dot products between $\\mathbf{q}_{\\texttt{Mary}}$ and these keys will be identical. The scaling factor $\sqrt{d}$ is also constant across both cases.

Therefore, the attention weights $\\bm{\\alpha}_{\\texttt{Mary}}$ will be the same for both "John loves Mary" and "Mary loves John", as they depend only on the interaction between the query vector of "Mary" and the key vectors of all words in the sequence.





****************************************************************************************
****************************************************************************************




Answer to Question 3-2
a. Two solutions to the problem of unknown words in a summarization model adapted to the medical domain are:
1. **Substitution or Expansion**: Replace or expand unknown words with synonyms, broader terms, or their definitions from medical dictionaries or ontologies like UMLS (Unified Medical Language System). This helps the model understand and generate contextually relevant content.
2. **Data Augmentation**: Introduce more medical text data into the training set to expose the model to a wider range of vocabulary specific to the domain. This can be done by using domain-specific datasets, pre-processing techniques like named entity recognition, or even through synthetic data generation.

b. ROUGE-n (Recall-Oriented Understudy for Gisting Evaluation) is based on n-gram precision and recall. It measures the overlap between a summary and reference text by comparing the co-occurrence of n-character sequences in both texts.

c. The model receives high ROUGE-2 scores because it focuses on bi-gram matching, which counts the occurrences of two consecutive words. In this case, "amyloid angiopathy" is a frequent bi-gram, leading to high scores despite the repetitive and non-grammatical output. To avoid this problem, we can use **BLEU (Bilingual Evaluation Understudy)** or **METEOR (Metric for Evaluation of Translation with Explicit ORdering)**, which consider n-gram precision, recall, and brevity penalties, providing a more comprehensive evaluation.

To reduce the amount of repetition in the output, one could implement:
1. **Length Constraint**: Set a maximum limit on the number of times a specific term can appear in the summary.
2. **Repetition Penalty**: Introduce a penalty function within the model's loss calculation that discourages excessive repetitions.
3. **Diversity Promoting Techniques**: Use techniques like beam search with diversity or sampling strategies to encourage variety in generated sequences.





****************************************************************************************
****************************************************************************************




Answer to Question 3-3
a. An advantage of using BERT with CTC for machine translation is that it leverages the pre-trained knowledge from BERT, which has been extensively trained on large text corpora, to create meaningful representations for the input text. This can potentially lead to better performance compared to training a model from scratch.

A disadvantage of this approach is that BERT was originally designed and optimized for text processing tasks, not for sequence-to-sequence tasks like machine translation. The CTC loss may not be the most appropriate choice for aligning and mapping the input text to its target translation, as it focuses on learning alignments between variable-length inputs and outputs, which might not capture the nuances of translation effectively.

b. To improve the model while keeping BERT as the encoder, one could consider using an additional decoder layer specifically designed for machine translation tasks, such as an attention mechanism (e.g., Transformer decoder). This would allow the model to learn more complex relationships between input and output sequences by focusing on relevant parts of the input during the decoding process. Another strategy could be fine-tuning BERT on a larger, task-specific dataset for machine translation or incorporating language-specific knowledge, such as dictionaries or parallel corpora. Additionally, applying data augmentation techniques, optimizing hyperparameters, or using ensemble methods may also contribute to performance improvement.





****************************************************************************************
****************************************************************************************




Answer to Question 3-4
a. A potential model for the text-to-SQL task could be a sequence-to-sequence model with attention mechanism, such as an encoder-decoder architecture using Transformer or LSTM networks. The encoder would take the natural language question as input and encode it into a hidden representation. The decoder would then generate the SQL query step by step, attending to relevant parts of the question encoding. To incorporate table metadata, we could include the table name and column names as additional inputs to the decoder. During training, the model would be optimized to minimize the difference between its generated SQL queries and the ground truth SQL queries for each question.

b. To handle unanswerable questions like "Who is the Chancellor of Germany?", the model should be trained with a mechanism to detect whether the question can be answered using the given table schema. This could involve adding an output layer that predicts if a question is answerable or not, and modifying the decoder to generate a special "unanswerable" token when appropriate. During inference, if this output layer indicates that the question cannot be answered with the provided table, the model would generate no SQL query or an invalid SQL query marker (e.g., a NULL or empty string). This way, the model signals that it is unable to provide a relevant answer based on the given data.





****************************************************************************************
****************************************************************************************




Answer to Question 4-1
a. The advantage of using adapters in this scenario is that it allows the fine-tuning process to be more efficient and less prone to divergence. By freezing the BERT parameters, the large pre-trained model's knowledge is preserved, preventing overfitting on the small dataset. Adding adapters, which are smaller sets of learnable parameters, introduces fewer new parameters to tune, reducing the risk of overfitting and potentially improving generalization.

b. To insert the adapters, they should be placed between the original linear layers in each of the 12 BERT layers (including layer 0). Specifically, an adapter would consist of a projection down to 256 dimensions after the input of a layer (e.g., after the query, key, and value projections in the self-attention), followed by a projection back up to the original dimension (768) before the output of that layer.

c. To calculate the number of added parameters, we need to consider the adapter's two linear projections for each of the 12 layers:

1. Projection down from 768 dimensions to 256: This adds 768 * 256 parameters.
2. Projection up from 256 dimensions back to 768: This adds 256 * 768 parameters.

Since there are two such projections per layer and 12 layers, the total number of added parameters would be:
(768 * 256 + 256 * 768) * 12.





****************************************************************************************
****************************************************************************************




Answer to Question 4-2
a. The approach of using the BERT CLS token for sentence representation in DPR differs from pooling (meanpool or maxpool) word vectors because the CLS token is specifically designed to capture the overall meaning and context of a sentence. It is pre-trained to represent the entire input sequence, integrating information from all the words. In contrast, meanpooling takes the average of all word vectors, which can dilute the importance of certain words, while maxpooling selects the most salient vector, potentially discarding important contextual information.

The advantage of using the BERT CLS token is that it provides a single, fixed-size representation for the entire sentence, which simplifies downstream tasks like retrieval. It also captures the overall context better than pooling methods, as it has been trained to do so explicitly.

b. Including irrelevant/negative pairs in DPR's training objective is crucial because it helps the model learn to distinguish between relevant and non-relevant passages. By contrasting positive (relevant) and negative (irrelevant) examples, the model can learn a stronger discriminative representation that enables more accurate retrieval at inference time.

If we were to leave out the irrelevant/negative pairs in the training objective, the model would lack a clear benchmark for what does not constitute a relevant passage. It might still be able to learn some degree of relevance, but it would struggle to establish an effective boundary between relevant and non-relevant passages, leading to less precise retrieval results. The absence of negative examples would make it difficult for the model to optimize its ability to distinguish between true matches and false positives.





****************************************************************************************
****************************************************************************************




