Answer to Question 1-1
a) Word embeddings, such as word2vec, take context into account by considering the surrounding words in a sentence or text to learn the meaning of a word. The word is represented as a vector in a continuous vector space based on the words that frequently appear nearby in the text corpus. On the other hand, TF-IDF takes context into account by looking at the frequency of a word in a document compared to its frequency in the entire document corpus. The main difference is that word embeddings capture semantic relationships between words based on context, while TF-IDF focuses on the importance of a word in a specific document based on its frequency and rarity in the corpus.





****************************************************************************************
****************************************************************************************




Answer to Question 1-2
To segment the sentence "I love NLP a lot" into subwords using byte-pair-encoding (BPE) codes, we need to iteratively merge the most frequent pairs of characters until we cannot find any more pairs in the given sentence.

Here are the steps for segmenting the sentence "I love NLP a lot" into subwords using BPE:

1. Initialize the sentence: "I love NLP a lot"
2. Available subwords: {A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X, Y, Z, a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z, la, le, li, lo}
3. Identify the most frequent pair of characters: "lo"
4. Merge "lo" to create a new subword: "lo" (New subwords list: {A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X, Y, Z, a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z, la, le, li, lo})
5. Repeat the process and identify the next most frequent pair of characters: "I "
6. Merge "I " to create a new subword: "I " (New subwords list: {A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X, Y, Z, a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z, la, le, li, lo})
7. Repeat the process, but no more pairs found in the sentence.
8. Final segmented subwords: {"I ", "love", "N", "L", "P", "a", "lot"}

Therefore, the sentence "I love NLP a lot" can be segmented into subwords as follows: {"I ", "love", "N", "L", "P", "a", "lot"}





****************************************************************************************
****************************************************************************************




Answer to Question 1-3
**Answers:**
a. The shape of the output projection in the skip-gram model for word embeddings would be 10,000 x 300, where 10,000 represents the vocabulary size and 300 represents the dimensions of each word vector.

b. Bart's observation that there is no difference in the trained word vectors when using context window sizes of 30, 40, and 50 words may not necessarily indicate a broken training pipeline. In situations where the context window size is significantly larger than the typical sentence or document length in the dataset, increasing the window size may not have a significant impact on the word embeddings. This is because the model can already capture enough context with smaller window sizes, and extending it further does not provide additional useful information for training the vectors. Therefore, Bart's observation may be valid if the dataset consists of short textual segments where longer context windows do not provide additional benefits.

Figure paths: N/A





****************************************************************************************
****************************************************************************************




Answer to Question 1-4
a. False. Using subwords is better for morphologically-rich languages because it helps capture morphological variations and improves generalization.

b. True. A unigram language model can be derived from the number of occurrences of each word in a corpus by calculating the probability of each word in isolation.

c. False. One-hot word representations cannot measure the semantic difference between words as they do not capture any semantic information.

d. True. In latent Dirichlet allocation (LDA), a document is indeed modeled as a distribution over words, reflecting the topic proportions.

e. True. In TF-IDF, the term frequency component decreases the importance of common words (e.g. stopwords) by assigning lower weights to them.

f. False. In hidden Markov models (HMMs) for part-of-speech tagging, the hidden states represent the part of speech categories, not the words themselves.





****************************************************************************************
****************************************************************************************




Answer to Question 2-1
a) To build a lightweight sentence classifier with pretrained word embedding vectors of 300 dimensions and 3 classification labels {happy, neutral, sad}, a simple model can be designed as follows:

Input: 
- Input will be a sequence of word indices representing the words in the sentence.
- Shape of input: (max_sentence_length,) where max_sentence_length is the maximum number of words in a sentence.

Intermediate Operations:
1. Embedding Layer:
- Maps the input word indices to their corresponding word embedding vectors.
- Shape of the embedding layer output: (max_sentence_length, embedding_dim) where embedding_dim is 300 in this case.

2. Global Average Pooling:
- Computes the average of all word vectors to get a fixed-length representation of the sentence.
- Shape after pooling: (embedding_dim,)

3. Dense Layer:
- A fully connected layer for classification with 3 output units (for happy, neutral, sad).
- Shape of the weight matrix: (embedding_dim, 3)

Output:
- Softmax activation to obtain the predicted probabilities for each class.

b) Two reasons why the model from the previous subquestion may not be suitable for audio utterance classification with 80-dimensional spectrograms are:
1. Dimensionality Mismatch:
- The input shape of the spectrograms (80-dimensional) does not match the shape expected by the embedding layer (for word indices).
2. Context Preservation:
- The word embedding approach may not effectively capture the temporal relationships present in audio data, as it is designed for sequential word data rather than spectrograms.

c) For the improved audio utterance classification model with 80-dimensional spectrograms and the same 3 classification labels {happy, neutral, sad}, a suitable model can be designed as follows:

Input:
- Input will be the spectrogram data representing the audio utterance.
- Shape of input: (time_steps, frequency_bins) where time_steps is the number of time steps in the spectrogram and frequency_bins is the number of frequency bins.

Intermediate Operations:
1. Convolutional Layer:
- 1D convolutional layers can capture local patterns in the spectrogram data effectively.
- Shape of the convolutional layer output will depend on the chosen filters and kernel sizes.

2. Global Pooling:
- Global pooling (e.g., Global Max Pooling) to aggregate the convolutional features into a fixed-length representation irrespective of input length.
- Shape after pooling: (num_filters,)

3. Dense Layer:
- A fully connected layer for classification with 3 output units (for happy, neutral, sad).
- Shape of the weight matrix: (num_filters, 3)

Output:
- Softmax activation to obtain the predicted probabilities for each class.





****************************************************************************************
****************************************************************************************




Answer to Question 2-2
a. The sequence classification approach suggested by your boss may not be optimal for dialog act identification because it does not take into account the contextual information provided by previous utterances in the dialog. In the given dialog example, a mistake that the model will certainly make with this approach is in assigning the label "symptom_kneeSwelling" to the doctor's utterance "For a week, right?" which should actually be labeled as "medication". This mistake occurs because the model only looks at the current utterance without considering the overall context of the conversation.

b. I would model the task as a sequence labeling problem rather than a sequence generation problem. In sequence labeling, each token (or in this case, each utterance) is assigned a label, whereas in sequence generation, the model generates a sequence of tokens or labels from scratch. For dialog act identification, sequence labeling is preferred because we have a fixed set of predefined dialog act labels and we want to classify each utterance into one of these labels. This aligns with the task of assigning semantic labels to individual utterances in a dialog.

c. To model the dialog act identification task as a sequence labeling problem, we can use a bidirectional Long Short-Term Memory (LSTM) neural network. 

- Input: Each utterance in the dialog is encoded using a pre-trained sentence encoder to obtain fixed-size vectors representing the utterances. These vectors are then arranged into a matrix of dimensions (number_of_utterances) x (d), where d is the sentence embedding dimension.
- Intermediate operations: The bidirectional LSTM processes the matrix of utterance embeddings to capture the contextual information from both past and future utterances in the dialog.
- Output: The output layer of the model predicts the dialog act label for each utterance, performing sequence labeling by assigning a label to each utterance in the dialog based on the contextual information captured by the LSTM.





****************************************************************************************
****************************************************************************************




Answer to Question 3-1
**Question:**

**Transformer, Self-Attention**

**a. The Transformer decoder is autoregressive. Explain what "autoregressive" means in this case.**

In the context of the Transformer decoder, "autoregressive" means that during the decoding process, the model predicts one output token at a time based on the previously generated tokens. It generates the output sequentially, considering the preceding tokens in the output sequence. This helps in capturing dependencies and maintaining the order of the generated tokens.

**b. Explain why the Transformer decoder self-attention must be partially masked out during training.**

The Transformer decoder self-attention needs to be partially masked out during training to prevent the model from attending to future tokens in the sequence. By masking out future tokens in the self-attention mechanism, the model ensures that it only considers information from tokens that have already been generated. This helps in maintaining the autoregressive property and ensures that the model generates tokens based only on past information, as intended.

**c.**

- BoS, E, F, G
- BoS, x, x, x, x
- E, x, x, x, x
- F, x, x, x, x
- G, x, x, x, x

**Figure Path:** Decoder_Self-Attention_Weights.jpg

**d.** 

To help answer this question, you would calculate the attention weights for the word "Mary" in the sequences "John loves Mary" and "Mary loves John".

Given:
- Sequence: "John loves Mary"
- Word embedding vectors: $[\mathbf{x}_{\text{John}}; \mathbf{x}_{\text{loves}}; \mathbf{x}_{\text{Mary}}] \in \mathbb{R}^{3{\times}d}$
- Attention query and key at position $i$: $\mathbf{q}_i$ and $\mathbf{k}_i$
- Attention weights for querying from the word "Mary": $\bm{\alpha}_{\text{Mary}}$

You need to demonstrate that $ \bm{\alpha}_{\text{Mary}} $ remains the same irrespective of whether the sequence is "John loves Mary" or "Mary loves John". This showcases the symmetrical nature of the attention mechanism in the Transformer model.

If a visual representation of how the attention weights remain the same is required, it would involve calculating the attention mechanism output for both sequences focusing on the word "Mary" specifically.





****************************************************************************************
****************************************************************************************




Answer to Question 3-2
a. When adapting a summarization model trained on news to the medical domain and encountering many unknown words that the model cannot represent or generate, there are several solutions to consider:
1. **Medical Terminology Expansion**: Incorporate a medical dictionary or ontology to expand the model's vocabulary with domain-specific terms.
2. **Data Augmentation**: Augment the training data with medical text to expose the model to a wider range of vocabulary.
3. **Subword Tokenization**: Utilize subword tokenization techniques like Byte Pair Encoding (BPE) or SentencePiece to handle out-of-vocabulary words by breaking them down into subword units.

b. ROUGE-n is based on the calculation of n-gram overlap between the generated summary and the reference summary. It is a widely used metric for evaluating the quality of summaries by comparing sequences of words.

c. The model may receive high ROUGE-2 scores despite non-grammatical output like repeated phrases due to the nature of ROUGE-2 focusing on bi-gram overlap rather than grammaticality or fluency. To address this issue and avoid rewarding repetitive outputs, the CIDEr (Consensus-based Image Description Evaluation) metric can be used as it emphasizes diversity and can penalize repetitive information.

To reduce repetition in the generated output, techniques like:
1. **Diversity Promotion**: Implement techniques such as diversity-promoting objective functions during training to encourage the model to produce varied and non-repetitive summaries.
2. **Beam Search Modification**: Adjust the beam search decoding algorithm to penalize repetitive n-grams, encouraging the model to explore more diverse summary options.
3. **Deduplication**: Post-processing step to remove duplicate or excessively repeated phrases in the generated summary.

Figure path: N/A





****************************************************************************************
****************************************************************************************




Answer to Question 3-3
a. 
Advantage: By using BERT as a pretrained text encoder with CTC for machine translation, the model can benefit from the rich semantic representations learned from a large corpus of text data. This can potentially improve the quality of the translations by capturing complex linguistic patterns and structures.

Disadvantage: One potential disadvantage is that BERT may not be specifically designed or optimized for the task of machine translation. As a result, the model may struggle with capturing the necessary nuances and context required for accurately translating between languages.

b. If the performance of the BERT with CTC model for machine translation is weak, one way to improve it while keeping the idea of using BERT as an encoder is to fine-tune BERT on a parallel corpus of translated text. This process, known as pretraining on a translation task, can help BERT learn better contextual representations for the specific task of translation. By fine-tuning BERT on translation data, the model can learn to generate more accurate translations by understanding the relationships between words in different languages.

Figure: N/A





****************************************************************************************
****************************************************************************************




Answer to Question 3-4
a) To create a model for the text-to-SQL task using the 30,000 training examples, you can follow these steps:

1. **Input Representation**: Convert the input data (table name, column names, question) into a format that the model can understand (e.g., tokenization, embedding).

2. **SQL Query Generation**: Design the model to output the SQL query structure ("SELECT column_name FROM table_name") based on the input question.

3. **Training Data**: Use the 30,000 training examples to train the model on how to map questions to SQL queries accurately.

4. **Natural Language Understanding**: Incorporate methods for natural language understanding to accurately interpret the meaning of the questions.

5. **Error Handling**: Implement mechanisms to detect and handle errors, such as invalid queries for non-existing tables or columns.

6. **Validation**: Validate the model using a separate set of validation examples to ensure its performance accuracy.

7. **Fine-tuning**: Fine-tune the model based on feedback and evaluation results to improve its accuracy and generalization.

b) In order to handle unanswerable questions like "Who is the Chancellor of Germany?" in the text-to-SQL model, you can implement the following adaptation:

1. **Question Classification**: Integrate a question classifier that can determine if a question is answerable based on the available table and column information.

2. **Condition Based Generation**: Modify the model to generate a default response or handle such unanswerable questions by returning a predefined message instead of attempting to create an SQL query.

3. **Thresholding**: Set a threshold for question relevance where if the model's confidence in generating a valid query falls below it, then it should identify the question as unanswerable and respond accordingly.

4. **Error Logging**: Log unanswerable questions during inference to analyze the model performance and understand the types of questions it struggles to handle.

By incorporating these strategies, the text-to-SQL model can effectively adapt to handle unanswerable questions and improve its overall performance in generating SQL queries.





****************************************************************************************
****************************************************************************************




Answer to Question 4-1
a. The advantage of adding adapters in each BERT layer is that it allows for fine-tuning specific task-related information without affecting the pre-trained BERT parameters. This can help prevent catastrophic forgetting and overfitting on the small dataset by only adapting a small number of parameters.

b. To insert the adapters in the BERT architecture, we would insert them after the self-attention mechanism and before the intermediate feedforward layer in each BertLayer block. Specifically, the adapters would be added right after the BertSelfOutput layer and before the BertIntermediate layer in each of the BertLayer blocks (0-11).

c. To calculate the number of parameters added by inserting the adapters, we need to consider the parameter count for the two linear projections in each adapter (down-projection to 256 dimensions and up-projection to 768 dimensions) for each of the 12 BertLayer blocks. The total number of parameters added would be the sum of the parameters introduced by these linear projections in all 12 layers.





****************************************************************************************
****************************************************************************************




Answer to Question 4-2
a. When creating fixed-size vectors for passages and questions, DPR uses sentence representations from the BERT CLS token. The main differences between this approach and pooling (e.g. meanpool or maxpool) the word vectors in the passages or questions as the sentence representation are:

1. **BERT CLS Token Approach**:
   - DPR uses the output of the BERT model corresponding to the [CLS] token as the sentence representation for both passages and questions.
   - This representation captures contextual information from the entire input sequence by considering the interactions between words in the passage or question.
   - The [CLS] token representation is learned during pre-training of the BERT model on a large corpus, which helps in capturing complex patterns and relationships in the text.

2. **Pooling (e.g. meanpool or maxpool) Approach**:
   - Pooling techniques like meanpooling or maxpooling calculate a fixed-size vector by aggregating the word embeddings in the passage or question.
   - Meanpooling takes the average of the word embeddings, while maxpooling takes the maximum value along each dimension from the word embeddings.
   - These pooling techniques do not capture the contextual dependencies between words and may lose important sequential information present in the text.

**Advantages of BERT CLS Token Approach**:
- The BERT CLS token approach considers the contextual information of the entire text, capturing nuanced relationships between words.
- It benefits from pre-training on a large corpus, which helps in learning rich representations.
- The [CLS] token representation has shown to be effective in various NLP tasks due to its ability to capture semantic information.

Figure: figures/bert_class_bw.png 

b. In DPR, the training objective includes relevant question-passage pairs as well as irrelevant/negative pairs for the following reasons:

- **Importance of Negative Pairs**:
  Including irrelevant or negative pairs in the training objective helps the model distinguish between relevant and irrelevant passages for a given question.
  It enables the model to learn a clear decision boundary by pushing relevant pairs closer together and pushing irrelevant pairs farther apart in the vector space.

- **Consequences of Leaving Out Irrelevant/Negative Pairs**:
  If we leave out the irrelevant/negative pairs in the training objective:
  - The model may not effectively learn to distinguish between relevant and irrelevant passages.
  - It could result in a less robust retrieval system where the model is unable to differentiate between correct and incorrect answers.
  - Without negative examples, the model may struggle to generalize to unseen data and may be prone to retrieving incorrect passages as potential answers.





****************************************************************************************
****************************************************************************************




