Answer to Question 1-1


Answer:

a) In word embeddings, context is taken into account through the co-occurrence of words in a large corpus. Words that appear in similar contexts are placed close to each other in the embedding space. For example, words like "king" and "queen" will be close to each other because they often appear in the same contexts. The main difference between word embeddings and TF-IDF is that word embeddings provide a dense vector representation of words that can capture semantic relationships between words, while TF-IDF is a numerical statistic that reflects how important a word is to a document in a collection. TF-IDF does not provide a vector representation of words, but rather a score that indicates the relevance of a word to a document.





****************************************************************************************
****************************************************************************************




Answer to Question 1-2


Answer:

For the given sentence "I love NLP a lot", the segmentation into subwords using byte-pair encoding (BPE) would involve splitting the sentence into individual words and then applying the BPE encoding to each word. Since the provided BPE codes do not include any space or punctuation symbols, we first need to tokenize the sentence.

a. Segmentation of the sentence "I love NLP a lot" into subwords:

1. I
2. love
3. NLP
4. a
5. lot

Now, we apply the BPE encoding to each subword:

1. I -> I
2. love -> love
3. NLP -> [NL] [LP] NP
4. a -> a
5. lot -> lot

Therefore, the segmented sentence "I love NLP a lot" into subwords using byte-pair encoding (BPE) is represented as: ["I", "love", "[NL]", "[LP]", "NP", "a", "lot"]

Note: The provided BPE codes do not include the square brackets, they are added here for clarity in explaining the segmentation process.





****************************************************************************************
****************************************************************************************




Answer to Question 1-3


Answer:

a) The shape of the output projection for each word vector in the skip-gram model is 300x1, since each word vector has 300 dimensions and only one output value is produced for each context word.

b) Based on the information provided, it is unlikely that Bart's training pipeline is broken. The context window size does not significantly affect the word vectors, as long as it is large enough to capture meaningful context. A context window size of 30, 40, or 50 words is quite large and should capture a diverse range of contexts. The lack of difference in the trained word vectors may be due to the similarity of the contexts in the news headlines or the size of the training dataset. To further investigate, Bart could try increasing the dataset size or using a different dataset with more varied contexts.





****************************************************************************************
****************************************************************************************




Answer to Question 1-4


a. To process morphologically-rich languages, using subwords is better than whole words.
Answer:
True, because morphologically-rich languages often have complex inflectional and derivational forms, and subword units can capture the shared morphological features and relationships between these forms more effectively than whole words.

b. If we have the number of occurrences of each word in a corpus, we can derive a unigram language model.
Answer:
Yes, a unigram language model is a simple probabilistic model that estimates the probability of a word occurring based on its frequency in the corpus, and can be derived directly from the word counts.

c. One-hot word representations can be used to measure the semantic difference between two words.
Answer:
False, because one-hot word representations only capture the presence or absence of a word in a given context, and do not provide any information about the meaning or similarity between words.

d. In latent Dirichlet allocation (LDA), a document is modeled as a distribution over words.
Answer:
Yes, in LDA, each document is represented as a probability distribution over a set of latent topics, and each topic is represented as a probability distribution over words.

e. In term frequency-inverse document frequency (TF-IDF), the term frequency decreases the importance of words that occur in many documents (e.g. stopwords).
Answer:
Yes, TF-IDF is a weighting scheme that adjusts the term frequency based on the document frequency, with the goal of increasing the importance of words that are rare in the corpus and decreasing the importance of common words that do not discriminate between documents.

f. When hidden Markov models (HMMs) are used for part-of-speech tagging, the hidden states are the words.
Answer:
False, because in HMMs for part-of-speech tagging, the hidden states typically represent the underlying part-of-speech tags or categories, and the observations are the words or sequences of words that are emitted from the hidden states.





****************************************************************************************
****************************************************************************************




Answer to Question 2-1


Answer:

a) For the given requirements of a lightweight sentence classifier with 3 types of classification labels (happy, neutral, sad), a suitable model would be a simple Multi-Layer Perceptron (MLP) with one or two hidden layers. The input to the model would be a 300-dimensional word embedding vector. The intermediate operation would be the application of one or more rectified linear unit (ReLU) activation functions to the output of each hidden layer. The output of the model would be a vector of size 3, representing the probability distribution over the three classes. The parameter shapes for this model would be as follows:
- Input shape: (batch_size, sequence_length, embedding_dimension)
- Hidden layer shape: (batch_size, sequence_length, hidden_units)
- Output layer shape: (batch_size, sequence_length, num_classes)

b) The model described in subquestion a is not suitable for the audio utterance classification task for the following reasons:
1. The input to the model is 80-dimensional spectrograms, which have a different dimensionality compared to the 300-dimensional word embedding vectors used in the sentence classifier.
2. The spectrograms are not trainable parameters, meaning that they cannot be updated during training, unlike the weights and biases in the MLP.

c) For the audio utterance classification task, a suitable model would be a Convolutional Neural Network (CNN) with one or more convolutional layers followed by one or more fully connected layers. The input to the model would be the 80-dimensional spectrograms. The intermediate operation would be the application of convolutional filters to the input spectrograms, followed by pooling operations to reduce the spatial dimensions. The output of the convolutional layers would be flattened and passed through one or more fully connected layers to obtain the final probability distribution over the three classes. The parameter shapes for this model would be as follows:
- Input shape: (batch_size, time_steps, frequency_bins)
- Convolutional layer shape: (batch_size, time_steps-filter_size+1, frequency_bins-filter_size+1, num_filters)
- Pooling layer shape: (batch_size, (time_steps-filter_size+1)//pool_size+1, (frequency_bins-filter_size+1)//pool_size+1, num_filters)
- Fully connected layer shape: (batch_size, num_classes)





****************************************************************************************
****************************************************************************************




Answer to Question 2-2


Answer:

a) A sequence classification approach, where the input is an utterance itself and the output is the label, may not be optimal for dialog act identification because dialogs consist of multiple turns and each turn can have different dialog acts. In the example dialog, the patient mentioned the knee pain and swelling twice, but the dialog acts associated with these utterances are different: the first mention of knee pain is labeled as "symptom_kneePain", while the second mention of knee pain is labeled as "greeting" when the doctor asks for clarification. Similarly, the first mention of knee swelling is labeled as "symptom_kneeSwelling", but the second mention is labeled as "correction" when the patient corrects the doctor's misunderstanding. Therefore, a sequence classification approach may not capture the context and dependencies between different turns in a dialog, leading to incorrect labels.

b) Given a good sentence encoder which represents sentences by fixed-size vectors, the dialog act identification task can be modeled as a sequence labeling problem. This is preferred over sequence generation because sequence labeling is a more straightforward and well-studied problem in natural language processing. In sequence labeling, the goal is to assign a label to each position in the sequence, while in sequence generation, the goal is to generate a sequence of labels. In the context of dialog act identification, sequence labeling allows us to capture the dependencies between different dialog acts in a dialog, as each dialog act label is assigned to a specific utterance.

c) For dialog act identification, a popular model is the Recurrent Neural Network (RNN) with Long Short-Term Memory (LSTM) cells, which is a type of sequence labeling model. The input to the model is a sequence of sentence embeddings, where each embedding corresponds to an utterance in the dialog. The RNN processes the sequence of embeddings one at a time, and at each time step, it computes the hidden state based on the current input embedding and the previous hidden state. The output of the RNN is a probability distribution over all possible dialog act labels. The label with the highest probability is then assigned to the current utterance. The intermediate operation in this model is the computation of the hidden state based on the current input embedding and the previous hidden state, which allows the model to capture the dependencies between different dialog acts in the sequence. The output of the model is a sequence of dialog act labels, one for each utterance in the dialog.





****************************************************************************************
****************************************************************************************




Answer to Question 3-1


Answer:

a) In the context of the Transformer decoder, autoregressive means that the decoder generates each output token based on the previously generated tokens and the input sequence, rather than generating all output tokens in parallel based on the entire input sequence. This allows the model to condition its predictions on the previously generated tokens, which can improve the model's ability to generate coherent and contextually appropriate sequences.

b) The Transformer decoder self-attention must be partially masked out during training to prevent the model from attending to future tokens in the input sequence, which would result in the model having access to information that it should not yet have. This is important for maintaining the autoregressive nature of the decoder and ensuring that the model generates each token based only on the previously generated tokens and the input sequence.

c) Based on the provided table, the weights that should be masked out during training are those on the diagonal from the bottom-left corner to the top-right corner, as these correspond to the decoder attending to future tokens in the input sequence. The remaining weights should be used for computing the self-attention during training.

d) To show that $\\bm{\\alpha}_{\\texttt{Mary}}$ is the same for the sequences "John loves Mary" and "Mary loves John", we need to show that the attention weights depend only on the word embeddings and not on the order of the words in the sequence. Let $\\bm{Q}$, $\\bm{K}$, and $\\bm{V}$ be the query, key, and value matrices, respectively, and let $\\bm{A}$ be the attention weights matrix. Then, we have:

$$\bm{A} = \text{softmax}(\frac{\bm{Q}\bm{K}^T}{\\sqrt{d}}) \\
\bm{\\alpha}_{\\texttt{Mary}} = \bm{A}_{:, \\texttt{Mary}} \\
$$

where $\\bm{A}_{:, \\texttt{Mary}}$ is the column of $\\bm{A}$ corresponding to the word embedding vector of "Mary". Since the word embeddings are learned independently of the order of the words in the sequence, $\\bm{\\alpha}_{\\texttt{Mary}}$ will be the same for both sequences.





****************************************************************************************
****************************************************************************************




Answer to Question 3-2


Answer:

a) Two solutions to the problem of unknown words in a summarization model adapted from news to the medical domain are:
1. Pre-processing: Before feeding the text into the model, perform extensive pre-processing to expand the vocabulary. This can be done by using a medical thesaurus or a large medical corpus to map the unknown words to their synonyms or medical terms. This will help the model to understand the context and generate appropriate summaries.
2. Fine-tuning: Fine-tune the model on a large medical corpus to learn the medical domain-specific terminology and context. This will help the model to generate more accurate and relevant summaries, even with unknown words.

b) ROUGE-n (Recall-Oriented Understudies for Gisting Evaluation) is a metric for evaluating summaries based on their overlap with reference summaries. It measures the n-gram precision and recall between the generated summary and the reference summary. For example, ROUGE-2 measures the overlap of bigrams (pairs of consecutive words) between the generated summary and the reference summary.

c) The summarization model receives high ROUGE-2 scores despite the non-grammatical output due to the fact that ROUGE-2 measures the overlap of bigrams, and the repetition of the same word creates many identical bigrams. However, this does not reflect the quality of the summary, as the repetition makes the output non-grammatical and difficult to read. To avoid this problem, we can use other metrics such as BLEU (Bilingual Evaluation Understudy) or METEOR (Metric for Evaluation of Translation with Explicit ORdering), which take into account the semantic similarity between the generated summary and the reference summary, and penalize repetition. To reduce the amount of repetition, we can use techniques such as diversity promotion, where we encourage the model to generate diverse summaries by adding a diversity constraint to the objective function. We can also use techniques such as beam search or top-k sampling to limit the number of repetitions in the output.





****************************************************************************************
****************************************************************************************




Answer to Question 3-3


Answer:

a) An advantage of using BERT as an encoder with CTC loss for machine translation is that BERT has been pre-trained on a large corpus of text data, which allows it to capture rich contextual information and semantics of the input text. This can lead to more accurate and meaningful translations. A disadvantage is that CTC loss is not well-suited for handling long sequences, which can result in poor performance for long translations.

b) To improve the performance of the model, one way to keep the idea of using BERT as an encoder is to use a sequence-to-sequence (Seq2Seq) model instead of CTC loss. Seq2Seq models can handle long sequences more effectively and can generate translations that are more fluent and idiomatic. Additionally, using attention mechanisms in the Seq2Seq model can help the model focus on the most relevant parts of the input text when generating the translation, leading to more accurate and contextually appropriate translations.





****************************************************************************************
****************************************************************************************




Answer to Question 3-4


Answer:

a) For the text-to-SQL task, we can use a combination of Natural Language Processing (NLP) and SQL parsing techniques. The model would take as input a natural language question and the schema of the table. The NLP component would parse the question to extract the relevant information such as the table name, the columns of interest, and the desired relationship between them. The SQL parsing component would then generate the corresponding SQL query based on the extracted information and the schema. The model would be trained on a large dataset of question-table-query triplets to learn the mapping between natural language questions and SQL queries.

b) To handle unanswerable questions, we can modify the text-to-SQL model to return an error message instead of an SQL query. For example, for the question "Who is the Chancellor of Germany?" which cannot be answered from the given table, the model would return an error message such as "Invalid SQL query: SELECT Player FROM PlayerList WHERE Nationality = 'German' AND Position = 'Chancellor'". This would prevent the model from generating invalid queries and potentially causing errors in the database. Alternatively, the model could be designed to return a special symbol or marker to indicate that the question is unanswerable from the given table.





****************************************************************************************
****************************************************************************************




Answer to Question 4-1


Answer:

a) The advantage of adding adapters is that they allow the model to learn task-specific parameters while keeping the BERT parameters frozen. This can help improve the model's performance on a specific task, such as named-entity recognition, without affecting the general understanding of language that BERT has learned.

b) The adapters should be inserted after the self-attention sub-layer in each BERT layer. Specifically, they should be inserted between the self-attention output and the feed-forward network. In other words, after the Linear layer with in_features=768 and before the ReLUActivation() function.

c) To calculate the number of parameters added by inserting adapters in each BERT layer, we need to count the number of parameters in two linear projections. The first linear projection projects the input to 256 dimensions, so it has in_features=768 and out_features=256. The second linear projection projects the input back to the original dimension, so it has in_features=256 and out_features=768. Therefore, each adapter in a single BERT layer adds 2 * (in_features * out_features) = 2 * (768 * 768) = 1,179,648 parameters. Since there are 12 BERT layers, the total number of parameters added to the model is 12 * 1,179,648 = 13,755,776 parameters.





****************************************************************************************
****************************************************************************************




Answer to Question 4-2


Answer:

a) The DPR model uses the BERT CLS token representation as the sentence representation for both passages and questions during the creation of fixed-size vectors. This approach differs from pooling word vectors (meanpool or maxpool) in several ways. First, the CLS token representation captures the semantic meaning of the entire sentence, while pooling word vectors may not fully capture the context and meaning of the sentence as a whole. Second, the CLS token representation is learned during the BERT model training, while pooled word vectors are simple averages or maximums of the word vectors. The advantages of using the CLS token representation include better context understanding and more accurate representation of the sentence semantics.

b) In DPR, the inclusion of irrelevant/negative question-passage pairs in the training objective is essential for effective learning. The reason is that the model needs to learn to distinguish between relevant and irrelevant pairs, which is crucial for accurate retrieval. If we leave out the irrelevant/negative pairs, the model may not learn to effectively separate the relevant pairs from the irrelevant ones, leading to poor performance.





****************************************************************************************
****************************************************************************************




