Answer to Question 1-1


a. False. One-hot word representations are used to represent words as binary vectors, where each dimension corresponds to a unique word in the vocabulary. This representation does not capture the semantic relationships between words, such as synonymy.

b. True. English has a more complex morphology than German, with a greater variety of prefixes, suffixes, and infixes that can be combined to create new words.

c. True. Syntax is concerned with the arrangement of words and phrases in a sentence, while semantics is concerned with the meaning of words and phrases. Syntax operates on a lower level of language than semantics.

d. False. Word2Vec is trained on local context windows, which capture the relationships between words in a sentence, rather than the global word occurrence matrix.

e. True. When byte-pair encoding (BPE) is applied for subword segmentation, less frequent words tend to become more subword- than word-based, as the algorithm breaks them down into more frequent subwords.

f. True. Conditional Random Fields (CRFs) allow for easier integration of new features than Hidden Markov Models (HMMs), as they can model the joint probability distribution over a sequence of labels, while HMMs can only model the joint probability distribution over a sequence of observations. 





****************************************************************************************
****************************************************************************************




Answer to Question 1-2


1. Dense word embeddings capture semantic information: Dense word embeddings are able to capture the meaning of words in context by representing them as vectors in a high-dimensional space. This allows the model to understand the relationships between words and their meanings, which is important for tasks such as sentiment analysis, text classification, and machine translation.
2. Dense word embeddings are more computationally efficient: Sparse features require a large amount of memory to store the features for each word in the vocabulary. In contrast, dense word embeddings can be efficiently stored in a compact form, such as a matrix, which makes them more computationally efficient. This is especially important for large-scale NLP tasks, where memory usage is a significant concern. 





****************************************************************************************
****************************************************************************************




Answer to Question 1-3


a. To create representations for the products using ideas similar to learning word representations, we can use techniques such as Singular Value Decomposition (SVD) or Non-negative Matrix Factorization (NMF). These techniques can help us reduce the dimensionality of the co-purchase matrix and extract latent features that capture the underlying relationships between the products.

For SVD, we can decompose the co-purchase matrix $X$ into the product of three matrices: $X = U \Sigma V^T$, where $U$ and $V$ are orthogonal matrices and $\Sigma$ is a diagonal matrix. The columns of $U$ and $V$ represent the latent features of the products and users, respectively. We can then use the columns of $U$ as the product representations.

For NMF, we can factorize the co-purchase matrix $X$ into the product of two non-negative matrices: $X = WH$, where $W$ and $H$ are non-negative matrices. The columns of $W$ represent the latent features of the products, and the columns of $H$ represent the latent features of the users. We can then use the columns of $W$ as the product representations.

b. To recommend similar products to users who have shown interest in one of the products, we can use the product representations derived in part a. We can compute the cosine similarity between the product representation of the product of interest and the product representations of all other products. The products with the highest cosine similarity scores are the most similar to the product of interest and can be recommended to the user. 





****************************************************************************************
****************************************************************************************




Answer to Question 1-4


a. One property of CNNs that is beneficial for the spam detection task is their ability to capture local features in the input. This is because CNNs use convolutional layers, which are designed to detect patterns in the input data. In the case of spam detection, this means that the CNN can identify specific words or phrases that are commonly found in spam emails, even if they are not present in every email. This is in contrast to RNNs, which process the input sequence as a whole and may not be able to capture local features as effectively.

b. A CNN-based model for spam detection could be designed as follows:

1. Input: The input to the model would be a sequence of words from the email. This could be represented as a one-hot encoded vector, where each word is represented by a binary vector indicating its presence or absence in the email.
2. Convolutional layers: The first layer of the model would be a convolutional layer, which would apply a set of filters to the input data to identify local features. The size of the feature map produced by this layer would depend on the size of the filters and the input data.
3. Pooling layers: After the convolutional layers, the model would include one or more pooling layers, which would reduce the size of the feature map by taking the maximum or average value within a certain window.
4. Fully connected layers: The output of the pooling layers would be flattened and passed through one or more fully connected layers, which would combine the features extracted from the input data to make a prediction.
5. Output: The final layer of the model would be a binary classification layer, which would output a probability score indicating whether the email is spam or not.

c. When evaluating the performance of the model, it is important to use a metric other than accuracy, as the majority of emails are not spam and the model is likely to achieve high accuracy simply by classifying all emails as not spam. A better metric for this task would be precision or recall. Precision is the proportion of true positives (i.e., correctly identified spam emails) among all emails identified as spam by the model. Recall is the proportion of true positives among all actual spam emails. By using these metrics, Tom can get a better understanding of how well the model is performing on the relevant emails (i.e., the spam emails), rather than just the overall accuracy. 





****************************************************************************************
****************************************************************************************




Answer to Question 1-5


a. The input for the model would be the 10,000 medical documents, which are likely to be in a text format. The intermediate operations would involve tokenizing the text, converting the tokens into word embeddings using GloVe, and then applying a convolutional neural network (CNN) to extract features from the word embeddings. The output of the model would be the predicted disease names in the documents.

b. A challenge of using GloVe to initialize the word embeddings is that GloVe is trained on a general corpus and may not capture the specific terminology and jargon commonly used in medical documents. To resolve this, one could fine-tune the GloVe embeddings on a medical corpus or use a specialized medical word embedding model, such as Med2Vec or BioBERT, to better capture the relevant terminology in the medical documents. 





****************************************************************************************
****************************************************************************************




Answer to Question 2-1


For Subquestion (a):

The simple unigram model counts the occurrences of each word in the corpus and predicts the word with the highest count. For sentence (1), "their" has a count of 50, while "there" has a count of 110, so the model would predict "their". For sentence (2), "there" has a count of 110, while "their" has a count of 50, so the model would predict "there".

This model is not a good solution because it does not take into account the context in which the word appears. For example, in sentence (2), "there" is used as a pronoun, while "their" is used as a possessive. The unigram model does not consider the difference between these two uses of the word.

For Subquestion (b):

The simple bigram model takes into account the context in which a word appears by considering the probability of a word given the previous word in the sentence. For example, in sentence (1), the probability of "their" given "He saw" is higher than the probability of "there" given "He saw". Therefore, the model would predict "their". In sentence (2), the probability of "there" given "He saw" is higher than the probability of "their" given "He saw", so the model would predict "there".

This model might be better than the unigram model because it takes into account the context in which the word appears. However, it might have problems in practice because it requires a large corpus to train the language model and may not be able to handle rare words or out-of-vocabulary words. 





****************************************************************************************
****************************************************************************************




Answer to Question 2-2


a. Under masked language model training (MLM), the mask is applied to the input sequence. In Figure "figures/Mask_under_MLM.pdf", the mask is applied to the second word in the sequence.

b. Under MLM, the model needs more iterations over the training data than under CLM. This is because in MLM, the model has to predict the masked word in the input sequence, whereas in CLM, the model has to predict the next word in the sequence. In PrefixLM, the model has to predict the next word in the sequence, but the input sequence is shifted one position to the right, which makes it more challenging for the model to predict the next word accurately.

c. Under MLM, the input sequences do not need to be shifted one position to the right like in CLM. This is because in MLM, the model has to predict the masked word in the input sequence, which is already present in the sequence. In CLM, the model has to predict the next word in the sequence, which is not present in the sequence. In PrefixLM, the model has to predict the next word in the sequence, but the input sequence is shifted one position to the right, which makes it more challenging for the model to predict the next word accurately.

d. Under PrefixLM, the model often has better performance than under CLM. This is because in PrefixLM, the model has to predict the next word in the sequence, but the input sequence is shifted one position to the right, which makes it more challenging for the model to predict the next word accurately. In CLM, the model has to predict the next word in the sequence, which is not present in the sequence. In MLM, the model has to predict the masked word in the input sequence, which is already present in the sequence. 





****************************************************************************************
****************************************************************************************




Answer to Question 2-3


a. No, the contextual embeddings for the two "left" in the sentence will not be the same. In the BERT model, the self-attention mechanism is used to capture the relationships between different words in a sentence. The query, key, and value vectors are generated from the input embeddings, and the attention scores are calculated based on the dot product of the query and key vectors. Since the two "left" words are in different positions in the sentence, their contextual embeddings will be different due to the positional encodings applied to them.

b. Yes, we can use the dot product attention in this case. The dot product attention mechanism calculates the attention scores by taking the dot product of the query and key vectors. Since the query vector has 1024 dimensions and the key vector has 512 dimensions, the dot product can be performed. However, it's important to note that the attention scores will be sparse, as not all positions in the input sequence will have a corresponding position in the output sequence.

c. The positional encoding function defined in Eq. 1 enables the model to treat different positions differently despite having no trainable parameters. The function uses the sine and cosine of the position multiplied by a scaling factor that depends on the feature dimension. This scaling factor ensures that the positional encodings are in the same range as the input embeddings, which have 512 dimensions. By applying different positional encodings to each position in the input sequence, the model can capture the position-specific information in the input sequence.

To have trainable positional encoding, we can add a learnable parameter for each position in the input sequence. This learnable parameter can be initialized with small random values and updated during training. By doing so, the model can adapt the positional encodings to the specific task and data it is processing. 





****************************************************************************************
****************************************************************************************




Answer to Question 2-4


a. False. Greedy decoding is generally less memory-intensive than beam search because it only keeps one candidate at a time, while beam search keeps a fixed number of candidates.

b. True. Text generation models with different vocabularies cannot be directly ensembled because they use different sets of words, which makes it difficult to combine their outputs.

c. True. If sentence probabilities are not normalized by sequence length, shorter sequences will be preferred because they have a higher probability of being the correct sequence.

d. True. With top-k sampling, a higher value of k leads to higher variability in the generated output because it allows for a wider range of possible sequences to be considered. 





****************************************************************************************
****************************************************************************************




Answer to Question 2-5


The question asks about the evaluation of English-to-German translation systems and the impact of different wording on BLEU and COMET scores.

BLEU (Bilingual Evaluation Understudy) and COMET (COntext-aware Metric for Evaluation of Translation) are two commonly used metrics for evaluating the quality of machine translation.

BLEU is based on n-gram precision and rewards translations that contain the most common n-grams in the reference translation. It is sensitive to the presence of rare words and phrases in the reference translation.

COMET, on the other hand, is based on the similarity between the reference translation and the translation being evaluated. It takes into account the context of the words in the reference translation and the translation being evaluated.

In the example provided, both System 1 and System 2's outputs are correct, but System 1's output is more similar to the reference translation in terms of wording. Therefore, System 1's output will likely receive a higher BLEU score.

However, COMET may not be impacted by the different wording in the same way. COMET takes into account the context of the words in the reference translation and the translation being evaluated, so it may not penalize System 2's output as much as BLEU would.

In summary, BLEU is more likely to be impacted by the different wording in the example, while COMET may not be as affected. 





****************************************************************************************
****************************************************************************************




Answer to Question 3-1


a. The approaches can be ranked by the number of trained parameters in the task-specific adaptation stage as follows:

1. Direct Prompting: This approach involves providing a specific prompt to the language model to perform a task. The number of trained parameters in this case is the same as the original language model, as no additional parameters are added during adaptation.

2. In-Context Learning: In this approach, the language model is fine-tuned on a small dataset of examples relevant to the task. The number of trained parameters in this case is the same as the original language model plus the additional parameters learned during fine-tuning.

3. (Promptless) Finetuning: This approach involves fine-tuning the language model on a larger dataset of examples relevant to the task. The number of trained parameters in this case is the same as the original language model plus the additional parameters learned during fine-tuning.

b. The approaches can be ranked by the amount of memory needed for inference (decoding) as follows:

1. Direct Prompting: This approach requires the least amount of memory, as it involves simply providing a prompt to the language model.

2. In-Context Learning: This approach requires more memory than Direct Prompting, as it involves fine-tuning the language model on a small dataset of examples relevant to the task.

3. (Promptless) Finetuning: This approach requires the most memory, as it involves fine-tuning the language model on a larger dataset of examples relevant to the task.

c. For a specific task, the choice of approach would depend on the size of the dataset and the computational resources available. If the dataset is small and computational resources are limited, Direct Prompting might be the best choice. If the dataset is larger and computational resources are available, (Promptless) Finetuning might be the best choice. In-Context Learning could be a good middle ground if the dataset is not too large and computational resources are moderate. 





****************************************************************************************
****************************************************************************************




Answer to Question 3-2


a. To calculate the number of parameters trained in the finetuning stage with adapters inserted after every layer, we need to consider the additional parameters introduced by the adapters. Each adapter consists of two linear projections with ReLU in between, which means each adapter adds 2 \* 1024 \* 256 = 2097152 parameters. Since there are 12 layers, the total number of parameters trained in the finetuning stage with adapters is 12 \* 2097152 = 2499456.

b. To calculate the number of parameters trained in the finetuning stage with prompt tuning, we need to consider the additional parameters introduced by the prompt. Since 50 tokens are reserved for the prompt, each token is represented by a 1024-dimensional embedding. Therefore, the total number of parameters introduced by the prompt is 50 \* 1024 = 512000. Since the pretrained model parameters remain unchanged, the total number of parameters trained in the finetuning stage with prompt tuning is the same as the pretrained model, which is 1024 \* 12 = 12288.

c. One possible explanation for why the model with prompt tuning runs out of memory despite having less total parameters than the model with adapters is that the additional parameters introduced by the prompt are not evenly distributed across the layers. The model with prompt tuning may have a higher concentration of parameters in certain layers, causing it to run out of memory.

d. The main difference between prompt tuning and prefix tuning is that prompt tuning uses a fixed prompt to condition the model, while prefix tuning uses a variable prefix to condition the model. An advantage of prompt tuning is that it requires less computational resources than prefix tuning, as it does not require generating a large number of prefixes. A disadvantage of prompt tuning is that it may not capture the full context of the input text, as the prompt is fixed and may not cover all possible contexts. 





****************************************************************************************
****************************************************************************************




Answer to Question 3-3


a. To adapt the pretrained model to use information from the object detection model, we can concatenate the object detection output with the input sentence and feed it into the translation model. The model input would be the concatenated sequence, and the output would be the translated sentence. For the case when the object label is not in the vocabulary of the pretrained translation model, we can use a predefined set of out-of-vocabulary words or phrases to represent the unrecognized object labels.

b. To analyze whether the model makes use of information from the object detection model, we can compare the model's output with and without the object detection model's input. If the model's output changes significantly when the object detection model's input is removed, it suggests that the model is using information from the object detection model.

c. To adapt the pretrained translation model to additionally use the encoded image, we can concatenate the encoded image with the input sentence and feed it into the translation model. The model input would be the concatenated sequence, and the output would be the translated sentence. For the case when the size of the encoded image does not match with the embedding dimensions of the translation model, we can resize the encoded image to match the embedding dimensions or use a predefined set of fixed-size vectors to represent the encoded image. 





****************************************************************************************
****************************************************************************************




Answer to Question 3-4


a. Retrieval-augmented generation (RAG) is a method that combines language model generation with retrieval from a large corpus of text. It works by first retrieving relevant text snippets from the corpus, and then using a language model to generate text based on these snippets. This differs from traditional generation, which relies solely on the language model to generate text. RAG could potentially improve the faithfulness of large language models by providing additional context and information from the retrieved text snippets.

b. I agree that hallucination in machine translation is easier to detect than in general-purpose text generation with large language models. This is because machine translation tasks often involve translating text from one language to another, which can help to identify when the model is generating incorrect or nonsensical text. In contrast, general-purpose text generation tasks can involve generating text on a wide range of topics, making it more difficult to detect when the model is hallucinating.

c. Truncating long documents during the training of large language models can potentially cause issues with model hallucination because it can limit the model's ability to learn from the full context of the document. This can lead to the model generating incorrect or nonsensical text. To mitigate this problem, one approach is to use techniques such as sliding windows or dynamic padding to allow the model to process longer documents in smaller chunks. Another approach is to use larger memory resources to allow the model to process longer documents in their entirety. 





****************************************************************************************
****************************************************************************************




