Answer to Question 1-1


a) False. One-hot word representations cannot be used to find synonyms, because they do not contain any information about the meaning of the words.

b) False. English is morphologically poorer than German. English has a more limited set of inflectional endings than German, and it often uses separate words (e.g. "to" in "to go") where German uses inflectional endings.

c) True. Syntax is on a lower level than semantics in the hierarchy of language. Syntax deals with the structure of sentences, while semantics deals with the meaning of sentences.

d) False. Word2Vec is not trained on the global word occurrence matrix. Instead, it is trained on the local context of each word in a large corpus of text.

e) True. When byte-pair encoding (BPE) is applied for subword segmentation, less frequent words tend to become more subword- than word-based. This is because BPE starts by segmenting words into individual characters, and then iteratively replaces the most frequent pair of characters with a new, single character. This process tends to preserve the structure of more frequent words, while breaking down less frequent words into smaller subword units.

f) True. Compared to HMMs, CRFs allow easier integration of new features. This is because CRFs use a log-linear model, which allows for the addition of new features by simply adding new terms to the model. In contrast, HMMs use a generative model, which requires the modification of the underlying probability distributions in order to add new features.





****************************************************************************************
****************************************************************************************




Answer to Question 1-2


1. Dense word embeddings can capture semantic relationships between words, such as similarity and relatedness, which is not possible with sparse features.

2. Dense word embeddings can capture the context in which a word appears, allowing for better understanding of the word's meaning in different contexts. This is not possible with sparse features, which only capture the presence or absence of a word.





****************************************************************************************
****************************************************************************************




Answer to Question 1-3


a) To create representations for the products using ideas similar to learning word representations, we can use a method called Singular Value Decomposition (SVD) on the co-purchase matrix. SVD is a matrix factorization technique that decomposes a matrix into three matrices, where each matrix represents a different aspect of the original matrix. In our case, we can decompose the co-purchase matrix $M$ into three matrices $U$, $\Sigma$, and $V^T$, such that $M = U\Sigma{V^T}$.

The matrix $U$ contains the left singular vectors, which can be thought of as the representations for the products. The matrix $\Sigma$ contains the singular values, which represent the strength of the relationships between the products. The matrix $V^T$ contains the right singular vectors, which can be thought of as the representations for the purchases.

To derive the product representations, we can simply take the left singular vectors from the matrix $U$. These vectors will have the same dimensionality as the number of products, and each vector will represent a product.

b) To recommend similar products to users who have shown interest in one of the products, we can use the product representations derived from the SVD. Specifically, we can calculate the cosine similarity between the representation of the product of interest and the representations of all other products. The cosine similarity is a measure of similarity between two vectors, and it is calculated as the dot product of the two vectors divided by the product of their magnitudes.

Once we have calculated the cosine similarities between the product of interest and all other products, we can recommend the top $k$ products with the highest similarity scores. These products will be similar to the product of interest, and they will be recommended to the user who has shown interest in the product of interest.

For example, suppose that the product of interest is represented by the vector $[0.1, 0.2, 0.3, 0.4]$, and the representations of all other products are as follows:

* Product 2: $[0.2, 0.3, 0.4, 0.5]$
* Product 3: $[0.3, 0.2, 0.5, 0.1]$
* Product 4: $[0.4, 0.5, 0.1, 0.2]$
* Product 5: $[0.5, 0.1, 0.2, 0.3]$

To calculate the cosine similarity between the product of interest and Product 2, we would calculate the dot product of the two vectors and divide it by the product of their magnitudes:

* Dot product: $0.1*0.2 + 0.2*0.3 + 0.3*0.4 + 0.4*0.5 = 0.34$
* Magnitude of product of interest: $\sqrt{0.1^2 + 0.2^2 + 0.3^2 + 0.4^2} = 0.5$
* Magnitude of Product 2: $\sqrt{0.2^2 + 0.3^2 + 0.4^2 + 0.5^2} = 0.707$
* Cosine similarity: $0.34 / (0.5 * 0.707) = 0.5$

We would repeat this process for all other products, and we would recommend the top $k$ products with the highest cosine similarity scores. In this example, we might recommend Products 2 and 4, since they have the highest cosine similarity scores.





****************************************************************************************
****************************************************************************************




Answer to Question 1-4


a) A property of CNNs that is beneficial for the spam detection task, which is not the case with RNNs, is that CNNs are more suitable for parallel processing of input data. This is because CNNs use convolutional layers, which apply the same filter to different parts of the input simultaneously, while RNNs process the input sequentially. This property of CNNs allows them to handle longer input sequences more efficiently, which is important for spam detection as emails can be quite long.

b) A CNN-based model for spam detection can be designed as follows:

* Input: The input to the model is a sequence of words in an email. Each word is represented as a one-hot vector, where the size of the vector is equal to the size of the vocabulary.
* Intermediate operations: The input is first passed through an embedding layer, which maps each one-hot vector to a dense vector of fixed size. The output of the embedding layer is then passed through a series of convolutional layers, each followed by a max pooling layer. The convolutional layers apply filters of different sizes to the input, allowing the model to capture features at different scales. The max pooling layers select the maximum value from each filter's output, reducing the dimensionality of the feature map.
* Output: The output of the convolutional layers is flattened and passed through a fully connected layer, which produces the final classification result.
* Size of the feature map: The size of the feature map depends on the size of the input, the size of the filters, and the stride of the convolutional layers. For example, if the input is a sequence of 100 words, and we use filters of size 3 and stride 1, then the size of the feature map will be (100-3+1)/1 = 98.

c) When evaluating the model, Tom can use the F1 score instead of accuracy. The F1 score is the harmonic mean of precision and recall, and it takes into account both false positives and false negatives. This is important for spam detection, as a high accuracy can be achieved by simply classifying all emails as non-spam. The F1 score is a more robust metric, as it penalizes models that have a high number of false positives or false negatives. Additionally, the F1 score can be computed separately for the positive and negative classes, allowing for a more detailed evaluation of the model's performance.





****************************************************************************************
****************************************************************************************




Answer to Question 1-5


a) The input of the model is a sequence of words, where each word is represented by a vector of real numbers (word embedding). The intermediate operations include:

1. Applying a convolutional layer with a filter size of 3 to extract n-gram features.
2. Applying a max-pooling layer to select the most important n-gram features.
3. Applying a fully connected layer to learn the relationship between the n-gram features and the disease names.
4. Applying a conditional random field (CRF) layer to model the dependencies between the labels of adjacent words.

The output of the model is a sequence of labels, where each label indicates whether the corresponding word is a disease name or not.

b) A challenge of using pretrained word embeddings, such as GloVe, is that the word embeddings may not capture the specific medical vocabulary used in the documents. To resolve this, we can fine-tune the word embeddings during training by allowing the word embeddings to be updated along with the other model parameters. Additionally, we can use a medical vocabulary list to initialize the word embeddings for words that are not present in the pretrained word embeddings. This will ensure that the word embeddings for medical terms are learned effectively during training.





****************************************************************************************
****************************************************************************************




Answer to Question 2-1


Answer:

a) The rule for the unigram model would be:
If p("He saw their was a football") > p("He saw there was a football")
Then Return "their"
Else Return "there"

To calculate the probabilities, we can use the maximum likelihood estimate for each word, which is the count of the word divided by the total number of words. So, we get:
p("He saw their was a football") = count("He")/N * count("saw")/N * count("their")/N * count("was")/N * count("a")/N * count("football")/N
p("He saw there was a football") = count("He")/N * count("saw")/N * count("there")/N * count("was")/N * count("a")/N * count("football")/N

Since count("their") > count("there"), the first probability will be larger than the second probability, and the rule will return "their". However, this is not a good solution because the unigram model does not take into account the context of the word. For example, in the sentence "He saw their was a football", the word "their" is still incorrect even though it has a higher probability than "there" under the unigram model.

b) The bigram model might be better than the unigram model because it takes into account the context of the word by considering the previous word. For example, in the sentence "He saw their was a football", the previous word is "saw", which is more likely to be followed by "there" than "their". Therefore, the bigram model might be able to correct the misspelling in this case. However, the bigram model might have problems in practice because it requires a large amount of data to estimate the probabilities accurately. If the corpus is not large enough, the model might not have enough examples of each bigram to make accurate predictions. Additionally, the bigram model might not be able to handle rare words or out-of-vocabulary words, which can occur frequently in natural language text.





****************************************************************************************
****************************************************************************************




Answer to Question 2-2


a) In Figure "figures/Mask_under_MLM.pdf", if we apply a masking ratio of 20%, we would randomly select 20% of the tokens in the input sequence to mask. For example, if the input sequence is "start-token the quick brown fox jumps over the lazy dog end-token", we might select "quick" and "jumps" to mask, resulting in the sequence "start-token the [MASK] brown [MASK] fox over the lazy dog end-token".

b) MLM and CLM both require iterations over the training data to learn the language model. However, CLM might need more iterations than MLM because it is a more difficult task. In CLM, the model must predict the next token in the sequence given all the previous tokens, while in MLM, the model only needs to predict a masked token given the surrounding context. This makes MLM a simpler task, and therefore it might require fewer iterations to learn the language model.

c) MLM does not require the input sequences to be shifted one position to the right like in CLM because the goal of MLM is to predict a masked token given the surrounding context. In MLM, the model is trained to predict the original token that was masked, so there is no need to shift the input sequence. In contrast, in CLM, the model must predict the next token in the sequence given all the previous tokens, so the input sequence must be shifted one position to the right to provide the correct context for the prediction.

d) PrefixLM often has better performance than CLM because it provides more context to the model for prediction. In CLM, the model must predict the next token in the sequence given only the previous tokens, while in PrefixLM, the model is given a prefix of the input sequence in addition to the previous tokens. This additional context allows the model to make more accurate predictions, resulting in better performance. As shown in Figure "figures/Illustration_of_language_model_training.png", in PrefixLM, the model is given the prefix "the quick brown fox" and must predict the next token, while in CLM, the model is only given the previous token "fox" and must predict the next token. The additional context in PrefixLM allows the model to make a more accurate prediction.





****************************************************************************************
****************************************************************************************




Answer to Question 2-3


a) The contextual embedding for the two "left" in the sentence will not be the same. This is because the BERT architecture uses self-attention, which calculates the attention score for each word in the sentence with respect to every other word. The attention score is calculated using the query, key, and value vectors. The query vector is derived from the input word embedding, and the key and value vectors are derived from the input word embedding and positional encoding. The positional encoding provides information about the position of the word in the sentence, which is used in the calculation of the attention score. Therefore, even if the input word embeddings for the two "left" are the same, the attention scores will be different because the positional encodings will be different.

b) No, we cannot use the dot product attention because the attention query has 1024 dimensions and the key has 512 dimensions. The dot product attention is calculated as the dot product of the query and key vectors, followed by a softmax operation. In order for the dot product to be well-defined, the query and key vectors must have the same number of dimensions. Therefore, we cannot use the dot product attention in this case.

c) The positional encoding enables the model to treat different positions differently despite having no trainable parameters because it provides information about the position of the word in the sentence. The positional encoding is added to the input word embedding before it is passed through the self-attention mechanism. The self-attention mechanism calculates the attention score for each word in the sentence with respect to every other word, and the positional encoding is used in this calculation. Therefore, even though the positional encoding has no trainable parameters, it still provides valuable information about the position of the word in the sentence, which allows the model to treat different positions differently.

One way to have trainable positional encoding is to add a trainable vector to the fixed positional encoding. This trainable vector can be learned during the training process and can provide additional information about the position of the word in the sentence. The trainable vector can be added to the fixed positional encoding before it is added to the input word embedding. This allows the model to learn a position-specific representation that can be used in the self-attention mechanism.





****************************************************************************************
****************************************************************************************




Answer to Question 2-4


a) False. Greedy decoding only keeps track of the most likely next token, while beam search keeps track of the top k most likely next tokens. Therefore, beam search is more memory-intensive than greedy decoding.

b) True. When ensembling text generation models with different vocabularies, we need to ensure that the vocabularies are aligned, i.e., each word in one vocabulary has a corresponding word in the other vocabulary. If the vocabularies are not aligned, we cannot directly ensemble the models.

c) True. The sentence probability is the product of the probabilities of each token in the sequence. If we do not normalize the sentence probability by sequence length, shorter sequences will have a higher product of probabilities, and will therefore be preferred.

d) True. Top-k sampling only considers the top k most likely next tokens at each step. A higher value of k leads to a larger set of possible next tokens, and therefore higher variability in the generated output.





****************************************************************************************
****************************************************************************************




Answer to Question 2-5


Both BLEU and COMET are impacted by the different wording in the example above, but in different ways.

BLEU (Bilingual Evaluation Understudy) is a metric for evaluating the quality of machine-generated translations by comparing them to reference translations. It is based on n-gram precision, which measures the proportion of n-grams (contiguous sequences of n items) in the machine-generated translation that also appear in the reference translation. BLEU is a simple and widely-used metric, but it has some limitations. One of these limitations is that it does not take into account the semantic meaning of the words or the context in which they are used.

In the example above, both System 1 and System 2's outputs are correct and have the same n-grams as the reference translation. Therefore, BLEU would assign the same score to both systems, even though System 2's output is more colloquial and natural-sounding in German.

COMET (Cross-lingual Optimized Metric for Evaluation of Translation), on the other hand, is a more recent metric that aims to overcome some of the limitations of BLEU. It is based on a pre-trained neural machine translation model that is fine-tuned on a large parallel corpus. COMET takes into account the semantic meaning of the words and the context in which they are used, and it is able to capture more nuanced aspects of translation quality, such as fluency, coherence, and naturalness.

In the example above, COMET would likely assign a higher score to System 2's output, because it is more colloquial and natural-sounding in German. COMET would be able to capture the fact that System 2's output is more similar to how a human would translate the source sentence into German, even though it uses a different wording than the reference translation.

In summary, both BLEU and COMET are impacted by the different wording in the example above, but BLEU would assign the same score to both systems, while COMET would likely assign a higher score to the more colloquial and natural-sounding output. This is because COMET takes into account the semantic meaning of the words and the context in which they are used, while BLEU does not.





****************************************************************************************
****************************************************************************************




Answer to Question 3-1


a. Ranking the approaches by the number of trained parameters in the task-specific adaptation stage:

1. Promptless Finetuning: This approach requires training all the parameters of the model on the task-specific data, which can be a large number depending on the size of the model.
2. Direct Prompting: This approach does not require any training of the model parameters, as it only involves constructing a prompt that guides the model to perform the desired task.
3. In-Context Learning: This approach also does not require any training of the model parameters, as it involves conditioning the model on a few input-output examples before providing the actual input to be processed.

b. Ranking the approaches by the amount of memory needed for inference (decoding):

1. Direct Prompting: This approach requires the least amount of memory, as it only involves processing the input prompt and generating the output.
2. In-Context Learning: This approach requires a moderate amount of memory, as it involves storing the input-output examples in memory and conditioning the model on them before processing the actual input.
3. Promptless Finetuning: This approach requires the most amount of memory, as it involves storing the entire task-specific model in memory during inference.

c. For a specific task with 8 input-output pairs, I would choose the In-Context Learning approach. This is because it does not require any training of the model parameters, and it can condition the model on the input-output examples to perform the task. This approach is also memory-efficient, as it only requires storing the input-output examples in memory during inference. Additionally, it allows for more flexibility in the prompt construction, as the input-output examples can be presented in different formats to guide the model's behavior.





****************************************************************************************
****************************************************************************************




Answer to Question 3-2


a) To calculate the number of parameters trained in the finetuning stage with adapters, we need to calculate the number of parameters in each adapter and multiply it by the number of layers. Each adapter consists of two linear projections with ReLU in between. A linear projection is a matrix multiplication operation, which requires a weight matrix and a bias term. The weight matrix has dimensions (1024, 256) and the bias term is a vector of size 256. Therefore, each linear projection has (1024 \* 256) + 256 parameters. Since there are two linear projections in each adapter, each adapter has (1024 \* 256 \* 2) + (256 \* 2) parameters. Since there are 12 layers in the model, the total number of parameters trained in the finetuning stage is (12 \* ((1024 \* 256 \* 2) + (256 \* 2))).

b) In prompt tuning, only the parameters of the prompt are trained in the finetuning stage. The prompt is a sequence of tokens that is prepended to the input sequence. In this case, 50 tokens are reserved for the prompt. Therefore, the number of parameters trained in the finetuning stage is equal to the number of parameters in the prompt, which is the number of tokens in the prompt multiplied by the embedding dimension.

c) Despite having less total parameters than the model with adapters, the model with prompt tuning runs out of memory during decoding. This could be due to the fact that the prompt is added to the input sequence, increasing the length of the sequence and therefore the memory requirements during decoding. In contrast, the adapters are inserted after each layer and do not increase the length of the sequence.

d) The main difference between prompt tuning and prefix tuning is that in prompt tuning, the prompt is prepended to the input sequence, while in prefix tuning, the prefix is inserted between the input sequence and the masked tokens. An advantage of prompt tuning compared to prefix tuning is that it is simpler to implement, since it only requires prepending a sequence of tokens to the input sequence. A disadvantage of prompt tuning compared to prefix tuning is that it may be less flexible, since the prompt is fixed and cannot be adapted to the specific task or input sequence.





****************************************************************************************
****************************************************************************************




Answer to Question 3-3


a) To adapt the pretrained model to use information from the object detection model, we can modify the input of the translation model to include the list of objects detected in the image. Specifically, we can concatenate the list of objects to the input sentence, separated by a special token such as '[SEP]'. For example, given an input sentence "Two people are sitting." and the object list [PERSON, PERSON, RIVER, BOAT], the modified input would be "Two people are sitting. [SEP] PERSON PERSON RIVER BOAT". To handle the case when the object label is not in the vocabulary of the pretrained translation model, we can add the object labels to the vocabulary of the translation model.

b) To analyze whether the model makes use of information from the object detection model, we can compare the translation performance of the model with and without the object detection information. Specifically, we can train two models: one with the object detection information and one without. Then, we can evaluate the translation performance of both models on a test set and compare the results. If the model with the object detection information performs significantly better than the model without, we can conclude that the model makes use of the object detection information.

c) To adapt the pretrained translation model to additionally use the encoded image, we can modify the input of the translation model to include the encoded image. Specifically, we can concatenate the encoded image to the input sentence, separated by a special token such as '[SEP]'. For example, given an input sentence "Two people are sitting." and an encoded image [0.1, 0.2, ..., 1.0], the modified input would be "Two people are sitting. [SEP] 0.1 0.2 ... 1.0". To handle the case when the size of the encoded image does not match with the the embedding dimensions of the translation model, we can project the encoded image to the embedding dimensions of the translation model using a linear layer. Specifically, we can add a linear layer at the beginning of the translation model that takes the encoded image as input and outputs a tensor with the same shape as the embedding dimensions of the translation model.





****************************************************************************************
****************************************************************************************




Answer to Question 3-4


a) Retrieval-augmented generation (RAG) differs from traditional generation in that it incorporates an additional retrieval step before generation. In traditional generation, the model generates text solely based on its parameters and the given prompt. In contrast, RAG models first retrieve relevant documents from an external corpus and then generate text based on the prompt and the retrieved documents. This approach could potentially improve the faithfulness of large language models because the retrieved documents provide additional context that helps the model generate more accurate and reliable responses.

b) I agree that hallucination in machine translation may be easier to detect than in general-purpose text generation with large language models. In machine translation, the input and output are well-defined, and the model's task is to translate the input text into the target language while preserving its meaning. If the model hallucinates and generates text that is not a faithful translation of the input, it is likely to produce grammatical or semantic errors that can be easily detected. In contrast, in general-purpose text generation, the model's task is to generate coherent and contextually appropriate text based on the given prompt, and the absence of a well-defined input-output mapping makes it harder to detect hallucination.

c) Truncating long documents during the training of large language models could potentially cause issues with model hallucination because it may lead to the loss of important context that helps the model generate accurate and reliable responses. To mitigate the problem, one approach is to use a sliding window technique that divides the document into smaller chunks and trains the model on each chunk separately while maintaining the context between them. Another approach is to use a memory-efficient attention mechanism that can handle longer sequences without truncation. Additionally, one can use a more diverse and representative training corpus that includes a variety of document lengths and structures, which can help the model generalize better and reduce the likelihood of hallucination.





****************************************************************************************
****************************************************************************************




