Answer to Question 1-1
a. False. One-hot word representations encode each word as a unique vector, which doesn't capture semantic similarity between words, so they cannot be used to find synonyms directly.

b. False. German is generally considered to be morphologically richer than English, with more inflections and compound words.

c. True. Syntax refers to the rules governing how words are combined to form phrases and sentences, while semantics deals with the meaning of those structures, placing semantics at a higher level in the hierarchy.

d. False. Word2Vec is trained on local context windows, not the global word occurrence matrix.

e. True. BPE divides words into subwords based on their frequency, so less frequent words, which might not have enough data for their own subword representation, tend to be represented by more subwords.

f. True. CRFs allow for the inclusion of arbitrary features in the model, which can be more easily incorporated than in HMMs where the state transitions are more rigidly defined.





****************************************************************************************
****************************************************************************************




Answer to Question 1-2
1. Dense word embeddings capture semantic relationships between words by representing words in a continuous vector space, allowing for measures of similarity and vector operations that can reveal meaningful connections. This is in contrast to sparse features, which typically represent each word as a binary indicator of its presence in a document, losing information about the relationships between words.

2. Dense embeddings can capture context and polysemy better than sparse features. Each word is represented by a fixed-length vector that can encode different meanings based on its context, whereas sparse features often treat each occurrence of a word independently, ignoring context-dependent nuances. This is particularly important in NLP tasks where understanding the context is crucial, such as sentiment analysis or named entity recognition.





****************************************************************************************
****************************************************************************************




Answer to Question 1-3
a. To create representations for the products using ideas similar to learning word representations, we can employ a technique called matrix factorization. A popular method for this is Singular Value Decomposition (SVD). SVD decomposes the co-purchase matrix into three matrices: $U$, $\Sigma$, and $V^T$, where $U$ and $V^T$ are orthogonal matrices and $\Sigma$ is a diagonal matrix containing singular values. The product representations can be obtained from the columns of $V^T$.

Here's the step-by-step process:
1. Perform SVD on the co-purchase matrix $M$ to get $M = U\Sigma V^T$.
2. The product representations are the columns of $V^T$, denoted as $V^T = [v_1, v_2, ..., v_N]$. Each column $v_i$ is a vector in a lower-dimensional space that captures the co-purchase patterns of product $x_i$ with all other products.

b. To recommend similar products to users who have shown interest in one of the products, we can follow these steps:
1. Identify the user's interested product, say $x_k$, and its corresponding vector $v_k$ from $V^T$.
2. Compute the dot product between $v_k$ and all other product vectors $v_i$ in $V^T$ to measure the similarity between the interested product and all other products. The dot product is a measure of how often the two products are co-purchased.
3. Sort the computed dot products in descending order to get a list of most similar products.
4. Return the top $K$ products from this list as recommendations, where $K$ is the number of recommendations to provide.

This approach effectively uses the derived product representations to find products that have similar co-purchase patterns with the user's interested product, thus making relevant recommendations.





****************************************************************************************
****************************************************************************************




Answer to Question 1-4
a. One property of CNNs that is beneficial for the spam detection task, which is not the case with RNNs, is their ability to efficiently process fixed-length local patterns in the input. CNNs use convolutional layers that can detect and learn features from specific parts of the email, regardless of their position, while RNNs would have to process the entire email sequence, which can be computationally expensive and may lead to vanishing gradients for longer sequences.

b. A CNN-based model for spam detection could have the following structure:
1. **Input**: The input would be the entire email text, represented as a sequence of word embeddings. The size of this input would depend on the vocabulary size and the chosen embedding dimension.
2. **Convolutional Layer**: This layer would consist of multiple filters (kernels) with a fixed width (e.g., 3-5 words) that slide across the input, performing convolutions. The output would be a feature map, where each position represents the activation for a specific pattern in the input. Let's assume we have 64 filters with a kernel size of 3 and a stride of 1, resulting in a feature map with a width equal to the input width minus the kernel size plus 1 (if padding is not used).
3. **Pooling Layer**: A max-pooling layer would follow, reducing the spatial dimensions of the feature map while retaining the most important information. For example, if we use a 2x2 pooling window with a stride of 2, the feature map would be reduced to 1/4 of its original width.
4. **Flattening**: The pooled feature map would be flattened into a 1-dimensional vector for further processing.
5. **Fully Connected Layers**: The flattened vector would be passed through one or more fully connected (dense) layers for classification. These layers would have activation functions like ReLU, and the final layer would have a softmax activation for multi-class classification (spam or not spam).
6. **Output**: The output would be a single value representing the probability that the email is spam.

The size of the feature map after the convolutional layer would depend on the input sequence length, kernel size, and pooling parameters.

c. Instead of classification accuracy, Tom could use the **F1 Score** or the **Area Under the Receiver Operating Characteristic Curve (AUC-ROC)** as evaluation metrics. These metrics are more informative in imbalanced datasets because they consider both the precision and recall of the model, which is crucial when dealing with a minority class like spam emails. The F1 Score gives a single value representing the balance between precision and recall, while the AUC-ROC provides a measure of the model's ability to distinguish between spam and non-spam emails.





****************************************************************************************
****************************************************************************************




Answer to Question 1-5
a. The model Tom could consider for extracting disease names from medical documents is a Transformer-based architecture, specifically a variation called a Transformer Encoder. The input to this model would be a sequence of tokens, where each token is a word or subword in the document. The input is first passed through an embedding layer that converts each token into a dense vector representation. Instead of using RNNs, this model relies on self-attention mechanisms to capture contextual information.

Intermediate operations involve the following steps:
1. **Embedding Layer**: The input tokens are transformed into fixed-size vectors using pre-trained word embeddings like GloVe.
2. **Positional Encoding**: To incorporate the order of the words, positional encodings are added to the token embeddings.
3. **Multi-Head Self-Attention**: This layer computes self-attention weights for each token, allowing the model to focus on different parts of the input based on their relevance.
4. **Feed-Forward Network**: A feed-forward neural network is applied to each position's output from the self-attention layer, further processing the information.
5. **Normalization and Residual Connections**: Layer normalization and residual connections are used after each self-attention and feed-forward layer to stabilize the training process.

The output of the model would be a sequence of hidden representations for each input token. A classification layer, such as a linear layer followed by a softmax activation, can be added on top to predict whether a token represents a disease name or not.

b. One challenge of using pre-trained word embeddings like GloVe is that they might not capture the specialized medical vocabulary present in the medical documents. These embeddings are typically trained on general-domain text, which might not include many disease names or medical terms.

To resolve this issue, you can follow these steps:
1. **Extending the Vocabulary**: If the disease names are not present in the pre-trained embedding, you can create a new embedding matrix by adding these specialized terms and initializing them with random values or zeros. The model will then be trained to update these embeddings alongside the rest of the network, allowing them to learn domain-specific representations.
2. **Fine-Tuning the Embeddings**: After loading the pre-trained GloVe vectors, you can fine-tune them during the training of the NER model. This allows the embeddings to adapt to the medical context and improve the representation of disease names and related terms.
3. **Integration of Domain-Specific Corpora**: If available, you can incorporate additional medical corpora into the pre-training process of the embeddings, or use pre-trained embeddings specifically trained on medical texts. This would ensure that the embeddings are more representative of the medical domain.





****************************************************************************************
****************************************************************************************




Answer to Question 2-1
a. For the first sentence "He saw their football in the park", the unigram model would predict the more frequent word, which is "there" since count("there") = 110 and count("their") = 50. The probability of "there" would be higher than "their" based on the unigram model.

For the second sentence "He saw their was a football", the unigram model would again predict "there" for the same reason as above.

This might not be a good solution because the unigram model only considers the frequency of each word in isolation, without considering the context. It doesn't account for the fact that "their" is more likely to follow "saw" in the given context.

b. The bigram model might be better because it considers the context by looking at the probability of a word following another word. It would predict the more likely combination of words, such as "saw there" or "saw their", based on the co-occurrence frequencies in the corpus. This could lead to more accurate predictions, as it captures some level of sequential information.

However, a bigram model might have problems in practice because it still doesn't account for longer-term dependencies or context beyond just the previous word. It may also require a large amount of data to accurately estimate the bigram probabilities, and it can be computationally expensive to calculate these probabilities for all possible word pairs. Additionally, if the corpus doesn't have enough examples of specific word pairs, the estimated probabilities might be unreliable.





****************************************************************************************
****************************************************************************************




Answer to Question 2-2
a. Under MLM with a 20% masking ratio, the mask is applied randomly to 20% of the tokens in the sequence. Assuming a sequence of 10 tokens, two tokens would be masked. For example, the masked positions could be the 3rd and 6th tokens, represented by [MASK] in the figure "figures/Mask_under_MLM.pdf".

b. MLM generally needs more iterations over the training data than CLM. This is because MLM masks a portion of the input tokens and predicts them independently, which can result in less efficient use of the context information compared to CLM. CLM predicts the next token based on the entire known context, which can lead to faster convergence.

c. MLM does not require input sequences to be shifted because it predicts masked tokens based on the entire input sequence, including both the known and masked tokens. In contrast, CLM predicts the next token based only on the known tokens to its left, which requires shifting the input to maintain the context.

d. PrefixLM often has better performance than CLM because it uses the entire prefix of the sequence as context for predicting the next token. This allows the model to utilize more context information, potentially leading to more accurate predictions. In Figure "figures/Illustration_of_language_model_training.png", this is demonstrated by the fact that PrefixLM can use the entire left context (w1 to wn-1) to predict wn, while CLM only uses the tokens to the left of the prediction point (w1 to wn-2) for wn-1.





****************************************************************************************
****************************************************************************************




Answer to Question 2-3
a. The contextual embedding for the two "left" words in the sentence will not be the same. In BERT, the self-attention mechanism allows each word to attend to other words in the sentence, considering their context. Without positional encoding, the model would not be able to distinguish between the two "left" words based on their position. The first "left" would attend to "phone" and "my" differently than the second "left" attending to "pocket." The self-attention query, key, and value vectors would be computed based on the input embeddings of each word, but without positional information, the attention weights would not differentiate the two occurrences of "left."

b. Yes, we can use the dot product attention with a query of 1024 dimensions and a key of 512 dimensions. The dot product operation is mathematically valid between two vectors of different dimensions, as long as they are compatible (i.e., one can be reshaped to fit the other). In this case, the query can be reshaped to match the key's dimensions, and the dot product can still be computed.

c. The positional encoding in Eq. \ref{eq:posEncoding} assigns different values to each position and feature dimension, creating a unique pattern for each position. The sinusoidal functions ensure that nearby positions have similar but not identical values, allowing the model to differentiate between positions. Since the values are fixed, the model learns to attend to position information without any trainable parameters.

To introduce trainable positional encoding, we could add learnable parameters to the positional encoding function. For each position $i$ and feature dimension $d$, we could add a trainable weight $W_d$ and a bias $b_d$, making the positional encoding function:

\begin{equation}
\textsc{pos}(i, d) = W_d \cdot \text{sin}(i / 10000^{d/512}) + b_d \cdot \text{cos}(i / 10000^{(d-1)/512})
\end{equation}

During training, these weights and biases would be updated alongside other model parameters, allowing the model to learn its own positional encoding strategy.





****************************************************************************************
****************************************************************************************




Answer to Question 2-4
a. This statement is false. Greedy decoding is less memory-intensive than beam search because it only considers the single most likely token at each step, whereas beam search maintains a set of the most likely sequences, which requires more memory to store multiple paths.

b. This statement is true. When decoding, if models have different vocabularies, it is not straightforward to ensemble them directly, as each model may use different tokens or have different token representations. An additional step, such as mapping all vocabularies to a shared one, would be needed.

c. This statement is true. If sentence probability is not normalized by sequence length, shorter sequences will have a higher probability because their probabilities are not spread out over as many tokens. Normalization ensures that the length of the sequence is taken into account when comparing different sequences.

d. This statement is true. With top-k sampling, increasing the value of k allows for more variety in the considered tokens at each step, leading to a higher variability in the generated output, as more possibilities are included in the sampling process.





****************************************************************************************
****************************************************************************************




Answer to Question 2-5
BLEU is more impacted by the different wording in the example above. BLEU is a metric that calculates the overlap of n-grams between the system output and the reference translation. It does not consider the semantic equivalence of the sentences, only the exact word matches. Since System 1's output matches the reference translation exactly, it would score higher in BLEU. However, System 2's use of "du" instead of "Sie" changes the formality but not the meaning of the sentence.

On the other hand, COMET (or any other modern metric likechrF, BLEU++, or MoverScore) is designed to consider contextual and semantic equivalence. It would be less impacted by the difference in wording, as it aims to capture the meaning of the sentence rather than just the word-for-word match. COMET would likely score both System 1 and System 2 similarly, as both translations convey the same essential meaning, even though the pronoun formality differs from the reference.





****************************************************************************************
****************************************************************************************




Answer to Question 3-1
a) Ranking the approaches by the number of trained parameters in the task-specific adaptation stage:
1. Direct Prompting: In this approach, the original large language model is not modified, and only a few task-specific tokens are added to the input. Therefore, no additional parameters are trained, and it has the fewest parameters involved in the adaptation.

2. In-Context Learning: This method involves providing a few example input-output pairs to the model as context, without any explicit finetuning. The model uses this context to generate outputs for new inputs. While no new parameters are added, the model may implicitly learn from the context, which could be seen as a form of parameter adjustment. However, this is generally considered less parameter-intensive than finetuning.

3. (Promptless) Finetuning: This approach involves updating the weights of the entire language model to optimize it for a specific task. This results in the largest number of trained parameters, as the entire model is adjusted.

b) Ranking the approaches by the amount of memory needed for inference (decoding):
1. Direct Prompting: Since no additional parameters are trained, the inference memory requirements remain the same as the original model, which is the smallest.

2. In-Context Learning: Although the model may implicitly learn from the context, the inference still uses the original model, so the memory requirements are similar to direct prompting.

3. (Promptless) Finetuning: After finetuning, the model will have a larger number of parameters, which will increase the memory needed for inference compared to the other two methods.

c) For a specific task with only 8 input-output pairs, the best approach would be Direct Prompting. With such a limited amount of data, finetuning the entire model may not be effective, and it could lead to overfitting. In-Context Learning could be an option, but it may not be as effective as Direct Prompting, which is known to work well with few data points. Direct Prompting allows you to leverage the pre-trained model's knowledge while minimizing the risk of overfitting with limited data.





****************************************************************************************
****************************************************************************************




Answer to Question 3-2
a. To calculate the number of parameters trained in the finetuning stage with adapters, we need to consider the parameters in the adapter layers. Each adapter consists of two linear projections with a ReLU activation in between. Each linear projection reduces the dimension from 1024 to 256, so there are 1024 * 256 parameters in each projection. Since there are two projections per adapter, that's 2 * (1024 * 256) parameters per adapter. As there are adapters after every layer, and there are 12 layers, the total number of parameters for the adapters is 12 * 2 * (1024 * 256). We don't need to calculate the exact number.

b. For prompt tuning, we have 50 reserved tokens. Each of these tokens has an embedding, so there are 50 * 1024 parameters for the prompt embeddings. Since the rest of the model remains unchanged, we don't need to consider additional parameters for the finetuning stage.

c. The model with prompt tuning might run out of memory because the prompt is added to the input sequence, increasing the sequence length. This can lead to a larger memory footprint during the decoding process, as the model has to process a longer sequence compared to the model with adapters, where the additional parameters are not part of the sequence length.

d. The main difference between prompt tuning and prefix tuning is the location of the tunable parameters. In prompt tuning, the parameters are in the form of additional tokens appended to the input sequence, while in prefix tuning, the parameters are added as a fixed-length prefix to the input sequence.

An advantage of prompt tuning is that it allows more flexibility in the input representation, potentially making it easier to learn different tasks. However, a disadvantage is that it can lead to increased memory usage, as seen in the previous part, because the prompt is part of the input sequence, which can be longer than the prefix.





****************************************************************************************
****************************************************************************************




Answer to Question 3-3
a. To adapt the pretrained translation model to use information from the object detection model, we can modify the input of the decoder. The input will now consist of two parts: the text input sentence and the list of detected objects. The text input will be passed through the encoder as usual, and the list of objects will be converted into a fixed-size vector using an embedding layer. If an object label is not in the vocabulary of the pretrained translation model, we can either ignore it, map it to a special "unknown" token, or use an out-of-vocabulary (OOV) handling technique like adding an OOV bucket to the vocabulary.

b. To analyze whether the model makes use of information from the object detection model, we can perform ablation studies. We can create two versions of the model: one with the image information and one without. By comparing the translation performance of both models, if there is a significant difference, it indicates that the image information is being utilized. Additionally, we can inspect the attention weights of the model, focusing on whether the attention patterns change when the object detection information is present.

c. To adapt the pretrained translation model to use the encoded image, we can concatenate the encoded image vector with the text input's encoder output. This can be done before the decoder's self-attention layer or at the input of the decoder. If the size of the encoded image does not match the embedding dimensions of the translation model, we can use a linear projection layer (i.e., a fully connected layer) to match the dimensions. This projection layer can be trained alongside the rest of the model during fine-tuning.





****************************************************************************************
****************************************************************************************




Answer to Question 3-4
a. Retrieval-augmented generation (RAG) combines the strengths of retrieval-based and generative models. In traditional generation, a language model generates text purely based on its learned parameters and context, which can lead to hallucination, i.e., producing factually incorrect information. RAG, on the other hand, supplements the model's knowledge by retrieving relevant information from a structured knowledge source. This retrieval component can help improve the faithfulness of the generated text, as it ensures that the model is grounded in factual information.

b. I partially agree with the claim. In machine translation, hallucination may be easier to detect because there is often a reference text available for comparison. If the generated translation contains information that is not present in the source text, it can be more readily identified as hallucination. However, in general-purpose text generation, the absence of a ground truth makes it more challenging to determine if the generated text is hallucinatory, as it may seem plausible but still be factually incorrect.

c. Truncating long documents during the training of large language models can lead to hallucination issues because the model might lose crucial context that is necessary for generating accurate responses. By not having access to the complete information, the model may generate text based on incomplete or incorrect understanding. To mitigate this problem, one approach is to use techniques like document-level context modeling, which tries to capture the long-range dependencies in the input. Another approach is to use a sliding window or a more efficient memory management technique to handle longer sequences without sacrificing context. Additionally, incorporating external knowledge sources, as in RAG, can also help provide context when the original input is truncated.





****************************************************************************************
****************************************************************************************




