Answer to Question 1-1
Here are my answers to the exam question:

a. False. One-hot word representations cannot be used to find synonyms because they do not capture any semantic similarity between words.

b. False. German is morphologically richer than English, as it has more inflectional forms for nouns, adjectives and verbs.

c. False. Syntax is on a higher level than semantics in the hierarchy of language, as it deals with the structure and grammar rather than the meaning.

d. False. Word2Vec is not trained on the global word occurrence matrix, but rather on local context windows in a large corpus.

e. True. When byte-pair encoding (BPE) is applied for subword segmentation, less frequent words tend to be broken down into more subword units compared to frequent words which remain more word-based.

f. True. Compared to HMMs, CRFs allow easier integration of new features because they are discriminative models that can incorporate arbitrary features of the input.





****************************************************************************************
****************************************************************************************




Answer to Question 1-2
Here are two reasons why dense word embeddings are preferred over sparse features in NLP:

1. Dense word embeddings capture semantic and syntactic relationships between words. Words that are similar in meaning or function end up having similar vector representations in the embedding space. This allows the embeddings to encode analogical relationships and enables the model to generalize better to words it hasn't seen before. In contrast, sparse features like one-hot encodings treat each word as completely independent and do not capture any notion of similarity or relatedness between words.

2. Dense word embeddings are more computationally efficient and scalable compared to sparse features. With sparse encodings like one-hot vectors, the dimensionality grows with the size of the vocabulary, leading to very high-dimensional, sparse representations. This becomes intractable for large vocabularies. Dense embeddings map words to much lower dimensional dense vectors (usually a few hundred dimensions) which is more efficient in terms of computation and memory. The dense representations also work better as input to neural networks compared to extremely sparse vectors.

In summary, the key advantages of dense word embeddings are that they capture meaningful semantic and syntactic relationships between words in a low-dimensional space, and they are computationally efficient to work with, especially in neural network models. This has led to their widespread adoption and preference over traditional sparse word representations in modern NLP.





****************************************************************************************
****************************************************************************************




Answer to Question 1-3
Here are my answers to the exam question:

a. To create representations for the products using ideas similar to learning word representations, I would perform the following operations on the co-purchase matrix:

1. Treat the co-purchase matrix as a "product context" matrix, similar to a word-context matrix used in learning word representations. Each row represents a product, and each column represents the context (co-purchased products).

2. Apply a dimensionality reduction technique, such as Singular Value Decomposition (SVD) or Non-Negative Matrix Factorization (NMF), to the co-purchase matrix. This will decompose the matrix into lower-dimensional product representations and context representations.

3. The resulting lower-dimensional product representations (e.g., the left singular vectors from SVD or the product factor matrix from NMF) will capture the latent features and relationships among the products based on their co-purchase patterns.

4. These product representations can be used as dense vector embeddings for each product, similar to word embeddings in natural language processing tasks.

b. To recommend similar products to users who have shown interest in one of the products, I would follow these steps:

1. Given a product of interest, retrieve its product representation vector obtained from the previous subquestion.

2. Compute the similarity between the representation vector of the product of interest and the representation vectors of all other products. This can be done using cosine similarity or other similarity measures.

3. Rank the products based on their similarity scores, with higher scores indicating more similar products.

4. Recommend the top-k most similar products to the user, where k is a predefined number of recommendations to provide.

5. Optionally, filter out products that the user has already purchased or interacted with to avoid redundant recommendations.

By leveraging the product representations derived from the co-purchase matrix, this approach allows for recommending products that are frequently purchased together or have similar co-purchase patterns to the product of interest. The dimensionality reduction step captures the underlying structure and relationships among the products, enabling effective similarity-based recommendations.





****************************************************************************************
****************************************************************************************




Answer to Question 1-4
Here are my answers to the exam question:

a. A beneficial property of CNNs for spam detection compared to RNNs is that CNNs can effectively extract local features and patterns regardless of their position in the input sequence. Certain keywords or phrases indicative of spam can occur anywhere in the email. CNNs can detect these patterns well without being sensitive to the position, whereas RNNs process the input sequentially and may struggle with long-range dependencies, especially for longer emails.

b. For a CNN-based spam detection model:
Input: Represent each email as a matrix, where each row corresponds to a word (or character) in the email, represented by a word embedding vector. 
Intermediate operations: Apply 1D convolutions of different filter sizes (e.g., 3, 4, 5) over the input matrix to capture local patterns of varying lengths. Each filter slides over the input, producing a feature map that indicates the presence of the pattern. Apply max-pooling over each feature map to downsample it and keep the most salient features. Concatenate the pooled features and pass them through fully-connected layers.
Output: The final layer outputs the probability of the email being spam or not (binary classification).
Feature map size: The size depends on the input length and filter sizes. With an input of L words and a filter size of K, the feature map size would be (L - K + 1).

c. Instead of accuracy, a better metric for imbalanced datasets like in spam detection is the F1 score. The F1 score is the harmonic mean of precision and recall. Precision measures the proportion of true spam emails among those classified as spam, while recall measures the proportion of spam emails that were correctly identified. The F1 score provides a balanced measure of the model's performance, considering both false positives and false negatives. It is more informative than accuracy, especially when the classes are imbalanced, as it focuses on the performance of the minority (spam) class.





****************************************************************************************
****************************************************************************************




Answer to Question 1-5
Here are my answers to the exam question:

a. For the named entity recognition task of extracting disease names from medical documents, I propose using a transformer-based model such as BERT. The input would be the tokenized text of each medical document. The tokens are passed through the pretrained BERT layers which perform self-attention to capture contextual information. On top of the BERT output, we add a token classification head consisting of a linear layer that predicts for each token whether it is part of a disease name entity or not (e.g. using BIO tagging). The intermediate operations are the self-attention layers of BERT that generate contextualized embeddings for each token. The output is the predicted entity label for each token, which we can use to extract the disease name spans from the document.

b. One challenge with using pretrained GloVe embeddings to initialize the word embeddings of the BERT-based model is that the BERT tokenizer uses subword tokenization, while GloVe provides embeddings for whole words. There will be a mismatch where some tokens from BERT's tokenizer will not have a corresponding GloVe embedding. 

To resolve this, we can try the following:
1) For BERT tokens that correspond to whole words, initialize them with the matching GloVe embedding if it exists, otherwise initialize randomly. 
2) For BERT tokens that are subwords, initialize them randomly or by averaging the embeddings of the containing whole words.
3) Allow the embeddings to be fine-tuned during training on the named entity recognition task, so they can adapt to the domain.

This way, we can leverage the semantic information from GloVe where possible, while still accounting for BERT's subword tokenization scheme and allowing the embeddings to adapt to the medical domain through fine-tuning.





****************************************************************************************
****************************************************************************************




Answer to Question 2-1
a. For the unigram model, the rule will return "there" for both sentences. This is because:
p("there") = count("there") / N = 110 / 10,000 = 0.011
p("their") = count("their") / N = 50 / 10,000 = 0.005
Since p("there") > p("their"), the model will always predict "there" regardless of the context.
This is not a good solution because it does not take into account the context of the sentence. The unigram model only considers the probability of each word independently, which is not sufficient for determining the correct spelling of "there" vs "their".

b. The bigram model might be better than the unigram model because it takes into account the context of the previous word. For example, the probability of "their" given "saw" (i.e., p("their" | "saw")) might be higher than the probability of "there" given "saw" (i.e., p("there" | "saw")), which would allow the model to correctly predict "their" in sentence (1).
However, this model might still have problems in practice:
1. Data sparsity: Many bigrams might not appear in the training corpus, leading to zero probabilities. This can be mitigated with smoothing techniques.
2. Limited context: The model only considers the immediately preceding word, which may not be enough to determine the correct spelling in all cases. For example, in "He saw there was a football", the model might still predict "their" if p("their" | "there") > p("there" | "there").
3. Inability to handle long-range dependencies: The correct spelling might depend on words that are further away in the sentence, which the bigram model cannot capture.





****************************************************************************************
****************************************************************************************




Answer to Question 2-2
Here are the answers to the language modeling training questions:

a. To mark where the mask is applied under MLM with a 20% masking ratio, I would randomly select one of the five input tokens (w1, w2, w3, w4, w5) to be masked. For example, if w3 was chosen, the input sequence would become [w1, w2, [MASK], w4, w5].

b. MLM typically requires more iterations over the training data compared to CLM. This is because in MLM, only a subset of the input tokens are masked and used for training on each iteration, whereas CLM trains on predicting all tokens in the sequence. MLM needs more passes over the data to adequately train on predicting all possible masked tokens.

c. MLM does not require shifting the input sequence to the right like CLM because the objective is different. In CLM, the model predicts the next token given all previous tokens, so shifting allows each token to be a target. MLM randomly masks some tokens and predicts those masked tokens based on the surrounding context, so no shifting is needed. The mask tokens serve as the prediction targets in MLM.

d. PrefixLM often performs better than standard CLM because it is trained to predict the continuation of a sequence given some prefix, rather than always predicting the next token from the start. This allows PrefixLM to learn more robust representations that can handle generating text from arbitrary starting points. The illustration shows how PrefixLM has context for prediction at each step, while CLM only predicts the next token in sequence.

In summary, PrefixLM's ability to predict from any provided context prefix gives it an advantage over CLM which is more limited to generating sequences from the beginning. The additional context in PrefixLM leads to better language understanding and generation performance.





****************************************************************************************
****************************************************************************************




Answer to Question 2-3
Here are my answers to the exam question:

a. No, the contextual embeddings for the two occurrences of "left" in the sentence "I left my phone in my left pocket" will not be the same if the model does not have positional encoding. 

In the self-attention mechanism, the query, key, and value are computed from the input embeddings. Without positional encoding, the input embeddings for the same word will be identical regardless of position. As a result, the query, key and value will also be the same for the two "left" tokens.

However, the two "left" words have different meanings in the sentence - the first is a verb and the second is an adjective. Self-attention allows a word to attend to other relevant words to build contextualized representations. But without a notion of position, self-attention cannot distinguish the two "left" words to build different representations for them based on their different contexts. Positional encoding is needed to make the input embeddings position-dependent before computing the query, key and value.

b. No, we cannot use dot product attention if the query has 1024 dimensions and the key has 512 dimensions. 

In dot product attention, the query and key vectors are multiplied together via dot product to compute the attention scores. This requires the two vectors to have the same dimensionality so that the dot product is mathematically defined.

With the query and key having different dimensions, their dot product cannot be computed. We would need to project one or both of them to the same dimensionality first, e.g. by multiplying the key by a learned weight matrix to increase its dimension from 512 to 1024 before taking the dot product with the 1024-dimensional query.

c. The positional encoding function POS(⋅) enables the model to treat different positions differently by making the input embedding position-dependent before any learned transformations are applied.

Even though POS(⋅) itself has no trainable parameters, it maps each position to a unique vector that gets added to the input embedding. So the same word at different positions will have different position-augmented embeddings. 

The uniqueness comes from POS(⋅) being defined as a combination of sine and cosine functions of the position, with different periods for each feature dimension. For any two positions, their positional encoding vectors will not be the same.

When the position-augmented embeddings are fed through the model's learned linear layers, words at different positions will undergo different transformations and produce different hidden representations, despite the model parameters being shared across positions. The model can thus learn position-dependent computations.

To make the positional encoding trainable, we can define it as a learned lookup table, with embeddings for each position that are optimized during training, instead of using a fixed sinusoidal function. The embeddings would be initialized randomly and updated via backpropagation like other model parameters.





****************************************************************************************
****************************************************************************************




Answer to Question 2-4
Here are my answers to the exam question:

a. False. Beam search is more memory-intensive than greedy decoding because beam search keeps track of multiple candidate sequences at each decoding step, while greedy decoding only keeps the single best candidate.

b. True. Models with different vocabularies cannot be directly ensembled during decoding because their output probability distributions are over different sets of tokens, making the probabilities incomparable.

c. True. Not normalizing the sentence probability by length will bias the decoding towards preferring shorter sequences, because the probability of a sequence is the product of its token probabilities which are all less than 1, so longer sequences will have lower probabilities.

d. True. A higher value of k in top-k sampling means more candidate tokens are considered at each decoding step, allowing for more variability and diversity in the generated sequences.





****************************************************************************************
****************************************************************************************




Answer to Question 2-5
BLEU is more impacted by the different wording in the example translations compared to COMET. Here's why:

BLEU (Bilingual Evaluation Understudy) is a metric that measures the similarity between the machine-generated translation and one or more reference translations. It calculates the precision of n-gram matches between the candidate and reference translations. BLEU relies heavily on exact word matches and word order.

In the given example, System 1's output perfectly matches the reference translation, while System 2's output differs in the pronoun used ("du" instead of "Sie"). Even though both translations are correct, BLEU would give a higher score to System 1 because it has an exact match with the reference. The difference in wording penalizes System 2's BLEU score.

On the other hand, COMET (Crosslingual Optimized Metric for Evaluation of Translation) is a learned metric that uses cross-lingual representations to assess translation quality. COMET is trained on human judgments and considers the semantic similarity between the candidate and reference translations.

In the example, both System 1 and System 2 convey the same meaning as the reference, despite the difference in pronouns. COMET's cross-lingual representations would capture the semantic similarity between the translations and the reference. As a result, COMET would be less impacted by the wording difference and would likely give similar scores to both systems.

In summary, BLEU is more sensitive to exact wording and word order, while COMET focuses on semantic similarity. Therefore, in cases like the given example where the translations are semantically equivalent but differ in wording, BLEU scores are more affected than COMET scores.





****************************************************************************************
****************************************************************************************




Answer to Question 3-1
a. Ranking by the number of trained parameters in the task-specific adaptation stage, from most to least:
1. (Promptless) Finetuning: In finetuning, all or a subset of the model's parameters are updated during the adaptation stage, requiring the most trained parameters.
2. Direct Prompting and In-Context Learning: Both these approaches do not involve updating the model's parameters during the adaptation stage, so they have the least (zero) trained parameters.

b. Ranking by the amount of memory needed for inference (decoding), from most to least:
1. In-Context Learning: This approach requires the input-output pairs (context) to be provided as part of the input during inference, increasing the memory requirements.
2. Direct Prompting: Although prompts are used, they are typically shorter than the context used in in-context learning, resulting in lower memory requirements compared to in-context learning.
3. (Promptless) Finetuning: Once finetuned, the model can directly process the input without additional prompts or context, requiring the least memory during inference.

c. If I have only 8 input-output pairs for a specific task, I would choose In-Context Learning for the following reasons:
1. Limited data: With only 8 examples, finetuning the model may lead to overfitting and poor generalization. In-context learning allows the model to learn from the examples without updating its parameters, making it more suitable for low-data scenarios.
2. No training required: In-context learning does not require a separate training stage, saving time and computational resources compared to finetuning.
3. Flexibility: In-context learning allows for quick experimentation and adaptation to new tasks by simply providing relevant examples, without the need to retrain the model for each task.





****************************************************************************************
****************************************************************************************




Answer to Question 3-2
a. To calculate the number of parameters trained when finetuning with adapters:
1. Each adapter consists of two linear projections. The first projects from 1024 to 256 dimensions, and the second projects from 256 back to 1024 dimensions.
2. The number of parameters in each linear projection is (input_dim * output_dim).
3. For the first projection, it's (1024 * 256), and for the second, it's (256 * 1024).
4. The total number of parameters per adapter is (1024 * 256) + (256 * 1024).
5. There are 12 layers, so the total number of parameters trained is 12 * ((1024 * 256) + (256 * 1024)).

b. To calculate the number of parameters trained when finetuning with prompt tuning:
1. In prompt tuning, only the embeddings of the reserved tokens are trained.
2. The number of parameters trained is equal to (number of reserved tokens * embedding dimension).
3. In this case, it's 50 * 1024.

c. The model with prompt tuning runs out of memory during decoding, despite having fewer total parameters, possibly because:
- In prompt tuning, the reserved tokens are attended to by all other tokens in the sequence, leading to increased memory usage during attention computation.
- The model with adapters has fewer parameters that are attended to, as the adapters are not involved in the attention computation.

d. The main difference between prompt tuning and prefix tuning is:
- In prompt tuning, the reserved tokens are part of the input sequence and are attended to by all other tokens.
- In prefix tuning, the reserved tokens are only prepended to the input sequence and are not attended to by the input tokens.

Advantage of prompt tuning:
- The reserved tokens can attend to the input tokens, allowing for more flexibility in adapting to the task.

Disadvantage of prompt tuning:
- The reserved tokens are attended to by all other tokens, leading to increased memory usage and computational cost during inference.





****************************************************************************************
****************************************************************************************




Answer to Question 3-3
Here are my answers to the exam question:

a. To adapt the pretrained translation model to use information from the object detection model:
- Modify the input to the translation model to be a concatenation of the source sentence and the list of detected objects. For example: "Two people sitting by a river [SEP] PERSON PERSON RIVER BOAT", where [SEP] is a special separator token.
- Keep the output of the translation model the same (the translated sentence in the target language).
- For handling object labels not in the translation model's vocabulary, add the set of possible object labels to the vocabulary during fine-tuning. Alternatively, replace object labels not in the vocabulary with a special [UNK] token.

b. To analyze whether the model uses information from the object detection model:
- Create a test set of image-caption pairs.
- For each pair, create adversarial examples by modifying the list of detected objects to be incorrect (e.g. replacing RIVER with MOUNTAIN). 
- Compare the translations generated by the model for the original input and adversarial input.
- If the translations differ significantly when the detected objects are changed, this suggests the model is utilizing the object information. If the translations remain largely unchanged, the model may be ignoring the object information.

c. To adapt the pretrained translation model to use the 1024-dimensional encoded image vectors:
- Add a linear projection layer to the translation model that maps the 1024-dimensional image vector to the dimension of the word embeddings used by the model. 
- Concatenate the projected image vector to the source sentence word embeddings before feeding it to the encoder. The encoder and decoder can remain unchanged.
- If the word embedding dimension is larger than 1024, the image vector can be projected to the word embedding size. If the word embedding dimension is smaller than 1024, the word embeddings can be projected to 1024 dimensions instead before concatenation.





****************************************************************************************
****************************************************************************************




Answer to Question 3-4
Here are my answers to the exam question on trustworthiness:

a. Retrieval-augmented generation (RAG) differs from traditional generation in that it incorporates an explicit retrieval step to find relevant information from an external knowledge source, which is then used to inform the generation process. In traditional generation, the model relies solely on the knowledge captured in its parameters during pretraining. RAG could potentially improve the faithfulness of large language models by grounding the generation in retrieved factual information. This reduces the risk of the model hallucinating incorrect statements, since it has access to verified information relevant to the input prompt.

b. I somewhat agree that hallucination in machine translation is easier to detect than in general-purpose text generation with large language models. In machine translation, the source text provides a clear reference that the translation can be compared against to identify potential hallucinations. There are also well-defined accuracy metrics like BLEU that can help detect low-quality translations. In contrast, for open-ended text generation, it is often difficult to determine whether a generated statement is factual or not, especially if it is plausible-sounding. However, hallucinations can still be tricky to identify in machine translation for more open-ended tasks like document-level translation or in low-resource language pairs.

c. Truncating long documents during training of large language models could lead to issues with hallucination because the model does not see the full context of the document. Important contextual information might get cut off, leading to an incomplete understanding. This could cause the model to make up information to fill in the gaps.

To mitigate this problem, a few techniques can be applied: 
(1) Use a sliding window approach to split the document into multiple overlapping segments that fit within the memory constraints. This preserves more context from the original document.
(2) Compress the document with a summarization or key information extraction step before feeding it to the model. This distills the important content while removing less essential details.
(3) Incorporate explicit retrieval like in RAG to fetch relevant background information that provides additional context to the truncated document.
(4) Use sparse attention mechanisms or other memory-efficient architectures that can process longer sequences without running out of memory.





****************************************************************************************
****************************************************************************************




