Answer to Question 1-1
a: False. One-hot representations assign a unique vector to each word but do not encode any semantic similarity or relationships between words.

b: False. German is generally considered morphologically richer than English due to its use of cases, genders, and more complex word formations.

c: False. Syntax is considered higher than semantics in the hierarchy of language structure because it involves the arrangement of words and phrases to create well-formed sentences, which is a level above individual word meaning.

d: False. Word2Vec is not trained on the global word occurrence matrix; instead, it uses local word usage context to learn representations via shallow neural networks.

e: True. BPE initially merges the most frequent pairs of bytes (or characters) and continues merging iteratively. Thus, rare words are less likely to have frequent pairs and will be split into subwords more often than common words.

f: True. Conditional Random Fields (CRFs) are designed to allow for the easy insertion of arbitrary features into the model, making the integration of new features more straightforward compared to Hidden Markov Models (HMMs).





****************************************************************************************
****************************************************************************************




Answer to Question 1-2
1. Dimensionality: Dense word embeddings are preferred because they have a much smaller, fixed-dimensional vector space compared to the high-dimensional space of sparse features. Sparse features, especially one-hot encoded vectors, can become very high dimensional as the size of the vocabulary increases, leading to issues like the curse of dimensionality and computational inefficiency.

2. Semantics: Dense embeddings capture semantic meaning and relationships between words better than sparse features. Words with similar meanings tend to have similar embeddings, allowing models to generalize better from the data. Sparse features, on the other hand, treat each word as an independent feature, failing to capture these semantic relationships.





****************************************************************************************
****************************************************************************************




Answer to Question 1-3
a: To create representations for products using ideas similar to learning word representations, I would apply techniques like those used in natural language processing (NLP), specifically those akin to the Word2Vec model, to the co-purchase matrix. Here's how I would do it step-by-step:

1. Treat each product as a "word" and each co-purchase instance as a "sentence" where products that are purchased together are considered as being in the same context.
2. Use the co-purchase matrix to construct "contexts" for each product by considering surrounding products as context. The number of products considered as context can be analogous to the window size in Word2Vec.
3. Apply a technique similar to the Skip-gram or Continuous Bag of Words (CBOW) model from Word2Vec. For Skip-gram, I would predict surrounding products from the target product, whereas for CBOW, I would predict the target product from its surrounding products.
4. Utilize a neural network architecture that learns to perform this prediction task. The input to the network would be the products, and the output would be the probability distribution of a product being in the context of the input product.
5. Train the model using pairs of products derived from the co-purchase matrix. During training, adjust the weights of the neural network which effectively adjusts the product representations in the hidden layer.
6. After training, use the weights (product vectors) from the hidden layer as the product representations.

b: To recommend similar products to users who have shown interest in a given product using the derived product representations:

1. Identify the vector representation of the product the user is interested in from the product representations derived in the previous subquestion.
2. Calculate the cosine similarity (or another similarity measure) between the vector of the interested product and the vectors of all other products.
3. Identify products with the highest similarity scores as these products are likely to be in similar contexts - purchased together or show similar purchase patterns.
4. Recommend the top-N similar products to the user, where N can be decided based on the system's requirements or after tuning for the best user experience.





****************************************************************************************
****************************************************************************************




Answer to Question 1-4
a: A property of CNNs that is beneficial for spam detection is their ability to capture local patterns within the input data through the use of filters or kernels. These local patterns can be specific keywords or phrases indicative of spam, and unlike RNNs, CNNs can identify and learn these patterns irrespective of their position in the input sequence due to their translational invariance. RNNs, in contrast, process data sequentially which can make it more challenging to capture these local patterns, especially in longer sequences where important signals might get diluted over time steps. 

b: A CNN-based model for spam detection would begin with the input which is the text of the email. This text would first be pre-processed by converting it into a numerical format, such as using one-hot encoding or word embeddings, to create a matrix representation of the email. The model would then apply convolutional layers with various filter sizes to extract features from the text. For example, small filters could be used to detect individual keywords, while larger filters could detect phrases. Each convolutional layer would be followed by a non-linear activation function, typically ReLU, and a pooling layer to reduce the dimensionality and to retain only the most significant features. 

The intermediate operations would also include dropout layers to prevent overfitting and fully connected layers to combine features from different parts of the email. The feature maps could have varying sizes depending on the design of the network, but generally, they would become smaller after each pooling operation. One might start with feature maps of size 128x128 (assuming a 128-dimensional embedding and 128 tokens as input) and reduce it to 64x64, 32x32, etc. through pooling.

The output would be a dense layer with a single neuron and a sigmoid activation function to output a probability of the email being spam. The email is classified as spam if the probability is above a certain threshold, usually 0.5.

c: Given that the dataset is imbalanced with a majority of non-spam emails, accuracy is not a good measure as it can be misleading. Instead, I would suggest Tom to use the F1 score as it is the harmonic mean of precision and recall, and it gives a better measure of the incorrectly classified cases than the Accuracy Metric. Precision will show how many of the emails identified as spam are actually spam, while recall will show how many of the actual spam emails were identified. The F1 score would be particularly useful in cases where the cost of miss-classifying spam as non-spam, and vice versa, is high. Other useful metrics could include ROC-AUC for the performance across various threshold levels or confusion matrix to visualize true positives, false positives, true negatives, and false negatives.





****************************************************************************************
****************************************************************************************




Answer to Question 1-5
a: For the named entity recognition task for disease names in medical documents, a non-RNN-based model could be a BERT (Bidirectional Encoder Representations from Transformers) based model.

Input: The input would be the raw text of the medical documents. Before feeding it to the model, it should be preprocessed which includes tokenization and encoding the tokens as input IDs understandable by the model. 

Intermediate Operations: The core of the BERT model involves transformer architecture which has multiple layers of self-attention and feedforward neural networks. Since BERT is designed to understand the context of a word in a sentence, it will process each token in relation to the rest in the document. The pre-trained BERT model can be fine-tuned with the 10,000 documents data wherein the disease names are marked. During fine-tuning, the model's parameters are updated to perform the task of named entity recognition for disease names. 

Output: The output of the model will be the input tokens labeled with tags indicating whether a token is part of a disease name or not (typically using tags like B-Disease for the beginning of a disease entity, I-Disease for the inside of an entity, and O for outside any entity).

b: A challenge of using pretrained GloVe embeddings is that these embeddings are static and do not take the context of the words into account, which can be crucial in the medical field where the meaning of terms can heavily depend on the context. 

To resolve this challenge, a more advanced strategy could be employed - using contextualized embeddings like those from ELMo, BERT, or GPT which generate embeddings based on the context of the word in a sentence. This is significant in a specialized field such as medicine where the context can completely alter the meaning. Moreover, these models can be further fine-tuned on the specific medical corpus to understand better and recognize the disease names accurately in the medical documents.





****************************************************************************************
****************************************************************************************




Answer to Question 2-1
a: Based on the given unigram model and the examples, let's calculate the probability for each sentence.

For sentence (1), "He saw their football in the park":
- Probability(there) = count("there") / N = 110 / 10000 = 0.011
- Probability(their) = count("their") / N = 50 / 10000 = 0.005

Since Probability(there) < Probability(their), the rule would return "their".

For sentence (2), "He saw their was a football":
- Using the same probabilities as above, since Probability(there) > Probability(their), the rule would return "there".

However, this might not be a good solution because unigram models do not take the context into account; they only predict based on individual word frequencies. In many cases, the correct form ("their" or "there") depends on the surrounding words. For instance, "there" is often followed by a verb or an adverb, while "their" is typically followed by a noun.

b: A simple bigram model might be better than a unigram model because it takes into account the probability of a word given the previous word. This would allow the model to use context, predicting "their" or "there" based on the previous word in the sentence. This could potentially improve the accuracy since "their" is more likely to be preceded by a word that possessive nouns follow (e.g., "their football"), and "there" would be more likely with words that fit with its use as an adverb or a pronoun (e.g., "there is").

Potential problems with this model in practice include data sparsity and overfitting. Since the bigram model relies on seeing pairs of words together, it might not perform well if the training corpus does not have enough examples of the word pairs it encounters, leading to less reliable probability estimates. Overfitting can occur if the model learns to predict perfectly for the training data but fails to generalize to unseen data.





****************************************************************************************
****************************************************************************************




Answer to Question 2-2
a. In the figure "figures/Mask_under_MLM.pdf," assuming a masking ratio of 20%, since there are five tokens (w1 to w5), one of them should be masked. I would choose to mask `w3` randomly. To mark this on the figure, I would draw a solid black square or circle over the token `w3` on the input row to indicate that it has been replaced by a [MASK] token.

b. MLM usually needs fewer iterations over the training data compared to CLM. This is because MLM can learn from the entire context on both left and right sides for each masked token in a single pass, and it typically masks around 15-20% of tokens in the input sequence. CLM, on the other hand, predicts each token based only on the preceding tokens and doesn't benefit from the bidirectional context. Since MLM allows the model to see and learn from more of the input data in a single iteration, it may require fewer iterations to achieve similar performance levels.

c. MLM does not require input sequences to be shifted because it is designed to predict masked out tokens given their context. In MLM, a certain percentage of words are masked and the model aims to predict the masked word given both the preceding and following contexts. Since prediction is targeted only at the masked positions, there is no need to use the shift-right technique that is employed in CLM where each token is predicted from the previous tokens sequentially.

d. PrefixLM often has better performance than CLM because by conditioning on a prefix, it allows future tokens to have some dependence on the preceding tokens, providing more context than in standard CLM. This additional context can help the model make more accurate predictions. The supplementary figure "figures/Illustration_of_language_model_training.png" shows that PrefixLM, unlike standard CLM, allows for some degree of interaction between the prefix and the generated sequence, which can help in capturing longer-range dependencies and improving the coherence and quality of the generated text.





****************************************************************************************
****************************************************************************************




Answer to Question 2-3
a: The contextual embedding for the two instances of the word "left" in the sentence "I left my phone in my left pocket" would not be the same without positional encoding in the model. BERT uses a self-attention mechanism to create contextual embeddings. The self-attention mechanism considers each word (query) in a sentence in relation to every other word (keys), and a set of values are assigned to the words based on this relationship. The absence of positional encoding means that the model treats positions of words as identical, and thus it relies completely on the self-attention mechanism to distinguish between the different contexts of the same word. Since the two instances of "left" are in different contexts within the sentence – the first one being a verb and the second an adjective indicating position – the self-attention mechanism will likely result in different value assignments based on the surrounding words (context) for each "left". Hence, even without positional encoding, the self-attention mechanism can assign different contextual embeddings by understanding the different word relationships and contextual meanings based on query, key, and value assignments.

b: No, we cannot directly use the dot product attention if the attention query has 1024 dimensions and the key has 512 dimensions. The dot product attention mechanism calculates the attention score by taking the dot product of the query and the key. To do so, the query and the key must be of the same dimensionality. To rectify this, either the query has to be projected down to 512 dimensions or the key has to be projected up to 1024 dimensions to match their dimensionality before the dot product can be computed.

c: The positional encoding enables the model to treat different positions differently by explicitly adding a unique value to the embedding of each token based on its position in the sequence. Since different positions have different sinusoidal functions (sine for even dimensions and cosine for odd dimensions), it results in a different positional encoding for each position. The encoding varies smoothly with position i and with respect to each dimension d, which allows the model to easily learn relative positions by considering the differences between the encodings. This, in turn, allows the transformer to infer sequence order and word positions even though it has no recurrence or convolution layers to provide this information inherently.

To have trainable positional encodings, one could initialize positional encodings similar to how other model parameters are initialized (with random values, for instance) and then allow these parameters to be updated during the training process. Alternatively, a model could include a separate neural network that takes the position as an input and outputs a positional encoding. This neural network would learn the optimal positional encoding during the training process.





****************************************************************************************
****************************************************************************************




Answer to Question 2-4
a) False. Greedy decoding is less memory-intensive than beam search as it keeps only one hypothesis at each step while beam search keeps multiple hypotheses.

b) False. It is possible to ensemble text generation models with different vocabularies by mapping the vocabularies to a common space or by converting the output probabilities accordingly.

c) True. Without sequence length normalization, the model tends to favor shorter sequences because they multiply fewer probability values, leading to a higher overall sequence probability.

d) True. A higher value of k in top-k sampling means considering more potential next words, thereby increasing the diversity and variability in the generated output.





****************************************************************************************
****************************************************************************************




Answer to Question 2-5
1. BLEU is more impacted by the different wording like in the example above.
2. This is because BLEU focuses on the exact match of n-grams between the candidate translation and the reference translation. Since System 1's output ("Was möchten Sie trinken?") is an exact match with the reference translation, it would get a higher BLEU score. On the other hand, System 2's output ("Was möchtest du trinken?") uses a different pronoun and verb conjugation ("du" instead of "Sie" and "möchtest" instead of "möchten"), which would not be an exact match, leading to a lower BLEU score. COMET, however, is based on machine learning models that can capture the meaning and fluency of the translation more holistically and is less sensitive to such wording differences as long as the overall semantics are preserved.





****************************************************************************************
****************************************************************************************




Answer to Question 3-1
a. The ranking of the approaches by the number of trained parameters in the task-specific adaptation stage is as follows:
1. (Promptless) Finetuning
2. In-Context Learning
3. Direct Prompting

Explanation: In (promptless) finetuning, the entire model's parameters are adapted for a specific task, which means that all parameters are trained. In In-Context Learning, no additional training is done; however, the model uses input-output pairs as examples within the prompt, which does not involve training new parameters but requires the model to generalize from provided examples. In Direct Prompting, we don't train the model at all; we only feed it prompts and let the pretrained models parameters do the work.

b. The ranking of the approaches by the amount of memory needed for inference (decoding) is as follows:
1. (Promptless) Finetuning
2. In-Context Learning 
3. Direct Prompting

Explanation: Finetuning would require storing the new versions of the model parameters adjusted specifically for the task, which would consume the most memory. In-context learning does not require additional parameters to be stored as the knowledge is embedded within the input examples, but there might be overhead due to increased length of prompts. Direct Prompting would typically require the least amount of memory because it relies solely on the original pre-trained model without additional context or training modifications, using concise prompts.

c. With only 8 input-output pairs, I would choose In-Context Learning. 

Explanation: This approach helps models generalize a task with few examples, which is just the case with only 8 pairs. Finetuning usually requires a larger dataset to effectively adapt the model without overfitting. In-context learning can leverage the few examples by including them directly in the prompt, guiding the model to the specifics of the task without the risk of overfitting. Direct Prompting might also work with few examples, but without them being explicitly provided, the model might not perform the task correctly, as it would have less guidance on the task specifics.





****************************************************************************************
****************************************************************************************




Answer to Question 3-2
a. The number of parameters trained in the finetuning stage with adapters can be calculated with the following steps:
- Each adapter consists of two linear projections. The first projects from 1024 to 256 dimensions and the second projects back from 256 to 1024 dimensions.
- The first projection has 1024 * 256 parameters, and the second has 256 * 1024 parameters.
- An adapter is added after each of the 12 layers, so the total number of parameters for the adapters is 12 * (1024 * 256 + 256 * 1024).
- We also need to add the bias terms for both projections in each adapter. There are 256 biases for the down-projection and 1024 biases for the up-projection in each adapter.
- The total number of biases is 12 * (256 + 1024).
- The total number of parameters trained is the sum of the parameters and the biases.

b. For prompt tuning, only the parameters for the prompt tokens are trained:
- Each token has an embedding size of 1024.
- There are 50 reserved tokens for the prompt.
- The total number of parameters is 50 * 1024.
- Since these are the only parameters that are being trained (as the rest of the model remains frozen), this is the total count of trained parameters.

c. A potential explanation for the model with prompt tuning to run out of memory, despite having fewer total parameters, could be due to the mechanisms that handle the extra prompt tokens. These mechanisms might require additional memory. For example:
- The prompt tokens increase the sequence length during decoding, which exponentially increases the memory requirements due to attention mechanisms in the Transformer model.
- The model may need to reserve additional memory to manipulate and attend to the prompt tokens alongside the actual input during the forward and backward passes.

d. The main difference between prompt tuning and prefix tuning is:
- In prompt tuning, trainable tokens are added to the input sequence and optimized during finetuning. The model treats these tokens as part of the input and learns to associate them with the desired output through the attention mechanism.
- In prefix tuning, prefixes are not actually part of the input sequence but are additional parameters in the attention mechanism before the actual input tokens. They are prepended to the sequence of keys and values in the self-attention calculation.

Advantage of prompt tuning compared to prefix tuning:
- It can be more interpretable because the prompt tokens can often be formulated in a way that is readable and understandable to humans.

Disadvantage of prompt tuning compared to prefix tuning:
- It increases the sequence length, which can lead to higher memory consumption and potentially slower processing during inference due to the self-attention mechanism's complexity, which grows quadratically with sequence length.





****************************************************************************************
****************************************************************************************




Answer to Question 3-3
a: To adapt the pretrained model to use information from the object detection model, we can append the list of object labels detected in the image to the text input before feeding it into the translation model. The model input would then be a concatenated string of the original text caption and the list of object labels (e.g., "Two people are sitting by a river. PERSON PERSON RIVER BOAT"). The output of the model remains the translated text. 

To handle the case when the object label is not in the vocabulary of the pretrained translation model, we can employ a fallback strategy such as:
1. Ignoring the object label that is not in the vocabulary.
2. Using a nearest neighbor approach to map the unrecognized label to the closest label in the model's vocabulary based on semantic similarity.
3. Adding a special token (e.g., UNK for unknown) in place of the unrecognized label to indicate to the model that there is an object that is not part of its learned vocabulary.

b: To analyze whether the model makes use of information from the object detection model, we could:
1. Compare the performance of the translation model with and without the appended object labels on a validation set.
2. Perform an ablation study where the object labels are systematically removed or replaced with random labels to observe the impact on translation quality.
3. Apply attention mechanism visualizations to see if the translation model focuses on the appended object labels during the translation process.

c: To adapt the pretrained translation model to additionally use the encoded image, we could introduce a multimodal fusion layer that combines the encoded image vector with the text embedding from the translation model. This fusion layer could be a fully connected layer or some form of attention mechanism that learns to weigh the contribution of image and text features appropriately.

To handle the case when the size of the encoded image does not match the embedding dimensions of the translation model, we can:
1. Use a linear transformation layer to project the image vector to the same dimensionality as the text embeddings.
2. Utilize pooling operations (e.g., average or max pooling) to reduce the dimensionality of the image encoding to match the text embeddings.
3. Expand the smaller vector through padding or replication to match the dimensionality of the larger vector.





****************************************************************************************
****************************************************************************************




Answer to Question 3-4
a: Retrieval-augmented generation (RAG) differs from traditional generation in that RAG combines the language generation capabilities of a model like GPT with a retrieval component that can pull in information from a large corpus of documents. Traditional generation models rely solely on the knowledge that they have been pre-trained on, and they generate text based on patterns they have learned during this training. RAG, however, can enhance this by searching for and utilizing additional information relevant to the query, which was not necessarily part of the training set.

RAG could potentially improve the faithfulness of large language models by providing them with access to factual and up-to-date information outside of their training data. By grounding the responses in real-world information that is retrieved on-the-fly, RAG can help mitigate issues with hallucinations (i.e., making up facts or being inconsistent with reality) that traditional models may have due to their limited pre-training knowledge base.

b: I do agree that hallucination in machine translation is easier to detect than in general-purpose text generation with large language models. This is because machine translation has a source text that the output can be directly compared to for semantic and factual consistency. Anomaly or hallucination in translation can be relatively easily spotted by comparing the translated text with the source text, ensuring the translation is a faithful representation of the original content. However, with general-purpose text generation, there is often no reference text to compare against, which makes it harder to determine whether a piece of generated information is hallucinated or not.

c: Truncating long documents during the training of large language models can cause issues with model hallucination because truncation may result in the model missing out on critical context that could help it understand and generate more accurate and relevant content. Without this context, the model may be more prone to filling gaps in understanding with fabricated information, leading to hallucination.

To mitigate the problem of hallucination due to document truncation, one approach could be to improve the model's ability to summarize and comprehend the most important parts of a document without needing the entire content. Another solution could be to employ techniques that allow the model to process longer sequences more efficiently, such as memory compression techniques or training with more powerful hardware that can handle longer documents. Lastly, models could be designed to reference external databases or texts during both training and inference to supplement truncated information.





****************************************************************************************
****************************************************************************************




