Answer to Question 1-1
a. False. One-hot word representations cannot capture similarities between words based on meaning or context. 

b. True. English has a more complex morphology with a larger variety of inflections and derivational processes compared to German.

c. False. Syntax and semantics are separate but intertwined levels of language processing, with neither being strictly lower or higher than the other.

d. False. Word2Vec is trained on local context windows of words in a corpus, not a global word occurrence matrix.

e. True. Byte-pair encoding tends to split less frequent words into smaller subword units, making them more subword-based.

f. True. Conditional Random Fields (CRFs) generally allow for easier integration of various features compared to Hidden Markov Models (HMMs) due to their flexibility in incorporating different types of information.





****************************************************************************************
****************************************************************************************




Answer to Question 1-2
1. Dense word embeddings capture semantic relationships better: Dense embeddings encode semantic information in a continuous vector space, allowing similar words to have similar representations. This is beneficial for various NLP tasks such as word similarity, analogy completion, and machine translation.

2. Dimensionality reduction: Dense embeddings typically have a lower dimensionality compared to sparse features, making them more computationally efficient and easier to work with in machine learning models. This reduction in dimensionality helps in mitigating the curse of dimensionality and reducing overfitting.







****************************************************************************************
****************************************************************************************




Answer to Question 1-3
a) To create representations for the products similar to learning word representations, we can use techniques like Word2Vec or GloVe. We can treat each product as a "word" and the co-purchase matrix as a "corpus" of co-purchases. By applying operations similar to the Skip-gram or Continuous Bag of Words (CBOW) models used in Word2Vec, we can derive low-dimensional vector representations for each product based on the patterns of co-purchases in the matrix. For example, we can train a neural network to predict co-purchases of products based on their vector representations, updating the product vectors to minimize prediction errors.

b) With the product representations derived in the previous step, we can recommend similar products to users interested in a particular product by computing the similarity between the vector representation of the given product and other product vectors. Products with closer vector representations in the learned embedding space are considered more similar. Therefore, we can recommend products that are closest (in terms of cosine similarity or Euclidean distance) to the product the user has shown interest in. This approach leverages the learned relationships between products from the co-purchase matrix to make relevant recommendations to users.

Figure: Not applicable





****************************************************************************************
****************************************************************************************




Answer to Question 1-4
a. One property of CNNs that is beneficial for the spam detection task, which is not the case with RNNs, is that CNNs can effectively capture local patterns and features within the input data due to their convolutional operations with shared weights. This property is especially useful in tasks where certain keywords or patterns need to be detected in the input data.

b. For a CNN-based model for spam detection, the input would be the email content represented as a sequence of word embeddings or one-hot encodings. The intermediate operations would involve convolutional layers with activation functions (e.g., ReLU) followed by pooling layers (e.g., max pooling) to extract important features from the input data. Finally, the output would be a classification layer (e.g., softmax) that predicts whether the email is spam or not. The size of the feature map would depend on the design of the convolutional layers and can be adjusted based on the complexity of the task.

c. To address the issue of imbalanced class distribution where the majority of emails are not spam, a better metric to evaluate model performance would be the F1 score. The F1 score takes into account both precision and recall, making it suitable for binary classification tasks with imbalanced classes. It provides a balanced assessment of the model's ability to correctly identify spam emails while minimizing false positives.





****************************************************************************************
****************************************************************************************




Answer to Question 1-5
a. For the named entity recognition task of extracting disease names in medical documents without using a Recurrent Neural Network (RNN)-based model, a suitable alternative could be a Bidirectional Long Short-Term Memory network (BiLSTM) combined with a Conditional Random Field (CRF) layer for sequence labeling:

Input:
- The input to the model would consist of tokenized words from the medical documents.
- Each word would be represented as a word embedding vector.

Intermediate Operations:
- The word embedding vectors would be fed into a BiLSTM layer to capture the contextual information from both directions.
- The output of the BiLSTM layer would be passed to a CRF layer to jointly decode the optimal sequence of disease names.

Output:
- The output of the model would be the predicted disease names along with their corresponding positions in the input medical documents.

b. One challenge of using GloVe pretrained word embeddings for initializing the word embedding layer in the proposed model is the domain mismatch between the general corpus used to train GloVe embeddings and the specific domain of medical texts containing disease names. 

To resolve this challenge, domain-specific embeddings or fine-tuning the pretrained GloVe embeddings on a medical text corpus can be considered:
- Domain-Specific Embeddings: Train word embeddings specifically on a medical text corpus where the embeddings are more aligned with the vocabulary and semantic expressions present in medical documents.
- Fine-tuning GloVe Embeddings: Fine-tune the pretrained GloVe embeddings on the acquired medical documents related to diseases to adapt the word representations to the specific domain.

By using domain-specific embeddings or fine-tuning pretrained embeddings on domain-specific data, the model can better capture the intricacies and nuances of disease-related language present in the medical documents, improving the overall performance of the named entity recognition task.

Figures: 
- No figures provided for this question.





****************************************************************************************
****************************************************************************************




Answer to Question 2-1
a) 
For the unigram model, the rule would return "their" for sentence (1) "He saw their football in the park" because \( p(\text{"He saw their football in the park"}) > p(\text{"He saw there football in the park"}) \). Conversely, for sentence (2) "He saw their was a football", the rule would return "there" since \( p(\text{"He saw their was a football"}) < p(\text{"He saw there was a football"}) \).

This solution may not be a good choice because the unigram model considers each word independently without taking into account the context or the sequence of words in the sentence. In this case, the decision is solely based on the frequency of individual words "there" and "their" in the corpus, which may not capture the correct prediction based on the sentence context.

b) 
The bigram model might be better than the unigram model as it considers the conditional probabilities of words given their preceding words. This allows for capturing some level of context in the language model, which can be beneficial in distinguishing between "there" and "their" based on the words surrounding them.

However, a bigram model might face problems in practice when dealing with sparse data or unseen word sequences. If the corpus does not contain certain bigrams, the model would assign zero probability to those sequences, leading to issues in prediction accuracy and generalization to new or rare word combinations. Additionally, the bigram model may still not capture long-distance dependencies or complex syntactic structures in the language, which could limit its effectiveness in certain cases. 

Figure paths: 
- Figure 1: figures/unigram_vs_bigram.png





****************************************************************************************
****************************************************************************************




Answer to Question 2-2
a. In Figure "figures/Mask_under_MLM.pdf", to mark where the mask is applied under MLM with a masking ratio of 20%, I would randomly choose an item to mask. 

b. Holding other conditions unchanged, CLM typically needs more iterations over the training data compared to MLM. This is because in CLM, each token in the sequence is conditioned on the tokens that come before it, which makes it necessary to iterate over the data more times to update these dependencies properly.

c. In training, MLM does not require the input sequences to be shifted one position to the right like in CLM because MLM aims to predict randomly masked tokens within the input sequence. This means that the model can learn to predict the masked tokens in any position without needing the sequential shift that CLM requires.

d. Holding other conditions unchanged, PrefixLM often performs better than CLM because PrefixLM allows the model to condition on a prefix of the sequence during training, which helps the model to learn better contextual representations. This prefix conditioning can capture more relevant information compared to the causal structure imposed by CLM, leading to potentially better performance in tasks like language modeling.

Figure paths:
- "figures/Mask_under_MLM.pdf"
- "figures/Illustration_of_language_model_training.png"





****************************************************************************************
****************************************************************************************




Answer to Question 2-3
a. The contextual embeddings for the two occurrences of the word "left" in the sentence "I left my phone in my left pocket" will not be the same in a BERT model without positional encoding. This is because the self-attention mechanism in BERT relies on three main components: query, key, and value. 

In self-attention, the query, key, and value vectors are linear transformations of the input embeddings. The query vector is used to retrieve relevant information from the key vectors based on their similarity scores (computed using dot product attention) and then the values are weighted based on these scores to produce the output embeddings.

Without positional encoding, the model will treat both occurrences of "left" identically since they have the same word representation. As a result, their corresponding query and key vectors will also be the same, leading to the same attention scores and aggregated contextual embeddings.

b. In a model with an attention query of 1024 dimensions and a key of 512 dimensions, we can still use dot product attention. The key dimension being smaller than the query dimension does not prevent the use of dot product attention. Instead, it just means that the keys will be projected onto a subspace of the query space during the attention calculation. This projection can still capture important information for attending to relevant parts of the input sequence.

c. The positional encoding function in the Transformer model allows the model to differentiate between tokens based on their position in the sequence even without any trainable parameters for positional encoding itself. This is achieved by adding specific sinusoidal or cosine functions to the input embeddings, creating distinct positional embeddings.

To have trainable positional encoding, one approach is to make the positional encoding vectors learnable by including them as part of the model's parameters. This can be done by adding an additional learnable positional encoding matrix that gets updated during the training process along with other parameters. By making positional encoding trainable, the model can adapt and learn the best positional representations specific to the task at hand.

Figure paths:
- None





****************************************************************************************
****************************************************************************************




Answer to Question 2-4
a) False. Greedy decoding is less memory-intensive than beam search as it only keeps track of the current best hypothesis.
b) True. Different vocabularies would lead to inconsistencies in the generated output if directly ensembled during decoding.
c) True. Without normalizing by sequence length, shorter sequences will have higher probabilities due to having fewer terms in the product.
d) True. A higher value of k in top-k sampling increases randomness and therefore variability in the generated outputs.

Figures are not provided in this question.





****************************************************************************************
****************************************************************************************




Answer to Question 2-5
In this case, COMET would be more impacted by the different wording in the translations provided. This is because COMET takes into account the overall context and meaning of the translations, in addition to considering the individual words or phrases. Since the reference translation is "Was möchten Sie trinken?", which matches exactly with System 1's output, COMET would likely assign a higher score to System 1 compared to System 2 despite both being correct translations.

BLEU, on the other hand, focuses more on the precision of individual n-grams in the translations. In this specific case, where the translations differ slightly in terms of formality ("Sie" vs. "du"), BLEU might give similar scores to both systems since they both contain the key words from the source sentence.

Therefore, in this scenario, COMET would be more impacted by the different wording in the translations compared to BLEU.

Figures: None.





****************************************************************************************
****************************************************************************************




Answer to Question 3-1
a) Ranking the approaches by the number of trained parameters in the task-specific adaptation stage:

1. (Promptless) Finetuning
2. In-Context Learning
3. Direct Prompting

Explanation:
- (Promptless) Finetuning usually involves updating most or all of the parameters of the pre-trained model, resulting in a large number of trained parameters.
- In-Context Learning focuses on updating only a portion of the pre-trained model's parameters, making it intermediate in terms of the number of trained parameters.
- Direct Prompting typically involves adding additional task-specific parameters without extensively modifying the pre-trained model, leading to the fewest trained parameters compared to the other approaches.

b) Ranking the approaches by the amount of memory needed for inference (decoding):

1. Direct Prompting
2. (Promptless) Finetuning
3. In-Context Learning

Explanation:
- Direct Prompting usually involves generating outputs directly from the prompt and the pre-trained model, requiring less memory for inference compared to the other approaches.
- (Promptless) Finetuning may require storing a fully updated model, potentially increasing the memory needed during inference.
- In-Context Learning involves modifying the model dynamically based on the context, which could increase the memory requirement for inference.

c) For a specific task with 8 input-output pairs, I would choose Direct Prompting. This approach would be suitable because it involves adding task-specific prompts without significantly altering the pre-trained model, leading to efficient inference with a relatively small number of trained parameters.





****************************************************************************************
****************************************************************************************




Answer to Question 3-2
a. To calculate the number of parameters trained in the finetuning stage with adapters, you would first determine the number of parameters in each adapter (two linear projections with ReLU in between, projecting down to 256 dimensions). Then, calculate the total number of adapters (one after each of the 12 layers) and sum up the parameters in all adapters.

b. To calculate the number of parameters trained in the finetuning stage with prompt tuning, you would consider the embeddings dimension, the number of tokens reserved for the prompt, and any additional parameters introduced during prompt tuning. Then, sum up these parameters to get the total number of trained parameters.

c. One possible explanation for the model with prompt tuning running out of memory despite having fewer total parameters than the model with adapters could be related to the memory requirements of processing the reserved prompt tokens. Since a portion of memory is allocated to store and process the prompt tokens, this additional overhead might contribute to the increased memory usage.

d. The main difference between prompt tuning and prefix tuning lies in how they incorporate task-specific information. Prompt tuning involves adding a task description or example at the beginning of the input sequence, while prefix tuning adds task-specific tokens in front of each input sequence. 
   - An advantage of prompt tuning is that it allows for more flexibility in providing task-specific information and guiding the model towards the desired output.
   - A disadvantage of prompt tuning compared to prefix tuning could be the potential limitation in the length or complexity of the task description that can be provided, as it adds extra tokens at the beginning of the input sequence, which may impact the overall sequence length the model can process effectively. 

Figure paths will be provided if necessary.





****************************************************************************************
****************************************************************************************




Answer to Question 3-3
a) To adapt the pretrained translation model to use information from the object detection model, we can concatenate the list of objects detected in the image to the input text sentence. For example, for the input sentence "Two people sitting next to a river," and the object detection output [PERSON, PERSON, RIVER, BOAT], we can create a new input that combines both sources of information, such as "Two people sitting next to a river [PERSON, PERSON, RIVER, BOAT]." This combined input is then fed into the adapted translation model. 

In cases where the object label is not in the vocabulary of the pretrained translation model, we can either map unknown object labels to a generic token like "UNKNOWN_OBJECT" or simply ignore them during the input concatenation process. By doing so, we ensure that the input maintains a consistent format for the translation model to process.

b) To analyze whether the model is effectively utilizing information from the object detection model, one approach could be to use attention mechanisms. By examining the attention weights produced during the translation process, we can observe which parts of the input (text + object labels) the model is focusing on. If the attention weights consistently highlight the object labels when generating corresponding parts of the translation, it indicates that the model is incorporating information from the object detection model.

c) When adapting the pretrained translation model to use the encoded image alongside text inputs, we can concatenate the encoded image vector to the text embedding before feeding it into the translation model. For example, if the encoded image is a 1024-dimensional vector and the text embedding is, say, 512-dimensional, we can concatenate these two vectors to create a combined input of 1536 dimensions. This combined input allows the translation model to jointly process information from both modalities.

In cases where the size of the encoded image does not match the embedding dimensions of the translation model, we can apply dimensionality reduction techniques (like PCA or linear projection) to either reduce the dimensionality of the image vector to match the text embedding size or to expand the text embedding to match the image vector size. This ensures compatibility between the dimensions of the encoded image and the embedding dimensions of the translation model for seamless joint processing.





****************************************************************************************
****************************************************************************************




Answer to Question 3-4
a. Retrieval-augmented generation (RAG) differs from traditional generation in that it incorporates a retrieval mechanism to bring relevant information from external sources into the generation process. This allows RAG to produce more contextually relevant and factually accurate outputs compared to traditional generation models, which rely solely on the internal knowledge encoded in their parameters. RAG could potentially improve the faithfulness of large language models because it can reduce the reliance on the model's internal knowledge and biases, incorporating real-world information from the retrieved sources. This can lead to more accurate and trustworthy outputs, as the model can verify and ground its generations in external knowledge.

b. I disagree with the statement that hallucination in machine translation is easier to detect than in general-purpose text generation with large language models. Hallucination refers to the generation of incorrect or misleading information that is not present in the input or context. In machine translation, hallucinations might be easier to detect because the output is expected to align closely with the input text in a different language. On the other hand, in general-purpose text generation with large language models, the generated text can be more diverse and complex, making it harder to discern whether information is hallucinated or not. Therefore, it could be argued that hallucination is not necessarily easier to detect in machine translation compared to general-purpose text generation.

c. During the training of large language models, long documents often get truncated due to memory limitations, leading to a lack of exposure to the full context of the document. This can potentially cause issues with model hallucination because the model may learn to fill in gaps or generate content based on partial information, leading to hallucinated or incorrect outputs. To mitigate this problem, researchers can explore techniques such as hierarchical processing of long documents, where the document is processed in chunks to ensure that the model captures the full context. Additionally, incorporating mechanisms for retaining important information or key phrases from truncated parts of the document can help the model generate more coherent and contextually accurate outputs.





****************************************************************************************
****************************************************************************************




