Answer to Question 1-1


a) False. One-hot word representations are binary vectors that indicate the presence or absence of a word in a given context. They cannot capture semantic relationships between words, such as synonymy.

b) True. English has a more complex morphology than German, with more irregular verbs and inflections.

c) True. Syntax deals with the arrangement of words in a sentence, while semantics deals with the meaning of words and sentences.

d) False. Word2Vec is trained on local context windows, not the global word occurrence matrix.

e) True. Less frequent words tend to have larger subword units, making them more subword-based than word-based in BPE segmentation.

f) True. CRFs (Conditional Random Fields) allow for more complex feature dependencies than HMMs (Hidden Markov Models), making it easier to integrate new features into the model.





****************************************************************************************
****************************************************************************************




Answer to Question 1-2


Reason 1:
Dense word embeddings capture semantic and syntactic relationships between words more effectively than sparse features. This is because dense embeddings represent words as high-dimensional vectors in a continuous vector space, allowing for the capture of subtle nuances and complex relationships between words. In contrast, sparse features represent words as binary or sparse vectors, which can only capture the presence or absence of a particular feature, limiting the ability to capture more nuanced relationships.

Reason 2:
Dense word embeddings enable efficient handling of large vocabularies and the ability to perform analogies and similarity-based tasks. With sparse features, each word would need to be represented by a unique feature vector, leading to a large and unwieldy feature matrix. However, with dense embeddings, words that are semantically similar can be represented by vectors that are close to each other in the embedding space, allowing for efficient handling of large vocabularies and the ability to perform tasks such as word analogies and similarity-based retrieval.

Therefore, the answers to the question and subquestions are:

Question: Give two reasons why dense word embeddings are preferred over sparse features in NLP.
Answer:
1. Dense word embeddings capture semantic and syntactic relationships between words more effectively than sparse features due to their ability to represent words as high-dimensional vectors in a continuous vector space.
2. Dense word embeddings enable efficient handling of large vocabularies and the ability to perform analogies and similarity-based tasks by allowing words that are semantically similar to be represented by vectors that are close to each other in the embedding space.





****************************************************************************************
****************************************************************************************




Answer to Question 1-3


To create product representations using ideas similar to learning word representations, we can apply techniques such as Singular Value Decomposition (SVD) or Matrix Factorization on the co-purchase matrix. These methods aim to find latent factors that explain the relationships between products.

a. To apply SVD on the co-purchase matrix, we first need to center and normalize it by subtracting the row and column means and dividing by the standard deviation. After that, we perform the SVD decomposition, which results in three matrices: $U$, $\Sigma$, and $V^T$. The columns of $U$ and $V$ represent the product and user latent factors, respectively. We can then obtain the product representations by taking the corresponding rows of $U$.

b. To recommend similar products to users who have shown interest in one product, we first find the user-product matrix by multiplying the co-purchase matrix with the user latent factors matrix $V$. This results in a matrix where each row represents a user and each column represents a product, with the entry being the product representation. To find similar products, we can use techniques such as Cosine Similarity or Jaccard Similarity. For example, using Cosine Similarity, we calculate the cosine angle between the product representation of the product of interest and all other product representations. The products with the highest cosine similarity scores are recommended as similar products to the user.





****************************************************************************************
****************************************************************************************




Answer to Question 1-4


Answer:

a) One property of CNNs that is beneficial for the spam detection task, which is not the case with RNNs, is their ability to extract local features from the input. CNNs can learn to identify specific patterns or features in the input, such as certain combinations of words or phrases, that are indicative of spam. This is particularly useful for longer input sequences, where RNNs may struggle to capture the relevant context.

b) A CNN-based model for spam detection could be implemented as follows:

- Input: The input to the model would be a sequence of tokenized words from an email. Each word would be represented as a dense vector obtained from a pre-trained word embedding model, such as Word2Vec or GloVe. The sequence of word vectors would be padded to a fixed length and arranged as a 3D tensor with dimensions [batch_size, sequence_length, embedding_dim].
- Convolutional layer: The first layer would be a convolutional layer with a filter size of 3 (or an odd number) and a stride of 1. This layer would apply a convolution operation on the input tensor along the sequence dimension, producing a feature map with dimensions [batch_size, sequence_length-filter_size+1, embedding_dim*filter_size].
- Max pooling layer: After the convolutional layer, a max pooling layer would be applied along the sequence dimension, reducing the size of the feature map to [batch_size, filter_size, embedding_dim*filter_size].
- Flatten layer: The feature map would then be flattened into a vector of size [batch_size, -1], which would be passed through a fully connected layer to obtain the final output.
- Output: The output of the model would be a binary classification result indicating whether the email is spam or not.

c) When evaluating the model, Tom could consider using the precision or recall metric instead of accuracy. These metrics are more informative when dealing with imbalanced datasets, where the majority of the examples belong to the negative class (non-spam emails). Precision measures the proportion of true positive predictions among all positive predictions, while recall measures the proportion of true positive predictions among all actual positive examples. By focusing on these metrics, Tom would be able to assess the model's ability to correctly identify spam emails while minimizing false positives.





****************************************************************************************
****************************************************************************************




Answer to Question 1-5


Answer:

a) Input: The input to the model would be a sequence of tokens extracted from the medical documents. Each token can be a single word or a sequence of words, depending on the tokenization method used. For instance, if we use word-level tokenization, then the input would be a sequence of words. If we use character-level tokenization, then the input would be a sequence of characters.

Intermediate operations: The first step would be to represent each token as a vector using pretrained word embeddings. Since Tom has asked for a model that is not RNN-based, we can use a method such as Hidden Markov Model (HMM) or Conditional Random Field (CRF) for named entity recognition. In HMM, we would use a tri-gram model to represent the probability distribution over the sequence of tags. In CRF, we would use a feature function to compute the probability of each tag given the context of the previous and next tags.

Output: The output of the model would be a sequence of tags, where each tag represents the named entity type of the corresponding token in the input sequence. For example, if the input sequence is "diabetes mellitus type 2", then the output sequence would be ["DISEASE", "DIABETES_TYPE", "DIABETES_TYPE_2"].

b) Challenge: One challenge of using pretrained word embeddings, such as GloVe, is that they may not capture the context-specific meaning of medical terms. For instance, the word "diabetes" may have a different meaning in a medical document compared to its meaning in a general text corpus. To address this challenge, we can fine-tune the pretrained word embeddings on a medical text corpus. This would allow the model to learn the medical-specific meaning of each word. Fine-tuning can be done using techniques such as transfer learning or multi-task learning. In transfer learning, we would initialize the word embeddings with the pretrained values and then fine-tune them on the medical text corpus. In multi-task learning, we would train the model to recognize named entities and also to predict other medical concepts, such as symptoms or treatments, using the same set of word embeddings. This would help the model learn the medical-specific meaning of each word in the context of the medical documents.





****************************************************************************************
****************************************************************************************




Answer to Question 2-1


Answer:

a) For sentence (1): "He saw their football in the park"
The probability of this sentence under the unigram model is calculated as the product of probabilities of each word in the sentence.

p("He saw their football in the park") = p("He") * p("saw") * p("their") * p("football") * p("in") * p("the") * p("park")

Since we know that count("their") > count("there"), the unigram model will predict "their" for this sentence.

For sentence (2): "He saw their was a football"
p("He saw their was a football") = p("He") * p("saw") * p("their") * p("was") * p("a") * p("football")

Since "was" is not one of the words in the list of possible words, the unigram model will not be able to predict the correct spelling for sentence (2).

This is not a good solution because the unigram model does not take into account the context of the words in the sentence. It only considers the probability of each word occurring independently, which is not sufficient for correctly predicting the spelling of words like "there" and "their".

b) For sentence (1): "He saw their football in the park"
The probability of this sentence under the bigram model is calculated as the product of probabilities of each pair of consecutive words in the sentence.

p("He saw their football in the park") = p("He saw") * p("their football") * p("their football in") * p("their football in the") * p("their football in the park")

The bigram model might be better than the unigram model because it takes into account the context of the words in the sentence. It considers the probability of a word based on the previous word in the sentence, which can help in correctly predicting the spelling of words like "there" and "their".

However, the bigram model has some problems in practice. It requires a larger corpus of data to train on compared to the unigram model, as it requires the probability of each pair of consecutive words to be calculated. Additionally, it may not capture long-range dependencies between words in the sentence, which can limit its accuracy in some cases.





****************************************************************************************
****************************************************************************************




Answer to Question 2-2


Answer:

a) Under masked language modeling (MLM) with a masking ratio of 20%, the masks would be applied to approximately 20% of the tokens in a given input sequence. In Figure "figures/Mask_under_MLM.pdf", the masks would be applied to the tokens marked with an 'X'. These tokens would be masked, and the model would be trained to predict their correct values based on the context of the surrounding tokens.

b) MLM generally requires fewer iterations over the training data compared to CLM. This is because in MLM, the model is only required to predict the masked tokens based on the context of the surrounding tokens, whereas in CLM, the model must predict the next token in the sequence given the context of all the previous tokens. This makes CLM a more complex task, requiring more iterations to converge to an accurate model.

c) MLM does not require the input sequences to be shifted one position to the right like in CLM because in MLM, the model is only required to predict the masked tokens based on the context of the surrounding tokens. In contrast, in CLM, the model must predict the next token in the sequence given the context of all the previous tokens, which requires the input to be shifted one position to the right.

d) PrefixLM often has better performance than CLM because PrefixLM models the probability distribution over the entire sequence, allowing it to capture long-term dependencies and context more effectively. In contrast, CLM models the probability distribution over the next token given the context of all the previous tokens, which can limit its ability to capture long-term dependencies and context. The illustration in Figure "figures/Illustration_of_language_model_training.png" further demonstrates this concept, showing how PrefixLM can capture longer-range dependencies compared to CLM.





****************************************************************************************
****************************************************************************************




Answer to Question 2-3


Answer:

a) No, the contextual embeddings for the two "left" in the sentence will not be the same because the self-attention mechanism in BERT model allows the model to focus on different parts of the input sequence based on their relationships with other parts. In self-attention, the query, key, and value vectors are derived from the input embeddings, and the attention scores are computed as the dot product of the query and key vectors. Since the two "left" words have different contexts in the sentence, their input embeddings will be different, leading to different query and key vectors. Consequently, they will receive different attention scores and result in distinct contextual embeddings.

b) Yes, we can use the dot product attention with the given dimensions because the dot product attention computes the attention scores as the dot product of the query and key vectors, which are both matrices with the same dimensions. In this case, the query matrix has dimensions $B \times T \times d_q$ (batch size $B$, sequence length $T$, and query dimension $d_q$), and the key matrix has dimensions $B \times T \times d_k$ (batch size $B$, sequence length $T$, and key dimension $d_k$). Since $d_q = d_k = 1024$, the dot product attention can be applied.

c) The positional encoding enables the model to treat different positions differently despite having no trainable parameters by adding fixed sinusoidal and cosinusoidal components to the input embeddings based on the token position and feature dimension. The positional encoding function defined in Eq. (
ef{eq:posEncoding}) introduces a phase difference between the sinusoidal and cosinusoidal components for each position and feature dimension combination. This phase difference allows the model to learn position-dependent representations from the input embeddings, even without having trainable positional encoding parameters. To have trainable positional encoding, we can learn the sinusoidal and cosinusoidal components as learnable parameters, but this approach may lead to overfitting and require more computational resources.





****************************************************************************************
****************************************************************************************




Answer to Question 2-4


Answer:

a) False: Greedy decoding is generally less memory-intensive than beam search since it only keeps the best hypothesis at each step, while beam search keeps multiple hypotheses.
b) True: Text generation models with different vocabularies cannot be directly ensembled during decoding because they generate sequences using different symbols.
c) True: If we do not normalize the sentence probability by sequence length during decoding, shorter sequences will be preferred due to their higher probability density.
d) True: With top-k sampling, a higher value of k leads to higher variability in the generated output because more diverse options are considered at each step.





****************************************************************************************
****************************************************************************************




Answer to Question 2-5


Answer:

1. The example provided in the question is about English-to-G German translation. The source sentence is "What would you like to drink?". Two different systems, System 1 and System 2, provided the following translations:
   - System 1: Was möchten Sie trinken?
   - System 2: Was möchtest du trinken?

2. The reference translation in German is also provided, which is "Was möchten Sie trinken?".

3. Both System 1 and System 2's outputs are grammatically correct and semantically equivalent to the reference translation. However, the wording of System 1 is more similar to the reference translation than System 2.

4. Regarding the impact of different wording on evaluation metrics like BLEU and COMET, both metrics are designed to measure the similarity between the system output and the reference translation. However, they do so in different ways.

5. BLEU (Bilingual Evaluation Understudy) is a metric that measures the n-gram precision between the system output and the reference translation. It does not take into account the semantic meaning of the words, but rather their sequence. Therefore, it might not be able to distinguish between System 1 and System 2's outputs in the example above, as they both contain the same n-grams.

6. COMET (Consensus-based Objective Metric for Evaluation of Translation), on the other hand, is a metric that measures the semantic similarity between the system output and the reference translation based on a consensus of multiple reference translations. It takes into account the meaning of the words and their context. Therefore, it might be able to distinguish between System 1 and System 2's outputs in the example above, as System 1's output is more similar to the reference translation in terms of semantic meaning.

7. In conclusion, COMET is more impacted by the different wording like in the example above, as it takes into account the semantic meaning of the words and their context. BLEU, on the other hand, might not be able to distinguish between different wordings that have the same n-grams.





****************************************************************************************
****************************************************************************************




Answer to Question 3-1


Answer:

a) The number of trained parameters in the task-specific adaptation stage for each approach is as follows:
- Direct Prompting: No additional parameters are trained for this approach. The language model remains unchanged.
- (Promptless) Finetuning: In this approach, a small amount of data is used to fine-tune the pre-trained model. The number of additional parameters depends on the size of the fine-tuning dataset and the learning rate. Generally, it is much smaller than the number of parameters in the original language model.
- In-Context Learning: No additional parameters are trained for this approach. The language model remains unchanged.

b) The memory requirements for inference (decoding) for each approach are as follows:
- Direct Prompting: The memory requirement is only for the input prompt and the output.
- (Promptless) Finetuning: The memory requirement is for the input prompt, the output, and the fine-tuned model.
- In-Context Learning: The memory requirement is only for the input prompt and the output.

The memory requirement for the fine-tuned model in (Promptless) Finetuning is much larger than the memory requirement for the input prompt and the output. However, the size of the fine-tuned model depends on the size of the fine-tuning dataset and the learning rate.

c) For a specific task with 8 input-output pairs, I would choose Direct Prompting or In-Context Learning. These approaches do not require any additional training or memory for the model, making them more efficient for small datasets. If the task requires more complex adaptations, then (Promptless) Finetuning may be a better choice, but it would require more resources.





****************************************************************************************
****************************************************************************************




Answer to Question 3-2


Answer:

a) To find the number of parameters trained in the finetuning stage with adapters inserted after every layer, we need to calculate the number of parameters in each adapter and the number of adapters. Each adapter consists of two linear projections with ReLU in between. The first linear projection projects the input of dimension embeddings_dim (1024) down to a lower dimension of 256. The number of parameters in a linear projection with input dimension i and output dimension j is given by i * j. Therefore, the number of parameters in each adapter is 1024 * 256 = 262144. Since there are 12 layers in the model, the total number of parameters trained in the finetuning stage is 12 * 262144 = 3145728.

b) In prompt tuning, 50 tokens are reserved for the prompt. The model has an embeddings dimension of 1024. Therefore, the number of parameters in the embeddings table for the prompt tokens is 50 * 1024 = 512000. Since the model has 12 layers, the total number of parameters trained in the finetuning stage is 12 * (1024 * 1024 + 512000) = 13315808.

c) The model with prompt tuning runs out of memory despite having less total parameters than the model with adapters due to the way the parameters are distributed in the model. In the model with adapters, the parameters are added after each layer, which increases the model size gradually. In contrast, in the model with prompt tuning, all the additional parameters are located in the embeddings table for the prompt tokens. This can lead to a larger memory footprint, especially if the prompt is long or the embeddings dimension is large.

d) The main difference between prompt tuning and prefix tuning is that in prompt tuning, only the parameters for the prompt tokens are fine-tuned, while in prefix tuning, the parameters for a longer prefix of tokens are fine-tuned. An advantage of prompt tuning is that it requires fewer computational resources than prefix tuning, as it involves fine-tuning fewer parameters. A disadvantage of prompt tuning is that it may not capture the full context of the input sequence, as only the parameters for the prompt tokens are fine-tuned. In contrast, prefix tuning can capture more context, but it requires more computational resources.





****************************************************************************************
****************************************************************************************




Answer to Question 3-3


To adapt the pretrained model to use information from the object detection model in a pipeline approach, we can concatenate the object labels output by the object detection model with the input sentence and feed it as a single input to the pretrained machine translation model. The model input would be a concatenated string of the form "[CLS] sentence [SEP] object1_label [SEP] object2_label ..." where "[CLS]" is a special token indicating the start of the input sequence, "sentence" is the original input sentence, and "object1_label" and "object2_label" are the labels of the first and second objects detected in the image, respectively. The model output would be the translated sentence.

If the object label is not in the vocabulary of the pretrained translation model, we can handle it by either ignoring it or replacing it with a special out-of-vocabulary (OOV) token. Ignoring it would mean that the model would only use the text input sentence for translation, while replacing it with an OOV token would allow the model to translate the sentence with the given object label but with an unknown object name.

To analyze whether the model makes use of information from the object detection model in the pipeline approach, we can compare the translations produced by the model when given the same input sentence but different object labels. If the translations differ significantly, it suggests that the model is making use of the object detection information.

To adapt the pretrained translation model to additionally use the encoded image in a parallel approach, we can concatenate the encoded image vector with the text input embedding and feed it as a single input to the pretrained machine translation model. The model input would be a concatenated vector of the form [text_embedding; image_embedding], where text_embedding is the embedding of the input sentence and image_embedding is the encoded image vector. The model output would be the translated sentence.

If the size of the encoded image does not match with the embedding dimensions of the translation model, we can handle it by either resizing the image or padding the image vector to match the embedding dimensions. Resizing the image would mean reducing or increasing its dimensions to match the embedding dimensions, while padding the image vector would mean adding zeros to the end of the vector to make it the same length as the embedding dimensions.





****************************************************************************************
****************************************************************************************




Answer to Question 3-4


Answer:

a) Retrieval-augmented generation (RAG) and traditional generation differ in the way they generate responses. In traditional generation, the model generates responses based solely on its internal knowledge and parameters. In contrast, RAG uses an external knowledge base to retrieve relevant information and incorporates it into the generated response. RAG could potentially improve the faithfulness of large language models by reducing the need for the model to hallucinate information. By providing the model with accurate and relevant information from the knowledge base, RAG can help ensure that the generated response is grounded in factual information.

b) I disagree that hallucination in machine translation is easier to detect than in general-purpose text generation with large language models. In machine translation, the context is often clearer due to the limited scope of the task and the availability of parallel data. This can make it easier to identify inconsistencies or errors in the translation. However, in general-purpose text generation, the context can be much more complex and ambiguous, making it harder to detect hallucinations or errors. Additionally, large language models can generate responses that are semantically correct but factually incorrect, making it even more challenging to detect hallucinations.

c) During the training of large language models, long documents often get truncated due to memory limitations. This can potentially cause issues with model hallucination because the model may not have access to the full context of the document. To mitigate this problem, researchers have proposed various techniques such as chunking, where the document is divided into smaller chunks that can be processed separately, or using larger memory systems to store longer documents in their entirety. Another approach is to use pre-training on large corpora, which can help the model learn to generate coherent and accurate responses even when faced with limited context.





****************************************************************************************
****************************************************************************************




