Answer to Question 1-1
Two challenges in modeling the perception of text are:

1. **Ambiguity**: 
   - Example: The word "bank" can refer to a financial institution or the side of a river.
  
2. **Subjectivity**:
   - Example: The interpretation of sentiment in text can vary depending on the reader's perspective and emotions.





****************************************************************************************
****************************************************************************************




Answer to Question 1-2
a. The assumption of the N-gram language model is that the probability of a word depends only on the previous n-1 words (where n is the order of the N-gram).

b. To calculate the probability of the sentence "This is the exam of Advanced AI." from a tri-gram language model, we would use the formula:
P(Word_n | Word_(n-1), Word_(n-2))

The calculation involves finding the probability of each consecutive word given the two words that came before it in the model.





****************************************************************************************
****************************************************************************************




Answer to Question 1-3
a. The building process of a BPE vocabulary with a size of 15 from the given sentences "I study in KIT. I like AI and NLP." after preprocessing is as follows:

1. Initialize the vocabulary with characters and their frequencies:
   {'i': 3, 's': 1, 't': 2, 'u': 1, 'd': 1, 'y': 1, 'n': 2, 'k': 1, 'w': 3, 'l': 1, 'a': 2, 'e': 1, 'n': 1, 'o': 1, 'p': 1}

2. Iteration 1: Merge the most frequent pair of tokens:
   New subword token: 'c'. Frequency update: {'c': 1, 'i': 3, 's': 1, 't': 2, 'u': 1, 'd': 1, 'y': 1, 'n': 2, 'k': 1, 'w': 3, 'l': 1, 'a': 2, 'e': 1, 'n': 1, 'o': 1, 'p': 1}

3. Iteration 2: Merge the most frequent pair of tokens:
   New subword token: 'i'. Frequency update: {'ci': 1, 't': 2, 'u': 1, 'd': 1, 'y': 1, 'n': 2, 'k': 1, 'w': 3, 'l': 1, 'a': 2, 'e': 1, 'n': 1, 'o': 1, 'p': 1}

4. Repeat iterations until the desired vocabulary size of 15 is reached.

The resulting BPE vocabulary with a size of 15 is:
['c', 'i', 'w', 't', 'n', 'a', 'y', 'k', 'l', 'u', 'd', 'o', 'p', 'e', 's', '/w']

b. Tokenizing the sentence "I like KIT." using the generated BPE vocabulary:
"I like KIT." -> ["i", "l", "i", "k", "e", "k", "i", "t", "/w"]





****************************************************************************************
****************************************************************************************




Answer to Question 1-4
a. The label sequence for the sentence using BILOU labeling approach would look like this: 

["O", "O", "O", "B-University", "L-University", "O", "O", "B-Course", "L-Course", "B-Lab", "O", "B-Lab", "L-Lab", "B-Lab", "L-Lab"]

b. The sequence labeling model would have 6 output classes: "O" (Outside), "B-University" (Beginning of University), "I-University" (Inside University), "L-University" (Last word of University), "B-Course" (Beginning of Course), "I-Course" (Inside Course), "L-Course" (Last word of Course), "B-Lab" (Beginning of Lab), "I-Lab" (Inside Lab), "L-Lab" (Last word of Lab).





****************************************************************************************
****************************************************************************************




Answer to Question 2-1
a) For the CBOW model with a window size of 2, one exemplary training sample could be:
CBOW: Input: [Human, smarter, large, model], Output: is
Skip-gram: Input: is, Output: [Human, smarter], [Human, large], [smarter, large], [smarter, model]

b) The challenge faced by Skip-gram model in implementation is the high computational cost due to the large number of negative samples that need to be generated for each positive sample during training. One solution to mitigate this challenge is to implement hierarchical softmax, where instead of calculating the probability for all words in the vocabulary, a tree structure is used to efficiently compute the output layer predictions.

Figure Path: N/A





****************************************************************************************
****************************************************************************************




Answer to Question 2-2
a. The problem with replacing the encoder with word embeddings and directly using the decoder in a neural machine translation system is that the encoder part of the Transformer model plays a crucial role in capturing the semantic and contextual information of the input sentence. By only using word embeddings, important syntactic and positional information may be lost, leading to inaccurate translations.

b. Example of two sentences where one will definitely be translated incorrectly:

Sentence 1: "The cat sat on the mat."
Sentence 2: "The bat flew over the sat."

In this example, by only using word embeddings without the encoder to capture the contextual information, the incorrect translation might result in "the cat flew over the mat" instead of "the bat flew over the sat."

Figure path: N/A





****************************************************************************************
****************************************************************************************




Answer to Question 2-3
a. To encourage learning contextualized representations, wav2vec2.0 implements the strategy of predicting future audio frames from the current frame outputs of the feature encoder. This strategy, known as the autoregressive prediction task, aims to capture the temporal dependencies in the audio input and inherently encourages the model to learn about the context and structure of the audio data. 

In pre-training, this strategy helps the feature encoder to capture meaningful representations by forcing it to learn relevant temporal patterns. By predicting future audio frames, the model is incentivized to create latent speech representations (Z) that are contextually rich and can effectively capture the sequential nature of the audio input. This, in turn, aids in the effectiveness of the contrastive loss during pre-training, as the context representations (C) and quantized representations (Q) can be better aligned due to the feature encoder's ability to contextualize the input audio.

b. Besides the contrastive loss, another loss in the objective of pre-training wav2vec2.0 is the reconstruction loss. The reconstruction loss aims to minimize the difference between the original input waveform and the waveform reconstructed from the quantized representations produced by the quantization module.

The reconstruction loss is essential in pre-training as it enforces the model to not only learn meaningful representations for downstream tasks but also to retain essential details of the input audio after quantization. By including the reconstruction loss in the training objective, the model is compelled to capture important information in the latent representations (Z) that is crucial for accurate reconstruction of the original waveform. This ensures that the model learns to preserve relevant information during the quantization process, leading to better performance in subsequent tasks such as fine-tuning for speech recognition.

Figure path: imgs/wav2vec2 illustration.png





****************************************************************************************
****************************************************************************************




Answer to Question 3-1
I would agree with my friend that a Bidirectional model might be better than a Unidirectional model for generating text descriptions from images. The reason is that a Bidirectional model can take into account both past and future context when generating text sequences, which can be beneficial for capturing more context and dependencies in the data. In the case of image descriptions, having access to information from both ends of the sequence (start to end and end to start) can potentially result in more coherent and contextually relevant text descriptions. Therefore, a Bidirectional model could be more suitable for this task compared to a Unidirectional model.





****************************************************************************************
****************************************************************************************




Answer to Question 3-2
To address the issue of handling out-of-vocabulary words in a machine translation model without additional training resources, one potential approach is to use subword tokenization techniques such as Byte Pair Encoding (BPE) or WordPiece. These techniques segment words into subword units, allowing the model to handle unseen words by breaking them down into smaller components that are likely to be in the vocabulary.

One potential problem of using subword tokenization for handling out-of-vocabulary words is the increase in vocabulary size. Since subword tokenization can result in a large number of subword units, the vocabulary size can grow significantly. This can lead to increased computational complexity during training and inference, as well as higher memory requirements for storing the larger vocabulary.

---






****************************************************************************************
****************************************************************************************




Answer to Question 3-3
a. In self-attention, the concept of multi-head refers to splitting the self-attention mechanism into multiple heads or sets. Each head has its own set of weights for query, key, and value transformations. By having multiple heads, the model can jointly attend to information from different representation subspaces at different positions. This allows the model to learn more complex patterns and relationships within the sequence, improving its ability to capture long-range dependencies and enhance performance on tasks.

b. To illustrate the masking of weights in the self-attention weight matrix on the solution sheet:
- First, identify the cells in the matrix that correspond to positions where the padding tokens are located.
- Next, mark those cells with 'X' to indicate that the weights in those positions should be masked out, as they do not contribute to the actual content of the sequence.

Figure Path: N/A (since it's not provided)





****************************************************************************************
****************************************************************************************




Answer to Question 3-4
a) To fill in the confusion matrix with True Positive (TP), False Positive (FP), True Negative (TN), and False Negative (FN), you would typically look at the predicted and actual classes for the data points in your dataset. TP represents the cases where the model correctly predicted the positive class, FP represents the cases where the model incorrectly predicted the positive class, TN represents the cases where the model correctly predicted the negative class, and FN represents the cases where the model incorrectly predicted the negative class.

b) Precision and Recall are two important metrics in evaluating classification models. The equations for precision and recall are as follows:

Precision = TP / (TP + FP)

Recall = TP / (TP + FN)

c) Relying solely on precision or recall for evaluation introduces bias in the assessment of the model performance.

- Bias of using only precision:
If you only consider precision, it may lead to a bias towards models that prioritize minimizing false positives. This could be problematic in scenarios where false negatives are more critical than false positives. For example, in a medical diagnosis task, if the model focuses only on precision, it may classify a serious condition as not present (false negative) to maintain a high precision score, which can be harmful to the patients.

- Bias of using only recall:
On the other hand, if you only consider recall, it may lead to a bias towards models that prioritize capturing all positive instances, even at the cost of higher false positives. This could be problematic in scenarios where false positives are more costly. For instance, in a spam email detection system, if the model focuses solely on recall, it may classify many legitimate emails as spam (false positive) to increase the recall score, which could annoy users and lead to important emails being missed.





****************************************************************************************
****************************************************************************************




Answer to Question 4-1
To determine the continuous convolution $f(t) = (g*h)(t)$ of the two continuous functions $g(t)$ and $h(t)$ graphically, we need to follow the steps of convolution:

1. Invert one of the functions (flip it horizontally). Let's flip $h(t)$ to get $h(-t)$.
2. Shift the inverted function by $t$ to the right. This means we shift $h(-t)$ by $t$ to get $h(t-\tau)$.
3. Multiply the two functions at each value of $t$. This means we multiply $g(t)$ and $h(t-\tau)$.
4. Integrate the product over all values of $t$. This will give us the convolution $f(t)$.

I will now analyze the figure provided to determine the convolution $f(t)$.

The path to the figure is "imgs/graph.png". 

To draw the convolution graphically, I will first follow the steps of convolution as mentioned above on the given graph. Then, I will mark the important points on the graph where significant changes occur due to the convolution process.





****************************************************************************************
****************************************************************************************




Answer to Question 4-2
To determine the discrete convolution \( u * v \) of the two given discrete functions \( u[t] \) and \( v[t] \), we need to perform the following steps:

1. Write down the functions \( u[t] \) and \( v[t] \) with their respective values.

Given:
\( u[t] = \{1, 3, 0.5, 1, 0.5, 0\} \) where \( t = \{0, 1, 2, 3, 4, 5\} \)
\( v[t] = \{0, 1, 2, 0, 3, 0\} \) where \( t = \{0, 1, 2, 3, 4, 5\} \)

2. Reverse the function \( v[t] \) to get \( v[-t] \).

\( v[-t] = \{0, 3, 2, 1, 0, 0\} \) where \( t = \{0, 1, 2, 3, 4, 5\} \)

3. Slide \( v[-t] \) across \( u[t] \), multiplying and summing at each step to get the convolution.

Let's calculate the values of the convolution function at each time step:

For \( t = 0 \):
\( u[0] * v[0] = 1 * 0 = 0 \)

For \( t = 1 \):
\( u[1] * v[1] = 3 * 1 = 3 \)

For \( t = 2 \):
\( u[2] * v[2] + u[1] * v[1] = 0.5 * 2 + 3 * 0 = 1 \)

For \( t = 3 \):
\( u[3] * v[3] + u[2] * v[2] + u[1] * v[1] = 1 * 1 + 0.5 * 2 + 3 * 1 = 5 \)

For \( t = 4 \):
\( u[4] * v[4] + u[3] * v[3] + u[2] * v[2] = 0.5 * 3 + 1 * 1 + 0.5 * 2 = 3.5 \)

For \( t = 5 \):
\( u[5] * v[5] + u[4] * v[4] + u[3] * v[3] = 0 * 0 + 0.5 * 3 + 1 * 1 = 1.5 \)

Therefore, the discrete convolution function \( u * v \) is:
\( \{0, 3, 1, 5, 3.5, 1.5\} \)

Path:
- Figures not provided in the question.





****************************************************************************************
****************************************************************************************




Answer to Question 4-3 
a. The sampling theorem states that in order to accurately reconstruct a continuous signal from its samples, the sampling rate must be at least twice the highest frequency present in the signal. This is also known as the Nyquist-Shannon sampling theorem.

b. When the sampling theorem is not fulfilled, aliasing occurs. Aliasing is a phenomenon where high-frequency components of the signal get aliased or folded back into lower frequencies, causing distortion in the reconstructed signal.

c. To sketch this phenomenon, draw a function f(t) in the time domain with a frequency higher than half of the sampling rate. The function should have oscillations that are not captured properly due to undersampling. Label the original function f(t) and indicate the sampled points at a rate lower than the Nyquist rate. Describe in the sketch how the high-frequency components get folded back into lower frequencies, leading to distortion in the reconstructed signal.

Figure path: ./sampling_theorem_sketch.png





****************************************************************************************
****************************************************************************************




Answer to Question 4-4
To calculate the accuracy ACC, we first need to calculate the Word Error Rate (WER) between the reference (REF) and the hypothesis (HYP).

REF: I need to book a flight to New York for next week
HYP: I need to cook light in Newark four next weeks

Now, we calculate the WER by counting the minimum number of editing operations required to change the hypothesis into the reference. 

Steps:
1. Substitution of "book" to "cook"
2. Substitution of "flight" to "light"
3. Substitution of "to" to "in"
4. Substitution of "New York" to "Newark"
5. Substitution of "for" to "four"
6. Substitution of "week" to "weeks"

Total operations = 6

The total number of words in the reference is 13.

WER = (6 / 13) = 0.4615

Accuracy ACC = 1 - WER = 1 - 0.4615 = 0.5385

Accuracy ACC as a percentage = 53.85%

Therefore, the accuracy of the hypothesis compared to the reference is 53.85%. 

Figure: None





****************************************************************************************
****************************************************************************************




Answer to Question 5-1
One image segmentation method that can be used to detect each object instance in the scene is Mask R-CNN (Mask Region-based Convolutional Neural Network). 

Mask R-CNN works by extending Faster R-CNN, a popular object detection model, to also predict segmentation masks for each instance. It operates in two stages:  
1. **Region Proposal Network (RPN):** This stage proposes regions in the image that may contain an object along with bounding box coordinates.
2. **Mask prediction:** For each Region of Interest (RoI) suggested by the RPN, Mask R-CNN predicts segmentation masks for every class.

By leveraging these two stages, Mask R-CNN is able to not only detect objects in an image but also accurately segment them, making it suitable for detecting object instances in a scene from RGB-D videos.





****************************************************************************************
****************************************************************************************




Answer to Question 5-2
A perturbation force term is needed in the DMP formulation for the following reasons:
1. **To account for external disturbances**: The perturbation force term helps the robot adapt to unforeseen external forces or disturbances that may affect the pouring action. By incorporating this term, the learned motion can be adjusted in real-time to maintain the desired pouring trajectory despite external influences.

2. **To improve robustness**: Including a perturbation force term enhances the robustness of the learned pouring action. It allows the robot to better handle variations in the environment or unexpected conditions without deviating significantly from the desired pouring trajectory.

3. **To handle uncertainties**: The perturbation force term provides flexibility in the DMP formulation, allowing the robot to deal with uncertainties in the pouring task. It helps in maintaining the smoothness and accuracy of the pouring action even when there are uncertainties or variations in the demonstration data.

4. **To achieve more natural motion**: By incorporating a perturbation force term, the learned pouring action can exhibit more natural and human-like movement. The perturbation force term adds subtle variations to the motion, making it look more realistic and similar to how a human would perform the pouring action.

In summary, the perturbation force term in the DMP formulation is essential for adapting to disturbances, improving robustness, handling uncertainties, and achieving natural and human-like motion in the learned pouring action.

Figure: N/A





****************************************************************************************
****************************************************************************************




Answer to Question 5-3
Equation of Locally Weighted Regression (LWR) with Radial Basis Function (RBF):
\[ f(x) = \sum_{i=1}^{n} w_i(x) y_i \]

Where:
1. \( f(x) \) is the output value to be predicted at input point \( x \).
2. \( w_i(x) \) are the locally computed weights for each training example.
3. \( y_i \) are the target values associated with the training examples.
4. \( n \) is the total number of training examples.

In this case, the input point \( x \) represents the features extracted from RGB-D videos of human demonstrations, and the target values \( y_i \) are the perturbation force terms associated with each demonstration. The weights \( w_i(x) \) are computed based on the Radial Basis Function (RBF) to give higher weights to training examples that are closer to the input point \( x \).





****************************************************************************************
****************************************************************************************




Answer to Question 5-4
Yes, a Dynamic Movement Primitive (DMP) can be learned from five human demonstrations for a specific motion. DMPs are a type of trajectory representation method used in robotics and motor control that can generalize and learn from multiple demonstrations of a task. By observing the demonstrations provided in the RGB-D videos, the robot can extract the essential features and characteristics of the pouring water action and use them to learn a DMP for that specific motion. The DMP will then allow the robot to reproduce the pouring water action with appropriate adaptation and generalization capabilities based on the learned demonstrations. 

In summary, DMPs are suitable for learning from multiple demonstrations and can be effectively applied to learn the pouring water action from the five human demonstrations in this scenario.





****************************************************************************************
****************************************************************************************




Answer to Question 5-5
I would choose a Gaussian Mixture Model (GMM) as the movement primitive to model the demonstrated pouring action. 

GMM can capture the multi-modal nature of the human demonstrations, which is important in this scenario because different humans may have slightly different ways of pouring water. By using GMM, the robot can learn a distribution of pouring actions from the demonstrations, allowing for more flexible and adaptive reproduction of the pouring action. 

Additionally, GMM can also be used to plan the via-point that the robot needs to pass through in order to avoid the obstacle. By introducing a via-point that is far away from the distribution of demonstrated trajectories, the robot can learn to detour around the obstacle while still effectively pouring water. 

Therefore, GMM is a suitable choice for modeling and reproducing the pouring action while avoiding obstacles in this scenario.





****************************************************************************************
****************************************************************************************




Answer to Question 5-6
Cognitivist cognitive architectures focus on symbolic representations and manipulation of information, often based on rules or algorithms. These architectures aim to simulate human cognition by using a set of rules to process and manipulate information in a step-by-step manner. Examples of cognitivist architectures include Soar and ACT-R.

On the other hand, emergent cognitive architectures model cognition as the emergence of complex behavior from the interactions of simpler processes. Instead of relying on explicit rules and symbols, emergent architectures emphasize the interactions of simpler computational units to give rise to cognitive abilities. Examples of emergent architectures include neural networks and connectionist models.

A hybrid cognitive architecture combines aspects of both cognitivist and emergent approaches. It leverages the strengths of symbolic processing and the emergent properties of connectionist systems to create a more comprehensive and flexible model of human cognition.

Therefore, in summary, cognitivist architectures rely on symbolic representations and rules, emergent architectures focus on interactions and emergence of behavior, and hybrid architectures combine elements from both paradigms to create a more comprehensive cognitive model.





****************************************************************************************
****************************************************************************************




Answer to Question 5-7
a) A forgetting mechanism given by $\alpha_i(t)$ is a time-based decay method. 

In the equation provided, $\alpha_i(t)$ represents the activation level of item $i$ in the robot's memory at time $t$. The parameter $\beta_i$ can be seen as the baseline strength of the memory item $i$, affecting how quickly the activation level decays over time. The parameter $d$ corresponds to the variance of the normal distribution used in the decay process, influencing how fast the forgetting occurs.

b) 
Given the scenario described:

At $t = 1$: $i_1$ is received.

At $t = 2$: $i_2$ is received.

At $t = 3$: $i_3$ is received, and $i_1$ and $i_2$ are recalled.

The equations for calculating $\alpha_{i_1}$, $\alpha_{i_2}$, and $\alpha_{i_3}$ at $t=3$ can be provided considering the recall and decay processes after each time step.

For $i_1$:
$\alpha_{i_1}(3) = \beta_{i_1} \cdot [r_{i_1,1} + r_{i_1,3} \cdot \mathcal{N}(\mu = 3, \sigma^2 = d)(3)]$

For $i_2$:
$\alpha_{i_2}(3) = \beta_{i_2} \cdot [r_{i_2,2} \cdot \mathcal{N}(\mu = 2, \sigma^2 = d)(3)]$

For $i_3$:
$\alpha_{i_3}(3) = \beta_{i_3} \cdot r_{i_3,3}$

Ordering the data activation according to their magnitude involves calculating the activation levels using the equations above and comparing the resulting values for $\alpha_{i_1}$, $\alpha_{i_2}$, and $\alpha_{i_3}$.





****************************************************************************************
****************************************************************************************




