Answer to Question 1-1
Here are two challenges in modelling the perception of text, along with an example for each:

1. Ambiguity and multiple interpretations
Example: The sentence "I saw a man on a hill with a telescope" can be interpreted in different ways. It could mean the man on the hill had a telescope, or that I used a telescope to see the man on the hill. Modelling how humans resolve such ambiguities based on context and world knowledge is challenging.

2. Figurative language and non-literal meanings  
Example: The phrase "it's raining cats and dogs" should not be interpreted literally as animals falling from the sky. It is an idiom that figuratively means it is raining very heavily. Modelling the human ability to understand metaphors, idioms, sarcasm, and other non-literal language is difficult, as it requires going beyond the surface meaning of the words.





****************************************************************************************
****************************************************************************************




Answer to Question 1-2
a. The assumption of the N-gram language model is that the probability of a word depends only on the previous N-1 words. In other words, it assumes that the probability of a word given all the previous words can be approximated by the probability of the word given only the previous N-1 words.

b. The probability equation of the sentence "This is the exam of Advanced AI." from a tri-gram language model is:

P(This is the exam of Advanced AI.) = P(This | <s> <s>) * P(is | <s> This) * P(the | This is) * P(exam | is the) * P(of | the exam) * P(Advanced | exam of) * P(AI. | of Advanced) * P(</s> | Advanced AI.)

Where <s> represents the start of the sentence, and </s> represents the end of the sentence.





****************************************************************************************
****************************************************************************************




Answer to Question 1-3
Here are the answers to the exam question:

a. Building a BPE vocabulary with a size of 15 from the given sentences:
Preprocessing: "i study in kit i like ai and nlp"

Step 1: Split into characters and add /w after each word
Vocabulary: {'i/w', 's/w', 't/w', 'u/w', 'd/w', 'y/w', 'n/w', 'k/w', 'l/w', 'e/w', 'a/w', '/w'}
Pairs: {('i', '/w'), ('s', 't'), ('t', 'u'), ('u', 'd'), ('d', 'y'), ('y', '/w'), ('n', '/w'), ('k', 'i'), ('i', 't'), ('t', '/w'), ('l', 'i'), ('i', 'k'), ('k', 'e'), ('e', '/w'), ('a', 'i'), ('a', 'n'), ('n', 'd'), ('d', '/w'), ('l', 'p')}

Step 2: Most frequent pair is ('i', '/w'). Merge into 'i/w'.
Vocabulary: {'i/w', 's/w', 't/w', 'u/w', 'd/w', 'y/w', 'n/w', 'k/w', 'l/w', 'e/w', 'a/w', '/w'}
Pairs: {('s', 't'), ('t', 'u'), ('u', 'd'), ('d', 'y'), ('y', '/w'), ('n', '/w'), ('k', 'i/w'), ('i/w', 't'), ('t', '/w'), ('l', 'i/w'), ('i/w', 'k'), ('k', 'e'), ('e', '/w'), ('a', 'i/w'), ('a', 'n'), ('n', 'd'), ('d', '/w'), ('l', 'p')}

Step 3: Most frequent pair is ('n', 'd'). Merge into 'nd'.
Vocabulary: {'i/w', 's/w', 't/w', 'u/w', 'd/w', 'y/w', 'n/w', 'k/w', 'l/w', 'e/w', 'a/w', '/w', 'nd'}
Pairs: {('s', 't'), ('t', 'u'), ('u', 'd'), ('d', 'y'), ('y', '/w'), ('n/w', '/w'), ('k', 'i/w'), ('i/w', 't'), ('t', '/w'), ('l', 'i/w'), ('i/w', 'k'), ('k', 'e'), ('e', '/w'), ('a', 'i/w'), ('a', 'nd'), ('nd', '/w'), ('l', 'p')}

Step 4: Most frequent pair is ('nd', '/w'). Merge into 'nd/w'. 
Final vocabulary (size 15): {'i/w', 's/w', 't/w', 'u/w', 'd/w', 'y/w', 'n/w', 'k/w', 'l/w', 'e/w', 'a/w', '/w', 'nd', 'nd/w', 'a'}

b. Tokenizing the sentence "I like KIT." using the generated BPE vocabulary:
"I like KIT." -> "i like kit" (after preprocessing)
Tokenized: "i/w l i/w k e/w k i/w t/w"





****************************************************************************************
****************************************************************************************




Answer to Question 1-4
a. The label sequence for the sentence using the BILOU labeling approach would look like this:

O O O O B-University I-University I-University I-University O O B-Course I-Course I-Course O B-Lab O B-Lab O O B-Lab O

b. In this case, the sequence labeling model would have 9 output classes:

1. O (Outside)
2. B-University
3. I-University
4. B-Course
5. I-Course
6. B-Lab
7. I-Lab
8. B-Lab
9. I-Lab

Note that even though there are only 3 named entity types (University, Course, and Lab), the BILOU labeling scheme requires separate labels for the Beginning, Inside, and Last tokens of each entity type, resulting in a total of 9 output classes (including the Outside class).





****************************************************************************************
****************************************************************************************




Answer to Question 2-1
a. From the sentence "Human is smarter than large language model", with a window size of 2, we can derive the following training samples:

CBOW:
- Input: [Human, smarter], Output: is
- Input: [is, than], Output: smarter
- Input: [smarter, large], Output: than
- Input: [than, language], Output: large
- Input: [large, model], Output: language

Skip-gram:
- Input: is, Output: [Human, smarter]
- Input: smarter, Output: [is, than]
- Input: than, Output: [smarter, large]
- Input: large, Output: [than, language]
- Input: language, Output: [large, model]

b. The Skip-gram model faces a significant challenge in implementation due to the large output layer size. For each input word, the model needs to predict the probability of all words in the vocabulary being in the context window. This becomes computationally expensive, especially for large vocabularies.

A solution to this challenge is using negative sampling. Instead of predicting the probability of all words in the vocabulary, negative sampling selects a small number of "negative" words that are not in the context window and trains the model to distinguish between the true context words and these negative samples.

For example, let's consider the input word "smarter" from the given sentence. The true context words are "is" and "than". Negative sampling might select "book" and "car" as negative samples. The model is then trained to predict high probabilities for "is" and "than", and low probabilities for "book" and "car". This approach significantly reduces the computational cost while still allowing the model to learn meaningful word embeddings.





****************************************************************************************
****************************************************************************************




Answer to Question 2-2
Here are my answers to the exam question:

a. The main problem with this model is that it loses the ability to capture context and long-range dependencies in the input sequence. By replacing the encoder with just word embeddings, the model can no longer learn relationships and interactions between the input words. The encoder's self-attention mechanism is crucial for understanding the context and building meaningful representations of the input. Without the encoder, the decoder will only have access to individual word embeddings without any contextual information, leading to poor translation quality.

b. Here's an example of two sentences where one will definitely be translated incorrectly by this model:

Sentence 1: "The bank is closed on Sundays."
Sentence 2: "We sat on the river bank to have a picnic."

In these sentences, the word "bank" has different meanings based on the context. In Sentence 1, "bank" refers to a financial institution, while in Sentence 2, it refers to the land alongside a river. Without the encoder to capture the context, the model with only word embeddings will likely translate "bank" incorrectly in one of the sentences. It might translate "bank" as a financial institution in both cases, leading to an incorrect translation for Sentence 2.





****************************************************************************************
****************************************************************************************




Answer to Question 2-3
a. The strategy implemented on the feature encoder outputs in wav2vec2.0 is masking. Portions of the feature encoder outputs Z are randomly masked before being fed into the context encoder. This forces the context encoder to learn contextualized representations that can fill in the missing information. The contrastive loss then measures the similarity between these contextualized representations C and the quantized representations Q of the original unmasked Z. By trying to match C with Q, the model learns to infer the masked information from surrounding context.

b. Besides the contrastive loss, the other loss in the pre-training objective of wav2vec2.0 is the diversity loss. The diversity loss encourages the quantized representations Q to utilize the codebook efficiently by maximizing the entropy of the averaged softmax distribution over the codebook entries. In other words, it promotes using the codebook entries in a diverse manner rather than collapsing to only a few representations. This diversity loss is necessary to prevent the trivial solution where the quantization module always outputs the same single code entry, which would not be useful for learning meaningful representations.





****************************************************************************************
****************************************************************************************




Answer to Question 3-1
I would respectfully disagree with my friend's suggestion of using a Bidirectional model as the decoder for generating text descriptions from image representations. Here's why:

In the task of generating text descriptions for images, we want the decoder to generate the text sequentially, word by word, based on the image representation from the encoder and the words generated so far in the sequence. This is a unidirectional process, where the generation of the next word depends only on the previously generated words and the image representation.

A Bidirectional model, such as a Bidirectional LSTM or a Bidirectional Transformer, considers both the past and future context when processing each word in the sequence. While this is beneficial for tasks like text classification or named entity recognition, where the entire input sequence is available, it is not suitable for text generation.

In text generation, we don't have access to the future words that are yet to be generated. We can only condition the generation on the words generated so far. Using a Bidirectional model as the decoder would introduce a mismatch between training and inference, as during inference, we can't provide the future context since the words are generated sequentially.

Therefore, a Unidirectional model, such as a Unidirectional LSTM or a Unidirectional Transformer (e.g., GPT), is more appropriate for the decoder in this task. It can generate the text sequentially, conditioning each word on the previously generated words and the image representation, without relying on the future context.

In summary, while Bidirectional models are powerful for tasks that have access to the entire input sequence, a Unidirectional model is more suitable as the decoder for generating text descriptions from image representations, as it aligns with the sequential nature of text generation.





****************************************************************************************
****************************************************************************************




Answer to Question 3-2
To handle out-of-vocabulary (OOV) words in the Encoder-Decoder model for machine translation, given that there are no additional training resources, one approach is to use subword tokenization techniques. Here's how it can be done:

1. Subword tokenization:
Instead of using a fixed vocabulary of whole words, the input and output sentences can be tokenized into smaller units called subwords. Common subword tokenization methods include:
a. Byte-Pair Encoding (BPE): BPE iteratively merges the most frequent pair of characters or character sequences to form subword units.
b. WordPiece: WordPiece is similar to BPE but uses a language model to determine the likelihood of subword units.
c. Unigram Language Model: This method uses a probabilistic model to determine the optimal subword segmentation based on the likelihood of each subword.

By using subword tokenization, the model can handle OOV words by breaking them down into smaller, more frequent subword units that are present in the vocabulary. During inference, the subword units generated by the decoder can be concatenated to form the final output words, even if they were not seen during training.

2. Potential problem:
One potential problem with using subword tokenization to handle OOV words is that it may introduce ambiguity in the generated translations. Since OOV words are broken down into subword units, the model may generate subwords that, when combined, result in words that are different from the intended meaning. This can happen especially for rare or domain-specific words that have unique spellings or morphological structures.

For example, if an OOV word is split into subwords that are shared with other more common words, the model might generate a translation that uses those common words instead of the intended rare word. This can lead to a loss of precision and coherence in the translated output.

To mitigate this issue, it may be necessary to fine-tune the subword tokenization algorithm or incorporate additional techniques such as copying mechanisms or post-processing steps to handle OOV words more effectively. However, these approaches may require additional training resources or domain-specific knowledge.





****************************************************************************************
****************************************************************************************




Answer to Question 3-3
a. Multi-head in self-attention refers to using multiple attention mechanisms in parallel. Each head attends to different aspects or representations of the input. This allows the model to jointly attend to information from different representation subspaces at different positions. Multi-head attention plays an important role because it expands the model's ability to focus on different parts of the input sequence, enabling it to capture more diverse and fine-grained dependencies between elements.

b. In the provided figure representing the self-attention weight matrix in a decoder, the weights that should be masked out (indicated with 'X') are:
- The entire row corresponding to "BoS" on the vertical axis
- The upper triangular part of the matrix excluding the main diagonal (i.e. the cells at positions B-A, C-A, C-B, D-A, D-B, D-C)

This masking pattern prevents the decoder from attending to future positions which have not been generated yet at each decoding step, preserving the auto-regressive property.





****************************************************************************************
****************************************************************************************




Answer to Question 3-4
a. The confusion matrix should be filled in as follows:
True Positive (TP): Actual condition Positive, Predicted condition Positive
False Positive (FP): Actual condition Negative, Predicted condition Positive 
True Negative (TN): Actual condition Negative, Predicted condition Negative
False Negative (FN): Actual condition Positive, Predicted condition Negative

b. The equations for precision and recall are:
Precision = TP / (TP + FP)
Recall = TP / (TP + FN)

c. Examples illustrating bias:

Using only precision: Consider a model predicting a rare disease. If it only predicts positive cases when it is extremely confident, it may have high precision (most of its positive predictions are correct). However, it could miss many actual positive cases, leading to low recall. Focusing only on precision would bias the evaluation to favor overly conservative models.

Using only recall: Imagine a spam email classifier. If it classifies most emails as spam to avoid missing any actual spam, it will have high recall (it catches most real spam cases). However, this comes at the cost of many false positives, where normal emails are incorrectly flagged as spam. Focusing only on recall would bias the evaluation to favor overly aggressive models.





****************************************************************************************
****************************************************************************************




Answer to Question 4-1
To graphically determine the continuous convolution (g * h)(t) of the given functions g(t) and h(t):

1. Flip h(t) horizontally about the vertical axis to obtain h(-t).
2. Slide h(-t) across g(t) from left to right, starting with the right edge of h(-t) aligned with t=0 on g(t).
3. At each t value, multiply the overlapping portions of g(t) and h(-t) and find the area under the resulting curve. This area is the value of the convolution (g * h)(t) at that t value.
4. Plot these convolution values for each t to obtain the full convolution function.

Key steps and coordinates:

- At t=0, the convolution value is 0 since the functions do not overlap.

- As h(-t) slides right, the convolution increases linearly from t=0 to t=1 as the overlap forms a growing triangle. At t=1, the convolution reaches a value of 1.

- From t=1 to t=2, the convolution remains constant at 1 since the overlap forms a rectangle with area 1.

- From t=2 to t=3, the convolution decreases linearly back to 0 as the overlap forms a shrinking triangle. At t=3, the convolution reaches 0 again.

- From t=3 onward, the convolution remains 0 as the functions no longer overlap.

In summary, the key coordinates for the convolution (g * h)(t) are:
(0, 0), (1, 1), (2, 1), (3, 0)

With linear segments connecting these points from t=0 to t=3, and the function remaining 0 elsewhere.





****************************************************************************************
****************************************************************************************




Answer to Question 4-2
To determine the discrete convolution u*v of the given discrete functions u[t] and v[t], we can use the following formula:

(u*v)[n] = Σ u[k] · v[n-k], where the sum is taken over all possible values of k.

Given:
u[t] = 1 if t=0, 3 if t=1, 0.5 if t=2, 1 if t=3, 0.5 if t=4, 0 else
v[t] = 1 if t=1, 2 if t=2, 3 if t=4, 0 else

Let's calculate the convolution for different values of n:

For n = 0:
(u*v)[0] = u[0] · v[0] = 1 · 0 = 0

For n = 1:
(u*v)[1] = u[0] · v[1] + u[1] · v[0] = 1 · 1 + 3 · 0 = 1

For n = 2:
(u*v)[2] = u[0] · v[2] + u[1] · v[1] + u[2] · v[0] = 1 · 2 + 3 · 1 + 0.5 · 0 = 5

For n = 3:
(u*v)[3] = u[0] · v[3] + u[1] · v[2] + u[2] · v[1] + u[3] · v[0] = 1 · 0 + 3 · 2 + 0.5 · 1 + 1 · 0 = 6.5

For n = 4:
(u*v)[4] = u[0] · v[4] + u[1] · v[3] + u[2] · v[2] + u[3] · v[1] + u[4] · v[0] = 1 · 3 + 3 · 0 + 0.5 · 2 + 1 · 1 + 0.5 · 0 = 5

For n = 5:
(u*v)[5] = u[1] · v[4] + u[2] · v[3] + u[3] · v[2] + u[4] · v[1] = 3 · 3 + 0.5 · 0 + 1 · 2 + 0.5 · 1 = 11.5

For n = 6:
(u*v)[6] = u[2] · v[4] + u[3] · v[3] + u[4] · v[2] = 0.5 · 3 + 1 · 0 + 0.5 · 2 = 2.5

For n = 7:
(u*v)[7] = u[3] · v[4] + u[4] · v[3] = 1 · 3 + 0.5 · 0 = 3

For n = 8:
(u*v)[8] = u[4] · v[4] = 0.5 · 3 = 1.5

Therefore, the discrete convolution (u*v)[n] is:

(u*v)[n] = 0 if n=0, 1 if n=1, 5 if n=2, 6.5 if n=3, 5 if n=4, 11.5 if n=5, 2.5 if n=6, 3 if n=7, 1.5 if n=8, 0 else





****************************************************************************************
****************************************************************************************




Answer to Question 4-3
a. The sampling theorem, also known as the Nyquist-Shannon sampling theorem, states that a continuous-time signal can be perfectly reconstructed from its samples if the sampling frequency is at least twice the highest frequency component present in the signal. In other words, the sampling rate must be greater than or equal to twice the bandwidth of the signal to avoid loss of information during the sampling process.

b. When the sampling theorem is not fulfilled, i.e., when the sampling frequency is less than twice the highest frequency component in the signal, the phenomenon that occurs is called aliasing.

c. To explain aliasing using a sketch in the time domain, consider the following:

Suppose we have a continuous-time signal f(t) = sin(2π * 10t) + sin(2π * 20t), which is a sum of two sinusoidal components with frequencies 10 Hz and 20 Hz, respectively.

If we sample this signal at a sampling frequency fs = 30 Hz, which is less than twice the highest frequency component (20 Hz), aliasing will occur.

The sketch would show:
1. The original continuous-time signal f(t) with its two sinusoidal components.
2. The sampled points of f(t) at intervals of 1/fs = 1/30 seconds.
3. The reconstructed signal from the sampled points, which will have a distorted shape due to aliasing.

In the reconstructed signal, the 20 Hz component will appear as a lower-frequency component due to aliasing. The aliased frequency can be calculated as fa = |fs - 20|, which in this case is 10 Hz. As a result, the reconstructed signal will appear as a single sinusoid with a frequency of 10 Hz, which is a distorted version of the original signal.

The sketch would clearly show that the original signal cannot be accurately reconstructed from the sampled points when the sampling frequency is insufficient, leading to aliasing and distortion in the reconstructed signal.





****************************************************************************************
****************************************************************************************




Answer to Question 4-4
To calculate the recognition accuracy (ACC), we first need to calculate the word error rate (WER). The WER is the minimum number of word substitutions, deletions, and insertions required to transform the hypothesis (HYP) into the reference (REF), divided by the number of words in the reference.

REF: I need to book a flight to New York for next week
HYP: I need to cook light in Newark four next weeks

Alignment and error counting:
Correct (C): 4 words ("I", "need", "to", "for")
Substitutions (S): 4 words ("cook" for "book", "light" for "flight", "in" for "to", "Newark" for "New York")
Deletions (D): 1 word ("a")
Insertions (I): 2 words ("four", "weeks" after "next")

Total errors = S + D + I = 4 + 1 + 2 = 7
Number of words in REF = 12

WER = (S + D + I) / Number of words in REF
    = 7 / 12
    ≈ 0.5833

ACC = 1 - WER
    = 1 - 0.5833
    ≈ 0.4167

The recognition accuracy (ACC), expressed as a rounded percentage, is approximately 42%.





****************************************************************************************
****************************************************************************************




Answer to Question 5-1
To detect each object instance in the scene from the given RGB-D videos of human demonstrations for learning the pouring water action, the image segmentation method that can be used is Instance Segmentation.

Instance Segmentation is a computer vision technique that involves detecting, segmenting, and classifying each distinct object of interest within an image or video frame. Here's how it works:

1. Object Detection: The first step is to identify and localize each object in the image using bounding boxes. This is typically done using deep learning object detection models like Faster R-CNN, YOLO, or SSD. These models are trained on large datasets to recognize and localize objects of different classes.

2. Pixel-wise Segmentation: After detecting the objects, the next step is to perform pixel-wise segmentation within each bounding box. This involves classifying each pixel as belonging to a specific object instance or the background. Convolutional Neural Networks (CNNs) like Mask R-CNN or DeepLab are commonly used for this purpose. These networks are trained to predict a binary mask for each object instance, indicating which pixels belong to that object.

3. Instance Classification: In addition to segmenting the objects, instance segmentation also assigns a class label to each object instance. This is done by extending the object detection and segmentation models with an additional classification head. The classification head predicts the class probabilities for each detected object instance.

4. Post-processing: The final step involves post-processing the segmented instances to refine the object boundaries and handle overlapping or occluded objects. Techniques like non-maximum suppression (NMS) can be applied to remove duplicate or overlapping detections.

In the context of learning the pouring water action from human demonstrations, instance segmentation can be applied to each frame of the RGB-D videos. The depth information from the RGB-D data can provide additional cues for object segmentation. By detecting and segmenting each object instance (e.g., the container, the cup, the water) in the scene, the robot can learn the spatial relationships and interactions between the objects involved in the pouring action.

Instance segmentation enables the robot to understand the scene at a fine-grained level, identifying individual objects and their precise boundaries. This information can be used to track the objects across frames, analyze their movements and interactions, and learn the necessary steps and parameters for executing the pouring water action accurately.





****************************************************************************************
****************************************************************************************




Answer to Question 5-2
The perturbation force term is needed in the Dynamic Movement Primitives (DMP) formulation for the following reasons:

1. Generalization: The perturbation force term allows the robot to generalize the learned pouring action to new situations. It enables the robot to adapt the trajectory based on the current state and goal, making the learned skill more flexible and applicable to different initial conditions and target positions.

2. Robustness to perturbations: In real-world scenarios, the robot may encounter external disturbances or perturbations while executing the pouring action. The perturbation force term helps the robot to cope with these disturbances and maintain the desired trajectory. It provides a way to modify the trajectory on-the-fly based on the sensed deviations from the expected state.

3. Obstacle avoidance: The perturbation force term can be used to incorporate obstacle avoidance behavior into the learned pouring action. By adding repulsive forces from nearby obstacles, the robot can adjust its trajectory to avoid collisions while still following the general shape of the learned skill.

4. Refinement and adaptation: The perturbation force term allows for fine-tuning and adaptation of the learned skill based on additional feedback or reinforcement learning. By modifying the perturbation force based on the observed outcomes or rewards, the robot can gradually improve its performance and adapt to new task variations.

5. Coupling with sensory feedback: The perturbation force term provides a way to couple the learned skill with sensory feedback from the environment. For example, the force term can be modulated based on the perceived water level in the container or the tactile feedback from the contact with the container. This sensory coupling enables the robot to adjust its actions based on the current state of the task.

In summary, the perturbation force term in the DMP formulation is crucial for enabling generalization, robustness, obstacle avoidance, refinement, and sensory coupling in the learned pouring action. It allows the robot to adapt the learned skill to new situations, handle perturbations, and improve its performance based on feedback from the environment.





****************************************************************************************
****************************************************************************************




Answer to Question 5-3
The locally weighted regression (LWR) with the radial basis function (RBF) to approximate the perturbation force term for a robot learning the pouring water action from five human demonstrations can be written as follows:

f(x) = ∑ⱼ wⱼ * φ(||x - xⱼ||)

where:
- f(x) is the approximated perturbation force term at query point x
- x is the query point (e.g., the current state of the robot)
- xⱼ is the j-th training data point (e.g., a state from one of the human demonstrations)
- wⱼ is the weight associated with the j-th training data point
- φ(·) is the radial basis function (RBF), which is a function of the Euclidean distance between the query point x and the training data point xⱼ
- ||x - xⱼ|| denotes the Euclidean distance between x and xⱼ

The weights wⱼ are computed by solving a local optimization problem at each query point x:

minimize ∑ⱼ (yⱼ - wⱼ * φ(||x - xⱼ||))² + λ * ∑ⱼ wⱼ²

where:
- yⱼ is the target value (e.g., the perturbation force) associated with the j-th training data point xⱼ
- λ is a regularization parameter that controls the trade-off between fitting the data and keeping the weights small

The radial basis function (RBF) is typically chosen to be a Gaussian function:

φ(r) = exp(-r² / (2σ²))

where:
- r is the Euclidean distance between the query point x and a training data point xⱼ
- σ is a bandwidth parameter that controls the width of the Gaussian function

In summary, the LWR with RBF approximates the perturbation force term at a query point x by computing a weighted sum of radial basis functions centered at the training data points xⱼ, with the weights determined by solving a local optimization problem that balances fitting the data and regularization.





****************************************************************************************
****************************************************************************************




Answer to Question 5-4
Yes, a Dynamic Movement Primitive (DMP) for a specific motion, such as pouring water, can be learned from five human demonstrations given as RGB-D videos. Here's the explanation:

1. DMPs are a framework for learning and generating complex motor skills from demonstrations. They encode a movement as a set of differential equations that can be adjusted to reproduce the demonstrated motion.

2. To learn a DMP from human demonstrations, the following steps are typically involved:
   a. Data collection: The RGB-D videos of the five human demonstrations provide the necessary data for learning the DMP. The RGB data captures the visual appearance, while the depth (D) data provides information about the 3D structure of the scene.
   b. Trajectory extraction: From the RGB-D videos, the 3D trajectories of the relevant body parts (e.g., hand, arm) involved in the pouring water action are extracted. This can be done using computer vision techniques like object tracking and pose estimation.
   c. Temporal alignment: The extracted trajectories from the five demonstrations may have different durations. To learn a consistent DMP, the trajectories need to be temporally aligned using techniques such as dynamic time warping (DTW).
   d. DMP learning: The aligned trajectories are used to learn the parameters of the DMP. This involves fitting the DMP equations to the demonstrated trajectories, typically using optimization techniques like locally weighted regression (LWR).

3. Five human demonstrations provide a reasonable amount of data for learning a DMP for a specific motion like pouring water. The demonstrations capture the essential characteristics and variations of the motion, allowing the DMP to generalize and reproduce the action.

4. The learned DMP can then be used to generate new trajectories for the robot to execute the pouring water action. The DMP provides a compact and adaptable representation of the motion, allowing the robot to adjust the movement based on different initial conditions or goals.

In summary, a DMP for the pouring water action can be learned from five human demonstrations given as RGB-D videos. The RGB-D data provides the necessary information to extract the motion trajectories, which are then used to learn the parameters of the DMP. The learned DMP enables the robot to reproduce and generalize the pouring water action based on the demonstrated examples.





****************************************************************************************
****************************************************************************************




Answer to Question 5-5
To model the demonstrated pouring action for the robot, I would choose Dynamic Movement Primitives (DMPs) as the movement primitive. Here's why:

1. DMPs are well-suited for learning from demonstrations: DMPs can be trained using human demonstrations provided as RGB-D videos. The trajectories can be extracted from the videos and used to learn the parameters of the DMP model.

2. DMPs allow for generalization and adaptation: Once trained, DMPs can generate trajectories that generalize well to new starting and goal positions. This is important for the robot to adapt the learned pouring action to different scenarios and object locations.

3. DMPs can incorporate via-points: DMPs have the ability to introduce via-points, which are intermediate points that the generated trajectory should pass through. In this case, a via-point can be added to the DMP to ensure that the robot avoids the obstacle while reproducing the pouring action. The via-point can be set far away from the distribution of the demonstrated trajectories to guide the robot's motion around the obstacle.

4. DMPs provide smooth and stable motion: DMPs generate smooth and stable trajectories, which is crucial for a task like pouring water. The generated trajectories will be temporally scaled and maintain the overall shape of the demonstrated motion, ensuring a controlled and precise pouring action.

5. DMPs are computationally efficient: DMPs have a compact representation and can be computed efficiently in real-time. This is important for the robot to quickly generate and execute the adapted pouring action based on the current situation and obstacle location.

In summary, Dynamic Movement Primitives (DMPs) are a suitable choice for modeling the demonstrated pouring action because they can learn from human demonstrations, allow for generalization and adaptation, incorporate via-points for obstacle avoidance, provide smooth and stable motion, and are computationally efficient.





****************************************************************************************
****************************************************************************************




Answer to Question 5-6
Here are the answers to the exam question:

1. The main difference between cognitivist and emergent cognitive architectures is in how they model and explain cognitive processes:

Cognitivist architectures, such as ACT-R and SOAR, view cognition as symbolic information processing. They propose that the mind operates on symbolic representations using explicit rules and algorithms. Cognition is seen as a sequential, step-by-step process that manipulates symbols. 

In contrast, emergent architectures, such as connectionist networks, view cognition as emerging from the interactions of simple processing units (neurons). They emphasize parallel distributed processing and the emergent properties that arise from the connectivity patterns between units. Cognition is not explicitly programmed but learned from experience by modifying connection strengths.

In summary, cognitivist architectures use symbolic representations and explicit rules, while emergent architectures rely on distributed representations and learning from experience.

2. A hybrid cognitive architecture combines elements of both cognitivist and emergent approaches. It integrates symbolic processing with neural network-like components.

In a hybrid architecture, some cognitive processes may be modeled using explicit symbolic representations and rules, while others emerge from the interactions of connectionist networks. The symbolic and emergent components interact and complement each other.

The advantage of hybrid architectures is that they can leverage the strengths of both approaches - the interpretability and reasoning capabilities of symbolic systems, and the robustness, learning, and pattern recognition abilities of emergent systems. Hybrid architectures aim to provide a more comprehensive and biologically plausible model of cognition.

Examples of hybrid cognitive architectures include ACT-R/S, CLARION, and Sigma. These architectures combine symbolic rule-based modules with connectionist networks to model different aspects of cognition such as memory, learning, perception, and action.





****************************************************************************************
****************************************************************************************




Answer to Question 5-7
a) The forgetting mechanism given by $\alpha_i(t)$ is a time-based decay method. The parameter $\beta_i$ represents the importance or salience of item $i$, which influences how quickly the activation of the item decays over time. A higher value of $\beta_i$ means the item is more important and will decay more slowly. The parameter $d$ represents the discriminability or variance of the Gaussian distribution used in the decay function. A larger value of $d$ means the activation of the item will spread out more over time, leading to a slower decay.

b) Given the assumptions:
- Initially the robot's memory is empty
- At $t = 1$, the memory receives $i_1$
- At $t = 2$, the memory receives $i_2$
- At $t = 3$, the memory receives $i_3$ and $i_1$ and $i_2$ are recalled
- $\beta_{i_1} = \beta_{i_2} = \beta_{i_3} = \beta$

The equations for calculating the activations at $t=3$ are:

$\alpha_{i_1}(3) = \beta \cdot (r_{i_1,1} \cdot \mathcal{N}(\mu=1, \sigma^2=d)(3) + r_{i_1,3} \cdot \mathcal{N}(\mu=3, \sigma^2=d)(3))$
$= \beta \cdot (\mathcal{N}(\mu=1, \sigma^2=d)(3) + \mathcal{N}(\mu=3, \sigma^2=d)(3))$

$\alpha_{i_2}(3) = \beta \cdot (r_{i_2,2} \cdot \mathcal{N}(\mu=2, \sigma^2=d)(3) + r_{i_2,3} \cdot \mathcal{N}(\mu=3, \sigma^2=d)(3))$
$= \beta \cdot (\mathcal{N}(\mu=2, \sigma^2=d)(3) + \mathcal{N}(\mu=3, \sigma^2=d)(3))$

$\alpha_{i_3}(3) = \beta \cdot r_{i_3,3} \cdot \mathcal{N}(\mu=3, \sigma^2=d)(3)$
$= \beta \cdot \mathcal{N}(\mu=3, \sigma^2=d)(3)$

The order of activation magnitudes at $t=3$ is:
$\alpha_{i_3}(3) > \alpha_{i_2}(3) > \alpha_{i_1}(3)$

This is because $i_3$ was just received at $t=3$, so its activation is the highest. $i_2$ was received more recently than $i_1$, so $i_2$'s activation is higher than $i_1$'s activation due to the time-based decay.





****************************************************************************************
****************************************************************************************




