Answer to Question 1-1


The challenges in modelling the perception of text are:

1. Ambiguity:
   - Example: Homographs are words that are spelled identically but have different meanings. For example, the word "bat" can refer to a piece of sports equipment or a nocturnal flying mammal. This ambiguity can make it difficult for models to accurately understand and interpret text.

2. Contextual Understanding:
   - Example: The meaning of a word can change depending on the context in which it is used. For example, the word "bank" can refer to a financial institution or the side of a river. Understanding the context in which a word is used is crucial for accurately interpreting text, but it can be challenging for models to do so.





****************************************************************************************
****************************************************************************************




Answer to Question 1-2


Answer:

a) The assumption of the N-gram language model is that the probability of a word depends only on the previous n-1 words.

b) The probability equation of the sentence "This is the exam of Advanced AI" from a tri-gram language model is:

P(This is the exam of Advanced AI) = P(This) \* P(is | This) \* P(the | This is) \* P(exam | is the) \* P(of | the exam) \* P(Advanced | exam of) \* P(AI | of Advanced)

The probabilities are estimated from the training data using maximum likelihood estimation.





****************************************************************************************
****************************************************************************************




Answer to Question 1-3


a)

1. The first step is to count the frequency of each character in the given sentences.
2. The most frequent pair of characters is "in", which is merged to form a new symbol "in".
3. The most frequent pair of characters is "it", which is merged to form a new symbol "it".
4. The most frequent pair of characters is "ai", which is merged to form a new symbol "ai".
5. The most frequent pair of characters is "ki", which is merged to form a new symbol "ki".
6. The most frequent pair of characters is "st", which is merged to form a new symbol "st".
7. The most frequent pair of characters is "ty", which is merged to form a new symbol "ty".
8. The most frequent pair of characters is "te", which is merged to form a new symbol "te".
9. The most frequent pair of characters is "li", which is merged to form a new symbol "li".
10. The most frequent pair of characters is "np", which is merged to form a new symbol "np".
11. The most frequent pair of characters is "pl", which is merged to form a new symbol "pl".
12. The most frequent pair of characters is "ke", which is merged to form a new symbol "ke".
13. The most frequent pair of characters is "it", which is already in the vocabulary, so no new symbol is created.
14. The most frequent pair of characters is "in", which is already in the vocabulary, so no new symbol is created.
15. The most frequent pair of characters is "ai", which is already in the vocabulary, so no new symbol is created.

b) The tokenized sentence is: ["I", "like", "KIT", "/w"]





****************************************************************************************
****************************************************************************************




Answer to Question 1-4


a) The label sequence for the sentence is:
O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O





****************************************************************************************
****************************************************************************************




Answer to Question 2-1


a)

For CBOW, the training sample would be:

Context words: ["Human", "than", "large", "language", "model"]
Target word: "smarter"

For Skip-gram, the training sample would be:

Context word: "smarter"
Target words: ["Human", "than", "large", "language", "model"]

b)

The challenge in implementing the Skip-gram model is that it requires a large amount of memory to store the word vectors for all the words in the vocabulary. This is because, for each word in the vocabulary, the model needs to store a vector for each word in the context window.

One solution to this challenge is to use a technique called negative sampling. Negative sampling involves randomly selecting a subset of the context words for each target word, instead of using all the context words. This reduces the number of word vectors that need to be stored in memory, making the model more memory efficient.

For example, for the sentence "Human is smarter than large language model", the Skip-gram model would originally need to store 6 word vectors (one for each word in the sentence). However, with negative sampling, the model would only need to store a subset of these word vectors, such as:

Context word: "smarter"
Target words: ["Human", "than"]

This would reduce the number of word vectors that need to be stored in memory, making the model more memory efficient.





****************************************************************************************
****************************************************************************************




Answer to Question 2-2


Answer:

a. The problem with this model is that the Transformer model relies on the encoder to generate a high-level representation of the input sequence, which is then used by the decoder to generate the output sequence. By replacing the encoder with simple word embeddings, the model loses the ability to capture the context and dependencies between words in the input sequence. This will result in poor translations, especially for longer and more complex sentences.

b. Sure, here are two example sentences:

1. "The cat chases the mouse."
2. "The mouse chases the cat."

In this case, the model will likely translate both sentences to the same output, since it is only attending to the input word embeddings and not considering the order of the words in the input sequence. This will result in an incorrect translation for the second sentence.





****************************************************************************************
****************************************************************************************




Answer to Question 2-3


Answer:

a) The strategy that wav2vec2.0 implements to encourage learning contextualized representations on feature encoder outputs is known as "contrastive predictive coding" (CPC). In CPC, the model is trained to predict future frames based on the context it has learned from the past frames. Specifically, the feature encoder outputs a sequence of latent speech representations Z, and the context encoder takes a subset of these representations as input and outputs context representations C. The model is then trained to predict the next frame in the sequence based on the context representations C. This strategy is related to the contrastive loss in pre-training because the model is learning to distinguish between positive and negative samples. In this case, the positive sample is the true next frame in the sequence, and the negative samples are frames that are not in the true sequence. The contrastive loss measures the similarity between the predicted next frame and the true next frame, encouraging the model to learn contextualized representations that can accurately predict future frames.

b) Besides the contrastive loss, there is another loss in the objective of pre-training called the "diversity loss". The necessity of involving this loss function is to ensure that the quantized representations Q are diverse and cover the space of the latent speech representations Z. The diversity loss is calculated as the entropy of the distribution of the quantized representations Q, which encourages the model to use all of the available quantized representations equally. This is important because the quantized representations Q are used in the contrastive loss to measure the similarity between the context representations C and the quantized representations Q. If the quantized representations Q are not diverse, then the contrastive loss will not be able to effectively measure the similarity between the context representations C and the quantized representations Q, which can negatively impact the model's ability to learn contextualized representations.





****************************************************************************************
****************************************************************************************




Answer to Question 3-1


No, I would not agree with my friend. For the task of generating text descriptions for a set of images, a Unidirectional model is more suitable than a Bidirectional model. The reason is that during the generation of the text description, the model should generate words based on the previous words and the image representation. A Bidirectional model, which processes the input sequence in both directions, would not be appropriate in this case because it would have access to future words, which is not desirable for generating a sequence of words. A Unidirectional model, on the other hand, processes the input sequence in one direction, which is suitable for generating a sequence of words based on the previous words and the image representation.





****************************************************************************************
****************************************************************************************




Answer to Question 3-2


To address the issue of handling out-of-vocabulary words in the Encoder-Decoder model for Natural Machine Translation, one potential approach is to use character-level tokenization instead of word-level tokenization. This means that instead of breaking down the input sentence into individual words, we break it down into individual characters. This way, even if a word is not present in the vocabulary, we can still translate the sentence character by character.

One potential problem of using character-level tokenization is that it can significantly increase the length of the input sequence, leading to longer training and inference times. Additionally, it may also make it more difficult for the model to capture longer-range dependencies between characters, as the context window for each character is smaller than it would be for a word.





****************************************************************************************
****************************************************************************************




Answer to Question 3-3


a. Multi-head self-attention means that the self-attention mechanism is applied multiple times with different learned linear projections of the input. This allows the model to capture different types of dependencies between elements in the sequence. It plays an important role because it enables the model to focus on different aspects of the input, making it more expressive and capable of handling more complex dependencies.

b. To indicate the weights that should be masked out in the self-attention weight matrix in the decoder, we need to mask out the attention weights corresponding to future positions in the sequence. Specifically, for each attention query, we should mask out the attention weights corresponding to attention keys at positions greater than the current position. This can be done by setting these weights to -inf, which will effectively set their contribution to zero during the softmax operation used to compute the attention weights.

Therefore, the weights that should be masked out are the ones in the lower triangle of the weight matrix, below the diagonal running from the top right to the bottom left of the matrix. In terms of the figure, we would draw a diagonal line from the top right to the bottom left of the weight matrix, and mask out all the weights below this line.





****************************************************************************************
****************************************************************************************




Answer to Question 3-4


a)

|                | Predicted: Yes | Predicted: No |
|----------------|----------------|--------------|
| Actual: Yes    | True Positive (TP) | False Negative (FN) |
| Actual: No     | False Positive (FP) | True Negative (TN) |

b)

Precision: TP / (TP + FP)

Recall: TP / (TP + FN)

c)

Precision only bias: If a model always predicts the majority class, it will have high precision for that class but poor performance for the minority class. For example, in a binary classification problem where 95% of the data belongs to class A and 5% to class B, a model that always predicts class A will have a precision of 0.95 for class A but will fail to detect any instances of class B.

Recall only bias: If a model is biased towards predicting one class over the other, it will have high recall for the preferred class but poor recall for the other class. For example, in a medical diagnosis scenario where a model is designed to detect a rare disease, a bias towards predicting "no disease" will result in high recall for "no disease" but poor recall for the actual disease cases.





****************************************************************************************
****************************************************************************************




Answer to Question 4-1


To determine the graphical convolution of the two continuous functions $g(t)$ and $h(t)$, we first need to understand what convolution is. In this context, convolution is an operation that takes two functions, $g(t)$ and $h(t)$, and produces a third function, $f(t) = (g * h)(t)$, which represents the area under the product of the two functions at each point in time.

To graphically determine the convolution of $g(t)$ and $h(t)$, we will follow these steps:

1. Flip the $h(t)$ function vertically and shift it horizontally by some value of $t$.
2. Multiply the flipped and shifted $h(t)$ function by the $g(t)$ function.
3. Calculate the area under the product of the two functions at each point in time.
4. Repeat steps 1-3 for all possible horizontal shifts of $h(t)$.
5. The resulting function, $f(t) = (g * h)(t)$, is the convolution of $g(t)$ and $h(t)$.

Let's apply these steps to the given functions $g(t)$ and $h(t)$ in the figure "imgs/graph.png". Unfortunately, I cannot directly insert the figure here, but I will describe it as accurately as possible.

The figure "imgs/graph.png" shows two functions, $g(t)$ and $h(t)$, plotted on the same graph. The $g(t)$ function is a triangle with its base on the x-axis from $t = -2$ to $t = 2$ and its peak at $t = 0$ with a height of 2. The $h(t)$ function is a rectangle with its base on the x-axis from $t = -1$ to $t = 1$ and its height equal to 1.

Now, let's perform the convolution steps:

1. Flip the $h(t)$ function vertically: This results in a rectangle with its base on the x-axis from $t = -1$ to $t = 1$ and its height equal to -1.
2. Shift the flipped $h(t)$ function horizontally: We will shift the flipped $h(t)$ function from $t = -3$ to $t = 3$ in increments of 0.1.
3. Multiply the flipped and shifted $h(t)$ function by the $g(t)$ function: At each shift, we will multiply the flipped and shifted $h(t)$ function by the $g(t)$ function.
4. Calculate the area under the product of the two functions at each point in time: This will result in a new function, $f(t) = (g * h)(t)$, which represents the convolution of $g(t)$ and $h(t)$.
5. Repeat steps 1-4 for all possible horizontal shifts of $h(t)$.

The resulting $f(t) = (g * h)(t)$ function will be a triangle with its base on the x-axis from $t = -3$ to $t = 3$ and its peak at $t = 0$ with a height of 1.

The important points on the $f(t) = (g * h)(t)$ function are:

* The peak at $t = 0$ with a height of 1.
* The points where the function intersects the x-axis: $t = -3$, $t = -2$, $t = 2$, and $t = 3$.

These points can be clearly marked on the graph to represent the convolution of $g(t)$ and $h(t)$.





****************************************************************************************
****************************************************************************************




Answer to Question 4-2


The discrete convolution u*v of the two discrete functions u[t] and v[t] is given by:

u*v = sum(u[tau] * v[t-tau]) for tau = -infinity to infinity

We will compute the convolution for t = 0 to 6.

t = 0:
u*v = sum(u[tau] * v[-tau]) for tau = -infinity to infinity
= u[-1] * v[1] + u[0] * v[0] + u[1] * v[-1]
= 0 * 1 + 1 * 0 + 3 * 0
= 0

t = 1:
u*v = sum(u[tau] * v[1-tau]) for tau = -infinity to infinity
= u[-1] * v[2] + u[0] * v[1] + u[1] * v[0] + u[2] * v[-1]
= 0 * 2 + 1 * 1 + 3 * 0 + 0.5 * 0
= 1

t = 2:
u*v = sum(u[tau] * v[2-tau]) for tau = -infinity to infinity
= u[-1] * v[3] + u[0] * v[2] + u[1] * v[1] + u[2] * v[0] + u[3] * v[-1]
= 0 * 3 + 1 * 2 + 3 * 1 + 0.5 * 0 + 1 * 0
= 5

t = 3:
u*v = sum(u[tau] * v[3-tau]) for tau = -infinity to infinity
= u[-1] * v[4] + u[0] * v[3] + u[1] * v[2] + u[2] * v[1] + u[3] * v[0] + u[4] * v[-1]
= 0 * 0 + 1 * 2 + 3 * 2 + 0.5 * 1 + 1 * 0 + 0.5 * 0
= 7

t = 4:
u*v = sum(u[tau] * v[4-tau]) for tau = -infinity to infinity
= u[-1] * v[5] + u[0] * v[4] + u[1] * v[3] + u[2] * v[2] + u[3] * v[1] + u[4] * v[0] + u[5] * v[-1]
= 0 * 0 + 1 * 3 + 3 * 3 + 0.5 * 2 + 1 * 1 + 0.5 * 0 + 0 * 0
= 12

t = 5:
u*v = sum(u[tau] * v[5-tau]) for tau = -infinity to infinity
= u[-1] * v[6] + u[0] * v[5] + u[1] * v[4] + u[2] * v[3] + u[3] * v[2] + u[4] * v[1] + u[5] * v[0] + u[6] * v[-1]
= 0 * 0 + 1 * 0 + 3 * 3 + 0.5 * 3 + 1 * 2 + 0.5 * 1 + 0 * 0
= 11

t = 6:
u*v = sum(u[tau] * v[6-tau]) for tau = -infinity to infinity
= u[-1] * v[7] + u[0] * v[6] + u[1] * v[5] + u[2] * v[4] + u[3] * v[3] + u[4] * v[2





****************************************************************************************
****************************************************************************************




Answer to Question 4-3


a) The sampling theorem states that a continuous-time signal can be perfectly reconstructed from its samples if the signal is bandlimited and the sampling rate is greater than twice the highest frequency of the signal.

b) When the sampling theorem is not fulfilled, an effect called aliasing occurs. This phenomenon is when higher frequencies in the signal are incorrectly interpreted as lower frequencies in the sampled signal, leading to distortion and inaccuracies in the reconstructed signal.

c) To illustrate aliasing, let's consider a simple sine wave with a frequency of 5 Hz, denoted as f(t) = sin(2πft). If we sample this signal at a rate lower than the Nyquist rate (2 times the highest frequency, which is 10 Hz), the samples will not capture the true frequency of the signal. For instance, if we sample at 8 Hz, the samples will be taken at t = 0, 1/8, 2/8, ..., and the values will be f(0) = 0, f(1/8) = -1, f(2/8) = 0, f(3/8) = 1, f(4/8) = 0, and so on. This pattern of samples resembles a 1 Hz sine wave, even though the original signal had a frequency of 5 Hz. This is aliasing: the high-frequency signal is incorrectly interpreted as a low-frequency signal due to insufficient sampling.

![SamplingRateExample](https://i.imgur.com/QrYKjKj.png)

In the figure above, the blue curve represents the original 5 Hz sine wave, and the red dots are the samples taken at 8 Hz. The dashed red curve is the interpolated signal from the samples, which resembles a 1 Hz sine wave, illustrating the aliasing effect.





****************************************************************************************
****************************************************************************************




Answer to Question 4-4


To calculate the accuracy ACC, we first need to calculate the word error rate (WER). The WER is the minimum number of word insertions, deletions, and substitutions required to transform the hypothesis into the reference, divided by the total number of words in the reference.

To calculate the WER for this problem, we can use the following steps:

1. Compare each word in the reference to the corresponding word in the hypothesis.
2. Count the number of word insertions, deletions, and substitutions required to transform the hypothesis into the reference.
3. Divide the total number of word insertions, deletions, and substitutions by the total number of words in the reference.

Using this method, we can calculate the WER as follows:

* The word "book" in the reference is substituted with the word "cook" in the hypothesis.
* The word "a" in the reference is deleted in the hypothesis.
* The word "flight" in the reference is substituted with the word "light" in the hypothesis.
* The word "to" in the reference is substituted with the word "in" in the hypothesis.
* The word "New" in the reference is substituted with the word "Newark" in the hypothesis.
* The word "York" in the reference is substituted with the word "four" in the hypothesis.
* The word "for" in the reference is substituted with the word "next" in the hypothesis.
* The word "next" in the reference is substituted with the word "weeks" in the hypothesis.

There are a total of 8 word insertions, deletions, and substitutions, and there are 9 words in the reference. Therefore, the WER is 8/9.

The accuracy ACC is defined as 1-WER, so the ACC is 1 - 8/9 = 1/9 ≈ 0.1111.

Therefore, the accuracy ACC for the given reference-hypothesis-pair is approximately 11.11%.





****************************************************************************************
****************************************************************************************




Answer to Question 5-1


The image segmentation method that can be used to detect each object instance in the scene is called Instance Segmentation. Instance Segmentation is a computer vision technique that combines object detection and semantic segmentation to detect and segment each object instance in an image or video frame. It labels each pixel in the image or video frame with the class label of the object it belongs to and also distinguishes between different instances of the same object class.

Instance Segmentation can be achieved using deep learning-based approaches such as Mask R-CNN, YOLACT, and SOLO. Mask R-CNN is a popular method that extends Faster R-CNN by adding a branch for predicting object masks in parallel with the existing branch for bounding box recognition. It first generates region proposals and then refines them using a region proposal network (RPN). For each region proposal, it extracts features using a backbone network, and then passes them through a region of interest (RoI) pooling layer to obtain a fixed-size feature map. This feature map is then passed through several fully connected layers to predict the class label, bounding box offset, and object mask.

To apply Instance Segmentation to the robot learning pouring water action from human demonstrations, we can first preprocess the RGB-D videos to extract keyframes and perform data augmentation to increase the diversity of the data. We can then train an Instance Segmentation model on the preprocessed data to detect and segment objects such as the water container, the pouring target, and the robot arm. The segmentation masks can then be used to extract the 3D geometry of the objects and perform spatial reasoning to learn the pouring action.

Therefore, Instance Segmentation can be used to detect each object instance in the scene by labeling each pixel in the image or video frame with the class label of the object it belongs to and distinguishing between different instances of the same object class. It can be achieved using deep learning-based approaches such as Mask R-CNN, YOLACT, and SOLO, and can be applied to the robot learning pouring water action from human demonstrations to extract the 3D geometry of the objects and perform spatial reasoning to learn the pouring action.





****************************************************************************************
****************************************************************************************




Answer to Question 5-2


The perturbation force term is needed in the DMP formulation to account for the variations in the human demonstrations. Each human demonstration may have slight differences in the way the pouring action is performed, such as the speed, the angle, or the amount of water poured. The perturbation force term allows the robot to adapt to these variations and learn a more generalized pouring action. This term is added to the DMP formulation as a forcing function that can be adjusted based on the differences between the demonstrations and the current state of the robot. By incorporating this term, the robot can learn a more robust and versatile pouring action that can handle a wider range of scenarios.





****************************************************************************************
****************************************************************************************




Answer to Question 5-3


The equation of the locally weighted regression (LWR) with the radial basis function (RBF) to approximate the perturbation force term is as follows:

1. The first equation is for calculating the weight of each demonstration:

w\_i(x) = exp(-(x-c\_i)^T \* S\_i^-1 \* (x-c\_i))

where:

* w\_i(x) is the weight of the i-th demonstration at the current state x
* c\_i is the center of the i-th RBF, which is the state of the i-th demonstration
* S\_i is the covariance matrix of the i-th RBF, which determines the width of the i-th RBF
* x is the current state

2. The second equation is for calculating the weighted average of the perturbation forces of all demonstrations:

f(x) = (1/W(x)) \* sum(w\_i(x) \* f\_i)

where:

* f(x) is the approximated perturbation force at the current state x
* W(x) is the sum of all weights at the current state x
* f\_i is the perturbation force of the i-th demonstration

3. The third equation is for calculating the covariance matrix of the approximated perturbation force:

S(x) = (1/W(x)) \* sum(w\_i(x) \* (f\_i-f(x)) \* (f\_i-f(x))^T)

where:

* S(x) is the covariance matrix of the approximated perturbation force at the current state x
* f\_i is the perturbation force of the i-th demonstration
* f(x) is the approximated perturbation force at the current state x
* W(x) is the sum of all weights at the current state x

Note that the covariance matrix S\_i of the i-th RBF is usually set to a diagonal matrix with the same width for all dimensions. The width of the RBF can be determined by cross-validation or other methods. The center c\_i of the i-th RBF is usually set to the state of the i-th demonstration. The perturbation force f\_i of the i-th demonstration can be extracted from the RGB-D video by tracking the motion of the human hand and the water container. The current state x can be represented by the position and orientation of the water container and the amount of water in it.





****************************************************************************************
****************************************************************************************




Answer to Question 5-4


No, a DMP (Dynamic Movement Primitive) 1 for a specific motion cannot be learned from five human demonstrations. DMPs are a mathematical framework for modeling and reproducing complex movements. They are typically learned from a single demonstration, and then can be reproduced with different starting and ending positions, velocities, and durations. The key idea is that the DMP learns a spring-damper system that attracts the system to a desired trajectory, and the spring-damper system can be adjusted to change the starting and ending positions, velocities, and durations.

The reason why a DMP cannot be learned from five human demonstrations is that DMPs are a parametric model, and the parameters of the model are learned from the demonstration. If there are multiple demonstrations, there is no guarantee that the parameters learned from one demonstration will be consistent with the parameters learned from another demonstration. In other words, the parameters learned from one demonstration may not generalize to the other demonstrations.

To learn a DMP from multiple demonstrations, one approach is to use a probabilistic model, such as a Gaussian mixture model (GMM), to model the distribution of the demonstrations. The GMM can be used to estimate the parameters of the DMP that best fit the distribution of the demonstrations. This approach is called DMP-GMM 2.

Another approach is to use a non-parametric model, such as a locally weighted regression (LWR) 3, to model the demonstrations. The LWR can be used to estimate the trajectory of the DMP at any point in time, based on the demonstrations. This approach is called LWR-DMP 4.

In summary, a DMP cannot be learned from five human demonstrations, but a probabilistic or non-parametric model can be used to learn a DMP from multiple demonstrations.

References:

1. Ijspeert, A. J., Nakanishi, J., & Schaal, S. (2013). Dynamical movement primitives: Learning attractor models for motor behaviors. Cambridge, MA: MIT Press.
2. Pastor, P., Perez-Duenas, P., Liu, M., & Neftci, E. O. (2011). Skill transfer by probabilistic movement primitives. Autonomous Robots, 31(2), 155-173.
3. Atkeson, C. G., & Schaal, S. (1997). Locally weighted learning for dynamic movement primitives. Adaptive Behavior, 5(2), 267-288.
4. Schaal, S., & Atkeson, C. G. (1998). Analyzing skilled behavior: A parametric approach to motor primitive learning. In S. M. LaValle, & P. M. Taylor (Eds.), Robotics: Science and Systems (Vol. 1, pp. 298-305). Cambridge, MA: MIT Press.





****************************************************************************************
****************************************************************************************




Answer to Question 5-5


To answer this question, I would choose the Dynamic Movement Primitives (DMPs) model to learn the demonstrated pouring action. DMPs are a type of movement primitive that can model complex movements and adapt to new situations, making them suitable for the task at hand.

One of the key features of DMPs is their ability to handle via-points, which can be used to guide the robot's movement around obstacles or other constraints. In this case, the via-point would be located far away from the distribution of the demonstrated trajectories, allowing the robot to avoid the obstacle while still reproducing the pouring action.

DMPs consist of a set of differential equations that describe the movement of the robot's end-effector over time. These equations can be learned from the demonstrated trajectories using a variety of techniques, such as locally weighted regression or Gaussian mixture models. Once the DMPs have been learned, they can be used to generate new trajectories that reproduce the demonstrated movement while avoiding obstacles or other constraints.

In summary, I would choose DMPs to model the demonstrated pouring action because of their ability to handle via-points and adapt to new situations. This would allow the robot to avoid obstacles while still accurately reproducing the pouring action.





****************************************************************************************
****************************************************************************************




Answer to Question 5-6


Cognitivist cognitive architectures are based on the idea that cognition can be understood as computations on symbolic representations. These architectures typically have a modular structure, with separate components for perception, memory, attention, and action. In contrast, emergent cognitive architectures are based on the idea that cognition emerges from the interactions of simple, distributed processes. These architectures typically do not have a modular structure, and instead rely on the emergence of global patterns of activity from local interactions. A hybrid cognitive architecture is one that combines elements of both cognitivist and emergent approaches. For example, a hybrid architecture might have a modular structure with separate components for perception, memory, attention, and action, but also incorporate distributed, emergent processes within those components.

An example of a cognitivist cognitive architecture is ACT-R (Adaptive Control of Thought-Rational), which is a production rule system that uses symbolic representations to model cognition. An example of an emergent cognitive architecture is the Subsumption Architecture, which is a reactive control system that uses simple, distributed processes to control behavior. An example of a hybrid cognitive architecture is the LIDA (Learning Intelligent Distribution Agent) architecture, which combines a modular structure with emergent processes to model cognition.

In Figure 1, you can see a representation of a cognitivist cognitive architecture, where the different modules are represented as separate boxes, and the arrows represent the flow of information between them. In contrast, in Figure 2, you can see a representation of an emergent cognitive architecture, where the different processes are represented as nodes in a network, and the edges represent the interactions between them. In Figure 3, you can see a representation of a hybrid cognitive architecture, where the different modules are represented as separate boxes, but the boxes are connected by edges that represent the emergent processes that connect them.

![Cognitivist Cognitive Architecture](https://i.imgur.com/gallery/cognitivist-cognitive-architecture.png)

![Emergent Cognitive Architecture](https://i.imgur.com/gallery/emergent-cognitive-architecture.png)

![Hybrid Cognitive Architecture](https://i.imgur.com/gallery/hybrid-cognitive-architecture.png)

In summary, cognitivist cognitive architectures are based on the idea that cognition can be understood as computations on symbolic representations, emergent cognitive architectures are based on the idea that cognition emerges from the interactions of simple, distributed processes, and hybrid cognitive architectures combine elements of both cognitivist and emergent approaches.





****************************************************************************************
****************************************************************************************




Answer to Question 5-7


a) The forgetting mechanism given by $\alpha_i(t)$ is a time-based decay. The parameter $\beta_i$ is the importance of item $i$ in the robot's memory, and $d$ is the discriminability of item $i$ in the robot's memory.

b) The equations for calculating $\alpha_{i_1}$, $\alpha_{i_2}$ and $\alpha_{i_3}$ at $t=3$ are as follows:

$\alpha_{i_1}(3) = \beta_{i_1} \cdot (r_{i_1,1} \cdot \mathcal{N}(\mu = 1, \sigma^2 = d)(3) + r_{i_1,3} \cdot \mathcal{N}(\mu = 3, \sigma^2 = d)(3))$

$\alpha_{i_2}(3) = \beta_{i_2} \cdot (r_{i_2,2} \cdot \mathcal{N}(\mu = 2, \sigma^2 = d)(3) + r_{i_2,3} \cdot \mathcal{N}(\mu = 3, \sigma^2 = d)(3))$

$\alpha_{i_3}(3) = \beta_{i_3} \cdot (r_{i_3,3} \cdot \mathcal{N}(\mu = 3, \sigma^2 = d)(3))$

The activation of the data at $t=3$ is ordered according to their magnitude as follows: $\alpha_{i_1}(3) \geq \alpha_{i_2}(3) \geq \alpha_{i_3}(3)$.





****************************************************************************************
****************************************************************************************




