Hi, this is Elena and I'm going to. Be presenting our work detecting and assimilated borrowings in Spanish and annotated corpus and approaches to modeling. 
So we're going to be covering what lexical borrowing is, the task that we proposed, the data set that we have released and some models that we explored. 
But to begin with, what is lexical borrowing and why it matters as an NLP task? 
Well, lexical borrowing is basically the incorporation of words from one language into another language. 
For instance, in Spanish we use words that come from English. 
And here you have a few examples, words such as podcast app. Linen, cloth, funding, all these are English words that we sometimes use in Spanish. 
Lexical borrowing is a type of linguistic borrowing which is basically reproducing in one language. Patterns of other languages 
and borrowing and code switching have sometimes been compared and described as a continuum because switching being the thing that bilinguals do where they mix 2 languages at the same time. 
There are however some differences between lexical borrowing and code switching. 
We're going to be focusing on lexical. Good 
switching is something that is done by bilinguals and by definition the code switches are not integrated into any of the languages and used, whereas lexical borrowing is something that is also done by monolinguals. 
The borrowings will comply with the grammar of the recipient language, 
and borrowings can eventually be integrated into the recipient language. 
But why is borrowing an interesting phenomenon? 
Well, from the point of view of linguistics, borrowing is a manifestation of of how languages change and how they interact. 
And also lexical borrowings are a source of new words. 
Here you have some examples of lexical borrowings that have been incorporated into the Spanish language as new words. 
In terms of NLP, borrowings are a common source of out of 
vocabulary words, and in fact, automatically detecting lexical borrowings has proven to be useful for NLP downstream tasks such as parsing text to speech synthesis or machine translation. 
There has been a growing interest in the influence of English on other languages, particularly related to English lexical borrowings, borrowings which sometimes have been called Anglicisms, 
and here you have some examples of work on automatic detection of borrowings in some of these languages. 
Though the task that we propose is to detect an assimilated lexical borrowings in Spanish newswire, 
which means that we are interested in extracting words borrowed from other languages that are being used in Spanish newspapers but that have not been integrated or assimilated into the recipient language. 
So it's not yet integrated into Spanish. 
Here you have an example. 
This is a sentence in Spanish Las Prendas bestsellers estampado motivos florales animal print. These people patchwork 
and as you can see there are three spans of text which are actually English words like bestseller, animal print and patchwork. 
These are the type of spans that we are interested in extracting and detecting. 
There has been previous word on anglicism detection which consists consisted of CRF model for anglicism detection on Spanish Newswire. 
This model achieve an F1 score of 86. 
But there were some limitations both in the data set and the modeling approach. 
So the data set focused exclusively on one source of news consisted only of headlines 
and also there was an overlap in the borrowings that appear in the training set on the test set. 
This prevented the assessment of whether the modeling approach could actually generalize to previously unseen borrowings. 
So what we aim is to tackle some of these limitations in the task. 
So to begin, we to begin with, we created a new data 
set, the aim, a new data set that was annotated with lexical borrowings and the aim was to create a test set that was as difficult as possible. 
So there would be minimal overlap in words and topics between the training set at the test set. 
And as a result, well the test set comes from sources and dates that were not seeing in the training set. 
Here you can see that there's no overlap in the in the time. 
It's also the test set is also very borrowing dense. 
Just to give you some numbers, if the training sets contain 6 borrowings per each thousand tokens, the test set contained 20 borrowings per each thousand tokens. 
The test set contained as many out of vocabulary words as possible. 
In fact, 92% of the borrowings in the test set are OV, 
so they were not seen during training. 
And the corpus consisted basically of a collection of texts that came from different sources. Of Spanish newspapers 
and uh. It was annotated by hand using two tags, 
one for English lexical borrowings which is the majority of lexical borrowings in Spanish and then they label other for borrowings from other languages. 
We use gonal formats and we used bio encoding so that we could encode single token borrowings such as app or multi token borrowing such as machine learning. 
These are the numbers of the. Suppose 
as you can see, it amounts to. Roughly 300. 70. 1000. Tokens and 
here. You have the number of spans that were labeled as English and the spans that were labeled as. Other borrowings and how many of them? Were unique. 
I hear you have a couple of examples of the of the set of the data set. 
As you can see for instance here we have. In the first example we have the borrowing batch cooking which is a multi multi word borrowing 
and we have annotated it using the bio and code. 
So the B I/O was used for words in Spanish so not for words that were not borrowed. 
And here in this second example you have benching and crash which are also labeled as borrowings from English. 
So once we had the data set, we explored several models for the task of extracting and detecting these lexical borrowings. 
The first one that we tried was the conditional random field model. 
This was the model that had been used on previous work. 
And we used the same handcrafted features from that from those from that work. 
As you can see, these are the features. 
These are binary features such as is the word or the token in upper case. 
Is it typically typical 
case, is it a quotation mark? 
Things like that, which are the type of features that one would expect in a named entity recognition task. 
These are the results that we got. 
You obtain 55 of F1 score using the the CRF model with handcrafted feature, 
which is a huge different difference compared to the reported F1 score of 86, which was the result obtained with the same CRF model same features but on a different data set also for Spanish lexical borrowing detection. 
So this proves that the data set that we created is more difficult and that we. We needed to explore more sophisticated models for these tasks. 
So we tested two transformer based model. 
We used beta which is a monolingual Bert model trained for Spanish and also multilingual bird. 
Both models we use them through the Transformers library by hugging face. 
These are the results that we got. 
As you can see multilingual bird perform better than better both on the development set and on the test set and across all metrics. 
Just so we have an idea to compare the. RF model obtained an 82. Sarah 
Molina, 55, to obtain 55. Of 1 score whereas the. Multilingual bird obtain. 82 which is a big difference. 
The ones that we had those results, we asked ourselves another question which is is could we find a bias TM CRF model, feed it with different types of embeddings, embeddings that encode different type of linguistic information and perform outperform the results obtained by transformer based model. 
So in order to do so, we ran some preliminary experiments. We run this by LSTM CRF model using flare library 
and we tried. Experimented with different type of embeddings like transformer base but also fastex, kitchen beddings and so on. 
And we found out was that transformer based embeddings performed better than non contextualized embeddings, that the combination of English birth and Spanish better embeddings outperform multilingual Bert embeddings 
and that BP embeddings produced better F1 and character embeddings produce better recall. 
With that in mind, these were the best performing results that we got. 
Both models were bylas DM CRF model using Flair 
One was fed with Beto and. Embeddings on BP and the other one better Berton, Bert embeddings P and also character embeddings. 
These last one was the one that produced the highest F1 score on the test set, although the highest score on the development set was obtained by the one without character embeddings. 
Just to bear in mind that the best result that we got with multilingual but obtain an F1 of 76 on the development side and 82 on the test set. 
So this is an improvement compared to those results. 
Finally, we ask ourselves and another question which was can lexical borrowing detection be framed as transfer learning from language identification and code switching? 
So we run the same bias TM CRF model that we had run using flair, but instead of using these unadapted transformer based Beto and Bert embeddings. We use code switching embeddings 
while our code switch embeddings. 
Well, these are embeddings that are have that have been fined. And transformer based embeddings that have been pre trained for language identification on the Spanish English section of the Linthe code switching data 
set linta is a data set on code switching that has a section on Spanish, English, Spanish, Spanish, English code switching. 
So we. We've had our bile, STMT, CRF with code switch embeddings and optionally character embeddings, BPR embeddings and so on. 
The best result that we got was 84.22, which is the highest across all the models that we tried on the test set. 
Although the best result F1 score that we got on the development set, which was 79, was lower than the best result obtained by the by LSTM CRF fed with Unadapted embeddings. 
So some conclusions from our work. 
We have a we have produced a new data set of Spanish newswire that is annotated with Anna simulated lexical borrowings. 
This data set is more borrowing dense and rich than previous resources. 
We have explored four types of models for lexical borrowing detection. 
In terms of error analysis, well recall was a weak point for all models 
as you can see here are some frequent false negatives. Include uppercase borrowings, words that exist in both in English and Spanish, for instance. 
Also, interestingly, BP embeddings seem to improve F1 score 
and character embeddings seem to improve recall, 
which it's an interesting finding that perhaps we can explore on future work. 
Well, this is everything that I have. 
Thank you so much for listening. 
My name is Antoine. 
I'm a PhD student at the University of Massachusetts Amherst. 
I am presenting our paper Kenya birth and morphology aware Kinyarwanda language model. 
Today I'll talk about the motivation for this research. 
Then you present Kenya beauty model architecture in detail. 
I'll then talk about our experimental results, then finish with some conclusions. 
We all know that recent natural language processing advances have been made possible by the use of pre trained language models such as birth. 
However there are still a number of limitations. 
Due to the complex morphology that is expressed by most morphologically rich languages. The ubiquitous byte pair encoding tokenization algorithm that I used cannot extract the exact subword lexical units, meaning the morphemes which are needed for effective representation. 
For example, here we have 3 kinyarwanda words that have several morphemes in them, but the BP algorithms cannot extract them. 
This is because some morphological rules. Produce different surface forms that hide the exact lexical information, and BPA, which is solely based on the surface forms, does not have access to this lexicon model. 
The second challenge is that even if one had access to an Oracle morphological analyzer, replacing BPE tokens with morphemes is not enough to express the morphological compositionality. 
1/3 gap in research is that new pre trained language models are most often evaluated on high resource languages. 
And we need to assess their applicability on low resources and diverse languages as well. 
Therefore, we present Kenya birth, which is a simple but effective adaptation of the bird architecture that is meant to more effectively handle morphologically rich languages. 
We evaluate Kenya birds on kinyarwanda, hello resource morphologically rich language, which is spoken by more than 12 million people across Eastern and central Africa. 
The input to the model is either a sentence or a document. 
For example, here we have John Tara, Mobile, Nevada tangaza, which means we were surprised to find the John there. 
As you can see, Kenya wonder what contains several morphemes that contain different information in them. 
Therefore, in our model, we pass this sentence or a document to a morphological analyzer. 
Which then generates morphemes contains in each of the words. 
The morphemes usually are made of a stem and zero or more affixes. 
The affixes may indicate tense, aspect, subjects or objects in verbs, and more often relates to the band 2 noun class for subjects and objects. 
The morphological analyzer also produces a part of speech tag for each of the words. 
After this step we make embeddings for. This for the part of speech tags. 
Embeddings for the. Affixes. 
And embeddings for. The stem. 
These are the morphology table. These are the morphology level embeddings. 
And we then pass these embeddings through a morphology encoder, which is a small transformer encoder that is applied it to each word independently. 
The output of the are the vectors that are considered contextualized with the morphological information at each word. 
Now we perform composition where the morphological embeddings corresponding to part of speech and stem are concatenated together. 
We further concatenate. Them with another stem embedding at the sentence level. 
Then we form an input to the main sentence or document encoder. 
The final output are contextualized embeddings that can be used for downstream and MLP tasks. 
For a morphological analyzer, we use finite state to level morphology principles. With the custom implementation that is tailored to the Kinyarwanda language. 
We effectively model the morphology of all kinyarwanda words, including verbals, nouns, demonstrative and possessive pronouns, numerous, and others. 
We use an unsupervised part of speech tagging algorithm. 
A first order factored model is used to account for more photog probability, basically the probability that it is assigned by the morphological analyzer. 
We also take into consideration the part of speech tag precedence as well as the syntactic agreements that are present in the input words. 
The part of speech tagger uses a viral bidirectional inference which improves upon the more often used Viterbi algorithm for decoding. 
A few remarks here. For positional encoding 
1, the morphology encoder does not use any position or encoding. 
This is because each of the morphemes occupies a known slot in the morphological model. 
Therefore, positional information is inherent when the morphemes are given. 
Second, the sentence encoder uses the so-called antide relative positional embeddings, which have been recently published at Iclear Conference. 
This positional embeddings essentially disentangles positional correlations from talking to token attention computation. 
It's similar to birth we use a masked language model pretraining objective. 
Essentially, we have to predict both the stem and the affixes that are associated with the words. 
During. Pre. Training. 15% of all words are considered for prediction, of which 80% are masked, 10% are swapped with random words and 10% are left unchanged. 
For a fixed prediction we face some multi label classification problem. 
For this we either group together affixes into a fixed number of sets and predict the set as a class label. 
The other option is to predict the afix probability vector. 
We evaluate both of these approaches in our experiments. 
To pretend King Albert on about 2 1/2 gigabytes of kinyarwanda text and compare it to three baseline models. 
One is a multilingual model called Excel. Hmm. That is trained on a large text corpora that is made of multiple languages. 
The other two baselines are pre trained on the same text using either byte pair encoding algorithm or using morphological analysis without using the two tier transformer encoder architecture. 
All models are configured in the base architecture, which is about between 100 and 110 million parameters, with kinyarwanda with Kenya built using the least number of parameters. 
All models except the multilingual are pretraining for 32,000 gradient updates with a batch size of 2560 sequences in each batch. 
We evaluate the pre trained models on the three sets of tasks. 
One is the group benchmark which has been often used for evaluating the effectiveness of pre trained language models. 
We obtain our group benchmark data by translating the original benchmark data into kinyarwanda using Google Translate. 
The second task is kinyarwanda named entity Recognition Benchmark, which is a high quality data set that was annotated by trained native speakers. 
The third one is a news categorization task where we pull news articles from several websites and collecting their categorization tags that were assigned by the authors and then essentially trying to predict the same categories. 
And now we go to the results. 
For the glue benchmark, we find that kingbird consistently outperforms baseline models. 
Here we show the average performance for 10 fine tuning runs. 
We also run a user evaluation. Of the translations that are produced by Google Translate. 
Essentially, user users rated about 6000 examples, assigning scores on a scale from one to four, assessing the quality of the translations. 
The result is that many translations were noisy. 
But all models had to cope with the same. Translation noise and the relative performance between the models is still important to notice. 
For the named Entity Recognition Task, we also find that in your belt gives the best performance with the affix distribution regression variant performing based. 
These results are also averages of 10 fine tuning runs. 
For the news categorization task we find mixed results 
review as work on text classification for KINYARWANDA had found that simple keyword detection is mostly enough for solving this specific task. 
Therefore there is less gain from using pre trained language models. 
On this particular task of news categorization. 
We also conducted an ablation study to see if there are alternative structures that improve performance. 
For the glue benchmark, we find that using a fixed sets consistently performs better, while a fixed probability regression objective yields the best performance on named entity recognition. 
Also, by looking at the low scarves for fine tuning, we find that in Albert has better convergence in most cases. 
So to conclude, this work has demonstrated the effectiveness of explicit reusing morphological information in pre trained language models. 
The proposed 2 tier transformer encoder architecture enables capturing morphological scene. Theological compositionality, which is an important aspect of morphologically rich languages. 
These findings should motivate further research into morphology aware language pre trained language models. 
Hello, my name is Mihai Petrushka and it is my pleasure to present to you the paper titled sparsifying Transformer models with trainable representation pooling. 
I work done at Aplicaci in cooperation with Lucas Portman and Lucas Gunzalez. 
Let me start with the. Problems our work targets. 
Our method works well for the cases where long inputs are considered. 
Roughly speaking, it is meant for the task orders and input of over 2000 tokens and the targets are shorter than the provided inputs. 
This has some specific applications in NLP. 
For example, one can imagine that given a long document, there's a need to summarize it, classify, answer the question about it, extract information or some key phrases. 
It's very cold. The modular transformer ended issue of its attention complexity that depends on the square of the input length. 
In a vanilla. Transformer. The full attention connectivity relations of each token to every other token have to be calculated. 
The computational complexity of attention just depends on the number of layer L. Sequence. Length. N. Another sequence length and the dimensionality of representations. 
Similarly in the decoders course attention to. This picture on the right side. The only difference here is that the target tokens are attending to the input tokens in this case. 
Which can be seen also in this formula. 
The Blue Square represents relations that have to be calculated. 
In case of the full attention, we need to calculate every relations within the input sequence. 
Now we see what happens when we have a blockwise encoder that works by limiting the tokens connectivity so that they can only see other nearby tokens. 
The text is read in Chang's which can restrict only reduce the number of computations on the encoder side, but does not improve the decoders cross attention as every input token is passed to the decoder anyway. 
This method is often referred to as fusion in decoder. 
The improvement here can be interpreted as changing. One of the dependencies. Of and to another constant M. Representing the block size. 
Our key observation is that most tokens are irrelevant for a wide variety of tasks and can be almost completely restrict guarded. This is exemplified on the slide. 
The only parts of the inputs are relevant to the desired output. 
For. Example. 
One can read an article once, marking the most important parts with a highlighter. And then produce a summary based on this part from the middle stage only. 
The cost of highlighting and deciding if the current tokens is essential to produce the summary is thus cheap and depends only on the tokens representation. 
The pooling of the highlighted tokens is possible. 
Thanks to our topk operator and its cost is negligible. 
The cost of producing a summary from a shortened input is also much lower than in the vanilla model when the whole input is considered. 
But here's a question. 
How to select important tokens and back propagate gradients to that selection? 
The essential underlying problem that we solve is to propose the trainable selection mechanism, 
one that can allow for gradient to be back propagated during the training so that the network can learn to select the most important tokens. 
More. Precisely. 
Doing some embeddings_obtained from a simple linear layer. The task is to return the highest scoring embeddings. First the sequence is permuted and peers are prepared so that the higher scoring vector is taken with the lower scoring 1. 
Next wait or calculated using boosted softmax over scores. 
After each. Tournament round. New vectors and scores are composed as a linear combination of those pars with the obtained weights. 
So in short, we combine them linearly. By performing a softmax over their scores. 
And while combining 2 tokens, some noise can be produced. Produced. 
But it also allows the gradients to be propagated to all input embeddings. 
In short, a trainable topk. To propose is based on performing a tournament liked soft selection at each step 
and from a different perspective, the presentation pulling follows the encoder layer. 
First each representation is scored and then only those with the highest scores are passed to the next layer. 
And coding can be performed as in standard transformer architecture on the full length input. 
It is however possible to process text in blocks of pixel length and globally select the best representation. 
Here is an example of the representation polling introduced after the encoder. 
This directly influenced the cause of cross attention, which depends not on the input link and. But the constant? K. Presenting. The. Politely. 
This constant informs how many representations are selected and passed to the decoder. 
Producing a summary from a shorter text is significantly cheaper than previous solution. 
The sequence. Length can be shortened by a large factor 
for. Example we successfully used K of 16 or even 60 times. 4. Times smaller than the value of N in our experiment. 
Please. Note. That the beneficial impact of blockwise encoding and self attention is sustained. 
Remember that the computational cost of attention depend on the score of the input lag. 
I do think the input earlier during the encoding process can significantly lower the costs. 
For the pyramidion model, we narrow down the size of the representation on the output of each chosen layer, leading to the exponential reduction of computational cost as the encoding proceeds. 
As you can see the total computational. Cost of a full encoder here is less than two times the cost of the full size first layer. 
And pulling is introduced earlier. The sum of all purple squares is thus bounded to a constant, not dependent on the number of layers L. 
But under constant sea, which can be influenced by the placing of the pooling layers within the network. 
Our improvements were benchmarked on 8000 tokens long inputs, 
and the figure shows that when pulling is engaged, the best scalability for the network step is achieved. 
Here, one can note. The training the pyramidion of 24 layers can be cheaper than training a two layer vanilla transformer on such long input. 
Not to mention how easily vanilla transformer can go out of memory for such a long input. 
The quality qualitative comparison of our trend, the pyramidion to other baseline is performed on the long document summarization task. Or given the body of an article from archive or pubmatic, the task is to generate its abstract. 
This one. Can see. Clockwise which is. Our baseline. Other reforms on the level of the. Recent state-of-the-art models while the PYRAMIDION retains or improves the performance of this competitive baseline. 
At the same time, our model is 80. Percent faster to train and over 450% faster at inference when compared to the blockwise baseline. 
Of model have much lower parameter count and were trained from scratch on the chosen tasks. 
Previous. Approaches. But to achieve a similar performance had to use more parameters and leverage petrain foundation foundational models. And additional language per training objective to achieve similar performance. Me. 
We invite you to read our full paper and use our GitHub code. 
Thank you for. Watching. 
Hello, this is John Joe from Harvard University. 
I am very glad to present our work on online semantic parsing for latency reduction in task oriented dialogue. 
This is joint work with Jason, Michael, Anthony and Sam from Microsoft semantic machines. 
The task oriented dialogue. A user interacts with the system that handles requests from user utterances. Usually in speaking 
from the finish of the user utterance to the system response there is often a noticeable delay. 
Under the hood, the user utterance is translated into an executable program, 
which is then executed so that the system can respond properly. 
Because the program is represented as a semantic graph that outlines the computation, where node represents a function invocation and its children are the arguments. 
The great nodes mark instantaneous operations, but the others are slow to execute. 
The simple example. Here we show these. Programs can often be more complicated graphs beyond the tree. Structures. 
In this talk we ask the question, can we start generating the program and executing it before the user even finishes the utterance so that the faster response can be achieved by the system. 
This is an online prediction and decision problem. 
There are a lot of others in this room. 
Examples include simultaneous translation where a live interpreter translates one language to another in real time, smart text auto completion to guess the user intent, and uberpool where the drivers are sent to where they might be needed based on the predicted demand. 
All of these scenarios have one thing in common, 
that is, it is beneficial to make decisions before seeing all the input. 
In our case, we are going to deal with online semantic parsing, which could be expected to be challenging as we have to guess what the user might say, 
and it is also underexplored with no formal evaluation metric. 
First, let's look at how an ordinary system works. 
It is operating offline by parsing to the program only at the end of the user utterance. 
Here the character graph is predicted after seeing all the information. 
In contrast, we're proposing an online system that compares at every utterance prefix. 
For example, each time we see a new token, we predict a new graph. 
Notice that there could be errors 
at the position of at the pool party. With Barack Obama, we got a graph with the red nodes on the person and the event subject, but gets the wrong timing information. 
This process goes on until we receive the full user utterance. 
How would this affect the execution timeline? In the offline system, 
we get the program graph at the end so that the system can start execution at this point. 
Remember that the great nodes are fast operations, so we only consider the execution timeline of the colored slow functions. 
First, these two five person functions can be executed in parallel, highlighting in white from the pink box as they have no dependency on other functions. 
Next, the node create event can then get executed after obtaining results from lower level nodes and then the top function yield so the whole program is finished. 
The execution process is strictly restricted to the program dependency structure where some operations cannot be parallelized which induces. Noticeable. Delay. 
In our online system where we predict as we go, the program execution can start earlier. 
Here at the prefix after. Obama would predict confidently. That the fine. Person function should be in the program. But the rest may contain errors as they are grayed out. 
And execution of the node can be immediately started as a slap. 
Then with more tokens we predict a totally new graph, but part of it is already being executed, 
so we only need to consider the rest of the nodes that we are confident about as well. 
Here another fine person can be executed in parallel. 
Again, we may have wrong predictions. 
With more text we have more ability to make it right, 
such as the event time here where AM is also anticipated correctly. 
Then we can start executing the rest following the program dependency structure. 
By overlapping the execution timeline with the utterance timeline. We save a big amount of time. 
So we proposed the task of online semantic parsing. 
When underlying assumption is that the execution time dominates the model prediction time, 
so we could only gain time by predicting earlier. 
Another assumption is that as the prediction and execution happen in the background, that is not visible to users. 
Maintain a consistent parsing history, 
so we will parse from scratch after each token. 
In particular, we propose A2 step approach 
of proposed step that predicts a graph with complete structure and a select step that selects the nodes that are worth executing at this time. 
They had two variants of the proposed method. 
First approach combines a language model completion with full utterance to graph parsing. 
In particular the prefix after Obama is first completed through a fine tune barked language model and then translated into a program with full offline parser. 
The second approach directly predicts the program from user utterance prefixes. 
This is achieved by tuning a single online parser to translate to the gold graph from each prefix. 
This facilitates the model to learn the right anticipation. 
In a bit more detail, how do we generate these graphs? 
We formulate the problem by generating a serial version of the graph. 
Each node or edge is represented by an action. 
Here we start from the first node. 
The number below records the absolute index in action history. 
Then we got the second node. 
Next is the edge between them. 
It contains a pointer to the index of a previous node and the edge label. 
0 here means connecting the most recent node with the node generated by the zeroth action and next node next edge. 
This process goes on until we generate the full graph. 
The underlying model is based on transformer with self pointing mechanism similar to a previous transition based parser. 
After generating a complete graph, we obtained the action level probabilities that correspond to different parts of the graph. 
We select confidence subgraphs based on the thresholding heuristic to be executed 
later on. We are going to vary the threshold to achieve different tradeoffs between the latency reduction and the execution cost. 
A formal evaluation of the online methods we propose final latency reduction or FLRW metric. 
Here's a recap of how an offline system finishes the execution timeline. 
In online systems, execution overlaps with the utterance timeline, so it ends earlier 
if SLR is defined as the reduction time compared to the offline system marked by the end of the execution. 
We conduct experiments on two large conversational semantic parsing datasets as some couple and child 
E our graph based parser when operating offline, achieves state-of-the-art performance on parsing on both datasets. 
The outline complete model also achieves nontrivial blue gain compared with the simple baseline of no completion. 
Now let's look at the prediction accuracy of our prefix, the graph parser. 
We test the match F1 score of Graph 2 posts between the generation and the GO graph in validation data in Y axis for each prefix length in X axis represented by percentages. 
Each of these curves represent a different model with the only difference in training data. 
The bottom curve is the offline parser, and we mix in prefix data in different lengths to transition the model to an. The line 
parser, for example, the legend prefix 80% plus means the model is trained with prefix data with prefix length larger than 80% of the full utterance length. 
The upper left corner is the desired area. 
As we can see the offline parser in black curve is not doing well on the prefix data 
as it makes it more prefixes. In training the curve is lifting upper and left performing better on all the perfect length. 
However, the full utterance parsing performance is not affected in the upper right dot. 
Based on these strong results, how much latency do we reduce? 
We measure the time by the number of source tokens and simulate different function execution times. 
The curves show the tradeoff between the FSLR metric and the execution cost measured by the number of excessive function costs that are not correct. 
This is achieved by varying the subgraph selection threshold. 
A higher threshold selects fewer functions of mistake by obtains a smaller full, whereas the lower threshold more aggressively. Likes and executes programs. 
We compare the two approaches we propose and baseline that does nothing but directly applying the offline parser for online use. 
The upper left region is has the best floor and cost tradeoff. 
We see both of our methods beat the baseline by a large margin, and they perform more similarly on tree 
DST. By individual function execution is faster, there tends to be more run executions and lower latency reduction room. 
By individual function execution is slower, there is more room for FSLR improvement. 
Our two approaches achieve better performance in different cultural regions. 
Overall, we achieve 30 to 63% relative latency reduction depending on execution time and allowed cost. 
Finally, we have a breakdown of average latency reduction in tokens for each type of the function node when they allowed cost is 3 round executions. 
As we can see there are gains all over the board. 
There are also some functions on which we gain impressive latency reduction where the red bar is much longer, such as fine manager and recipient. 
These are low level functions that do not have much dependency on others. 
In conclusion, we propose online semantic parsing as new task to explore with the rigorous latency reduction metric. 
With the strong graph based semantic parser, we achieve relatively good latency reduction either through our pipeline approach with line completion and a full parser or directly through a learned parser on the prefixes. 
Moreover, our approach can be a general framework and can be applied to other executable semantic representations in different. 
Future works could explore smarter prediction and execution integration method. 
Thanks for your listening. 
Hi 
I'm going to discuss our work on generating retrieval augmented counterfactuals for question answering task. 
This is work done during my internship at Google Research, where I was mentored by Matthew Lam and Ian Denny. 
Motivate the task. Let me begin by defining a counterfactual. 
In this work, we define a counterfactual as a perturbation of the input text that differs in some meaningful, controlled way from the original text 
and allows us to reason about the changes in the outcome or the task label, 
for instance, changing the words fascinating to captivating. Are expected to mind numbing changes the sentiment for this movie review. 
Similarly, adding the qualifier women's to the question changes the answer to the question in the example below. 
Humans are typically robust to such perturbations compared. To NLP models trained on the task. 
Why is that? 
The data set may be sampled with systematic biases that lead to a simple decision boundary that is violated by the counterfactual, 
as shown in this 2D classification problem. 
But I work has found that adding. Counterfactual examples to the training. Data can make. The model robust to such perturbations. 
So if counterfactuals are valuable, how can we generate them? 
This task is especially hard for NLP because here are three examples from three different NLP tasks. 
As you will see, examples that violate the decision boundary between outcomes need to be very carefully crafted by perturbing some attributes of the text that are underlined here. 
This could be done by human annotation, but this is expensive and biased. 
Some prior work has focused on using syntax trees or semantic role labeling, 
but the set of perturbations generated by these techniques are limited by the semantic framework. 
More recent work has used mass language models to fill in mass. Portions of the text to change labels, 
but finding what parts of the text to perturb. Can be challenging. 
There are more challenges to generating counterfactuals for question answering. Specifically, 
this task requires background knowledge. 
For instance, to perturb the original question is Indiana Jones, Temple of Doom, a prequel. 
We need to be aware of the other movies in the franchise to get to a question like his Indiana Jones Raiders of the Lost Ark are prequel. 
Furthermore, random perturbations can lead to questions that are not answerable with the available evidence or have false premises. 
Moreover, some question perturbations can lead to. Significant semantic drift from the original input. 
For instance, this question is Indiana Jones. Practicing child slavery in Temple of Doom. 
We propose a very simple yet effective technique called retrieve generate, filter or a LGF to tackle counterfactual perturbations of questions, and also aims to tackle all the other aforementioned challenges. 
The core intuition behind RGF is that the necessary background information that is needed to generate perturbations may be present in the near misses made by a question answering model. 
For instance, the state-of-the-art model realm produces the following topk answers to the question who is the captain of the Richmond Football Club? 
Well, it does recover the original reference passage and answer Trent Kotkin as the topmost choice. 
It also retrieves additional passages and answers which can be used to guide question perturbation. 
For instance, it recovers 2 more answers corresponding to the captains of the reserve team and the women's team of the same club, and this can lead to interesting edits. 
To summarize, RGF first retrieves topk most relevant answers and contexts. Which don't match the reference answer in context. 
Following the step, the question generation model conditions on these alternate answers to generate a question that corresponds to them, 
and finally we could filter the generated questions based on minimality or based on the type of semantic perturbation we are interested in introducing. 
Going over each step in greater detail for retrieval we use a retrieve than read model like Realm that takes as input the original question and a large corpus like Wikipedia. 
It consists of two modules. 
The retriever module performs similarity search over a dense index of passages to retrieve the topk most relevant passages to the question, 
and a reader module then extracts a span from each passage as a potential answer. 
Realm retrieves the Gold passage and answer in most cases. 
However, in this work we are more interested in the answers and context that it retrieves further down the line. 
In the next step, question generation, we use these alternate answers and contexts to generate new questions that correspond to these alternatives. 
Question generation model is a pre trained text to text transformer that is fine-tuned on the ENQUEUE data to generate a question for an answer that's marked in context. 
During inference we supply the question generation model the alternative answer in context that we retrieved in the previous step. 
For example, for the query who is the captain of the Richmond Football Club? Realm retrieves passages about the clubs women's team captained by Jess Kennedy, and the question generation model generates the query who Captain Richmond Football Club's first ever women's team, 
which has a specific semantic perturbation. 
In a similar fashion we also get queries. Like who? Captain Richmond's VFL? Reserve team. 
Or who did Graham negate in the Grand final last year? 
Finally, we filter out a subset of the generated queries based on some desired characteristics. 
As motivated earlier, we would like to ensure that the new question is still semantically close to the original. 
For filtering techniques that doesn't require additional supervision, we simply retain new questions. That have a small token level added distance from the original question. 
For example, we remove the question who did gram negate in the Grand final last year 
because it has a long longer added distance from the original question. 
And experiments we demonstrate that this simple heuristic can be. Used to augment enqueue training data. 
We also experiment with a filtering strategy that is based on the type of semantic perturbation. 
To this end, we use a general purpose query decomposition framework called QED. 
QED identifies 2 parts to the question, a predicate and a reference. 
References are noun phrases in the question that correspond to entities in the context. 
A predicate is basically the remaining portion of the question. 
For example, we are able to decompose the query who Captain Richmond's first ever women's team into 2 references. Richmond Football Club, women's team and the predicate who captained X. 
Model trained on reference predicate annotations. For NQ gives us this question decomposition. 
Decomposing both the original and generated question based on QD allows us to categorize our generated counterfactual for evaluation. 
Specifically, we obtain two groups of questions, 
those that undergo a reference change while retaining predicates, and those that undergo a predicate change and optionally add references. 
For instance, who Captain Richmond's VFL Reserve team is a reference change, 
while who wears #9 for the club is a predicate change. 
We now evaluate the effectiveness of RGF perturbations when augmented to training data. 
So, to effectively evaluate the effectiveness of counterfactual augmentation in particular, we experiment with two strong data augmentation baselines. 
The first baseline, called random answer and question generation, adds data that has no relation with the original question. 
That is, passengers and answers are simply randomly sampled from Wikipedia. 
This baseline basically adds more data that looks like NQ. 
So the second baseline gold answer and question generation, we specifically update the retrieval portion of our method. 
Here alternate answers are just chosen from the same passage that contained the gold answer. 
How big? How do the baselines and RJF augmentation perform on reading comprehension where the model has access to question and context? 
We experiment with six out of domain datasets and present results here. Where data is training, data is doubled in augmentation. 
We find that both data augmentation baselines are not able to improve our domain generalization. 
In fact, an ensemble of six models trained on the original data seems to be the most competitive baseline. 
Comparing against that baseline, we find that RGF counterfactuals are able to improve out of domain performance while maintaining in domain performance. 
This suggests that filling in the reasoning gaps of the model via counterfactual augmentation is more effective than adding more data from the training distribution. 
Furthermore, we find that using retrieval to sample alternative outcomes or answers is important for effective CDA. 
We also experiment with open domain QA setting where the model only sees the question and once again we evaluate on 4 out of domain datasets. 
We find that baseline models are not. As effective for out of domain generalization. 
However, data augmentation with RGF shows more significant improvements. 
We even improve in the in domain enqueue data set. 
We hypothesize that the counterfactual data augmentation aids the model in learning better query encodings for very similar queries. 
Finally, we also evaluate on the model's ability to improve consistency in the local neighborhood of the original question. 
Consistency measures the proportion of questions correctly answered by the model where both the original and the counterfactual query are correctly answered. 
This explicitly helps us to measure the model's robustness to small perturbations in the neighborhood of the original input. 
The experiment with five datasets which contain pairs of questions that are semantically close to each other. 
Apart from the three datasets Aqua Ambigua and coref contrast set that are already available. We also evaluate on RGF counterfactuals that are paired with original enqueue questions based on whether they underwent a predicate change or reference change. 
These subsets were annotated in-house to eliminate noise and are provided as a resource. 
All baselines are unable to significantly improve consistency with the ensemble model, improving consistency by a small margin. 
However, RGF counterfactual augmentation has impressive gains in consistency both on prior data sets and the two subsets we curated for reference and predicate perturbations. 
Note that the augmented RGF data is not biased by perturbation type, only the evaluation sets are. 
In fact, a qualitative inspection of the kinds of counterfactuals generated. Show that the generated questions contain several diverse perturbations. 
For instance, this original question on the population of Walnut Grove, MN is perturbed along different dimensions like town, state, country. And along different predicates like location, power T, number of schools. 
RGF perturbations are context specific, 
for example for this other question about. The Mundon singles tournament. The perturbation is along. Type of game, type of tournament or the game outcome? 
Final takeaways, we tackle the task of counterfactual data augmentation and perturbations for information seeking queries and tackle its unique challenges via a reversal of the generation approach over generate using near misses of the model and filter based on perturbation type or minimality. 
We find that this technique requires no additional supervision, and the examples are labeled for augmentation, 
augmentation improves out of domain generalization and neighborhood consistency. 
And we find that RGF counterfactuals are semantically diverse without introducing bias during augmentation. 
Thank. You. 
