Hi, this is Elena and I'm going to be presenting our work: Detecting Unassimilated Borrowings in Spanish.
An Annotated Corpus and Approaches to Modeling.
So we're going to be covering what lexical borrowing is, and the task that we proposed, the data set that we have released and some models that we explored.
But to begin with, what is lexical borrowing and why it matters as an NLP task?
Well, lexical borrowing is basically the incorporation of words from one language into another language.
For instance, in Spanish we use words that come from English.
And here you have a few examples, words such as podcast app
online crowdfunding, all these are English words that we sometimes use in Spanish.
Lexical borrowing is a type of linguistic borrowing ,um, which is basically reproducing in one language the patterns of other languages.
 And borrowing and code-switching have sometimes been compared and described as a continuum because we think being ,ah, the thing that bilinguals do where they mix two languages at the same time.
There are however some differences between lexical borrowing and code-switching.
We're going to be focusing on lexical borrowing.
Code switching is something that is done by bilinguals and by definition.
The code switches are not integrated into any of the languages and used, whereas lexical borrowing is something that is also done by monolinguals.
The borrowings will comply with the grammar of the recipient language.
And borrowings can eventually be integrated into the recipient language.
So why is borrowing an interesting phenomenon?
Well, from the point of view of linguistics, borrowing is a manifestation of of how languages change and how they interact.
And also lexical borrowings are a source of new words.
Here you have some examples of lexical borrowings that have been incorporated into the Spanish language as new words.
In terms of NLP, aa borrowings are a common source of out-of-vocabulary words.
And in fact, automatically detecting lexical borrowings ,ah, has proven to be useful for NLP downstream tasks such as parsing, text-to-speech synthesis or machine translation.
There has been a growing interest in the influence of English on other languages, particularly ,ah, related to English lexical borrowings, borrowings which sometimes have been called Anglicisms.
And here, you have some examples of ,ah, work on automatic detection of borrowings in some of these languages.
So the task that we propose is to detect unassimilated lexical borrowings in Spanish newswire.
Which means that we are interested in extracting aaa words borrowed from other languages that are being used in Spanish newspapers but that have not been integrated or assimilated into the recipient language.
So not yet integrated into Spanish.
Here you have an example.
This is a sentence in Spanish: Las prendas bestsellers se estampan con motivos florales, animal print o a
retales tipo patchwork.
Um, and as you can see, there are three spans of texts which are actually English words like bestseller, animal print and patchwork.
These are the type of spans that we are interested in extracting and detecting.
There has been previous word on Anglicism detection ,aah, which consists consisted of CRF model for Anglicism 
detection on Spanish Newswire.
This model achieve an F-one score of eighty six.
But there were some limitations both ,um, in the dataset and the modeling approach.
So the dataset focused exclusively on one source of news consisted only of headlines.
And also there was an overlap in the borrowings that appear in the training set and the test set.
So, this prevented the assessment of whether the modeling approach could actually generalize to previously unseen borrowings.
Um. So what we aim is to tackle some of these limitations in the task.
So to begin we, to begin with, we created a new data set.
Aaa the aim at a new dataset that was annotated with lexical borrowings and the aim was to create a test set that was as difficult as possible.
So there would be minimal overlap in words and topics between the training set at the test set.
And as a result, well, the test set comes from sources and dates that we're not seeing in the training set.
Here you can see that there's no overlap in the in the time.
It's also the test set is also very borrowing-dense.
Just to give you some numbers, if the training sets contain six borrowings per each thousand tokens, the test set contained twenty borrowings per each thousand tokens.
The test set contained as many out of vocabulary words as possible.
In fact, ninety percent of the borrowings in the test set are OOV.
So, they were not seen during training.
And the corpus consisted basically of a collection of texts that came from different sources
of Spanish newspapers.
And ,uh, it was annotated by hand ,uh, using two tags.
One for English lexical borrowings which is the majority of lexical borrowings in Spanish, and then the label other for borrowings from other languages.
We use conel formats and we used BIO encoding so that we could encode ,uh, single token borrowing such as app or multi token borrowings such as machine learning.
These are the numbers of the
purpose.
As you can see, it amounts to roughly three hundred seventy thousand tokens.
And here you have the number of spans that were labeled as English and the spans that were labeled as other borrowings and how many of them were unique.
And here you have a couple of examples of the of the set of the data set.
As you can see for instance here, we have ,uh, in the first example, we have the borrowing batch cooking which is a multi multi word borrowing.
And we have annotated it using the BIO ,um, encode.
So the BIO was used for words in Spanish so not for wor, words that were not borrowed.
And here in this second example, you have benching and crash which are also labeled as borrowings from English.
So, once we had the data set, we explored several models for the task of extracting and detecting these lexical borrowings.
The first one that we tried was the conditional random field model.
Aah, this was the model that had been used on previous work.
And we used the same handcrafted features from that from those from that work.
As you can see, these are the features.
These are binary features such as the word or the token in upper case?
Is it tical titlecase?
Is it a quotation mark?
Things like that, which are the type of features that one would expect in a named entity recognition task.
These are the results that we got.
We obtain fifty-five of F-one score using the the CRF model with handcrafted feature.
Which is a huge different difference ,um, compared to the reported F-one score of eighty-six, which was the result obtained with the same CRF model, same features but on a different data set also for Spanish lexical borrowing detection.
So, this proves that the data set that we created is more difficult and that we
we needed to explore more sophisticated models for these tasks.
So, we tested two transformer based model.
We used BETO which is a monolingual BERT model trained for Spanish and also multilingual BERT.
Both models we use them through the transformers library by hugging phase.
These are the results that we got.
As you can see, multilingual BERT perform better than BETO both on the development set and on the test set and across all metrics.
Just so we have ,uh, an idea to compare the
CRF model obtained an eighty two.
The CRF model obtained a fifty five to obtain fifty five F-one score, whereas the multilingual BERT obtained eighty two, which is a big difference.
So, once that we had those results, we asked ourselves another question which is is could we find a BiLSTM-CRF model, fed it with different types of embeddings embeddings that encode different type of linguistic information and perform outperform the results obtained by transformer based model.
So in order to do so, we ran some preliminary experiments, we we we run this by BiLSTM-CRF model using flare library.
And we tried
and experimented with different type of embeddings like: transformer-base but also fast-text character embeddings and so on.
What we found out was that transformer based embeddings performed better than non contextualized embeddings, that the combination of English BERT and Spanish better embeddings outperform multilingual BETO embeddings.
And that BPE embeddings produced better F-one and character embeddings produce better recall.
With that in mind, these were the best performing results that we got both models where BiLSTM-CRF model using flare.
One was fed with BETO and
BERT embeddings on BPE and the other one BETO BERT and BETO embeddings BPE and also character embeddings.
These last one was the one that produced the highest F-one score on the test set, although the highest score on the development set was obtained by the one without character embeddings.
Just ,aah, to bear in mind that the best result that we got with multilingual BERT obtain an F-one of seventy six on the development set and eighty two on the test set.
So this is an improvement compared to those results.
Finally, we ask ourselves an another question which was can lexical borrowing detection be framed as transfer learning from language identification in code switching?
So, we run the same BiLSTM-CRF model that we had run using flare, but instead of using these unadapted transformer based BETO and BERT embeddings, we use code switching embeddings.
What are code switch embeddings?
Well these are ,um, embeddings that are have been fine
and transformer based embeddings that have been pre trained for language identification on the Spanish English section of the lingth code switching data set linta is a data set on code switching that has a section on Spanish, English, Spanish, English code switching.
So we
fed our BiLSTM-CRF with code switch embeddings and optionally character embeddings, BPE embeddings and so on.
The best result that we got was eighty four point twenty two, which is the highest across all the models that we tried on the test set.
Although the best result of F-one score that we got on the development set, which was seventy nine was lower than the best result obtained by the BiLSTM-CRF fed with unadapted embeddings.
So, some conclusions from our work.
We have ,uh, but we have produced a new data set of Spanish newswire that is annotated with unassimilated lexical borrowings.
This data set is more borrowing dense and overreach than previous resources.
We have explored four types of models for lexical borrowing detection.
Um. In terms of error analysis, well, recall was a weak point for all models.
Ah, as you can see here are some frequent false negatives
include uppercase borrowings, words that exist in both in English and Spanish, for instance.
Also interestingly, BPE embeddings seem to improve F-one score.
And character embedding seem to improve recall.
Which ,aah, it's an interesting finding that perhaps we can explore on future work.
Um. Well, this is everything that I have.
Thank you so much for listening.
My name is Antoine. I'm a PhD student at the University of Massachusetts Amherst. I am presenting our paper KenyaBERT: a Morphology-aware Kinyarwanda Language Model.
Today, I'll talk about the motivation for this research. Then I'll present KenyaBERT model architecture in detail. I'll then talk about our experimental results, then finish with some conclusions.
We all know that recent natural language processing advances have been made possible by the use of pre trained language models such as BERT. However, there are still a number of limitations.
Due to the complex morphology that is expressed by most morphologically rich languages,
the ubiquitous byte pair encoding tokenization algorithm that I used cannot extract the exact subword lexical units, meaning the morphemes which are needed for effective representation. For example, here we have three Kinyarwanda words that have several morphemes in them, but the BPE algorithms cannot extract them. This is because some morphological rules
produce different surface forms that hide the exact lexical information, and BPE, which is solely based on the surface forms, does not have access to this lexicon model.
The second challenge is that even if one had access to an Oracle morphological analyzer, replacing BPE tokens with morphemes is not enough to express the morphological compositionality. A third gap in research is that new pre trained language models are most often evaluated on high resource languages.
And we need to assess their applicability on low resources and diverse languages as well.
Therefore, we present KenyaBERT, which is a simple but effective adaptation of the BERT architecture that is meant to more effectively handle morphologically rich languages. We evaluate KenyaBERT on Kinyarwanda, a low resource morphologically rich language, which is spoken by more than twelve million people across Eastern and central Africa.
The input to the model is either a sentence or a document. For example here, we have John twarahamubonye biradutangaza, which means we were surprised to find John there. As you can see, Kinyarwanda words contains several morphemes that contain different information in them.
Therefore, in our model, we pass this sentence or a document to a morphological analyzer.
Which then generates morphemes contains in each of the words.
The morphemes usually are made of the stem and zero or more affixes. The affixes may indicate tense, aspect, subject or object in verbs, and more often relates to the band two noun class for subjects and objects.
The morphological analyzer also produces a part of speech tag for each of the words.
After this step, we make embeddings for the  spee, for the part of speech tags.
Embeddings for the affixes.
And embeddings for the stem.
These are the morphologable, these are the morphology level embeddings.
Within the past, these embeddings through a morphology encoder, which is a small transformer encoder that is applied to each word independently.
The output of the are the vectors that are contex contextualized with the morphological information at each word.
Now, we perform composition where the morphological embeddings corresponding to part of speech and stem are concatenated together. We further concateten we further concatenate them with another stem embedding at the sentence level.
Then we form manipulate to the main sentence or document encoder.
The final output are contextualized embeddings that can be used for downstream NLP tasks.
For a morphological analyzer, we use finite state two level morphology principles with the custom implementation that is tailored to the Kinyarwanda language. We effectively model the morphology of all Kinyarwanda words, including verbals, nouns, demonstrative and possessive pronouns, numerals, and others.
We use unsupervised part of speech tagging algorithm. A first order factored model is used to account for morphology probability, basically the probability that is assigned by the morphological analyzer. We also take into consideration the part of speech tag precedence as well as the syntactic agreements that are present in the in, in the input words.
The part of speech tagger uses a bira, bidirectional inference which improves upon the more often used Viterbi algorithm for decoding.
A few remarks here for positional encoding. One, the morphology encoder does not use any position or encoding. This is because each of the morphemes occupies a known slot in the morphological model. Therefore positional information is inherent when the morphemes are given.
Second, the sentence encoder uses the so-called untied relative positional embeddings, which have been recently published at ICLR Conference. This positioning embeddings essentially disentangles positional correlations from token to token attention computation.
Similar to BERT, we use a masked language model pre-training objective. Essentially we have to predict both the stem and the affixes that are associated with the words.
During pre-training,
fifteen percent of all words are considered for prediction, of which eighty percent are masked, ten percent are swapped with random words and ten percent are left unchanged.
For affixed prediction, we face some multi label classification problem. For this, we either group together affixes into a fixed number of sets and predict the set as a class label. The other option is to predict the affix probability vector. We evaluate both of these approaches in our experiments.
We pre-train KenyaBERT on about two and half gigabytes of Kinyarwanda text, and they compared it to three baseline models. One is a multilingual model called XLM-R,
that is trained on a large text corpora that is made of multiple languages. The other two baselines are pre trained on the same Kinyarwanda text using either byte BERT encoding algorithm or using morphological analysis without using the two tier transformer encoder architecture.
All models are configured in the base architecture, which is about between a hundred and a hundred and ten million parameters, with Kinyarwanda with KenyaBERT using the least number of parameters.
All models except the multilingual are pre trained for thirty tow thousand gradient updates with a batch size of two thousand and five hundred and sixty sequences in each batch.
We evaluate the pre trained models on three sets of tasks. One is the group benchmark which has been often used for evaluating the effectiveness of pre trained language models.
We obtain our group benchmark data by translating the original benchmark data into Kinyarwanda using Google Translate.
The second task is Kinyarwanda named entity Recognition Benchmark, which is a high quality dataset that was annotated by trained native speakers.
The third one is a news categorization task where we pull news articles from several websites and collecting their categorization tags that were assigned by the authors and then essentially trying to predict the same, the s the the same categories.
And now we go to the results.
For the GLUE benchmark, we find that KenyaBERT consistently outperforms baseline models. Here we show the average performance for ten fine tuning runs.
We also run a user evaluation of the translations that are produced by Google Translate.
Essentially, user users rated about six thousand examples, assigning scores on a scale from one to four, assessing the quality of the translations.
The result is that many translations were noisy.
But all models had to cope with the same translation noise, and the relative performance between the models is still important to notice.
For the named Entity Recognition Task, we also find that KinyaBERT gives the best performance with the affix distribution regression variant performing based. These results are also averages of ten fine tuning runs.
For the news categorization task, we find mixed results. Previous work on text classification for Kinyarwanda had found that simple keyword detection is mostly enough for solving this specific task. Therefore, there is less gain from using pre trained language models.
On this particular task of news categorization.
We also conducted an ablation study to see if there are alternative structures that improve performance. For the group benchmark, we find that using affixed sets consistently performs better, while affixed probability regression objective yields the best performance, so named entity recognition.
Also by looking at the low scarves for fine tuning, we find that KinyaBERT has better convergence in most cases. So to conclude, this work has demonstrated the effectiveness of explicitly using morphological information in pre trained language models. The proposed two-tier transformer encoder architecture enables capturing morphological see,
morphological compositionality, which is an important aspect of morphologically rich languages. These findings should motivate further research into morphology aware language pre-trained language models.
Hello, my name is Michal Pietruszka and it is my pleasure to present to you the paper titled Sparsifying Transformer Models with Trainable Representation Pooling. A work done at Applicati in cooperation with Lukasz Borchmann and Lukasz Garncarek.
Let me start with the problems our work targets.
Our method works well for the cases where long inputs are considered. Roughly speaking, it is meant for the task orders and input of over two thousand tokens and the targets are shorter than the provided inputs.
This has some specific applications in NLP. For example, one can in, imagine that given a long document, there's a need to summarize it, classify, answer the question about it, extract information or some key phrases.
Let me recall the vanilla Transformer are ended issue of its attention complexity that depends on the square of the input line.
In the Vanilla transformer,
with full attention connectivity, relations of each token to every other token have to be calculated. The computational complexity of attention, this depends on the number of layer L,
sequence length N,
another sequence length, and the dimensionality of representations. Similarly in the decoders course attention to this picture on the right side,
the only difference here is that the target tokens are attending to the input tokens in this case.
Which can be seen also in this formula.
The blue score represents relations that have to be calculated. In case of the full attention, we need to calculate every rela, relations within the input sequence.
Now, we'll see what happens when we have a blockwise encoder. That works by limiting the tokens connectivity so that they can only see other nearby tokens.
The text is read in chunks which can restrict only reduce the number of computations on the encoder side, but does not improve the decoders cross attention as every input token is passed to the decoder anyway.
This method is often referred to as fusion indecoder. The improvement here can be interpreted as changing.
One of the dependencies of N to another constant M representing the block size.
Our key observation is that most tokens are iri, irrelevant for a wide variety of tasks and can be almost completely disregarded. This is exemplified on the slide.
The only parts of the inputs are relevant to the desired output.
For example;
one can read an article once marking the most important parts with a highlighter, and then produce a submarine based on this part from the middle stage only. The cost of highlighting and deciding if the current tokens is essential to produce the summary is thus cheap and depends only on the tokens representation.
The pooling of the highlighted tokens is possible.
Thanks to our top k operator and its cost is naturally edible.
The cost of producing a summary from a shortened input is also much lower than in the Vanilla model when the whole input is considered.
But here's a question. How to select important tokens and backpropagate gradients to that selection?
The essential underlying problem that we solve is to propose the trainable selection mechanism. One that can allow for gradient to be back propagated during the training so that the network can learn to select the most important tokens.
More precisely,
given some embeddings underscore obtained from a simple linear layer, the task is to return the highest scoring embeddings. First, the sequence is permuted and peers are prepared so that the higher scoring vector is taken with the lower scoring one.
Next, weights are calculated using boosted softmax over scores.
After each tournament round,
new vectors and scores are composed as a linear combination of those pars with the obtained weights.
So in short, we combine them linearly
by performing a softmax over their scores.
And while combining two tokens, some noise can be produces.
Produced. But it also allows the gradients to be propagated to all input embeddings.
In short, a trainable top K
we propose is based on performing a tournament liked soft selection at each step. And from a different perspective, the present representation polling follows the encoder layer. First, each representation is scored and then only those with the highest scores are passed to the next layer.
And coding can be performed as in standard transformer architecture on the full length input. It is however possible to process text in blocks of fixed len, of fixed length and globally select the best representation.
Here is an example of the representation pooling introduced after the encoder.
This directly influenced the cause of cross attention, which depends not on the input length N,
but the constant K.
Representing the pool play.
This constant informs how many representations are selected and passed to the decoder.
Producing a summary from a shorter text is significantly cheaper than previous solution.
As the sequence length can be shortened by a large factor. For example, we successfully used K of sixteen or even sixty times four or even
four times smaller than the value of N in our experiment.
Please note
that the beneficial impact of blockwise encoding and self attention is sustained.
Remember that the computational cost of attention depend on the square of the input length.
Reducing it the input earlier during the encoding process can significantly lower the costs. For the pyramidion model, we narrow down the size of the representation on the output of ec, of each chosen layer, leading to the exponential reduction of computational cost as the encoding proceeds.
As you can see, the total computational cost of a full encoder here is less than two times the cost of the full-sized first layer.
When pooling is introduced earlier, the sum of all purple squares is thus bounded to a constant, not dependent on the number of layers L,
but on the constant C, which can be influenced by the placing of the pooling layers within the network.
Our improvements were benchmarked on eight thousand tokens long inputs. And the figure shows that when pooling is engaged, the best scalability for the network's depth is achieved. Here, one can note
that training the pyramidion of twenty four layers can be cheaper than training a two layer Vanilla transformer on such length inputs.
Not to mention how easily Vanilla transformer can go out of memory for such a long input. The qual quality qual qualitative comparison of our trend pyramidion to other baseline is performed on the long document summarization task, or given the body of an article from archive or pubmed, the task is to generate its abstract.
Thus, one can see
blockwise, which is our baseline,
performs on the level of the re, recent state-of-the-art models, while the pyramidion retains or improves the performance of this competitive baseline.
At the same time, our model is eighty
percent faster to train and over four hundred fifty percent faster at inference when compared to the blockwise baseline.
Both models have much lower parameter counts and were trained from scratch on the chosen tasks.
Previous approaches
to to achieve a similar performance had to use more parameters and leverage pretrained foundation foundational models and additional language pretraining objective to achieve similar performance.
We
we invite you to read our full paper and use our GitHub code.
Thank you for aw watching.
Hello, this is Jiawei Joe from Harvard University.
I am very glad to present our work on Online Semantic Parsing for Latency Reduction in Task-Oriented Dialogue.
This is joint work with Jason, Michael, Anthony and Sam from Microsoft semantic machines.
In task-oriented dialogue, a user interacts with the system that handles requests from user utterances usually in speaking.
From the finish of the user utterance to the system response there is often unnoticeable delay.
Under the hood, the user utterance is translated into an executable program.
Which is then executed so that the system can respond properly.
Because the program is represented as a semantic graph that outlines the computation, where node represents a function invocation and its children are the arguments. The great nodes mark instantaneous operations, but the others are slow to execute.
The simple example here we show, these programs can often be more complicated graphs beyond the tree structures.
In this talk, we ask the question, can we start generating the program and executing it before the user even finishes the utterance so that the faster response can be achieved by the system?
This is the online prediction and decision problem.
There are a lot of others in this room.
Examples include simultaneous translation where a live interpreter translates one language to another in real time, smart text auto completion to guess the user intent, and Uber pool where the drivers are sent to where they might be needed based on the predicted demand.
All of these scenarios have one thing in common.
That is, it is beneficial to make decisions before seeing all the input.
In our case, we are going to deal with online semantic parsing, which could be expected to be challenging as we have to guess what the user might say.
And it is also underexplored with no formal evaluation metric.
First, let's look at how an ordinary system works.
It is operating offline by parsing to the program only at the end of the user utterance.
Here, the character graph is predicted after seeing all the information.
In contrast, we are proposing an online system that compares at every utterance prefix.
For example, each time we see a new token, we predict a new graph.
Notice that there could be errors at the position of at the pool party with Barack Obama, we got a graph with the red nodes on the person and the event subject, but gets the wrong timing information.
This process goes on until we receive the full user utterance.
How would this affect the execution timeline in the offline system?
We'll get the program graph at the end so that the system can start the execution at this point.
Remember that the great nodes are fast operations, so we only consider the execution timeline of the colored slow functions.
First, these two fine person functions can be executed in parallel, highlighted in white from the pink box as they have no dependency on other functions.
Next, the node create event can then get executed after obtaining results from lower level nodes and then the top function yield so the whole program is finished.
The execution process is strict, restricted to the program dependency structure where some operations cannot be parallelized which induces
a noticeable delay.
In our online system, where we predict as we go, the program execution can start earlier.
Here, at the prefix after Obama would predict confidently that define person function should be in the program, but the rest may contain errors as they are grayed out.
Their execution of the node can be immediately started as a slap.
Then with more tokens, we predict a totally new graph, but part of it is already being executed.
So we only need to consider the rest of the nodes that we are confident about as well.
Here, another fine person can be executed in parallel. 
Again, we may have wrong predictions.
With more text, we have more ability to make it right.
Such as the event time here where AM is also anticipated correctly.
Then, we can start executing the rest following the program dependency structure.
By overlapping the execution timeline with the utterance timeline, we save a big amount of time.
So we proposed the task of online semantic parsing when underlying assumption is that the execution time dominates the model prediction time.
So we could only gain time by predicting earlier.
Another assumption is that as the prediction and execution happen in the background, that is not visible to users.
To maintain a consistent parsing history. So, we reparse from scratch after each token.
In particular, we propose a two step approach.
A proposed step that predicts a graph with complete structure and a select step that selects the nodes that are worth executing at this time.
We had two variants of the proposed method.
First approach combines a language model completion with full utterance to graph parsing.
In particular, the prefix after Obama is first completed through a fine-tuned BART language model and then translated into a program with full offline parser.
The second approach directly predicts the program from user utterance prefixes.
This is achieved by training a single online parser to translate to the goal graph from each prefix.
This facilitates the model to learn the right anticipation.
In a bit more detail, how do we generate these graphs?
We formulate the problem by generating a serial version of the graph.
Each node or edge is represented by an action.
Here, we start from the first node.
The number below records the absolute index in action history.
Then, we got the second node.
Next is the edge between them.
It contains the pointer to the index of the previous node and the edge label.
Zero here means connecting the most recent node with the node generated by the zeroth action and next node next edge.
This process goes on until we generate the full graph.
The underlying model is based on transformer with self pointing mechanism similar to a previous transition based parser.
After generating a complete graph, we obtained the action level probabilities that correspond to different parts of the graph.
We select confidence subgraphs based on the thresholding heuristic to be executed.
Later on, we're going to vary the threshold to achieve different tradeoffs between the latency reduction and the execution cost.
For formal evaluation of the online methods, we propose final latency reduction or FRL metric.
Here's a recap of how an offline system finishes the execution timeline.
In online systems, execution overlaps with the utterance timeline, so it ends earlier.
FLR is defined as the reduction time compared to the offline system, marked by the end of the execution.
We conduct experiments on two large conversational semantic parsing datasets.
SMCalFlow and TreeDST, our graph based parser when operating offline, achieves state-of-the-art performance on parsing on both datasets.
The outline complete model also achieves nontrivial BLEU gain compared with the simple baseline of node completion.
Now, let's look at the prediction accuracy of our prefix to graph parser.
We test the match F-one score of Graph two posts between the generation and the GO graph in validation data in Y axis for each prefix length in X axis represented by percentages.
Each of these curves represents a different model with the only difference in training data.
The bottom curve is the offline parser, and we mix in prefix data in different lengths to transition the model to an
online parser.
For example, the legend prefix eighty percent plus means the model is trained with prefix data with prefix length larger than eighty percent of the full utterance length.
The upper left corner is the desired area.
As we can see, the offline parser in black curve is not doing well on the prefix data.
As we're mixing more prefixes in training, the curve is lifting upper and left performing better on all the perfect length.
However, the full utterance parsing performance is not affected in the upper right dot.
Based on these strong results, how much latency do we reduce?
We measure the time by the number of source tokens and simulate different function execution times.
The curves show the tradeoff between the FLR metric and the execution cost, measured by the number of excessive function costs that are not correct.
This is achieved by varying the subgraph selection threshold.
A higher threshold selects fewer functions of mistake but obtains a smaller FLR, whereas the lower threshold more aggressively
likes and executes programs.
We compare the two approaches we propose in a baseline that does nothing but directly applying the offline parser for online use.
The upper left region is has the best FRL and cost tradeoff.
We see both of our methods beat the baseline by a large margin, and they perform more similarly on TreeDST.
While individual function execution is faster, there tends to be more run executions and lower latency reduction room.
When individual function execution is slower, there is more room for FLR improvement.
Our two approaches achieve better performance in different cos, cost regions.
Overall, we achieve thirty to sixty three percent relative latency reduction depending on execution time and allowed cost.
Finally, we have a breakdown of average latency reduction in tokens for each type of the function node when they allowed cost is three run executions.
As we can see, there are gains all over the board.
There are also some functions on which we gain impressive latency reduction where the red bar is much longer, such as FindManager and Recipient.
These are low level functions that do not have much dependency on others.
In conclusion, we proposed online semantic parsing as new task to explore with the rigorous latency reduction metric.
With the strong graph based semantic parser, we achieve relatively good latency reduction either through our pipeline approach with LA completion and a full parser or directly through a learned parser on the prefixes.
Moreover, our approach can be a general framework and can be applied to other executable semantic representations in different
domains.
Future works could explore smarter prediction and execution integration method.
Thanks for your listening.
Hi.
I'm going to discuss our work on generating Retrieval Augmented Counterfactuals for question answering tasks.
This is work done during my internship at Google Research, where I was mentored by Matthew Lamm and Ian Tenney.
To motivate the task, let me begin by defining a counterfactual.
In this work, we define a counterfactual as a perturbation of the input text that differs in some meaningful controlled way from the original text.
And allows us to reason about the changes in the outcome or the task label.
For instance, changing the words fascinating to captivating
are expected to my numbering changes the sentiment for this movie review.
Similarly, adding the qualifier women's to the question changes the answer to the question in the example below.
Humans are typically robust to such perturbations compared to NLP models trained on the task.
Why is that?
The data set may be sampled with systematic biases that lead to a simple decision boundary that is violated by the counterfactual.
As shown in this two-D classification problem.
My work has found that adding counterfactual examples to the training data can make the model robust to such perturbations.
So, if counterfactuals are valuable, how can we generate them?
This task is especially hard for NLP because here are three examples from three different NLP tasks.
As you can see, examples that violate the decision boundary between outcomes need to be very carefully crafted by perturbing some attributes of the text that are underlined here.
This could be done by human annotation, but this is expensive and biased.
Some prior work has focused on using syntax trees or semantic role labeling.
But the set of perturbations generated by these techniques are limited by the semantic framework.
More recent work has used mask language models to fill in mask portions of the text to change labels.
But finding what parts of the text to perturb can be challenging.
There are more challenges to generating counterfactuals for question answering specifically.
This task requires background knowledge.
For instance, to perturb the original question is Indiana Jones 'a temple of doom' a prequel?
We need to be aware of the other movies in the franchise to get to a question like is Indiana Jones 'Raiders of the Lost Ark' a prequel?
Furthermore, random perturbations can lead to questions that are not answerable with the available evidence or have false premises.
Moreover, some question perturbations can lead to significant semantic drift from the original input.
For instance, this question is Indiana Jones practicing child slavery in Temple of Doom?
We propose a very simple yet effective technique called retrieve generate filter or a RGF.
To tackle counterfactual perturbations of questions, and also aims to tackle all the other aforementioned challenges.
The core intuition behind archief is that the necessary background information that is needed to generate perturbations may be present in the near misses mailed by a question answering model.
For instance, the state-of-the-art model realm produces the following topk answers to the question who is the captain of the Richmond Football Club?
Well, it does recover the original reference passage and answer Trent Kotkin as the topmost choice.
It also retrieves additional passages and answers which can be used to guide question perturbation.
For instance, it recovers two more answers corresponding to the captains of the reserve team and the women's team of the same club, and this can lead to interesting edits.
To summarize, RGF first retrieves topk most relevant answers and contexts which don't match the reference answer in context.
Following the step, the question generation model conditions on these alternate answers to generate a question that corresponds to them.
And finally, we can filter the generated questions based on minimality or based on the type of semantic perturbation we are interested in introducing.
Going over each step in greater detail for retrieval, we use a retrieve then read model like Realm that takes as input the original question, and a large corpus like Wikipedia.
It consists of two modules.
The retriever module performs similarity search over a dense index of passages to retrieve the top K most relevant passages to the question.
And a reader module then extracts a span from each passage as a potential answer.
Realm retrieves the Gold passage and answer in most cases.
However, in this work, we are more interested in the answers and context that it retrieves further down the line.
In the next step, question generation, we use these alternate answers and contexts regenerate new questions that correspond to these alternatives.
Question generation model is a pre trained text-to-text transformer that is fine-tuned on the ENQUEUE data to generate a question for an answer that's marked in context.
During inference we supply the question generation model, the alternative answer and context that we retrieved in the previous step.
For example, for the query who is the captain of the Richmond Football Club?
Realm retrieves passages about the club's women's team, captained by Jess Kennedy, and the question generation model generates the query who captained Richmond Football Club's first ever women's team?
Which has a specific semantic perturbation.
In a similar fashion, we also get queries like who Captain Richmond's VFL Reserve team?
Or who did Graham negate in the Grand final last year?
Finally, we filter out a subset of the generated queries based on some desired characteristics.
As motivated earlier, we would like to ensure that the new question is still semantically close to the original.
But filtering techniques that doesn't require additional supervision, we simply retain new questions
that have a small token label edit distance from the original question.
For example, we remove the question who did gram negate in the Grand final last year?
Because it has a longer added distance from the original question.
In our experiments, we demonstrate that this simple heuristic can be used to augment and Q training data.
We also experiment with a filtering strategy that is based on the type of semantic perturbation.
To this end, we use a general purpose query decomposition framework called QED.
QED identifies two parts to the question, a predicate and a reference.
References are noun phrases in the question that correspond to entities in the context.
A predicate is basically the remaining portion of the question.
For example, we are able to decompose the query who Captain Richmond's first ever women's team into two references.
Richmond Football Club, women's team and the predicate who captained X.
A model trained on reference predicate annotations for NQ gives us this question decomposition.
Decomposing both the original and generated question based on QED allows us to categorize our generated counterfactual for evaluation.
Specifically, we obtain two groups of questions.
Those that undergo a reference change while retaining predicates, and those that undergo a predicate change and optionally add references.
For instance, who Captain Richmond's VFL Reserve team is a reference change?
While who wears number nine for the club is a predicate change.
We now evaluate the effectiveness of RGF perturbations when augmented to training data.
So, to effectively evaluate the effectiveness of counterfactual augmentation in particular, we experiment with two strong data augmentation baselines.
The first baseline, called random answer and question generation, adds data that has no relation with the original question.
That is, passages and answers are simply randomly sampled from Wikipedia.
This baseline basically adds more data that looks like enqueue.
With the second waistline gold answer and question generation, we specifically update the retrieval portion of our method.
Here, alternate answers are just chosen from the same passage that contained the gold answer.
How base, how do the baselines and RGF ,aa, augmentation perform on reading comprehension where the model has access to question and context?
We experiment with six out of domain datasets and present results here.
Where data is the training data is doubled in augmentation.
We find that both data augmentation baselines are not able to improve our domain generalization.
In fact, an ensemble of six models trained on the original data seems to be the most competitive baseline.
Comparing against that baseline, we find that RGF counterfactuals are able to improve out of domain performance while maintaining in domain performance.
This suggests that filling in the reasoning gaps of the model via counterfactual augmentation is more effective than adding more data from the training distribution.
Furthermore, we we find that using retrieval to sample alternative outcomes or answers is important for effective CDA.
We also experiment with open domain QA setting where the model only sees the question and once again we evaluate on four out of domain datasets.
We find that baseline models are not as effective for out of domain generalization.
However, data augmentation with RGF shows more significant improvements.
We even improve in the in domain enqueued data set.
We hypothesized that the counterfactual data augmentation aids the model in learning better query encodings for very similar queries.
Finally, we also evaluate on the model's ability to improve consistency in the local neighborhood of the original question.
Consistency measures the proportion of questions correctly answered by the model where both the original and the counterfactual query are correctly answered.
This ex, explicitly helps us to measure the model's robustness to small perturbations in the neighborhood of the original input. 
We experiment with five datasets which contain pairs of questions that are semantically close to each other.
Apart from the three datasets AQA, AmbigQA and Quoref-contrast set that are already available, we also evaluate on RGF counterfactuals that are paired with original enqueue questions based on whether they underwent a predicate change or reference change. 
These subsets were annotated in-house to eliminate noise and are provided as a resource.
All baselines are unable to significantly improve consistency with the ensemble model, improving consistency by a small margin.
However, RGF counterfactual augmentation has impressive gains in consistency both on prior data sets and the two subsets we curated for reference and predicate perturbations.
Note that the augmented RGF data is not biased by perturbation type, only the evaluation sets are.
In fact, a qualitative inspection of the kinds of counterfactuals generated
show that the generated questions contain several diverse perturbations.
For instance, this original question on the population of Walnut Grove Minnesota is perturbed along different dimensions like town, state, country,
and along different predicates like location, poverty, number of schools.
Audio of perturbations are context specific.
For example, for this other question about the Arm Armandon ,aa, singles Tournament, the perturbation is along type of game, type of tournament, or the game outcome.
Final takeaways, we tackle the task of counterfactual data ,uh, augmentation and perturbations for information seeking queries and tackle its unique challenges via a reversal of the generation approach over generate using near misses of the model and filter based on perturbation type or minimality.
We find that this technique requires no additional supervision.
And the examples are labeled for augmentation.
Augmentation improves out of domain generalization and neighborhood consistency.
And we find that RGF counterfactuals are semantically diverse without introducing bias during augmentation.
Thank you.
