Hi, this is Elena and I'm going to be presenting our work, [Detecting] Unassimilated Borrowings in Spanish: An [Annotated Corpus] and Approaches to [Modeling].
So we're going to be covering what [lexical] borrowing is, the [task] that we proposed, the [dataset] that we have released and some [models] that we explored.
But to begin with, what is [lexical] borrowing and why it matters as an [NLP task]?
Well, [lexical] borrowing is basically the incorporation of [words] from one [language] into another [language].
[For] instance, in Spanish we use [words] that come from [English].
And here you have a few examples, [words] such as podcast, app, and [online] crowdfunding, all these are [English] [words] that we sometimes use in Spanish.
[Lexical] borrowing is a type of [linguistic] borrowing um which is basically reproducing in one [language] patterns of other [languages].
And borrowing and code switching have sometimes been [compared] and described as a continuum, code switching being ah the thing that bilinguals do where they mix two [languages] at the same time.
There are however some differences between [lexical] borrowing and code-switching.
We're going to be focusing on [lexical] borrowing.
Code switching is something that is done by bilinguals and by definition the code switches are not integrated into any of the [languages] in use, whereas [lexical] borrowing is something that is also done by monolinguals.
The borrowings will comply with the [grammar] of the recipient [language].
And borrowings can eventually be integrated into the recipient [language].
So why is borrowing an interesting phenomenon?
Well, from the point of view of [linguistics], borrowing is a manifestation of of how [languages] change and how they interact.
And also [lexical] borrowings are a [source] of new [words].
Here you have some examples of [lexical] borrowings that have been incorporated into the Spanish [language] as new [words].
In terms of [NLP] ah borrowings are a common [source] of out-of-[vocabulary] [words].
And in fact, [automatically] [detecting] [lexical] borrowings ah has proven to be useful [for] [NLP] [downstream] [tasks] such as [parsing], [text]-to-[speech] synthesis or [machine translation].
There has been a growing interest in the influence of [English] on other [languages] ah particularly ah related to [English] [lexical] borrowings, borrowings which sometimes have been called Anglicisms.
And here, you have some examples of ah work on [automatic] [detection] of borrowings in ah some of these [languages].
So the [task] that we [propose] is to detect unassimilated [lexical] borrowings in Spanish [newswire].
Which means that we are interested in [extracting] ah [words] borrowed from other [languages] that are being used in Spanish newspapers but that have not been integrated or assimilated into the recipient [language].
So not yet integrated into Spanish.
Here you have an example.
This is a [sentence] in Spanish: Las prendas bestsellers se estampan con motivos florales, animal print o retales tipo patchwork.
Um, and as you can see, there are three [spans] of [texts] which are actually [English] [words] like bestseller, animal print and patchwork.
These are the type of [spans] that we are interested in [extracting] and [detecting].
There has been [previous] [word] on Anglicism [detection] ah which consists consisted of a [CRF] [model] [for] Anglicism [detection] on Spanish [Newswire].
This [model] achieved an F1 score of eighty six.
But there were some limitations both um in the [dataset] and the [modeling] [approach].
So the [dataset] focused exclusively on one [source] of [news], consisted only of headlines.
And also there was an overlap in the borrowings that appear in the [training] set and the test set.
This prevented the assessment of whether the [modeling] [approach] could actually [generalize] to previously [unseen] borrowings.
So what we aim is to tackle some of these limitations in the [task].
So to begin we, to begin with, we created a new [dataset].
Ah the aim at a new [dataset] that was [annotated] with [lexical] borrowings and the aim was to create a test set that was as difficult as possible.
So there would be minimal overlap in [words] and topics between the [training] set and test set.
And as a result, well, the test set comes from sources and dates that we're not seeing in the [training] set.
Here you can see that there's no overlap in the in the time.
It's also, the test set is also very borrowing-dense.
Just to give you some numbers, if the [training] set contains six borrowings per each thousand [tokens], the test set contained twenty borrowings per each thousand [tokens].
The test set contained as many out of [vocabulary] [words] as possible.
In fact, ninety two percent of the borrowings in the test set are [OOV].
So, they were not seen during [training].
And the [corpus] consisted basically of a collection of [texts] that came from different sources of Spanish newspapers.
And ah it was [annotated] by hand ah using two tags.
One [for] [English] [lexical] borrowings which is the majority of [lexical] borrowings in Spanish, and then the label other [for] borrowings from other [languages].
We use [CONLL] formats and we used [BIO] [encoding] so that we could [encode] ah single [token] borrowings such as app or multi [token] borrowings such as [machine learning].
These are the numbers of the [corpus].
As you can see, it amounts to roughly three hundred seventy thousand [tokens].
And here you have the [number] of [spans] that were [labeled] as [English] and the [spans] that were [labeled] as other borrowings and how many of them were unique.
And here you have a couple of examples of the of the set of the [dataset].
As you can see [for] instance here, we have ah in the first example, we have the borrowing batch cooking which is a multi [word] borrowing.
And we have [annotated] it using the [BIO] um [encode].
So the [BIO] was used [for] [words] in Spanish so not [for] [words] that were not borrowed.
And here in this second example, you have benching and crash which are also [labeled] as borrowings from [English].
So, once we had the [dataset], we explored several [models] [for] the [task] of [extracting] and [detecting] these [lexical] borrowings.
The first one that we tried was the conditional random field [model].
Ah, this was the [model] that had been used on [previous] work.
And we used the same handcrafted [features] from that from those from that work.
As you can see, these are the [features].
These are [binary] [features] such as the [word] or the [token] in upper case?
Is it title titlecase?
Is it a quotation mark?
Things like that, which are the type of [features] that one would expect in a [named entity recognition] [task].
These are the results that we got.
We obtain fifty five F1 score using the the [CRF] [model] with handcrafted [features].
Which is a huge different difference um [compared] to the reported F1 score of eighty six, which was the result obtained with the same [CRF] [model], same [features] but on a different [dataset] also [for] Spanish [lexical] borrowing [detection].
So, this proves that the [dataset] that we created is more difficult and that we needed to explore more sophisticated [models] [for] these [tasks].
So, we tested two [transformer] based [models].
We used [BETO] which is a [monolingual] [BERT model] trained [for] Spanish and also [multilingual BERT].
Both [models] we use them through the [transformers] library by HuggingFace.
These are the results that we got.
As you can see, [multilingual BERT] performs better than [BETO] both on the development set and on the test set and across all [metrics].
Just so we have ah an idea to compare, the [CRF] [model] obtained an eighty two.
The [CRF] [model] obtained a fifty five obtained a fifty five F1 score, whereas the [multilingual BERT] obtained eighty two, which is a big difference.
So, once that we had those results, we asked ourselves another [question] which is, could we find a [BiLSTM-CRF] [model], feed it with different types of [embeddings], [embeddings] that [encode] different types of [linguistic] [information] and perform outperform the results obtained by [transformer] based [models]?
So in order to do so, we ran some preliminary experiments, we we run this by [BiLSTM-CRF] [model] using flare library.
And we tried experimented with different type of [embeddings] like [transformer]-based but also fast-[text], character [embeddings], and so on.
What we found out was that [transformer]-based [embeddings] performed better than non [contextualized] [embeddings], that the combination of [English] [BERT] and Spanish [BETO] [embeddings] outperform [multilingual BERT] [embeddings].
And that [BPE] [embeddings] produced better F1 and character [embeddings] produce better recall.
With that in mind, these were the best performing results that we got.
Both [models] were [BiLSTM-CRF] [models] using flare.
One was fed with [BETO] and [BERT] [embeddings] and [BPE], and the other one [BETO] and [BERT] [embeddings] and [BPE] and also character [embeddings].
This last one was the one that produced the highest F1 score on the test set, although the highest score on the development set was obtained by the one without character [embeddings].
Just ah to bear in mind that the best result that we got with [multilingual BERT] obtained an F1 of seventy six on the development set and eighty two on the test set.
So this is an improvement [compared] to those results.
Finally, we asked ourselves another [question] which was can [lexical] borrowing [detection] be framed as [transfer learning] from [language identification] in code switching?
So, we run the same [BiLSTM-CRF] [model] that we had run using flare, but instead of using these unadapted [transformer]-based [BETO] and [BERT] [embeddings], we used code switch [embeddings].
What are code switch [embeddings]?
Well these are um [embeddings] that are have been fine tuned [transformer]-based [embeddings] that have been [pretrained] [for] [language identification] on the Spanish [English] section of the [LinCE] code switching [dataset].
[LinCE] is a [dataset] on code switching that has a section on Spanish [English], Spanish [English] code switching.
So we fed our [BiLSTM-CRF] with code switch [embeddings] and optionally character [embeddings], [BPE] [embeddings] and so on.
The best result that we got was eighty four point twenty two, which is the highest across all the [models] that we tried on the test set.
Although the best result F1 score that we got on the development set, which was seventy nine, was lower than the best result obtained by the [BiLSTM-CRF] fed with unadapted [embeddings].
So, some conclusions from our work.
We have ah we have produced a new [dataset] of Spanish [newswire] that is [annotated] with unassimilated [lexical] borrowings.
This [dataset] is more borrowing dense and [OOV]-rich than [previous] [resources].
We have explored four types of [models] [for] [lexical] borrowing [detection].
Um. In terms of error [analysis], well, recall was a weak point [for] all [models].
Ah, as you can see here, some frequent false negatives include uppercase borrowings, [words] that exist in both [English] and Spanish, [for] instance.
Also interestingly, [BPE] [embeddings] seem to improve F1 score.
And character [embedding] seem to improve recall.
Which ah it's an interesting finding that perhaps we can explore on future work.
Um. Well, this is everything that I have.
Thank you so much [for] listening.
My name is Antoine.
I'm a PhD student at the University of Massachusetts Amherst.
I am presenting our [paper] [KinyaBERT]: a [Morphology]-aware Kinyarwanda [Language Model].
Today, I'll talk about the motivation [for] this [research].
Then I'll present [KinyaBERT] [model] architecture in detail.
I'll then talk about our experimental results, then finish with some conclusions.
We all know that recent [natural language processing] advances have been made possible by the use of [pretrained language] [models] such as [BERT].
However, there are still a [number] of limitations.
Due to the complex [morphology] that is expressed by most [morphologically] rich [languages], the ubiquitous [byte pair encoding] [tokenization] [algorithm] that I used cannot extract the exact [subword] [lexical] units, [meaning] the [morphemes], which are needed [for] effective [representation].
[For] example, here we have three Kinyarwanda [words] that have several [morphemes] in them, but the [BPE] [algorithms] cannot extract them.
This is because some [morphological] rules produce different surface forms that hide the exact [lexical] [information], and [BPE], which is solely based on the surface forms, does not have access to this [lexical] [model].
The second challenge is that even if one had access to an [oracle] [morphological analyzer], replacing [BPE] [tokens] with [morphemes] is not enough to express the [morphological] [compositionality].
A third gap in the [research] is that new [pretrained language] [models] are most often evaluated on high resource [languages].
And we need to assess their applicability on low [resources] and diverse [languages] as well.
[Therefore], we present [KinyaBERT], which is a simple but effective adaptation of the [BERT] architecture that is meant to more effectively handle [morphologically] rich [languages].
We evaluate [KinyaBERT] on Kinyarwanda, a [low resource] [morphologically] rich [language], which is [spoken] by more than twelve million people across Eastern and Central Africa.
The [input] to the [model] is either a [sentence] or a [document].
[For] example here, we have John twarahamubonye biradutangaza, which means we were surprised to find John there.
As you can see, Kinyarwanda [words] contains several [morphemes] that contain different [information] in them.
[Therefore], in our [model], we pass this [sentence] or a [document] to a [morphological analyzer].
Which then generates [morphemes] contained in each of the [words].
The [morphemes] usually are made of the stem and zero or more affixes.
The affixes may indicate tense, [aspect], subject or object in [verbs], and more often relates to the Bantu [noun] class [for] subjects and objects.
The [morphological analyzer] also produces a part of [speech] tag [for] each of the [words].
After this step, we make [embeddings] [for] the spee- [for] the part of [speech] tags.
[Embeddings] [for] the affixes.
And [embeddings] [for] the stem.
These are the [morphology] level, these are the [morphology] level [embeddings].
We then pass these [embeddings] through a [morphology] [encoder], which is a small [transformer encoder] that is applied to each [word] independently.
The output of the are the [vectors] that are [contextualized] with the [morphological] [information] at each [word].
Now, we perform composition where the [morphological] [embeddings] [corresponding] to part of [speech] and stem are concatenated together.
We further concat we further concatenate them with another stem [embedding] at the [sentence] level.
Then we form an [input] to the main [sentence] or [document] [encoder].
The final output are [contextualized] [embeddings] that can be used [for] [downstream] [NLP] [tasks].
[For] a [morphological analyzer], we use finite state two level [morphology] principles with custom implementation that is tailored to the Kinyarwanda [language].
We effectively [model] the [morphology] of all Kinyarwanda [words], including verbals, [nouns], demonstrative and possessive [pronouns], numerals, and others.
We use an [unsupervised] part of [speech] [tagging] [algorithm].
A first order factored [model] is used to account [for] [morphology] probability, basically the probability that is assigned by the [morphological analyzer].
We also take into consideration the part of [speech] tag precedence as well as the [syntactic] agreements that are present in the in the [input] [words].
The part of [speech] [tagger] uses a bidi [bidirectional] [inference] which improves upon the more often used Viterbi [algorithm] [for] [decoding].
A few remarks here [for] [positional encoding].
One, the [morphology] [encoder] does not use any [positional encoding].
This is because each of the [morphemes] occupies a known slot in the [morphological] [model].
[Therefore], positional [information] is inherent when the [morphemes] are given.
Second, the [sentence] [encoder] uses the so-called untied relative positional [embeddings], which have been recently published at [ICLR] conference.
This positional [embeddings] essentially disentangles positional [correlations] from [token] to [token] [attention] [computation].
[Similar] to [BERT], we use a [masked language model] [pre-training] objective.
Essentially we have to predict both the stem and the affixes that are associated with the [words].
During [pre-training], fifteen percent of all [words] are considered [for] [prediction], of which eighty percent are masked, ten percent are swapped with random [words], and ten percent are left unchanged.
[For] affix [prediction], we face some multi label [classification] [problem].
[For] this, we either group together affixes into a fixed [number] of sets and predict the set as a class label.
The other option is to predict the affix probability [vector].
We evaluate both of these approaches in our experiments.
We pre-train [KinyaBERT] on about two and half gigabytes of Kinyarwanda [text], and compare it to three baseline [models].
One is a [multilingual] [model] called [XLM]-R, that is trained on a [large] [text] [corpora] that is made of multiple [languages].
The other two [baselines] are [pretrained] on the same Kinyarwanda [text] using either the [byte pair encoding] [algorithm] or using [morphological analysis] without using the two tier [transformer encoder] architecture.
All [models] are configured in the base architecture, which is about between a hundred and a hundred and ten million parameters, with Kinyarwanda with [KinyaBERT] using the least [number] of parameters.
All [models] except the [multilingual] are [pretrained] [for] thirty two thousand [gradient] updates with a batch size of two thousand five hundred and sixty [sequences] in each batch.
We evaluate the [pretrained] [models] on three sets of [tasks].
One is the [GLUE] benchmark which has often been used [for] evaluating the effectiveness of [pretrained language] [models].
We obtain our [GLUE] benchmark [data] by translating the original benchmark [data] into Kinyarwanda using Google Translate.
The second [task] is Kinyarwanda [named entity recognition] benchmark, which is a high [quality] [dataset] that was [annotated] by trained native speakers.
The third one is a [news] categorization [task] where we pull [news] articles from several websites and collecting their categorization tags that were assigned by the authors and then essentially trying to predict the same, the the same categories.
And now we go to the results.
[For] the [GLUE] benchmark, we find that [KinyaBERT] consistently outperforms baseline [models].
Here we show the average performance [for] ten [finetuning] runs.
We also run a [user] [evaluation] of the [translations] that are produced by Google Translate.
Essentially, [user] users rated about six thousand examples, assigning scores on a scale from one to four, [assessing] the [quality] of the [translations].
The result is that many [translations] were noisy.
But, all [models] had to cope with the same [translation] noise, and the relative performance between the [models] is still important to notice.
[For] the [named entity recognition] [task], we also find that [KinyaBERT] gives the best performance with the affix distribution [regression] variant performing best.
These results are also averages of ten [finetuning] runs.
[For] the [news] categorization [task], we find mixed results.
[Previous] work on [text classification] [for] Kinyarwanda had found that simple keyword [detection] is mostly enough [for] solving this specific [task].
[Therefore], there is less gain from using [pretrained language] [models].
On this particular [task] of [news] categorization.
We also conducted an [ablation] study to see if there are alternative structures that improve performance.
[For] the [GLUE] benchmark, we find that using affix sets consistently performs better, while affix probability [regression] objective yields the best performance on [named entity recognition].
Also by looking at the low scores [for] [finetuning], we find that [KinyaBERT] has better convergence in most cases.
So to conclude, this work has demonstrated the effectiveness of explicitly using [morphological] [information] in [pretrained language] [models].
The proposed two tier [transformer encoder] architecture enables capturing [morphological] complexity [morphological] [compositionality], which is an important [aspect] of [morphologically] rich [languages].
These findings should motivate further [research] into [morphology] aware [language] [pretrained language] [models].
Hello, my name is Michał Pietruszka and it is my pleasure to present to you the [paper] titled Sparsifying [Transformer] [Models] with Trainable [Representation] Pooling.
A work done at Applica [AI] in cooperation with Lukasz Borchmann and Lukasz Garncarek.
Let me start with the problems our work targets.
Our [method] works well [for] the cases where long inputs are considered.
Roughly speaking, it is meant [for] the [task] orders and [input] of over two thousand [tokens] and the targets are shorter than the provided inputs.
This has some specific applications in [NLP].
[For] example, one can imagine that given a long [document], there's a need to summarize it, classify, [answer] the [question] about it, extract [information] or some key phrases.
Let me recall the vanilla [transformer] and our and its issue of its [attention] complexity that depends on the square of the [input] line.
In the vanilla [transformer], with full [attention] connectivity, [relations] of each [token] to every other [token] have to be calculated.
The [computational] complexity of [attention], this depends on the [number] of layers l, [sequence] length n, another [sequence] length, and the dimensionality of [representations].
Similarly, in the [decoder]'s cross [attention], to this picture on the right side, the only difference here is that the [target] [tokens] are attending to the [input] [tokens] in this case.
Which can be seen also in this formula.
The [BLEU score] represents [relations] that have to be calculated.
In case of the full [attention], we need to calculate every [relations] within the [input] [sequence].
Now, we see what happens when we have a blockwise [encoder] that works by limiting the [tokens] connectivity so that they can only see other nearby [tokens].
The [text] is read in chunks which can drastically reduce the [number] of computations on the [encoder] side, but does not improve the [decoder]'s cross [attention] as every [input] [token] is passed to the [decoder] anyway.
This [method] is often referred to as fusion in [decoder].
The improvement here can be interpreted as changing one of the [dependencies] of n to another constant m representing the block size.
Our key observation is that most [tokens] are irrelevant [for] a wide variety of [tasks] and can be almost completely disregarded. This is exemplified on the slide.
The only parts of the inputs are relevant to the desired output.
[For] example.
One can read an article once marking the most important parts with a highlighter, and then produce a summary based on this part from the middle stage only.
The cost of highlighting and deciding if the current [token] is essential to produce the summary is thus cheap and depends only on the [token]'s [representation].
The pooling of the highlighted [tokens] is possible.
Thanks to our top k operator and its cost is negligible.
The cost of producing a summary from a shortened [input] is also much lower than in the vanilla [model] when the whole [input] is considered.
But here's a [question].
How to select important [tokens] and backpropagate gradients to that selection?
The essential underlying [problem] that we solve is to [propose] the trainable selection mechanism.
One that can allow [for] [gradient] to be back propagated during the [training] so that the network can learn to select the most important [tokens].
More precisely
Given some [embeddings] underscore obtained from a simple [linear] layer, the [task] is to return the highest scoring [embeddings]. First, the [sequence] is permuted and pairs are prepared so that the higher scoring [vector] is taken with the lower scoring one.
Next, [weights] are calculated using boosted [softmax] over scores.
After each tournament round, new [vectors] and scores are composed as a [linear] combination of those pairs with the obtained [weights].
So in short, we combine them linearly by performing a [softmax] over their scores.
And while combining two [tokens], some noise can be produces produced.
But it also allows the gradients to be propagated to all [input] [embeddings].
In short, a trainable top k we [propose] is based on performing a tournament like soft selection at each step.
And from a different perspective, the [representation] pooling follows the [encoder] layer.
First, each [representation] is scored and then only those with the highest scores are passed to the next layer.
[Encoding] can be performed as in standard [transformer] architecture on the full length [input].
It is however possible to process [text] in blocks of fixed length of fixed length and globally select the best [representation].
Here is an example of the [representation] pooling introduced after the [encoder].
This directly influenced the cause of cross [attention], which depends not on the [input] length N, but the constant K, representing the pooled length.
This constant informs how many [representations] are selected and passed to the [decoder].
Producing a summary from a shorter [text] is significantly cheaper than [previous] solution.
As the [sequence] length can be shortened by a [large] factor.
[For] example, we successfully used k of sixteen or even sixty times four or even sixty four times smaller than the value of n in our experiments.
Please note that the beneficial impact of blockwise [encoding] and self [attention] is sustained.
Remember that the [computational] cost of [attention] depend on the square of the [input] length.
Reducing it the [input] earlier during the [encoding] process can significantly lower the costs.
[For] the pyramidion [model], we narrowed down the size of the [representation] on the output of each of each chosen layer, leading to the exponential reduction of [computational] cost as the [encoding] proceeds.
As you can see, the total [computational] cost of a full [encoder] here is less than two times the cost of the full-sized first layer.
When pooling is introduced earlier, the sum of all purple squares is thus bounded to a constant, not dependent on the [number] of layers l.
But on the constant c, which can be influenced by the placing of the pooling layers within the network.
Our improvements were benchmarked on eight thousand [tokens] long inputs.
And the figure shows that when pooling is engaged, the best scalability [for] the network's depth is achieved.
Here one can note that [training] the pyramidion of twenty four layers can be cheaper than [training] a two layer vanilla [transformer] on such long inputs.
Not to mention how easily vanilla [transformer] can go out of memory [for] such a long [input].
The qual [quality] qual [qualitative] comparison of our trend pyramidion to other baseline is performed on the long [document] [summarization] [task], or given the body of an article from arXiv or [PubMed], the [task] is to generate its abstract.
Thus, one can see blockwise, which is our baseline, performs on the level of the [re], recent state-of-the-art [models], while the pyramidion retains or improves the performance of this competitive baseline.
At the same time, our [model] is eighty percent faster to train and over four hundred fifty percent faster at [inference] when [compared] to the blockwise baseline.
Both [models] have much lower [parameter] counts and were trained from scratch on the chosen [tasks].
[Previous] approaches to to achieve a [similar] performance had to use more parameters and leverage [pretrained] foundation foundational [models] and additional [language] [pretraining] objective to achieve [similar] performance.
We invite you to read our full [paper] and use our GitHub code.
Thank you [for] watching.
Hello, this is Jiawei Zhou from Harvard University.
I am very glad to present our work on [Online] [Semantic Parsing] [for] Latency Reduction in [Task]-Oriented [Dialogue].
This is joint work with Jason, Michael, Anthony and Sam from Microsoft [Semantic] Machines.
In [task]-oriented [dialogue], a [user] interacts with the [system] that handles requests from [user] [utterances] usually in speaking.
From the finish of the [user] [utterance] to the [system] response there is often a noticeable delay.
Under the hood, the [user] [utterance] is translated into an executable program.
Which is then executed so that the [system] can respond properly.
Because the program is represented as a [semantic] [graph] that outlines the [computation], where node represents a function invocation and its children are the arguments.
The great [nodes] mark instantaneous operations, but the others are slow to execute.
The simple example here we show, these programs can often be more complicated [graphs] beyond the tree structures.
In this talk, we ask the [question], can we start [generating] the program and executing it before the [user] even finishes the [utterance] so that the faster response can be achieved by the [system]?
This is the [online] [prediction] and decision [problem].
There are a lot of others in this realm.
Examples include [simultaneous] [translation] where a live interpreter translates one [language] to another in real time, smart [text] auto completion to guess the [user] intent, and Uber pool where the drivers are sent to where they might be needed based on the predicted demand.
All of these scenarios have one thing in common.
That is, it is beneficial to make decisions before seeing all the [input].
In our case, we are going to deal with [online] [semantic parsing], which could be expected to be challenging as we have to guess what the [user] might say.
And it is also underexplored with no formal [evaluation] metric.
First, let's look at how an ordinary [system] works.
It is operating offline by [parsing] to the program only at the end of the [user] [utterance].
Here, the character [graph] is predicted after seeing all the [information].
In contrast, we are proposing an [online] [system] that compares at every [utterance] prefix.
[For] example, each time we see a new [token], we predict a new [graph].
Notice that there could be errors.
At the position of at the pool party with Barack Obama, we got a [graph] with the right [nodes] on the person and the [event] subject, but guess the wrong timing [information].
This process goes on until we receive the full [user] [utterance].
How would this affect the execution timeline in the offline [system]?
We'll get the program [graph] at the end so that the [system] can start execution at this point.
Remember that the great [nodes] are fast operations, so we only consider the execution timeline of the colored slow functions.
First, these two find person functions can be executed in [parallel], highlighted in white from the pink box as they have no [dependency] on other functions.
Next, the node create [event] can then get executed after obtaining results from lower level [nodes] and then the top function yield so the whole program is finished.
The execution process is strict, restricted to the program [dependency] [structure] where some operations cannot be parallelized which induces a noticeable delay.
In our [online] [system], where we predict as we go, the program execution can start earlier.
Here, at the prefix after Obama we predict confidently that the find person function should be in the program, but the rest may contain errors as they are grayed out.
The execution of the node can be immediately started as a step.
Then, with more [tokens], we predict a totally new [graph], but part of it has already being executed.
So, we only need to consider the rest of the [nodes] that we are confident about as well.
Here, another find person can be executed in [parallel].
Again, we may have wrong predictions.
With more [text], we have more ability to make it right.
Such as the [event] time here where AM is also anticipated correctly.
Then, we can start executing the rest following the program [dependency] [structure].
By overlapping the execution timeline with the [utterance] timeline, we save a big amount of time.
So we proposed the [task] of [online] [semantic parsing].
One underlying assumption is that the execution time dominates the [model] [prediction] time.
So we could only gain time by predicting earlier.
Another assumption is that as the [prediction] and execution happen in the background, that it is not visible to users.
It is not necessary to maintain a consistent [parsing] history.
So, we reparse from scratch after each [token].
In particular, we [propose] a two step [approach].
A proposed step that predicts a [graph] with complete [structure] and a select step that selects the [nodes] that are worth executing at this time.
We had two variants of the proposed [method].
First [approach] combines a [language model] completion with full [utterance] to [graph] [parsing].
In particular, the prefix after Obama is first completed through a finetuned [BART] [language model] and then translated into a program with full offline [parser].
The second [approach] directly predicts the program from [user] [utterance] prefixes.
This is achieved by [training] a single [online] [parser] to translate to the goal [graph] from each prefix.
This facilitates the [model] to learn the right anticipation.
In a bit more detail, how do we generate these [graphs]?
We formulate the [problem] by [generating] a serial version of the [graph].
Each node or edge is represented by an action.
Here, we start from the first node.
The [number] below records the absolute index in action history.
Then, we got the second node.
Next, is the edge between them.
It contains the pointer to the index of the [previous] node and the edge label.
Zero here means connecting the most recent node with the node [generated] by the zeroth action and next node next edge.
This process goes on until we generate the full [graph].
The underlying [model] is based on [transformer] with self pointing mechanism [similar] to a [previous] transition based [parser].
After [generating] a complete [graph], we obtained the action level probabilities that correspond to different parts of the [graph].
We select confidence subgraphs based on the thresholding [heuristic] to be executed.
Later on, we're going to vary the threshold to achieve different tradeoffs between the latency reduction and the execution cost.
[For] formal [evaluation] of the [online] [methods], we [propose] final latency reduction or [FLR] metric.
Here's a recap of how an offline [system] finishes the execution timeline.
In [online] [systems], execution overlaps with the [utterance] timeline, so it ends earlier.
[FLR] is defined as the reduction time [compared] to the offline [system], marked by the end of the execution.
We conduct experiments on two [large] [conversational] [semantic parsing] [datasets], [SMCalFlow] and [TreeDST].
Our [graph] based [parser] when operating offline, [achieves] state-of-the-art performance on [parsing] on both [datasets].
The LM complete [model] also [achieves] nontrivial [BLEU] gain [compared] with the simple baseline of node completion.
Now, let's look at the [prediction] accuracy of our prefix to [graph] [parser].
We test the match F1 score of [graph] tuples between the [generation] and the go [graph] in validation [data] in y axis [for] each prefix length in x axis represented by percentages.
Each of these curves represents a different [model] with the only difference in [training data].
The bottom curve is the offline [parser], and we mix in prefix [data] in different lengths to transition the [model] to an [online] [parser].
[For] example, the legend prefix eighty percent plus means the [model] is trained with prefix [data] with prefix length larger than eighty percent of the full [utterance] length.
The upper left corner is the desired area.
As we can see, the offline [parser] in black curve is not doing well on the prefix [data].
As we're mixing more prefixes in [training], the curve is lifting upper and left, performing better on all the prefix lengths.
However, the full [utterance] [parsing] performance is not affected in the upper right dot.
Based on these strong results, how much latency do we reduce?
We measure the time by the [number] of [source] [tokens] and simulate different function execution times.
The curves show the tradeoff between the [FLR] metric and the execution cost, measured by the [number] of excessive function costs that are not correct.
This is achieved by varying the subgraph selection threshold.
A higher threshold selects fewer functions of mistake, but obtains a smaller [FLR], whereas the lower threshold more aggressively selects and executes programs.
We compare the two approaches we [propose] and a baseline that does nothing but directly applying the offline [parser] [for] [online] use.
The upper left region is has the best [FLR] and cost tradeoff.
We see both of our [methods] beat the baseline by a [large] margin, and they perform more similarly on [TreeDST].
While individual function execution is faster, there tends to be more run executions and lower latency reduction room.
When individual function execution is slower, there is more room [for] [FLR] improvement.
Our two approaches achieve better performance in different cost cost regions.
Overall, we achieve thirty to sixty three percent relative latency reduction depending on execution time and allowed cost.
Finally, we have a breakdown of average latency reduction in [tokens] [for] each type of the function node when the allowed cost is three run executions.
As we can see, there are gains all over the board.
There are also some functions on which we gain impressive latency reduction where the red bar is much longer, such as find manager and recipient.
These are low level functions that do not have much [dependency] on others.
In conclusion, we proposed [online] [semantic parsing] as new [task] to explore with the rigorous latency reduction metric.
With a strong [graph] based [semantic] [parser], we achieve relatively good latency reduction either through our pipeline [approach] with LM completion and a full [parser] or directly through a learned [parser] on the prefixes.
[Moreover], our [approach] can be a general framework and can be applied to other executable [semantic] [representations] in different [domains].
Future works could explore smarter [prediction] and execution integration [method].
Thanks [for] your listening.
Hi.
I'm going to discuss our work on [generating] [retrieval] [augmented] counterfactuals [for] [question answering] [tasks].
This is work done during my internship at Google [Research], where I was mentored by Matthew Lamm and Ian Tenney.
To motivate the [task], let me begin by defining a [counterfactual].
In this work, we define a [counterfactual] as a perturbation of the [input] [text] that differs in some meaningful controlled way from the original [text].
And allows us to reason about the changes in the outcome or the [task] label.
[For] instance, changing the [words] fascinating to captivating or expected to mind-numbing changes the [sentiment] [for] this movie review.
Similarly, adding the qualifier women's to the [question] changes the [answer] to the [question] in the example below.
Humans are typically robust to such perturbations [compared] to [NLP] [models] trained on the [task].
Why is that?
The [dataset] may be sampled with systematic [biases] that lead to a simple decision boundary that is violated by the [counterfactual].
As shown in this 2D [classification] [problem].
My work has found that adding [counterfactual] examples to the [training data] can make the [model] robust to such perturbations.
So, if counterfactuals are valuable, how can we generate them?
This [task] is especially hard [for] [NLP] because here are three examples from three different [NLP] [tasks].
As you can see, examples that violate the decision boundary between outcomes need to be very carefully crafted by perturbing some attributes of the [text] that are underlined here.
This could be done by [human] [annotation], but this is expensive and biased.
Some prior work has focused on using [syntax] trees or [semantic role labeling].
But the set of perturbations [generated] by these techniques are limited by the [semantic] framework.
More recent work has used masked [language models] to fill in masked portions of the [text] to change labels.
But finding what parts of the [text] to perturb can be challenging.
There are more challenges to [generating] counterfactuals [for] [question answering] specifically.
This [task] requires background [knowledge].
[For] instance, to perturb the original [question] is Indiana Jones Temple of Doom a prequel?
We need to be aware of the other movies in the franchise to get to a [question] like is Indiana Jones Raiders of the Lost Ark a prequel?
[Furthermore], random perturbations can lead to [questions] that are not answerable with the available evidence or have false premises.
[Moreover], some [question] perturbations can lead to significant [semantic] drift from the original [input].
[For] instance, this [question] is Indiana Jones practicing child slavery in Temple of Doom?
We [propose] a very simple yet effective technique called retrieve generate filter or [RGF], to tackle [counterfactual] perturbations of [questions], and also aims to tackle all the other aforementioned challenges.
The core intuition behind [RGF] is that the necessary background [information] that is needed to generate perturbations may be present in the near misses made by a [question answering] [model].
[For] instance, the state-of-the-art [model] [REALM] produces the following top k answers to the [question] who is the captain of the Richmond Football Club?
While it does recover the original reference passage and [answer] Trent Cotchin as the top most choice.
It also retrieves additional passages and answers which can be used to guide [question] perturbation.
[For] instance, it recovers two more answers [corresponding] to the captains of the reserve team and the women's team of the same club, and this can lead to interesting edits.
To summarize, [RGF] first retrieves top k most relevant answers and [contexts] which don't match the reference [answer] in [context].
Following this step, the [question generation] [model] conditions on these alternate answers to generate a [question] that corresponds to them.
And finally, we can filter the [generated] [questions] based on minimality or based on the type of [semantic] perturbation we are interested in introducing.
Going over each step in greater detail [for] [retrieval], we use a retrieve then read [model] like [REALM] that takes as [input] the original [question], and a [large] [corpus] like [Wikipedia].
It consists of two modules.
The retriever module performs [similarity] [search] over a dense index of passages to retrieve the top k most relevant passages to the [question].
And a reader module then extracts a span from each passage as a potential [answer].
[REALM] retrieves the gold passage and [answer] in most cases.
However, in this work, we are more interested in the answers and [context] that it retrieves further down the line.
In the next step, [question generation], we use these alternate answers and [contexts] to regenerate new [questions] that correspond to these alternatives.
[Question generation] [model] is a pre trained [text]-to-[text] [transformer] that is fine-tuned on the NQ [data] to generate a [question] [for] an [answer] that's marked in [context].
During [inference] we supply the [question generation] [model], the alternative [answer] and [context] that we [retrieved] in the [previous] step.
[For] example, [for] the [query] who is the captain of the Richmond Football Club? [REALM] retrieves passages about the club's women's team, captained by Jess Kennedy, and the [question generation] [model] generates the [query] who captained Richmond Football Club's first ever women's team?
Which has a specific [semantic] perturbation.
In a [similar] fashion, we also get [queries] like who captained Richmond's [VFL] Reserve team?
Or who did graham negate in the grand final last year?
Finally, we filter out a subset of the [generated] [queries] based on some desired characteristics.
As [motivated] earlier, we would like to ensure that the new [question] is still [semantically] close to the original.
[For] filtering techniques that doesn't require additional supervision, we simply retain new [questions] that have a small [token] label [edit] distance from the original [question].
[For] example, we remove the [question] who did graham negate in the grand final last year?
Because it has a longer [edit] distance from the original [question].
In our experiments, we demonstrate that this simple [heuristic] can be used to augment and queue [training data].
We also experiment with a filtering strategy that is based on the type of [semantic] perturbation.
To this end, we use a general purpose [query] decomposition framework called [QED].
[QED] identifies two parts to the [question], a [predicate] and a reference.
References are [noun] phrases in the [question] that correspond to [entities] in the [context].
A [predicate] is basically the remaining portion of the [question].
[For] example, we are able to decompose the [query] who captained Richmond's first ever women's team into two references: Richmond Football Club women's team and the [predicate] who captained X.
A [model] trained on reference [predicate] [annotations] [for] NQ gives us this [question] decomposition.
Decomposing both the original and [generated] [question] based on [QED] allows us to categorize our [generated] counterfactuals [for] [evaluation].
Specifically, we obtain two groups of [questions].
Those that undergo a reference change while retaining [predicates], and those that undergo a [predicate] change and optionally add references.
[For] instance, who captained Richmond's [VFL] reserve team is a reference change?
While, who wears [number] nine [for] the club is a [predicate] change.
We now evaluate the effectiveness of [RGF] perturbations when [augmented] to [training data].
So, to effectively evaluate the effectiveness of [counterfactual] [augmentation] in particular, we experiment with two strong [data augmentation] [baselines].
The first baseline, called random [answer] and [question generation], adds [data] that has no [relation] with the original [question].
That is, passages and answers are simply randomly sampled from [Wikipedia].
This baseline basically adds more [data] that looks like NQ.
With the second baseline gold [answer] and [question generation], we specifically update the [retrieval] portion of our [method].
Here, alternate answers are just chosen from the same passage that contained the gold [answer].
How base how do the [baselines] and [RGF] ah [augmentation] perform on [reading comprehension] where the [model] has access to [question] and [context]?
We experiment with six out of [domain] [datasets] and present results here, where [data] is the [training data] is doubled in [augmentation].
We find that both [data augmentation] [baselines] are not able to improve our [domain] [generalization].
In fact, an ensemble of six [models] trained on the original [data] seems to be the most competitive baseline.
Comparing against that baseline, we find that [RGF] counterfactuals are able to improve out of [domain] performance while maintaining in [domain] performance.
This suggests that filling in the [reasoning] gaps of the [model] via [counterfactual] [augmentation] is more effective than adding more [data] from the [training] distribution.
[Furthermore], we find that using [retrieval] to sample alternative outcomes or answers is important [for] effective [CDA].
We also experiment with open [domain] [QA] setting where the [model] only sees the [question] and once again we evaluate on four out of [domain] [datasets].
We find that baseline [models] are not as effective [for] out of [domain] [generalization].
However, [data augmentation] with [RGF] shows more significant improvements.
We even improve in the in [domain] NQ [dataset].
We hypothesized that the [counterfactual] [data augmentation] aids the [model] in [learning] better [query] encodings [for] very [similar] [queries].
Finally, we also evaluate on the [model]'s ability to improve consistency in the local neighborhood of the original [question].
Consistency measures the proportion of [questions] correctly answered by the [model] where both the original and the [counterfactual] [query] are correctly answered.
This explicitly helps us to measure the [model]'s [robustness] to small perturbations in the neighborhood of the original [input].
We experiment with five [datasets] which contain pairs of [questions] that are [semantically] close to each other.
Apart from the three [datasets] [AQA], [AmbigQA] and [QUOREF]-Contrast set that are already available, we also evaluate on [RGF] counterfactuals that are paired with original NQ [questions] based on whether they underwent a [predicate] change or reference change.
These subsets were [annotated] in-house to eliminate noise and are provided as a resource.
All [baselines] are unable to significantly improve consistency with the ensemble [model] improving consistency by a small margin.
However, [RGF] [counterfactual] [augmentation] has impressive gains in consistency both on prior [datasets] and the two subsets we curated [for] reference and [predicate] perturbations.
Note that the [augmented] [RGF] [data] is not biased by perturbation type, only the [evaluation] sets are.
In fact, a [qualitative] inspection of the kinds of counterfactuals [generated] show that the [generated] [questions] contain several diverse perturbations.
[For] instance, this original [question] on the population of Walnut Grove, Minnesota is perturbed along different dimensions like town, state, country, and along different [predicates] like location, poverty, [number] of schools.
Audio of perturbations are [context] specific.
[For] example, [for] this other [question] about the Wimbledon ah singles tournament, the perturbation is along type of game, type of tournament, or the game outcome.
Final takeaways; we tackle the [task] of [counterfactual] [data augmentation] and perturbations [for] [information] seeking [queries] and tackle its unique challenges via a reversal of the [generation] [approach], over generate using near misses of the [model] and filter based on perturbation type or minimality.
We find that this technique requires no additional supervision and the examples are [labeled] [for] [augmentation].
[Augmentation] improves out of [domain] [generalization] and neighborhood consistency.
And we find that [RGF] counterfactuals are [semantically] diverse without introducing bias during [augmentation].
Thank you.
