Hi everyone. Today I'm going to present our [research] work [Learning] to Reason Deductively: [Math Word Problem] Solving as Complex [Relation Extraction].
I'm Allan from ByteDance [AI] Lab, and this is a joint work with Jierui Li from the University of Texas at Austin and Wei Lu from [SUTD].
First, I'd like to talk about our motivation [for] [reasoning].
So here we show an examples where multi-step [reasoning] is helpful.
So this figure is taken from the [PaLM] [paper] where they perform prompting to solve the network [problem] in the few shot [learning] scenario.
So on the left hand side, we can see if we give some examples with just [question] and answers, we might not be able to obtain the correct answers.
But if we give some more [reasoning] description, the [model] is able to predict the [reasoning] description and also make a correct [prediction] here.
So it is good to have [interpretable] multi-step [reasoning] as output.
And we also think [math word problem] is a straightforward application to evaluate such [reasoning] abilities.
So, here in our [problem] setup, given the [questions] we need to solve this [question] and obtain the numerical answers.
So in our [datasets] we are also given the mathematical expression which leads to the ah to this particular [answer] as well.
So, certain assumptions ah also apply as in [previous] work.
We assume the precision of quantities are known.
And we only consider basic operators such as addition, subtractions, multiplication, division, and exponential.
[Furthermore], complicated operators can be actually decomposed into these basic operators.
So, [previous] work in [math word problem] solving ah actually can ah be categorized into [sequence] to [sequence] and [sequence] to tree [model].
So, traditional [sequence] to [sequence] [model] convert the expression to a specific [sequence] [for] [generation].
And it is pretty easy to implement and it can [generalize] to many different complicated [problem].
But the drawbacks are the performance is actually generally not better than the [structured] [model] and its lack of [interpretability] [for] [prediction].
But actually this direction is still quite popular because of um the [transformer] [model].
So, in tree based [models], we actually [structure] these expressions in the tree form and follow a preordered traversal in tree generations.
So here we keep [generating] the operators until we reach the leaves, which are the quantities.
So here the good thing is that it actually gives us this [binary] tree [structure], and it is um but actually it is quite counterintuitive because we generate the operator first and then at the end we generate the quantities.
And the second thing is that it also contains some repetitive computations.
So here if we look at this expression, eight times three plus three is actually [generated] twice, but in fact we should reuse the results.
So, in our proposed [approach] we want to solve those problems in a step by step and [interpretable] manners.
So [for] example, here in the second step, ah we can obtain these divisors which is twenty seven.
And we can also refer back to the original [questions] to find the relevant contents.
And in these steps we obtain the divisors.
So, ah and then at this third step we actually get the quotient.
Alright. And after these three steps, we can actually reuse the results from the second step, and then get the ah results of the fourth step, and then finally we can obtain the dividends.
So, here we actually generate the whole expression directly rather than [generating] a single operators or quantities.
So this makes the process more accurate.
So, in our deductive [system], we first start with a bunch of quantities presented in the [questions] and also including some constant as our initial state ah initial state.
So, the expression is represented by e i j o p.
Where we perform operator from q_i to q_j, and such expression is actually directed.
So, we also have subtraction with [words] here to represent the opposite direction.
This is quite [similar] to [relation extraction].
So in a formal deductive [system], at a time step t, we apply the operator between the q_i and q_j pair, and then we obtain this new expression.
We add it to the next state to become a new quantity.
So, these slides actually visualize the evolution of the state where we keep adding expression to the current state.
So in our [model] implementations, we first use a [pretrained language] [model] which can be [BERTs] or Robertas and then we [encode] the [sentence] and then we obtain these quantity [representations].
So, once we get the quantity [representations], we can start to do [inference].
Here we show an example of q_1 to obtain the [representation] [for] q_2 divided by q_2 and then times q_3.
First we get the ah pair [representation], which is basically just the [concatenation] between q_1 and q_2, and then we apply a feedforward network which is parameterized by the operator.
And then finally we obtain the expression [representation] q_1 divided by q_2.
But in fact, in practice, in the [inference] stage, we might ah be able to get the incorrect expression as well.
So, here all the possible expression is equals to three times the [number] of operators.
So the nice thing here is that we can easily add constraints to control this [search] this [search] space.
[For] example, if this expression is not allowed, we can simply remove this expression in our [search] space.
So in the second step, we do the same thing, but the only difference is that we ah the only difference is one more quantities.
So this quantity come from the [previous] calculated expression.
So finally we can obtain this final expression q_3 times q_4.
And we can also see the [number] of all the possible ah expression is different from the [previous] step.
So, ah such difference make it hard to apply [beam search] because the probability distribution between these two steps is unbalanced.
So the [training] procedure is [similar] to [training] a [sequence] to [sequence] [model] where we optimize the loss at each time step.
And here we also use this tau to represent when we should terminate this [generation] process.
And here the space is different from [sequence] to [sequence] because the space is different at each time step while in traditional [sequence] to [sequence] [model] this is the [number] of [vocabulary].
And it also allows us to impose certain constraints from prior from prior [knowledge].
So we conduct experiments on the commonly used [math word problem] [datasets], [MAWPS], Math23K,  [MathQA] and [SVAMP].
And here we briefly show the results [compared] with the [previous] best approaches.
So our best performing variant is Roberta-DeductiveReasoner.
And in fact we do not use [beam search], in contrast all [previous] approaches are using [beam search].
All right. So, the best approaches are often tree based [model].
So, overall our reasoner is able to significantl significantly outperform this tree based [model].
But we can see the absolute numbers on [MathQA] or [SVAMP] are not really high.
So we further investigate the results on [SVAMP].
And this [dataset] is challenging because the author tried to [manually] ah adding something to confuse the [NLP] [model] like such as adding irrelevant [information] and extra quantities.
So, in our [prediction] we find some of the intermediate values are actually negatives.
[For] example, um, in these [questions] we are asking how many apples does Jake have?
But we have some extra [information] like seventeen fewer pictures, and Steven has eight pictures, which is totally irrelevant.
So, our [model] makes some [prediction] like this which is producing negative values.
And we observe these two expressions actually have [similar] scores.
So, we can actually limit this [search] space by removing those results that are negatives so that we can make the ah make the [answer] correct.
So um we further find such [constraint] actually improves quite a lot [for] some [models].
[For] example, [for] [BERT], we improve seven points and then [for] the Roberta base [model] we actually improved two points.
So better [language model] has better [language understanding] abilities so that the [number] here is higher [for] Roberta and lower [for] [BERT].
And we also try to analyze the difficulty behind these behind all these [datasets].
We assume the [number] of unused quantities can be regarded as irrelevant [information] here.
So ah here we can see that ah,we have the the percentage of samples with unused quantities, and the [SVAMP] [dataset] has the largest portion.
And here we also show the overall performance.
[For] those samples without unused quantities, so the overall performance is actually higher than the, the performance is actually higher than the overall performance.
But with those samples that with unused quantity is actually way worse than the, worse than the overall performance.
[For] [MAWPS], we don't we don't really have ah too many test cases, so I just ignore this part.
So, finally we want to show the [interpretability] through a [question] perturbation example.
So here our [model] actually makes a wrong [prediction] at the first step.
So, we can actually correlate this expression with the [sentence] here. Alright.
So, we think this [sentence] might be misleading the [model] to an incorrect predictions.
So here planting another thirty five makes the [model] makes the [model] think it should be an addition operator.
So we try to revise the [sentence] to be something like the [number] of pear trees are thirty five fewer than the apple trees.
So, we make it to convey more accurate [semantics] such that the [model] is able to make um the [prediction] correct.
So, this study shows how the [interpretable] predictions help us understand the [model] behavior.
So to conclude our work, so first our [model] is actually pretty efficient.
And we are able to provide [interpretable] solving procedure.
And we can easily incorporate some prior [knowledge] as [constraint] which can help improve the performance.
And the last thing is that the underlying mechanism does not only apply to network [problem] solving [tasks] but also other [tasks] that involve multi step [reasoning].
We also have certain limitations.
Ah, if we have a [large] [number] of operators or constants, the memory consumption could be pretty high.
And the second thing is that, as mentioned, because the probability distribution is unbalanced between different time steps, so it's also pretty challenging to apply [beam search] strategy.
So this is the end of the talk, and [questions] are welcomed. Thank you.
Hi, my name is Antoine and I'm from Maastricht University.
I will be presenting my joint work with Jerry which is about a New [Dataset] [for] Statutory Article [Retrieval].
Legal issues are an integral part of many people's lives.
But the majority of citizens have little to know [knowledge] about their rights and fundamental legal processes.
As a result, many vulnerable citizens who cannot afford the costly assistance of a legal expert are left unprotected or, worst, exploited.
All work aims to bridge the gap between people and the law by developing an effective [retrieval] [system] [for] statutory articles.
Such a [system] could provide a free professional legal help service [for] unskilled humans.
Before diving into the main contribution of this work, let's first describe the [problem] of statutory article [retrieval].
Given a simple [question] on a legal matter such as, what do I risk if I violate professional confidentiality?
A [model] is required to retrieve all relevant statutory articles from a [large] body of legislation.
This [information retrieval] [task] comes with its own set of challenges.
First, it deals with two types of [language].
Common [natural language] [for] the [questions] and complex legal [language] [for] the statutes.
This difference in [language] [distributions] makes it harder [for] a [system] to retrieve relevant candidates, as it indirectly requires an inherent interpretation [system] that can translate a [natural] [question] to a legal [question] that matches the [terminology] of statutes.
Besides, statutory law is not a stack of independent articles that can be treated as a complete [source] of [information] on their own, unlike [news] or recipes, [for] example.
Instead, it's a [structured] collection of legal provisions that have a whole [meaning] only when considered in the overall [context], that is, together with the supplementary [information] from the neighboring articles, the fields and subfields they belong to, and their place in the [structure] of the law.
Lastly, statutory articles aren't small paragraphs which usually is the typical [retrieval] unit in most [retrieval] works.
Here, there are long [documents] that may be up to six thousand [words].
The [recent advances] in [NLP] have sparked huge interest in many legal [tasks], such as legal judgment [prediction] or automated contact contract review.
But statutory article [retrieval] has remained mainly untouched due to the lack of [large] and high [quality] [labeled] [datasets].
In this work, we present a new [French] native citizen-centric [dataset] to study whether [retrieval] [models] can approximate the efficiency and reliability of a legal expert [for] the [task] of statutory article [retrieval].
Our Belgian statutory article [retrieval] [dataset] [BSARD] consists of more than one thousand one hundred legal [questions] posed by Belgian citizens.
These [questions] cover a [wide range] of topics from family, housing, money, to work and [social] security.
Each of them has been [labeled] by experienced jurists with references to relevant articles from a [corpus] of more than twenty-two thousand six hundred legal articles from Belgian codes of law.
Let's now talk about how we collected this [dataset].
First, we started by compiling a [large] [corpus] of legal articles.
We considered thirty two publicly available Belgian codes and [extracted] all the articles as well as the [corresponding] section headings.
Then we gathered legal [questions] with references to relevant statutes.
To do so, we partner with the Belgian law firm that receives each year around four thousand emails from Belgian citizens who ask [for] advice on a personal legal issue.
We were lucky enough to get access to their websites, where their team of experienced jurists addresses Belgians' most common legal issues.
We collected thousands of [questions] [annotated] with categories, subcategories and legal references to relevant statutes.
Lastly, we passed the legal references and filtered out the [questions] whose references were not articles in one of the codes of law we considered.
The remaining references were matched and converted to the [corresponding] article ids from our [corpus].
We eventually ended up with one thousand one hundred and eight [questions], each carefully [labeled] with the ids of the relevant articles from our [large] [corpus] of twenty two thousands and six hundred thirty three statutory articles.
In addition, each [question] comes with the main category and a [concatenation] of subcategories.
And each articles comes with a [concatenation] of the subsequence heading in the [structure] of the law.
This extra [information] is not used in the present work, but might be of interest [for] future [research] on legal [information retrieval] or legal [text classification].
Let's look at some characteristic of our [dataset].
The [questions] are between five and forty four [words] long with a median of fourteen [words].
The articles are much longer with a median length of seventy seven [words], with one hundred and forty two of them exceeding one thousand [words].
The lengthiest one being up to five thousand seven hundred and ninety [words].
As previously mentioned, the [questions] cover a [wide range] of topics, with around eighty five percent of them being either about family, housing, money or justice.
While the remaining fifteen percent concern either [social] security, foreigners or work.
The article are also very diverse as they come from thirty two different Belgian codes that cover a [large] [number] of legal topics.
Here's the total [number] of articles collected from each of these Belgian codes.
Out of the twenty two thousand six hundred and thirty three articles, only one thousand six hundred and twelve are referred to as relevant to at least one [question] in the [dataset].
And around eighty percent of these cited articles come from either the civil code, judicial codes, criminal investigation codes or penal codes.
Meanwhile, eighteen out of thirty two codes have less than five articles mentioned as relevant to at least one [question].
Which can be explained by the fact that those codes focused less on individuals and their concerns.
Overall, the median [number] of citations [for] these cited articles is two, and less than twenty-five percent of them are cited more than five times.
Using all [datasets], we benchmarked several [retrieval] approaches, including [lexical] and dense architecture.
Given a [query] and an article, a [lexical] [model] assigns a score to the [query] article pair by computing the sum over the [query] terms of the [weights] of each of these terms in that article.
We experiment with the standard TF-[IDF] and BM25 ranking functions.
The main [problem] with these approaches is that they can only retrieve articles that contain keywords present in the [query].
To overcome this limitation, we experiment with a [neural] based architecture that can capture [semantic] relationships between [queries] and article.
We use a bi-[encoder] [model] that maps [queries] and articles into dense [vector] [representations] and calculate a relevance score between a [query] article pair by the [similarity] of their [embeddings].
These [embeddings] typically result from a pooling operation on the output of a [word embedding] [model].
First, we study the effectiveness of Siamese bi-[encoders] in a zero shot [evaluation] setup, [meaning] that [pretrained] [word embedding] [models] are applied out-of-the-box without any additional [finetuning].
We experiment with [context] independent [text] [encoder], [namely] [word2vec] and fastText, and [context] dependent [embedding] [models], [namely] Roberta and more specifically [CamemBERT] which is a [French] Roberta [model].
[Additionally], we train our own [CamemBERT] based [model] ah bi-[encoders] on our [dataset].
Note that [for] [training], we experiment with the two flavors of the bi-[encoder] architecture.
Siamese, which uses a unique [word embedding] [model] that maps the [query] and article together in a [shared] dense [vector space], and two-tower, which uses two independent [word embedding] [models] that [encode] the [query] and article separately into different [embedding] spaces.
We experiment with mean, max and [CLS] pooling as well as product and [cosine] [for] computing similarities.
Here are the result of our baseline on the test sets.
With the [lexical] [methods] above, the Siamese bi-[encoders] evaluated in a zero shot setup in the middle, and the finetuned bi-[encoders] below.
Overall, the finetuned bi-[encoder] significantly outperforms all the other [baselines].
The two-tower [model] improves over its Siamese variants on recall at one hundred, but performs similarly on the other [metrics].
Although BM25 underperformed the trained bi-[encoder] significantly, its performance indicated that it's still a strong baseline [for] [domain] specific [retrieval].
Regarding the zero shot [evaluation] of Siamese bi-[encoder], we find that directly using the [embeddings] of a [pretrained] [CamemBERT] [model] without optimizing [for] the [information retrieval] [task] gives poor results, which is consistent with [previous] findings.
[Furthermore], we observe that the [word2vec] based bi-[encoder] significantly outperformed the fastText and [BERT] based [models], suggesting that maybe [pretrained] [word] level [embeddings] are more appropriate [for] the [task] than character level or [subword] level [embeddings] when used out of the box.
Although promising, these results suggest ample opportunity [for] improvement [compared] to a skilled legal expert who can eventually retrieve all relevant articles to any [question] and thus get perfect scores.
Let's conclude by discussing two limitations of our [dataset].
First, the [corpus] of article is limited to those collected from the thirty two considered Belgian codes, which does not cover the entire Belgian law as articles from decrees, directives and ordinances are missing.
During the [dataset] construction, all references to these uncollected articles are ignored, which causes some [questions] to end up with only a fraction of the initial [number] of relevant articles.
This [information] thus implies that the [answer] contained in the remaining relevant articles might be incomplete, although it's still completely appropriate.
Second, we should note that not all legal [questions] can be answered with statutes alone.
[For] instance, the [question], can I evict my tenants if they make too much noise?
Might not have a detailed [answer] within statutory law that quantifies a specific noise threshold at which eviction is allowed.
Instead, the landlord should probably rely more on case law and find precedents [similar] to their current situation.
[For] example, the tenants makes two parties a week until two AM.
[Hence], some [question] are better suited than others to the statutory article [retrieval] [task], and the [domain] of the less suitable ones remains to be determined.
We hope that our work sparks interest in developing practical and reliable statutory article [retrieval] [models].
That can help improve access to justice [for] all.
You can check out our [paper], [dataset] and code at the following links. Thank you.
Hello, we are happy to present our work on [VALSE]; a [Task]-Independent Benchmark meant [for] testing vision and [language models] with specific [linguistic] phenomena.
Why did we do the trouble in setting up this benchmark?
Well, during the last years, we have seen an explosion of [transformer] based vision and [language models] [pretrained] on [large] amounts of [image] [text] pairs.
Each one of these [models] pushes state-of-the-art on vision and [language] [tasks] such as [visual question answering], [visual] common [sense] [reasoning], [image] [retrieval], [phrase] [grounding].
So we got a message, the accuracies on these [tasks] and specific benchmarks are increasing steadily.
But do we know what the [models] have actually learned?
What is it that a vision and [language] [transformer] understood when assigning a high score [for] this [image] and this [sentence] to match?
And the low score [for] this one?
Do vision and [language models] focus on the right thing?
Or do they focus on [biases] as shown by [previous] work?
To shed more light on this [aspect], we [propose] a more [task] agnostic direction and introduce [VALSE] that tests the sensitivity of vision and [language models] to specific [linguistic] phenomena that affect both the [linguistic] and the [visual] [modalities].
We [target] existence, plurality, counting, [spatial] [relations], actions and [entity] [coreference].
But how do we test whether the vision and [language models] have captured this phenomena?
By foiling a [method] previously applied [for] vision and [language models] only [for] [noun] phrases by Ravi Shekhar and collaborators, and on counting by us in [previous] work.
Foiling basically means that we take the caption of an [image] and produce a foil by altering the caption such that it does not describe the [image] anymore.
And we do these [phrase] alterations by focusing on six specific pieces such as existence, plurality, counting, [spatial] [relations], actions and [entity] [coreference], where each piece can consist of one or more instruments, in case we found more than one interesting way to create foil instances.
[For] example, in the case of the actions piece, we have two instruments, one in which the action [verb] is changed with a different action, and one in which actants are swapped.
Counting and [coreference] also are pieces that have more than one instrument.
And we create these foils by making sure that they fail to describe the [image], that they are [grammatical], and otherwise valid [sentences].
This is not easy to do because a foiled caption may be less likely than the original caption.
[For] example, though it's not impossible, it is statistically less likely [for] plants to cut a man than a man to cut plants, and [large] vision and [language models] could pick up on this.
[Therefore], to obtain valid foils, we must take action.
First, we make use of strong [language models] to [propose] foils.
Second, we use [natural language inference] or short [NLI] to filter out foils that could be still describing the [image], since when constructing foils we need to ensure that they fail to describe the [image].
To test this [automatically], we apply [natural language inference] with the following rationale.
We consider an [image] to be the premise and its caption its entailed hypothesis.
In addition, we consider the caption to be the premise, and the foil is its hypothesis.
If an [NLI] [model] predicts the foil to contradict or to be neutral with respect to the caption, we take this as an indicator of a valid foil.
If an [NLI] predicts the foil to be entailed by the caption, it cannot be a good foil, since by transitivity it will give a truthful description of the [image], and we filter these foils out.
But this procedure is not perfect, it is just an indicator [for] valid foils.
[Therefore], as a third measure [for] [generating] valid foils, we employ [human] [annotators] to validate the [data] used in [VALSE].
So, after filtering and [human evaluation], we have as many test instances as described in this table.
Note that [VALSE] does not deliver any [training data] but only test [data].
Since it is a zero shot testing benchmark only, it is designed to leverage the [existing] capabilities of vision and [language models] after [pretraining].
[Finetuning] would only enable [models] to exploit artifacts or [statistical] [biases] in the [data].
And we all know that these [models] like to cheat and take shortcuts.
And as we said, we are interested in [assessing] what capabilities the vision and [language models] have after [pretraining].
We experiment with five vision and [language models] on [VALSE], [namely] with [CLIP], [LXMert], [ViLBERT], [ViLBERT] twelve in one, and [VisualBERT].
Two of our most important [evaluation] [metrics] are the accuracy of the [models] in [classifying] [image] [sentence] pairs into [captions] and foils.
Perhaps more relevant [for] this video, we will showcase our more permissive metric, the [pairwise] accuracy, which measures whether the [image] [sentence alignment] score is greater [for] the correct [image] [text] pair than [for] its foiled pair.
[For] more [metrics] and results on them, do check out our [paper].
The results with [pairwise] accuracy are shown here and they are consistent with the results we got from the other [metrics] is that the best zero shot performance is achieved by [ViLBERT] twelve in one, followed by [ViLBERT], [LXMert], [CLIP], and finally [VisualBERT].
It's notable how instruments centered on the individual objects like existence and [noun] phrases are almost solved by [ViLBERT] twelve in one, highlighting that [models] are capable of [identifying] [named] objects and their presence in images.
However, none of the remaining pieces can be reliably solved in our [adversarial] foiling settings.
We see from the plurality and counting instruments that vision and [language models] have trouble distinguishing references to single versus multiple objects, or counting them in an [image].
The [relation] piece shows that they have difficulties in correctly [classifying] a [named] [spatial] [relation] between objects in an [image].
They also have trouble distinguishing actions and [identifying] their participants, even if supported by plausibility [biases] as we see in the actions piece.
From the [coreference] piece, we find out that tracing multiple references to the same object in an [image] by using [pronouns] is also difficult [for] vision and [language models].
As a sanity check, and because it's an interesting experiment, we also benchmark two [text] only [models], [GPT] one and [GPT] two, to assess whether [VALSE] is solvable by these unimodal [models] by computing the [perplexity] of the correct and the foiled caption, no [image] here, and predicting the entry with the lowest [perplexity].
If the [perplexity] is higher [for] the foil, we take this as an indication that the foiled caption may suffer from plausibility bias or other [linguistic] [biases].
And it's interesting to see that in some cases, the [text] only [GPT] [models] have captured the plausibility of the world better than the vision and [language models].
So to sum up, [VALSE] is a benchmark that uses the lens of [linguistic] constructs to help the community improve vision and [language models] by hard testing their [visual] [grounding] capabilities.
Our experiments show that vision and [language models] identify [named] objects and their presence in images well, as shown by the existence piece, but struggle to ground their interdependence and relationships in [visual] scenes when forced to respect [linguistic] indicators.
We would really like to encourage the community to use [VALSE] [for] measuring progress towards [language] [grounding] with vision and [language models].
And even more, [VALSE] could be used as an indirect assessment of [datasets], as [models] could be evaluated before and after [training] or [finetuning] to see whether a [dataset] helps [models] improve on any of the aspects tested by [VALSE].
If you're interested, do check out the [VALSE] [data] on GitHub, and if you have any [questions] do not hesitate to contact us.
Hello, my name is Kamezawa from the University of Tokyo.
I'll be presenting a [paper] entitled [RNSum]: A [Large]-Scale [Dataset] [for] [Automatic] Release Note [Generation] via Commit Logs [Summarization].
I'll be explaining in this order.
First, I will introduce [automatic] release note [generation] that we are working on in this [research].
A release note is a technical [document] that summarizes the changes distributed with each release of a software product.
The [image] shows a release note [for] version two point six point four of the vuejs library.
Release notes play an important role in [open source] development but they're time consuming to prepare [manually].
[Therefore], it would be very useful to be able to [automatically] generate high [quality] release notes.
I will defer to two [previous] researches on [automatic] release note [generation].
The first is a [system] called [ARENA] released in twenty fourteen.
It takes a rule-based [approach], [for] example using the change [extractor] to extract all differences, library changes and [document] changes from the differences between releases, and finally combining them.
The most notable feature of this [system] is the issue [extractor] in the upper right corner.
Which must be left to Jira, the issue tracker [system], and can only be applied to projects that use Jira.
In other [words], it cannot be used [for] many projects on GitHub.
The second is Glyph, recently announced in twenty twenty.
It is available on the [internet] and can be installed via pip.
This [system] has a simple [learning] based [text classification] [model] and [outputs] one of five labels such as [features] or bug fixes [for] each [input] commit message.
This [image] is a sample usage that returns a corrective or bug fixes label.
Glyph's [training data] is fairly small, about five thousand, and will be shown in the experiments described below.
The performance of the [text classification] [model] is not high.
I present two related researches, but their problems are limited applicability and scarce [data] [resources].
Our [paper] solves these two problems and [automatically] generates high [quality] release notes.
With a limited applicability [problem], we [propose] a high [quality] classwise [summarization] [method] using only commit messages as [input].
This proposed [method] can be used [for] all [English] repositories.
[For] the second [problem] of scarce [data] [resources], we built our [RNSum] [dataset] consisting of about eighty two thousand pieces of [data] by collecting [data] from public GitHub repositories using the GitHub [API].
Next, I'll describe our [dataset].
Here is an example of [data].
The left side is a commit message and the right side is the release notes.
Release notes are [labeled] as improvements or fixes, etc.
We have set up a [task] that takes the commit messages as [input] and [outputs] a [labeled] release notes.
This can be regarded as a [summarization] [task].
We have predefined four labels: [features], improvements, bug fixes, deprecations removals and breaking changes.
These were set based on [previous] [research] and other factors.
The release note on the bottom right is [extracted] from the release note on the bottom left.
At this time, it is necessary to detect the four labels that have been set up in advance.
But the labels are not always consistent with each repository.
[For] example, the improvements label includes improvements, enhancements, optimizations, and so on.
We prepared a [vocabulary] list of about thirty labels [for] each of these notational variations.
This is to detect the release note class, and collects the [text] of the release that follows as the release note [sentence] [for] the class.
Next is a commit message.
Commit messages are not tied to each release.
As shown in the [image] below, if the current release is version two point five to nineteen, we need to identify the [previous] release version two point five to eighteen and get a diff.
This is a bit tedious and it is not enough to just get a list of releases and look at the before and after.
We created a [heuristic] matching rule to get the [previous] and next versions.
[Dataset] [analysis].
In the end, seven thousand two hundred repositories and eighty two thousand pieces of [data] were collected.
Also, the average [number] of release notes [tokens] is sixty three, which is quite high [for] a [summarization] [task].
Also, the [number] of unique [tokens] is quite [large] at eight thousand eight hundred thirty thousand.
This is due to the [large] [number] of unique class or [method] names found in the repository.
Next, I will explain the proposed [method].
The classwise [extractive] then [abstractive summarization] [model] consists of two [neural] modules.
A [classifier] using [BERT] or [CodeBERT] and a generator using [BART].
First, [CEAS] uses a [classifier] to classify each commit message into five release notes classes, which use improvements, bug fixes, deprecations, plus an other.
The commit messages classified as other are discarded.
Then [CEAS] applies the generator to the four [labeled] [documents] independently and generates release notes [for] each class.
In this [task], the direct correspondences between commit messages and release notes are not known.
[Therefore], to train the [classifier], that's why we reassigned surveys to each [input] commit message using the first ten characters of each commit message.
We modeled the classwise [abstractive summarization] [approach] by two different [methods].
The first [model], which we call [CAS]-Single, consists of a single six to six network and generates a single release note [text] give a [concatenation] of [input] commit messages.
The output [texts] can be divided into classwise segments based on special class-specific endpoint symbols.
The second [method], [method], which we call [CAS]-Multi, consists of four different [seq2seq] networks, each of which correspond to one of the fixed release note classes.
Okay, let me explain the experiments.
Five [methods] were [compared]: [CEAS], [CAS]-Single, [CAS]-Multi, [Clustering], and [previous] study, Glyph.
Regarding [evaluation], in some cases, release notes are output in multiple [sentences].
Since it is difficult to calculate the [number] of [sentences] as they are, they are combined with spaces and treated as one long [sentence].
The [BLEU] is penalized when the [system] [outputs] a short [sentence].
This penalty results in a lower [BLEU] value in the experiment results described next.
Finally, we also calculate the specificity because [ROUGE] and [BLEU] cannot be calculated if the release notes are empty.
A higher specificity means that the [model] correctly [outputs] an empty [text] in cases where the release notes assume empty.
Here are the results.
Since the [dataset] contains e-mail addresses, hashed values, etc, we also evaluated the cleaned [dataset], which excludes them.
[CEAS] and [CAS] achieved [ROUGE]-L scores more than ten points higher than the [baselines].
In particular, on the clean test set, the score gap between the proposed [method] and the [baselines] jumped to more than twenty points.
These results indicate that [CEAS] and [CAS] are significantly affected.
[CEAS] got a better [ROUGE]-L score than [CAS] suggesting that combining a [classifier] and a generator is effective on [training] the [classifier] using [pseudo] labels.
High coverage of [CEAS] can be achieved probably because the [classifier] can focus on selecting relevant commit messages [for] each class.
[CAS]-Multi tended to yield higher [ROUGE]-L than [CAS]-Single.
Suggesting that it is also effective to independently develop differently [abstractive summarization] [models] [for] each release note class.
Here are an error [analysis].
[CAS] [methods] tend to output shorter [sentences] than [human] reference [sentences].
In the figure on the right, the reference [sentence] has three or four [sentences], while [CAS] has only one.
The reason [for] this [model]'s reluctance is that in [training data], only thirty three percent of the [sentences] are present in the [features] label and forty percent in the improvements label.
[Furthermore], [CAS] [methods] cannot generate accurate release notes without additional [information].
The top example on the right is an example of a very messy commit message, and the complete [sentence] cannot be [generated] without reference to the [corresponding] progress or issue.
The example below shows that the two commit messages in the [input] are related and should be combined into one [sentence], but it fails to do so.
Finally, a conclusion.
We have built a new [dataset] [for] [automatic] release note [generation].
We have also formulated a [task] of entering commit messages and [summarizing] them so that it is applicable to all projects [written] in [English].
Our experiments show that the proposed [method] generates less noisy release notes at higher coverage than the [baselines].
Please check out our [dataset] on GitHub.
Thank you.
Hello. My name is Asaf Harari.
And I will present our [paper], Few-Shot Tabular [Data] Enrichment Using Fine-Tuned [Transformers] [Architectures].
[Data] scientists analyze [data] and mainly focus on the manipulating the [data]'s [existing] [features].
But sometimes, these [features] are limited.
Feature [generation] using another [data] [source] may add substantial [information].
Our [research] goal is [automatic] tabular [data] enrichment using external sources' free [text].
Assume we have a tabular [dataset] and a [knowledge base].
We need an [automatic] process which involves [entity linking] and [text] [analysis] to extract new [features] from the [knowledge base]'s free [text].
Our framework [FeSTE] is exactly this [automatic] process.
So let's see an example in a [dataset] fed into [FeSTE].
In this example, the [dataset] is university [dataset].
When its goal is to classify universities into low ranking universities and high-ranking universities.
As [knowledge base], we use [Wikipedia].
The first phase of [FeSTE] is [entity linking].
When each [entity], in this example the university name, is [linked] to an [entity] within the [knowledge base].
And and the [text] of the [entities] of the [knowledge base] is [extracted] and added to the [dataset].
In this example, the [text] is the [Wikipedia] page's abstract.
Now, we need to generate or extract [features] from the [retrieved] [text].
So, we need to ah feature [extraction] phase ah which includes [text] [analysis].
And this is the main novelty of this [paper] and I will deep dive into it in the next slides.
After the feature [extraction] phase, there is a feature [generation] phase when we use the [extracted] [features] to generate a small [number] of new [features].
First generate ah [features] in the [number] of classes of the original [dataset].
In this example, the original [dataset] has two classes.
So, [FeSTE] generates two new [features].
But if the [dataset] has five classes, [FeSTE] generates five new [features].
Each feature represents the likelihood [for] each class.
To analyze the [text], we use the current state-of-the-art of [text] [analysis], which are [transformer] based [language models] as [BERT], [GPT],  [XLNet] and etc.
It is but it is not likely that we can train [language models] using the [input] [datasets].
So a naive [approach] will be ah [target] [task] [finetuning].
So, in the feature [extraction] phase, we can download [pretrained language] [models], finetune the [language model] over the [target] [dataset].
In this example to finetune the [language model], to classify ah to classify [text] into classes, abstract into classes, low or high.
Receive the [language model] output, which is the likelihood [for] each class and use as new [features].
The [problem] with this [approach] is [datasets] may have few distinct [entities] / [texts].
In our experiment, almost half of the [datasets] contain less than four hundred samples and the smallest [dataset] contain thirty five samples in its, in a [training] set.
So to finetune a [language model] over ah this [dataset] will be ineffective.
But we can use prior [knowledge] about pre-analyzed [datasets].
Because [FeSTE], we apply [FeSTE] over a multiple [dataset], we can use the n minus one [datasets] to gather [information] about the n minus one [datasets], and use this [information] when we analyze the nth [dataset].
What we, what we suggest is to add, to add another [finetuning] phase.
A preliminary [multitask] [finetuning] phase.
When you finetune the [language model] over the n minus one [datasets].
And, then we execute another [finetuning] phase which is a [target] [task] [finetuning], when you fine when we finetune the [language model] over the nth [target] [dataset].
The state-of-the-art in [multitask] ah [multitask] [finetuning] called [MTDNN].
In [MTDNN], [MTDNN] maintains ah heads in the [number] of [tasks] in the [training] set.
So, in this example there are four [tasks] in the [training] set, so [MTDNN] maintain four heads as you can see at the [image].
And it samples a random batch from ah from the [training] set.
And if they random batch belongs to a, [for] example single [sentence classification] [task], it executes forward and backward paths through the first head.
And if the random batch belongs to [pairwise] ranking [task], it executes forward and backward path through the last head.
In our scenario, ah tabular [datasets] vary in the [number] of classes.
So there are many [tasks].
[MTDNN] maintained [number] of classes, heads, output layers.
And the additional, [additionally] [MTDNN] needs to initialize new heads [for] a new [dataset] with a new [task].
Our [approach], called [task] reformulation [finetuning] is, in our [approach] [task] reformulation [finetuning], instead of maintaining multiple heads, we reformulate each [dataset] into a [sentence] per [classification] [problem], which is two classes' [tasks].
So let's see an example.
Here is the our [input] [dataset] which consists of [entities], [features], [text] and classes.
And, we reformulate the [task] from a [classifying] the [text] into low or high to classify the [text], the abstract and the class into true or false.
Or in other [words], we trained the [language model] to classify an abstract and class ah to abstract and class ah, if the abstract belongs to the class or not.
So the label [vector] in this case stays always ah which consists always with two classes.
And this is the ah [algorithm] [for] our fine, reformulated [finetuning] [approach].
So let's see the full framework.
[Dataset] fed into [FeSTE].
And then ah [FeSTE] executes [entity linking] phase.
It ah it extracts the [text] from the [knowledge base], which in this example is the abstract of the [Wikipedia] page.
Then it reformulated the [task] into a [pairwise] [sentence classification] [task].
Applied the [language model] to the new [task] and the output likelihood [for] each class.
And now that the [language model] is already finetuned over n minus one [dataset] using a preliminary [multitask] [finetuning].
Then we use the output [vector] of the [language model] as a newly [generated] feature in the [number] of classes.
To evaluate our framework, we use ah seventeen tabular [classification] [datasets] which vary in size, [features], balance, [domain] and initial performance.
And as [knowledge base] we use [Wikipedia].
We design our experiment as leave one out ah [evaluation] where we train [FeSTe] over sixteen [datasets] and apply it to the seventeenth [dataset].
We also, we also split each [dataset] into four folds and apply four folds cross validation.
Then, we generate the new [features] and evaluate them using five [evaluation] [classifiers].
We use in our experiments base [BERT] base architecture.
Here are the results [for] our experiments.
You can see that we compare our our framework to [target] [dataset] [finetuning], [target] [task] [finetuning], and a [MTDNN] preliminary [finetuning].
And our reformulated [finetuning] [achieves] the best result, the best performance.
While [MTDNN] achieved two percent improvement over the [target] [dataset] [finetuning].
Our [approach] achieved six percent improvement.
When we look on the small ah [dataset], we can see that the performance of [MTDNN] decreases and the improvement of the prelim, the preliminary [multitask] [finetuning] phase decreases to one point five percent.
But our performance increased to eleven percent [compared] to the [target] [task] [finetuning] alone.
[For] summing, [FeSTE] enables few shot enrichment from thirty five samples in our experiments.
It uses one architecture [for] all [tasks] and [datasets].
And it keeps the head of ah of the [model].
But it adds reformulation phase.
It augments the train set and it needs a [target] value with [semantic] [meaning] so we can feed it into the [language model] and use it in the [sentence pair] [classification] [problem].
Thank you.
