Hi everyone. Today I'm going to present our research work Learning to Reason Deductively: Math Word Problem Solving as Complex Region Extraction. 
I'm Allan from ByteDance AI Lab, and this is a joint work with Jierui Li from the University of Texas at Austin and Wei Lu from SUTD. 
First, I'd like to talk about our motivation for reasoning. 
So here we show an examples where multi step reasoning is helpful. 
So this figure is taken from the PaLM paper where they perform prompting to solve the network. problem in the flip chart learning scenario. 
So on the left hand side, we can see if we give some examples with just question and answers, we might not be able to obtain the correct answers. 
But if we give some more reasoning description, the model is able to predict the recent description and also make correct predictions here. 
So it is good to have interpretable multi-step reasoning as output. 
And we also think Math word problem is a straightforward application to evaluate such reasoning abilities. 
So, here in our problem setup given the questions we need to solve this question and obtain the numerical answers. 
So in our datasets we are also given the mathematical expression which leads to the aa to this particular answer as well. 
So, certain assumptions are also apply as in previous work. 
We assume the precision of quantities are known. 
And we only consider basic operators such as addition, subtractions, multiplication, division, and exponential. 
Furthermore, communicate the operators can be actually decomposed into these basic operators. 
So, previous work in Math word problem solving aa actually can aa be categorized into sequential sequence and sequence to tree model. 
So, traditional sequence to sequence model convert the expression to a specific sequence for generation. 
And it is pretty easy to implement and it can generalize to many different com, complicated problem. 
But the drawbacks are the performance is actually generally not better than the structure model and it is lack of interpretability for prediction. 
But actually this direction is still quite popular because of um the transformer model. 
So, in tree based models, we actually structure these expressions in the tree form and follow a preorder traversal in tree generations. 
So here we keep generating the operators until we reach the leaves, which are the quantities. 
So here the good thing is that it actually gives us this binary tree structure, and it is um, but but actually it is quite counterintuitive because we generate the operator first and then at the end we generate the quantities. 
And the second thing is that it also contains some repetitive computations. 
So here if we look at this expression, eight hum three plus three is actually generated twice, but in fact we should reuse the results. 
So, in our proposed approach we want to solve those problems in a step by step and interpretable manners. 
So for example, here in the second step, ah we can obtain these divisors which is twenty seven. 
And we can also refer back to the original questions to find the relevant contents. 
And in these steps we obtain the divisors. 
So, ah and then at this third step we actually get the quotient. 
Alright. And after these three steps, we can actually reuse the results from the second step, and then gets the ah results of the fourth step, and then finally we can obtain the dividends. 
So, here we actually generate the whole expression directly rather than generating a single operators or quantities. 
So this makes the process more accurate. 
So, in our deductive system, we first start start with a bunch of quantities presented in the questions and also including some constant as our initial state, ah initial state. 
So, the expression is represented by e.i.j.o.p. 
Where we perform operator from Qi to Qj, and such expression is actually directed. 
So, we also have subtraction with words here to represent the opposite direction. 
This is quite similar to relation extraction. 
So in a formal deductive system, at a time step T, we apply the operator between the Qi and Qj pair, and then we obtain this new expressions. 
We add it to the next state to become a new quantity. 
So, this slide is actually visualize the evolution of the state where we keep adding expression to the current state. 
So in our model implementations, we first use a pre trained language model which can be BERTs or Robertas and then we encode the sentence and then we obtain these quantity representations. 
So, once we get the quantity representations, we can start to do inference. 
Here we show an example of Q one to obtain the representation for Q one divided by Q two and then times Q three. 
First we get the ah pair representation, which is basically just the concatenation between Q one and Q two, and then we apply a feedforward network which is pra, parameterized by the operator, 
and then finally we obtain the expression representation Q one divided by Q two. 
But in fact, in practice, in the inference stage, we might ah be able to get the incorrect ins, incorrect expression as well. 
So, here all the possible expression is equals to three times the number of operators. 
So the nice thing here is that we can easily add constraints to control this search this search space. 
For example, if this expression is not allowed, we can simply remove this expression in our search space. 
So in the second step, we do the same thing, but the only difference is that we ah the only difference is one more quantities. 
So this quantity come from the previous calculated expression. 
So finally we can obtain this final expression Q three times Q four. 
And we can also see the number of all the possible ex, ah, expression is different from the previous step. 
So, aa such difference make it hard to apply beam search because the probability distribution between these two steps is unbalanced. 
So the training procedure is similar to training a sequence to sequence model where we optimize the loss at each time step. 
And here we also use this Tau to represent when we should terminate this generation process. 
And here the space is different from sequence to sequence because the space is different at each time step. While in traditional sequence to sequence model this is the number of vocabulary. 
And it also allows us to impose certain constraints from prior, from prior knowledge. 
So we conduct experiments on the commonly used Math word problem datasets, MAWPS, Math twenty-three K, MathQA and SVAMP. 
And here we briefly shows the results compared with the previous best approaches. 
So our best performing variant is Roberta deductive reasoning. 
And in fact we do not use beam search in contrast, obvious approaches using beam search. 
All right. So, the best approaches are often tree based model. 
So, overall our reasoner is able to significantl significantly outperform this tree based model. 
But we can see the absolute number on MathQA or SVAMP are not really high. 
So we further investigate the results on SVAMP. 
And this dataset is challenging because the author tried to manually ar, ah, adding something to confuse the NLP model like such as adding irrelevant information and extra quantities. 
So, in our prediction we find some of the intermediate values are actually negatives. 
For example, um, in these questions we are asking how many apples does Jake have? 
But we have some extra information like seventeen fewer pictures, and Steven has eight pictures, which is totally irrelevant. 
So, our model makes some prediction like this which is producing negative values. 
And we observe these two expressions actually have similar scores. 
So, we can actually limit these search space by removing those results negatives so that we can make ah, the, make the answer correct. 
So, um, we further find such constraint actually improves quite a lot for for for some models. 
For example, for BERTs, we improve seven point and then for the Roberta base model we actually improved two points. 
So better language model has better language understanding abilities so that the number here is higher for Roberta and lower for for BERTs. 
And we also try to analyze the difficulty behind these, behind all these datasets. 
We assume the number of unused quantity can be regarded as irrelevant information here. 
So, ah, here we can see that ah, we we have the the percentage of samples we unused quantities, and the SVAMP dataset has the largest portion. 
And here we also show the overall performance. 
For those samples without unused quantities, so the overall performance is actually higher than the the performance is actually higher than the overall performance. 
But with those samples that with unused quantity is actually way worse than the worse than the overall performance. 
For MAWPS, we don't, we don't really have aa too many disk cases, so I just ignore this part. 
So, finally we want to show the interpretability through a crash and perturbation example. 
So here our model actually makes a wrong prediction at the, at the first step. 
So, we can actually correlate this expression with the sentence here. Alright. 
So, we think this sentence might be misleading the model to an incorrect predictions. 
So here planting another thirty five makes the model makes the model think it should be an addition operators. 
So we try to revise the sentence to be something like the number of pear trees are thirty five fewer than the apple trees. 
So, we make it to convey more accurate semantics such that the model is able to make um, the prediction correct. 
So, this study shows how the interpretable predictions help us understand the model behavior. 
So to conclude our work, so first our model is actually pretty efficient. 
And we are able to provide interpretable solving procedure. 
And we can easily incorporate some prior knowledge as constraint which can help improve the performance. 
And the last thing is that the underlying mechanism does not only apply to network problem solving tasks but also other tasks that involve multi step reasoning. 
And we also have certain limitations. 
Ah, if we have a large number of operators or content or constants, the memory consumption could be pretty high. 
And the second thing is that, as mentioned, because the probability distribution is unbalanced between different time steps, so it's also pretty challenging to apply beam searches strategy. 
So this is the end of the talk, and questions are welcomed. Thank you. 
Hi, my name is Antoine and I'm from Mastrick University. 
I will be presenting my joint work with Jerry which is about a New Dataset for Statutory Article Retrieval. 
Legal issues are an integral part of many people's life. 
But the majority of citizens have little to know knowledge about their rights and fundamental legal processes. 
As a result, many vulnerable citizens who cannot afford the costly assistance of a legal expert are left unprotected or, worst, exploited. 
All work aims to bridge the gap between people and the law by developing effective retrieval system for statutory articles. 
Such a system could provide a free professional legal help service for unskilled humans. 
Before diving into the main contribution of this work, let's first describe the problem of statutory article retrieval. 
Given a simple question on a legal matter such as what do I risk if I violate professional confidentiality? 
A model is required to retrieve all relevant statutory articles from a large body of legislation. 
This information retrieval task comes with its own set of challenges. 
First, it deals with two types of language. 
Common natural language for the questions and complex illegal language for the statutes. 
This difference in language distributions makes it harder for a system to retrieve relevant candidates, as it indirectly requires an inherent interpretation system that can translate a natural question to a legal question that matches the terminology of statutes. 
Besides, statutory law, it's not a stack of independent article that can be treated as a complete source of information on their own unlike news or recipes, for example. 
Instead, it's a structure election of legal provision that have a whole meaning only when considered in the overall context, that is, together with the supplementary information from the neighboring articles, the fields and subfields they belong to, and they place in the structure of the law. 
Lastly, statutory articles aren't small paragraph which usually is the typical retrieval unit in most retrieval works. 
Here, there are long documents that may be up to six thousand words. 
The recent advances in NLP have sparked huge interest in many legal tasks, such as legal judgment prediction or automated contact con contract review. 
But statutory article retrieval has remained mainly untouched due to the lack of large and high quality labeled datasets. 
In this work, we present a new French native citizen centric data set to study whether retrieval model can approximate the efficiency and reliability of a legal expert for the task of statutory article retrieval. 
All Belgian statutory article retrieval data set results consists of more than one thousand one hundred illegal questions posed by Belgian citizens. 
These questions cover a wide range of topics from family, housing, money, to work and Social Security. 
Each of them has been labeled by experienced jurists with references to relevant articles from a corpus of more than twenty-two thousands six hundred legal articles from Belgian codes of flow. 
Let's now talk about how we collected this data sets. 
First, we started by compiling a large corpus of legal articles. 
We considered thirty two publicly available Belgian codes and extracted all the articles as well as the corresponding section headings. 
Then we gathered legal questions with references to relevant statutes. 
To do so, we partner with the Belgian law firm that receives each year around four thousands emails from Belgian citizens who ask for advice on a personal legal issue. 
We were lucky enough to get access to their websites, where their team of experienced tourists addresses Belgian most common legal issues. 
We collected thousands of questions annotated with categories, subcategories and legal references to relevant statutes. 
Lastly, we passed the legal references and filtered out the questions whose references were not articles in one of the codes of low we considered. 
The remaining references were matched and converted to the corresponding article IDs from all corpus. 
We eventually ended up with one thousand one hundred and eight questions, each carefully labeled with the IDs of the relevant articles from our large corpus of twenty two thousands and six hundred thirty three statutory articles. 
In addition, each question comes with the main category and a concatenation of subcategories. 
And each articles comes with a concatenation of the subsequence heading in the structure of the low. 
This extra information is not used in the present work, but might be of interest for future research on legal information retrieval or legal text classification. 
Let's look at some characteristic of our datasets. 
The question now between five and forty four words long with the median of fourteen words. 
The article are much longer with the median lengths of seventy seven words, with one hundred and forty two of them exceeding one thousand words. 
They lengthened one being up to five thousand seven hundred and ninety words. 
As previously mentioned, the question cover a wide range of topics, with around eighty-five percent of them being either about family, housing, money or justice. 
While the remaining fifteen percent concern either Social Security, foreigners or work. 
The article are also very diverse as they come from thirty two different Belgian codes that cover a large number of illegal topics. 
Here's the total number of articles collected from each of these Belgian codes. 
Out of the twenty two thousand six hundred and thirty three articles, only one thousand six hundred and twelve are referred to as relevant to at least one question in the datasets. 
And around eighty percent of these cited articles come from either the civil code, judicial codes, criminal investigation codes or penal codes. 
Meanwhile, eighteen out of thirty two codes have less than five article mentioned as relevant to at least one question. 
Which can be explained by the fact that those code focused less on individuals and their concerns. 
Overall, the median number of citation for these cited articles is two, and less than twenty-five percent of them are cited more than five times. 
Using all data sets, we benchmark several retrieval approaches, including lexical and dense architecture. 
Given a query and an article, a lexical model assigns a score to the query article pair by computing the sum over the query terms of the weights of each of these terms in that article. 
We experiment with the standard TF-IDF and BM twenty-five ranking functions. 
The main problem with these approaches is that they can only retrieve article that contain keywords present in the query. 
To overcome this limitation, we experiment with a neural based architecture that can capture semantic relationship between queries and article. 
We use a Bi-encoder model that maps queries and articles into dense vector representations and calculate a relevance score between a query article pair by the similarity of their embeddings. 
These embeddings typically result from a pooling operation on the output of a word embedding model. 
First, we study the effectiveness of Siamese Bi-encoders in a zero shot evaluation setup, meaning that pre trained word embedding models are applied out-of-the-box without any additional fine tuning. 
We experiment with context independent text encoder, namely word two Vec and fast text, and context dependent on betting models, namely Roberta and more specifically CamemBERT which is a French Roberta model. 
Additionally, we train our own CamemBERT based model aaa bi-encoders on all data sets. 
Note that for training, we experiment with the two flavors of the Bi-encoder architecture. 
Siamese, which uses a unique word embedding model model that maps the query and article together in a shared dense vector space, and to tower, which uses two independent word embedding models that encode the query and article separately into different embedding spaces. 
We experiment with Mean, Max and CLS pooling as well as product and cosine for computing similarities. 
Here are the result of our baseline on the test sets. 
With the lexical methods above, the Siamese bi-encoders evaluated in a Zero Strat setup in the middle, and the fine-tuned bi-encoders below. 
Overall, the fine-tuned bi-encoder significantly outperform all the other baselines. 
The two-tower model improves over its Siamese variants on recall at one hundred, but perform similarly on the other metrics. 
Although BM twenty-five underperformed the train bi-encoder significantly, its performance indicated that it's still a strong baseline for domain specific retrieval. 
Regarding the zero shot evaluation of Siamese bi-encoder, we find that directly using the embeddings of a pre trained CamemBERT model without optimizing for the information retrieval task gives poor results which is consistent with previous findings. 
Furthermore, we observe that the word two VEC based Bi-encoder significantly outperformed the fast text and BERT based model, suggesting that maybe pre trained word level embeddings are more appropriate for the task than character level or sub word level embeddings when used out-of-the-box. 
Although promising, these results suggest ample opportunity for improvement compared to a skill little expert who can eventually retrieve all relevant article to any question and thus get perfect scores. 
Let's conclude by discussing two limitation of all datasets. 
First, the corpus of article is limited to those collected from the thirty-two considered Belgian codes, which does not cover the entire Belgian law. As articles from decrease, directives and ordinances are missing. 
During the dataset construction, all references to these uncollected articles are ignored, which causes some question to end up with only a fraction of the initial number of relevant articles. 
This information which implies that the answer contained in the remaining relevant articles might be incomplete, although it's still completely appropriate. 
Second, we should note that not all legal questions can be answered with statutes alone. 
For instance, the question can I evict my tenants if they make too much noise? 
Might not have a detailed answer within statutory law that quantifies a specific noise threshold at which eviction is loads. 
Instead, the landlord should probably rely more on case law and find precedents similar to their current situation. 
For example, the tenants makes two parties a week until two AM. 
Hence, some question are better suited than others to the statutory article retrieval task, and the domain of the less suitable once remains to be determined. 
We hope that all works sparks interest in developing practical and reliable statutory article retrieval models. 
That can help improve access to justice for all. 
You can check out our paper, dataset and code at the following links. Thank you. 
Hello, we are happy to present our work on VALSE; a Task Independent Benchmark Meant for Testing Vision and Language Models with Specific Linguistic Phenomena. 
Why did we do the trouble in setting up this benchmark? 
Well during the last years, we have seen an explosion of transformer based vision and language models pretrained on large amounts of image text pairs. 
Each one of these models pushes state-of-the-art on vision and language tasks such as visual question answering, visual common sense reasoning, image retrieval, phrase grounding. 
So we got a message, the accuracies on these tasks, specific benchmarks are increasing steadily. 
But do we know what the models have actually learned? 
What is it that a vision and language transformer understood when assigning a high score for this image and this sentence to match? 
In the low score for this one? 
Do vision and language models focus on the right thing? 
Or do they focus on biases as shown by previous work? 
To shed more light on this aspect, we propose a more task agnostic direction and introduce VALSE that test the sensitivity of vision and language models to specific linguistic phenomena that affect both the linguistic and the visual modalities. 
We target existence, plurality, counting, spatial relations, actions and entity coreference. 
But how do we test whether the vision and language models have captured this phenomena? 
By foiling a method previously applied for vision and language models only for noun phrases by Ravi Shekhar and collaborators, and on counting by us in previous work. 
Foiling basically means that we take the caption of an image and produce a foil by altering the caption such that it does not describe the image anymore. 
And we do these phrase alterations by focusing on six specific pieces such as existence, plurality, counting, spatial relations, actions and entity coreference, where each piece can consist of one or more instruments, in case we found more than one interesting way to create Foil instances. 
For example, in the case of the actions piece, we have two instruments, one in which the action verb is changed with a different action, and one in which actants are swapped. 
Counting and coreference also are pieces that have more than one instrument. 
And we create these Foils by making sure that they fail to describe the image, that they are grammatical, and otherwise valid sentences. 
This is not easy to do because a foiled caption may be less likely than the original caption. 
For example, though it's not impossible, it is statistically less likely for plans to cut a man than a man to cut plants, and large vision and language models could pick up on this. 
Therefore, to obtain valid force, we must take action. 
First, we make use of strong language models to propose foils. 
Second, we use natural language inference or short NLI to filter out foils that could be still describing the image, since when constructing foils we need to ensure that they fail to describe the image. 
To test this automatically, we apply natural language inference with the following rationale. 
We consider an image to be the premise and its caption its entailed hypothesis. 
In addition, we consider the caption to be the premise. And the foil is its hypothesis. 
If an NLI model predicts the foil to contradict or to be neutral with respect to the caption, we take this as an indicator of a valid foil. 
If an NLI predicts the foil to be entailed by the caption, it cannot be a good foil, since by transitivity it will give a truthful description of the image, and we filter these foils out. 
But this procedure is not perfect, it is just an indicator for valid foils. 
Therefore, as a third measure for generating valid foils, we employ human annotators to validate the data used in VALSE. 
So, after filtering and human evaluation, we have as many test instances as described in this table. 
Note that VALSE does not deliver any training data but only test data. 
Since it is a zero shot testing benchmark only, it is designed to leverage the existing capabilities of vision and language models after pre 
training. Fine tuning would only enable models to exploit artifacts or statistical biases in the data. 
And we all know that these models like to cheat and take shortcuts. 
And as we said, we are interested in assessing what capabilities the vision and language models have after pre training. 
We experiment with five vision and language models on VALSE, namely with CLIP, LXMert, ViLBERT, ViLBERT twelve in one, and VisualBERT. 
Two of our most important evaluation metrics are the accuracy of the models in classifying image sentence pairs into captions and foils. 
Perhaps more relevant for this video, we will showcase our more permissive metric, the pairwise accuracy, which measures whether the image sentence alignment score is greater for the correct image text pair than for its foiled pair. 
For more metrics and results on them, do check out our paper. 
The results with pairwise accuracy are shown here and they are consistent with the results we got from the other metrics is that the best zero shot performance is achieved by ViLBERT twelve in one, followed by ViLBERT, LXMert, CLIP, and finally VisualBERT. 
It's notable how instruments centered on the individual objects like existence and noun phrases are almost solved by ViLBERT twelve in one, highlighting that models are capable of identifying named objects and their presence in images. 
However, none of the remaining pieces can be reliably solved in our adversarial foiling settings. 
We see from the plurality and counting instruments that vision and language models have trouble distinguishing references to single versus multiple objects, or counting them in an image. 
The relation piece shows that they have difficulties in correctly classifying a named spatial relation between objects in an image. 
They also have trouble distinguishing actions and identifying their participants, even if supported by plausibility biases as we see in the actions piece. 
From the Coreference piece, we find out that tracing multiple references to the same object in an image by using pronouns is also difficult for vision and language models. 
As a sanity check, and because it's an interesting experiment, we also benchmark two text only models GPT one and GPT two to assess whether VALSE is solvable by these unimodal models by computing the perplexity of the correct and the fault caption, no image here, and predicting the entry with the lowest perplexity. 
If the perplexity is higher for the foil, we take this as an indication that the foiled caption may suffer from plausibility bias or other linguistic biases. 
And it's interesting to see that in some cases, the text only GPT models have captured the plausibility of the world better than the vision and language models. 
So to sum up, VALSE is a benchmark that uses the lens of linguistic constructs to help the community improve vision and language models by hard testing their visual grounding capabilities. 
Our experiments show that vision and language models identify named objects in their presence in image as well as shown by the existence piece, but struggle to ground their interdependence and relationships in visual scenes when forced to respect linguistic indicators. 
We would really like to encourage the Community to use VALSE for measuring progress towards language grounding with vision and language models. 
And even more, VALSE could be used as an indirect assessment of datasets, as models could be evaluated before and after training or fine tuning to see whether a data set helps models improve on any of the aspects tested by VALSE. 
If you're interested, do check out the VALSE data on GitHub, and if you have any questions do not hesitate to contact us. 
Hello, my name is Kamezawa from the University of Tokyo. 
I'll be presenting a paper entitled RNSum: A Large-Scale Dataset for Automatic Release Note Generation via Commit Logs Summarization. 
I'll be explaining in this order: 
First, I will introduce Automatic Release Note Generation that we are working on in this research. 
A release note is a technical document that summarizes the changes distributed with each release of a software product. 
Image shows a release notes for version two point six point four of the debuggers library. 
These notes play an important role in open source development but they're time consuming to prepare manually. 
Therefore, it would be very useful to be able to automatically generate high quality release notes. 
I will defer to two previous researches on automatic release note generation. 
The first is a system called Arena released in twenty-fourteen. 
It takes a rule-based approach, for example using the change extractor to extract all differences, library changes and document changes from the differences between releases and finally combining them. 
The most notable feature of this system is the issue extractor in the upper right corner. 
Which must be left to Jira, the issue tracker system and can only be applied to projects that use Jira. 
In other words, it cannot be used for many projects on GitHub. 
The second is Glyph, recently announced in twenty-twenty. 
It is available on the Internet and can be stored via peep. 
This system has a simple learning-based text classification model and outputs one of five labels such as features or bug fixes. For each input commit message. 
The image is a sample usage that returns a corrective or bug fixes label. 
Glyph's training data is fairly small, about five thousand, and will be shown in the experiments described below. 
The performance of the text classification model is not high. 
I present two related researches, but their problems are limited applicability and scarce data resources. 
Our paper solves these two problems and automatically generates high quality release notes. 
With a limited applicability problem, we propose a high quality classwise summarization method. Using only commit message as input. 
This proposed method can be used for all English repositories. 
For the second problem of scarce data resources, we built our RNSum dataset consisting of about eighty-two thousand pieces of data by collecting data from public Github people stories using the Github API. 
Next, I'll describe our dataset. 
Here is an example of data. 
The left side is a commit message. On the left side is the release notes. 
Release notes are labeled as improvements or fixes, etc. 
We have set up a task that takes up the commit messages as input and outputs and labeled these notes. 
This can be regarded as a summarization task. 
We have predefined four labels, features, improvements, bug fixes, deprecations, removals and breaking changes. 
These were set based on previous research and other factors. 
There is a note on the bottom right and extracted from the list notion on the bottom left. 
At this time, it is necessary to detect the four labels that have been set up in advance. 
But the labels are not always consistent with each liberty. 
For example, the improvements label includes improvements, enhancements, optimizations, and so on. 
We prepared a vocabulary list of our study labels for each of these rotational variations. 
This is to detect the release not cross, and collects the text of the list that follows as the release no sentence for the cross. 
Next is a commit message. 
Commit messages are not tied to HPS. 
As shown in the image below, if the current release is version two point five to nineteen, we need to identify the previous release version two point five to eighteen and get it there. 
This is a bit tedious and it is not enough to just get a list of releases and look at the before and after. 
We created a heuristic matching rule to get the previous and next versions. 
Dataset Analysis. 
In the end, seven thousand two hundred repositories and eighty two thousand pieces of data were collected. 
Also, the average number of release notes tokens is sixty-three, which is quite high for summarization task. 
Also, the number of unique tokens is quite large at eight thousand eight hundred thirty thousand. 
This is due to the large number of unique class or method names found in the repository. 
Next, I will explain the proposed method. 
The classwise extractive then abstractive summarization model consists of two neural modules. 
A classifier using BERT or CodeBERT and a generator using BART. 
First, CEAS uses a classifier to classify each commit message into five release notes classes, which use improvements, bug fixes, deprecations, plus and other. 
The commit messages classified as Other are discarded. 
Then CEAS applies the generator to the four labels documents independently and generates release notes for each class. 
In this task, the direct correspondences between commit messages and release notes are not known. 
Therefore, to train the classifier, that's why we reassigned surveys to each input commit message using the first ten characters of each commit message. 
We modeled the classwise obstructive summarizes to approach by two different methods. 
The first model, which we call CAS-single, consists of a single six to six network and generate a single release note text give a concatenation of input commit messages. 
The ultimate texts can be divided into classwise segment based on special class-specific endpoint symbols. 
The second method, method, which we call CAS-Multi, consists of four different sector sack networks, each of which correspond to one of the fixed release-note classes. 
OK, let me explain the experiment. 
Five methods were compared: CEAS, CAS-single, CAS-multi, Clustering, and previous study Glyph. 
Regarding evaluation, in some cases, release notes are output in multiple sentences. 
Since it is difficult to calculate the number of sentences as they are, they are combined with spaces and treated as one long sentence. 
The BLEU is penalized when the system outputs are short sentence. 
This penalty results in a lower BLEU value in the experiment results described next. 
Finally, we also calculate the specificity because ROUGE and BLEU cannot be calculated if the release notes are empty. 
A higher specificity means that the model correctly outputs are empty text in cases where the release notes assume empty. 
Here are the results. 
Since the dataset contains e-mail addresses, hashed values, etc, we also evaluated its cleaned dataset, which excludes them. 
CEAS And CAS achieved ROUGE-L scores more than ten points higher than the baselines. 
In particular, on the clean test set, the score gap between the proposed method and the baselines jumped to more than twenty points. 
These results indicate that CEAS and CAS are significantly affected. 
CEAS got a better ROUGE-L score than CAS suggesting that combining a classifier and a generator is effective on training the classifier using pseudo labels. 
High coverage of CEAS can be achieved probably because the classifier can focus on selecting relevant commit messages for each class. 
CAS-Multi tended to yield higher ROUGE-L than CAS-Single. 
Suggesting that it is also effective to independently develop differently abstractive summarization models for each release-note class. 
Here are an error analysis. 
CAS methods tend to output shorter sentences than human reference sentence is. 
In the figure on the right, the reference sentence has three or four sentences, while CAS has only one. 
The reason for this model reluctance is that in training data, Only thirty-three percent of the sentences are present in the features label and forty percent in the improvements label. 
Furthermore, CAS methods cannot generate accurate release note without additional information. 
The top example on the right is an example of a very messy commit message, and the complete sentence cannot be generated without reference to the corresponding progress or issue. 
The example below shows that the two commit message in the input are related and should be combined into one sentence, but it fails to do so. 
Finally, a conclusion. 
We have built a new dataset for automatic release-note generation. 
We have also formulated a task of entering commit messages and summarize them so that it is applicable to all projects written in English. 
Our experiments show that the proposed method generate less noisy release notes at higher coverage than the baselines. 
Please check out our dataset on Github. 
Thank you. 
Hello my name is Asaf Harari. 
And I represent our paper: Few-Shot Tabular Data Enrichment Using Fine-Tuned Transformers Architectures. 
That's a scientist analyze data and mainly focus on the manipulating the data existing features. 
But sometimes, these features are limited. 
Feature generation using another data source may add substantial information. 
Our research goal is automatic tabular data enrichment using external sources' free text. 
Assume we have a tabular data set and a knowledge base. 
We need an automatic process which involves. entity linking and text analysis to extract new features from the knowledge base free text. 
Our framework FeSTE is exactly this automatic process. 
So let's see an example in a datasets fed into FeSTE. 
In this example, the dataset is university dataset. 
When its goal is to classify universities into low ranking universities and high-ranking universities. 
As knowledge base we use Wikipedia. 
The first phase of FeSTE is entity linking. 
When each entity, in this example the university name, is linked to an entity within the knowledge base. 
And ex, and the text of the entities of the knowledge base is extracted and add to the data set. 
In this example, the text is the Wikipedia page abstract. 
Now, we need to generate or extract features from the retrieved text. 
So, we need to feature extraction phase, aaa, which include text analysis. 
And this is the main ability of this paper and I will deep dive into it in the next slides. 
After the feature extraction phase, there is a future generation phase when we use the extracted features to generate a small number of new features. 
First generate aa features in the number of classes of the original data set. 
In this example, the original data set has two classes. 
So, FeSTE generate two new features. 
But if the data set has five classes, FeSTE generate five new features. 
Each feature represent the likelihood for each class. 
To analyze the text, we use the current state-of-the-art of text analysis which are transformed based language models as BERT, GPT, Xlnet and etc. 
It is, but it is not likely that we can train language model using the input datasets. 
So a naive approach will be a target task fine tuning. 
So, in the feature extraction phase, we can download pre train language model, fine tuned language model over the target data set. 
In this example to fine tune the language model, to classify, aa, to classify text into classes, abstract into classes low or high. 
Receive the language model output, which is the likelihood for each class and use as new features. 
The problem with this approach is data set may have few distinct entities texts. 
In our experiment, almost half of the datasets contain less than four hundred sample and the smallest data set contain thirty five sample in his, in a training set. 
So to fine tune a language model over ,aa, this data set will be ineffective. 
But we can use prior knowledge about pre analyzed data sets. 
Because fast is we apply fast over a multiple data set, we can use the N minus one data sets to gather information about the n minus one datasets and use this information when we analyze the N'th data set. 
What we, what we suggest is the add, to add another fine tuning phase. 
A preliminary multitask fine tuning phase. 
When you fine tuning the language model over the n minus one datasets. 
And and, and then we execute another fine tuning phase which is a target as fine tuning when you find when we fine tune the language model over the N'th target data set. 
The state-of-the-art in multitask, the, multitask fine tuning called MTDNN. 
In MTDNN, MTDNN maintain a heads in the number of tasks in the training set. 
So, in this example there are four tasks in the training set, so MTDNN maintain four heads. As you can see at the image. 
And it samples a random batch from aa from the training set. 
And if they random batch belongs to a for example single sentence classification tasks, it's execute forward and backward path through the first head. 
And if the random batch belongs to pairwise ranking task, it's attitude forward and backward path through the last head. 
In our scenario, a Tabular data set vary in the number of classes. 
So there are many tasks. 
MTDNN maintained number of classes, heads, output layers. 
And the additional, additionally MTDNN needs to initialize a new heads for a new data set with a new task. 
Our approach called task reformulation fine tuning, is we, is, in our approach task reformulation fine-tuning, instead of maintaining multiple heads, we reformulate each data set into a sentence per classification problem, which is two classes tasks. 
So let's see an example. 
Here is the our input data set which consists of entities, features, text and classes. 
And we reformulate the task from a classifying the text into lower height to classify the text, the abstract and the class into true or false. 
Or in other word, we trained the language model to classify ,aa, abstract and class, aa, to tr, to to abstract and class, aa, if the abstract belong to the class or not. 
So the label vector in this case is stays always which consist always with two classes. 
And this is the ,aa, algorithm for our fine, reformulated fine tuning approach. 
So let's see the full framework. 
Data set fed into FeSTE. 
And then ,aa, FeSTE execute entity linking phase. 
It ,aa, it extract the text from the knowledge base, which in this example is the abstract of the Wikipedia page. 
Then it reformulated the task into a pill. The spirit classification tasks, 
applied the language model to the new task and the output likelihood for each class. 
And now that the language model is already fine tuned over n-minus-one data set using a preliminary multitask fine tuning. 
Then we use the output vector of the language model as aaa newly generated feature in the number of classes. 
To evaluate our framework, we use a seventeen tabular classification data set which vary in size, features, balance, domain and initial performance. 
In this knowledge base we use Wikipedia. 
We design our experiment as Live One Out. And evaluation when we train fast over sixteen data sets and apply it to the seventeenth data set. 
We also each, we also split each data set into a four folds and apply for false cross validation. 
Then, we generate the new feature and evaluate them using five evaluation classifiers. 
We use in our experiment based BERT based architecture. 
Her are there results for our experiment. 
You can see that we compare ours, our framework to target data set, the fine tuning target task fine tuning and a MTDNN preliminary fine tuning. 
And our reformulated fine tuning achieve the best result, the best performance. 
While MTDNN achieved two percent improvement over the the target data set fine tuning. 
Our approach achieved six percent improvement. 
When we look on the small ,aaa, data set, we can see that the performance of MTDNN decreases and the improvement of the premiu, the preliminary multitask fine tuning phase decreases to one point five percent. 
But our performance increased to eleven percent compared to the target task fine tuning alone. 
For summing, FeSTE enables few-shot enrichment from thirty five samples in our experiment. 
It uses one architecture for all tasks datasets. 
And it keeps the head of ,ah, of the model. 
But it adds reformulation phase. 
It's augment the train set at and its needs a target value with semantic meaning so we can fed it into the language model and use it in the sentence per classification problem. 
Thank you. 
