Hi everyone. Today I'm going to present our research work Learning to Reason Deductively: Math Word Problem Solving as Complex Relation Extraction.
I'm Allan from ByteDance AI Lab, and this is a joint work with Jierui Li from the University of Texas at Austin and Wei Lu from SUTD.
First, I'd like to talk about our motivation for reasoning.
So here we show an examples where multi-step reasoning is helpful.
So this figure is taken from the PaLM paper where they perform prompting to solve the network problem in the few shot learning scenario.
So on the left hand side, we can see if we give some examples with just question and answers, we might not be able to obtain the correct answers.
But if we give some more reasoning description, the model is able to predict the reasoning description and also make a correct prediction here.
So it is good to have interpretable multi-step reasoning as output.
And we also think math word problem is a straightforward application to evaluate such reasoning abilities.
So, here in our problem setup, given the questions we need to solve this question and obtain the numerical answers.
So in our datasets we are also given the mathematical expression which leads to the ah to this particular answer as well.
So, certain assumptions ah also apply as in previous work.
We assume the precision of quantities are known.
And we only consider basic operators such as addition, subtractions, multiplication, division, and exponential.
Furthermore, complicated operators can be actually decomposed into these basic operators.
So, previous work in math word problem solving ah actually can ah be categorized into sequence to sequence and sequence to tree model.
So, traditional sequence to sequence model convert the expression to a specific sequence for generation.
And it is pretty easy to implement and it can generalize to many different complicated problem.
But the drawbacks are the performance is actually generally not better than the structured model and its lack of interpretability for prediction.
But actually this direction is still quite popular because of um the transformer model.
So, in tree based models, we actually structure these expressions in the tree form and follow a preordered traversal in tree generations.
So here we keep generating the operators until we reach the leaves, which are the quantities.
So here the good thing is that it actually gives us this binary tree structure, and it is um but actually it is quite counterintuitive because we generate the operator first and then at the end we generate the quantities.
And the second thing is that it also contains some repetitive computations.
So here if we look at this expression, eight times three plus three is actually generated twice, but in fact we should reuse the results.
So, in our proposed approach we want to solve those problems in a step by step and interpretable manners.
So for example, here in the second step, ah we can obtain these divisors which is twenty seven.
And we can also refer back to the original questions to find the relevant contents.
And in these steps we obtain the divisors.
So, ah and then at this third step we actually get the quotient.
Alright. And after these three steps, we can actually reuse the results from the second step, and then get the ah results of the fourth step, and then finally we can obtain the dividends.
So, here we actually generate the whole expression directly rather than generating a single operators or quantities.
So this makes the process more accurate.
So, in our deductive system, we first start with a bunch of quantities presented in the questions and also including some constant as our initial state ah initial state.
So, the expression is represented by e i j o p.
Where we perform operator from q_i to q_j, and such expression is actually directed.
So, we also have subtraction with words here to represent the opposite direction.
This is quite similar to relation extraction.
So in a formal deductive system, at a time step t, we apply the operator between the q_i and q_j pair, and then we obtain this new expression.
We add it to the next state to become a new quantity.
So, these slides actually visualize the evolution of the state where we keep adding expression to the current state.
So in our model implementations, we first use a pretrained language model which can be BERTs or Robertas and then we encode the sentence and then we obtain these quantity representations.
So, once we get the quantity representations, we can start to do inference.
Here we show an example of q_1 to obtain the representation for q_2 divided by q_2 and then times q_3.
First we get the ah pair representation, which is basically just the concatenation between q_1 and q_2, and then we apply a feedforward network which is parameterized by the operator.
And then finally we obtain the expression representation q_1 divided by q_2.
But in fact, in practice, in the inference stage, we might ah be able to get the incorrect expression as well.
So, here all the possible expression is equals to three times the number of operators.
So the nice thing here is that we can easily add constraints to control this search this search space.
For example, if this expression is not allowed, we can simply remove this expression in our search space.
So in the second step, we do the same thing, but the only difference is that we ah the only difference is one more quantities.
So this quantity come from the previous calculated expression.
So finally we can obtain this final expression q_3 times q_4.
And we can also see the number of all the possible ah expression is different from the previous step.
So, ah such difference make it hard to apply beam search because the probability distribution between these two steps is unbalanced.
So the training procedure is similar to training a sequence to sequence model where we optimize the loss at each time step.
And here we also use this tau to represent when we should terminate this generation process.
And here the space is different from sequence to sequence because the space is different at each time step while in traditional sequence to sequence model this is the number of vocabulary.
And it also allows us to impose certain constraints from prior from prior knowledge.
So we conduct experiments on the commonly used math word problem datasets, MAWPS, Math23K,  MathQA and SVAMP.
And here we briefly show the results compared with the previous best approaches.
So our best performing variant is Roberta-DeductiveReasoner.
And in fact we do not use beam search, in contrast all previous approaches are using beam search.
All right. So, the best approaches are often tree based model.
So, overall our reasoner is able to significantl significantly outperform this tree based model.
But we can see the absolute numbers on MathQA or SVAMP are not really high.
So we further investigate the results on SVAMP.
And this dataset is challenging because the author tried to manually ah adding something to confuse the NLP model like such as adding irrelevant information and extra quantities.
So, in our prediction we find some of the intermediate values are actually negatives.
For example, um, in these questions we are asking how many apples does Jake have?
But we have some extra information like seventeen fewer pictures, and Steven has eight pictures, which is totally irrelevant.
So, our model makes some prediction like this which is producing negative values.
And we observe these two expressions actually have similar scores.
So, we can actually limit this search space by removing those results that are negatives so that we can make the ah make the answer correct.
So um we further find such constraint actually improves quite a lot for some models.
For example, for BERT, we improve seven points and then for the Roberta base model we actually improved two points.
So better language model has better language understanding abilities so that the number here is higher for Roberta and lower for BERT.
And we also try to analyze the difficulty behind these behind all these datasets.
We assume the number of unused quantities can be regarded as irrelevant information here.
So ah here we can see that ah,we have the the percentage of samples with unused quantities, and the SVAMP dataset has the largest portion.
And here we also show the overall performance.
For those samples without unused quantities, so the overall performance is actually higher than the, the performance is actually higher than the overall performance.
But with those samples that with unused quantity is actually way worse than the, worse than the overall performance.
For MAWPS, we don't we don't really have ah too many test cases, so I just ignore this part.
So, finally we want to show the interpretability through a question perturbation example.
So here our model actually makes a wrong prediction at the first step.
So, we can actually correlate this expression with the sentence here. Alright.
So, we think this sentence might be misleading the model to an incorrect predictions.
So here planting another thirty five makes the model makes the model think it should be an addition operator.
So we try to revise the sentence to be something like the number of pear trees are thirty five fewer than the apple trees.
So, we make it to convey more accurate semantics such that the model is able to make um the prediction correct.
So, this study shows how the interpretable predictions help us understand the model behavior.
So to conclude our work, so first our model is actually pretty efficient.
And we are able to provide interpretable solving procedure.
And we can easily incorporate some prior knowledge as constraint which can help improve the performance.
And the last thing is that the underlying mechanism does not only apply to network problem solving tasks but also other tasks that involve multi step reasoning.
We also have certain limitations.
Ah, if we have a large number of operators or constants, the memory consumption could be pretty high.
And the second thing is that, as mentioned, because the probability distribution is unbalanced between different time steps, so it's also pretty challenging to apply beam search strategy.
So this is the end of the talk, and questions are welcomed. Thank you.
Hi, my name is Antoine and I'm from Maastricht University.
I will be presenting my joint work with Jerry which is about a New Dataset for Statutory Article Retrieval.
Legal issues are an integral part of many people's lives.
But the majority of citizens have little to know knowledge about their rights and fundamental legal processes.
As a result, many vulnerable citizens who cannot afford the costly assistance of a legal expert are left unprotected or, worst, exploited.
All work aims to bridge the gap between people and the law by developing an effective retrieval system for statutory articles.
Such a system could provide a free professional legal help service for unskilled humans.
Before diving into the main contribution of this work, let's first describe the problem of statutory article retrieval.
Given a simple question on a legal matter such as, what do I risk if I violate professional confidentiality?
A model is required to retrieve all relevant statutory articles from a large body of legislation.
This information retrieval task comes with its own set of challenges.
First, it deals with two types of language.
Common natural language for the questions and complex legal language for the statutes.
This difference in language distributions makes it harder for a system to retrieve relevant candidates, as it indirectly requires an inherent interpretation system that can translate a natural question to a legal question that matches the terminology of statutes.
Besides, statutory law is not a stack of independent articles that can be treated as a complete source of information on their own, unlike news or recipes, for example.
Instead, it's a structured collection of legal provisions that have a whole meaning only when considered in the overall context, that is, together with the supplementary information from the neighboring articles, the fields and subfields they belong to, and their place in the structure of the law.
Lastly, statutory articles aren't small paragraphs which usually is the typical retrieval unit in most retrieval works.
Here, there are long documents that may be up to six thousand words.
The recent advances in NLP have sparked huge interest in many legal tasks, such as legal judgment prediction or automated contact contract review.
But statutory article retrieval has remained mainly untouched due to the lack of large and high quality labeled datasets.
In this work, we present a new French native citizen-centric dataset to study whether retrieval models can approximate the efficiency and reliability of a legal expert for the task of statutory article retrieval.
Our Belgian statutory article retrieval dataset BSARD consists of more than one thousand one hundred legal questions posed by Belgian citizens.
These questions cover a wide range of topics from family, housing, money, to work and social security.
Each of them has been labeled by experienced jurists with references to relevant articles from a corpus of more than twenty-two thousand six hundred legal articles from Belgian codes of law.
Let's now talk about how we collected this dataset.
First, we started by compiling a large corpus of legal articles.
We considered thirty two publicly available Belgian codes and extracted all the articles as well as the corresponding section headings.
Then we gathered legal questions with references to relevant statutes.
To do so, we partner with the Belgian law firm that receives each year around four thousand emails from Belgian citizens who ask for advice on a personal legal issue.
We were lucky enough to get access to their websites, where their team of experienced jurists addresses Belgians' most common legal issues.
We collected thousands of questions annotated with categories, subcategories and legal references to relevant statutes.
Lastly, we passed the legal references and filtered out the questions whose references were not articles in one of the codes of law we considered.
The remaining references were matched and converted to the corresponding article ids from our corpus.
We eventually ended up with one thousand one hundred and eight questions, each carefully labeled with the ids of the relevant articles from our large corpus of twenty two thousands and six hundred thirty three statutory articles.
In addition, each question comes with the main category and a concatenation of subcategories.
And each articles comes with a concatenation of the subsequence heading in the structure of the law.
This extra information is not used in the present work, but might be of interest for future research on legal information retrieval or legal text classification.
Let's look at some characteristic of our dataset.
The questions are between five and forty four words long with a median of fourteen words.
The articles are much longer with a median length of seventy seven words, with one hundred and forty two of them exceeding one thousand words.
The lengthiest one being up to five thousand seven hundred and ninety words.
As previously mentioned, the questions cover a wide range of topics, with around eighty five percent of them being either about family, housing, money or justice.
While the remaining fifteen percent concern either social security, foreigners or work.
The article are also very diverse as they come from thirty two different Belgian codes that cover a large number of legal topics.
Here's the total number of articles collected from each of these Belgian codes.
Out of the twenty two thousand six hundred and thirty three articles, only one thousand six hundred and twelve are referred to as relevant to at least one question in the dataset.
And around eighty percent of these cited articles come from either the civil code, judicial codes, criminal investigation codes or penal codes.
Meanwhile, eighteen out of thirty two codes have less than five articles mentioned as relevant to at least one question.
Which can be explained by the fact that those codes focused less on individuals and their concerns.
Overall, the median number of citations for these cited articles is two, and less than twenty-five percent of them are cited more than five times.
Using all datasets, we benchmarked several retrieval approaches, including lexical and dense architecture.
Given a query and an article, a lexical model assigns a score to the query article pair by computing the sum over the query terms of the weights of each of these terms in that article.
We experiment with the standard TF-IDF and BM25 ranking functions.
The main problem with these approaches is that they can only retrieve articles that contain keywords present in the query.
To overcome this limitation, we experiment with a neural based architecture that can capture semantic relationships between queries and article.
We use a bi-encoder model that maps queries and articles into dense vector representations and calculate a relevance score between a query article pair by the similarity of their embeddings.
These embeddings typically result from a pooling operation on the output of a word embedding model.
First, we study the effectiveness of Siamese bi-encoders in a zero shot evaluation setup, meaning that pretrained word embedding models are applied out-of-the-box without any additional finetuning.
We experiment with context independent text encoder, namely word2vec and fastText, and context dependent embedding models, namely Roberta and more specifically CamemBERT which is a French Roberta model.
Additionally, we train our own CamemBERT based model ah bi-encoders on our dataset.
Note that for training, we experiment with the two flavors of the bi-encoder architecture.
Siamese, which uses a unique word embedding model that maps the query and article together in a shared dense vector space, and two-tower, which uses two independent word embedding models that encode the query and article separately into different embedding spaces.
We experiment with mean, max and CLS pooling as well as product and cosine for computing similarities.
Here are the result of our baseline on the test sets.
With the lexical methods above, the Siamese bi-encoders evaluated in a zero shot setup in the middle, and the finetuned bi-encoders below.
Overall, the finetuned bi-encoder significantly outperforms all the other baselines.
The two-tower model improves over its Siamese variants on recall at one hundred, but performs similarly on the other metrics.
Although BM25 underperformed the trained bi-encoder significantly, its performance indicated that it's still a strong baseline for domain specific retrieval.
Regarding the zero shot evaluation of Siamese bi-encoder, we find that directly using the embeddings of a pretrained CamemBERT model without optimizing for the information retrieval task gives poor results, which is consistent with previous findings.
Furthermore, we observe that the word2vec based bi-encoder significantly outperformed the fastText and BERT based models, suggesting that maybe pretrained word level embeddings are more appropriate for the task than character level or subword level embeddings when used out of the box.
Although promising, these results suggest ample opportunity for improvement compared to a skilled legal expert who can eventually retrieve all relevant articles to any question and thus get perfect scores.
Let's conclude by discussing two limitations of our dataset.
First, the corpus of article is limited to those collected from the thirty two considered Belgian codes, which does not cover the entire Belgian law as articles from decrees, directives and ordinances are missing.
During the dataset construction, all references to these uncollected articles are ignored, which causes some questions to end up with only a fraction of the initial number of relevant articles.
This information thus implies that the answer contained in the remaining relevant articles might be incomplete, although it's still completely appropriate.
Second, we should note that not all legal questions can be answered with statutes alone.
For instance, the question, can I evict my tenants if they make too much noise?
Might not have a detailed answer within statutory law that quantifies a specific noise threshold at which eviction is allowed.
Instead, the landlord should probably rely more on case law and find precedents similar to their current situation.
For example, the tenants makes two parties a week until two AM.
Hence, some question are better suited than others to the statutory article retrieval task, and the domain of the less suitable ones remains to be determined.
We hope that our work sparks interest in developing practical and reliable statutory article retrieval models.
That can help improve access to justice for all.
You can check out our paper, dataset and code at the following links. Thank you.
Hello, we are happy to present our work on VALSE; a Task-Independent Benchmark meant for testing vision and language models with specific linguistic phenomena.
Why did we do the trouble in setting up this benchmark?
Well, during the last years, we have seen an explosion of transformer based vision and language models pretrained on large amounts of image text pairs.
Each one of these models pushes state-of-the-art on vision and language tasks such as visual question answering, visual common sense reasoning, image retrieval, phrase grounding.
So we got a message, the accuracies on these tasks and specific benchmarks are increasing steadily.
But do we know what the models have actually learned?
What is it that a vision and language transformer understood when assigning a high score for this image and this sentence to match?
And the low score for this one?
Do vision and language models focus on the right thing?
Or do they focus on biases as shown by previous work?
To shed more light on this aspect, we propose a more task agnostic direction and introduce VALSE that tests the sensitivity of vision and language models to specific linguistic phenomena that affect both the linguistic and the visual modalities.
We target existence, plurality, counting, spatial relations, actions and entity coreference.
But how do we test whether the vision and language models have captured this phenomena?
By foiling a method previously applied for vision and language models only for noun phrases by Ravi Shekhar and collaborators, and on counting by us in previous work.
Foiling basically means that we take the caption of an image and produce a foil by altering the caption such that it does not describe the image anymore.
And we do these phrase alterations by focusing on six specific pieces such as existence, plurality, counting, spatial relations, actions and entity coreference, where each piece can consist of one or more instruments, in case we found more than one interesting way to create foil instances.
For example, in the case of the actions piece, we have two instruments, one in which the action verb is changed with a different action, and one in which actants are swapped.
Counting and coreference also are pieces that have more than one instrument.
And we create these foils by making sure that they fail to describe the image, that they are grammatical, and otherwise valid sentences.
This is not easy to do because a foiled caption may be less likely than the original caption.
For example, though it's not impossible, it is statistically less likely for plants to cut a man than a man to cut plants, and large vision and language models could pick up on this.
Therefore, to obtain valid foils, we must take action.
First, we make use of strong language models to propose foils.
Second, we use natural language inference or short NLI to filter out foils that could be still describing the image, since when constructing foils we need to ensure that they fail to describe the image.
To test this automatically, we apply natural language inference with the following rationale.
We consider an image to be the premise and its caption its entailed hypothesis.
In addition, we consider the caption to be the premise, and the foil is its hypothesis.
If an NLI model predicts the foil to contradict or to be neutral with respect to the caption, we take this as an indicator of a valid foil.
If an NLI predicts the foil to be entailed by the caption, it cannot be a good foil, since by transitivity it will give a truthful description of the image, and we filter these foils out.
But this procedure is not perfect, it is just an indicator for valid foils.
Therefore, as a third measure for generating valid foils, we employ human annotators to validate the data used in VALSE.
So, after filtering and human evaluation, we have as many test instances as described in this table.
Note that VALSE does not deliver any training data but only test data.
Since it is a zero shot testing benchmark only, it is designed to leverage the existing capabilities of vision and language models after pretraining.
Finetuning would only enable models to exploit artifacts or statistical biases in the data.
And we all know that these models like to cheat and take shortcuts.
And as we said, we are interested in assessing what capabilities the vision and language models have after pretraining.
We experiment with five vision and language models on VALSE, namely with CLIP, LXMert, ViLBERT, ViLBERT twelve in one, and VisualBERT.
Two of our most important evaluation metrics are the accuracy of the models in classifying image sentence pairs into captions and foils.
Perhaps more relevant for this video, we will showcase our more permissive metric, the pairwise accuracy, which measures whether the image sentence alignment score is greater for the correct image text pair than for its foiled pair.
For more metrics and results on them, do check out our paper.
The results with pairwise accuracy are shown here and they are consistent with the results we got from the other metrics is that the best zero shot performance is achieved by ViLBERT twelve in one, followed by ViLBERT, LXMert, CLIP, and finally VisualBERT.
It's notable how instruments centered on the individual objects like existence and noun phrases are almost solved by ViLBERT twelve in one, highlighting that models are capable of identifying named objects and their presence in images.
However, none of the remaining pieces can be reliably solved in our adversarial foiling settings.
We see from the plurality and counting instruments that vision and language models have trouble distinguishing references to single versus multiple objects, or counting them in an image.
The relation piece shows that they have difficulties in correctly classifying a named spatial relation between objects in an image.
They also have trouble distinguishing actions and identifying their participants, even if supported by plausibility biases as we see in the actions piece.
From the coreference piece, we find out that tracing multiple references to the same object in an image by using pronouns is also difficult for vision and language models.
As a sanity check, and because it's an interesting experiment, we also benchmark two text only models, GPT one and GPT two, to assess whether VALSE is solvable by these unimodal models by computing the perplexity of the correct and the foiled caption, no image here, and predicting the entry with the lowest perplexity.
If the perplexity is higher for the foil, we take this as an indication that the foiled caption may suffer from plausibility bias or other linguistic biases.
And it's interesting to see that in some cases, the text only GPT models have captured the plausibility of the world better than the vision and language models.
So to sum up, VALSE is a benchmark that uses the lens of linguistic constructs to help the community improve vision and language models by hard testing their visual grounding capabilities.
Our experiments show that vision and language models identify named objects and their presence in images well, as shown by the existence piece, but struggle to ground their interdependence and relationships in visual scenes when forced to respect linguistic indicators.
We would really like to encourage the community to use VALSE for measuring progress towards language grounding with vision and language models.
And even more, VALSE could be used as an indirect assessment of datasets, as models could be evaluated before and after training or finetuning to see whether a dataset helps models improve on any of the aspects tested by VALSE.
If you're interested, do check out the VALSE data on GitHub, and if you have any questions do not hesitate to contact us.
Hello, my name is Kamezawa from the University of Tokyo.
I'll be presenting a paper entitled RNSum: A Large-Scale Dataset for Automatic Release Note Generation via Commit Logs Summarization.
I'll be explaining in this order.
First, I will introduce automatic release note generation that we are working on in this research.
A release note is a technical document that summarizes the changes distributed with each release of a software product.
The image shows a release note for version two point six point four of the vuejs library.
Release notes play an important role in open source development but they're time consuming to prepare manually.
Therefore, it would be very useful to be able to automatically generate high quality release notes.
I will defer to two previous researches on automatic release note generation.
The first is a system called ARENA released in twenty fourteen.
It takes a rule-based approach, for example using the change extractor to extract all differences, library changes and document changes from the differences between releases, and finally combining them.
The most notable feature of this system is the issue extractor in the upper right corner.
Which must be left to Jira, the issue tracker system, and can only be applied to projects that use Jira.
In other words, it cannot be used for many projects on GitHub.
The second is Glyph, recently announced in twenty twenty.
It is available on the internet and can be installed via pip.
This system has a simple learning based text classification model and outputs one of five labels such as features or bug fixes for each input commit message.
This image is a sample usage that returns a corrective or bug fixes label.
Glyph's training data is fairly small, about five thousand, and will be shown in the experiments described below.
The performance of the text classification model is not high.
I present two related researches, but their problems are limited applicability and scarce data resources.
Our paper solves these two problems and automatically generates high quality release notes.
With a limited applicability problem, we propose a high quality classwise summarization method using only commit messages as input.
This proposed method can be used for all English repositories.
For the second problem of scarce data resources, we built our RNSum dataset consisting of about eighty two thousand pieces of data by collecting data from public GitHub repositories using the GitHub API.
Next, I'll describe our dataset.
Here is an example of data.
The left side is a commit message and the right side is the release notes.
Release notes are labeled as improvements or fixes, etc.
We have set up a task that takes the commit messages as input and outputs a labeled release notes.
This can be regarded as a summarization task.
We have predefined four labels: features, improvements, bug fixes, deprecations removals and breaking changes.
These were set based on previous research and other factors.
The release note on the bottom right is extracted from the release note on the bottom left.
At this time, it is necessary to detect the four labels that have been set up in advance.
But the labels are not always consistent with each repository.
For example, the improvements label includes improvements, enhancements, optimizations, and so on.
We prepared a vocabulary list of about thirty labels for each of these notational variations.
This is to detect the release note class, and collects the text of the release that follows as the release note sentence for the class.
Next is a commit message.
Commit messages are not tied to each release.
As shown in the image below, if the current release is version two point five to nineteen, we need to identify the previous release version two point five to eighteen and get a diff.
This is a bit tedious and it is not enough to just get a list of releases and look at the before and after.
We created a heuristic matching rule to get the previous and next versions.
Dataset analysis.
In the end, seven thousand two hundred repositories and eighty two thousand pieces of data were collected.
Also, the average number of release notes tokens is sixty three, which is quite high for a summarization task.
Also, the number of unique tokens is quite large at eight thousand eight hundred thirty thousand.
This is due to the large number of unique class or method names found in the repository.
Next, I will explain the proposed method.
The classwise extractive then abstractive summarization model consists of two neural modules.
A classifier using BERT or CodeBERT and a generator using BART.
First, CEAS uses a classifier to classify each commit message into five release notes classes, which use improvements, bug fixes, deprecations, plus an other.
The commit messages classified as other are discarded.
Then CEAS applies the generator to the four labeled documents independently and generates release notes for each class.
In this task, the direct correspondences between commit messages and release notes are not known.
Therefore, to train the classifier, that's why we reassigned surveys to each input commit message using the first ten characters of each commit message.
We modeled the classwise abstractive summarization approach by two different methods.
The first model, which we call CAS-Single, consists of a single six to six network and generates a single release note text give a concatenation of input commit messages.
The output texts can be divided into classwise segments based on special class-specific endpoint symbols.
The second method, method, which we call CAS-Multi, consists of four different seq2seq networks, each of which correspond to one of the fixed release note classes.
Okay, let me explain the experiments.
Five methods were compared: CEAS, CAS-Single, CAS-Multi, Clustering, and previous study, Glyph.
Regarding evaluation, in some cases, release notes are output in multiple sentences.
Since it is difficult to calculate the number of sentences as they are, they are combined with spaces and treated as one long sentence.
The BLEU is penalized when the system outputs a short sentence.
This penalty results in a lower BLEU value in the experiment results described next.
Finally, we also calculate the specificity because ROUGE and BLEU cannot be calculated if the release notes are empty.
A higher specificity means that the model correctly outputs an empty text in cases where the release notes assume empty.
Here are the results.
Since the dataset contains e-mail addresses, hashed values, etc, we also evaluated the cleaned dataset, which excludes them.
CEAS and CAS achieved ROUGE-L scores more than ten points higher than the baselines.
In particular, on the clean test set, the score gap between the proposed method and the baselines jumped to more than twenty points.
These results indicate that CEAS and CAS are significantly affected.
CEAS got a better ROUGE-L score than CAS suggesting that combining a classifier and a generator is effective on training the classifier using pseudo labels.
High coverage of CEAS can be achieved probably because the classifier can focus on selecting relevant commit messages for each class.
CAS-Multi tended to yield higher ROUGE-L than CAS-Single.
Suggesting that it is also effective to independently develop differently abstractive summarization models for each release note class.
Here are an error analysis.
CAS methods tend to output shorter sentences than human reference sentences.
In the figure on the right, the reference sentence has three or four sentences, while CAS has only one.
The reason for this model's reluctance is that in training data, only thirty three percent of the sentences are present in the features label and forty percent in the improvements label.
Furthermore, CAS methods cannot generate accurate release notes without additional information.
The top example on the right is an example of a very messy commit message, and the complete sentence cannot be generated without reference to the corresponding progress or issue.
The example below shows that the two commit messages in the input are related and should be combined into one sentence, but it fails to do so.
Finally, a conclusion.
We have built a new dataset for automatic release note generation.
We have also formulated a task of entering commit messages and summarizing them so that it is applicable to all projects written in English.
Our experiments show that the proposed method generates less noisy release notes at higher coverage than the baselines.
Please check out our dataset on GitHub.
Thank you.
Hello. My name is Asaf Harari.
And I will present our paper, Few-Shot Tabular Data Enrichment Using Fine-Tuned Transformers Architectures.
Data scientists analyze data and mainly focus on the manipulating the data's existing features.
But sometimes, these features are limited.
Feature generation using another data source may add substantial information.
Our research goal is automatic tabular data enrichment using external sources' free text.
Assume we have a tabular dataset and a knowledge base.
We need an automatic process which involves entity linking and text analysis to extract new features from the knowledge base's free text.
Our framework FeSTE is exactly this automatic process.
So let's see an example in a dataset fed into FeSTE.
In this example, the dataset is university dataset.
When its goal is to classify universities into low ranking universities and high-ranking universities.
As knowledge base, we use Wikipedia.
The first phase of FeSTE is entity linking.
When each entity, in this example the university name, is linked to an entity within the knowledge base.
And and the text of the entities of the knowledge base is extracted and added to the dataset.
In this example, the text is the Wikipedia page's abstract.
Now, we need to generate or extract features from the retrieved text.
So, we need to ah feature extraction phase ah which includes text analysis.
And this is the main novelty of this paper and I will deep dive into it in the next slides.
After the feature extraction phase, there is a feature generation phase when we use the extracted features to generate a small number of new features.
First generate ah features in the number of classes of the original dataset.
In this example, the original dataset has two classes.
So, FeSTE generates two new features.
But if the dataset has five classes, FeSTE generates five new features.
Each feature represents the likelihood for each class.
To analyze the text, we use the current state-of-the-art of text analysis, which are transformer based language models as BERT, GPT,  XLNet and etc.
It is but it is not likely that we can train language models using the input datasets.
So a naive approach will be ah target task finetuning.
So, in the feature extraction phase, we can download pretrained language models, finetune the language model over the target dataset.
In this example to finetune the language model, to classify ah to classify text into classes, abstract into classes, low or high.
Receive the language model output, which is the likelihood for each class and use as new features.
The problem with this approach is datasets may have few distinct entities / texts.
In our experiment, almost half of the datasets contain less than four hundred samples and the smallest dataset contain thirty five samples in its, in a training set.
So to finetune a language model over ah this dataset will be ineffective.
But we can use prior knowledge about pre-analyzed datasets.
Because FeSTE, we apply FeSTE over a multiple dataset, we can use the n minus one datasets to gather information about the n minus one datasets, and use this information when we analyze the nth dataset.
What we, what we suggest is to add, to add another finetuning phase.
A preliminary multitask finetuning phase.
When you finetune the language model over the n minus one datasets.
And, then we execute another finetuning phase which is a target task finetuning, when you fine when we finetune the language model over the nth target dataset.
The state-of-the-art in multitask ah multitask finetuning called MTDNN.
In MTDNN, MTDNN maintains ah heads in the number of tasks in the training set.
So, in this example there are four tasks in the training set, so MTDNN maintain four heads as you can see at the image.
And it samples a random batch from ah from the training set.
And if they random batch belongs to a, for example single sentence classification task, it executes forward and backward paths through the first head.
And if the random batch belongs to pairwise ranking task, it executes forward and backward path through the last head.
In our scenario, ah tabular datasets vary in the number of classes.
So there are many tasks.
MTDNN maintained number of classes, heads, output layers.
And the additional, additionally MTDNN needs to initialize new heads for a new dataset with a new task.
Our approach, called task reformulation finetuning is, in our approach task reformulation finetuning, instead of maintaining multiple heads, we reformulate each dataset into a sentence per classification problem, which is two classes' tasks.
So let's see an example.
Here is the our input dataset which consists of entities, features, text and classes.
And, we reformulate the task from a classifying the text into low or high to classify the text, the abstract and the class into true or false.
Or in other words, we trained the language model to classify an abstract and class ah to abstract and class ah, if the abstract belongs to the class or not.
So the label vector in this case stays always ah which consists always with two classes.
And this is the ah algorithm for our fine, reformulated finetuning approach.
So let's see the full framework.
Dataset fed into FeSTE.
And then ah FeSTE executes entity linking phase.
It ah it extracts the text from the knowledge base, which in this example is the abstract of the Wikipedia page.
Then it reformulated the task into a pairwise sentence classification task.
Applied the language model to the new task and the output likelihood for each class.
And now that the language model is already finetuned over n minus one dataset using a preliminary multitask finetuning.
Then we use the output vector of the language model as a newly generated feature in the number of classes.
To evaluate our framework, we use ah seventeen tabular classification datasets which vary in size, features, balance, domain and initial performance.
And as knowledge base we use Wikipedia.
We design our experiment as leave one out ah evaluation where we train FeSTe over sixteen datasets and apply it to the seventeenth dataset.
We also, we also split each dataset into four folds and apply four folds cross validation.
Then, we generate the new features and evaluate them using five evaluation classifiers.
We use in our experiments base BERT base architecture.
Here are the results for our experiments.
You can see that we compare our our framework to target dataset finetuning, target task finetuning, and a MTDNN preliminary finetuning.
And our reformulated finetuning achieves the best result, the best performance.
While MTDNN achieved two percent improvement over the target dataset finetuning.
Our approach achieved six percent improvement.
When we look on the small ah dataset, we can see that the performance of MTDNN decreases and the improvement of the prelim, the preliminary multitask finetuning phase decreases to one point five percent.
But our performance increased to eleven percent compared to the target task finetuning alone.
For summing, FeSTE enables few shot enrichment from thirty five samples in our experiments.
It uses one architecture for all tasks and datasets.
And it keeps the head of ah of the model.
But it adds reformulation phase.
It augments the train set and it needs a target value with semantic meaning so we can feed it into the language model and use it in the sentence pair classification problem.
Thank you.
