Jianglong He
2024
Infrrd.ai at SemEval-2024 Task 7: RAG-based end-to-end training to generate headlines and numbers
Jianglong He
|
Saiteja Tallam
|
Srirama Nakshathri
|
Navaneeth Amarnath
|
Pratiba Kr
|
Deepak Kumar
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)
We propose a training algorithm based on retrieval-augmented generation (RAG) to obtain the most similar training samples. The training samples obtained are used as a reference to perform contextual learning-based fine-tuning of large language models (LLMs). We use the proposed method to generate headlines and extract numerical values from unstructured text. Models are made aware of the presence of numbers in the unstructured text with extended markup language (XML) tags specifically designed to capture the numbers. The headlines of unstructured text are preprocessed to wrap the number and then presented to the model. A number of mathematical operations are also passed as references to cover the chain-of-thought (COT) approach. Therefore, the model can calculate the final value passed to a mathematical operation. We perform the validation of numbers as a post-processing step to verify whether the numerical value calculated by the model is correct or not. The automatic validation of numbers in the generated headline helped the model achieve the best results in human evaluation among the methods involved.
2022
Infrrd.ai at SemEval-2022 Task 11: A system for named entity recognition using data augmentation, transformer-based sequence labeling model, and EnsembleCRF
Jianglong He
|
Akshay Uppal
|
Mamatha N
|
Shiv Vignesh
|
Deepak Kumar
|
Aditya Kumar Sarda
Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)
In low-resource languages, the amount of training data is limited. Hence, the model has to perform well in unseen sentences and syntax on which the model has not trained. We propose a method that addresses the problem through an encoder and an ensemble of language models. A language-specific language model performed poorly when compared to a multilingual language model. So, the multilingual language model checkpoint is fine-tuned to a specific language. A novel approach of one hot encoder is introduced between the model outputs and the CRF to combine the results in an ensemble format. Our team, Infrrd.ai, competed in the MultiCoNER competition. The results are encouraging where the team is positioned within the top 10 positions. There is less than a 4% percent difference from the third position in most of the tracks that we participated in. The proposed method shows that the ensemble of models with a multilingual language model as the base with the help of an encoder performs better than a single language-specific model.
Search
Co-authors
- Aditya Kumar Sarda 1
- Akshay Uppal 1
- Deepak Kumar 2
- Mamatha N 1
- Navaneeth Amarnath 1
- show all...