K. Vijay-Shanker

Also published as: K Vijay-Shanker, K. Vijay-Shankar, Vijay Shanker

2023

pdf abs
ArTrivia: Harvesting Arabic Wikipedia to Build A New Arabic Question Answering Dataset
Sultan Alrowili | K Vijay-Shanker
Proceedings of ArabicNLP 2023

We present ArTrivia, a new Arabic question-answering dataset consisting of more than 10,000 question-answer pairs along with relevant passages, covering a wide range of 18 diverse topics in Arabic. We created our dataset using a newly proposed pipeline that leverages diverse structured data sources from Arabic Wikipedia. Moreover, we conducted a comprehensive statistical analysis of ArTrivia and assessed the performance of each component in our pipeline. Additionally, we compared the performance of ArTrivia against the existing TyDi QA dataset using various experimental setups. Our analysis highlights the significance of often overlooked aspects in dataset creation, such as answer normalization, in enhancing the quality of QA datasets. Our evaluation also shows that ArTrivia presents more challenging and out-of-distribution questions to TyDi, raising questions about the feasibility of using ArTrivia as a complementary dataset to TyDi.

2022

In this paper, we present the results and findings of the Shared Task on Gender Rewriting, which was organized as part of the Seventh Arabic Natural Language Processing Workshop. The task of gender rewriting refers to generating alternatives of a given sentence to match different target user gender contexts (e.g., a female speaker with a male listener, a male speaker with a male listener, etc.). This requires changing the grammatical gender (masculine or feminine) of certain words referring to the users. In this task, we focus on Arabic, a gender-marking morphologically rich language. A total of five teams from four countries participated in the shared task.

pdf abs
Generative Approach for Gender-Rewriting Task with ArabicT5
Sultan Alrowili | Vijay Shanker
Proceedings of the Seventh Arabic Natural Language Processing Workshop (WANLP)

Addressing the correct gender in generative tasks (e.g., Machine Translation) has been an overlooked issue in the Arabic NLP. However, the recent introduction of the Arabic Parallel Gender Corpus (APGC) dataset has established new baselines for the Arabic Gender Rewriting task. To address the Gender Rewriting task, we first pre-train our new Seq2Seq ArabicT5 model on a 17GB of Arabic Corpora. Then, we continue pre-training our ArabicT5 model on the APGC dataset using a newly proposed method. Our evaluation shows that our ArabicT5 model, when trained on the APGC dataset, achieved competitive results against existing state-of-the-art methods. In addition, our ArabicT5 model shows better results on the APGC dataset compared to other Arabic and multilingual T5 models.

2021

pdf bib abs
Improving BERT Model Using Contrastive Learning for Biomedical Relation Extraction
Peng Su | Yifan Peng | K. Vijay-Shanker
Proceedings of the 20th Workshop on Biomedical Language Processing

Contrastive learning has been used to learn a high-quality representation of the image in computer vision. However, contrastive learning is not widely utilized in natural language processing due to the lack of a general method of data augmentation for text data. In this work, we explore the method of employing contrastive learning to improve the text representation from the BERT model for relation extraction. The key knob of our framework is a unique contrastive pre-training step tailored for the relation extraction tasks by seamlessly integrating linguistic knowledge into the data augmentation. Furthermore, we investigate how large-scale data constructed from the external knowledge bases can enhance the generality of contrastive pre-training of BERT. The experimental results on three relation extraction benchmark datasets demonstrate that our method can improve the BERT model representation and achieve state-of-the-art performance. In addition, we explore the interpretability of models by showing that BERT with contrastive pre-training relies more on rationales for prediction. Our code and data are publicly available at: https://github.com/AnonymousForNow.

pdf abs
BioM-Transformers: Building Large Biomedical Language Models with BERT, ALBERT and ELECTRA
Sultan Alrowili | Vijay Shanker
Proceedings of the 20th Workshop on Biomedical Language Processing

The impact of design choices on the performance of biomedical language models recently has been a subject for investigation. In this paper, we empirically study biomedical domain adaptation with large transformer models using different design choices. We evaluate the performance of our pretrained models against other existing biomedical language models in the literature. Our results show that we achieve state-of-the-art results on several biomedical domain tasks despite using similar or less computational cost compared to other models in the literature. Our findings highlight the significant effect of design choices on improving the performance of biomedical language models.

pdf abs
ArabicTransformer: Efficient Large Arabic Language Model with Funnel Transformer and ELECTRA Objective
Sultan Alrowili | Vijay Shanker
Findings of the Association for Computational Linguistics: EMNLP 2021

Pre-training Transformer-based models such as BERT and ELECTRA on a collection of Arabic corpora, demonstrated by both AraBERT and AraELECTRA, shows an impressive result on downstream tasks. However, pre-training Transformer-based language models is computationally expensive, especially for large-scale models. Recently, Funnel Transformer has addressed the sequential redundancy inside Transformer architecture by compressing the sequence of hidden states, leading to a significant reduction in the pre-training cost. This paper empirically studies the performance and efficiency of building an Arabic language model with Funnel Transformer and ELECTRA objective. We find that our model achieves state-of-the-art results on several Arabic downstream tasks despite using less computational resources compared to other BERT-based models.

2017

pdf abs
Noise Reduction Methods for Distantly Supervised Biomedical Relation Extraction
Gang Li | Cathy Wu | K. Vijay-Shanker
BioNLP 2017

Distant supervision has been applied to automatically generate labeled data for biomedical relation extraction. Noise exists in both positively and negatively-labeled data and affects the performance of supervised machine learning methods. In this paper, we propose three novel heuristics based on the notion of proximity, trigger word and confidence of patterns to leverage lexical and syntactic information to reduce the level of noise in the distantly labeled data. Experiments on three different tasks, extraction of protein-protein-interaction, miRNA-gene regulation relation and protein-localization event, show that the proposed methods can improve the F-score over the baseline by 6, 10 and 14 points for the three tasks, respectively. We also show that when the models are configured to output high-confidence results, high precisions can be obtained using the proposed methods, making them promising for facilitating manual curation for databases.

pdf abs
Identifying Comparative Structures in Biomedical Text
Samir Gupta | A.S.M. Ashique Mahmood | Karen Ross | Cathy Wu | K. Vijay-Shanker
BioNLP 2017

Comparison sentences are very commonly used by authors in biomedical literature to report results of experiments. In such comparisons, authors typically make observations under two different scenarios. In this paper, we present a system to automatically identify such comparative sentences and their components i.e. the compared entities, the scale of the comparison and the aspect on which the entities are being compared. Our methodology is based on dependencies obtained by applying a parser to extract a wide range of comparison structures. We evaluated our system for its effectiveness in identifying comparisons and their components. The system achieved a F-score of 0.87 for comparison sentence identification and 0.77-0.81 for identifying its components.

The accuracy of statistical parsing models can be improved with the use of lexical information. Statistical parsing using Lexicalized tree adjoining grammar (LTAG), a kind of lexicalized grammar, has remained relatively unexplored. We believe that is largely in part due to the absence of large corpora accurately bracketed in terms of a perspicuous yet broad coverage LTAG. Our work attempts to alleviate this difficulty. We extract different LTAGs from the Penn Treebank. We show that certain strategies yield an improved extracted LTAG in terms of compactness, broad coverage, and supertagging accuracy. Furthermore, we perform a preliminary investigation in smoothing these grammars by means of an external linguistic resource, namely, the tree families of an XTAG grammar, a hand built grammar of English.

1998

pdf
Dialogue Act Tagging with Transformation-Based Learning
Ken Samuel | Sandra Carberry | K. Vijay-Shanker
36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Volume 2

pdf
Dialogue Act Tagging with Transformation-Based Learning
Ken Samuel | Sandra Carberry | K. Vijay-Shanker
COLING 1998 Volume 2: The 17th International Conference on Computational Linguistics

pdf bib
Proceedings of the Fourth International Workshop on Tree Adjoining Grammars and Related Frameworks (TAG+4)
Anne Abeillé | Tilman Becker | Giorgio Satta | K. Vijay-Shanker
Proceedings of the Fourth International Workshop on Tree Adjoining Grammars and Related Frameworks (TAG+4)

pdf
Motion verbs and semantic features in TAG
Tonia Bleam | Martha Palmer | K. Vijay-Shanker
Proceedings of the Fourth International Workshop on Tree Adjoining Grammars and Related Frameworks (TAG+4)

pdf
TAG derivation as monotonic C-command
Robert Frank | K. Vijay-Shanker
Proceedings of the Fourth International Workshop on Tree Adjoining Grammars and Related Frameworks (TAG+4)

pdf
Wh-islands in TAG and related formalisms
Owen Rambow | K. Vijay-Shanker
Proceedings of the Fourth International Workshop on Tree Adjoining Grammars and Related Frameworks (TAG+4)

pdf
Consistent grammar development using partial-tree descriptions for Lexicalized Tree-Adjoining Grammars
Fei Xia | Martha Palmer | K. Vijay-Shanker | Joseph Rosenzweig
Proceedings of the Fourth International Workshop on Tree Adjoining Grammars and Related Frameworks (TAG+4)

1997

pdf abs
Towards a Reduced Commitment, D-Theory Style TAG Parser
John Chen | K. Vijay-Shankar
Proceedings of the Fifth International Workshop on Parsing Technologies

Many traditional TAG parsers handle ambiguity by considering all of the possible choices as they unfold during parsing. In contrast , D-theory parsers cope with ambiguity by using underspecified descriptions of trees. This paper introduces a novel approach to parsing TAG, namely one that explores how D-theoretic notions may be applied to TAG parsing. Combining the D-theoretic approach to TAG parsing as we do here raises new issues and problems. D-theoretic underspecification is used as a novel approach in the context of TAG parsing for delaying attachment decisions. Conversely, the use of TAG reveals the need for additional types of underspecification that have not been considered so far in the D-theoretic framework. These include combining sets of trees into their underspecified equivalents as well as underspecifying combinations of trees. In this paper, we examine various issues that arise in this new approach to TAG parsing and present solutions to some of the problems. We also describe other issues which need to be resolved for this method of parsing to be implemented.