Yunita Sari

2026

This paper introduces AnnoHID, a semi-automated annotation framework designed for medical texts in low-resource languages. The system integrates large language models (LLMs) for pre-annotation and human validation to support efficient and consistent annotation. We demonstrate its application to Bahasa Indonesia medical social media texts from Alodokter, a medical Q A platform, for Named Entity Recognition (NER) and Medical Concept Normalization (MCN). We conducted a user study with 21 participants to demonstrate the effectiveness of AnnoHID. The results show that LLM-assisted annotation yields higher inter-annotator agreement for both NER (𝜅 = 0.76) and MCN (𝜅 = 0.63) and that human review improves raw LLM NER output, raising the F1 score from 0.39 to 0.45. However, LLM assistance did not reduce annotation time and may introduce normalization bias in MCN. The framework is multilingual, human-in-the-loop, and interoperable with standard medical terminologies, such as SNOMED-CT. Future work focuses on mitigating pre-annotation bias, reducing annotation overhead, and scaling evaluations to larger datasets and additional low-resource languages.

2024

pdf bib

Climate-NLI: A Model for Natural Language Inference and Zero-Shot Classification on Climate-Related Text
Faturahman Yudanto | Yunita Sari | Maeve Zahwa Adriana Crown Zaki
Proceedings of the 38th Pacific Asia Conference on Language, Information and Computation

pdf bib

CIKMar: A Dual-Encoder Approach to Prompt-Based Reranking in Educational Dialogue Systems
Joanito Agili Lopo | Marina Indah Prasasti | Alma Permatasari | Yunita Sari
Proceedings of the 38th Pacific Asia Conference on Language, Information and Computation

2018

pdf bib abs

Topic or Style? Exploring the Most Useful Features for Authorship Attribution
Yunita Sari | Mark Stevenson | Andreas Vlachos
Proceedings of the 27th International Conference on Computational Linguistics

Approaches to authorship attribution, the task of identifying the author of a document, are based on analysis of individuals’ writing style and/or preferred topics. Although the problem has been widely explored, no previous studies have analysed the relationship between dataset characteristics and effectiveness of different types of features. This study carries out an analysis of four widely used datasets to explore how different types of features affect authorship attribution accuracy under varying conditions. The results of the analysis are applied to authorship attribution models based on both discrete and continuous representations. We apply the conclusions from our analysis to an extension of an existing approach to authorship attribution and outperform the prior state-of-the-art on two out of the four datasets used.

2017

pdf bib abs

A Shallow Neural Network for Native Language Identification with Character N-grams
Yunita Sari | Muhammad Rifqi Fatchurrahman | Meisyarah Dwiastuti
Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications

This paper describes the systems submitted by GadjahMada team to the Native Language Identification (NLI) Shared Task 2017. Our models used a continuous representation of character n-grams which are learned jointly with feed-forward neural network classifier. Character n-grams have been proved to be effective for style-based identification tasks including NLI. Results on the test set demonstrate that the proposed model performs very well on essay and fusion tracks by obtaining more than 0.8 on both F-macro score and accuracy.

pdf bib abs

Continuous N-gram Representations for Authorship Attribution
Yunita Sari | Andreas Vlachos | Mark Stevenson
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

This paper presents work on using continuous representations for authorship attribution. In contrast to previous work, which uses discrete feature representations, our model learns continuous representations for n-gram features via a neural network jointly with the classification layer. Experimental results demonstrate that the proposed model outperforms the state-of-the-art on two datasets, while producing comparable results on the remaining two.

Yunita Sari

2026

2024

2018

2017

Co-authors

Venues