2025
pdf
bib
abs
SMARTMiner: Extracting and Evaluating SMART Goals from Low-Resource Health Coaching Notes
Iva Bojic
|
Qi Chwen Ong
|
Stephanie Hilary Xinyi Ma
|
Lin Ai
|
Zheng Liu
|
Ziwei Gong
|
Julia Hirschberg
|
Andy Hau Yan Ho
|
Andy W. H. Khong
Findings of the Association for Computational Linguistics: EMNLP 2025
We present SMARTMiner, a framework for extracting and evaluating specific, measurable, attainable, relevant, time-bound (SMART) goals from unstructured health coaching (HC) notes. Developed in response to challenges observed during a clinical trial, the SMARTMiner achieves two tasks: (i) extracting behavior change goal spans and (ii) categorizing their SMARTness. We also introduce SMARTSpan, the first publicly available dataset of 173 HC notes annotated with 266 goals and SMART attributes. SMARTMiner incorporates an extractive goal retriever with a component-wise SMARTness classifier. Experiment results show that extractive models significantly outperformed their generative counterparts in low-resource settings, and that two-stage fine-tuning substantially boosted performance. The SMARTness classifier achieved up to 0.91 SMART F1 score, while the full SMARTMiner maintained high end-to-end accuracy. This work bridges healthcare, behavioral science, and natural language processing to support health coaches and clients with structured goal tracking - paving way for automated weekly goal reviews between human-led HC sessions. Both the code and the dataset are available at: https://github.com/IvaBojic/SMARTMiner.
2023
pdf
bib
abs
Hierarchical Evaluation Framework: Best Practices for Human Evaluation
Iva Bojic
|
Jessica Chen
|
Si Yuan Chang
|
Qi Chwen Ong
|
Shafiq Joty
|
Josip Car
Proceedings of the 3rd Workshop on Human Evaluation of NLP Systems
Human evaluation plays a crucial role in Natural Language Processing (NLP) as it assesses the quality and relevance of developed systems, thereby facilitating their enhancement. However, the absence of widely accepted human evaluation metrics in NLP hampers fair comparisons among different systems and the establishment of universal assessment standards. Through an extensive analysis of existing literature on human evaluation metrics, we identified several gaps in NLP evaluation methodologies. These gaps served as motivation for developing our own hierarchical evaluation framework. The proposed framework offers notable advantages, particularly in providing a more comprehensive representation of the NLP system’s performance. We applied this framework to evaluate the developed Machine Reading Comprehension system, which was utilized within a human-AI symbiosis model. The results highlighted the associations between the quality of inputs and outputs, underscoring the necessity to evaluate both components rather than solely focusing on outputs. In future work, we will investigate the potential time-saving benefits of our proposed framework for evaluators assessing NLP systems.
pdf
bib
abs
A Data-centric Framework for Improving Domain-specific Machine Reading Comprehension Datasets
Iva Bojic
|
Josef Halim
|
Verena Suharman
|
Sreeja Tar
|
Qi Chwen Ong
|
Duy Phung
|
Mathieu Ravaut
|
Shafiq Joty
|
Josip Car
Proceedings of the Fourth Workshop on Insights from Negative Results in NLP
Low-quality data can cause downstream problems in high-stakes applications. Data-centric approach emphasizes on improving dataset quality to enhance model performance. High-quality datasets are needed for general-purpose Large Language Models (LLMs) training, as well as for domain-specific models, which are usually small in size as it is costly to engage a large number of domain experts for their creation. Thus, it is vital to ensure high-quality domain-specific training data. In this paper, we propose a framework for enhancing the data quality of original datasets. (Code and dataset are available at https://github.com/IvaBojic/framework). We applied the proposed framework to four biomedical datasets and showed relative improvement of up to 33%/40% for fine-tuning of retrieval/reader models on the BioASQ dataset when using back translation to enhance the original dataset quality.