Kshitij Jadhav
2025
Broken Words, Broken Performance: Effect of Tokenization on Performance of LLMs
Sachin Pawar
|
Manoj Apte
|
Kshitij Jadhav
|
Girish Keshav Palshikar
|
Nitin Ramrakhiyani
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
Tokenization is the first step in training any Large Language Model (LLM), where the text is split into a sequence of tokens as per the model’s fixed vocabulary. This tokenization in LLMs is different from the traditional tokenization in NLP where the text is split into a sequence of “natural” words. In LLMs, a natural word may also be broken into multiple tokens due to limited vocabulary size of the LLMs (e.g., Mistral’s tokenizer splits “martial” into “mart” and “ial”). In this paper, we hypothesize that such breaking of natural words negatively impacts LLM performance on various NLP tasks. To quantify this effect, we propose a set of penalty functions that compute a tokenization penalty for a given text for a specific LLM, indicating how “bad” the tokenization is. We establish statistical significance of our hypothesis on multiple NLP tasks for a set of different LLMs.
2023
Replace and Report: NLP Assisted Radiology Report Generation
Kaveri Kale
|
Pushpak Bhattacharyya
|
Kshitij Jadhav
Findings of the Association for Computational Linguistics: ACL 2023
Clinical practice frequently uses medical imaging for diagnosis and treatment. A significant challenge for automatic radiology report generation is that the radiology reports are long narratives consisting of multiple sentences for both abnormal and normal findings. Therefore, applying conventional image captioning approaches to generate the whole report proves to be insufficient, as these are designed to briefly describe images with short sentences. We propose a template-based approach to generate radiology reports from radiographs. Our approach involves the following: i) using a multilabel image classifier, produce the tags for the input radiograph; ii) using a transformer-based model, generate pathological descriptions (a description of abnormal findings seen on radiographs) from the tags generated in step (i); iii) using a BERT-based multi-label text classifier, find the spans in the normal report template to replace with the generated pathological descriptions; and iv) using a rule-based system, replace the identified span with the generated pathological description. We performed experiments with the two most popular radiology report datasets, IU Chest X-ray and MIMIC-CXR and demonstrated that the BLEU-1, ROUGE-L, METEOR, and CIDEr scores are better than the State-of-the-Art models by 25%, 36%, 44% and 48% respectively, on the IU X-RAY dataset. To the best of our knowledge, this is the first attempt to generate chest X-ray radiology reports by first creating small sentences for abnormal findings and then replacing them in the normal report template.
Search
Fix author
Co-authors
- Manoj Apte 1
- Pushpak Bhattacharyya 1
- Kaveri Kale 1
- Girish Keshav Palshikar 1
- Sachin Pawar 1
- show all...