Tomohiro Nishiyama

2026

The aim of the Social Media Mining for Health Applications and Health Real-World Data (#SMM4H-HeaRD) shared tasks is to fos- ter the development and evaluation of natural language processing, machine learning, and artificial intelligence methods for analyzing health-related text from social media and other real-world data sources. For the 11th iteration, held online and co-located with ACL 2026, the workshop continued the expanded #SMM4H- HeaRD platform initiated in 2025, broaden-ing its scope beyond social media to include additional health real-world data sources such as clinical narratives and biomedical literature. The 8 shared tasks covered diverse data sources, health domains (e.g., adverse drug events, insomnia, influenza vaccine effectiveness, cancer staging, substance use), and task formulations (e.g., classification, named entity recognition, span extraction, and text generation). In total, 110 teams registered, representing 31 countries. In this paper, we present an overview of the datasets, participant systems, and performance results, providing insights into current methods for mining social media and health real-world data for biomedical and clinical applications.

pdf bib abs

Exploring Novel Drug Research Area using Large Language Models Based on Research Trends in Biomedical Literature
Afnan Afnan | Michael Van Supranes | Tomohiro Nishiyama | Shoko Wakamiya | Eiji Aramaki
BioNLP 2026

The rapid expansion of biomedical literature makes manual identification of novel drug-disease relationships increasingly difficult. Existing approaches have leveraged LLMs to mine abstracts or construct knowledge graphs for drug repurposing. There are two key limitations: finite context windows for capturing macro-level research trends, and single-pass black-box pipelines make it difficult to verify outputs. This paper proposes a pipeline for discovering new drug targets by combining disease and drug research trends using Large Language Models (LLMs). Our method extracts PICO components from PubMed abstracts, normalizing the Population and Intervention Component to ICD and ATC codes, respectively. A temporal frequency delta matrix is constructed to capture publication count shifts across 2013 to 2022, then used to discover novel drug areas. Compared with the abstract-based baseline, our approach showed qualitative signs of generating combinations that were more closely aligned with observed research trends and, in some cases, more clinically plausible. These findings suggest the potential usefulness of structured trend information for LLM-based exploration, although the differences between the two methods were limited and the results remain preliminary. Future work will focus on validating the consistency and reliability of these candidates.

2025

pdf bib

ARxHYOKA at TAQEEM2025: Comparative Approaches to Arabic Essay Trait Scoring
Mohamad Alnajjar | Ahmad Almoustafa | Tomohiro Nishiyama | Shoko Wakamiya | Eiji Aramaki | Takuya Matsuzaki
Proceedings of The Third Arabic Natural Language Processing Conference: Shared Tasks

2024

pdf bib abs

User-generated data sources have gained significance in uncovering Adverse Drug Reactions (ADRs), with an increasing number of discussions occurring in the digital world. However, the existing clinical corpora predominantly revolve around scientific articles in English. This work presents a multilingual corpus of texts concerning ADRs gathered from diverse sources, including patient fora, social media, and clinical reports in German, French, and Japanese. Our corpus contains annotations covering 12 entity types, four attribute types, and 13 relation types. It contributes to the development of real-world multilingual language models for healthcare. We provide statistics to highlight certain challenges associated with the corpus and conduct preliminary experiments resulting in strong baselines for extracting entities and relations between these entities, both within and across languages.

pdf bib abs

Assessing Authenticity and Anonymity of Synthetic User-generated Content in the Medical Domain
Tomohiro Nishiyama | Lisa Raithel | Roland Roller | Pierre Zweigenbaum | Eiji Aramaki
Proceedings of the Workshop on Computational Approaches to Language Data Pseudonymization (CALD-pseudo 2024)

Since medical text cannot be shared easily due to privacy concerns, synthetic data bears much potential for natural language processing applications. In the context of social media and user-generated messages about drug intake and adverse drug effects, this work presents different methods to examine the authenticity of synthetic text. We conclude that the generated tweets are untraceable and show enough authenticity from the medical point of view to be used as a replacement for a real Twitter corpus. However, original data might still be the preferred choice as they contain much more diversity.

Search Fix author