Hibiki Nakatani

2026

Linking-based geocoding is the task of linking location mentions in text to their corresponding entries in a geographic database (Geo-DB) and assigning precise coordinates. Although the task and its technology are essential for spatial information extraction, existing datasets are manually curated and lack sufficient data for training accurate models. To address this limitation, we automatically construct a large-scale dataset for linking-based geocoding by leveraging publicly available resources to generate data efficiently at scale. Specifically, we align location mentions in the first paragraphs of Japanese Wikipedia articles with their associated Wikidata entries containing geographic attributes. Wikipedia provides natural textual contexts, while Wikidata offers structured data such as coordinates, place types, and administrative divisions, which can serve as rich metadata for future extensions. Our experiments show that models trained on our dataset achieve strong performance not only on in-domain data, i.e., Wikipedia, but also on out-of-domain newspaper articles, and further confirm that hard negative mining substantially improves disambiguation among confusable candidates. Although the dataset focuses on Japanese, the construction method is language-agnostic and can be extended to other languages with sufficient Wikipedia and Wikidata coverage.

2025

pdf bib abs

Reliability of Distribution Predictions by LLMs: Insights from Counterintuitive Pseudo-Distributions
Toma Suzuki | Ayuki Katayama | Seiji Gobara | Ryo Tsujimoto | Hibiki Nakatani | Kazuki Hayashi | Yusuke Sakai | Hidetaka Kamigaito | Taro Watanabe
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop)

The proportion of responses to a question and its options, known as the response distribution, enables detailed analysis of human society. Recent studies highlight the use of Large Language Models (LLMs) for predicting response distributions as a cost-effective survey method. However, the reliability of these predictions remains unclear. LLMs often generate answers by blindly following instructions rather than applying rational reasoning based on pretraining-acquired knowledge. This study investigates whether LLMs can rationally estimate distributions when presented with explanations of “artificially generated distributions” that are against commonsense. Specifically, we assess whether LLMs recognize counterintuitive explanations and adjust their predictions or simply follow these inconsistent explanations. Results indicate that smaller or less human-optimized LLMs tend to follow explanations uncritically, while larger or more optimized models are better at resisting counterintuitive explanations by leveraging their pretraining-acquired knowledge. These findings shed light on factors influencing distribution prediction performance in LLMs and are crucial for developing reliable distribution predictions using language models.

pdf bib abs

A Text Embedding Model with Contrastive Example Mining for Point-of-Interest Geocoding
Hibiki Nakatani | Hiroki Teranishi | Shohei Higashiyama | Yuya Sawada | Hiroki Ouchi | Taro Watanabe
Proceedings of the 31st International Conference on Computational Linguistics

Geocoding is a fundamental technique that links location mentions to their geographic positions, which is important for understanding texts in terms of where the described events occurred. Unlike most geocoding studies that targeted coarse-grained locations, we focus on geocoding at a fine-grained point-of-interest (POI) level. To address the challenge of finding appropriate geo-database entries from among many candidates with similar POI names, we develop a text embedding-based geocoding model and investigate (1) entry encoding representations and (2) hard negative mining approaches suitable for enhancing the model’s disambiguation ability. Our experiments show that the second factor significantly impact the geocoding accuracy of the model.