Word frequency is a key variable in psycholinguistics, useful for modeling human familiarity with words even in the era of large language models (LLMs). Frequency in film subtitles has proved to be a particularly good approximation of everyday language exposure. For many languages, however, film subtitles are not easily available, or are overwhelmingly translated from English. We demonstrate that frequencies extracted from carefully processed YouTube subtitles provide an approximation comparable to, and often better than, the best currently available resources. Moreover, they are available for languages for which a high-quality subtitle or speech corpus does not exist. We use YouTube subtitles to construct frequency norms for five diverse languages, Chinese, English, Indonesian, Japanese, and Spanish, and evaluate their correlation with lexical decision time, word familiarity, and lexical complexity. In addition to being strongly correlated with two psycholinguistic variables, a simple linear regression on the new frequencies achieves a new high score on a lexical complexity prediction task in English and Japanese, surpassing both models trained on film subtitle frequencies and the LLM GPT-4. We publicly release our code, the frequency lists, fastText word embeddings, and statistical language models.
Non-English dialogue datasets are scarce, and models are often trained or evaluated on translations of English-language dialogues, an approach which can introduce artifacts that reduce their naturalness and cultural appropriateness. This work proposes Dialogue Act Script (DAS), a structured framework for encoding, localizing, and generating multilingual dialogues from abstract intent representations. Rather than translating dialogue utterances directly, DAS enables the generation of new dialogues in the target language that are culturally and contextually appropriate. By using structured dialogue act representations, DAS supports flexible localization across languages, mitigating translationese and enabling more fluent, naturalistic conversations. Human evaluations across Italian, German, and Chinese show that DAS-generated dialogues consistently outperform those produced by both machine and human translators on measures of cultural relevance, coherence, and situational appropriateness.
Research in low-resource language is often hampered due to the under-representation of how the language is being used in reality. This is particularly true for Indonesian language because there is a limited variety of textual datasets, and majority were acquired from official sources with formal writing style. All the more for the task of geoparsing, which could be implemented for navigation and travel planning applications, such datasets are rare, even in the high-resource languages, such as English. Being aware of the need for a new resource in both languages for this specific task, we constructed a new dataset comprising both Indonesian and English from personal travelogue articles. Our dataset consists of 88 articles, exactly half of them written in each language. We covered both named and nominal expressions of four entity types related to travel: location, facility, transportation, and line. We also conducted experiments by training classifiers to recognise named entities and their nominal expressions. The results of our experiments showed a promising future use of our dataset as we obtained F1-score above 0.9 for both languages.