Zhishen Yang


2026

In high-stakes domains such as medicine, ensuring transparency of the training corpus is essential, with careful consideration of local healthcare landscapes; however, the majority of existing medical large language models (LLMs) have not disclosed the details of their training corpora. Here, we introduce an open recipe for domain adaptation of LLMs to the Japanese medical domain. We employed fully open-source Japanese general-domain LLMs as base models, whose pre-training datasets are also disclosed. To establish effective corpora for domain adaptation through continued pre-training, we started with small-scale medical datasets and ultimately constructed a medical corpus consisting of 79.6B tokens, incorporating local clinical guidelines, medical textbooks, and other domain-specific resources. The resulting LLM from continued pre-training, namely SIP-med-llm-8x13B, with an active parameter count of 22B, demonstrated favorable accuracy on benchmarks including the Japanese National Medical Examination. This performance was comparable to that of 70B-parameter open-weight models whose construction details remain non-transparent. This represents the first case in the Japanese medical field where complete corpus details have been disclosed for fully from-scratch development, providing important insights for future efforts to construct medical LLMs tailored to the specific characteristics of local contexts. The model is available publicly at this Hugging Face repository: https://huggingface.co/SIP-med-LLM/SIP-jmed-llm-2-8x13b-OP-instruct.

2020

This paper describes the emphasis selection system of the team TextLearner for SemEval 2020 Task 10: Emphasis Selection For Written Text in Visual Media. The system aims to learn the emphasis selection distribution using contextual representations extracted from pre-trained language models and a two-staged ranking model. The experimental results demonstrate the strong contextual representation power of the recent advanced transformer-based language model RoBERTa, which can be exploited using a simple but effective architecture on top.
In this paper, we address the task of news-image captioning, which generates a description of an image given the image and its article body as input. This task is more challenging than the conventional image captioning, because it requires a joint understanding of image and text. We present a Transformer model that integrates text and image modalities and attends to textual features from visual features in generating a caption. Experiments based on automatic evaluation metrics and human evaluation show that an article text provides primary information to reproduce news-image captions written by journalists. The results also demonstrate that the proposed model outperforms the state-of-the-art model. In addition, we also confirm that visual features contribute to improving the quality of news-image captions.

2019

This paper presents our contextual emotion detection system in approaching the SemEval2019 shared task 3: EmoContext: Contextual Emotion Detection in Text. This system cooperates with an emotion detection neural network method (Poria et al., 2017), emoji2vec (Eisner et al., 2016) embedding, word2vec embedding (Mikolov et al., 2013), and our proposed emoticon and emoji preprocessing method. The experimental results demonstrate the usefulness of our emoticon and emoji prepossessing method, and representations of emoticons and emoji contribute model’s emotion detection.