2023
pdf
abs
Orca: A Few-shot Benchmark for Chinese Conversational Machine Reading Comprehension
Nuo Chen
|
Hongguang Li
|
Junqing He
|
Yinan Bao
|
Xinshi Lin
|
Qi Yang
|
Jianfeng Liu
|
Ruyi Gan
|
Jiaxing Zhang
|
Baoyuan Wang
|
Jia Li
Findings of the Association for Computational Linguistics: EMNLP 2023
The conversational machine reading comprehension (CMRC) task aims to answer questions in conversations, which has been a hot research topic in recent years because of its wide applications. However, existing CMRC benchmarks in which each conversation is assigned a static passage are inconsistent with real scenarios. Thus, model’s comprehension ability towards real scenarios are hard to evaluate reasonably. To this end, we propose the first Chinese CMRC benchmark Orca and further provide zero-shot/few-shot settings to evaluate model’s generalization ability towards diverse domains. We collect 831 hot-topic driven conversations with 4,742 turns in total. Each turn of a conversation is assigned with a response-related passage, aiming to evaluate model’s comprehension ability more reasonably. The topics of conversations are collected from social media platform and cover 33 domains, trying to be consistent with real scenarios. Importantly, answers in Orca are all well-annotated natural responses rather than the specific spans or short phrase in previous datasets. Besides, we implement three strong baselines to tackle the challenge in Orca. The results indicate the great challenge of our CMRC benchmark.
2018
pdf
Discriminating between Similar Languages on Imbalanced Conversational Texts
Junqing He
|
Xian Huang
|
Xuemin Zhao
|
Yan Zhang
|
Yonghong Yan
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
2017
pdf
abs
HCCL at SemEval-2017 Task 2: Combining Multilingual Word Embeddings and Transliteration Model for Semantic Similarity
Junqing He
|
Long Wu
|
Xuemin Zhao
|
Yonghong Yan
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)
In this paper, we introduce an approach to combining word embeddings and machine translation for multilingual semantic word similarity, the task2 of SemEval-2017. Thanks to the unsupervised transliteration model, our cross-lingual word embeddings encounter decreased sums of OOVs. Our results are produced using only monolingual Wikipedia corpora and a limited amount of sentence-aligned data. Although relatively little resources are utilized, our system ranked 3rd in the monolingual subtask and can be the 6th in the cross-lingual subtask.