Yu Chen

2022

pdf abs
基于情感增强非参数模型的社交媒体观点聚类(A Sentiment Enhanced Nonparametric Model for Social Media Opinion Clustering)
Kan Liu (刘勘) | Yu Chen (陈昱) | Jiarui He (何佳瑞)
Proceedings of the 21st Chinese National Conference on Computational Linguistics

“本文旨在使用文本聚类技术,将社交媒体文本根据用户主张的观点汇总,直观呈现网民群体所持有的不同立场。针对社交媒体文本模式复杂与情感丰富等特点,本文提出使用情感分布增强方法改进现有的非参数短文本聚类算法,以高斯分布建模文本情感,捕获文本情感特征的同时能够自动确定聚类簇数量并实现观点聚类。在公开数据集上的实验显示,该方法在多项聚类指标上取得了超越现有模型的聚类表现,并在主观性较强的数据集中具有更显著的优势。”

2021

pdf abs
Deep Learning on Graphs for Natural Language Processing
Lingfei Wu | Yu Chen | Heng Ji | Yunyao Li
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Tutorials

Due to its great power in modeling non-Euclidean data like graphs or manifolds, deep learning on graph techniques (i.e., Graph Neural Networks (GNNs)) have opened a new door to solving challenging graph-related NLP problems. There has seen a surge of interests in applying deep learning on graph techniques to NLP, and has achieved considerable success in many NLP tasks, ranging from classification tasks like sentence classification, semantic role labeling and relation extraction, to generation tasks like machine translation, question generation and summarization. Despite these successes, deep learning on graphs for NLP still face many challenges, including automatically transforming original text sequence data into highly graph-structured data, and effectively modeling complex data that involves mapping between graph-based inputs and other highly structured output data such as sequences, trees, and graph data with multi-types in both nodes and edges. This tutorial will cover relevant and interesting topics on applying deep learning on graph techniques to NLP, including automatic graph construction for NLP, graph representation learning for NLP, advanced GNN based models (e.g., graph2seq, graph2tree, and graph2graph) for NLP, and the applications of GNNs in various NLP tasks (e.g., machine translation, natural language generation, information extraction and semantic parsing). In addition, hands-on demonstration sessions will be included to help the audience gain practical experience on applying GNNs to solve challenging NLP problems using our recently developed open source library – Graph4NLP, the first library for researchers and practitioners for easy use of GNNs for various NLP tasks.

pdf abs
Are Language-Agnostic Sentence Representations Actually Language-Agnostic?
Yu Chen | Tania Avgustinova
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

With the emergence of pre-trained multilingual models, multilingual embeddings have been widely applied in various natural language processing tasks. Language-agnostic models provide a versatile way to convert linguistic units from different languages into a shared vector representation space. The relevant work on multilingual sentence embeddings has reportedly reached low error rate in cross-lingual similarity search tasks. In this paper, we apply the pre-trained embedding models and the cross-lingual similarity search task in diverse scenarios, and observed large discrepancy in results in comparison to the original paper. Our findings on cross-lingual similarity search with different newly constructed multilingual datasets show not only correlation with observable language similarities but also strong influence from factors such as translation paths, which limits the interpretation of the language-agnostic property of the LASER model. %

2019

pdf abs
Bidirectional Attentive Memory Networks for Question Answering over Knowledge Bases
Yu Chen | Lingfei Wu | Mohammed J. Zaki
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

When answering natural language questions over knowledge bases (KBs), different question components and KB aspects play different roles. However, most existing embedding-based methods for knowledge base question answering (KBQA) ignore the subtle inter-relationships between the question and the KB (e.g., entity types, relation paths and context). In this work, we propose to directly model the two-way flow of interactions between the questions and the KB via a novel Bidirectional Attentive Memory Network, called BAMnet. Requiring no external resources and only very few hand-crafted features, on the WebQuestions benchmark, our method significantly outperforms existing information-retrieval based methods, and remains competitive with (hand-crafted) semantic parsing based methods. Also, since we use attention mechanisms, our method offers better interpretability compared to other baselines.

pdf abs
Machine Translation from an Intercomprehension Perspective
Yu Chen | Tania Avgustinova
Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)

Within the first shared task on machine translation between similar languages, we present our first attempts on Czech to Polish machine translation from an intercomprehension perspective. We propose methods based on the mutual intelligibility of the two languages, taking advantage of their orthographic and phonological similarity, in the hope to improve over our baselines. The translation results are evaluated using BLEU. On this metric, none of our proposals could outperform the baselines on the final test set. The current setups are rather preliminary, and there are several potential improvements we can try in the future.

2012

pdf abs
Joint Grammar and Treebank Development for Mandarin Chinese with HPSG
Yi Zhang | Rui Wang | Yu Chen
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We present the ongoing development of MCG, a linguistically deep and precise grammar for Mandarin Chinese together with its accompanying treebank, both based on the linguistic framework of HPSG, and using MRS as the semantic representation. We highlight some key features of our grammar design, and review a number of challenging phenomena, with comparisons to alternative linguistic treatments and implementations. One of the distinguishing characteristics of our approach is the tight integration of grammar and treebank development. The two-step treebank annotation procedure benefits from the efficiency of the discriminant-based annotation approach, while giving the annotators full freedom of producing extra-grammatical structures. This not only allows the creation of a precise and full-coverage treebank with an imperfect grammar, but also provides prompt feedback for grammarians to identify the errors in the grammar design and implementation. Preliminary evaluation and error analysis shows that the grammar already covers most of the core phenomena for Mandarin Chinese, and the treebank annotation procedure reaches a stable speed of 35 sentences per hour with satisfying quality.

pdf abs
MultiUN v2: UN Documents with Multilingual Alignments
Yu Chen | Andreas Eisele
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

MultiUN is a multilingual parallel corpus extracted from the official documents of the United Nations. It is available in the six official languages of the UN and a small portion of it is also available in German. This paper presents a major update on the first public version of the corpus released in 2010. This version 2 consists of over 513,091 documents, including more than 9% of new documents retrieved from the United Nations official document system. We applied several modifications to the corpus preparation method. In this paper, we describe the methods we used for processing the UN documents and aligning the sentences. The most significant improvement compared to the previous release is the newly added multilingual sentence alignment information. The alignment information is encoded together with the text in XML instead of additional files. Our representation of the sentence alignment allows quick construction of aligned texts parallel in arbitrary number of languages, which is essential for building machine translation systems.

pdf
Machine Learning for Hybrid Machine Translation
Sabine Hunsicker | Yu Chen | Christian Federmann
Proceedings of the Seventh Workshop on Statistical Machine Translation

pdf
Combining Social Cognitive Theories with Linguistic Features for Multi-genre Sentiment Analysis
Hao Li | Yu Chen | Heng Ji | Smaranda Muresan | Dequan Zheng
Proceedings of the 26th Pacific Asia Conference on Language, Information, and Computation

2011

pdf
Statistical Machine Transliteration with Multi-to-Multi Joint Source Channel Model
Yu Chen | Rui Wang | Yi Zhang
Proceedings of the 3rd Named Entities Workshop (NEWS 2011)

pdf
Engineering a Deep HPSG for Mandarin Chinese
Yi Zhang | Rui Wang | Yu Chen
Proceedings of the 9th Workshop on Asian Language Resources

2010

pdf
Using Deep Belief Nets for Chinese Named Entity Categorization
Yu Chen | You Ouyang | Wenjie Li | Dequan Zheng | Tiejun Zhao
Proceedings of the 2010 Named Entities Workshop

pdf
Exploring Deep Belief Network for Chinese Relation Extraction
Yu Chen | Wenjie Li | Yan Liu | Dequan Zheng | Tiejun Zhao
CIPS-SIGHAN Joint Conference on Chinese Language Processing

pdf
Hierarchical Hybrid Translation between English and German
Yu Chen | Andreas Eisele
Proceedings of the 14th Annual conference of the European Association for Machine Translation

pdf abs
MultiUN: A Multilingual Corpus from United Nation Documents
Andreas Eisele | Yu Chen
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

This paper describes the acquisition, preparation and properties of a corpus extracted from the official documents of the United Nations (UN). This corpus is available in all 6 official languages of the UN, consisting of around 300 million words per language. We describe the methods we used for crawling, document formatting, and sentence alignment. This corpus also includes a common test set for machine translation. We present the results of a French-Chinese machine translation experiment performed on this corpus.

pdf abs
Integrating a Rule-based with a Hierarchical Translation System
Yu Chen | Andreas Eisele
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Recent developments on hybrid systems that combine rule-based machine translation (RBMT) systems with statistical machine translation (SMT) generally neglect the fact that RBMT systems tend to produce more syntactically well-formed translations than data-driven systems. This paper proposes a method that alleviates this issue by preserving more useful structures produced by RBMT systems and utilizing them in a SMT system that operates on hierarchical structures instead of flat phrases alone. For our experiments, we use Joshua as the decoder. It is the first attempt towards a tighter integration of MT systems from different paradigms that both support hierarchical analysis. Preliminary results show consistent improvements over the previous approach.

2009

pdf
Intersecting Multilingual Data for Faster and Better Statistical Translations
Yu Chen | Martin Kay | Andreas Eisele
Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics

2008

pdf abs
Improving Statistical Machine Translation Efficiency by Triangulation
Yu Chen | Andreas Eisele | Martin Kay
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

In current phrase-based Statistical Machine Translation systems, more training data is generally better than less. However, a larger data set eventually introduces a larger model that enlarges the search space for the decoder, and consequently requires more time and more resources to translate. This paper describes an attempt to reduce the model size by filtering out the less probable entries based on testing correlation using additional training data in an intermediate third language. The central idea behind the approach is triangulation, the process of incorporating multilingual knowledge in a single system, which eventually utilizes parallel corpora available in more than two languages. We conducted experiments using Europarl corpus to evaluate our approach. The reduction of the model size can be up to 70% while the translation quality is being preserved.

pdf
Using Moses to Integrate Multiple Rule-Based Machine Translation Engines into a Hybrid System
Andreas Eisele | Christian Federmann | Hervé Saint-Amand | Michael Jellinghaus | Teresa Herrmann | Yu Chen
Proceedings of the Third Workshop on Statistical Machine Translation