Yu Chen


2024

pdf
PKAD: Pretrained Knowledge is All You Need to Detect and Mitigate Textual Backdoor Attacks
Yu Chen | Qi Cao | Kaike Zhang | Xuchao Liu | Huawei Shen
Findings of the Association for Computational Linguistics: EMNLP 2024

In textual backdoor attacks, attackers insert poisoned samples with triggered inputs and target labels into training datasets to manipulate model behavior, threatening the model’s security and reliability. Current defense methods can generally be categorized into inference-time and training-time ones. The former often requires a part of clean samples to set detection thresholds, which may be hard to obtain in practical application scenarios, while the latter usually requires an additional retraining or unlearning process to get a clean model, significantly increasing training costs. To avoid these drawbacks, we focus on developing a practical defense method before model training without using any clean samples. Our analysis reveals that with the help of a pre-trained language model (PLM), poisoned samples, different from clean ones, exhibit mismatched relationship and shared characteristics. Based on these observations, we further propose a two-stage poison detection strategy solely leveraging insights from PLM before model training. Extensive experiments confirm our approach’s effectiveness, achieving better performance than current leading methods more swiftly. Our code is available at https://github.com/Ascian/PKAD.

pdf
Déplacement vertical du larynx dans la production des plosives en thaï
Paula Alejandra Cano Córdoba | Thi-Thuy-Hien Tran | Nathalie Vallée | Christophe Savariaux | Silvain Gerber | Nicha Yamlamai | Yu Chen
Actes des 35èmes Journées d'Études sur la Parole

Les plosives, généralement accompagnées d’un burst (relâchement audible) après la phase d’occlusion, sont néanmoins produites sans burst dans certaines langues d’Asie comme le thaï. Cette absence de bruit est attribuée au non relâchement brusque des articulateurs et est observée exclusivement lorsque les plosives sont en finale de syllabe, jamais en initiale. Nous formulons l’hypothèse qu’un mouvement d’abaissement du larynx pourrait provoquer une diminution de la pression intraorale pendant la tenue de l’occlusion induisant le non-relâchement articulatoire. Nous avons examiné le mouvement vertical du larynx chez deux locutrices natives lors de la production des plosives /p, t, k/ dans une tâche de lecture d’une liste de pseudo-mots de structure CVC. Les résultats montrent une grande variabilité dans le mouvement d’abaissement du larynx en fonction des segments consonantiques, vocaliques et du contexte tonal, suggérant que plusieurs facteurs pourraient être impliqués dans l’explication de la diminution de la pression intraorale.

pdf
Sandhi tonal en shanghaïen : une étude acoustique des contours dissyllabiques chez des locuteurs jeunes
Yu Chen | Nathalie Vallée | Thi-Thuy-Hien Tran | Silvain Gerber
Actes des 35èmes Journées d'Études sur la Parole

Le shanghaïen possède deux types de sandhi tonal : Left Dominant Sandhi (LDS) dans les composés sémantiques de type syntagme nominal (SN) et Right Dominant Sandhi (RDS) dans des phrases prosodiques de type syntagme verbal (SV). Cette étude examine les caractéristiques acoustiques du contour tonal dans des SN et SV dissyllabiques chez trois locutrices jeunes. Nos résultats montrent que les tons des SN subissent des changements phonologiques relevant du LDS, alors que les SV sont plutôt soumis aux effets phonétiques de la coarticulation tonale plutôt qu’au RDS. L’absence de différences significatives entre les SN et les SV ne permet pas de généraliser une distinction entre eux uniquement sur la base des réalisations tonales. Cette étude exploratoire ouvre des perspectives pour de futurs travaux intergénérationnels sur les productions tonales et la perception du sandhi tonal, en étendant le corpus à différentes positions au sein de la phrase et différentes classes d’âge.

pdf
LM-Infinite: Zero-Shot Extreme Length Generalization for Large Language Models
Chi Han | Qifan Wang | Hao Peng | Wenhan Xiong | Yu Chen | Heng Ji | Sinong Wang
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Today’s large language models (LLMs) typically train on short text segments (e.g., <4K tokens) due to the quadratic complexity of their Transformer architectures. As a result, their performance suffers drastically on inputs longer than those encountered during training, substantially limiting their applications in real-world tasks involving long contexts such as encod- ing scientific articles, code repositories, or long dialogues. Through both theoretical analysis and empirical investigation, this work identifies three major factors contributing to this length generalization failure. Our theoretical analysis reveals that commonly used techniques like using a sliding-window attention pattern or relative positional encodings are inadequate to address them. Answering these challenges, we propose LM-Infinite, a simple and effective method for enhancing LLMs’ capabilities of handling long contexts. LM-Infinite is highly flexible and can be used with most modern LLMs off-the-shelf. Without any parameter updates, it allows LLMs pre-trained with 2K or 4K-long segments to generalize to up to 200M length inputs while retaining perplexity. It also improves performance on downstream tasks such as Passkey Retrieval and Qasper in the zero-shot setting. LM-Infinite brings substantial efficiency improvements: it achieves 2.7× decoding speed up and 7.5× memory saving over the original model. Our code will be publicly available upon publication.

pdf
Ameli: Enhancing Multimodal Entity Linking with Fine-Grained Attributes
Barry Yao | Sijia Wang | Yu Chen | Qifan Wang | Minqian Liu | Zhiyang Xu | Licheng Yu | Lifu Huang
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

We propose attribute-aware multimodal entity linking, where the input consists of a mention described with a text paragraph and images, and the goal is to predict the corresponding target entity from a multimodal knowledge base (KB) where each entity is also accompanied by a text description, visual images, and a collection of attributes that present the meta-information of the entity in a structured format. To facilitate this research endeavor, we construct Ameli, encompassing a new multimodal entity linking benchmark dataset that contains 16,735 mentions described in text and associated with 30,472 images, and a multimodal knowledge base that covers 34,690 entities along with 177,873 entity images and 798,216 attributes. To establish baseline performance on Ameli, we experiment with several state-of-the-art architectures for multimodal entity linking and further propose a new approach that incorporates attributes of entities into disambiguation. Experimental results and extensive qualitative analysis demonstrate that extracting and understanding the attributes of mentions from their text descriptions and visual images play a vital role in multimodal entity linking. To the best of our knowledge, we are the first to integrate attributes in the multimodal entity linking task. The programs, model checkpoints, and the dataset are publicly available at https://github.com/VT-NLP/Ameli.

2023

pdf
YNU-HPCC at SemEval-2023 Task 6: LEGAL-BERT Based Hierarchical BiLSTM with CRF for Rhetorical Roles Prediction
Yu Chen | You Zhang | Jin Wang | Xuejie Zhang
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)

To understand a legal document for real-world applications, SemEval-2023 Task 6 proposes a shared Subtask A, rhetorical roles (RRs) prediction, which requires a system to automatically assign a RR label for each semantical segment in a legal text. In this paper, we propose a LEGAL-BERT based hierarchical BiLSTM model with conditional random field (CRF) for RR prediction, which primarily consists of two parts: word-level and sentence-level encoders. The word-level encoder first adopts a legal-domain pre-trained language model, LEGAL-BERT, initially word-embedding words in each sentence in a document and a word-level BiLSTM further encoding such sentence representation. The sentence-level encoder then uses an attentive pooling method for sentence embedding and a sentence-level BiLSTM for document modeling. Finally, a CRF is utilized to predict RRs for each sentence. The officially released results show that our method outperformed the baseline systems. Our team won 7th rank out of 27 participants in Subtask A.

pdf
MixPAVE: Mix-Prompt Tuning for Few-shot Product Attribute Value Extraction
Li Yang | Qifan Wang | Jingang Wang | Xiaojun Quan | Fuli Feng | Yu Chen | Madian Khabsa | Sinong Wang | Zenglin Xu | Dongfang Liu
Findings of the Association for Computational Linguistics: ACL 2023

The task of product attribute value extraction is to identify values of an attribute from product information. Product attributes are important features, which help improve online shopping experience of customers, such as product search, recommendation and comparison. Most existing works only focus on extracting values for a set of known attributes with sufficient training data. However, with the emerging nature of e-commerce, new products with their unique set of new attributes are constantly generated from different retailers and merchants. Collecting a large number of annotations for every new attribute is costly and time consuming. Therefore, it is an important research problem for product attribute value extraction with limited data. In this work, we propose a novel prompt tuning approach with Mixed Prompts for few-shot Attribute Value Extraction, namely MixPAVE. Specifically, MixPAVE introduces only a small amount (< 1%) of trainable parameters, i.e., a mixture of two learnable prompts, while keeping the existing extraction model frozen. In this way, MixPAVE not only benefits from parameter-efficient training, but also avoids model overfitting on limited training examples. Experimental results on two product benchmarks demonstrate the superior performance of the proposed approach over several state-of-the-art baselines. A comprehensive set of ablation studies validate the effectiveness of the prompt design, as well as the efficiency of our approach.

pdf
Joint Semantic and Strategy Matching for Persuasive Dialogue
Chuhao Jin | Yutao Zhu | Lingzhen Kong | Shijie Li | Xiao Zhang | Ruihua Song | Xu Chen | Huan Chen | Yuchong Sun | Yu Chen | Jun Xu
Findings of the Association for Computational Linguistics: EMNLP 2023

Persuasive dialogue aims to persuade users to achieve some targets by conversations. While previous persuasion models have achieved notable successes, they mostly base themselves on utterance semantic matching, and an important aspect has been ignored, that is, the strategy of the conversations, for example, the agent can choose an emotional-appeal strategy to impress users. Compared with utterance semantics, conversation strategies are high-level concepts, which can be informative and provide complementary information to achieve effective persuasions. In this paper, we propose to build a persuasion model by jointly modeling the conversation semantics and strategies, where we design a BERT-like module and an auto-regressive predictor to match the semantics and strategies, respectively. Experimental results indicate that our proposed approach can significantly improve the state-of-the-art baseline by 5% on a small dataset and 37% on a large dataset in terms of Recall@1. Detailed analyses show that the auto-regressive predictor contributes most to the final performance.

pdf
Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for Improved Vision-Language Compositionality
Harman Singh | Pengchuan Zhang | Qifan Wang | Mengjiao Wang | Wenhan Xiong | Jingfei Du | Yu Chen
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Contrastively trained vision-language models have achieved remarkable progress in vision and language representation learning. However, recent research has highlighted severe limitations of these models in their ability to perform compositional reasoning over objects, attributes, and relations. Scene graphs have emerged as an effective way to understand images compositionally. These are graph-structured semantic representations of images that contain objects, their attributes, and relations with other objects in a scene. In this work, we consider the scene graph parsed from text as a proxy for the image scene graph and propose a graph decomposition and augmentation framework along with a coarse-to-fine contrastive learning objective between images and text that aligns sentences of various complexities to the same image. We also introduce novel negative mining techniques in the scene graph space for improving attribute binding and relation understanding. Through extensive experiments, we demonstrate the effectiveness of our approach that significantly improves attribute binding, relation understanding, systematic generalization, and productivity on multiple recently proposed benchmarks (For example, improvements up to 18% for systematic generalization, 16.5% for relation understanding over a strong baseline), while achieving similar or better performance than CLIP on various general multimodal tasks.

2022

pdf
基于情感增强非参数模型的社交媒体观点聚类(A Sentiment Enhanced Nonparametric Model for Social Media Opinion Clustering)
Kan Liu (刘勘) | Yu Chen (陈昱) | Jiarui He (何佳瑞)
Proceedings of the 21st Chinese National Conference on Computational Linguistics

“本文旨在使用文本聚类技术,将社交媒体文本根据用户主张的观点汇总,直观呈现网民群体所持有的不同立场。针对社交媒体文本模式复杂与情感丰富等特点,本文提出使用情感分布增强方法改进现有的非参数短文本聚类算法,以高斯分布建模文本情感,捕获文本情感特征的同时能够自动确定聚类簇数量并实现观点聚类。在公开数据集上的实验显示,该方法在多项聚类指标上取得了超越现有模型的聚类表现,并在主观性较强的数据集中具有更显著的优势。”

2021

pdf
Are Language-Agnostic Sentence Representations Actually Language-Agnostic?
Yu Chen | Tania Avgustinova
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

With the emergence of pre-trained multilingual models, multilingual embeddings have been widely applied in various natural language processing tasks. Language-agnostic models provide a versatile way to convert linguistic units from different languages into a shared vector representation space. The relevant work on multilingual sentence embeddings has reportedly reached low error rate in cross-lingual similarity search tasks. In this paper, we apply the pre-trained embedding models and the cross-lingual similarity search task in diverse scenarios, and observed large discrepancy in results in comparison to the original paper. Our findings on cross-lingual similarity search with different newly constructed multilingual datasets show not only correlation with observable language similarities but also strong influence from factors such as translation paths, which limits the interpretation of the language-agnostic property of the LASER model. %

pdf
Deep Learning on Graphs for Natural Language Processing
Lingfei Wu | Yu Chen | Heng Ji | Yunyao Li
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Tutorials

Due to its great power in modeling non-Euclidean data like graphs or manifolds, deep learning on graph techniques (i.e., Graph Neural Networks (GNNs)) have opened a new door to solving challenging graph-related NLP problems. There has seen a surge of interests in applying deep learning on graph techniques to NLP, and has achieved considerable success in many NLP tasks, ranging from classification tasks like sentence classification, semantic role labeling and relation extraction, to generation tasks like machine translation, question generation and summarization. Despite these successes, deep learning on graphs for NLP still face many challenges, including automatically transforming original text sequence data into highly graph-structured data, and effectively modeling complex data that involves mapping between graph-based inputs and other highly structured output data such as sequences, trees, and graph data with multi-types in both nodes and edges. This tutorial will cover relevant and interesting topics on applying deep learning on graph techniques to NLP, including automatic graph construction for NLP, graph representation learning for NLP, advanced GNN based models (e.g., graph2seq, graph2tree, and graph2graph) for NLP, and the applications of GNNs in various NLP tasks (e.g., machine translation, natural language generation, information extraction and semantic parsing). In addition, hands-on demonstration sessions will be included to help the audience gain practical experience on applying GNNs to solve challenging NLP problems using our recently developed open source library – Graph4NLP, the first library for researchers and practitioners for easy use of GNNs for various NLP tasks.

pdf
Energy-based Unknown Intent Detection with Data Manipulation
Yawen Ouyang | Jiasheng Ye | Yu Chen | Xinyu Dai | Shujian Huang | Jiajun Chen
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

2019

pdf
Bidirectional Attentive Memory Networks for Question Answering over Knowledge Bases
Yu Chen | Lingfei Wu | Mohammed J. Zaki
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

When answering natural language questions over knowledge bases (KBs), different question components and KB aspects play different roles. However, most existing embedding-based methods for knowledge base question answering (KBQA) ignore the subtle inter-relationships between the question and the KB (e.g., entity types, relation paths and context). In this work, we propose to directly model the two-way flow of interactions between the questions and the KB via a novel Bidirectional Attentive Memory Network, called BAMnet. Requiring no external resources and only very few hand-crafted features, on the WebQuestions benchmark, our method significantly outperforms existing information-retrieval based methods, and remains competitive with (hand-crafted) semantic parsing based methods. Also, since we use attention mechanisms, our method offers better interpretability compared to other baselines.

pdf
Machine Translation from an Intercomprehension Perspective
Yu Chen | Tania Avgustinova
Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)

Within the first shared task on machine translation between similar languages, we present our first attempts on Czech to Polish machine translation from an intercomprehension perspective. We propose methods based on the mutual intelligibility of the two languages, taking advantage of their orthographic and phonological similarity, in the hope to improve over our baselines. The translation results are evaluated using BLEU. On this metric, none of our proposals could outperform the baselines on the final test set. The current setups are rather preliminary, and there are several potential improvements we can try in the future.

2012

pdf
Combining Social Cognitive Theories with Linguistic Features for Multi-genre Sentiment Analysis
Hao Li | Yu Chen | Heng Ji | Smaranda Muresan | Dequan Zheng
Proceedings of the 26th Pacific Asia Conference on Language, Information, and Computation

pdf
Machine Learning for Hybrid Machine Translation
Sabine Hunsicker | Yu Chen | Christian Federmann
Proceedings of the Seventh Workshop on Statistical Machine Translation

pdf
Joint Grammar and Treebank Development for Mandarin Chinese with HPSG
Yi Zhang | Rui Wang | Yu Chen
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We present the ongoing development of MCG, a linguistically deep and precise grammar for Mandarin Chinese together with its accompanying treebank, both based on the linguistic framework of HPSG, and using MRS as the semantic representation. We highlight some key features of our grammar design, and review a number of challenging phenomena, with comparisons to alternative linguistic treatments and implementations. One of the distinguishing characteristics of our approach is the tight integration of grammar and treebank development. The two-step treebank annotation procedure benefits from the efficiency of the discriminant-based annotation approach, while giving the annotators full freedom of producing extra-grammatical structures. This not only allows the creation of a precise and full-coverage treebank with an imperfect grammar, but also provides prompt feedback for grammarians to identify the errors in the grammar design and implementation. Preliminary evaluation and error analysis shows that the grammar already covers most of the core phenomena for Mandarin Chinese, and the treebank annotation procedure reaches a stable speed of 35 sentences per hour with satisfying quality.

pdf
MultiUN v2: UN Documents with Multilingual Alignments
Yu Chen | Andreas Eisele
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

MultiUN is a multilingual parallel corpus extracted from the official documents of the United Nations. It is available in the six official languages of the UN and a small portion of it is also available in German. This paper presents a major update on the first public version of the corpus released in 2010. This version 2 consists of over 513,091 documents, including more than 9% of new documents retrieved from the United Nations official document system. We applied several modifications to the corpus preparation method. In this paper, we describe the methods we used for processing the UN documents and aligning the sentences. The most significant improvement compared to the previous release is the newly added multilingual sentence alignment information. The alignment information is encoded together with the text in XML instead of additional files. Our representation of the sentence alignment allows quick construction of aligned texts parallel in arbitrary number of languages, which is essential for building machine translation systems.

2011

pdf
Statistical Machine Transliteration with Multi-to-Multi Joint Source Channel Model
Yu Chen | Rui Wang | Yi Zhang
Proceedings of the 3rd Named Entities Workshop (NEWS 2011)

pdf
Engineering a Deep HPSG for Mandarin Chinese
Yi Zhang | Rui Wang | Yu Chen
Proceedings of the 9th Workshop on Asian Language Resources

2010

pdf
Further Experiments with Shallow Hybrid MT Systems
Christian Federmann | Andreas Eisele | Yu Chen | Sabine Hunsicker | Jia Xu | Hans Uszkoreit
Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR

pdf
Using Deep Belief Nets for Chinese Named Entity Categorization
Yu Chen | You Ouyang | Wenjie Li | Dequan Zheng | Tiejun Zhao
Proceedings of the 2010 Named Entities Workshop

pdf
Exploring Deep Belief Network for Chinese Relation Extraction
Yu Chen | Wenjie Li | Yan Liu | Dequan Zheng | Tiejun Zhao
CIPS-SIGHAN Joint Conference on Chinese Language Processing

pdf
Hierarchical Hybrid Translation between English and German
Yu Chen | Andreas Eisele
Proceedings of the 14th Annual Conference of the European Association for Machine Translation

pdf
MultiUN: A Multilingual Corpus from United Nation Documents
Andreas Eisele | Yu Chen
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

This paper describes the acquisition, preparation and properties of a corpus extracted from the official documents of the United Nations (UN). This corpus is available in all 6 official languages of the UN, consisting of around 300 million words per language. We describe the methods we used for crawling, document formatting, and sentence alignment. This corpus also includes a common test set for machine translation. We present the results of a French-Chinese machine translation experiment performed on this corpus.

pdf
Integrating a Rule-based with a Hierarchical Translation System
Yu Chen | Andreas Eisele
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Recent developments on hybrid systems that combine rule-based machine translation (RBMT) systems with statistical machine translation (SMT) generally neglect the fact that RBMT systems tend to produce more syntactically well-formed translations than data-driven systems. This paper proposes a method that alleviates this issue by preserving more useful structures produced by RBMT systems and utilizing them in a SMT system that operates on hierarchical structures instead of flat phrases alone. For our experiments, we use Joshua as the decoder. It is the first attempt towards a tighter integration of MT systems from different paradigms that both support hierarchical analysis. Preliminary results show consistent improvements over the previous approach.

2009

pdf
Intersecting Multilingual Data for Faster and Better Statistical Translations
Yu Chen | Martin Kay | Andreas Eisele
Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics

pdf
Combining Multi-Engine Translations with Moses
Yu Chen | Michael Jellinghaus | Andreas Eisele | Yi Zhang | Sabine Hunsicker | Silke Theison | Christian Federmann | Hans Uszkoreit
Proceedings of the Fourth Workshop on Statistical Machine Translation

pdf
Translation Combination using Factored Word Substitution
Christian Federmann | Silke Theison | Andreas Eisele | Hans Uszkoreit | Yu Chen | Michael Jellinghaus | Sabine Hunsicker
Proceedings of the Fourth Workshop on Statistical Machine Translation

2008

pdf
Using Moses to Integrate Multiple Rule-Based Machine Translation Engines into a Hybrid System
Andreas Eisele | Christian Federmann | Hervé Saint-Amand | Michael Jellinghaus | Teresa Herrmann | Yu Chen
Proceedings of the Third Workshop on Statistical Machine Translation

pdf
Hybrid machine translation architectures within and beyond the EuroMatrix project
Andreas Eisele | Christian Federmann | Hans Uszkoreit | Hervé Saint-Amand | Martin Kay | Michael Jellinghaus | Sabine Hunsicker | Teresa Herrmann | Yu Chen
Proceedings of the 12th Annual Conference of the European Association for Machine Translation

pdf
Improving Statistical Machine Translation Efficiency by Triangulation
Yu Chen | Andreas Eisele | Martin Kay
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

In current phrase-based Statistical Machine Translation systems, more training data is generally better than less. However, a larger data set eventually introduces a larger model that enlarges the search space for the decoder, and consequently requires more time and more resources to translate. This paper describes an attempt to reduce the model size by filtering out the less probable entries based on testing correlation using additional training data in an intermediate third language. The central idea behind the approach is triangulation, the process of incorporating multilingual knowledge in a single system, which eventually utilizes parallel corpora available in more than two languages. We conducted experiments using Europarl corpus to evaluate our approach. The reduction of the model size can be up to 70% while the translation quality is being preserved.

2007

pdf
Multi-Engine Machine Translation with an Open-Source SMT Decoder
Yu Chen | Andreas Eisele | Christian Federmann | Eva Hasler | Michael Jellinghaus | Silke Theison
Proceedings of the Second Workshop on Statistical Machine Translation