Yanjun Ma

2022

PaddleSpeech is an open-source all-in-one speech toolkit. It aims at facilitating the development and research of speech processing technologies by providing an easy-to-use command-line interface and a simple code structure. This paper describes the design philosophy and core architecture of PaddleSpeech to support several essential speech-to-text and text-to-speech tasks. PaddleSpeech achieves competitive or state-of-the-art performance on various speech datasets and implements the most popular methods. It also provides recipes and pretrained models to quickly reproduce the experimental results in this paper. PaddleSpeech is publicly avaiable at https://github.com/PaddlePaddle/PaddleSpeech.

pdf bib abs
A Gentle Introduction to Deep Nets and Opportunities for the Future
Kenneth Church | Valia Kordoni | Gary Marcus | Ernest Davis | Yanjun Ma | Zeyu Chen
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts

The first half of this tutorial will make deep nets more accessible to a broader audience, following “Deep Nets for Poets” and “A Gentle Introduction to Fine-Tuning.” We will also introduce GFT (general fine tuning), a little language for fine tuning deep nets with short (one line) programs that are as easy to code as regression in statistics packages such as R using glm (general linear models). Based on the success of these methods on a number of benchmarks, one might come away with the impression that deep nets are all we need. However, we believe the glass is half-full: while there is much that can be done with deep nets, there is always more to do. The second half of this tutorial will discuss some of these opportunities.

2021

pdf abs
SaGE: 基于句法感知图卷积神经网络和ELECTRA的中文隐喻识别模型(SaGE: Syntax-aware GCN with ELECTRA for Chinese Metaphor Detection)
Shenglong Zhang (张声龙) | Ying Liu (刘颖) | Yanjun Ma (马艳军)
Proceedings of the 20th Chinese National Conference on Computational Linguistics

隐喻是人类语言中经常出现的一种特殊现象,隐喻识别对于自然语言处理各项任务来说具有十分基础和重要的意义。针对中文领域的隐喻识别任务,我们提出了一种基于句法感知图卷积神经网络和ELECTRA的隐喻识别模型(Syntax-aware GCN withELECTRA SaGE)。该模型从语言学出发,使用ELECTRA和Transformer编码器抽取句子的语义特征,将句子按照依存关系组织成一张图并使用图卷积神经网络抽取其句法特征,在此基础上对两类特征进行融合以进行隐喻识别。我们的模型在CCL2018中文隐喻识别评测数据集上以85.22%的宏平均F1分数超越了此前的最佳成绩,验证了融合语义信息和句法信息对于隐喻识别任务具有重要作用。

2018

In this paper, we focus on the problem of question generation (QG). Recent neural network-based approaches employ the sequence-to-sequence model which takes an answer and its context as input and generates a relevant question as output. However, we observe two major issues with these approaches: (1) The generated interrogative words (or question words) do not match the answer type. (2) The model copies the context words that are far from and irrelevant to the answer, instead of the words that are close and relevant to the answer. To address these two issues, we propose an answer-focused and position-aware neural question generation model. (1) By answer-focused, we mean that we explicitly model question word generation by incorporating the answer embedding, which can help generate an interrogative word matching the answer type. (2) By position-aware, we mean that we model the relative distance between the context words and the answer. Hence the model can be aware of the position of the context words when copying them to generate a question. We conduct extensive experiments to examine the effectiveness of our model. The experimental results show that our model significantly improves the baseline and outperforms the state-of-the-art system.

2012

pdf
An Evaluation of Statistical Post-Editing Systems Applied to RBMT and SMT Systems
Hanna Béchara | Raphaël Rubino | Yifan He | Yanjun Ma | Josef van Genabith
Proceedings of COLING 2012

2011

pdf
Preliminary Experiments on Using Users’ Post-Editions to Enhance a SMT System Oracle-based Training for Phrase-based Statistical Machine Translation
Ankit Srivastava | Yanjun Ma | Andy Way
Proceedings of the 15th Annual conference of the European Association for Machine Translation

pdf
Statistical Post-Editing for a Statistical MT System
Hanna Bechara | Yanjun Ma | Josef van Genabith
Proceedings of Machine Translation Summit XIII: Papers

pdf
Rich Linguistic Features for Translation Memory-Inspired Consistent Translation
Yifan He | Yanjun Ma | Andy Way | Josef van Genabith
Proceedings of Machine Translation Summit XIII: Papers

pdf bib abs
From the Confidence Estimation of Machine Translation to the Integration of MT and Translation Memory
Yanjun Ma | Yifan He | Josef van Genabith
Proceedings of Machine Translation Summit XIII: Tutorial Abstracts

In this tutorial, we cover techniques that facilitate the integration of Machine Translation (MT) and Translation Memory (TM), which can help the adoption of MT technology in localisation industry. The tutorial covers four parts: i) brief introduction of MT and TM systems, ii) MT confidence estimation measures tailored for the TM environment, iii) segment-level MT and MT integration, iv) sub-segment level MT and TM integration, and v) human evaluation of MT and TM integration. We will first briefly describe and compare how translations are generated in MT and TM systems, and suggest possible avenues to combines these two systems. We will also cover current quality / cost estimation measures applied in MT and TM systems, such as the fuzzy-match score in the TM, and the evaluation/confidence metrics used to judge MT outputs. We then move on to introduce the recent developments in the field of MT confidence estimation tailored towards predicting post-editing efforts. We will especially focus on the confidence metrics proposed by Specia et al., which is shown to have high correlation with human preference, as well as post-editing time. For segment-level MT and TM integration, we present translation recommendation and translation re-ranking models, where the integration happens at the 1-best or the N-best level, respectively. Given an input to be translated, MT-TM recommendation compares the output from the MT and the TM systems, and presents the better one to the post-editor. MT-TM re-ranking, on the other hand, combines k-best lists from both systems, and generates a new list according to estimated post-editing effort. We observe high precision of these models in automatic and human evaluations, indicating that they can be integrated into TM environments without the risk of deteriorating the quality of the post-editing candidate. For sub-segment level MT and TM integration, we try to reuse high quality TM chunks to improve the quality of MT systems. We can also predict whether phrase pairs derived from fuzzy matches should be used to constrain the translation of an input segment. Using a series of linguistically- motivated features, our constraints lead both to more consistent translation output, and to improved translation quality, as is measured by automatic evaluation scores. Finally, we present several methodologies that can be used to track post-editing effort, perform human evaluation of MT-TM integration, or help translators to access MT outputs in a TM environment.

pdf
Consistent Translation using Discriminative Learning - A Translation Memory-inspired Approach
Yanjun Ma | Yifan He | Andy Way | Josef van Genabith
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

2010

pdf
Bridging SMT and TM with Translation Recommendation
Yifan He | Yanjun Ma | Josef van Genabith | Andy Way
Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics

pdf
HMM Word-to-Phrase Alignment with Dependency Constraints
Yanjun Ma | Andy Way
Proceedings of the 4th Workshop on Syntax and Structure in Statistical Translation

pdf
Statistical Analysis of Alignment Characteristics for Phrase-based Machine Translation
Patrik Lambert | Simon Petitrenaud | Yanjun Ma | Andy Way
Proceedings of the 14th Annual conference of the European Association for Machine Translation

pdf abs
Improving the Post-Editing Experience using Translation Recommendation: A User Study
Yifan He | Yanjun Ma | Johann Roturier | Andy Way | Josef van Genabith
Proceedings of the 9th Conference of the Association for Machine Translation in the Americas: Research Papers

We report findings from a user study with professional post-editors using a translation recommendation framework (He et al., 2010) to integrate Statistical Machine Translation (SMT) output with Translation Memory (TM) systems. The framework recommends SMT outputs to a TM user when it predicts that SMT outputs are more suitable for post-editing than the hits provided by the TM. We analyze the effectiveness of the model as well as the reaction of potential users. Based on the performance statistics and the users’ comments, we find that translation recommendation can reduce the workload of professional post-editors and improve the acceptance of MT in the localization industry.

pdf
Integrating N-best SMT Outputs into a TM System
Yifan He | Yanjun Ma | Andy Way | Josef van Genabith
Coling 2010: Posters

2009

pdf abs
Low-resource machine translation using MaTrEx
Yanjun Ma | Tsuyoshi Okita | Özlem Çetinoğlu | Jinhua Du | Andy Way
Proceedings of the 6th International Workshop on Spoken Language Translation: Evaluation Campaign

In this paper, we give a description of the Machine Translation (MT) system developed at DCU that was used for our fourth participation in the evaluation campaign of the International Workshop on Spoken Language Translation (IWSLT 2009). Two techniques are deployed in our system in order to improve the translation quality in a low-resource scenario. The first technique is to use multiple segmentations in MT training and to utilise word lattices in decoding stage. The second technique is used to select the optimal training data that can be used to build MT systems. In this year’s participation, we use three different prototype SMT systems, and the output from each system are combined using standard system combination method. Our system is the top system for Chinese–English CHALLENGE task in terms of BLEU score.

pdf
Source-Side Context-Informed Hypothesis Alignment for Combining Outputs from Machine Translation Systems
Jinhua Du | Yanjun Ma | Andy Way
Proceedings of Machine Translation Summit XII: Posters

pdf
Tracking Relevant Alignment Characteristics for Machine Translation
Patrik Lambert | Yanjun Ma | Sylwia Ozdowska | Andy Way
Proceedings of Machine Translation Summit XII: Posters

pdf
Bilingually Motivated Domain-Adapted Word Segmentation for Statistical Machine Translation
Yanjun Ma | Andy Way
Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009)

pdf bib
Proceedings of the Student Research Workshop at EACL 2009
Vera Demberg | Yanjun Ma | Nils Reiter
Proceedings of the Student Research Workshop at EACL 2009

pdf
Using Supertags as Source Language Context in SMT
Rejwanul Haque | Sudip Kumar Naskar | Yanjun Ma | Andy Way
Proceedings of the 13th Annual conference of the European Association for Machine Translation

pdf
Tuning Syntactically Enhanced Word Alignment for Statistical Machine Translation
Yanjun Ma | Patrik Lambert | Andy Way
Proceedings of the 13th Annual conference of the European Association for Machine Translation

2008

pdf
MaTrEx: The DCU MT System for WMT 2008
John Tinsley | Yanjun Ma | Sylwia Ozdowska | Andy Way
Proceedings of the Third Workshop on Statistical Machine Translation

pdf
Improving Word Alignment Using Syntactic Dependencies
Yanjun Ma | Sylwia Ozdowska | Yanli Sun | Andy Way
Proceedings of the ACL-08: HLT Second Workshop on Syntax and Structure in Statistical Translation (SSST-2)

pdf bib abs
Exploiting alignment techniques in MATREX: the DCU machine translation system for IWSLT 2008.
Yanjun Ma | John Tinsley | Hany Hassan | Jinhua Du | Andy Way
Proceedings of the 5th International Workshop on Spoken Language Translation: Evaluation Campaign

In this paper, we give a description of the machine translation (MT) system developed at DCU that was used for our third participation in the evaluation campaign of the International Workshop on Spoken Language Translation (IWSLT 2008). In this participation, we focus on various techniques for word and phrase alignment to improve system quality. Specifically, we try out our word packing and syntax-enhanced word alignment techniques for the Chinese–English task and for the English–Chinese task for the first time. For all translation tasks except Arabic–English, we exploit linguistically motivated bilingual phrase pairs extracted from parallel treebanks. We smooth our translation tables with out-of-domain word translations for the Arabic–English and Chinese–English tasks in order to solve the problem of the high number of out of vocabulary items. We also carried out experiments combining both in-domain and out-of-domain data to improve system performance and, finally, we deploy a majority voting procedure combining a language model-based method and a translation-based method for case and punctuation restoration. We participated in all the translation tasks and translated both the single-best ASR hypotheses and the correct recognition results. The translation results confirm that our new word and phrase alignment techniques are often helpful in improving translation quality, and the data combination method we proposed can significantly improve system performance.

2007

pdf abs
MaTrEx: the DCU machine translation system for IWSLT 2007
Hany Hassan | Yanjun Ma | Andy Way
Proceedings of the Fourth International Workshop on Spoken Language Translation

In this paper, we give a description of the machine translation system developed at DCU that was used for our second participation in the evaluation campaign of the International Workshop on Spoken Language Translation (IWSLT 2007). In this participation, we focus on some new methods to improve system quality. Specifically, we try our word packing technique for different language pairs, we smooth our translation tables with out-of-domain word translations for the Arabic–English and Chinese–English tasks in order to solve the high number of out of vocabulary items, and finally we deploy a translation-based model for case and punctuation restoration. We participated in both the classical and challenge tasks for the following translation directions: Chinese–English, Japanese–English and Arabic–English. For the last two tasks, we translated both the single-best ASR hypotheses and the correct recognition results; for Chinese–English, we just translated the correct recognition results. We report the results of the system for the provided evaluation sets, together with some additional experiments carried out following identification of some simple tokenisation errors in the official runs.

pdf
Alignment-guided chunking
Yanjun Ma | Nicolas Stroppa | Andy Way
Proceedings of the 11th Conference on Theoretical and Methodological Issues in Machine Translation of Natural Languages: Papers

pdf
Bootstrapping Word Alignment via Word Packing
Yanjun Ma | Nicolas Stroppa | Andy Way
Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics