This is an internal, incomplete preview of a proposed change to the ACL Anthology.
For efficiency reasons, we don't generate MODS or Endnote formats, and the preview may be incomplete in other ways, or contain mistakes.
Do not treat this content as an official publication.
Ahmet CüneydTantuğ
Also published as:
A. Cüneyd Tantuǧ,
A. Cüneyd Tantuğ
Fixing paper assignments
Please select all papers that do not belong to this person.
Indicate below which author they should be assigned to.
With the growth of the social web, user-generated text data has reached unprecedented sizes. Non-canonical text normalization provides a way to exploit this as a practical source of training data for language processing systems. The state of the art in Turkish text normalization is composed of a token level pipeline of modules, heavily dependent on external linguistic resources and manually defined rules. Instead, we propose a fully automated, context-aware machine translation approach with fewer stages of processing. Experiments with various implementations of our approach show that we are able to surpass the current best-performing system by a large margin.
In this study, we present Morpheus, a joint contextual lemmatizer and morphological tagger. Morpheus is based on a neural sequential architecture where inputs are the characters of the surface words in a sentence and the outputs are the minimum edit operations between surface words and their lemmata as well as the morphological tags assigned to the words. The experiments on the datasets in nearly 100 languages provided by SigMorphon 2019 Shared Task 2 organizers show that the performance of Morpheus is comparable to the state-of-the-art system in terms of lemmatization. In morphological tagging, on the other hand, Morpheus significantly outperforms the SigMorphon baseline. In our experiments, we also show that the neural encoder-decoder architecture trained to predict the minimum edit operations can produce considerably better results than the architecture trained to predict the characters in lemmata directly as in previous studies. According to the SigMorphon 2019 Shared Task 2 results, Morpheus has placed 3rd in lemmatization and reached the 9th place in morphological tagging among all participant teams.
We present a tool, BLEU+, which implements various extension to BLEU computation to allow for a better understanding of the translation performance, especially for morphologically complex languages. BLEU+ takes into account both closeness in morphological structure, closeness of the root words in the WordNet hierarchy while comparing tokens in the candidate and reference sentence. In addition to gauging performance at a finer level of granularity, BLEU+ also allows the computation of various upper bound oracle scores: comparing all tokens considering only the roots allows us to get an upper bound when all errors due to morphological structure are fixed, while comparing tokens in an error-tolerant way considering minor morpheme edit operations, allows us to get a (more realistic) upper bound when tokens that differ in morpheme insertions/deletions and substitutions are fixed. We use BLEU+ in the fine-grained evaluation of the output of our English-to-Turkish statistical MT system.