This is an internal, incomplete preview of a proposed change to the ACL Anthology.
For efficiency reasons, we don't generate MODS or Endnote formats, and the preview may be incomplete in other ways, or contain mistakes.
Do not treat this content as an official publication.
Septina DianLarasati
Also published as:
Septina Larasati
Fixing paper assignments
Please select all papers that do not belong to this person.
Indicate below which author they should be assigned to.
The Votter Corpus is a new annotated corpus of social polling questions and answers. The Votter Corpus is novel in its use of the mobile application format and novel in its coverage of specific demographics. With over 26,000 polls and close to 1 millions votes, the Votter Corpus covers everyday question and answer language, primarily for users who are female and between the ages of 13-24. The corpus is annotated by topic and by popularity of particular answers. The corpus contains many unique characteristics such as emoticons, common mobile misspellings, and images associated with many of the questions. The corpus is a collection of questions and answers from The Votter App on the Android operating system. Data is created solely on this mobile platform which differs from most social media corpora. The Votter Corpus is being made available online in XML format for research and non-commercial use. The Votter android app can be downloaded for free in most android app stores.
This paper presents a method to improve a word alignment model in a phrase-based Statistical Machine Translation system for a low-resourced language using a string similarity approach. Our method captures similar words that can be seen as semi-monolingual across languages, such as numbers, named entities, and adapted/loan words. We use several string similarity metrics to measure the monolinguality of the words, such as Longest Common Subsequence Ratio (LCSR), Minimum Edit Distance Ratio (MEDR), and we also use a modified BLEU Score (modBLEU). Our approach is to add intersecting alignment points for word pairs that are orthographically similar, before applying a word alignment heuristic, to generate a better word alignment. We demonstrate this approach on Indonesian-to-English translation task, where the languages share many similar words that are poorly aligned given a limited training data. This approach gives a statistically significant improvement by up to 0.66 in terms of BLEU score.
This paper describes the creation process of an Indonesian-English parallel corpus (IDENTIC). The corpus contains 45,000 sentences collected from different sources in different genres. Several manual text preprocessing tasks, such as alignment and spelling correction, are applied to the corpus to assure its quality. We also apply language specific text processing such as tokenization on both sides and clitic normalization on the Indonesian side. The corpus is available in two different formats: plain', stored in text format and morphologically enriched', stored in CoNLL format. Some parts of the corpus are publicly available at the IDENTIC homepage.