Mridul Gupta


2011

Phrase alignment is a crucial step in phrase-based statistical machine translation. We explore a way of improving phrase alignment by adding syntactic information in the form of chunks as soft constraints guided by an in-depth and detailed analysis on a hand-aligned data set. We extend a probabilistic phrase alignment model that extracts phrase pairs by optimizing phrase pair boundaries over the sentence pair [1]. The boundaries of the target phrase are chosen such that the overall sentence alignment probability is optimal. Viterbi alignment information is also added in the extended model with a view of improving phrase alignment. We extract phrase pairs using a relatively larger number of features which are discriminatively trained using a large-margin online learning algorithm, i.e., Margin Infused Relaxed Algorithm (MIRA) and integrate it in our approach. Initial experiments show improvements in both phrase alignment and translation quality for Arabic-English on a moderate-size translation task.

2010

This paper describes the development of a hybrid tool for a semi-automated process for validation of treebank annotation at various levels. The tool is developed for error detection at the part-of-speech, chunk and dependency levels of a Hindi treebank, currently under development. The tool aims to identify as many errors as possible at these levels to achieve consistency in the task of annotation. Consistency in treebank annotation is a must for making data as error-free as possible and for providing quality assurance. The tool is aimed at ensuring consistency and to make manual validation cost effective. We discuss a rule based and a hybrid approach (statistical methods combined with rule-based methods) by which a high-recall system can be developed and used to identify errors in the treebank. We report some results of using the tool on a sample of data extracted from the Hindi treebank. We also argue how the tool can prove useful in improving the annotation guidelines which would in turn, better the quality of annotation in subsequent iterations.
The paper describes an approach to expedite the process of manual annotation of a Hindi dependency treebank which is currently under development. We propose a way by which consistency among a set of manual annotators could be improved. Furthermore, we show that our setup can also prove useful for evaluating when an inexperienced annotator is ready to start participating in the production of the treebank. We test our approach on sample sets of data obtained from an ongoing work on creation of this treebank. The results asserting our proposal are reported in this paper. We report results from a semi-automated approach of dependency annotation experiment. We find out the rate of agreement between annotators using Cohen’s Kappa. We also compare results with respect to the total time taken to annotate sample data-sets using a completely manual approach as opposed to a semi-automated approach. It is observed from the results that this semi-automated approach when carried out with experienced and trained human annotators improves the overall quality of treebank annotation and also speeds up the process.

2009