Masood Ghayoomi

2014

pdf abs
A Database of Freely Written Texts of German School Students for the Purpose of Automatic Spelling Error Classification
Kay Berkling | Johanna Fay | Masood Ghayoomi | Katrin Hein | Rémi Lavalley | Ludwig Linhuber | Sebastian Stüker
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

The spelling competence of school students is best measured on freely written texts, instead of pre-determined, dictated texts. Since the analysis of the error categories in these kinds of texts is very labor intensive and costly, we are working on an automatic systems to perform this task. The modules of the systems are derived from techniques from the area of natural language processing, and are learning systems that need large amounts of training data. To obtain the data necessary for training and evaluating the resulting system, we conducted data collection of freely written, German texts by school children. 1,730 students from grade 1 through 8 participated in this data collection. The data was transcribed electronically and annotated with their corrected version. This resulted in a total of 14,563 sentences that can now be used for research regarding spelling diagnostics. Additional meta-data was collected regarding writers’ language biography, teaching methodology, age, gender, and school year. In order to do a detailed manual annotation of the categories of the spelling errors committed by the students we developed a tool specifically tailored to the task.

pdf abs
Converting an HPSG-based Treebank into its Parallel Dependency-based Treebank
Masood Ghayoomi | Jonas Kuhn
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

A treebank is an important language resource for supervised statistical parsers. The parser induces the grammatical properties of a language from this language resource and uses the model to parse unseen data automatically. Since developing such a resource is very time-consuming and tedious, one can take advantage of already extant resources by adapting them to a particular application. This reduces the amount of human effort required to develop a new language resource. In this paper, we introduce an algorithm to convert an HPSG-based treebank into its parallel dependency-based treebank. With this converter, we can automatically create a new language resource from an existing treebank developed based on a grammar formalism. Our proposed algorithm is able to create both projective and non-projective dependency trees.

pdf abs
Constituency Parsing of Bulgarian: Word- vs Class-based Parsing
Masood Ghayoomi | Kiril Simov | Petya Osenova
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper, we report the obtained results of two constituency parsers trained with BulTreeBank, an HPSG-based treebank for Bulgarian. To reduce the data sparsity problem, we propose using the Brown word clustering to do an off-line clustering and map the words in the treebank to create a class-based treebank. The observations show that when the classes outnumber the POS tags, the results are better. Since this approach adds on another dimension of abstraction (in comparison to the lemma), its coarse-grained representation can be used further for training statistical parsers.

2012

pdf abs
From Grammar Rule Extraction to Treebanking: A Bootstrapping Approach
Masood Ghayoomi
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Most of the reliable language resources are developed via human supervision. Developing supervised annotated data is hard and tedious, and it will be very time consuming when it is done totally manually; as a result, various types of annotated data, including treebanks, are not available for many languages. Considering that a portion of the language is regular, we can define regular expressions as grammar rules to recognize the strings which match the regular expressions, and reduce the human effort to annotate further unseen data. In this paper, we propose an incremental bootstrapping approach via extracting grammar rules when no treebank is available in the first step. Since Persian suffers from lack of available data sources, we have applied our method to develop a treebank for this language. Our experiment shows that this approach significantly decreases the amount of manual effort in the annotation process while enlarging the treebank.

Masood Ghayoomi

2014

2012

2010

2009

2005

Co-authors

Venues