Alignment of Monolingual Corpus by Reduction of the Search Space

Prajol Shrestha


Abstract
Monolingual comparable corpora annotated with alignments between text segments (paragraphs, sentences, etc.) based on similarity are used in a wide range of natural language processing applications like plagiarism detection, information retrieval, summarization and so on. The drawback wanting to use them is that there aren’t many standard corpora which are aligned. Due to this drawback, the corpus is manually created, which is a time consuming and costly task. In this paper, we propose a method to significantly reduce the search space for manual alignment of the monolingual comparable corpus which in turn makes the alignment process faster and easier. This method can be used in making alignments on different levels of text segments. Using this method we create our own gold corpus aligned on the level of paragraph, which will be used for testing and building our algorithms for automatic alignment. We also present some experiments for the reduction of search space on the basis of stem overlap, word overlap, and cosine similarity measure which help us automatize the process to some extent and reduce human effort for alignment.
Anthology ID:
2011.jeptalnrecital-recital.5
Volume:
Actes de la 18e conférence sur le Traitement Automatique des Langues Naturelles. REncontres jeunes Chercheurs en Informatique pour le Traitement Automatique des Langues
Month:
June
Year:
2011
Address:
Montpellier, France
Venue:
JEP/TALN/RECITAL
SIG:
Publisher:
ATALA
Note:
Pages:
48–56
Language:
URL:
https://aclanthology.org/2011.jeptalnrecital-recital.5
DOI:
Bibkey:
Cite (ACL):
Prajol Shrestha. 2011. Alignment of Monolingual Corpus by Reduction of the Search Space. In Actes de la 18e conférence sur le Traitement Automatique des Langues Naturelles. REncontres jeunes Chercheurs en Informatique pour le Traitement Automatique des Langues, pages 48–56, Montpellier, France. ATALA.
Cite (Informal):
Alignment of Monolingual Corpus by Reduction of the Search Space (Shrestha, JEP/TALN/RECITAL 2011)
Copy Citation:
PDF:
https://preview.aclanthology.org/update-css-js/2011.jeptalnrecital-recital.5.pdf