Supervised classification of end-of-lines in clinical text with no manual annotation

Pierre Zweigenbaum, Cyril Grouin, Thomas Lavergne


Abstract
In some plain text documents, end-of-line marks may or may not mark the boundary of a text unit (e.g., of a paragraph). This vexing problem is likely to impact subsequent natural language processing components, but is seldom addressed in the literature. We propose a method which uses no manual annotation to classify whether end-of-lines must actually be seen as simple spaces (soft line breaks) or as true text unit boundaries. This method, which includes self-training and co-training steps based on token and line length features, achieves 0.943 F-measure on a corpus of short e-books with controlled format, F=0.904 on a random sample of 24 clinical texts with soft line breaks, and F=0.898 on a larger set of mixed clinical texts which may or may not contain soft line breaks, a fairly high value for a method with no manual annotation.
Anthology ID:
W16-5109
Volume:
Proceedings of the Fifth Workshop on Building and Evaluating Resources for Biomedical Text Mining (BioTxtM2016)
Month:
December
Year:
2016
Address:
Osaka, Japan
Venue:
WS
SIG:
Publisher:
The COLING 2016 Organizing Committee
Note:
Pages:
80–88
Language:
URL:
https://aclanthology.org/W16-5109
DOI:
Bibkey:
Cite (ACL):
Pierre Zweigenbaum, Cyril Grouin, and Thomas Lavergne. 2016. Supervised classification of end-of-lines in clinical text with no manual annotation. In Proceedings of the Fifth Workshop on Building and Evaluating Resources for Biomedical Text Mining (BioTxtM2016), pages 80–88, Osaka, Japan. The COLING 2016 Organizing Committee.
Cite (Informal):
Supervised classification of end-of-lines in clinical text with no manual annotation (Zweigenbaum et al., 2016)
Copy Citation:
PDF:
https://preview.aclanthology.org/remove-xml-comments/W16-5109.pdf