Michael Arrigo
2022
CAMIO: A Corpus for OCR in Multiple Languages
Michael Arrigo | Stephanie Strassel | Nolan King | Thao Tran | Lisa Mason
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Michael Arrigo | Stephanie Strassel | Nolan King | Thao Tran | Lisa Mason
Proceedings of the Thirteenth Language Resources and Evaluation Conference
CAMIO (Corpus of Annotated Multilingual Images for OCR) is a new corpus created by Linguistic Data Consortium to serve as a resource to support the development and evaluation of optical character recognition (OCR) and related technologies for 35 languages across 24 unique scripts. The corpus comprises nearly 70,000 images of machine printed text, covering a wide variety of topics and styles, document domains, attributes and scanning/capture artifacts. Most images have been exhaustively annotated for text localization, resulting in over 2.3M line-level bounding boxes. For 13 of the 35 languages, 1250 images/language have been further annotated with orthographic transcriptions of each line plus specification of reading order, yielding over 2.4M tokens of transcribed text. The resulting annotations are represented in a comprehensive XML output format defined for this corpus. The paper discusses corpus design and implementation, challenges encountered, baseline performance results obtained on the corpus for text localization and OCR decoding, and plans for corpus publication.
2019
Corpus Building for Low Resource Languages in the DARPA LORELEI Program
Jennifer Tracey | Stephanie Strassel | Ann Bies | Zhiyi Song | Michael Arrigo | Kira Griffitt | Dana Delgado | Dave Graff | Seth Kulick | Justin Mott | Neil Kuster
Proceedings of the 2nd Workshop on Technologies for MT of Low Resource Languages
Jennifer Tracey | Stephanie Strassel | Ann Bies | Zhiyi Song | Michael Arrigo | Kira Griffitt | Dana Delgado | Dave Graff | Seth Kulick | Justin Mott | Neil Kuster
Proceedings of the 2nd Workshop on Technologies for MT of Low Resource Languages
2015
A New Dataset and Evaluation for Belief/Factuality
Vinodkumar Prabhakaran | Tomas By | Julia Hirschberg | Owen Rambow | Samira Shaikh | Tomek Strzalkowski | Jennifer Tracey | Michael Arrigo | Rupayan Basu | Micah Clark | Adam Dalton | Mona Diab | Louise Guthrie | Anna Prokofieva | Stephanie Strassel | Gregory Werner | Yorick Wilks | Janyce Wiebe
Proceedings of the Fourth Joint Conference on Lexical and Computational Semantics
Vinodkumar Prabhakaran | Tomas By | Julia Hirschberg | Owen Rambow | Samira Shaikh | Tomek Strzalkowski | Jennifer Tracey | Michael Arrigo | Rupayan Basu | Micah Clark | Adam Dalton | Mona Diab | Louise Guthrie | Anna Prokofieva | Stephanie Strassel | Gregory Werner | Yorick Wilks | Janyce Wiebe
Proceedings of the Fourth Joint Conference on Lexical and Computational Semantics
Search
Fix author
Co-authors
- Stephanie Strassel 3
- Jennifer Tracey 2
- Rupayan Basu 1
- Ann Bies 1
- Tomas By 1
- Micah Clark 1
- Adam Dalton 1
- Dana Delgado 1
- Mona Diab 1
- Dave Graff 1
- Kira Griffitt 1
- Louise Guthrie 1
- Julia Hirschberg 1
- Nolan King 1
- Seth Kulick 1
- Neil Kuster 1
- Lisa Mason 1
- Justin Mott 1
- Vinodkumar Prabhakaran 1
- Anna Prokofieva 1
- Owen Rambow 1
- Samira Shaikh 1
- Zhiyi Song 1
- Tomek Strzalkowski 1
- Thao Tran 1
- Gregory Werner 1
- Janyce Wiebe 1
- Yorick Wilks 1