Named Entity Recognition in Historic Legal Text: A Transformer and State Machine Ensemble Method

Fernando Trias, Hongming Wang, Sylvain Jaume, Stratos Idreos


Abstract
Older legal texts are often scanned and digitized via Optical Character Recognition (OCR), which results in numerous errors. Although spelling and grammar checkers can correct much of the scanned text automatically, Named Entity Recognition (NER) is challenging, making correction of names difficult. To solve this, we developed an ensemble language model using a transformer neural network architecture combined with a finite state machine to extract names from English-language legal text. We use the US-based English language Harvard Caselaw Access Project for training and testing. Then, the extracted names are subjected to heuristic textual analysis to identify errors, make corrections, and quantify the extent of problems. With this system, we are able to extract most names, automatically correct numerous errors and identify potential mistakes that can later be reviewed for manual correction.
Anthology ID:
2021.nllp-1.18
Volume:
Proceedings of the Natural Legal Language Processing Workshop 2021
Month:
November
Year:
2021
Address:
Punta Cana, Dominican Republic
Venue:
NLLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
172–179
Language:
URL:
https://aclanthology.org/2021.nllp-1.18
DOI:
10.18653/v1/2021.nllp-1.18
Bibkey:
Cite (ACL):
Fernando Trias, Hongming Wang, Sylvain Jaume, and Stratos Idreos. 2021. Named Entity Recognition in Historic Legal Text: A Transformer and State Machine Ensemble Method. In Proceedings of the Natural Legal Language Processing Workshop 2021, pages 172–179, Punta Cana, Dominican Republic. Association for Computational Linguistics.
Cite (Informal):
Named Entity Recognition in Historic Legal Text: A Transformer and State Machine Ensemble Method (Trias et al., NLLP 2021)
Copy Citation:
PDF:
https://preview.aclanthology.org/remove-xml-comments/2021.nllp-1.18.pdf
Data
CoNLL-2003