Sylvain Jaume


2021

pdf
Named Entity Recognition in Historic Legal Text: A Transformer and State Machine Ensemble Method
Fernando Trias | Hongming Wang | Sylvain Jaume | Stratos Idreos
Proceedings of the Natural Legal Language Processing Workshop 2021

Older legal texts are often scanned and digitized via Optical Character Recognition (OCR), which results in numerous errors. Although spelling and grammar checkers can correct much of the scanned text automatically, Named Entity Recognition (NER) is challenging, making correction of names difficult. To solve this, we developed an ensemble language model using a transformer neural network architecture combined with a finite state machine to extract names from English-language legal text. We use the US-based English language Harvard Caselaw Access Project for training and testing. Then, the extracted names are subjected to heuristic textual analysis to identify errors, make corrections, and quantify the extent of problems. With this system, we are able to extract most names, automatically correct numerous errors and identify potential mistakes that can later be reviewed for manual correction.

2020

pdf
News Aggregation with Diverse Viewpoint Identification Using Neural Embeddings and Semantic Understanding Models
Mark Carlebach | Ria Cheruvu | Brandon Walker | Cesar Ilharco Magalhaes | Sylvain Jaume
Proceedings of the 7th Workshop on Argument Mining

Today’s news volume makes it impractical for readers to get a diverse and comprehensive view of published articles written from opposing viewpoints. We introduce a transformer-based news aggregation system, composed of topic modeling, semantic clustering, claim extraction, and textual entailment that identifies viewpoints presented in articles within a semantic cluster and classifies them into positive, neutral and negative entailments. Our novel embedded topic model using BERT-based embeddings outperforms baseline topic modeling algorithms by an 11% relative improvement. We compare recent semantic similarity models in the context of news aggregation, evaluate transformer-based models for claim extraction on news data, and demonstrate the use of textual entailment models for diverse viewpoint identification.