Abstract
Automatic dating of ancient documents is a very important area of research for digital humanities applications. Many documents available via digital libraries do not have any dating or dating that is uncertain. Document dating is not only useful by itself but it also helps to choose the appropriate NLP tools (lemmatizer, POS tagger ) for subsequent analysis. This paper provides a dataset with thousands of ancient documents in French and present methods and evaluation metrics for this task. We compare character-level methods with token-level methods on two different datasets of two different time periods and two different text genres. Our results show that character-level models are more robust to noise than classical token-level models. The experiments presented in this article focused on documents written in French but we believe that the ability of character-level models to handle noise properly would help to achieve comparable results on other languages and more ancient languages in particular.- Anthology ID:
- 2020.lt4hala-1.3
- Volume:
- Proceedings of LT4HALA 2020 - 1st Workshop on Language Technologies for Historical and Ancient Languages
- Month:
- May
- Year:
- 2020
- Address:
- Marseille, France
- Editors:
- Rachele Sprugnoli, Marco Passarotti
- Venue:
- LT4HALA
- SIG:
- Publisher:
- European Language Resources Association (ELRA)
- Note:
- Pages:
- 17–21
- Language:
- English
- URL:
- https://aclanthology.org/2020.lt4hala-1.3
- DOI:
- Cite (ACL):
- Anaëlle Baledent, Nicolas Hiebel, and Gaël Lejeune. 2020. Dating Ancient texts: an Approach for Noisy French Documents. In Proceedings of LT4HALA 2020 - 1st Workshop on Language Technologies for Historical and Ancient Languages, pages 17–21, Marseille, France. European Language Resources Association (ELRA).
- Cite (Informal):
- Dating Ancient texts: an Approach for Noisy French Documents (Baledent et al., LT4HALA 2020)
- PDF:
- https://preview.aclanthology.org/ingest-bitext-workshop/2020.lt4hala-1.3.pdf