Identifying Cleartext in Historical Ciphers

Maria-Elena Gambardella, Beata Megyesi, Eva Pettersson


Abstract
In historical encrypted sources we can find encrypted text sequences, also called ciphertext, as well as non-encrypted cleartexts written in a known language. While most of the cryptanalysis focuses on the decryption of ciphertext, cleartext is often overlooked although it can give us important clues about the historical interpretation and contextualisation of the manuscript. In this paper, we investigate to what extent we can automatically distinguish cleartext from ciphertext in historical ciphers and to what extent we are able to identify its language. The problem is challenging as cleartext sequences in ciphers are often short, up to a few words, in different languages due to historical code-switching. To identify the sequences and the language(s), we chose a rule-based approach and run 7 different models using historical language models on various ciphertexts.
Anthology ID:
2022.lt4hala-1.1
Volume:
Proceedings of the Second Workshop on Language Technologies for Historical and Ancient Languages
Month:
June
Year:
2022
Address:
Marseille, France
Editors:
Rachele Sprugnoli, Marco Passarotti
Venue:
LT4HALA
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
1–9
Language:
URL:
https://aclanthology.org/2022.lt4hala-1.1
DOI:
Bibkey:
Cite (ACL):
Maria-Elena Gambardella, Beata Megyesi, and Eva Pettersson. 2022. Identifying Cleartext in Historical Ciphers. In Proceedings of the Second Workshop on Language Technologies for Historical and Ancient Languages, pages 1–9, Marseille, France. European Language Resources Association.
Cite (Informal):
Identifying Cleartext in Historical Ciphers (Gambardella et al., LT4HALA 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/emnlp22-frontmatter/2022.lt4hala-1.1.pdf