% This must be in the first 5 lines to tell arXiv to use pdfLaTeX, which is strongly recommended. \pdfoutput=1 % In particular, the hyperref package requires pdfLaTeX in order to break URLs across lines. \documentclass[11pt]{article} % Remove the "review" option to generate the final version. \usepackage[]{EACL2023} % Standard package includes \usepackage{times} \usepackage{latexsym} % For proper rendering and hyphenation of words containing Latin characters (including in bib files) \usepackage[T1]{fontenc} % For Vietnamese characters % \usepackage[T5]{fontenc} % See https://www.latex-project.org/help/documentation/encguide.pdf for other charachttps://www.overleaf.com/project/657ad11dbe4be03a529846a5ter sets % This assumes your files are encoded as UTF8 \usepackage[utf8]{inputenc} % This is not strictly necessary, and may be commented out. % However, it will improve the layout of the manuscript, % and will typically save some space. \usepackage{microtype} % This is also not strictly necessary, and may be commented out. % However, it will improve the aesthetics of text in % the typewriter font. \usepackage{inconsolata} % For horizontal table lines \usepackage{booktabs} % For table 1 extra formatting \usepackage{multirow} % If the title and author information does not fit in the area allocated, uncomment the following % %\setlength\titlebox{} % % and set to something 5cm or larger. \title{Two Approaches to Diachronic Normalization of Polish Texts} % Author information can be set in various styles: % For several authors from the same institution: % \author{Author 1 \and ... \and Author n \\ % Address line \\ ... \\ Address line} % if the names do not fit well on one line use % Author 1 \\ {\bf Author 2} \\ ... \\ {\bf Author n} \\ % For authors from different institutions: % \author{Author 1 \\ Address line \\ ... \\ Address line % \And ... \And % Author n \\ Address line \\ ... \\ Address line} % To start a seperate ``row'' of authors use \AND, as in % \author{Author 1 \\ Address line \\ ... \\ Address line % \AND % Author 2 \\ Address line \\ ... \\ Address line \And % Author 3 \\ Address line \\ ... \\ Address line} %\author{First Author \\ %Affiliation / Address line 1 \\ %Affiliation / Address line 2 \\ %Affiliation / Address line 3 \\ %\texttt{email@domain} \\\And %Second Author \\ %Affiliation / Address line 1 \\ %Affiliation / Address line 2 \\ %Affiliation / Address line 3 \\ %\texttt{email@domain} \\} \author{Kacper Dudzic \and Filip Graliński \and Krzysztof Jassem \and Marek Kubis \and Piotr Wierzchoń \\ Adam Mickiewicz University, Poznań, Poland \\ \texttt{\{firstname.lastname\}@amu.edu.pl}} \begin{document} \maketitle \begin{abstract} This paper discusses two approaches to the diachronic normalization of Polish texts: a rule-based solution that relies on a set of handcrafted patterns, and a neural normalization model based on the text-to-text transfer transformer architecture. The training and evaluation data prepared for the task are discussed in detail, along with experiments conducted to compare the proposed normalization solutions. A quantitative and qualitative analysis is made. It is shown that at the current stage of inquiry into the problem, the rule-based solution outperforms the neural one on 3 out of 4 variants of the prepared dataset, although in practice both approaches have distinct advantages and disadvantages. \end{abstract} \section{Introduction} This paper discusses two solutions to the problem of diachronic normalization, that is, the task of determining contemporary spelling for a given historical text. Diachronic normalization may concern the writing of individual words, punctuation, hyphenation, or separation of tokens. We believe that the methods described in this paper may be useful for linguistic research on historical texts. A practical use case for our work is to facilitate full-text search in historical texts -- a query written in contemporary spelling may trigger a search for historical variants through the use of reversed-order diachronic normalization. Similar experiments, for text normalization in a speech synthesis system from text, were described in \cite{DBLP:journals/corr/SproatJ16}. Those authors claim that text normalization remains one of the few tasks in the field of natural language processing where handcrafted rules may yield better results than machine learning. This is due to the following reasons: \begin{itemize} \item Lack of training data; there is no economic motivation for creating training data for text normalization -- unlike machine translation, for example, for which training data are created "naturally"; \item Low data density of interesting cases, i.e. words that should be somehow changed -- unlike for example, phonemic transcription, where all words are converted to a new representation; \item Standard methods of evaluation which do not reward trivial cases (copying of input words), thus favoring human labor. \end{itemize} In our experiments, we compare the results of a rule-based approach with one based on machine learning. The rule-based approach relies on a set of handcrafted rules to normalize text. In the ML approach, we train a supervised normalization model on the basis of a corpus of Polish books for which both historical and current spellings are available. \section{Related work} The first attempts at rule-based diachronic normalization used for historical text in English were described by \citet{rayson07} and \citet{Baron09}. Similar studies were conducted for German \cite{Archer06}. There, context rules operated at the level of letters instead of words. The normalization rules may be derived from corpora, as \citet{Bollman11} showed for German. Diachronic normalization may be also performed using a noisy channel model, as described by \citet{oravecz10} using the example of Old Hungarian texts. Research on diachronic normalization has also been conducted for Portugese \cite{reynaert12}, Swedish \cite{petterson12}, Slovene \cite{Scherrer13}, Spanish \cite{porta13}, and Basque \cite{Etxeberia16}. \citet{bollmann-2019-large} surveys historical spelling normalization methods for eight languages. He reports word-level accuracy for the evaluated systems. He claims that using CER is not justified, because it strongly correlates with WER for systems showing reasonable accuracy. % \citet{bollmann-sogaard-2016-improving} use bi-directional LSTMs and multi-task learning to normalize texts in Early New High German. Their dataset consists of 44 texts from the Aselm corpus. he model presented is evaluated with respect to word-level accuracy. % \citet{robertson-goldwater-2018-evaluating} discuss the problem of evaluating historical text normalization systems. They emphasize the necessity of reporting accuracy for unobserved tokens and recommend confronting the normalization systems with a simple baseline that memorizes training samples. \citet{jassem17pros,jassem18automatic} present an automatic method for diachronic normalization of Polish texts. The proposed method uses a formal language to model diachronic changes. \citet{GralinskiMining2020} introduce a method for finding spelling variants in a diachronic corpus using word2vec. \section{Data} Training and evaluation of a diachronic normalizer requires a corpus of texts that preserve historical spelling along with their contemporized counterparts. As our aim is the normalization of Polish prose, we decided to collect texts for our corpus from two sources. Texts that preserve historical spelling were drawn from the Polish edition of the Wikisource project \cite{wikisource23}, which provides proof-read transcriptions of printed books that have fallen into the public domain, encoded in the MediaWiki format. For contemporized texts, we used Wolne Lektury \cite{wolnelektury23}, a digital library that aims to deliver new editions of school readers, free of charge. Although both sources encompass a wide variety of texts, ranging from poems and works of philosophy to dictionaries and historical documents, we narrowed our attention to novels, to facilitate the process of matching the original texts from Wikisource to their contemporized versions in Wolne Lektury with the use of metadata information available for novels in both sources. % We initially sourced 308 novels from Wikisource and 279 from Wolne Lektury. \subsection{Preprocessing} All of the texts then underwent preprocessing. First, we split the texts into paragraphs, with the use of markup information preserved in XML files sourced from Wolne Lektury, and MediaWiki content collected from Wikisource. Next, regular expressions were used to remove leftover markup information, such as in-text metadata, formatting, or HTML tags, and to normalize some atypical characters. Accordingly, diacritical marks were removed from letter characters not belonging to the Polish alphabet, and non-ASCII variants of standard letter characters of the Latin alphabet were replaced by their ASCII counterparts. Finally, the same method was used to remove dialogue-specific text formatting and punctuation in paragraphs consisting of dialogue utterances, such as quotation dashes or character cues. \subsection{Alignment} To create aligned paragraph data, we first automatically matched all editions of novels existing across both data sources using fuzzy information similarity for author and title metadata. We then narrowed the matches to those that contained at least one edition in each of the sources. Next, for each match of all editions of a novel, the oldest edition from Wikisource and the most recent edition from WolneLektury were identified using metadata information. Subsequently, the text paragraphs of both editions were extracted and aligned using the Hunalign tool, version 1.1 \cite{varga2005parallel}. Specifically, it was used to automatically create paragraph pairs consisting of a given text fragment with historical spelling from the oldest edition of a novel and the same text fragment but with contemporized spelling found in its newer edition, optionally automatically joining or splitting paragraphs where it was applicable. The paragraph alignment quality metric returned by Hunalign was consulted to provide additional filtering. The average alignment quality score across the entire text contents for each edition pair was used to identify and discard very low-scoring edition pairs, which turned out to be Polish translations of foreign novels made by different translators. In turn, per-paragraph alignment quality scores below 1.0 were used as an indicator to discard singular misaligned paragraphs. \subsection{Dataset creation} After completing all of the above steps and performing deduplication at the very end, we obtained a final corpus of 248,645 paragraph pairs originating from 87 eligible pairs of matched novel editions. Four dataset variants were created with this as the basis. All variants involve a training and test split, but they differ in the following two respects: \begin{description} % \item[Pruning] was either applied or not. \emph{Pruned} versions of the dataset are reduced in size by removing samples in which the paragraphs of the pair are identical. Applying pruning leads to a 64.83\% decrease in the number of samples, a 47.34\% decrease in the number of words, and a 47.23\% decrease in the number of characters. % \item[Separation] of novels prior to the train/test split was either performed or not. In \emph{separated} variants of the dataset, train and test sets are created from separate pools of novels with no overlap, so that all paragraphs from a given novel are contained in only one of the sets. Four novels were sampled from each of the quartiles determined with respect to the number of paragraphs contained in the corpus, to guarantee that each data subset contained a balanced volume of text. In the case of \emph{non-separated} variants, the paragraphs are randomly sampled from the entire set of novels following the standard $80\%/20\%$ sampling ratio for train/test splits. \end{description} \begin{table*} \centering \begin{tabular}{llrrrr} \toprule \multirow{2}{*}{\textbf{Pruning}} & \multirow{2}{*}{\textbf{Separation}} & \multicolumn{2}{c}{\textbf{Split samples}} & \multirow{2}{*}{\textbf{Characters}} & \multirow{2}{*}{\textbf{Words}} \\ \cmidrule(lr){3-4} & & \footnotesize{Train} & \footnotesize{Test} & & \\ \midrule No & No & 198,916 & 49,729 & 92,306,901 & 14,438,223 \\ Yes & No & 69,952 & 17,488 & 48,710,393 & 7,603,573 \\ No & Yes & 199,004 & 49,641 & 92,306,901 & 14,438,223 \\ Yes & Yes & 63,921 & 23,519 & 48,710,393 & 7,603,573 \\ \bottomrule \end{tabular} \caption{Dataset statistics} \label{tab:dataset} \end{table*} \begin{table*} \centering \begin{tabular}{lllrrrr} \toprule \textbf{Method} & \textbf{Pruning} & \textbf{Separation} & \textbf{CER} & \textbf{WER} \\ \midrule Transducers & No & No & \textbf{0.0164} & \textbf{0.0466} \\ Neural & No & No & 0.0488 & 0.0654 \\ \midrule Transducers & Yes & No & \textbf{0.0319} & \textbf{0.0827} \\ Neural & Yes & No & 0.0728 & 0.1011 \\ \midrule Transducers & No & Yes & \textbf{0.0182} & \textbf{0.0560} \\ Neural & No & Yes & 0.0632 & 0.0932 \\ \midrule Transducers & Yes & Yes & \textbf{0.0281} & 0.0844 \\ Neural & Yes & Yes & 0.0398 & \textbf{0.0737} \\ \bottomrule \end{tabular} \caption{Evaluation results} \label{tab:results} \end{table*} \section{Experiments} \subsection{Rule-based model} \label{sec:rule-based-model} Our first solution to the problem of diachronic normalization relies on a set of deterministic rules. Henceforth, we will refer to this solution as \textit{Transducers}. The rules were handcrafted initially and then adjusted semi-automatically. % They were created mostly based on the expert literature describing changes in the Polish spelling system and by looking at a list of similar words having close embeddings. For most of the work on the rules, datasets for supervised learning were not consulted. % Originally, the rules were written using the Thrax language \cite{tai11} for defining transducer grammars, but more recently have been rewritten into a Java code base with normalization rules encoded using regular expressions. For instance, the rule: \begin{verbatim} Rule( "([cs]|(?:\\A|(?