I present a novel approach to the determination of recurrent sound correspondences in bilingual wordlists.
The idea is to relate correspondences between sounds in wordlists to translational equivalences between words in bitexts (bilingual corpora).
My method induces models of sound correspondence that are similar to models developed for statistical machine translation.
The experiments show that the method is able to determine recurrent sound correspondences in bilingual wordlists in which less than 30% of the pairs are cognates.
By employing the discovered correspondences, the method can identify cognates with higher accuracy than the previously reported algorithms.
1 Introduction
Genetically related languages often exhibit recurrent sound correspondences (henceforth referred to simply as correspondences) in words with similar meaning.
For example, t:d, 6:t, n:n, and other known correspondences between English and Latin are demonstrated by the word pairs in Table 1.
Word pairs that contain such correspondences are called cognates, because they originate from the same protoform in the ancestor language.
Correspondences in cognates are preserved over time thanks to the regularity of sound changes, which normally apply to sounds in a given phonological context across all words in the language.
The determination ofcorrespondences is the principal step of the comparative method of language reconstruction.
Not only does it provide evidence for the relatedness of languages, but it also makes it possible to distinguish cognates from loan words and chance resemblances.
However, because manual determination of correspondences is an extremely time-consuming process, it has yet to be accomplished for many proposed language families.
A system able to perform this task automatically
Table 1: Examples of English-Latin cognates exhibiting correspondences.
The corresponding phonemes shown in boldface originate from a single proto-phoneme.
from unprocessed bilingual wordlists could be of great assistance to historical linguists.
The Reconstruction Engine (Lowe and Mazaudon, 1994), a set of programs designed to be an aid in language reconstruction, requires a set of correspondences to be provided beforehand.
The determination of correspondences is closely related to another task that has been much studied in computational linguistics, the identification of cognates.
Cognates have been employed for sentence and word alignment in bitexts (Simard et al., 1992; Melamed, 1999), improving statistical machine translation models (Al-Onaizan et al., 1999), and inducing translation lexicons (Koehn and Knight, 2001).
Some of the proposed cognate identification algorithms implicitly determine and employ correspondences (Tiedemann, 1999; Mann and Yarowsky, 2001).
Although it may not be immediately apparent, there is a strong similarity between the task of matching phonetic segments in a pair of cognate words, and the task of matching words in two sentences that are mutual translations (Figure 1).
The
Snow lies on the ground
Nix iacet in terra
Figure 1: The similarity of word alignment in bi-texts and phoneme alignment between cognates.
consistency with which a word in one language is translated into a word in another language is mirrored by the consistency of sound correspondences.
The former is due to the semantic relation of synonymy, while the latter follows from the principle of the regularity of sound change.
Thus, as already asserted by Guy (1994), it should be possible to use similar techniques for both tasks.
The primary objective of the method proposed in this paper is the automatic determination of correspondences in bilingual wordlists, such as the one in Table 1.
The method exploits the idea of relating correspondences in bilingual wordlists to trans-lational equivalence associations in bitexts through the employment of models developed in the context of statistical machine translation, The second task addressed in this paper is the identification of cognates on the basis of the discovered correspondences.
The experiments to be described in Section 6 show that the method is capable of determining correspondences in bilingual wordlists in which less than 30% of the pairs are cognates, and outperforms comparable algorithms on cognate identification.
Although the experiments focus on bilingual wordlists, the approach presented in this paper could potentially be applied to other bitext-related tasks.
2 Related work
In a schematic description of the comparative method, the two steps that precede the determination of correspondences are the identification of cognate pairs (Kondrak, 2001), and their phonetic alignment (Kondrak, 2000).
Indeed, if a comprehensive set of correctly aligned cognate pairs is available, the correspondences could be extracted by simply following the alignment links.
Unfortunately, in order to make reliable judgments of cognation, it is necessary to know in advance what the
correspondences are.
Historical linguists solve this apparent circularity by guessing a small number of likely cognates and refining the set of correspondences and cognates in an iterative fashion.
Guy (1994) outlines an algorithm for identifying cognates in bilingual wordlists which is based on correspondences.
The algorithm estimates the probability of phoneme correspondences by employing a variant of the %2 statistic on a contingency table, which indicates how often two phonemes co-occur in words of the same meaning.
The probabilities are then converted into the estimates of cognation by means of some experimentation-based heuristics.
The paper does not contain any evaluation on authentic language data, but Guy's program COGNATE, which implements the algorithm, is publicly available.
An experimental evaluation of COGNATE is described in Section 6.
Oakes (2000) describes a set of programs that together perform several steps of the comparative method, from the determination of correspondences in wordlists to the actual reconstruction ofthe proto-forms.
Word pairs are considered cognate if their edit distance is below a certain threshold.
The edit operations cover a number of sound-change categories.
Sound correspondences are deemed to be regular if they are found to occur more than once in the data.
The paper describes experimental results of running the programs on a set of wordlists representing four Indonesian languages, and compares those to the reconstructions found in the linguistic literature.
Section 6 contains an evaluation of one of the programs in the set, JAKARTA, on the cognate identification task.
3 Models of translational equivalence
In statistical machine translation, a translation model approximates the probability that two sentences are mutual translations by computing the product of the probabilities that each word in the target sentence is a translation of some source language word.
A model oftranslation equivalence that determines the word translation probabilities can be induced from bitexts.
The difficulty lies in the fact that the mapping, or alignment, of words between two parts of a bitext is not known in advance.
Algorithms for word alignment in bitexts aim at discovering word pairs that are mutual translations.
A straightforward approach is to estimate the likelihood that words are mutual translations by computing a similarity function based on a co-occurrence
statistic, such as mutual information, Dice coefficient, or the %2 test.
The underlying assumption is that the association scores for different word pairs are independent of each other.
Melamed (2000) shows that the assumption ofin-dependence leads to invalid word associations, and proposes an algorithm for inducing models oftrans-lational equivalence that outperform the models that are based solely on co-occurrence counts.
His models employ the one-to-one assumption, which formalizes the observation that most words in bitexts are translated to a single word in the corresponding sentence.
The algorithm, which is related to the expectation-maximization (EM) algorithm, iter-atively re-estimates the likelihood scores which represent the probability that two word types are mutual translations.
In the first step, the scores are initialized according to the G2 statistic (Dunning, 1993).
Next, the likelihood scores are used to induce a set of one-to-one links between word tokens in the bitext.
The links are determined by a greedy competitive linking algorithm, which proceeds to link pairs that have the highest likelihood scores.
After the linking is completed, the link counts are used to re-estimate the likelihood scores, which in turn are applied to find a new set of links.
The process is repeated until the translation model converges to the desired degree.
Melamed presents three translation-model estimation methods.
Method A re-estimates the likelihood scores as the logarithm of the probability of jointly generating the pair of words u and v:
where links u v denotes the number of links induced between u and v. Note that the co-occurrence counts of u and vare not used for the re-estimation, In Method B, an explicit noise model with auxiliary parameters A+ and X~ is constructed in order to improve the estimation of likelihood scores.
A+ is a probability that a link is induced between two co-occurring words that are mutual translations, while At is a probability that a link is induced between two co-occurring words that are not mutual translations.
Ideally, A+ should be close to one and At should be close to zero.
The actual values of the two parameters are calculated by the maximum likelihood estimation.
Let cooc u v be the number of co-occurrences of u and v.Thescore function is
defined as:
where B k n p denotes the probability of k being generated from a binomial distribution with parameters n and p.
In Method C, bitext tokens are divided into classes, such as content words, function words, punctuation, etc., with the aim of producing more accurate translation models.
The auxiliary parameters are estimated separately for each class.
scoreC u vZ class u v
4 Models of sound correspondence
Thanks to its generality and symmetry, Melamed's parameter estimation process can be adapted to the problem ofdetermining correspondences.
The main idea is to induce a model of sound correspondence in a bilingual wordlist, in the same way as one induces a model of translational equivalence among words in a parallel corpus.
After the model has converged, phoneme pairs with the highest likelihood scores represent the most likely correspondences.
While there are strong similarities between the task of estimating translational equivalence of words and the task of determining recurrent correspondences of sounds, a number of important modifications to Melamed's original algorithm are necessary in order to make it applicable to the latter task.
The modifications include the method of finding a good alignment, the handling of null links, and the method of computing the alignment score.
For the task at hand, I employ a different method of aligning the segments in two corresponding sequences.
In sentence translation, the alignment links frequently cross and it is not unusual for two words in different parts of sentences to correspond.
In contrast, the processes that lead to link intersection in diachronic phonology, such as metathesis, are quite sporadic.
The introduction of the no-crossing-links constraint on alignments not only leads to a dramatic reduction of the search space, but also makes it possible to replace the approximate competitive-linking algorithm of Melamed with a variant of the well-known dynamic programming algorithm (Wagner and Fischer, 1974; Kondrak,
2000), which computes the optimal alignment between two strings in polynomial time.
Null links in statistical machine translation are induced for words on one side of the bitext that have no clear counterparts on the other side of the bitext.
Melamed's algorithm explicitly calculates the likelihood scores of null links for every word type occurring in a bitext.
In diachronic phonology, phonological processes that lead to insertion or deletion of segments usually operate on individual words rather than on particular sounds across the language.
Therefore, I model insertion and deletion by employing a constant indel penalty for unlinked segments.
The alignment score between two words is computed by summing the number of induced links, and applying an indel penalty for each unlinked segment, with the exception of the segments beyond the rightmost link.
The exception reflects the relative instability of word endings in the course of linguistic evolution.
In order to avoid inducing links that are unlikely to represent recurrent sound correspondences, only pairs whose likelihood scores exceed a set threshold are linked.
All correspondences above the threshold are considered to be equally valid.
In the cases where more than one best alignment is found, each link is assigned a weight that is its average over the entire set of best alignments (for example, a link present in only one of two competing alignments receives the weight of 0.5).
5 Implementation
The method described above has been implemented as a C++ program, named CORDI, which will soon be made publicly available.
The program takes as input a bilingual wordlist and produces an ordered list of correspondences.
A model for a 200-pair list usually converges after 3-5 iterations, which takes only a few seconds on a Sparc workstation.
The user can choose between methods A, B, and C, described in Section 3, and an additional Method D. In Method C, phonemes are divided into two classes: non-syllabic (consonants and glides), and syllabic (vowels); links between phonemes belonging to different classes are not induced.
Method D differs from Method C in that the syllabic phonemes do not participate in any links.
Adjustable parameters include the indel penalty ratio d and the minimum-strength correspondence threshold t. The parameter d fixes the ratio between the negative indel weight and the positive
weight assigned to every induced link.
(A lower ratio causes the program to be more adventurous in positing sparse links.)
The parameter t controls the tradeoff between reliability and the number of links.
In Method A, the value of t is the minimum number of phoneme links that have to be induced for the correspondence to be valid.
In methods B, C, and D, the value of t implies a likelihood score threshold of t ■ log which is a score achieved by a pair of phonemes that have t links out of t cooccurrences.
In the experiments reported in Section 6, d was set to 0 15, and t was set to 1 (sufficient to reject all non-recurring correspondences).
In Method D, where the lack of vowel links causes the linking constraints to be weaker, a higher value of t 3 was used.
These parameter values were optimized on the development set described below.
6 Evaluation
The experiments in this section were performed using a well-known list of 200 basic meanings that are considered universal and relatively resistant to lexical replacement (Swadesh, 1952).
The Swadesh 200-word lists are widely used in linguistics and have been compiled for a large number of languages.
The development set consisted ofthree 200-word list pairs adapted from the Comparative Indoeuro-pean Data Corpus (Dyen et al., 1992).
The corpus contains the 200-word lists for a number of Indoeuropean languages together with cognation judgments made by a renowned historical linguist Isidore Dyen.
Unfortunately, the words are represented in the Roman alphabet without any diacritical marks, which makes them unsuitable for automatic phonetic analysis.
The Polish-Russian, Spanish-Romanian, and Italian-Serbocroatian were selected because they represent three different levels ofrelatedness (73.5%, 58.5%, and 25.3% of cognate pairs, respectively), and also because they have relatively transparent grapheme-to-phoneme conversion rules.
They were transcribed into a phonetic notation by means of Perl scripts and then stemmed and corrected manually.
The test set consisted of five 200-word lists representing English, German, French, Latin, and Albanian, compiled by Kessler (2001) As the lists contain rich phonetic and morphological information, the stemmed forms were automatically converted from the XML format with virtually no extra pro-
cessing.
The word pairs classified by Kessler as doubtful cognates were assumed to be unrelated.
6.2 Determination of correspondences in word pairs
Experiments show that CORDI has little difficulty in determining correspondences given a set of cognate pairs (Kondrak, 2002) However, the assumption that a set of identified cognates is already available as the input for the program is not very plausible.
The very existence of a reliable set of cognate pairs implies that the languages in question have already been thoroughly analyzed and that the sound correspondences are known.
A more realistic input requirement is a list of word pairs from two languages such that the corresponding words have the same, well-deined meaning.
Determining correspondences in a list of synonyms is clearly a more challenging task than extracting them from a list of reliable cognates because the non-cognate pairs introduce noise into the data.
Note that Melamed's original algorithm is designed to operate on aligned sentences that are guaranteed to be mutual translations.
Table 2: English-Latin correspondences discovered by CORDI in noisy synonym data.
In order to test CORDI's ability to determine correspondences in noisy data, Method D was applied to the 200-word lists for English and Latin.
Only 29% of word pairs are actually cognate; the remaining 71% of the pairs are unrelated lexemes.
The top ten correspondences discovered by the program are shown in Table 2.
Remarkably, all but one are valid.
In contrast, only four of the top ten phoneme matchings picked up by the %2 statistic are valid correspondences (the validity judgements are my own).
6.3 Identification of cognates in word pairs
The quality of correspondences produced by CORDI is dificult to validate, quantify, and compare with the results of alternative approaches.
However, it is possible to evaluate the correspondences indirectly by using them to identify cognates.
The likelihood of cognation of a pair of words increases with the number of correspondences that they contain.
Since CORDI explicitly posits correspondence links between words, the likelihood of cognation can be estimated by simply dividing the number of induced links by the length of the words that are being compared.
A minimum-length parameter can be set in order to avoid computing cognation estimates for very short words, which tend to be unreliable.
word pair
cognate?
Table 3: An example ranking of cognate pairs.
The evaluation method for cognate identiication algorithms adopted in this section is to apply them to a bilingual wordlist and order the pairs according to their scores (refer to Table 3).
The ranking is then evaluated against a gold standard by computing the n-point average precision, a generalization of the 11-point average precision, where n is the total number of cognate pairs in the list.
The n-point average precision is obtained by taking the average of n precision values that are calculated for each point in the list where we ind a cognate pair: Pi = ^,/=l,... ,72, where / is the number of the cognate pair counting from the top of the list produced by the algorithm, and rt is the rank of this cognate pair among all word pairs.
The n-point precision of the ranking in Table 3 is (1.0 + 0.66)/2 = 0.83.
The expected n-point precision of a program that randomly orders word pairs is close to the proportion of cognate pairs in the list.
Romanian
Serbocr.
Table 4: Average cognate identification precision on the development set for various methods.
Languages
Proportion
of cognates
Albanian
Table 5: Average cognate identification precision on the test set for various methods.
Table 4 compares the average precision achieved by methods A, B, C, and D on the development set.
The cognation judgments from the Comparative In-doeuropean Data Corpus served as the gold standard.
All four methods proposed in this paper as well as other cognate identiication programs were uniformly applied to the test set representing ive In-doeuropean languages.
Apart from the English-German and the French-Latin pairs, all remaining language pairs are quite challenging for a cognate identiication program.
In many cases, the goldstandard cognate judgments distill the indings of decades of linguistic research.
In fact, for some of those pairs, Kessler inds it dificult to show by statistical techniques that the surface regularities are unlikely to be due to chance.
Nevertheless, in order to avoid making subjective choices, CORDI was evaluated on all possible language pairs in Kessler's set.
Two programs mentioned in Section 2, COGNATE and JAKARTA, were also applied to the test set.
The source code of JAKARTA was obtained directly from the author and slightly modiied according to his instructions in order to make it recognize additional phonemes.
Word pairs were ordered according to the conidence scores in the case ofCOG-NATE, and according to the edit distances in the case of JAKARTA.
Since the other two programs do not impose any length constraints on words, the minimum-length parameter was not used in the experiments described here.
The results on the test set are shown in Table 5.
The best result for each language pair is underlined.
The performance of COGNATE and JAKARTA is
quite similar, even though they represent two radically different approaches to cognate identiica-tion.
On average, methods B, C, and D outperform both comparison programs.
On closely related languages, Method B, with its relatively unconstrained linking, achieves the highest precision.
Method D, which considers only consonants, is the best on fairly remote languages, where vowel correspondences tend to be weak.
The only exception is the extremely dificult Albanian-English pair, where the relative ordering of methods seems to be accidental.
As expected, Method A is outperformed by methods that employ an explicit noise model.
However, in spite of its extra complexity, Method C is not consistently better than Method B, perhaps because of its inability to detect important vowel-consonant correspondences, such as the ones between French nasal vowels and Latin /n/.
7 Conclusions and future work
I have presented a novel approach to the determination of correspondences in bilingual wordlists.
The results of experiments indicate that the approach is robust enough to handle a substantial amount of noise that is introduced by unrelated word pairs.
CORDI does well even when the number of non-cognate pairs is more than double the number of cognate pairs.
When tested on the cognate-identiication task, CORDI achieves substantially higher precision than comparable programs.
The correspondences are explicitly posited, which means that, unlike in some statistical approaches, they can be veriied by examining individual cognate pairs.
In contrast with approaches that assume a rigid alignment based on the syl-
labic structure, the models presented here can link phonemes in any word position.
Currently, I am working on the incorporation of complex correspondences into the cognate identification algorithm by employing Melamed's (1997) algorithm for discovering non-compositional compounds in parallel data.
Such an extension would overcome the limitation of the one-to-one model, in which links are induced only between individual phonemes.
Other possible extensions include taking into account the phonological context of correspondences, combining the correspondence-based approach with phonetic-based approaches, and identifying correspondences and cognates directly in dictionary-type data.
The results presented here prove that the techniques developed in the context of statistical machine translation can be successfully applied to a problem in diachronic phonology.
The transfer of methods and insights should also be possible in the other direction.
Acknowledgments
Thanks to Graeme Hirst, Radford Neal, and Suzanne Stevenson for helpful comments, to Michael Oakes for assistance with JAKARTA, and to Gemma Enriquez for helping with the experimental evaluation ofCOGNATE.
This research was supported by the Natural Sciences and Engineering Research Council ofCanada.
