Distributed together with the paper "Automatic Detection of Cognates Using Orthographic Alignment".

This program implements the algorithm we proposed in our paper for detecting pairs of cognates.

The code requires Java 1.7 and Weka toolkit jars. In order to compile and run our source files, the first steps are to create a folder 'demo\lib', to add the Weka required jar file and to copy the input files from 'data\txt' folder (compressed in 'data.zip' archive) in 'demo\files' folder. We provide a sample script for executing the code on Windows: the extension of the file 'demo\run.bat.txt' should be changed to 'demo\run.bat' and then the script can be executed.

The program handles the feature extraction for our experiments. Using lists of cognates and non-cognates as input, we extract two types of orthography-based features:

	1) orthographic similarity metrics (edit, xdice, longest common subsequence ratio)
	2) orthographic alignment (the Needleman-Wunsch algorithm for global sequence alignment)
	
A demo input file is provided in the 'files' folder and the expected output is provided as well, only for 2-grams. The paths to the files should be updated in the code as needed. The output of this code is suitable to be used as input for the Weka toolkit for machine learning.
 


