Distributed together with the paper "Automatic Detection of Cognates Using Orthographic Alignment"

We provide an archive containing automatically extracted cognate pairs from the Romanian vocabulary provided by dexonline machine-readable dictionary (http://dexonline.ro). We investigate 4 pairs of languages:

	1) Romanian - French
	2) Romanian - Italian
	3) Romanian - Portuguese
	4) Romanian - Spanish

The format of the data is:

	word1____word2____label
	
Using lists of cognates and non-cognates as input, we extract two types of orthography-based features:

	1) orthographic similiarty metrics (edit, xdice, longst common subsequence ratio) - located in 'data\arff\metrics' folder
	2) orthographic alignment (Needleman-Wunsch algorithm for global sequence alignment) - located in 'data\arff\grams' folder; there are two additional subfolders here: 
		i) strict - for a given n, we use n-grams as features
		ii) all - for a given n,  we use i-gram features where i is in {1, 2, ..., n}

The arff files extracted from the lists of cognates and non-cognates are suitable to be used as input for Weka toolkit for machine learning.