# Database

This is a database of borrowings and cognate pairs in 5 Romance languages: Italian, Spanish, Portuguese, French, and Romanian.

The package contains the following files:

Cognate Pairs
-------------
- under `dataset`, files such as `cognates_<lang1>_<lang2>.csv`. These contain cognate pairs for language pair (`<lang1>`,`<lang2>`). These csv files contain the following columns:
	- `word_<lang1>` ,`word_<lang2>`: the cognate words pair in the two languages
	- `etymon`: the common etymon of the two cognate words
	- `normalized_etymon`: normalized etymon
	- `source_language`: language of common etymon

Borrowing Pairs
---------------
- under `dataset`, files such as `borrowings_<lang1>_<lang2>.csv`. These contain pairs of words where the word from `<lang1>` was borrowed from the word from `<lang2>`. Note that the number of files is double compared to the cognate files (because of the two possible directions of borrowing).

Unrelated Pairs
---------------
- under `dataset`, files such as `negative_<random | levenshtein>_<lang1>_<lang2>.csv`. These contain pairs of unrelated words for language pair (`<lang1>`,`<lang2>`). These pairs were selected either randomly, or based on the Levenshtein distance (the paper discribes the process in more detail).

