A FRAMEWORK FOR THE CONSTRUCTION OF MONOLINGUAL AND CROSS-LINGUAL WORD SIMILARITY DATASETS

This directory contains five files and a sub-directory:

----------------------------------------------------------------------------------------------------------------

- "Cross-lingual_datasets" (Directory): It contains fifteen text files which correspond to the cross-lingual word similarity datasets, 
					including English (EN), French (FR), German (DE), Spanish (ES), Portuguese (PT), and Farsi (FA) languages. 

- "rg65_spanish.txt": Monolingual Spanish Word Similarity dataset.

- "rg65_farsi.txt": Monolingual Farsi (Persian) Word Similarity dataset.

Each line in these seventeen files (Spanish and Farsi monolingual datasets and fifteen cross-lingual datasets) are formatted as follows:

word1<tab>word2<tab>similarity_score

----------------------------------------------------------------------------------------------------------------

- "cross-lingual_dataset_creation.py": the Python script for the automatic creation of cross-lingual datasets. 

Input: Two monolingual datasets previously aligned pair-wise (line by line), following the format indicated above.
Output: the cross-lingual dataset in the same format, saved in the same directory.

Intructions to run the script: 

The code has the following parameters: path, file_1, file_2, and size_sim_scale
	path -> Path of the monolingual datasets' directory and path where the cross-lingual dataset will be created (by default it is the same path).
        file_1 -> File name of the first dataset.
        file_2 -> File name of the second dataset.
        size_sim_scale -> Size of the similarity scale (In RG-65, for example, the size of the similarity scale is 4).

Run it in the terminal as follows: "python cross-lingual_dataset_creation.py path file_1 file_2 size_sim_scale"
It will create the new cross-lingual dataset in the same directory (path).

Example of usage: "python cross-lingual_dataset_creation.py /home/Resources/ rg65_spanish.txt rg65_farsi.txt 4"
In the example, the cross-lingual dataset will be created in "/home/Resources/" with the following name: "cross_rg65_spanish_rg65_farsi.txt"

----------------------------------------------------------------------------------------------------------------

- "Similarity_Guidelines_ES.pdf": Annotation guidelines used for the construction of the Spanish word similarity dataset. 

- "Similarity_Guidelines_FA.pdf": Annotation guidelines used for the construction of the Farsi word similarity dataset. 

----------------------------------------------------------------------------------------------------------------


