###############################################################
###############################################################
This archive contains the 3 evaluation datasets that we compiled, constructed, and used for evaluation in our EMNLP 2014 submission. 

The datasets comprise 3 language pairs: Spanish-to-English, Dutch-to-English and Italian-to-English.
Dataset specifications are provided in the paper, here some more technical details:

- For each language pair, 360 sentences from Wikipedia were extracted (24 sentences for each of the 15 Spanish/Italian/Dutch ambiguous nouns), which is in total 3*360=1080 sentences

(sentences are in the files TestSentencesESEN.test, TestSentencesITEN.test and TestSentencesNLEN.test in the corresponding folders)

- For each sentence, the system has to suggest a single most likely correct translation of the given ambiguous word. The ground truth is provided for each sentence
(ground truth translations are provided in the files TestSentencesESEN.gold, TestSentencesITEN.gold and TestSentencesNLEN.gold)


- All files are provided in the dataset are provided in the standardized XML format. In *.test files, the word for which a translation is sought is wrapped inside the <head> and </head> tags.

1) Here is a tiny extract from the TestSentencesITEN.test file (the beginning of the file):
<corpus lang="italian"> -- denotes the language of this test set, Italian in this case
	<lexelt item="accordo.n"> -- denotes the current lexical item, what follows are the sentences/instances of the lexical item
		<instance id="101"> -- here the ID of the instance/sentence is specified
			<context> -- the actual sentence is withing the <context> and </context> tags
				In musica si definisce <head>accordo</head> la simultaneità di più suoni aventi un ' altezza definita .
			</context>
		</instance>
...

2) Here is a tiny extract from the TestSentencesITEN.gold file:
calcio.n en 306 :: calcium;football;stock :: calcium
calcio.n en 307 :: calcium;football;stock :: football
calcio.n en 308 :: calcium;football;stock :: stock
...

Explanation:
calcio.n -- the lexical item in Italian
en -- the language of translations (English in this case, as we are translating from Italian to English)
306/307/308 -- the instance ID, to link it to the actual sentence in the TestSentencesITEN.test file
calcium;football;stock -- possible translations of "calcio"
The last field contains the correct translation for the particular instance; for 306 it is calcium, for 307 football


One of the reasons why we decided to construct this dataset is to provide some test material with non-English text. In other words, to provide sentences/instances with polysemous words in other languages besides English. The dataset then may be a useful evaluation set in many cross-lingual tasks. 

The format of the dataset makes it easily extensible. In further development of this dataset, we plan to extract more sentences for the already present language pairs as well as to include other language pairs. Moreover, we plan to provide ground truth translations for other languages besides English.





