This dataset accompanies the paper:

P. Malakasiotis and I. Androutsopoulos, "A Generate and Rank Approach to Sentence Paraphrasing". Proceedings of the Conference on Empirical Methods on Natural Language Processing (EMNLP 2011), Edinburgh, UK, 2011. 

====================
Dataset Construction
====================

The dataset consists of sentence pairs. Each pair contains a source (input) sentence S and a 
generated candidate paraphrase C of S. The pairs were evaluated by human judges in terms of 
grammaticality, meaning preservation, and overall paraphrase quality. More information about 
the scores can be found in the paper.

The candidate paraphrases C were generated by using the paraphrasing rules of S. Zhao, 
H. Wang, T. Liu, and S. Li. More information about the rules, how to obtain them, and how 
they were used can be found in the paper.

The source sentences S were drawn randomly from the AQUAINT corpus, which is covered by
licensing agreements. Please ensure that you are covered by appropriate licences,
described in: http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2002T31.

==============
Dataset Format
==============

The data are in XML and in two files:

train_instances.xml:
	Contains 1500 sentence pairs that can be used to train the ranking component of a 
	generate-and-rank paraphrase generator.
test_instances.xml:
	Contains 1935 sentence pairs that can be used to evaluate the ranking component of
	a generate-and-rank paraphrase generator.

Each file contains a <pairs> element as the root element, and as many <pair> elements as 
the pairs it contains. A <pair> element has 4 attributes: GR (grammaticality), MP (meaning 
preservation), PQ (overall paraphrase quality), and id (unique identifier). It also has 
two children-elements: <sourceSentence> (where the source sentence S is stored) and 
<candidateParaphrase> (where the generated candidate paraphrase C is stored). 

Below is a short example of how the sentence pairs are stored in XML.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<pairs size="1500">
   <pair GR="3" MP="4" PQ="3" id="1">
      <sourceSentence>The United States would lift economic sanctions on these zones 
         and use airpower to help defend them from Iraqi counterattacks.
      </sourceSentence>
      <candidateParaphrase>The United States would lift sanctions in economic on these 
         zones and exploit airpower to enable defend them from Iraqi counterattacks.
      </candidateParaphrase>
   </pair>
 ...
</pairs>

An XML Schema defintion (XSD) is also provided (see paraphrase_corpus.xsd).


Prodromos Malakasiotis and Ion Androutsopoulos
Natural Language Processing Group, Department of Informatics, 
Athens University of Economics and Business, Greece
http://nlp.cs.aueb.gr/
