This archive contains ETP-gold, version 1.0 (04-30-2014)

ETP-gold is based on edits from English Wikipedia articles and turns from Wikipedia discussion pages. 
It is devided into 128 corresponding edit-turn-pairs (in "corresponding" folder) and 508 non-corresponding pairs (in "non-corresponding" folder). 
Each edit-turn-pair has been manually annotated; the data shipped with this archive is the gold standard obtained by majority voting.

More details can be found in the following paper:

Johannes Daxenberger, Iryna Gurevych: Automatically Detecting Corresponding Edit-Turn-Pairs in Wikipedia.
In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. Short Papers.
June 2014. Baltimore, MD, USA.

The format of all files (pseudo XML) is like this:

<article_title>ARTICLE TITLE</article_title>
<edit_user>EDIT USER NAME</edit_user>
<edit_time>EDIT TIMESTAMP</edit_time>
<edit_comment>EDIT COMMENT</edit_comment>
<edit_text>EDIT TEXT</edit_text>
<turn_user>TURN USER NAME<turn_user>
<turn_time>TURN TIMESTAMP</turn_time>
<turn_topicname>TURN TOPIC NAME</turn_topicname>
<turn_topictext>TURN TOPIC TEXT</turn_topictext>
<turn_text>TURN TEXT</turn_text>

The EDIT TEXT is formatted like this:
<strong><strike>DELETED TEXT</strike></strong>
<strong>INSERTED TEXT</strong>
<strong><em>RELOCATED TEXT</em></strong>

The encoding of all files is UTF-8. The naming scheme of the files is [revisionId]_[editId]_[turn-timestamp]_[turnId].

ETP-gold is licensed under the Creative Commons Attribution/Share-Alike License (CC-BY-SA), as it is based on data extracted from Wikipedia.
If you want to use ETP-gold, please refer to it with a citation to the reference given above.

ETP-gold can also be found online at www.ukp.tu-darmstadt.de/data/edit-turn-pairs.

In case of questions, please contact: Johannes Daxenberger (daxenberger@ukp.informatik.tu-darmstadt.de)
UKP Lab, TU Darmstadt, www.ukp.tu-darmstadt.de
