AQMAR Arabic Wikipedia Supersense Corpus

This dataset contains text extracted from a small corpus of Arabic Wikipedia articles and hand-annotated 
for nominal supersenses. It is described in the paper

  Nathan Schneider, Behrang Mohit, Kemal Oflazer, and Noah A. Smith (2012),
  Coarse Lexical Semantic Annotation with Supersenses: An Arabic Case Study. Proceedings of ACL.

and can be downloaded at 

  http://www.ark.cs.cmu.edu/AQMAR/

This dataset is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License (see LICENSE).


CONTENTS

== Data ==

articles.txt
  4-digit code, English title, and domain of each article in
  the data.

problem.sentences.txt
  Sentences that were removed from the dataset due to being 
  flagged as problematic by one or both annotators.

sentences.txt
  Annotated sentences, one per line. Tab-separated fields:
   * sentence ID: first 4 digits are the article code; 
     remaining digits are numbered sequentially in 
     the main text of the article
   * Arabic sentence (UTF-8, tokenized)
   * tags from Annotator A, if available
   * tags from Annotator B, if available

tokens.txt
  Data in the token-based format. Each line contains the 
  Arabic token, tag from Annotator A (if available), 
  tag from Annotator B (if available), and the sentence ID.

tokens.agreement.bio
  Data used to measure inter-annotator agreement.

tokens.annA.bio
tokens.annB.bio
  Data from individual annotators.


== Documentation ==

examples.html
  Short descriptions and examples for each supersense tag. 
  These were listed in a sidebar in the annotation interface.

guidelines.html
  Tagging guidelines used by annotators.

LICENSE
README
VERSION


== Scripts ==

agreementDataFilter.py
  Applied to sentences.txt, outputs sentences independently 
  annotated by both annotators.

counts.sh
  Counts sentences, tokens, and supersense mentions in each 
  domain and collectively.

sentences2tokens.py
  Converts sentences to a token-based format, with one token 
  per line.

extenderTagScheme2BIO.py
  ./extenderTagScheme2BIO "<" tokens.txt | sed 's/[BI]-[-_]/O/g' 
  converts the tagging to a BIO scheme.


NOTES

The supersense tag symbols are included in the tagset documentation. 
Other symbols are:
_ or - = blank (not part of a nominal supersense)
< = extender (continues a multiword unit)
? = unsure

Tokenization separated punctuation and the conjunction wa- from words.

For articles 0009-0012, the first 20 sentences (i.e. sentences numbered 
0001-0020) were annotated cooperatively between the two annotators. 
For the remaining articles, the first 5 sentences were annotated 
cooperatively.
All other sentences were annotated independently; those with tags from 
both annotators (not including articles 0001-0004, which were used in 
pilot annotation rounds) were used to compute inter-annotator agreement.
