===============================================================================
==                             MORPH Dataset                                 ==
==           Annotation of abiguous complex function words in French         ==
==                                                                           ==
==                          Release 0.2 - May 28 2015                        ==
==                   Resource associated to ACL 2015 paper:                  ==
==                                                                           ==
==     Joint Dependency Parsing and Multiword Expressions Tokenization       ==
==          Alexis Nasr, Carlos Ramisch, José Deulofeu, André Valli          ==
==                   Aix Marseille Univesité, CNRS, LIF                      ==
==                   FirstName.LastName@lif.univ-mrs.fr                      ==
===============================================================================

1) DESCRIPTION

The MORPH dataset was built to allow evaluation of complex function word parsing
in French. It contains around 100 sentences per target construction, which are
7 frequent ADV+que complex conjunctions and 4 de-DET complex determiners. Each
sentence contains a single instance of the target construction, and an 
annotation which describes whether it is used as a complex function word (MORPH)
or as a regular combination (OTHER).

2) DATA FORMAT

The sentences of each target construction are stored in a separate file, named
after the target construction. For instance, file "ainsi_que.txt" contains 
annotations for "ainsi que", a complex conjunction. The two classes are stored
in separate folders, "ADV-que" and "de-DET".

Each file is a tab-separated CSV encoded in UTF-8, with the following columns:
  - seg-annot: manual annotation of segmentation (MORPH or OTHER)
  - sentence-tok: tokenized sentence containing the target construction
  
The first column contains the annotation for the target construction in that
sentence. Since there is only one occurrence of the target construction in the
sentence, we do not indicate the position in the sentence to which this 
annotation corresponds. The two possible values are MORPH, for a complex
function word use (there is a MORPH dependency link between the words of the
construction) and OTHER (there is another, regular syntactic structure which
does not include a MORPH link).

The second column contains the sentence itself, with the target construction.
The sentence was tokenized, since it was extracted from the POS-tagged frWaC
corpus. For the same reason, there might be spelling errors in the sentences,
as they were automatically crawled from websites in the construction of frWaC.

3) DATA COLLECTION AND ANNOTATION

The sentences were extracted from the frWaC corpus, a 1.6B-word corpus of texts
crawled from the web. We used the POS-tagged version of the corpus, made 
available on request on the WaCky website. We have selected sentences based on
the following criteria:
  - The sentence should contain exactly one occurrence of the target 
    construction type (e.g., no more than one de-DET determiner, regardless of
    its type)
  - The sentence should contain between 10 and 20 words, in order to provide
    enough context for annotation without being too long. Too long sentences
    may include irrelevant material which will only slow the annotation down
  - For de-DET constructions, we also required that a verb preceded the "de"
    preposition. However, the verb may appear several words before, it is not
    necessarily adjacent to the target construction. This reduces the number of
    nominal complements, like "président de la république" and favors the
    occurrence of determiner/prepositionall phrase ambiguity
  - Some sentences were manually removed during annotation because they 
    contained too much noise (typos, grammar errors) or because there was not
    enough context to decide on the correct annotation
    
Annotation was performed by two experts on French syntax. They went through
each set of sentences independently. Each target construction occurrence was
judged as either MORPH or OTHER. The annotation class "OTHER" may contain 
several readings, e.g. the construction "tant que" may represent a regular 
adverb followed by a subordinative clause "je voudrais tant que tu m'aimes" or
a comparative "mange tant que tu veux". This distinction was not made in this
annotation, since we are only interested in the MORPH/regular distinction.
After a first pass, annotators cross-checked each other's sentences.
Divergences were discussed and, if no consensus was reached, the sentence was
discarded.

4) EVALUATION

The released dataset also contains a script called eval-morph.sh, which 
compares a parser's output in CONLL07 format with the annotated files. An
example of parsing output is provided in file ainsi_que.conll07, you can run 
the evaluation script as follows:

./eval-morph.sh ainsi_que.conll07 ADV-que/ainsi_que.txt ainsi que

This will provide, in addition to precision and recall stats, an error analysis
of the cases missed by the parser (differences wrt annotation). In order to
evaluate other parsing files, use the CONLL07 format similar to the example,
where the complex function words are linked by a dependency called "MORPH".
    
