# Data

This folder contains the data that forms the basis of the tests in the paper.
We consider three different types of data: synthetic data, semi_natural data and natural data.
The natural data differs per test and is directly extracted from natural corpora (e.g. it consists of natural sentences containing a particular idiom under consideration).
The other two data types are templated, you can find this data in their respective folders.
The [vocabulary](vocabulary) folder contains the vocabulary items that we used to vary the templates.

# Synthetic data
For our synthetic test data, we have taken inspiration from literature on probing hierarchical structure in language models: we consider the synthetic data generated by Lakretz et al (2019), which contains a large number of sentences with a fixed syntactic structure and diverse lexical material.
We extend the set of templates in the dataset and the vocabulary used, resulting in the following ten templates:

| # | Template | Example sentence |
|:--|:---------|:-----------------|
| 1 | The N<sub>people</sub> V<sub>transitive</sub> the N<sup>sl</sup><sub>elite</sub> | _The poet criticises the king_ |
| 2 | The N<sub>people</sub> Adv V<sub>transitive</sub> the N<sup>sl</sup><sub>elite</sub> | _The victim carefully observes the queen_ |
| 3 | The N<sub>people</sub> P the N<sup>sl</sup><sub>vehicle</sub> V<sub>transitive</sub> the N<sup>sl</sup><sub>elite</sub> | _The athlete near the bike observes the leader_ |
| 4 | The N<sub>people</sub> and the N<sub>people</sub> V<sup>pl</sup><sub>transitive</sub> the  N<sup>sl</sup><sub>elite</sub>. | _The poet and the child understand the mayor_|
| 5 | The N<sup>sl</sup><sub>quantity</sub> of N<sup>pl</sup><sub>people</sub> P the N<sup>sl</sup><sub>vehicle</sub> V<sup>sl</sup></sub>transitive</sub> the  N<sup>sl</sup><sub>elite</sub> | _The group of friends beside the bike forgets the queen_ |
| 6 | The N<sub>people</sub> V<sub>transitive</sub> that the N<sub>people</sub> V<sup>pl</sup><sub>intransitive</sub> | _The farmer sees that the lawyers cry_ |
| 7 | The N<sub>people</sub> Adv V<sub>transitive</sub> that the  N<sub>people</sub> V<sup>pl</sup><sub>intransitive</sub> | _The mother probably thinks that the fathers scream_ |
| 8 | The N<sub>people</sub> V<sub>transitive</sub> that the N<sup>pl</sup><sub>people</sub> V<sup>pl</sup><sub>intransitive</sub> Adv | _The mother thinks that the fathers scream carefully_ |
| 9 | The N<sub>people</sub> that V<sub>intransitive</sub> V<sub>transitive</sub> the N<sup>sl</sup><sub>elite</sub>  | _The poets that sleep understand the queen_ |
| 10| The N<sub>people</sub> that V<sub>transitive</sub> Pro V<sup>sl</sup><sub>transitive</sub> the N<sup>sl</sup></sub>elite</sub> | _The mother that criticises him recognises the queen_ |

For each of the templates, we generated 3000 sentences.

# Semi natural

In the synthetic data, we have full control over the sentence structure and lexical items, but the sentences are shorter (9 tokens vs. 16 in OPUS) and simpler than typical in NMT data. 
To obtain more complex yet plausible test sentences, we employ a data-driven approach: to generate semi-natural data,  we use the tree substitution grammar Double DOP (Van Cranenburgh et al., 2016), we obtain noun and verb phrases whose structures frequently occur in OPUS.

To generate the data, we follow the following process:

1. Sample 100k English OPUS sentences.
2. Generate a treebank using the [disco-dop](https://github.com/andreasvc/disco-dop) library and the discodop parser en ptb command.  We used the library’s `--fmt` bracket to turn off discontinuous parsing, which the library was originally developped for.
3. Compute tree fragments from the resulting treebank (discodop fragments). These tree fragments are the building blocks of a Tree-Substitution Grammar.
4. We assume the most frequent fragments to be common syntactic structures in English. To construct complex test sentences, we collect the 100 most frequent fragments containing at least 15
non-terminal nodes for NPs and VPs.
5. Selection of three VP and five NP fragments to be used in our final semi-natural templates. These structures are selected through qualitative analysis for their diversity.
6. Extract sentences matching the eight fragments (discodop treesearch).
7. Create semi-natural sentences by varying one lexical item and varying the matching NPs and VPs retrieved in the previous step.

We then embed the etracted NPs and VPs in ten synthetic templates, resulting in the following 10 semi-natural templates:

| # | Template | Example sentence |
|:--|:---------|:-----------------|
| 1 | The N<sub>people</sub> (VP (TO ) (VP (VB ) (NP (NP ) (PP (IN ) (NP (NP ) (PP (IN ) (NP ))))))) | _The woman wants to use the Internet as a means of communication._ |
| 2 | The N<sub>people</sub> (VP (VBP ) (VP (VBG ) (S (VP (TO ) (VP (VB ) (S (VP (TO ) (VP ))))))))) | _The men are gonna have to move off-camera._ |
| 3 | The N<sub>people</sub> (VP (VB ) (NP (NP ) (PP (IN ) (NP ))) (PP (IN ) (NP (NP ) (PP (IN ) (NP ))))) | _The doctors retain 10 % of these amounts by way of collection costs._ |
| 4 | The N<sub>people</sub> reads an article about (NP (NP ) (PP (IN ) (NP (NP ) (PP (IN ) (NP (NP ) (PP (IN ) (NP ))))))) | _The friend reads an article about the development of ascites in rats with liver cirrhosis._ |
| 5 | The N<sub>people</sub> reads an article about (NP (NP (DT ) (NN )) (PP (IN ) (NP (NP ) (SBAR (S (WHNP (WDT )) (VP )))))) | _The teachers read an article about the degree of progress that can be achieved by the industry._ | 
| 6 | An article about (NP (NP ) (PP (IN ) (NP (NP ) (PP (IN ) (NP (NP ) (PP (IN ) (NP ))))))) is read by the N<sub>people</sub>.  | _An article about the inland transport of dangerous goods from a variety of Member States is read by the lawyer._ |
| 7 | An article about (NP (NP ) (PP (IN ) (NP (NP ) (, ,) (SBAR (S (WHNP (WDT )) (VP )))))) , is read by the N<sub>people</sub> .  | _An article about the criterion on price stability , which was 27 % , is read by the child._ |
| 8 | Did the N<sub>people</sub> hear about (NP (NP ) (PP (IN ) (NP (NP ) (PP (IN ) (NP (NP ) (PP (IN ) (NP ))))))). | _Did the friend hear about an inhospitable fringe of land on the shores of the Dead Sea?_ |
| 9 | Did the N<sub>people</sub> hear about (NP (NP (DT ) (NN )) (PP (IN ) (NP (NP ) (SBAR (S (WHNP (WDT )) (VP ))))))? | _Did the teacher hear about the march on Employment which happened here on Sunday?_ |
| 10 | Did the N<sub>people</sub> hear about (NP (NP ) (SBAR (S (VP (TO ) (VP (VB ) (NP (NP ) (PP (IN ) (NP )))))))) ? | _Did the lawyers hear about a qualification procedure to examine the suitability of the applicants?_ |

As for the synthetic data, we generate 3000 samples for each template.
