# Systematicity, Compositionality and Transitivity of Deep NLP Models: a Metamorphic Testing Perspective

Proof of concept code for the use of metamorphic relations to test systematicity, compositionality and transitivity 
of deep Natural Language Processing models.

### Abstract

Metamorphic testing has recently been used to check the safety of neural NLP models.
Its main advantage is that it does not rely on a ground truth to generate test cases.
However, existing studies are mostly concerned with robustness-like metamorphic relations, limiting the scope of linguistic properties that can be tested.
We propose three new classes of metamorphic relations, which address the properties of systematicity, compositionality and transitivity.
Unlike robustness, our relations are defined over more than one source input, thus increasing the number of test cases that we can produce by a polynomial factor.
With them, we test the internal consistency of state-of-the-art NLP models, and show that they do not always behave according to their expected linguistic properties.
Lastly, we introduce a novel graphical notation that efficiently summarises the inner structure of metamorphic relations.

### Environment requirements
* Python with the following packages:
```
numpy
scikit_learn
torch
transformers
sentencepiece
pandas
tqdm
```

### Installation and usage
1. On the root of the uncompressed package:

        pip install .

2. To run an experiment:

        python -m experiments.context_invariance
        python -m experiments.entailment_composition
        python -m experiments.sentiment_invariance
        python -m experiments.synonymy


### Description
We wish to test the internal consistency of an NLP model, by checking whether it satisfies a necessary relation of its 
inputs and outputs ([Ribeiro et al., 2020](https://doi.org/10.18653/v1/2020.acl-main.442)).
In contrast with testing *robustness* relations, which require that the output of an NLP model remains stable in the 
face of small input perturbations ([Aspillaga et al., 2020](https://aclanthology.org/2020.lrec-1.232)).

We propose three new classes of metamorphic relations, and use them to test the systematicity, compositionality and 
transitivity of NLP models. These tests are implemented in the following experiments:

* **Pairwise systematicity of sentiment**

  To apply the pairwise-systematicity relation structure to a sentiment analysis task, we choose the following:

  - **Transformation** *T*. For each source input ![](https://render.githubusercontent.com/render/math?math=\mathbf{x}_i}), 
    we create a follow-up input ![](https://render.githubusercontent.com/render/math?math=\mathbf{x}_i'=T(\mathbf{x}_i)) 
    by concatenating a short sentence to it.
    
  - **Output premise** ![](https://render.githubusercontent.com/render/math?math=P_{src}). 
    Let ![](https://render.githubusercontent.com/render/math?math=s(\mathbf{y}_1)) and 
    ![](https://render.githubusercontent.com/render/math?math=s(\mathbf{y}_2)) be the (positive) sentiment scores predicted 
    by model *f*. Define the baseline behaviour of model *f* as the order property ![](https://render.githubusercontent.com/render/math?math=P_{src}=P_{ord}) 
    between these two scores.
    
  - **Output hypothesis** ![](https://render.githubusercontent.com/render/math?math=P_{flw}$). Let ![](https://render.githubusercontent.com/render/math?math=s(\mathbf{y}_1')) 
    and ![](https://render.githubusercontent.com/render/math?math=s(\mathbf{y}_2')) be the sentiment scores of the follow-up inputs. 
    We require that their order matches the one of the source inputs. More formally: ![](https://render.githubusercontent.com/render/math?math=P_{flw}%20=%20P_{ord}) and 
    ![](https://render.githubusercontent.com/render/math?math=P_{src}%20\iff%20P_{flw}).
  <br/><br/>
  
  Our rationale is that the sentiment of any input shifts when we concatenate additional text. If we have ground-truth 
  information on the sentiment of the text we are adding, we can test whether our predictions shift in the expected direction. 
  For instance, concatenating *"I am very happy"* should make the score of any input more positive.

  However, if we do not have such ground truth, we can still test our model. We do so by considering a pair of inputs 
  ![](https://render.githubusercontent.com/render/math?math=(\mathbf{x}_1,\mathbf{x}_2)), and concatenating the same text 
  to both of them. Then, whenever ![](https://render.githubusercontent.com/render/math?math=\mathbf{x}_1) is predicted 
  more positive than ![](https://render.githubusercontent.com/render/math?math=\mathbf{x}_2), we require that its transformed 
  version ![](https://render.githubusercontent.com/render/math?math=\mathbf{x}_1') is also more positive than 
  ![](https://render.githubusercontent.com/render/math?math=\mathbf{x}_2') and vice versa. This is pairwise systematicity.
  <br/><br/>
  
  We select a fine-tuned version of RoBERTa  ([Liu et al., 2019)](https://arxiv.org/abs/1907.11692)) for sentiment analysis 
  from the [HuggingFace library](https://huggingface.co/siebert/sentiment-roberta-large-english). We choose 10,605 movie 
  reviews from ([Socher et al., 2013](https://aclanthology.org/P13-1045)) as our dataset ![](https://render.githubusercontent.com/render/math?math=\mathcal{D}). 
  From it, we generate all *112*M+ possible source input pairs.


* **Pairwise compositionality of NLI**

  In general, the input ![](https://render.githubusercontent.com/render/math?math=\mathbf{x}=(\mathbf{x}_a,\mathbf{x}_b)) 
  of an NLI model is the concatenation of two pieces of text: the premise ![](https://render.githubusercontent.com/render/math?math=\mathbf{x}_a) 
  and the hypothesis ![](https://render.githubusercontent.com/render/math?math=\mathbf{x}_b). The model's goal is to predict 
  whether ![](https://render.githubusercontent.com/render/math?math=\mathbf{x}_b) logically follows from 
  ![](https://render.githubusercontent.com/render/math?math=\mathbf{x}_a), i.e. their *entailment*.

  To test whether the model's predictions exhibit a compositional behaviour, we construct our test inputs according  to 
  ([Rozanova et al., 2021](http://arxiv.org/abs/2105.08008)). Namely, we first choose a prototypical sentence template 
  ![](https://render.githubusercontent.com/render/math?math=C(\ell)), which we call a \textit{context}. Each context includes 
  a placeholder token ![](https://render.githubusercontent.com/render/math?math=\ell) that can be replaced with some 
  *insertion* text. Second, we construct each input ![](https://render.githubusercontent.com/render/math?math=\mathbf{x}=(C(\ell_a),C(\ell_b))) 
  by copying the same context twice with different insertions.

  Finally, we choose the contexts ![](https://render.githubusercontent.com/render/math?math=C_i) and insertion pairs 
  ![](https://render.githubusercontent.com/render/math?math=(\ell_a,\ell_b)_j) in such a way that their composition 
  ![](https://render.githubusercontent.com/render/math?math=(C(\ell_a),C(\ell_b))_{ij}) has a well-definite entailment relation. 
  Namely, the insertion pairs are either hypernyms (![](https://render.githubusercontent.com/render/math?math=\supseteq)), 
  hyponyms (![](https://render.githubusercontent.com/render/math?math=\subseteq)), or unrelated (none). Similarly, the 
  contexts are either *upward monotone*, if they preserve the insertion relation, or *downward monotone*, if they invert it. 
  As a result, only the compositions ![](https://render.githubusercontent.com/render/math?math=\text{Up}(\subseteq)) and 
  ![](https://render.githubusercontent.com/render/math?math=\text{Down}(\supseteq)) are entailed, while the rest are not.
  <br/><br/>
  
  Now, assume that both inputs ![](https://render.githubusercontent.com/render/math?math=\mathbf{x}_1) and 
  ![](https://render.githubusercontent.com/render/math?math=\mathbf{x}_2) are based on the same context 
  ![](https://render.githubusercontent.com/render/math?math=C_i). We can test whether the NLI model build its output by 
  reasoning over the monotonicity of ![](https://render.githubusercontent.com/render/math?math=C_i) and the lexical relation 
  of the insertion pairs ![](https://render.githubusercontent.com/render/math?math=(\ell_a,\ell_b)_j) as follows:

  - **Hidden premise** ![](https://render.githubusercontent.com/render/math?math=P_{hid}). Let 
    ![](https://render.githubusercontent.com/render/math?math=\mathbf{z}) be the embeddings of the second to last layer, 
    for the tokens corresponding to the insertions ![](https://render.githubusercontent.com/render/math?math=\ell_a) and 
    ![](https://render.githubusercontent.com/render/math?math=\ell_b). Train a linear probe 
    ![](https://render.githubusercontent.com/render/math?math=s_{hyp}) on ![](https://render.githubusercontent.com/render/math?math=\mathbf{z}) 
    ([Liu et al., 2019](https://doi.org/10.18653/v1/N19-1112)) to predict whether ![](https://render.githubusercontent.com/render/math?math=\ell_a) 
    is a hypernym of ![](https://render.githubusercontent.com/render/math?math=\ell_b). Define 
    ![](https://render.githubusercontent.com/render/math?math=P_{hid}\!=\!P_{ord}) as the order property over the hypernymy 
    scores ![](https://render.githubusercontent.com/render/math?math=s_{hyp}(\mathbf{z}_1)) and 
    ![](https://render.githubusercontent.com/render/math?math=s_{hyp}(\mathbf{z}_2)) of the two inputs.
    
  - **Output hypothesis** ![](https://render.githubusercontent.com/render/math?math=P_{out}). Let 
    ![](https://render.githubusercontent.com/render/math?math=s_{ent}(\mathbf{y})) be the entailment score produced by the 
    full neural model ![](https://render.githubusercontent.com/render/math?math=f\circ%20g). Moreover, define 
    ![](https://render.githubusercontent.com/render/math?math=P_{out}%20=%20P_{ord}) as the order of the two output scores 
    ![](https://render.githubusercontent.com/render/math?math=s_{ent}(\mathbf{y}_1)) and 
    ![](https://render.githubusercontent.com/render/math?math=s_{ent}(\mathbf{y}_2)). Then, consider the monotonicity of 
    the input context. If ![](https://render.githubusercontent.com/render/math?math=C_i) is downward monotone, let 
    ![](https://render.githubusercontent.com/render/math?math=P_{hid}%20\iff%20P_{out}), since more hypernymy means more entailment. 
    If ![](https://render.githubusercontent.com/render/math?math=C_i) is upward monotone, let 
    ![](https://render.githubusercontent.com/render/math?math=P_{hid}%20\iff%20\neg%20P_{out}), since more hypernymity means 
    less entailment.
    <br/><br/>

  If the NLI model ![](https://render.githubusercontent.com/render/math?math=f\circ%20g) had a compositional behaviour, 
  the order ![](https://render.githubusercontent.com/render/math?math=P_{hid}) of the hypernymy scores in the hidden layer 
  should be reflected in the order ![](https://render.githubusercontent.com/render/math?math=P_{out}) of the entailment 
  scores in the output.
  <br/><br/>

  We build a dataset ![](https://render.githubusercontent.com/render/math?math=\mathcal{D}) of *292* insertions pairs and 
  repeat our experiment with *211* contexts, for a total of about *9*M test cases. We chose a 
  [fine-tuned version of RoBERTa](https://huggingface.co/roberta-large-mnli) 
  for NLI as our model.
  <br/><br/>

* **Three-way transitivity of lexical relations**

    An NLP model that generalises correctly should exhibit *transitive* behaviour under the right circumstances 
    ([Yanaka et al., 2021](https://doi.org/10.18653/v1/2021.eacl-main.78)). That is, if the model predicts a transitive 
    linguistic property over the input pairs ![](https://render.githubusercontent.com/render/math?math=(x_1,%20x_2)) and 
    ![](https://render.githubusercontent.com/render/math?math=(x_2,%20x_3)) then it should also predict it for the pair 
    ![](https://render.githubusercontent.com/render/math?math=(x_1,%20x_3)). Here, we propose to test this behaviour in a metamorphic way.

    The three source inputs ![](https://render.githubusercontent.com/render/math?math=\mathbf{x}_1,\mathbf{x}_2,\mathbf{x}_3) 
    are combined to form all possible input pairs ![](https://render.githubusercontent.com/render/math?math=\mathbf{x}_{ij}=(\mathbf{x}_i,\mathbf{x}_j)). 
    Then, we can test whether their corresponding outputs are transitive with the following output property:

    ![](https://render.githubusercontent.com/render/math?math=P:%20\quad%20v(y_{12})%20\land%20v(y_{23})%20\Rightarrow%20v(y_{13}))

    where ![](https://render.githubusercontent.com/render/math?math=v(\cdot):\mathcal{Y}\to\{0,1\}) is the Boolean prediction of model *f*.
    <br/><br/>
    
    We apply this metamorphic formulation to test the transitivity of lexical semantic relations, e.g. synonymy and 
    hypernymy ([Santus et al., 2016](https://aclanthology.org/W16-5309)).
    We reproduce a state-of-the-art model for lexical relations ([Wachowiak  et  al., 2020](https://aclanthology.org/2020.cogalex-1.7)), which is a fine-tuned 
    version of the multi-lingual transformer model xlmroberta ([Conneau et al., 2020](https://doi.org/10.18653/v1/2020.acl-main.747)). 
    The multi-lingual lexical relation test set from the CogALex_VI shared task ([Santus et al., 2016](https://aclanthology.org/W16-5309)), 
    is extracted and all possible source triplets from its corpus of words are generated.


### Evaluation strategy
Evaluation is done by calculating the ratio of inputs for which the applicable metamorphic relation holds. 
We call this ratio ***safety*** of the evaluated model.


### Computing infrastructure used for the experiments
* Pairwise systematicity of sentiment, Pairwise compositionality of NLI

      Intel(R) Core(TM) i5-6200U CPU @ 2.30GHz   2.40 GHz
      16 GB RAM
      Windows 10 64 bit

* Three-way transitivity of lexical relations

      AMD Ryzen Threadripper 2950X
      128 GB RAM
      Nvidia GeForce RTX 3060 (12 GB VRAM) 
      Linux Mint 20.1





