
# Probing for idiomaticity in vector space models

Pre-requisite: Python >= 3.7

Execute the command `pip3 install -r requirements.txt` to install all the required libraries.

Execute the file `download_models.sh` under `models/` folder to download the pre-trained weight for GloVe Portuguese, GloVe English and SBERT Multilingual, which are not available through [transformers][transformer_python].

To generate the cosine similarity for the models you execute:

```bash
python similarity.py <dataset_path> <columns to execute similarity> <model architecture> <pretrained weights name or path> <output>
        -l <layers to consider>
```

The output will be a file with for each pair of columns input, other two: one ending with `-mwe` for the NC similarities and another with `-sent` for the Sent similarities. 

In the following table you will find the pre-trained weights to use as input in `similarity.py`:

|    Model   |               English              |             Portuguese             |
|:----------:|:----------------------------------:|:----------------------------------:|
| GloVe      | models/glove.840B.300d             | models/glove_pt_s300               |
| ELMo       | small                              | pt                                 |
| BERT       | bert-base-multilingual-cased       | bert-base-multilingual-cased       |
| DistilBERT | distilbert-base-multilingual-cased | distilbert-base-multilingual-cased |
| SBERT      | models/sbert_ml/0_DistilBERT/      | models/sbert_ml/0_DistilBERT/      |

In this other table you will find the columns which should serve as input:

| Probe |                           Neutral                          |                                  Naturalistics                                 |
|:-----:|:----------------------------------------------------------:|:------------------------------------------------------------------------------:|
| P1    | neutral sentence,mwe synonym                               | original sentence,synonym for compound                                         |
| P2    | neutral sentence,modifier only  neutral sentence,head only | original sentence,original head only  original sentence,original modifier only |
| P3    | neutral sentence,both synonyms                             | original sentence,synonym both                                                 |
| P4    | neutral sentence,compound noun                             | original sentence,compound noun                                                |

In order to measure the correlation, you can use the snippet below:

```bash

python correlation.py <list of files containing cosine similarities> -c <list of the columns containing the similarities> -t <gold standard CSV file path> -tc <column in gold standard with idiomaticity score> -o <path to output result>

```

# Examples

An example to execute for GloVe model for Portuguese, neutral, probe 1:

```bash
python similarity.py ./datasets/pt/neutral.csv "neutral sentence,mwe synonym" glove models/glove_pt_s300 results/pt/P1/glove/results.csv

python correlation.py esults/pt/P1/glove/results.csv -c "neutral sentence-mwe synonym-cs-mwe" -t datasets/pt/gold_standard.csv -tc compositionality -o results/pt/P1/glove/results_corr_nc.csv

```

It will generate a CSV file with two columns `neutral sentence-mwe synonym-mwe` and `neutral sentence-mwe synonym-sent` with the cosine similarities for each compound.

Another example to execute BERT for English, naturalistic, probe 2:

```bash
python similarity.py ./datasets/en/naturalistics/naturalistics_examplesent1.csv "original sentence,original head only" "original sentence,original modifier only" bert bert-base-multilingual-cased results/en/P2/bert/results_sent1.csv -l="-1,-2,-3,-4" -b 32 -U

python similarity.py ./datasets/en/naturalistics/naturalistics_examplesent2.csv "original sentence,original head only" "original sentence,original modifier only" bert bert-base-multilingual-cased results/en/P2/bert/results_sent2.csv -l="-1,-2,-3,-4" -b 32 -U

python similarity.py ./datasets/en/naturalistics/naturalistics_examplesent3.csv "original sentence,original head only" "original sentence,original modifier only" bert bert-base-multilingual-cased results/en/P2/bert/results_sent3.csv -l="-1,-2,-3,-4" -b 32 -U

python correlation.py results/en/P2/bert/results_sent1.csv results/en/P2/bert/results_sent2.csv results/en/P2/bert/results_sent3.csv -c "original sentence-original head only-cs-mwe" "original sentence-original modifier only-cs-mwe" -t datasets/en/gold_standard.csv -tc compositionality -o results/en/P2/bert/results_corr_nc.csv

python correlation.py results/en/P2/bert/results_sent1.csv results/en/P2/bert/results_sent2.csv results/en/P2/bert/results_sent3.csv -c "original sentence-original head only-cs-sent" "original sentence-original modifier only-cs-sent" -t datasets/en/gold_standard.csv -tc compositionality -o results/en/P2/bert/results_corr_sent.csv

```

[transformer_python]: https://github.com/huggingface/transformers

