<img src="img/icon.png" width=125 height=125 align="right">

# A Statutory Article Retrieval Dataset in French

This repository contains:

* The Belgian Statutory Article Retrieval Dataset (BSARD)  v1.0.
* Code for training and evaluation of the retrieval models.
* Web application to visualize insightful statistics about BSARD.

## Belgian Statutory Article Retrieval Dataset (BSARD)

### Dataset Nutrition Labels

We provide the *dataset nutrition labels* [(Holland et al., 2018)](https://arxiv.org/abs/1805.03677) for BSARD.

<p align="center"><img align="center" src="img/nutrition.png" width="70%"></p>

### Visualize

We provide a [Dash](https://plotly.com/dash/) web application that shows insightful visualizations about BSARD.

<p align="center"><img src="img/eda.gif" width="80%" height="auto"></p>

To explore the visualizations on your local machine, run:

```bash
python scripts/eda/visualise.py
```

## Experiments

### Setup

This repository is tested on Python 3.8+. To install all dependencies, you should have [conda](https://docs.conda.io/projects/conda/en/latest/index.html) installed on your machine and run:

```bash
conda env create -f environment.yml
conda activate bsard
```

In addition, please install spaCy's [fr_core_news_md](https://spacy.io/models/fr#fr_core_news_md) pipeline (needed for text processing) by running:

```bash
python -m spacy download fr_core_news_md
```

Lastly, download the pre-trained French [fastText](https://fasttext.cc/docs/en/crawl-vectors.html#models) and [word2vec](https://fauconnier.github.io/#data) embeddings by running:

```bash
bash scripts/experiments/utils/download_embeddings.sh
```

### Lexical Models

In order to evaluate the TF-IDF and BM25 models, run:

```bash
python scripts/experiments/run_zeroshot_evaluation.py \
    --articles_path </path/to/articles.csv> \
    --questions_path </path/to/questions_test.csv> \
    --retriever_model {tfidf, bm25} \ 
    --lem \ 
    --output_dir </path/to/output>
```

### Dense Models

#### Zero-Shot Evaluation

In order to evaluate the bi-encoder models in a zero-shot setup, run:

```bash
python scripts/experiments/run_zeroshot_evaluation.py \
    --articles_path </path/to/articles.csv> \
    --questions_path </path/to/questions_test.csv> \
    --retriever_model {word2vec, fasttext, camembert} \ 
    --lem \                                           # [Only for word2vec and fastText] Lemmatize both articles and questions as pre-processing.
    --output_dir </path/to/output>
```

#### Training

In order to train a bi-encoder model, update the model parameters and training hyperparameters in *scripts/experiments/train_biencoder.py*. Then, run:

```bash
python scripts/experiments/train_biencoder.py
```
