# telicity
Modeling aspectual classes of verbs with distributional semantics

### Installation

* The code needs to be run on python 3.6 or later
* Ideally, the code is run from a `virtualenv` (e.g. using `venv` or `conda`)
* Setup with `venv` would work as follows
	```bash
	# Create virtual environment
	python3 venv /path/to/some/directory/

	# Activate virtualenv
	source /path/to/some/directory/bin/activate
	```
* Install the project requirements
	```bash
	pip install -r requirements.txt
	```
* Install the project itself (creates a development version; installing the project is the best thing to avoid dealing with `PYTHONPATH` and related issues)
	```bash
	pip install -e .
	```

### Experiments

#### 1. Predict telicity from word embeddings

##### Download the fastText embeddings

Run the script `downlad_fasttext_embeddings.sh`

The script downloads the unaligned and aligned fasttext embeddings for the specified language and subsequently processes them for further usage.

```bash
./telicity/corpora/bash/download_fasttext_embeddings.sh -l <ISO-639-1 LANGUAGE CODE> -o <OUTPUT_PATH> -p <PYTHON_SCRIPT_PATH>
```

* `-l` specifies the 2-letter ISO 639-1 language code of the given language, e.g. `de` for German, `en` for English or `fa` for Farsi
* `-o` specifies the path where the fastText embeddings should be downloaded to
* `-p` specifies the path to the `telicity/corpora/preprocess.py` script.

Once the embeddings are downloaded and processed they can be accessed through the `FastTextEmbedding` thin wrapper

```python
from telicity.models.embedding import FastTextEmbedding
from scipy.spatial.distance import cosine
import numpy as np

# Load embeddings
emb = FastTextEmbedding(embedding_path='/path/to/fasttext_wiki-en_dim-300.kvec')

# Access embeddings (calculate cosine similarity between a few items)
print(1 - cosine(emb['dog'], emb['cat']))
> 0.638044954320572

print(1 - cosine(emb['dog'], emb['car']))
> 0.2908740896556926

print('aardvark' in emb)
> True

# The embedding object also supports `get` instead of directly accessing keys
oov = np.zeros((emb.dimensionality(),))
vec = emb.get('guitar', oov)

# Individual embeddings are just `np.ndarray`s and can be added, subtracted, etc.
small_dog = emb['small'] + emb['dog']
print(1 - cosine(small_dog, emb['dog']))
> 0.7833679789954273

print(1 - cosine(small_dog, emb['cat']))
> 0.5253284443759026
```

##### Setup an experiment file

An experiment is configured by a flat csv file that specifies some of the input parameters. One such file is at `telicity/experiments/resources/experiments/mono_lingual/experiment_lr_en.csv`

The experiment file defines the following parameters:

* `vector_file`: The fastText embeddings to use for this experiment
* `dataset_file`: The dataset to use for this experiment
* `lowercase`: `True` if the input from the `dataset_file` should be lowercased, `False` otherwise
* `num_folds`: The number of folds for the cross-validation setup
* `random_seed`: An integer defining the random seed (for easy reproducibility)
* `dataset_load_function`: Fully qualified name to the function that loads the data
* `data_index`: Index (0-based) of the column in the the `dataset_file` where the data is (e.g. full sentence or just the verb)
* `target_index`: Index (0-based) of the column where the target/class label is
* `skip_header`: `True` if the `dataset_file` contains a header, `False` otherwise
* `evaluation_mode`: `cross_validation` for performing k-fold validation, `train_test` otherwise

##### Create the `dataset_load_function`

Basically this function is responsible for loading the dataset file and returning a list of data and target items. An example is `telicity.util.data_util.get_telicity_dataset_en`

The function receives 4 input parameters (specified in the experiment file described above):

* `dataset_file`: The file to load
* `data_index`: The index at which to find the data 
* `target_index`: The index at which to find the targets/class labels
* `skip_header`: Whether or not the file contains a header line

The function should return a tuple consisting of the data and the targets in a python list.

##### Running the Experiment

The experiment works as follows:

* Load fastText embeddings
* Load dataset
* Vectorise data and labels - If data is a full sentence or a verb phrase (rather than just a verb), the corresponding embeddings will be averaged.
* Define a `StratifiedKFold` cross-validation setup (the stratified ensures that the label distribution for each split is ~the same as for the whole dataset) or a `train_test` split setup.
* Train a simple Logistic Regression classifier with embeddings as inputs to predict aspectuality as output.
* The script compares the performance of the Logistic Regression classifier with a majority class baseline.
* The script also stores some basic analysis resources such as a confusion matrix of the results and a table indicating which contexts are most/least useful for predicting telicity.

Running the experiment can be done as:
```bash
python -m telicity.experiments.experiment_lr -cn lr_en -ip path/to/fastText/embeddings/ -ip2 path/to/dataset/file -ef mono_lingual/experiment_lr_en.csv -op /path/to/_results/lr_en -obs file
```

* `-cn`: config_name of the experiment (only used for housekeeping the different experiment runs)
* `-ip`: input path to the fastText embeddings
* `-ip2`: input path to the dataset
* `-ef`: the experiment file to use for this experiment
* `-op`: output path for the analysis resources
* `-obs`: observer for the experiment (I am using `sacred` for recording experiments) - the observer specifies how experiments should be tracked. The option `file` uses a basic file structure to store the parametersiations and results of the different experiments.

##### Analysing the Results

The Jupyter Notebook at `/path/to/telicity/telicity/experiments/resources/notebooks/telicity_analysis.ipynb` contains the code to load confusion matrix from file and produce a heatmap as output.