# Relieving the Information Asymmetry Issue in Similarity-based Word Sense Disambiguation

This repository is the open source code for COE, a Context-Oriented Embedding technique for context embedding learning in similarity-based word sense disambiguation. The framework comes from [BEM](https://github.com/facebookresearch/wsd-biencoders). We also use some modules in [SREF](https://github.com/lwmlyy/SREF). We thank the authors for opening their valuable codes.
For evaluation, we use sense embedding sets from SREF, [LMMS](https://github.com/danlou/LMMS) and [ARES](http://sensembert.org/ares). We also thank the authors for releasing their sense embddings.


## Table of Contents
- [Requirements](#Requirements)
- [Sense Embedding](#Get-Sense-Embedding)
- [Context Embedding](#Retrieve-Context-Embedding)
- [WSD Evaluation](#WSD-evaluation)


### Requirements

The project relies on Anaconda to provide the basic packages, and others included in requirements.txt. Please use the following code to install them:

```bash
pip install -r requirements.txt
```

For NLTK packages, you need to download them using the following code:

```bash
$ python -c "import nltk; nltk.download('wordnet')"
```

### Get Sense Embedding
We use three sense embedding sets from the above mentioned papers. We note that, before using the sense embeddings from ARES, you need to separate those 1024-dim supervised and knowledge-based embeddings. The sense embeddings should be placed into './data/vectors' folder.

If you want to learn the SREF sense embeddings from scratch, please run the modified emb_glosses.py to obtain the knowledge-based version sense embeddings (you need to download augmented gloss from [SREF](https://github.com/lwmlyy/SREF) and put them in './' folder). Also, run context-wsd.py to obtain the SemCor sense embeddings.

```bash
$ python emb_glosses.py -emb_strategy aug_gloss+examples
$ python context-wsd.py --task emb-lmms
```

### Retrieve Context Embedding
To learn context embedding, set --task to 'context_vec'.   

#### local context embedding
For local context embedding, we use different surrounding sentences to the left and right of the ambiguous sentence. SemEval2007 is used to select the optimal sentence number, i.e. 2 sentences to the left and right of the ambiguous sentence. Use the following parameter to assign the sentence number.

--context_lenw 2 (window)   
--context_mode ['global']

#### global context embedding
For global context embedding, we devise three strategies, i.e. WO, tfidf-WO, and GeWO to rank all other sentences in the same document as the ambiguous sentence. Also, SemEval2007 is used to select the optimal sentence number, i.e. top 2 ranked sentences. Use the following parameter to assign the sentence number. 

--context_lens 2 (local)

For different scoring strategies, use the following parameter to obtain the context embedding.

--context_mode ['wo-select', 'tfidfwo-select', 'gewo-select', 'global']

Use the following code to obtain both local and global context embedding before the disambiguation. 
```bash
$ python context-wsd.py --task context_vec --context_mode global --context_lens 2
$ python context-wsd.py --task context_vec --context_mode wo-select --context_lens 2
```

### WSD evaluation
For evaluation, you need to download the framework from [wsdeval](http://lcl.uniroma1.it/wsdeval/home) (make sure to download the whole framework including both the training, test dataset, and also the 'data_validation' file which includes all potential senses for each lemma) and put them in './data' folder.   
When the context and sense embeddings are ready, run the following code to conduct WSD.

```bash
$ python context-wsd.py --task wsd-kb --sec_wsd
```

| |SE2|SE3|SE07|SE13|SE15|ALL|  
|----------------|----|----|----|----|----|-----------------|   
|COEkb|76.0|74.2|69.2*|78.2|80.9|76.3*|
|COEsup|80.3|77.6|73.6*|80.7|82.3|79.6*|      



