# Generalized Zero-Shot intent classification

## Data

### Datasets
1. Schema Guided Dataset (sgd)
   
    Preprocessed dataset stored in `data/sgd`. Predifined split `data/sgd/split_original` corresponds to train/dev/test subsets from original dataset.

2. MultiWoZ (multiwoz)
   
    Preprocessed dataset stored in `data/multiwoz`.
   
3. CLINC (clinc)
    
    Preprocessed dataset stored in `data/clinc`.

### Data directory structure

Each dataset stored in `all.csv`  w/o splitting.

All predefined splits stored in dataset root directory.

All information about intents stored in `intent_info` folder. 
`intent_info/descriptions` contains intent descriptions and different types of patterns.

Intent and utterance similarity matrices for negative sampling stored in `intent_info/intent_similarity` and `uttr_similarity` directories respectively. Different formats for two sampling strategies cause num of intents are much less than num of utterances: for `intents` - it's a similarity matrix, for `utterances` - vector of indices of similar out-of-class utterances for each sample from dataset.

```
├── data
│   ├── clinc
│   │   ├── all.csv
│   │   ├── intent_info
│   │   │   ├── actions.json
│   │   │   ├── concepts.json
│   │   │   ├── descriptions
│   │   │   │   ├── d1_pattern.json
│   │   │   │   ├── d2_pattern.json
│   │   │   │   ├── d3_pattern.json
│   │   │   │   ├── d4_pattern.json
│   │   │   │   ├── names.json
│   │   │   │   ├── q1_pattern.json
│   │   │   │   ├── q2_pattern.json
│   │   │   └── intent_similarity
│   │   │       └── sentence_bert
│   │   │           ├── raws.json
│   │   │           └── similarity.txt
│   │   └── uttr_similarity
│   │       └── sentence_bert_100.txt
│   ├── multiwoz
│   │   ├── all.csv
│   │   ├── intent_info
│   │   │   ├── actions.json
│   │   │   ├── concepts.json
│   │   │   ├── descriptions
│   │   │   │   ├── d1_pattern.json
│   │   │   │   ├── d2_pattern.json
│   │   │   │   ├── d3_pattern.json
│   │   │   │   ├── d4_pattern.json
│   │   │   │   ├── names.json
│   │   │   │   ├── q1_pattern.json
│   │   │   │   ├── q2_pattern.json
│   │   │   └── intent_similarity
│   │   │   │   └── sentence_bert
│   │   │   │       ├── raws.json
│   │   │   │       └── similarity.txt
│   │   └── uttr_similarity
│   │       └── sentence_bert_100.txt
│   ├── sgd
│   │   ├── all.csv
│   │   ├── intent_info
│   │   │   ├── actions.json
│   │   │   ├── concepts.json
│   │   │   ├── descriptions
│   │   │   │   ├── d1_pattern.json
│   │   │   │   ├── d2_pattern.json
│   │   │   │   ├── d3_pattern.json
│   │   │   │   ├── d4_pattern.json
│   │   │   │   ├── names.json
│   │   │   │   ├── original.json
│   │   │   │   ├── q1_pattern.json
│   │   │   │   ├── q2_pattern.json
│   │   │   ├── intent_similarity
│   │   │   │   └── sentence_bert
│   │   │   │       ├── raws.json
│   │   │   │       └── similarity.txt
│   │   ├── split_composition
│   │   │   ├── dev.csv
│   │   │   ├── test.csv
│   │   │   ├── train.csv
│   │   │   └── zeroshot_intents.json
│   │   ├── split_hard
│   │   │   ├── dev.csv
│   │   │   ├── test.csv
│   │   │   ├── train.csv
│   │   │   └── zeroshot_intents.json
│   │   ├── split_original
│   │   │   ├── dev.csv
│   │   │   ├── test.csv
│   │   │   ├── train.csv
│   │   │   ├── uttr_similarity
│   │   │   │   └── sentence_bert_100.txt
│   │   │   └── zeroshot_intents.json
```

## How to run

### For training

```
# SGD base training
python classification/train.py dataset=sgd-origin experiment.name=/path/to/experiment/dir

# MultiWoZ base training
python classification/train.py dataset=multiwoz experiment.name=/path/to/experiment/dir

# CLINC base training
python classification/train.py dataset=clinc experiment.name=/path/to/experiment/dir
```


### For evaluation
```
# SGD base training
python classification/evaluate.py dataset=sgd-origin experiment.name=/path/to/experiment/dir

# MultiWoZ base training
python classification/evaluate.py dataset=multiwoz experiment.name=/path/to/experiment/dir

# CLINC base training
python classification/evaluate.py dataset=clinc experiment.name=/path/to/experiment/dir
```


## Configs
### Config directory structure
```
├── classification
│   ├── conf
│   │   ├── config.yaml
│   │   └── dataset
│   │       ├── clinc.yaml
│   │       ├── multiwoz.yaml
│   │       └── sgd-origin.yaml

```
### Parameters
| Parameter                     |                Default               | Description                                                                                |
|------------------------------|:------------------------------------:|--------------------------------------------------------------------------------------------|
| dataset                      | All parameters specified for dataset |                                                                                            |
| dataset.name                 |                                      | Dataset and it's config name                                                               |
| dataset.path                 |                                      | Relative path to split data or whole dataset                                               |
| dataset.intent_info_path     |                                      | Relative path to intent information data                                                   |
| dataset.description_type     |                                      | Type of intent description to use. Ex: `names`, `d1_pattern`, `q1_pattern`                 |
| dataset.uttr_len             |                                      | Max length of utterance in tokens. Longer utterance would be truncated.                    |
| dataset.desc_len             |                                      | Max length of intent description in tokens. Longer utterance would be truncated.           |
| model                        |                                      |                                                                                            |
| model.base_model             |             roberta-base             | Contextualized encoder model name or path                                                  |
| model.dropout                |                  0.5                 | Linear classifier head dropout                                                             |
| model.embedding_dim          |                  768                 | Contextualized encoder embedding size                                                      |
| model.model_type             |              nli_strict              | Type of classifier and loss function. Default: sentence pair classifier and BCE loss       |
| experiment                   |                                      |                                                                                            |
| experiment.root_dir          |                  ./                  | Root path for experiments                                                                  |
| experiment.name              |                  ???                 | Experiment name - needs to specify                                                         |
| experiment.seed              |                   0                  | Random seed                                                                                |
| experiment.epochs            |        <specified for dataset>       | Epochs to train                                                                            |
| experiment.batch_size        |        <specified for dataset>       | Batch size                                                                                 |
| experiment.accum_steps       |        <specified for dataset>       | Number of gradient accumulation steps                                                      |
| experiment.sampling_strategy |                intents               | One of two sampling strategy: `intents` or `utterances`                                    |
| experiment.k_negative        |                   7                  | Number of examples for negative sampling                                                   |
| experiment.sim_matrix        |                 None                 | Name of similarity matrix for neg. sampling.                                               |
| experiment.train_only_seen   |                 True                 | Train only with seen intent descriptions or not.                                           |
| experiment.intent_desc_first |                 True                 | Is intent description above utterance in sentence  pair encoding.                          |
| experiment.test_epoch        |                 None                 | Specify epoch for evaluation. Default: best loss epoch                                     |
| scheduler                    |                                      |                                                                                            |
| scheduler.lr                 |                 2e-5                 | Learning rate                                                                              |
| scheduler.warmup_steps       |                 0.15                 | Scheduler warmup iterations                                                                |
| checkpoint & log             |                                      |                                                                                            |
| checkpoint.save_from_epoch   |                 None                 | Specified epoch to save checkpoint from. Default: save only best loss checkpoint           |
| checkpoint.saved_model       |                 None                 | Epoch to load model checkpoint from. Default: load from best loss checkpoint.              |
| log.print_every              |                 1000                 | Number of iterations to log loss.                                                          |


### Reproducibility

Almost all hyper-parameters to reproduce experiments are reported in `classification/conf`.

**Specific setups**

For MultiWoZ and CLINC datasets:
 *  all lexicalized intent description types `dataset.description_type` from `[d1_pattern, d2_pattern, q1_pattern, q2_pattern
]` used `dataset.desc_len=15`
 *  intent labels description types `dataset.description_type` from `[d1_pattern, d2_pattern, q1_pattern, q2_pattern
]` used `dataset.desc_len=5` for MultiWoZ and `dataset.desc_len=7` for ClINC
 
For SGD: all setups fix `dataset.desc_len=19` cause intent labels for this dataset contains short natural language descriptions in our experiments.  

