# The Semantic Transformer 

This repository contains the data and code for the paper:

> **Unsupervised Extractive Opinion Summarization Using Sparse Coding**,<br/>
>

## Data

Download the SPACE corpus from this [link](https://github.com/stangelid/qt).
Amazon dataset is publicly available [here](https://github.com/abrazinskas/Copycat-abstractive-opinion-summarizer/tree/master/gold_summs).

For Amazon dataset, the data was processed using instruction from [here](https://github.com/stangelid/qt/blob/main/custom.md).


## Using our model

### Setting up the environment

* __Python version:__ `python3.6`

* __Dependencies:__ Use the `requirements.txt` file and conda/pip to install all necessary dependencies. E.g., for pip:

		pip install -U pip
		pip install -U setuptools
		pip install -r requirements.txt 


### Train SentencePiece Tokenizer
You need to train a SentencePiece tokenizer on your data using the train-spm.py script
```
cd ./src/utils/
python3 train-spm.py path/to/dataset_train.json spm_dataset
mv spm_dataset* ../../data/sentencepiece/
```

### Training SemTrans

To train SemTrans on a subset of SPACE datasetset using a GPU, go to the `./src`
directory and run the following:

    python3 train.py --max_num_entities 500 --run_id run1 --gpu 0

This will train a SemTrans model with default hyperparameters (for general
summarization), store tensorboard logs under `./logs` and save a
model snapshot after every epoch under `./models` (filename:
`run1_<epoch>_model.pt`). 

For training the full model on SPACE, run the following:
```
cd scripts/
chmod +x train_space.sh
./train_space.sh
```
For training the model on full Amazon dataset, please run `scripts/train_amazon.sh` bash script in a similar manner.



### Summarization with SemTrans

To perform general opinion summarization with a trained SemTrans model, go to the `./src` directory and run the following:

	python3 inference.py --model ../models/run1_20_model.pt --run_id space_run1 --gpu 0

This will store the summaries under `./outputs/general_run1` and also the output of ROUGE evaluation in `./outputs/eval_general_run1.json`. 
For aspect opinion summarization, run:

	python3 aspect_inference.py --model ../models/run1_20_model.pt --sample_sentences --run_id aspects_run1 --gpu 0

The summarization scripts for SPACE and Amazon are available: `scripts/evaluate_*.sh`