Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.

Licensed under the Apache License, Version 2.0 (the "License").
You may not use this file except in compliance with the License.
A copy of the License is located at

  http://www.apache.org/licenses/LICENSE-2.0

or in the "license" file accompanying this file. This file is distributed 
on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either 
express or implied. See the License for the specific language governing 
permissions and limitations under the License.

  
## Deep Sentence Embedding Clustering (DSEC)
DSEC is a method to cluster sentences together. It fits an end-to-end auto-encoder structure, with a sentences embedding layer that links to a deep clustering structure. In particular, the sentence embedding layer aims at achieving two goals simultaneously:  1. lead to powerful feature representation layer that can recover the input utterance as much as possible; 2. build a reasonable template clustering results by introducing a clustering oriented loss on it. To achieve the two goals, it minimizes the reconstruction loss and clustering loss at the same time, and plans to experiment with combinations of different structure and word embeddings such as Glove/ELMo/BERT.

## Usage
1. Install [Keras >=v2.0](https://github.com/fchollet/keras), tensorflow, scikit-learn and git   
`sudo pip install keras scikit-learn tensorflow-gpu`  
`sudo apt-get install git`


2. Run experiment on cancel prime data. The fitted model will be saved to: "results/glove_lstm_20190927_len_10_enc_10/models/dsec_model_{epoch}.h5"

        python DSEC.py --train_file path_to_train.txt --emb_file path_to_embedding.txt --max_seq_len 10 --encoder_size 10 --save_dir results/glove_lstm_20190927_len_10_enc_10 --gpu '0' --n_clusters 50


3. Load an existing pre-trained model, and then fine tune/train clustering on it.

        python DSEC.py --train_file path_to_train.txt --emb_file path_to_embedding.txt --max_seq_len 10 --encoder_size 10 --save_dir results/glove_lstm_20190927_len_10_enc_10 --sae_weights_path results/glove_lstm_20190927_len_10_enc_10 --gpu '0' --n_clusters 50



 

