This folder contains codes of the proposed BERTSeg and BERTSeg-Regularization.
    * characterBERT_embeddings.py: generate word embeddings of a corpus by characterBERT model.
    * utils.py: small functions
    * dataset.py: create the dataset used in training and inference.
    * model.py: the Transformer decoder
    * train.py: the DP algorithm to calculate word probability through the sum of all subword segmentations of that word
    * inference.py: another DP algorithm to generate best N segmentations for each word in the corpus
    * inference_corpus.py: generate segmented corpus from list of segmented word (put in another supplementary material) and the raw corpus
    

