# WSD-Z-reweighting
  code for Rare and zero-shot word sense disambiguation using Z-reweighting

  Before running the experiments, download training data SemCor3.0 from http://web.eecs.umich.edu/~mihalcea/downloads.html#semcor and put it under dir ./data
  Also, the unified evaluation framework for WSD http://lcl.uniroma1.it/wsdeval/

  cd ./data
  wget http://lcl.uniroma1.it/wsdeval/data/WSD_Evaluation_Framework.zip
  unzip WSD_Evaluation_Framework.zip

# Enviroment as follows:
  python: 3.7.6
  pytorch: 1.2.0
  transformer version: 4.1.1

# Analyze the SemCor dataset and polysemy distribution
Before everything starts, firstly transform the xml format into csv
  cd ./preprocess
  python transform.py

  resulting semcor.csv file, similarily, applying to senseval2, senseval3, etc.

1) sort by frequency order, get polysemy distribution and instance number for words/senses, set K value and calculate smoothed polysemy distribution

  polysemy definition from ./data/WSD_Evaluation_Framework/Data_Validation/candidatesWN30.txt

  cd ./preprocess
  python poly_power.py

  resulting semcor_sense_count.json,  semcor_synset_count.txt, semcor\_polysemy\_K\_{}.npy, where K = 50, 100, 200, 300, 400.
 
2) use power law function to fit the polysemy distribution; set lammda and assign weight to training words in SemCor
  
  python power_law_fit.py

  This will first use the threadholds to group the words by one-decimal score generated by power-law fitting curve.

  According to groups, futher set gamma to adjust the weight for training words and generate weight file semcor_synset_weight_K_gamma.json, which will be used later in Z-reweighting strategy.
    

# Running experiments

  For different strategies, the running scripts are:

  CUDA_VISIBLE_DEVICES=0,1 python biencoder_Z_reweighting.py --data-path ./data --postprocess-data-path ./preprocess --K 300 --gamma 2 --ckpt bert_base_Z_reweighting_300_2 --encoder-name bert-base --multigpu

# Evaluation

  CUDA_VISIBLE_DEVICES=0,1 python biencoder_Z_reweighting.py --data-path ./data --postprocess-data-path ./preprocess --ckpt bert_base_Z_reweighting_300_2 --encoder-name bert-base --multigpu --split ALL --eval
 

# MCS/LCS analysis
  use definition of MCS and LCS from WordNet, top1 ranked sense is MCS, others are LCS.
 
  MCS, LCS and Zero-shot senses evaluation

  cd ./analysis

  change the input path and run:

  python f1_mcs_lcs.py

