# BoOQA Construction Pipeline

## Overview
This directory contains the codes and bash scripts necessary for building the BoOQA dataset from raw corpora.
Before starting off, one should first clone the [BERT-WSD](https://github.com/BPYap/BERT-WSD) project to the parent directory of the
current directory. Also, use the parsers to extract predicate-argument structures for NewsCrawl / CLUE corpora as a pre-processing step.

## Select Positives

Use `select_positives_en_generic.sh` or `select_positives_zh_generic.sh`

* EN:  `bash select_positives_en_generic.sh 30 0 15 triple doc 40000 all "--disjoint_window"`
* ZH: `bash select_positives_zh_generic.sh 30 0 15 triple doc 40000 all "--disjoint_window"`

## Generate Negatives
Use the `generate_negatives_XXX.sh` scripts.
* EN
    1. Loading the predicate sets (invariant to different thresholds): `bash generate_negatives_loading_en.sh 15_30_0_triple_doc_40000_disjoint pred`
    2. Compute the most likely synsets from all WordNet Synsets including this lemma (invariant to different thresholds): 
    `bash generate_negatives_synsets.sh 15_30_0_triple_doc_40000_disjoint 
    potential pred 3`
    3. Loading vector representations of untyped predicates (invariant to different thresholds): `bash generate_negatives_upredvec_en.sh 
    15_30_0_triple_doc_40000_disjoint pred`
    4. Generating WordNet Negatives: `bash generate_negatives_wordnet_en.sh 15_30_0_triple_doc_40000_disjoint 
    potential pred 30`
    5. Sampling: `bash generate_negatives_sampling_freqmap_en.sh 15_30_0_triple_doc_40000_disjoint 30 0 2 wordnet 40000`

* ZH
    1. Loading the predicate sets (invariant to different thresholds): `bash generate_negatives_loading_zh.sh 15_30_0_triple_doc_40000_disjoint pred`
    2. Loading vector representations of untyped predicates (also invariant to different thresholds): `bash generate_negatives_upredvec_zh.sh 
    15_30_0_triple_doc_40000_disjoint pred`
    3. Generating WordNet Negatives: `bash generate_negatives_wordnet_zh.sh 15_30_0_triple_doc_40000_disjoint 
    potential pred 30`
    4. Sampling: `bash generate_negatives_sampling_freqmap_zh.sh 15_30_0_triple_doc_40000_disjoint 30 0 2 wordnet 40000`