# MixGR: Enhancing Retriever Generalization for Scientific Domain through Complementary Granularity

In the folder, we include the code to reproduce our results in the submission. For the convenience, the results are present and visualized in the `notebooks/result_collection.ipynb`. Please set the current directory as the env variable `ROOT_DIR`.

To reproduce our result, there will be three steps, including document (and proposition) indexing, searching multiple granularities, and fusing the search results with granularities.

## 1. Query and Document Decomposition
In our work, we need to decompose queries and documents to subqueries and propositions, respectively. We use the existing T5-Large model, which distills the decomposition capacity from GPT-4. Its huggingface point is `chentong00/propositionizer-wiki-flan-t5-large`. The query and document files, i.e., `QUERY_FILE` and `DOCUMENTF_FILE`, should be in the format of `jsonl`, aligned with [`pyserini`](https://github.com/castorini/pyserini).

### 1.1. Query Decomposition
Decompose queries to subqueries:

```
python src/query_parse.py --query_file $QUERY_FILE --save_file $SAVE_FILE
```

### 1.2. Document Decomposition
Decompose documents to propositions:

```
python src/chunk.py --parse_file $DOCUMENT_FILE
```
Note: the code above will generate two files, `corpus.chunk.jsonl` and `corpus.prop.jsonl` under the same directory of `$DOCUMENT_FILE`. As we mentioned in the paper, documents will be firstly decompose to chunks with a maximum length of 128.

For your convience, we have already provide the decomposed documents and queries with the supplementary data. Please unzip the data and put it under the root directory, and make sure to name the folder as `data`.

## 2. Index
Here, we use the dataset `scifact` as an example.

```
python -m pyserini.encode \
input   --corpus $ROOT_DIR/data/scifact/chunks/corpus.chunk.jsonl \
        --fields title text \
        --delimiter "\n" \
output  --embeddings $ROOT_DIR/indexes/scifact/chunks/scifact."$ENCODER".index.jsonl \
        --to-faiss \
encoder --encoder $ENCODER_MODEL \
        --fields title text \
        --batch 16 \
        --fp16 \
        --device cuda:0
```

## 3. Search multiple granularities
Search the queries/subqueries within the documents/propositions. Here, we will calcualte three scores as mentioned in the paper, $$s_{q-d}$$, $$s_{q-p}$$, and $$s_{s-p}$$.

### 3.1. Search queries within documents
Here, we will search for documents given the queries.
```
python -m pyserini.search.faiss \
--encoder $ENCODER_MODEL \
--index $DOC_INDEX \
--topics $QUERY_FILE \
--output $QUERY_DOC_SEARCH_RESULT \
--batch-size 64 --threads 32 \
--hits 500
```

### 3.2. Search queries within propositions
Here, we will search for propositions given the queries.
```
python -m pyserini.search.faiss \
--encoder $ENCODER_MODEL \
--index $PROP_INDEX \
--topics $QUERY_FILE \
--output $QUERY_PROP_SEARCH_RESULT \
--batch-size 64 --threads 32 \
--hits 500
```

### 3.3. Search subqueries within propositions
Here, we will search for propositions given the subqueries.
```
python -m pyserini.search.faiss \
--encoder $ENCODER_MODEL \
--index $PROP_INDEX \
--topics $SUBQUERY_FILE \
--output $SUBQUERY_PROP_SEARCH_RESULT \
--batch-size 64 --threads 32 \
--hits 500
```

### 3.4. Search subqueries within documents
We did not use this metrics for the fusion. However, in order to make the following code runable, we kindly ask to generate this search result.
```
python -m pyserini.search.faiss \
--encoder $ENCODER_MODEL \
--index $DOC_INDEX \
--topics $SUBQUERY_FILE \
--output $SUBQUERY_DOC_SEARCH_RESULT \
--batch-size 64 --threads 32 \
--hits 500
```

## 4. Fuse the results from multiple sources
Then, given the searching results above, we would fuse these results for mixed-granularities.

### 4.1. Generate the extra similarities between queries (subqueries) and documents (propositions)
As mentioned in Section 3.3 in the paper, we will additonally generate the similarity between queries (subqueries) and documents (props) which is missing from one set based on the other.
```
python $ROOT_DIR/src/retrieval/union_retrieval_quadrant.py \
    --encoder $ENCODER \
    --prop_bm25_dir $PROP_BM25_INDEX \
    --chunk_bm25_dir $CHUNK_BM25_INDEX \
    --subquery_file $SUBQUERY_FILE \
    --query_file $QUERY_FILE \
    --passage2prop $CHUNK2PROP_PKL \
    --whole_prop_path $QUERY_PROP_SEARCH_RESULT \
    --multi_prop_path $SUBQUERY_PROP_SEARCH_RESULT \
    --whole_chunk_path $QUERY_DOC_SEARCH_RESULT \
    --multi_chunk_path $SUBQUERY_DOC_SEARCH_RESULT
```

Note: 
- We will need the BM25 index of documents and propositions, i.e., `$DOC_BM25_INDEX` and `$PROP_BM25_INDEX`, which can be generated by `pyserini`.
- We will need a mapping between document and chunks as a pkl file, i.e., `$CHUNK2PROP_PKL`
- We save these on the disk, because of the frequent reuse.

### 4.2. Fuse the metrics on different granularities
```
    python $ROOT_DIR/src/retrieval/rrf_wo_eval_quadrant.py \
    --encoder $ENCODER \
    --sub_query_jsonl $SUBQUERY_FILE \
    --retrieval_dir $ROOT_RETRIEVAL_DIR \
    --whole_chunk_path $QUERY_DOC_FILENAME \
    --multi_chunk_path $SUBQUERY_DOC_FILENAME \
    --whole_prop_path $QUERY_PROP_FILENAME \
    --multi_prop_path $SUBQUERY_PROP_FILENAME \
    --add_query_file $ADDITIONAL_FILE \
    --save_query_file $SAVE_FILE \
    --passage2prop $CHUNK2PROP_PKL
```

Note:
- `ADDITIONAL_FILE` is the result obtained from the step 4.1
- `$ROOT_RETRIEVAL_DIR` is the shared directory of `$QUERY_DOC_FILE`, `$QUERY_PROP_FILE`, `$SUBQUERY_DOC_FILE`, and `$SUBQUERY_PROP_FILE`.

## 5. Evaluation
### 5.1. General evaluation, incl. everything except Table 3
Please refer to `notebooks/result_collection.ipynb` for details.

### 5.2. Evaluation with Large Langauge Models
Please run as follows to reproduce our results on Table 3

```
python $ROOT_DIR/src/qa/llama.py \
--prop_bm25_dir $PROP_BM25_INDEX \
--doc_bm25_dir $DOC_BM25_INDEX
```

