# Step-level Value Preference Optimization for Mathematical Reasoning
This is the official repository of "Step-level Value Preference Optimization for Mathematical Reasoning".

## Installation
1. Install `requirements.txt`
```
cd ./batch_SBS/n_sbs/
pip install -r requirements.txt
```
2. Install the evaluation toolkit (../MATH_EVAL) as a package.
3. Install our customized vllm (../vllm) to support value model.

## Checkpoint Initialization
1. Download the deepseek-math-7b-base.
2. You can use the `./batch_SBS/n_sbs/scripts/save_value_head.py` to add the value head to the LLM.

## Greedy Decoding
You can run either of the following two cmds. There may be slightly difference of accuracy between the two.
```
bash ./batch_greedy/multi_model/batch_greedy.sh
```
or
```
# use step_beam (1, 1) without value func
python solver_demo.py \
--custom_cfg configs/sbs_greedy.yaml \
--qaf ../MATH_EVAL/data/math_testset_annotation.json
```

## Step-level Beam Search
In our machine, on MATH testset, the following cmd with config `B1=1, B2=5`, and the one with config `B1=3, B2=5`
```
python solver_demo.py \
--custom_cfg configs/sbs_sft.yaml \
--qaf ../MATH_EVAL/data/math_testset_annotation.json
```