# RoCBert: Robust Chinese Bert with Multimodal Contrastive Pretraining

---

This repository contains code, model, dataset for RoCBert at ACL2022.

## Download

---

**pretrained model**

We provide pre-trained RoCBert models in Pytorch version and can be downloaded [here](https://drive.google.com/file/d/1_vwoXGwb7d0SDwVWoxRiXCP7KEsPenwg/view?usp=sharing).

**dataset**

CLUE dataset can be download [here](https://github.com/CLUEbenchmark/CLUE)


## Requirements

---

* Python3.7+
* transformers==3.0.2
* torch==1.8.1
* pypinyin


## QuickTour

---

**Pretrain**

```
cd pretrain
sh pretrain.sh
```

**Finetune**

1 download the CLUE dataset and put data into `finetune/CLUEdatasets` directory.

```
├── CLUEdatasets
| └── tnews
| └── train.json
| └── dev.json
| └── test.json
| └── ...
| └── csl　
| └── ...
```

2 download pretrain model and put into `prev_trained_model/robust-bert` directory.

```
├── prev_trained_model
| └── robust-bert
| └── config.json
| └── pytorch_model.bin
| └── vocab.txt
| └── word_pronunciation.json
| └── word_shape.json
```



3 execute corresponding shell

```
sh run_classifier_[task_name].sh # finetune
sh run_classifier_[task_name].sh evaluate # evaluation
```


**load model**

```python
from finetune.RobustBert.models import RobustBert
model = RobustBert.from_pretrained([MODEL_PATH])
print(model)
```

**get sentence representation**

```
from finetune.RobustBert.models import RobustBert
from finetune.RobustBert.robust_tokenizer import RobustTokenizer

model_path = [MODEL_PATH]
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
tokenizer = RobustTokenizer(model_path)
model = RobustBert.from_pretrained(model_path)
model.to(device)
model.eval()
sentence = "我喜欢小猫咪"
token_res = tokenizer.encode_plus(sentence)
print(token_res)
"""
{'input_ids': tensor([[ 101, 2769, 1599, 3614, 2207, 4344, 1488, 102]], device='cuda:0'), 'shape_ids': tensor([[ 2, 17383, 5102, 12590, 23155, 20557, 13676, 2]], device='cuda:0'), 'pronunciation_ids': tensor([[ 2, 298, 61, 157, 129, 90, 251, 2]], device='cuda:0'), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:0')}

"""

sequence_output, pooled_output = model.get_bert_output(**token_res)
print(pooled_output)
"""
tensor([[ 7.8390e-01, 9.9754e-01, 9.9819e-01, 7.0014e-01, 8.7884e-01,
-9.3026e-01, -8.9485e-01, 9.8657e-01, 1.9697e-03, -8.2809e-01,
9.9674e-01, 9.9985e-01, 2.0567e-01, -9.6421e-01, 9.9892e-01,
-9.8048e-01, 6.4294e-01, -8.4301e-01, -9.3481e-01, 4.2152e-02,
9.9456e-01, -9.8749e-01, -9.8239e-01, -4.3763e-01, 2.9538e-01,
8.1702e-01, 9.5609e-01, 8.3711e-01, -9.9982e-01, 9.9277e-01,
2.0431e-01, 7.7141e-01, -3.1260e-01, -9.9939e-01, -9.9044e-01,
...]])
"""

```
