# Arabic Memes Categorization Dataset -- ArMemes

This repository contains the dataset for the Arabic Memes Categorization project, namely **ArMemes**. The dataset is split into training, development, and test sets.

## Directory Structure
├── arabic_memes_categorization_dev.jsonl
├── arabic_memes_categorization_test.jsonl
├── arabic_memes_categorization_train.jsonl
├── licenses_by-nc-sa_4.0_legalcode.txt
└── README.md

**Note** Due to the large size of the image files, we store them on google drive. Please download them from the following link:

### Directory/Files Description
- `arabic_memes_categorization_train.jsonl`: Training set for Arabic memes categorization.
- `arabic_memes_categorization_dev.jsonl`: Development set for Arabic memes categorization.
- `arabic_memes_categorization_test.jsonl`: Test set for Arabic memes categorization.
- `licenses_by-nc-sa_4.0_legalcode.txt`: License information for the dataset, under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
- `README.md`: This readme file containing information about the dataset and its structure.
- `bin`: Contains scripts to run experiments.


## License

This dataset is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. You can view the full license in the `licenses_by-nc-sa_4.0_legalcode.txt` file.

## Usage

To use this dataset, you can load the JSONL files into your data processing pipeline. Each file contains one JSON object per line, representing individual memes and their categorization.

### Example (Python)

```python
import json

def load_jsonl(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        data = [json.loads(line) for line in file]
    return data

# Load training data
train_data = load_jsonl('arabic_memes_categorization_train.jsonl')

# Load development data
dev_data = load_jsonl('arabic_memes_categorization_dev.jsonl')

# Load test data
test_data = load_jsonl('arabic_memes_categorization_test.jsonl')

```

## Experiments:

### Data splits
We split the dataset in a stratified manner, allocating 70\%, 10\%, and 20\% for training, development, and testing, respectively.

### Baseline
- Random and Ngram baselines

### Image modality
Before running the experiments with image modality please make sure you download the images and keep them in the directory where you have JSONL files. Then create your working environment using the provided python environment file.


```
WDIR=./
task_name=class_label
model="resnet50"
task=memes_multiclass
output_dir=$WDIR"/outputs/$task/$model/"
results_dir=$WDIR"/results/$task/$model/"
python $WDIR/code/src/train.py --name=$exp_name --task-name=$task_name --seed=1 \
--train=data/arabic_memes_categorization_train.jsonl --dev=data/arabic_memes_categorization_dev.jsonl --test=data/arabic_memes_categorization_test.jsonl --file-type "jsonl" \
--out-file=$results_file --data-dir=$WDIR'/' --best-state-path=$output_dir/best.pth --fig-dir=$output_dir \
--checkpoint-dir=$output_dir --arch=$model --batch-size=32 --learning-rate=1e-5 --weight-decay=0.0001 \
--num-epochs=100 --keep-frozen=False --use-rand-augment=True --rand-augment-n=2 --rand-augment-m=9
```

### Text modality
We mostly used huggingface transformer models. Please find a sample script below.

```
bash bin/run_arabertv2.sh aubmindlab/bert-base-arabertv2
```

### Multimodality
For multimodality we used OpenAI's GPT models and Google's Gemini.
