1. Introduction

2. Background
2.1 Dataset
% Getting to know the data
Single sample: id, url, text content, labels are the 20 leaves in the hierarchy
Example: MAKE AMERICA GREAT AGAIN
Labels not distributed equally, min: X, max Y out of Z samples in the training.

Authors provided train, dev, valid and test data in english.
For three other languages, no labeled data was available

2.2 Models
Transformers work well across a large variety of tasks
2.2.1 Encoder Only Models
Return an Embedding for a sequence of token
2.2.2 Decoder Only Models
Usually trained to predict the next token, but linear layer can be put on top of the probabilities for the next token after the sequence was fully processed.

3. System Overview
% What we plan on doing (high-ish level)
3.1 Architecture
- We tried encoder- as well as decoder-only models with a classification head
- Using just a Linear layer as classifier does not incorperate the structure of the labels
=> Designed Custom Classification Head that incorperates the hierarchy of the labels

3.2 Custom classification head
- Gets Features of the last layer as input
...
- Returns the logits for all 28 labels
=> Custom head incorperates the hierarchy of the labels in the training
=> Threshold is required to classify a sample, more details in the experiements chapter


3.3 Handling inputs in another language
- Use ChatGPT-4 to translate to english

4. Experimental Setup
% What we were doing (choices we made)
- Train on train, develop on dev, validate on valid
- Hardware

4.1 Hyperparameters
We came up with these main hyperparameters
- Model
- Learning rates
- Experimented with multiple pre-processing variants, early small-scale tests suggested: Use only all-lower and cleaned
- Labels are not distributed equally -> Possibility of weighting the loss
- Extra layers in the classification head 

4.2 Evaluation Metrics and Baseline
- Metrics should incorperate the structure of the labels (some confusions are worse than others)
- Hierarchical Losses
- Baseline: Always pick the same label, yes / no, ...

4.2 Improving predictions

4.2.1 Determining Thresholds for Classification
- Common approach: Use 0.5 as Threshold
- Pick the same th for all classes
- Find the best th for each class individually

4.2.2 Ensemble 
- Multiple Models with the same architecture
- Hopes of providing different views on the same sample (needs verification)

5. Results
% What we achieved
- Achieved HF of X (Ranked third)

5.1 Hyperparameter influence: only very roughly, we do not have full data!!!
- llama13 works best
- LR
- all-lower best
- Surprisingly, weighting loss performs worse
- More layers -> probably better (?)
- Custom Head small improvement
- THs

5.2 Error Analysis
- See PowerPoint



For our submission for Task 4 of SemEval 2024, we developed a custom classification head that is designed to be applied atop of Large Language Model incorporating the hierarchy of the labels. 
To find the best hyperparameters for the LLaMA-2-13b model, we conducted a grid-search over various smaller BERT and RoBERTa models.
In order to compete in the multilingual setting, we translated all documents to English.