REVIEWER #1
Lack of relevant model comparison experiments.
Will the descriptive information generated by the LlaVa model be different each time, resulting in non-reproducible experimental results?

REVIEWER #2
UMUTeam participated in all three subtask on the English language only. 
They fine-tuned the RoBERTa-large LLM on the task for subtask 1 and still used RoBERTa-large, but the input to the model was first concatenated with the representation of the meme by LlaVa, an end-to-end multimodal Large Language Model (LLM) that incorporates a vision encoder for general purpose visual and language understanding.   

The paper is rather clear, but it misses some key information which need to be added to the camera-ready version. The task was provided in several languages, what languages have you participated in should be clear to the reader without having to check the leaderboard. 
Done 

While the authors provide a link to the website of the competition, they should cite the task description paper since the paper will always be accessible, while the website is not guaranteed. 

Moreover, when you refer to a model in the paper, you cite the corresponding paper, so that the reader can get more information on it and verify what you wrote about it. Not doing so for the task description paper affects the quality of yours. See the Post-Task guide on the website on how to do so. 

The experimental section reports only the experiments performed during the competition. To make the paper more interesting, the authors could perform some simple ablation study, for example checking different random seed for the initialisation of the parameters or solving subtask 2a without LlaVa. 

While the description of the model and the experimental setup has a good number of details, having a link to the code would enhance the replicability of the work. 

Notice that the number of techniques for subtask 2a is 22.

Done 
REVIEWER #3
This paper has a solid foundation and demonstrates good results. However, there is a lack of detail in the results section.

In my opinion there are a few minor issues:
You mention misinformation frequently - while it is relevant to the use of persuasion techniques it seems a little disconnected as misinformation detection is not the focus of the task

No citations for transformers and RoBERTa in the introduction
Done

The text in figure 1 (left hand side) is very small

Done 

Figure 2 seems to be combining the training process with the system architecture, and implying that the sequence classification layer was not present during fine tuning. 
Done 


"LlaVa has demonstrated impressive multimodal conversational capabilities, sometimes exhibiting behavior similar to the multimodal GPT-4 on unseen images/instructions, and achieving a relative score of 85.1% compared to GPT-4 on a synthetic multimodal instruction-following dataset." -> the source for this information is not clear

Questions for Authors
---------------------------------------------------------------------------
Questions for the authors:
1. How does your model perform on each individual label? 
2. Does including image descriptions improve the performance for some labels more than others? 
3. Does including image descriptions reduce the performance in some cases?
4. What were the most common misclassifications your models made? 
5. When your models misclassify, do they tend to confuse classes that are closer together in the hierarchy?
