REVIEWER #1

The current manuscript delineates the involvement of the UMUTeam in Task 8 of SemEval-2024, designated as "Multigenerator, Multidomain, and Multilingual Black-Box Machine-Generated Text Detection." This collaborative endeavor aims to identify automated systems capable of detecting machine-generated text to mitigate potential misuse. The task comprises three subtasks: Subtask A, involving binary classification to ascertain whether a given full-text originated from a human or a machine; Subtask B, addressing a multi-class classification challenge to identify the source of a full-text, distinguishing between human-generated content and that produced by specific language models; and Subtask C, focused on mixed human-machine text recognition. The UMUTeam contributed to Subtask B, employing a multimodal approach integrating fine-tuning of a pre-trained model, such as RoBERTa, with syntactic features extracted from the texts. The system attained the 23rd position among 77 participa!
 nts, achieving a score of 75.350%, thereby surpassing the baseline performance.

The article is well-written and only some minor English editing is required.

The article misses several relevant work/models based on ensemble or data augmentation for NLP tasks. The authors should provide further ensemble-based SOTA models and data augmentation strategies to better motivate their choices. Eventually, some recent and relevant works worth to be mentioned are:



1) PL-Transformer: a POS-aware and layer ensemble transformer for text classification, 2023

2) TextCNN-based ensemble learning model for Japanese Text Multi-classification, 2023

3) T100: A modern classic ensemble to profile irony and stereotype spreaders, 2022

4) Ensemble feature selection for multiâ€label text classification: An intelligent order statistics approach, 2022

5) Text enrichment with Japanese language to profile cryptocurrency influencers, 2023

6) Opinion mining using ensemble text hidden Markov models for text classification, 2018

7) Backtranslate what you are saying and I will tell who you are, 2024

8) XLNet with data augmentation to profile cryptocurrency influencers, 2023

9) Profiling cryptocurrency influencers with few-shot learning using data augmentation and electra, 2023

10) A survey on data augmentation for text classification, 2022


Please better detail your preprocessing choices, maybe referencing some recent reference works on text preprocessing to describe the techniques used. For instance,

a) Is text preprocessing still worth the time? A comparative survey on the influence of popular preprocessing methods on Transformers and traditional classifiers, 2024

b) Text preprocessing for text mining in organizational research: Review and recommendations, 2022

Here are some specific comments to improve the paper:

You introduce the acronym Artificial Intelligence (AI) and Natural Language Processing (NLP) capitalizing the first letters. But for natural language generation (NLG) you don't. Please uniform it. Define AI before using it.

"...The models evaluated for Subtask B are RoBERTa..., a model based..." -> "...subtask B is RoBERTa..."

"...number of times different punctuation marks..." -> "...number of times that different..."

"...treat PAN corpora..." -> When did you reference this PAN corpora?

"...we have extracted the confusion matrix of our model on the test set..." -> please rephrase.

"We can see that our model predicts the texts generated by Bloomz, Dolly, ChatGPT and Davinci very well,..." -> very well? so a system with 95% accuracy is very very very well? Or just very very well? Please just report numbers and facts. Do not forget that you are writing a scientific paper.

REVIEWER #2

Needs minor revisions for camera ready submission. Other notes

- it's mentioned that your approach is multimodal. In the literature, modality usually refers to text, image, or audio modalities. 
- No need to outline that AI is short for Artificial Intelligence or that NLP is short for Natural Language Processing. Same applies to LLMs. This paper is to be published in an NLP conference, and the audience knows these abbreviations.
- No need to explain how RoBERTa was pretrained
- The tokenizer does not generate embeddings as mentioned in the paper. Rather, it tokenizes the text and creates a vector of token indices.
- The system description does not make sense. It claims that syntax features and token indices are concatenated, fed into BERT and RoBERTa, then fine-tuned. I think what was actually done is that the embeddings extracted from the transformers, NOT the tokenizer, are concatenated with the syntax features then fine-tuned. 
- The paper does not cite the shared-task paper.


The paper needs revision especially the last two notes
How do you know that the syntactical features were actually useful? It would be helpful to see ablation results.


REVIEWER #3
Appropriateness
Paper fits in the event and describes a submission with a good score.

Clarity
The writing is of good quality and understandable for most readers. Minor gripe: Results section first paragraph says the accuracy is 15.5% higher than the first, which should be lower.

Originality / Innovativeness
No groundbreaking or innovative techniques used. Using stylometric features along with embeddings/RoBERTa model is a good technique though. It would have been great if the authors explored more techniques or innovations to the model.

Soundness / Correctness
Whatever is chosen is done well. However, many other models could have been tried and tested. There is no discussion about hyperparameter tuning for the model selected. The concatenation of numerical features and embeddings remains unexplored deeply.

Meaningful Comparison
The references are adequate for the work done.

Thoroughness
Absolutely not thorough. Only one model is mentioned, and tested (according to the paper). No talk about hyperparameter tuning. The experiments mentioned are limited to the final configuration. Would have loved to see more discussion about the results, especially on why the misclassification occurs for Dolly/Cohere/DaVinci. 

One alternative could have been testing an ensemble with two models: numerical features and embeddings as separate trained models. 


Impact of Ideas or Results
Concatenating numerical features and embeddings is moderately interesting but remains a minor incremental improvement. 

Recommendation
I would recommend the paper primarily for the feature design, but am conflicted because it remains a work in progress and is not thorough in its exploration.