REVIEWER #1
The paper is quite clear to read and well structured. The author tried reasonable pre trained models in zero shot settings and have got decent results.

It could have been better to have a detailed error analysis rather than just discussing confusion matrices. The paper should have  shared task level (Paraphrase generation, Machine translation and Definition modelling) analysis/numbers which was not present.

Done 

REVIEWER #2

In this paper the authors present their participation in the SemEval task 6:  SHROOM. SHROOM is about hallucination detection, one of the greatest current challenges of NLG using language models. For this, the authors use propt engineering to test the zero-shot
classification capability of 3 Large Language Models: Mistral, LLaMa-2, and Tulu. Additionally, the authors provide an interesting error analysis section that has the potential to shed more light into their results.


Here are some aspects that need to be revised before the article is ready: 
- Rather than simply describing the confusion matrices, I think you can make better use of the error analysis results you present. 

- I think there might be a miscommunication when calling the method presented as "zero-shot learning", because the prompts you present have to be similar as what the systems you are using have seen. I would like to see some evidence that such task was not seen during training for neither of the three models you are testing.

- the second and third paragraph of the intro are almost a direct copy-paste of the description of the task provided in the SHROOM website. Please reframe or acknowledge this.

- [meta] please add a citation to the SHROOM taks description paper when referring to the dataset and/or the task (this will be provided in the upcoming days)
- Please check typos in all the document.