============================================================================ 
SemEval 2024 Reviews for Submission #4
============================================================================ 

Title: nicolay-r at SemEval-2024 Task 3: Reasoning Emotion Cause Supported by Emotion State with Chain-of-Thoughts
Authors: Nicolay Rusnachenko and Huizhi Liang


============================================================================
                            REVIEWER #1
============================================================================

---------------------------------------------------------------------------
Reviewer's Scores
---------------------------------------------------------------------------
                   Appropriateness (1-5): 5
                           Clarity (1-5): 4
      Originality / Innovativeness (1-5): 3
           Soundness / Correctness (1-5): 5
             Meaningful Comparison (1-5): 5
                      Thoroughness (1-5): 5
        Impact of Ideas or Results (1-5): 4
                    Recommendation (1-5): 4
               Reviewer Confidence (1-5): 4


============================================================================
                            REVIEWER #2
============================================================================

---------------------------------------------------------------------------
Reviewer's Scores
---------------------------------------------------------------------------
                   Appropriateness (1-5): 5
                           Clarity (1-5): 3
      Originality / Innovativeness (1-5): 3
           Soundness / Correctness (1-5): 3
             Meaningful Comparison (1-5): 3
                      Thoroughness (1-5): 4
        Impact of Ideas or Results (1-5): 3
                    Recommendation (1-5): 4
               Reviewer Confidence (1-5): 4

Detailed Comments
---------------------------------------------------------------------------
This paper presents a submission by the team called nicolay-r to the SemEval-2024 Task 3 aimed at emotion-causing pair analysis in conversations. The models developed in this work rely on the use of Flan-T5 language model enhanced by three-hop reasoning approach to perform emotion classification. The submitted system scored 3rd and 4th for F1-proportional and 5th for F1-strict among 15 participating teams. The presented approach is interesting and the results are promising, which are the main strengths of this submission. However, there are several aspects of the paper that can be improved. These are outlined below.

Overall, the paper is structured reasonably but it would be good to proofread it for the camera-ready submission as it currently contains grammatical errors and the writing is not always fluent. The use of some terminology can also be improved: for instance, instead of "instructive large language model (LLM) training", "LLM instruction-tuning" is more commonly used in the field; similarly, "instructive language models" should be "instruction-tuned language models".

Clarity of writing can also be improved: for example,
- It is not entirely clear what Table 1 covers and the explanation of its content (and the importance of the included statistics) should be explained in the main text.
- Please make sure that all abbreviations and terms (mathematical and other) are first defined in text.
- It is unclear why a specific model (Flan-T5) was used as this is never explained in the paper.
- It is unclear whether the idea outlined in the "Spans Correction" subsection of Section 3 was included in the submission or whether it is a suggestion for future improvements. How do the results with this correction compare to the results without such a correction?

The paper currently misses a dedicated Previous Work section and does not explain in enough detail the methodology of Hao et al. (2023), which was used as the basis for the proposed framework. Without such a dedicated discussion it is hard to put the current work in context, make meaningful comparisons, and judge whether the proposed solution is fully based on the previous work or is innovative.

Other suggestions:
- Subsection 1.2 contains references to Section 1, i.e., essentially to itself. If there is a specific bit that you try to refer to, introduce links to subsections.
- Make sure all text is in the same font â€“ e.g., prompt templates in Section 2 are written in a smaller font for no obvious reason.
---------------------------------------------------------------------------


Questions for Authors
---------------------------------------------------------------------------
See the suggestions on how to improve the paper above.
---------------------------------------------------------------------------



============================================================================
                            REVIEWER #3
============================================================================

---------------------------------------------------------------------------
Reviewer's Scores
---------------------------------------------------------------------------
                   Appropriateness (1-5): 4
                           Clarity (1-5): 3
      Originality / Innovativeness (1-5): 4
           Soundness / Correctness (1-5): 3
             Meaningful Comparison (1-5): 2
                      Thoroughness (1-5): 2
        Impact of Ideas or Results (1-5): 3
                    Recommendation (1-5): 3
               Reviewer Confidence (1-5): 4

Detailed Comments
---------------------------------------------------------------------------
1. Appropriateness - The paper addresses a way to get causal utterances, emotion labels for a given utterance. But, it doesn't yield a way to extract casual spans. 

2. Clarity	- The paper had to be read multiple times. Table 1 & 4 seems very cramped and it's hard to interpret it in one go. Figure 3 has the legend occluding the line plot.

3. Originality / Innovativeness - In regards to the literature in emotion cause analysis, very few have attempted Chain of Thought framework in this area.

4. Soundness / Correctness		
    4.1. THoR engine chains probabilities thrice. Perhaps using other alternatives of CoT which use less 'hops' could prove beneficial?
    4.2. The dataset has some special characters from string.punctuation which are present in the label spans. Were they incorporated appropriately in the span correction algorithm?
    4.3. 'Future' contexts are ignored which may be ahead in future time step, but are crucial for this task, as the conversations could be simultaneous. 

5. Meaningful Comparison - 1. No prior work has been shown related to efficacy of prior approached or CoT-based frameworks.

6. Thoroughness		
    6.1. The idea sounds great but efficacy of THoR cause and THoR cause-RR is questionable when compared to finetuend Flan-T5 base for D cause. Figure 3 shows that in later epochs (> 5) prompt based method performs better than THoR cause and THoR cause-RR.
    
7. Impact of Ideas or Results - Incorporating CoT framework has not been done extensively in Emotion Cause Analysis (ECA), but this might be cited as a way to benchmark these frameworks for ECA.
---------------------------------------------------------------------------


Questions for Authors
---------------------------------------------------------------------------
1. Table 1 & 4 seems very crammed and making it hard to interpret in one go. Figure 3 has the legend occluding the line plot. Could these items be re-formatted?
2. Is the span correction algorithm impacted by the choice of punctuation set? If so, string.puntuation should be avoided as it contains characters which are present in the dataset which are valid prefixes.
3. Figure 3 might be showing that in later epochs (> 5) prompt based method performs better than THoR cause and THoR cause-RR. on dev set. Is this true? Does the metric rebounds and increases?
---------------------------------------------------------------------------