REVIEWER #1

Iwill say that overall the paper was well-written. I think it was mostly clear what was done and why. 

Section by section breakdown:

Intro:
The first paragraph is good, I think it introduces the topic effectively. The second paragraph where you introduce code mixing is slightly awkward, you might want some more words there to demonstrate importance of code mixing. Since this task is dealing simultaneously with emotion recognition in conversation, emotion flip reasoning, and code mixing, I think you have to make sure they're introduced one at a time and the importance of each is demonstrated. In paragraph 3 you say that for emotions "it is not enough to simply recognize them," which I think might be a bit strong. I think there's plenty of cases where emotion recognition is sufficient for solving a problem. 

Background:
How come you have a paragraph on multimodal emotion recognition? It's interesting but I'm not sure what the applicability is here.
This is small but in your list of models you should probably start with BERT since the other two are based on it.

System Overview / Experimental Setup:
How come you decided to translate the Hindi-English to English? Maybe it's simply because that's what language the models you're using were trained on. If so, you should mention that.
I think grouping the utterances made by the speaker of the target utterance is creative, that's certainly not something that I thought of when attempting these tasks, though I'm not sure what your reasoning behind this is (discussed more in Questions for Authors).
A question: how did you separate the utterances when you passed them to the model. Was it semicolons, as shown in figure 2? Might want to say that explicitly.
I really like how you included the model diagrams, however I have a couple questions about them. What are the "user n" blocks in the preprocessing diagram? Is it all the utterances each user made? I'm also not sure what the fine-tuning diagram shows. Is it a diagram of the model or is it how you trained the model?
Very small thing: "traslation" misspelled in diagram.

Results:
Why did XLM and DeBERTa get a 0 F1? Did they predict all 0's or something? Would be nice to now why.

Analysis:
I like the confusion matrix, good to see where the model is going right and wrong. Question: why do tables 5 and 6 show a much higher F1 score than was reported in the results? They show 92 f1, that's really high. Is that the training set or something?

Buen trabajo.
---------------------------------------------------------------------------


Questions for Authors
---------------------------------------------------------------------------
Why do you group the utterances made by the speaker of the target utterance? It seems to me that there's an implicit hypothesis that all the utterances a certain speaker makes in a conversation are more important than the utterances other speakers make when you're trying to classify the emotion of a specific utterance made by that speaker. In other words, we understand the emotion of a speaker better if we look at everything they say rather than what other say. Why do you think that's the case? Isn't a speaker's emotion highly dependent on what others say? Why do we not include other speakers utterances as context? You don't have to get to in depth here, but I think you should explain your reasoning somewhat. I think it could be pretty interesting and may be the "secret sauce" of this paper. 

What is the "sequence classification layer"? Is that a fully connected layer, and if so, how is that "sequence classification"? Or do I totally misunderstand your model and you're using some sort of RNN? When you refer to a sequence do you mean the words in an utterance or the sequence of utterances?

Do you use the same model for ERC and EFR (besides the difference in text processing)? Did you make any changes to the last layers in the model?

REVIEWER #2
Good analysis where it used 3 transformers models: XLM-R, Deberta and BERT. 
The paper surpassed the 5 page limit. I recommend removing table 4 with the official ranking to fit in the 5 page limit. It was interesting to see that for task 2 and 3 xlm and deberta are outperformed by BERT by a large margin. Why would you think that is? Why XLM has an F1 score of 0?
