Chinese_tense_data.txt (UTF-8 encoding) contains the dataset used in our paper, containing 294 conversations collected from 25 Chinese movies, dramas and TV shows. Each conversation contains 2-18 sentences and there are 1,857 sentences in total.

In Chinese_tense_data.txt, each line is a sentence in a conversation. For a sentence that has predicates, we annotate the main predicate for each sentence in the form of #predicate#. The annotated information is at the beginning of a line in the form of <x,y>.

x represents the speaker of a sentence. We use a,b,c,d,e… to distinguish different speakers in a conversation.

y is the manually labeled tense of the main predicate of a sentence. It can be {p,c,f,i,none}.

p denotes the past tense;
c denotes the present tense;
f denotes the future tense;
i denotes that the sentence is an imperative sentence.
none means that the sentence does not have a predicate and it is unnecessary to label tense for this sentence.

In our paper, we mainly focus on {p,c,f} tense system and do not consider imperative sentence identification. In evaluation, we only consider prediction for sentences with p, c or f label.

To refer to this dataset, please cite the following paper:
Tao Ge, Heng Ji, Baobao Chang, Zhifang Sui: One Tense per Scene: Predicting Tense in Chinese Conversations. In Proceedings of ACL-IJCNLP 2015.

