
================================================================================
Priberam Fine-Grained Opinion Corpus, V1.0
================================================================================

- This package contains the Priberam Fine-Grained Opinion Corpus, a 
  Portuguese (MPQA-like) fine-grained dependency opinion mining corpus, which is
  described in [1]. 

- If you use this data in your research, please cite the paper:

	[1] Mariana S. C. Almeida, Claudia Pinto, Helena Figueira, Pedro Mendes
	    and André F. T. Martins. 2015. "Aligning Opinions: Cross-Lingual
	    Opinion Mining with Dependencies", In Annual Meeting of the 
      	    Association for Computational Linguistics (ACL).

- License: Priberam Fine-Grained Opinion Corpus (c) by Priberam Informática, S.A.
	   Priberam Fine-Grained Opinion Corpus is licensed under a 
	   Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. 
  	   You should have received a copy of the license along with this work (file LICENCE). 
	   If not, see <http://creativecommons.org/licenses/by-nc-sa/4.0/>. 

- Released date: May 2015

- Website: http://labs.priberam.com/Resources/Fine-Grained-Opinion-Corpus.aspx


================================================================================
Contents
================================================================================

1. Description

2. Annotations 
	2.1 Annotation Procedure
	2.2 Annotated Elements
	2.3 Data Format
	2.4 Examples

3. References

4. Acknowledgements


================================================================================
1. Description
================================================================================

The Priberam Fine-Grained Opinion Corpus (available at 
http://labs.priberam.com/Resources/Fine-Grained-Opinion-Corpus)
consists of a subset of the documents of the Priberam Compressive Summarization 
Corpus (PCSC) [2] (http://labs.priberam.com/Resources/PCSC.aspx), which contains 
80 news topics with 10 documents each, collected from several Portuguese newspapers, 
TV and radio websites in the biennia 2010–2011 and 2012–2013. For this corpus, we 
selected and annotated one document of each of the 80 topics. 
The corpus contains a total of 80 news documents with 1226 sentences annotated with
a total of 828 opinions (direct-subjectives) and respective agents, targets, 
polarities and intensities (see  2.2. Annotated Elements).

This package contains five files:
	README.txt  	--> This file
	LICENSE.txt 	--> License file 
	lexicon_pt.txt  --> A Portuguese translation of the Subjective lexicon of [3]
	train.txt   	--> 20 training documents, selected form the bienium 2010–2011.
	dev.txt     	--> 20 development documents, selected form the bienium 2010–2011.
	test.txt    	--> 40 test documents, selected form the bienium 2011–2012.


You may also be interested the Portuguese translation of the Subjective lexicon of [3] 
that we used in [1]: the Priberam Subjectivity Lexicon for Portuguese, which is 
available at http://labs.priberam.com/Resources/Subjectivity-Lexicon-PT.aspx.


================================================================================
2. Annotations
================================================================================


======= 2.1. Annotation Procedure ========

The corpus was annotated in a similar vein as the MPQA [3], with the addition of a 
head node for each span element of the opinion frame. (see Annotated Elements)

The annotation was carried out by three linguists, after reading the MPQA annotation 
guidelines [3,4] and having a small practice period using the provided examples and 
some MPQA annotated sentences. Each document was annotated by two of the three linguists
 and then revised by the third linguist, who (in case of any doubts) discussed with 
the initial annotators to reach for the final consensus. For inter-annotator agreement 
scores please see Table 3 in [1]. 


======= 2.2. Annotated Elements =========

- The corpus is annotated with various elements of subjectivity annotated in the MPQA corpus:
	• direct-subjective expressions (or opinions) that are direct mentions of a private 
	  state, e.g. opinions, beliefs, emotions, sentiments, speculations, goals, etc.;
	• the opinion agent, i.e., the holder of the opinion;
	• the opinion target, i.e., what is being argued about;
	• the opinion polarity, i.e., the sentiment (positive, negative or neutral) of an 
          opinion and towards each target.
	• the opinion intensity, i.e., the intensity (low, medium, high, extreme) of an 
          opinion and towards each target.

- For each annotated span (of an opinion, agent or target), the annotator indicated the 
head word which better represents the that span (most typically, this head has a direct 
correspondence with the syntactic head of a span).

- The corpus also includes, for each word, subjectivity information obtained by translating 
and inflecting the words in the Subjectivity Lexicon of [5]. 
(A translated version of the Subjectivity lexicon of [5] is also available in the page of 
Priberam Labs: http://labs.priberam.com/Resources/Subjectivity-Lexicon-PT.aspx)


========= 2.3. Data Format ===========

The annotations are provided in the CoNLL format (as exemplified ahead with two examples). 
Each word is annotated in a separated line and the sentences are split with empty lines. 
Before each sentence there is a comment line (starting with the character "#") that contains 
the name of the document (obtained form the original summarization corpus PCSC [2]) 
and the number of the sentence in that document.

For each word, each line provides different elements, separated by tabs:

	1st - word position.

	2nd - word surface.

	3rd - lexicon annotation, in the same format of the Subjectivity Lexicon [5]:
		"intensity|polarity", where intensity can be "weaksubj" or "strongsubj"  
		and polarity can be "positive", "negative", "neutral" or "both".

	4th, 5th and 6th columns have MPQA-like annotations of opinions, agents and targets
	elements. If spans of different frames overlap, the annotations of overlapping spans are
        split using a pipe ("|") as separating character (see Example-2, bellow). All the words of the
	span have the same annotation, with exception of the head word, which starts with a 
	special character "*".

	4th - Opinion annotation, DS-"ID"_i="intensity"_p="polarity" (ex: DS-14_i=low_p=0):
		- All words of the span have the same annotation, with exception of the head word, 
	 	  which starts with a special character "*".
		- The annotation start with two control characters "DS" (form direct-subjective).
		- Then, after the splitting character "-", we have the frame ID (ID=14 in the example); this ID 
		  allows to connect the opinion span to its agents and targets.
		- Finally, the intensity and polarity properties of the opinion are split with the character "_". 
		  Intensity can be "low", "medium", "high" or "extreme"; in the example it is "low" (i=low).
		  polarity can be "-1" (negative), "0" (neutral) or "1" (positive); in the example it is "neutral" (p=0).

	5th - Agent annotation, A-"ID"-a"agent_number" (ex: A-14-a1):
		- All words of the span have the same annotation, with exception of the head word, 
	 	  which starts with a special character "*".
		- The annotation start with a control character "A" (form agent).
		- Then, after the splitting character "-" we have the frame ID a(ID=14 in the example); this ID 
		  allows to connect the agent span to the corresponding opinion and target.
		- Finally, after another splitting character "-", we have a agent number ("a1" in the example),
		  which is useful for the few cases where an opinion may have more than one agent span.

	6th - Target annotation (T-"ID"-t"target_number", ex: T-14-t1_i=low_p=0):
		- All words of the span have the same annotation, with exception of the head word, 
	 	  which starts with a special character "*".
		- The annotation start with a control character "T" (form target).
		- Then, after the splitting character "-" we have the frame ID a(ID=14 in the example); this ID 
		  allows to connect the target span to the corresponding opinion and agent.
		- After another splitting character "-", we have a target number ("t1" in the example),
		  which is useful for the few cases where an opinion may have more than one target span.
		- Finally, the intensity and polarity of the opinion towards this specific target are given 
		  separated by character "_". 
		  Intensity can be "low", "medium", "high" or "extreme"; in the example it is "low" (i=low).
		  polarity can be "-1" (negative), "0" (neutral) or "1" (positive); in the example it is "neutral" (p=0).


========= 2.4. Examples =========

1) The FIRST EXAMPLE, bellow, has:

- Three subjective words: "creem" (weaksubj|both), "imediatamente" (strongsubj|neutral) and 
			  "confinados" (weaksubj|negative)

- One opinion element (DS-14_i=low_p=0): 
	span: "creem"
	head word: "creem"
	opinion ID: 14
	opinion intensity: "low"
	opinion polarity: "0" (neural)

- One agent for opinion with ID 14 (A-14-a1): 
	span: "Os físicos"
	head word: "físicos"
	agent ID: 14

- One target for opinion with ID 14 (T-14-t1_i=low_p=0): 
	span: "Os quarks e os glutões"
	head word: "quarks"
	target ID: 14
	intensity of the opinion towards a the target: "low"
	polarity of the opinion towards a the target: "0" (neural)

-----
#	doc=PT2012-2013_03_20120813_DN_2718630	sentence=2
1	Os	_	_	A-14-a1	_
2	físicos	_	_	*A-14-a1	_
3	creem	weaksubj|both	*DS-14_i=low_p=0	_	_
4	que	_	_	_	_
5	,	_	_	_	_
6	em_	_	_	_	_
7	os	_	_	_	_
8	instantes	_	_	_	_
9	imediatamente	strongsubj|neutral	_	_	_
10	posteriores	_	_	_	_
11	a_	_	_	_	_
12	o	_	_	_	_
13	Big	_	_	_	_
14	Bang	_	_	_	_
15	,	_	_	_	_
16	os	_	_	_	T-14-t1_i=low_p=0
17	quarks	_	_	_	*T-14-t1_i=low_p=0
18	e	_	_	_	T-14-t1_i=low_p=0
19	os	_	_	_	T-14-t1_i=low_p=0
20	gluões	_	_	_	T-14-t1_i=low_p=0
21	'	_	_	_	_
22	estruturas	_	_	_	_
23	básicas	_	_	_	_
24	de_	_	_	_	_
25	a	_	_	_	_
26	matéria	_	_	_	_
27	'	_	_	_	_
28	não	_	_	_	_
29	estavam	_	_	_	_
30	confinados	weaksubj|negative	_	_	_
31	a	_	_	_	_
32	partículas	_	_	_	_
33	compostas	_	_	_	_
34	como	_	_	_	_
35	os	_	_	_	_
36	protões	_	_	_	_
37	e	_	_	_	_
38	os	_	_	_	_
39	neutrões	_	_	_	_
40	,	_	_	_	_
41	tal	_	_	_	_
42	como	_	_	_	_
43	ocorre	_	_	_	_
44	em_	_	_	_	_
45	a	_	_	_	_
46	atualidade	_	_	_	_
47	.	_	_	_	_
-----

2) In the SECOND EXAMPLE, bellow, we have:

- Subjective words: "introdução" (strongsubj|positive), "considerou" (strongsubj|neutral),
		    "questão" (strongsubj|neutral), "justificaria" (weaksubj|neutral),
		    "pedido" (weaksubj|both), "demissão" (weaksubj|negative),
		    "tivesse" (strongsubj|neutral), "deu" (weaksubj|positive),
 		    "aterrar" (strongsubj|negative).

- Four opinion frames, with ID=210, ID=211, ID=212, and ID=2170.

- Opinion frame with ID=210: 
	- Opinion span (DS-210_i=low_p=0): "considerou", where "considerou" is the head word.
				   	   with "low" intensity and "0" (neutral) polarity.
	- Agent span (A-210-a1): "o deputado do PCP Bernardino Soares", where "Soares" is the head word.
	- Target span (T-210-t1_i=low_p=0): "a questão justificaria um pedido de demissão", 
			    		     where "questão" is the head word. The opinion towards 
					     this target has a "low" intensity and "0" (neutral) polarity.

- Opinion frame with ID=211: 
	- Opinion span (DS-211_i=low_p=0): "decisão", where "decisão" is the head word.
				   	   with "low" intensity and "0" (neutral) polarity.
	- Agent span (A-211-a1): "o senhor ministro", where "ministro" is the head word.
	- Target span (T-211-t1_i=low_p=0): "um pedido de demissão", where "demissão" is the head word. 
					     The opinion towards this target has a "low" intensity 
					     and "0" (neutral) polarity.

- Opinion frame with ID=212: 
	- Opinion span (DS-212_i=low_p=0): "considerou", where "considerou" is the head word.
				   	   with "low" intensity and "0" (neutral) polarity.
	- Agent span (A-212-a1): "o deputado do PCP Bernardino Soares", where "Soares" is the head word.
	- Target span (T-212-t1_i=low_p=0): "uma questão", where "questão" is the head word. 
					     The opinion towards this target has a "low" intensity 
					     and "0" (neutral) polarity.

- Opinion frame with ID=2170: 
	- Opinion span (DS-2170_i=medium_p=0): "deu a ordem", where "ordem" is the head word.
				   	   with "medium" intensity and "0" (neutral) polarity.
	- Agent span (A-2170-a1): "Quem", where "Quem" is the head word.
	- Target span (T-2170-t1_i=medium_p=0): "o avião não aterrar", where "avião" is the head word. 
					     The opinion towards this target has a "medium" intensity 
					     and "0" (neutral) polarity.

-----
#	doc=PT2012-2013_21_20130709_JN_3313908	sentence=4
1	Em_		_			_		_		_
2	a		_			_		_		_
3	introdução	strongsubj|positive	_		_		_
4	prévia		_			_		_		_
5	,		_			_		_		_
6	o		_			_	A-210-a1|A-212-a1	_
7	deputado	_			_	A-210-a1|A-212-a1	_
8	de_		_			_	A-210-a1|A-212-a1	_
9	o		_			_	A-210-a1|A-212-a1	_
10	PCP		_			_	A-210-a1|A-212-a1	_
11	Bernardino	_			_	A-210-a1|A-212-a1	_
12	Soares		_			_	*A-210-a1|*A-212-a1	_
13	,		_			_		_		_
14	considerou	strongsubj|neutral	*DS-210_i=low_p=0	_	_
15	que		_			_		_		_
16	a		_			_		_		T-210-t1_i=low_p=0
17	questão		strongsubj|negative	_		_		*T-210-t1_i=low_p=0
18	'		_			_		_		T-210-t1_i=low_p=0
19	justificaria	weaksubj|neutral	_		_		T-210-t1_i=low_p=0
20	um		_			_		_		T-210-t1_i=low_p=0|T-211-t1_i=low_p=0
21	pedido		weaksubj|both		_		_		T-210-t1_i=low_p=0|*T-211-t1_i=low_p=0
22	de		_			_		_		T-210-t1_i=low_p=0|T-211-t1_i=low_p=0
23	demissão	weaksubj|negative	_		_		T-210-t1_i=low_p=0|T-211-t1_i=low_p=0
24	,		_			_		_		_
25	não		_			_		_		_
26	tivesse		strongsubj|neutral	_		_		_
27	o		_			_		A-211-a1	_
28	senhor		_			_		A-211-a1	_
29	ministro	_			_		*A-211-a1	_
30	já		_			_		_		_
31	tomado		_			_		_		_
32	esse		_			_		_		_
33	decisão		_			*DS-211_i=low_p=0	_	_
34	'		_			_		_		_
35	,		_			_		_		_
36	e		_			_		_		_
37	colocou		_			_		_		_
38	uma		_			_		_		T-212-t1_i=low_p=0
39	questão		strongsubj|negative	_		_		*T-212-t1_i=low_p=0
40	que		_			_		_		_
41	considerou	strongsubj|neutral	*DS-212_i=low_p=0	_	_
42	decisiva	_			_		_		_
43	:		_			_		_		_
44	'		_			_		_		_
45	Quem		_			_		*A-2170-a1	_
46	deu		weaksubj|positive	DS-2170_i=medium_p=0	_	_
47	a		_			DS-2170_i=medium_p=0	_	_
48	ordem		_			*DS-2170_i=medium_p=0	_	_
49	para		_			_		_		_
50	o		_			_		_		T-2170-t1_i=medium_p=0
51	avião		_			_		_		*T-2170-t1_i=medium_p=0
52	não		_			_		_		T-2170-t1_i=medium_p=0
53	aterrar		strongsubj|negative	_		_		T-2170-t1_i=medium_p=0
54	?		_			_		_		_
55	'		_			_		_		_
56	.		_			_		_		_
-----

================================================================================
3. References
================================================================================

If you use this corpus, please cite:

[1] Mariana S. C. Almeida, Claudia Pinto, Helena Figueira, Pedro Mendes
    and André F. T. Martins. 2015. "Aligning Opinions: Cross-Lingual
    Opinion Mining with Dependencies", In Annual Meeting of the 
    Association for Computational Linguistics (ACL).

Related references:

[2] Miguel B. Almeida, Mariana S. C. Almeida, Andre F. T. Martins, Helena Figueira,
    Pedro Mendes, and Claudia Pinto. 2014. Priberam compressive summarization 
    corpus: A new multi-document summarization corpus for European Portuguese. 
    In Int. Conference on Language Resources and Evaluation(LREC).

[3] Janyce Wiebe, Theresa Wilson, and Claire Cardie. 2005. Annotating expressions
    of opinions and emotions in language. Language resources and evaluation.

[4] Theresa Wilson. 2008. Fine-Grained Subjectivity Analysis. 
    Ph.D. thesis, University of Pittsburgh.

[5] Theresa Wilson, Janyce Wiebe, and Paul Hoffmann (2005). Recognizing Contextual 
    Polarity in Phrase-Level Sentiment Analysis. Proc. of HLT-EMNLP-2005.

================================================================================
4. Acknowledgements
================================================================================

This work was partially supported by the EU/FEDER programme, QREN/POR Lisboa 
(Portugal), under the Intelligo project (contract 2012/24803) and by a FCT grants 
UID/EEA/50008/2013 and PTDC/EEISII/2312/2012.


