*
**
***
**** Personal Events in Dialogue Corpus Version 1.0
***
**
*
* Joshua Daniel Eisenberg, PhD
* Lead Scientist, Natural Language Understanding
* Artie, Inc.
* joshua.eisenberg@artie.com
* www.artie.com
* www.research-josh.com



If you use this corpus, please cite our work as follows:

Joshua D Eisenberg and Michael Sheriff. 2020. Automatic extraction of personal events from dialogue. In Proceedings of the 1st Workshop on Narrative Understanding, Storylines, and Events (NUSE 2020), Seattle, Washington. Association for Computational Linguistics.


*
*ZIP Contents 
*
*annotation_guide_personal_event_extraction_v_04_JDE.pdf
******A PDF of the most recent version of the event extraction annotation guide. 
******This explains the specifics of what is and isn't an event. 
*
*COPYRIGHT_NOTICE_transcript_citations.pdf
******A PDF with citations and copyright information for the 14 episodes of This American Life in the corpus. 
******This American Life is the copyright holder of the episode transcripts. 
*
*corpus
***gold_standard
******event_annotations
************The event annotations for each episode are in this folder.
************Each file contains the annotations for a single episode.
************The annotations are saved in two formats:
			1. .ser -- Serialized Java ArrayList<ArrayList<Boolean>>
				For those that use Java, the annotations for each episode is serialized as a 
				Java ArrayList<ArrayList<Boolean>> in a .ser file. 
				The following code can deserialize an episode's annotations:
				~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~		
					FileInputStream fis = new FileInputStream(pathToAnnotations);
					ObjectInputStream ois = new ObjectInputStream(fis);
					ArrayList<ArrayList<Boolean>> goldStandard = (ArrayList<ArrayList<Boolean>>) ois.readObject();
					ois.close();
					fis.close();
				~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~		
				Each ArrayList<Boolean> represents the annotations for a single utterance.
				Each Boolean represents the event annotation for a token. 
				If it is true, the token is an event.
				If it is false it is a nonevent.
			2. .txt -- text files
				We have encoded the annotations for each episode into an easily parseable text file.
				Each utterance's annotations are on a single line of text. 
				Individual annotations are separated by a tab. 
				The .txt file annotation values are: 'true' and 'false'
******tokens
************The tokenized utterances for each episode are stored in this folder.
************Each file contains the tokens for a single episode.
************The tokens are saved in two formats:
			1. .ser -- Serialized Java ArrayList<ArrayList<String>>
				For those that use Java, the tokens for each episode is serialized as a 
				Java ArrayList<ArrayList<String>> in a .ser file. 
				The following code can deserialize an episode's tokenized utterances:
				~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~		
					FileInputStream fis = new FileInputStream(pathToTokens);
					ObjectInputStream ois = new ObjectInputStream(fis);
					ArrayList<ArrayList<String>> tokenizedUtterances = (ArrayList<ArrayList<String>>) ois.readObject();
					ois.close();
					fis.close();
				~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~		
				Each ArrayList<String> represents the tokens for a single utterance.
				Each String is a token. 
			2. .txt -- text files
				We have encoded the tokens for each episode into an easily parseable text file.
				Each utterance's tokens are on a single line of text. 
				Individual tokens are separated by a tab. 
***raw_transcripts
********This folder contains the transcripts for each episode.
********The transcripts are saved as text files.
********These utterances are preceded with the name of the person speaking the utterance, with a colon after the name.
********The 'symbol' XXX separates dialogue from different scenes
********The utterances correspond with the utterances in the 'tokens' and 'event_annotations' folders.
********We included the raw transcripts so that you can have access to the dialogue with all punctuation,
********and information about who is saying the utterances.
*
*		
*license_for_PEDC_CC-BY-4.0.txt
******This is the license for the corpus. 
******Important: if you share the corpus you must include a copy of *COPYRIGHT_NOTICE_transcript_citations.pdf
*
*
*NUSE_Eisenberg_Personal_Event_Extraction_Paper.pdf
******This is a copy of the paper written about this corpus. 
******This paper also details a set of experiments on the automatic extraction events from dialogue. 
*
*PEDC_stats.png
******This is a table which gives general statistics on the corpus.
******It can be used to verify the # of utterances, events, and nonevents are in each episode.
*
*
*
*README.txt
******You are currently reading the README. 
*******I think you know where you are :)
********You're at the back of the worm.
https://youtu.be/xCQb4dVIM14?t=542