@inproceedings{neerbek-etal-2020-real,
    title = "A Real-World Data Resource of Complex Sensitive Sentences Based on Documents from the Monsanto Trial",
    author = "Neerbek, Jan  and
      Eskildsen, Morten  and
      Dolog, Peter  and
      Assent, Ira",
    editor = "Calzolari, Nicoletta  and
      B{\'e}chet, Fr{\'e}d{\'e}ric  and
      Blache, Philippe  and
      Choukri, Khalid  and
      Cieri, Christopher  and
      Declerck, Thierry  and
      Goggi, Sara  and
      Isahara, Hitoshi  and
      Maegaard, Bente  and
      Mariani, Joseph  and
      Mazo, H{\'e}l{\`e}ne  and
      Moreno, Asuncion  and
      Odijk, Jan  and
      Piperidis, Stelios",
    booktitle = "Proceedings of the Twelfth Language Resources and Evaluation Conference",
    month = may,
    year = "2020",
    address = "Marseille, France",
    publisher = "European Language Resources Association",
    url = "https://preview.aclanthology.org/ingest-emnlp/2020.lrec-1.158/",
    pages = "1258--1267",
    language = "eng",
    ISBN = "979-10-95546-34-4",
    abstract = "In this work we present a corpus for the evaluation of sensitive information detection approaches that addresses the need for real world sensitive information for empirical studies. Our sentence corpus contains different notions of complex sensitive information that correspond to different aspects of concern in a current trial of the Monsanto company. This paper describes the annotations process, where we both employ human annotators and furthermore create automatically inferred labels regarding technical, legal and informal communication within and with employees of Monsanto, drawing on a classification of documents by lawyers involved in the Monsanto court case. We release corpus of high quality sentences and parse trees with these two types of labels on sentence level. We characterize the sensitive information via several representative sensitive information detection models, in particular both keyword-based (n-gram) approaches and recent deep learning models, namely, recurrent neural networks (LSTM) and recursive neural networks (RecNN). Data and code are made publicly available."
}Markdown (Informal)
[A Real-World Data Resource of Complex Sensitive Sentences Based on Documents from the Monsanto Trial](https://preview.aclanthology.org/ingest-emnlp/2020.lrec-1.158/) (Neerbek et al., LREC 2020)
ACL