*Deceptive Opinion Spam Corpus v3*
*FOR ACADEMIC PURPOSES ONLY*

Questions:

Please direct all questions to Myle Ott <myleott@cs.cornell.edu>

Usage:

This data can be used for academic research purposes only. If you use this
data in your work, please reference [1].

Overview:

This corpus consists of truthful and deceptive hotel reviews from 20 hotels in
the Chicago area, described in [1]. Specifically, this corpus contains:

    a) 400 truthful reviews from TripAdvisor.com (*)
    b) 400 deceptive reviews from Amazon Mechanical Turk

(*) TripAdvisor reviews contain only the features (unigrams and bigrams), rather
    than the original reviews. This is done to protect the privacy of the
    original posters. Features additionally encode POS information as given by
    the Stanford Tagger [2]. Mechanical Turk reviews are given in both
    feature-only and raw formats.

Naming Convention:

Directories prefixed with "fold" correspond to a single fold from the
cross-validation experiments presented in [1]. Files are named according to the
format "%c_%h_%i(.%n).txt", where:

    - %c denotes the class: (t)ruthful or (d)eceptive

    - %h denotes the hotel:

        affinia -> Affinia Chicago
        allegro -> Hotel Allegro Chicago
        amalfi -> Amalfi Hotel Chicago
        ambassador -> Ambassador East Hotel
        conrad -> Conrad Chicago
        fairmont -> Fairmont Chicago Millenium Park
        hardrock -> Hard Rock Hotel Chicago
        hilton -> Hilton Chicago
        homewood -> Homewood Suites by Hilton Chicago Downtown
        hyatt -> Hyatt Regency Chicago
        intercontinental -> InterContinental Chicago
        james -> James Chicago
        knickerbocker -> Millennium Knickerbocker Hotel Chicago
        monaco -> Hotel Monaco Chicago - a Kimpton Hotel
        omni -> Omni Chicago Hotel
        palmer -> The Palmer House Hilton
        sheraton -> Sheraton Chicago Hotel and Towers
        sofitel -> Sofitel Chicago Water Tower
        swissotel -> Swissotel Chicago
        talbott -> The Talbott Hotel

    - %i serves as a counter to make the filename unique

    - %n (optionally) indicates whether the file contains unigrams or bigrams

References:

[1] M. Ott, Y. Choi, C. Cardie, and J.T. Hancock. 2011. Finding Deceptive
Opinion Spam by Any Stretch of the Imagination. In Proceedings of the 49th
Annual Meeting of the Association for Computational Linguistics: Human Language
Technologies.

[2] D. Klein and C.D. Manning. 2003. Accurate unlexicalized parsing. In
Proceedings of the 41st Annual Meeting on Association for Computational
Linguistics.
