The file ‘/annotations_UQ.csv’ contains 10,909 Reddit comments annotated for whether they contain an unpalatable question or not.

# To read as a DataFrame:
>> import pandas as pd
>> df = pd.read_csv(‘./annotations_UQ.csv', lineterminator='\n')


Column description:
- ‘reply_id’ = unique ID for each row
- ‘reply_text’ = text for the main comment (or reply)
- ‘comment_text’ = text for the preceding comment in the thread
- ‘label’ = majority label selected by MTurk coders. It can take two values: “yes_unpalatable” or “not_unpalatable”
- ‘confidence’ = annotator agreement. Since we collect five annotations, it can take three values: 0.6, 0.8, 1.0
- Note that a very small number of comments received more than five annotations since they were dynamically used as test questions across batches, and confidence values for those rows are not exactly 0.6, 0.8, or 1.0. This is captured in the column ‘unmodified_confidence’. We created the ‘confidence’ column from ‘unmodified_confidence’ using the following brackets: [0.5, 0.7) -> 0.6 ; [0.7, 0.9) -> 0.8 ; [0.9, 1.0] -> 1.0