The AOC dataset itself is too large to be uploaded here (300+ MB).  But it is hosted online
by the first author at http://cs.jhu.edu/~ozaidan/AOC/.

Instead, we're including small samples of the harvested comment and article sentences.

This archive also includes the 108K sentences that have been classified as dialectal or MSA
(summarized in Table 2).  You will notice that the line counts of the files are equal to the
sentence counts in the Table.

The sentences in the files are the result of applying some simple cleaning of the raw Arabic
sentences, hence the .norm extension.  This normalization includes removing long sequences
of a single repeated letter, and also includes splitting on some special characters such
as punctuation points.  The normalized versions are what's used in the experiments of Section 4.
