These files contain our processed Bernstein-Ratner-Brent corpus, used
in our submission "Bootstrapping a Unified Model of Lexical and
Phonetic Acquisition". To obtain a development corpus, use "head -n
8000" on each file. To obtain a test corpus, use "tail -n 1790" (we
run test experiments on the full set, but evaluate only the last 1790
lines).

There are two files:

  brentStest.surface    : the "surface" tokens
  brentStest.underlying : the "intended form" tokens

These files were prepared from the BRB corpus as distributed by Sharon
Goldwater at
http://homepages.inf.ed.ac.uk/sgwater/software/br_data.tar,
specifically the br-phono.txt file, by sampling a new pronunciation
for each token from the empirical distribution of pronunciations in
Buckeye (Pitt et al 07); see section 5.1 of the paper for details.

Because the Brent and Buckeye phonetic alphabets do not quite
correspond, this data is in a subset of the Buckeye alphabet
containing all the Buckeye symbols that correspond to a Brent
symbol. The main difference is that Buckeye has more nasal vowels,
which appear as a vowel followed by a nasal in our data.

The surface file contains lines like this:

uw || hh w ah n || n ah || s iy || ah || b uh k ||
l uh k || dh eh r z || ah || b oy || w ah th || s ih z || hh ae t ||

Phonetic symbols are space-delimited and word boundaries are indicated
by "||".

The underlying file contains lines like this:

y uw |0-1| w aa n |1-5| t ah |5-7| s iy |7-9| dh ah |9-10| b uh k |10-13|
l uh k |0-3| dh eh r z |3-7| ah |7-8| b oy |8-10| w ih th |10-13| hh ih z |13-16| hh ae t |16-19|

As above, phonetic symbols are space-delimited. Word boundaries are
indicated by "|[start]-[end]|" where [start] and [end] are character
indices in the surface string, indicating the corresponding surface
segmentation. For instance, the first word, "y uw", corresponds to
characters 0-1 of the surface string: "uw". (This means the two
representations are redundant as to the surface word boundaries.)

The "intended forms" in this file are the most frequent Buckeye
pronunciation of the corresponding orthographic word. (It doesn't
matter what the actual forms are, though, because our evaluation
treats them as arbitrary and uses a mapping process; see section 5.2.)
