
This directory contains data associated with the paper:

Young-Bum Kim and Benjamin Snyder. Universal Grapheme-to-Phoneme Prediction 
Over Latin Alphabets. EMNLP 2012.

We provide one file for each target grapheme. Within each file, each line
corresponds to a language that employs the target grapheme. The lines are
formatted as follows:

LANGUAGE PH_1,PH_2,..,PH_n FEAT_1 FEAT_2 ...

The first two fields indicate the language and the phone values that the
grapheme may take according to this language. The remaining fields indicate
feature values.

LANGUAGE  := iso639-3 code indicating language identity
PH_i      := IPA symbol of a phone that is represented by the grapheme
             according to the language
FEAT_i    := LANGUAGE_FEATURE | TEXT_FEATURE | PHONETIC_FEATURE

The three feature types correspond respectively to (i) general language
features drawn from Ethnologue and the World Atlas of Language Structures, (ii)
text features drawn from the language's translation of the Universal
Declaration of Human Rights, and (iii) phonetic features extracted from the
text features using the IPA heuristic described in the paper. Details of the
feature formats are given below.

===============================================================================

LANGUAGE_FEATURE := region=REGION | llan=LLAN | slan=SLAN
REGION           := Africa | Americas | Asia | Europe | Other | Pacific | UNK
LLAN             := Afro-Asiatic | Algic | Altaic | Arawakan | Austro-Asiatic |
                    Austronesian | Basque | Eskimo-Aleut | Harakmbet |
                    Huitotoan | Indo-European | Mayan | Muskogean | 
                    Niger-Congo | Nilo-Saharan | Oto-Manguean | Uralic | 
                    Uto-Aztecan | Zaparoan | other | UNK
SLAN             := Algonquian | Arawakan | Aztecan | Baltic | Bantoid | 
                    Basque | Celtic | Central_Malayo-Polynesian | Chamorro | 
                    Creoles_and_Pidgins | Eastern_Cushitic | Eskimo | Finnic | 
                    Germanic | Harakmbet | Huitoto | Igboid | Kwa | Mayan | 
                    Meso-Philippine | Muskogean | Nilotic | Northern_Atlantic | 
                    Northern_Philippines | Oceanic | Otomian | Romance | 
                    Semitic | Slavic | Sundic | Turkic | Ubangi | 
                    Viet-Muong | West_Chadic | Western_Mande | Zaparoan | UNK

These features provide general information about the language in question:

'region' -- the region of the world in which the language is spoken
'llan' -- the large language family of the language
'slan' -- the small language family of the language

===============================================================================

TEXT_FEATURE := count=N | l1=G=N | r1=G=N | l2=G1@G2=N | r2=G1@G2=N | 
                l1r1=G1@G2=N
G            := any latin grapheme
N            := integer count

These features count contexts in which the target grapheme appears in the text:

'count' -- the total count of the target grapheme
'l1' -- the count of the grapheme to the immediate left of the target grapheme
'r1' -- the count of the grapheme to the immediate right of the target grapheme
'l2' -- the count of the pair of graphemes to the immediate left of the 
        target grapheme
'r2' -- the count of the pair of graphemes to the immediate right of the 
        target grapheme
'l1r1' -- the count of pairs of graphemes surrounding the target grapheme

===============================================================================

PHONETIC_FEATURE  := il1=PH=N | ir1=PH=N | il2=PH1@PH2=N | ir2=PH1@PH2=N | 
                     l1r1=PH1@PH2=N
PH                := alveolar | approximant | back | bilabial | central |       
                     centralized | close | consonant | dental | extra_short | 
                     fricative | front | glottal | implosive | labial | 
                     labiodental | lateral | mid | nasal | nasalized | near | 
                     open | palatal | pharyngeal | plosive | retroflex | 
                     rounded | trill | unrounded | uvular | velar | voiced | 
                     voiceless | vowel
N                 := integer count

These features abstract the previous set of features by mapping contextual
graphemes to a phone (using the equivalent IPA symbols) and then considering
various aspects of the phone. As with the text features, each of these features
corresponds to a context to the immediate left, right, or surrounding the
target grapheme.

Note that the results in the paper require further discretization and feature
selection performed in a leave-one-out scenario, as detailed in the paper.
Please contact the authors with any questions or requests for the data along
the various processing steps.


ybkim@cs.wisc.edu, bsnyder@cs.wisc.edu
