Re-evaluating the Word Token for Bilingual Speech Processing: The Case for Intonation Units

Rebecca Pattichis, Dora LaCasse, Rena Torres Cacoullos


Abstract
Natural Language Processing (NLP) metrics for bilingual code-switching (CS) have, until now, used words as the token level. However, the assumption that any two words constitute an equally likely switch point is erroneous. In spoken language, a major delimiter of CS is a prosodic chunk known as the Intonation Unit (IU). Switch points are far more likely between words at IU boundaries than between words in the same IU. The word as an elementary NLP unit is thus incommensurate with bilingual speech patterns. Here, we put forward an IU-based adaptation of a familiar metric of CS probability. We then compare the token levels on this metric for ten bilingual datasets featuring multi-word CS. Our comparison shows that the currently standard two-significant-figure precision of the word-based metric is insufficient, as the token level compresses the range of values by inflating the universe of CS. More discerning CS probability values can be obtained by normalizing word-based counts using mean IU length.
Anthology ID:
2026.cl-1.8
Volume:
Computational Linguistics, Volume 52, Issue 1 - March 2026
Month:
March
Year:
2026
Address:
Cambridge, MA
Venue:
CL
SIG:
Publisher:
MIT Press
Note:
Pages:
271–293
Language:
URL:
https://preview.aclanthology.org/ingest-latest-mitpress-cl-tacl/2026.cl-1.8/
DOI:
10.1162/coli.a.580
Bibkey:
Cite (ACL):
Rebecca Pattichis, Dora LaCasse, and Rena Torres Cacoullos. 2026. Re-evaluating the Word Token for Bilingual Speech Processing: The Case for Intonation Units. Computational Linguistics, 52(1):271–293.
Cite (Informal):
Re-evaluating the Word Token for Bilingual Speech Processing: The Case for Intonation Units (Pattichis et al., CL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-latest-mitpress-cl-tacl/2026.cl-1.8.pdf