Kenneth Church

Also published as: Kenneth Ward Church, Ken Church, Kenneth W. Church

2025

Hausa Natural Language Processing (NLP) has gained increasing attention in recent years, yet remains understudied as a low-resource language despite having over 120 million first-language (L1) and 80 million second-language (L2) speakers worldwide. While significant advances have been made in high-resource languages, Hausa NLP faces persistent challenges including limited open-source datasets and inadequate model representation. This paper presents an overview of the current state of Hausa NLP, systematically examining existing resources, research contributions, and gaps across fundamental NLP tasks: text classification, machine translation, named entity recognition, speech recognition, and question answering. We introduce HausaNLP, a curated catalog that aggregates datasets, tools, and research works to enhance accessibility and drive further development. Furthermore, we discuss challenges in integrating Hausa into large language models (LLMs), addressing issues of suboptimal tokenization, and dialectal variation. Finally, we propose strategic research directions emphasizing dataset expansion, improved language modeling approaches, and strengthened community collaboration to advance Hausa NLP. Our work provides both a foundation for accelerating Hausa NLP progress and valuable insights for broader multilingual NLP research.

pdf bib abs
Comparable Corpora: Opportunities for New Research Directions
Kenneth Ward Church
Proceedings of the 18th Workshop on Building and Using Comparable Corpora (BUCC)

Most conference papers present new results, but this paper will focus more on opportunities for the audience to make their own contributions. This paper is intended to challenge the community to think more broadly about what we can do with comparable corpora. We will start with a review of the history, and then suggest new directions for future research.

pdf bib abs
Is Peer-Reviewing Worth the Effort?
Kenneth Ward Church | Raman Chandrasekar | John E. Ortega | Ibrahim Said Ahmad
Proceedings of the 31st International Conference on Computational Linguistics

How effective is peer-reviewing in identifying important papers? We treat this question as a forecasting task. Can we predict which papers will be highly cited in the future based on venue and “early returns” (citations soon after publication)? We show early returns are more predictive than venue. Finally, we end with a constructive suggestion to simplify reviewing.

2024

pdf bib abs
On Translating Technical Terminology: A Translation Workflow for Machine-Translated Acronyms
Richard Yue | John Ortega | Kenneth Church
Proceedings of the 16th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)

The typical workflow for a professional translator to translate a document from its source language (SL) to a target language (TL) is not always focused on what many language models in natural language processing (NLP) do - predict the next word in a series of words. While high-resource languages like English and French are reported to achieve near human parity using common metrics for measurement such as BLEU and COMET, we find that an important step is being missed: the translation of technical terms, specifically acronyms. Some state-of-the art machine translation systems like Google Translate which are publicly available can be erroneous when dealing with acronyms - as much as 50% in our findings. This article addresses acronym disambiguation for MT systems by proposing an additional step to the SL-TL (FR-EN) translation workflow where we first offer a new acronym corpus for public consumption and then experiment with a search-based thresholding algorithm that achieves nearly 10% increase when compared to Google Translate and OpusMT.

pdf bib abs
Are Generative Language Models Multicultural? A Study on Hausa Culture and Emotions using ChatGPT
Ibrahim Ahmad | Shiran Dudy | Resmi Ramachandranpillai | Kenneth Church
Proceedings of the 2nd Workshop on Cross-Cultural Considerations in NLP

Large Language Models (LLMs), such as ChatGPT, are widely used to generate content for various purposes and audiences. However, these models may not reflect the cultural and emotional diversity of their users, especially for low-resource languages. In this paper, we investigate how ChatGPT represents Hausa’s culture and emotions. We compare responses generated by ChatGPT with those provided by native Hausa speakers on 37 culturally relevant questions. We conducted experiments using emotion analysis. We also used two similarity metrics to measure the alignment between human and ChatGPT responses. We also collect human participants ratings and feedback on ChatGPT responses. Our results show that ChatGPT has some level of similarity to human responses, but also exhibits some gaps and biases in its knowledge and awareness of Hausa culture and emotions. We discuss the implications and limitations of our methodology and analysis and suggest ways to improve the performance and evaluation of LLMs for low-resource languages.

pdf bib abs
No Culture Left Behind: ArtELingo-28, a Benchmark of WikiArt with Captions in 28 Languages
Youssef Mohamed | Runjia Li | Ibrahim Said Ahmad | Kilichbek Haydarov | Philip Torr | Kenneth Church | Mohamed Elhoseiny
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Research in vision and language has made considerable progress thanks to benchmarks such as COCO. COCO captions focused on unambiguous facts in English; ArtEmis introduced subjective emotions and ArtELingo introduced some multilinguality (Chinese and Arabic). However we believe there should be more multilinguality. Hence, we present ArtELingo-28, a vision-language benchmark that spans 28 languages and encompasses approximately 200,000 annotations (140 annotations per image). Traditionally, vision research focused on unambiguous class labels, whereas ArtELingo-28 emphasizes diversity of opinions over languages and cultures. The challenge is to build machine learning systems that assign emotional captions to images. Baseline results will be presented for three novel conditions: Zero-Shot, Few-Shot and One-vs-All Zero-Shot. We find that cross-lingual transfer is more successful for culturally-related languages. Data and code will be made publicly available.

pdf bib abs
Comparing Edge-based and Node-based Methods on a Citation Prediction Task
Peter Vickers | Kenneth Church
Findings of the Association for Computational Linguistics: EMNLP 2024

Citation Prediction, estimating whether paper a cites paper b, is particularly interesting in a forecasting setting where the model is trained on papers published before time t, and evaluated on papers published after h, where h is the forecast horizon. Performance improves with t (larger training sets) and degrades with h (longer forecast horizons). The trade-off between edge-based methods and node-based methods depends on t. Because edges grow faster than nodes, larger training sets favor edge-based methods.We introduce a new forecast-based Citation Prediction benchmark of 3 million papers to quantify these trends.Our benchmark shows that desirable policies for combining edge- and node-based methods depend on h and t.We release our benchmark, evaluation scripts, and embeddings.

2023

pdf bib abs
A Research-Based Guide for the Creation and Deployment of a Low-Resource Machine Translation System
John E. Ortega | Kenneth Church
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing

The machine translation (MT) field seems to focus heavily on English and other high-resource languages. Though, low-resource MT (LRMT) is receiving more attention than in the past. Successful LRMT systems (LRMTS) should make a compelling business case in terms of demand, cost and quality in order to be viable for end users. When used by communities where low-resource languages are spoken, LRMT quality should not only be determined by the use of traditional metrics like BLEU, but it should also take into account other factors in order to be inclusive and not risk overall rejection by the community. MT systems based on neural methods tend to perform better with high volumes of training data, but they may be unrealistic and even harmful for LRMT. It is obvious that for research purposes, the development and creation of LRMTS is necessary. However, in this article, we argue that two main workarounds could be considered by companies that are considering deployment of LRMTS in the wild: human-in-the-loop and sub-domains.

2022

pdf bib abs
A Gentle Introduction to Deep Nets and Opportunities for the Future
Kenneth Church | Valia Kordoni | Gary Marcus | Ernest Davis | Yanjun Ma | Zeyu Chen
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts

The first half of this tutorial will make deep nets more accessible to a broader audience, following “Deep Nets for Poets” and “A Gentle Introduction to Fine-Tuning.” We will also introduce GFT (general fine tuning), a little language for fine tuning deep nets with short (one line) programs that are as easy to code as regression in statistics packages such as R using glm (general linear models). Based on the success of these methods on a number of benchmarks, one might come away with the impression that deep nets are all we need. However, we believe the glass is half-full: while there is much that can be done with deep nets, there is always more to do. The second half of this tutorial will discuss some of these opportunities.

pdf bib abs
ArtELingo: A Million Emotion Annotations of WikiArt with Emphasis on Diversity over Language and Culture
Youssef Mohamed | Mohamed Abdelfattah | Shyma Alhuwaider | Feifan Li | Xiangliang Zhang | Kenneth Church | Mohamed Elhoseiny
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

This paper introduces ArtELingo, a new benchmark and dataset, designed to encourage work on diversity across languages and cultures. Following ArtEmis, a collection of 80k artworks from WikiArt with 0.45M emotion labels and English-only captions, ArtELingo adds another 0.79M annotations in Arabic and Chinese, plus 4.8K in Spanish to evaluate “cultural-transfer” performance. 51K artworks have 5 annotations or more in 3 languages. This diversity makes it possible to study similarities and differences across languages and cultures. Further, we investigate captioning tasks, and find diversity improves the performance of baseline models. ArtELingo is publicly available at ‘www.artelingo.org‘ with standard splits and baseline models. We hope our work will help ease future research on multilinguality and culturally-aware AI.

pdf bib abs
Training on Lexical Resources
Kenneth Church | Xingyu Cai | Yuchen Bian
Proceedings of the Thirteenth Language Resources and Evaluation Conference

We propose using lexical resources (thesaurus, VAD) to fine-tune pretrained deep nets such as BERT and ERNIE. Then at inference time, these nets can be used to distinguish synonyms from antonyms, as well as VAD distances. The inference method can be applied to words as well as texts such as multiword expressions (MWEs), out of vocabulary words (OOVs), morphological variants and more. Code and data are posted on https://github.com/kwchurch/syn_ant.

pdf bib abs
Data Augmentation for the Post-Stroke Speech Transcription (PSST) Challenge: Sometimes Less Is More
Jiahong Yuan | Xingyu Cai | Kenneth Church
Proceedings of the RaPID Workshop - Resources and ProcessIng of linguistic, para-linguistic and extra-linguistic Data from people with various forms of cognitive/psychiatric/developmental impairments - within the 13th Language Resources and Evaluation Conference

We employ the method of fine-tuning wav2vec2.0 for recognition of phonemes in aphasic speech. Our effort focuses on data augmentation, by supplementing data from both in-domain and out-of-domain datasets for training. We found that although a modest amount of out-of-domain data may be helpful, the performance of the model degrades significantly when the amount of out-of-domain data is much larger than in-domain data. Our hypothesis is that fine-tuning wav2vec2.0 with a CTC loss not only learns bottom-up acoustic properties but also top-down constraints. Therefore, out-of-domain data augmentation is likely to degrade performance if there is a language model mismatch between “in” and “out” domains. For in-domain audio without ground truth labels, we found that it is beneficial to exclude samples with less confident pseudo labels. Our final model achieves 16.7% PER (phoneme error rate) on the validation set, without using a language model for decoding. The result represents a relative error reduction of 14% over the baseline model trained without data augmentation. Finally, we found that “canonicalized” phonemes are much easier to recognize than manually transcribed phonemes.

2021

pdf bib
Proceedings of the 1st Workshop on Benchmarking: Past, Present and Future
Kenneth Church | Mark Liberman | Valia Kordoni
Proceedings of the 1st Workshop on Benchmarking: Past, Present and Future

pdf bib abs
Benchmarking: Past, Present and Future
Kenneth Church | Mark Liberman | Valia Kordoni
Proceedings of the 1st Workshop on Benchmarking: Past, Present and Future

Where have we been, and where are we going? It is easier to talk about the past than the future. These days, benchmarks evolve more bottom up (such as papers with code). There used to be more top-down leadership from government (and industry, in the case of systems, with benchmarks such as SPEC). Going forward, there may be more top-down leadership from organizations like MLPerf and/or influencers like David Ferrucci, who was responsible for IBM’s success with Jeopardy, and has recently written a paper suggesting how the community should think about benchmarking for machine comprehension. Tasks such as reading comprehension become even more interesting as we move beyond English. Multilinguality introduces many challenges, and even more opportunities.

pdf bib abs
Data Collection vs. Knowledge Graph Completion: What is Needed to Improve Coverage?
Kenneth Church | Yuchen Bian
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

This survey/position paper discusses ways to improve coverage of resources such as WordNet. Rapp estimated correlations, rho, between corpus statistics and pyscholinguistic norms. rho improves with quantity (corpus size) and quality (balance). 1M words is enough for simple estimates (unigram frequencies), but at least 100x more is required for good estimates of word associations and embeddings. Given such estimates, WordNet’s coverage is remarkable. WordNet was developed on SemCor, a small sample (200k words) from the Brown Corpus. Knowledge Graph Completion (KGC) attempts to learn missing links from subsets. But Rapp’s estimates of sizes suggest it would be more profitable to collect more data than to infer missing information that is not there.

pdf bib abs
On Attention Redundancy: A Comprehensive Study
Yuchen Bian | Jiaji Huang | Xingyu Cai | Jiahong Yuan | Kenneth Church
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Multi-layer multi-head self-attention mechanism is widely applied in modern neural language models. Attention redundancy has been observed among attention heads but has not been deeply studied in the literature. Using BERT-base model as an example, this paper provides a comprehensive study on attention redundancy which is helpful for model interpretation and model compression. We analyze the attention redundancy with Five-Ws and How. (What) We define and focus the study on redundancy matrices generated from pre-trained and fine-tuned BERT-base model for GLUE datasets. (How) We use both token-based and sentence-based distance functions to measure the redundancy. (Where) Clear and similar redundancy patterns (cluster structure) are observed among attention heads. (When) Redundancy patterns are similar in both pre-training and fine-tuning phases. (Who) We discover that redundancy patterns are task-agnostic. Similar redundancy patterns even exist for randomly generated token sequences. (“Why”) We also evaluate influences of the pre-training dropout ratios on attention redundancy. Based on the phase-independent and task-agnostic attention redundancy patterns, we propose a simple zero-shot pruning method as a case study. Experiments on fine-tuning GLUE tasks verify its effectiveness. The comprehensive analyses on attention redundancy make model understanding and zero-shot model pruning promising.

2020

pdf bib abs
Improving Bilingual Lexicon Induction for Low Frequency Words
Jiaji Huang | Xingyu Cai | Kenneth Church
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

This paper designs a Monolingual Lexicon Induction task and observes that two factors accompany the degraded accuracy of bilingual lexicon induction for rare words. First, a diminishing margin between similarities in low frequency regime, and secondly, exacerbated hubness at low frequency. Based on the observation, we further propose two methods to address these two factors, respectively. The larger issue is hubness. Addressing that improves induction accuracy significantly, especially for low-frequency words.

Text-to-speech synthesis (TTS) has witnessed rapid progress in recent years, where neural methods became capable of producing audios with high naturalness. However, these efforts still suffer from two types of latencies: (a) the computational latency (synthesizing time), which grows linearly with the sentence length, and (b) the input latency in scenarios where the input text is incrementally available (such as in simultaneous translation, dialog generation, and assistive technologies). To reduce these latencies, we propose a neural incremental TTS approach using the prefix-to-prefix framework from simultaneous translation. We synthesize speech in an online fashion, playing a segment of audio while generating the next, resulting in an O(1) rather than O(n) latency. Experiments on English and Chinese TTS show that our approach achieves similar speech naturalness compared to full sentence TTS, but only with a constant (1-2 words) latency.

Simultaneous speech-to-speech translation is an extremely challenging but widely useful scenario that aims to generate target-language speech only a few seconds behind the source-language speech. In addition, we have to continuously translate a speech of multiple sentences, but all recent solutions merely focus on the single-sentence scenario. As a result, current approaches will accumulate more and more latencies in later sentences when the speaker talks faster and introduce unnatural pauses into translated speech when the speaker talks slower. To overcome these issues, we propose Self-Adaptive Translation which flexibly adjusts the length of translations to accommodate different source speech rates. At similar levels of translation quality (as measured by BLEU), our method generates more fluent target speech latency than the baseline, in both Zh<->En directions.

2019

pdf bib abs
Hubless Nearest Neighbor Search for Bilingual Lexicon Induction
Jiaji Huang | Qiang Qiu | Kenneth Church
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Bilingual Lexicon Induction (BLI) is the task of translating words from corpora in two languages. Recent advances in BLI work by aligning the two word embedding spaces. Following that, a key step is to retrieve the nearest neighbor (NN) in the target space given the source word. However, a phenomenon called hubness often degrades the accuracy of NN. Hubness appears as some data points, called hubs, being extra-ordinarily close to many of the other data points. Reducing hubness is necessary for retrieval tasks. One successful example is Inverted SoFtmax (ISF), recently proposed to improve NN. This work proposes a new method, Hubless Nearest Neighbor (HNN), to mitigate hubness. HNN differs from NN by imposing an additional equal preference assumption. Moreover, the HNN formulation explains why ISF works as well as it does. Empirical results demonstrate that HNN outperforms NN, ISF and other state-of-the-art. For reproducibility and follow-ups, we have published all code.

While the web provides a fantastic linguistic resource, collecting and processing data at web-scale is beyond the reach of most academic laboratories. Previous research has relied on search engines to collect online information, but this is hopelessly inefficient for building large-scale linguistic resources, such as lists of named-entity types or clusters of distributionally similar words. An alternative to processing web-scale text directly is to use the information provided in an N-gram corpus. An N-gram corpus is an efficient compression of large amounts of text. An N-gram corpus states how often each sequence of words (up to length N) occurs. We propose tools for working with enhanced web-scale N-gram corpora that include richer levels of source annotation, such as part-of-speech tags. We describe a new set of search tools that make use of these tags, and collectively lower the barrier for lexical learning and ambiguity resolution at web-scale. They will allow novel sources of information to be applied to long-standing natural language challenges.

2009

pdf bib
Using Word-Sense Disambiguation Methods to Classify Web Queries by Intent
Emily Pitler | Ken Church
Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing

pdf bib
Repetition and Language Models and Comparable Corpora
Ken Church
Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: from Parallel to Non-parallel Corpora (BUCC)

2007

pdf bib
Compressing Trigram Language Models With Golomb Coding
Kenneth Church | Ted Hart | Jianfeng Gao
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)

pdf bib
A Sketch Algorithm for Estimating Two-Way and Multi-Way Associations
Ping Li | Kenneth W. Church
Computational Linguistics, Volume 33, Number 3, September 2007

pdf bib
K-Best Suffix Arrays
Kenneth Church | Bo Thiesson | Robert Ragno
Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers

2005

pdf bib
Using Sketches to Estimate Associations
Ping Li | Kenneth W. Church
Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing

pdf bib
Last Words: Reviewing the Reviewers
Kenneth Church
Computational Linguistics, Volume 31, Number 4, December 2005

pdf bib
The Wild Thing
Ken Church | Bo Thiesson
Proceedings of the ACL Interactive Poster and Demonstration Sessions

2002

pdf bib
NLP Found Helpful (at least for one Text Categorization Task)
Carl Sable | Kathleen McKeown | Kenneth Church
Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002)

2001

pdf bib
Using Suffix Arrays to Compute Term Frequency and Document Frequency for All Substrings in a Corpus
Mikio Yamamoto | Kenneth W. Church
Computational Linguistics, Volume 27, Number 1, March 2001

pdf bib
Using Bins to Empirically Estimate Term Weights for Text Categorization
Carl Sable | Kenneth W. Church
Proceedings of the 2001 Conference on Empirical Methods in Natural Language Processing

2000

pdf bib
Empirical Estimates of Adaptation: The chance of Two Noriegas is closer to p/2 than p2
Kenneth W. Church
COLING 2000 Volume 1: The 18th International Conference on Computational Linguistics

pdf bib
Empirical Term Weighting and Expansion Frequency
Kyoji Umemura | Kenneth W. Church
2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora

1999

pdf bib
What’s Happened Since the First SIGDAT Meeting?
Kenneth Ward Church
1999 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora

1998

pdf bib
Using Suffix Arrays to Compute Term Frequency and Document Frequency for All Substrings in a Corpus
Mikio Yamamoto | Kenneth W. Church
Sixth Workshop on Very Large Corpora

1996

1995

pdf bib
Inverse Document Frequency (IDF): A Measure of Deviations from Poisson
Kenneth Church | William Gale
Third Workshop on Very Large Corpora

1994

pdf bib
Termight: Identifying and Translating Technical Terminology
Ido Dagan | Ken Church
Fourth Conference on Applied Natural Language Processing

pdf bib
Fax: An Alternative to SGML
Kenneth W. Church | William A. Gale | Jonathan I. Helfman | David D. Lewis
COLING 1994 Volume 1: The 15th International Conference on Computational Linguistics

pdf bib
K-vec: A New Approach for Aligning Parallel Texts
Pascale Fung | Kenneth Ward Church
COLING 1994 Volume 2: The 15th International Conference on Computational Linguistics

1993

pdf bib
Introduction to the Special Issue on Computational Linguistics Using Large Corpora
Kenneth W. Church | Robert L. Mercer
Computational Linguistics, Volume 19, Number 1, March 1993, Special Issue on Using Large Corpora: I

pdf bib
A Program for Aligning Sentences in Bilingual Corpora
William A. Gale | Kenneth W. Church
Computational Linguistics, Volume 19, Number 1, March 1993, Special Issue on Using Large Corpora: I

pdf bib
Char_align: A Program for Aligning Parallel Texts at the Character Level
Kenneth Ward Church
31st Annual Meeting of the Association for Computational Linguistics

pdf bib
Robust Bilingual Word Alignment for Machine Aided Translation
Ido Dagan | Kenneth Church | Willian Gale
Very Large Corpora: Academic and Industrial Perspectives

1992

pdf bib
Using bilingual materials to develop word sense disambiguation methods
William A. Gale | Kenneth W. Church | David Yarowsky
Proceedings of the Fourth Conference on Theoretical and Methodological Issues in Machine Translation of Natural Languages

pdf bib
One Sense Per Discourse
William A. Gale | Kenneth W. Church | David Yarowsky
Speech and Natural Language: Proceedings of a Workshop Held at Harriman, New York, February 23-26, 1992

pdf bib
Estimating Upper and Lower Bounds on the Performance of Word-Sense Disambiguation Programs
William Gale | Kenneth Ward Church | David Yarowsky
30th Annual Meeting of the Association for Computational Linguistics

1991

pdf bib
Identifying Word Correspondences in Parallel Texts
William A. Gale | Kenneth W. Church
Speech and Natural Language: Proceedings of a Workshop Held at Pacific Grove, California, February 19-22, 1991

pdf bib
Book Reviews: Theory and Practice in Corpus Linguistics
Kenneth Ward Church
Computational Linguistics, Volume 17, Number 1, March 1991

pdf bib
A Program for Aligning Sentences in Bilingual Corpora
William A. Gale | Kenneth W. Church
29th Annual Meeting of the Association for Computational Linguistics

1990

pdf bib
A Spelling Correction Program Based on a Noisy Channel Model
Mark D. Kernighan | Kenneth W. Church | William A. Gale
COLING 1990 Volume 2: Papers presented to the 13th International Conference on Computational Linguistics

pdf bib
Poor Estimates of Context are Worse than None
William A. Gale | Kenneth W. Church
Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania, June 24-27,1990

pdf bib
Word Association Norms, Mutual Information, and Lexicography
Kenneth Ward Church | Patrick Hanks
Computational Linguistics, Volume 16, Number 1, March 1990

1989

pdf bib
Parsing, Word Associations and Typical Predicate-Argument Relations
Kenneth Church | William Gale | Patrick Hanks | Donald Hindle
Speech and Natural Language: Proceedings of a Workshop Held at Cape Cod, Massachusetts, October 15-18, 1989

pdf bib
Enhanced Good-Turing and Cat-Cal: Two New Methods for Estimating Probabilities of English Bigrams (abbreviated version)
Kenneth W. Church | William A. Gale
Speech and Natural Language: Proceedings of a Workshop Held at Cape Cod, Massachusetts, October 15-18, 1989

pdf bib
Session 11 Natural Language III
Kenneth Ward Church
Speech and Natural Language: Proceedings of a Workshop Held at Cape Cod, Massachusetts, October 15-18, 1989

pdf bib
Word Association Norms, Mutual Information, and Lexicography
Kenneth Ward Church | Patrick Hanks
27th Annual Meeting of the Association for Computational Linguistics

pdf bib abs
Parsing, Word Associations and Typical Predicate-Argument Relations
Kenneth Church | William Gale | Patrick Hanks | Donald Hindle
Proceedings of the First International Workshop on Parsing Technologies

There are a number of collocational constraints in natural languages that ought to play a more important role in natural language parsers. Thus, for example, it is hard for most parsers to take advantage of the fact that wine is typically drunk, produced, and sold, but (probably) not pruned. So too, it is hard for a parser to know which verbs go with which prepositions (e.g., set up) and which nouns fit together to form compound noun phrases (e.g., computer programmer). This paper will attempt to show that many of these types of concerns can be addressed with syntactic methods (symbol pushing), and need not require explicit semantic interpretation. We have found that it is possible to identify many of these interesting co-occurrence relations by computing simple summary statistics over millions of words of text. This paper will summarize a number of experiments carried out by various subsets of the authors over the last few years. The term collocation will be used quite broadly to include constraints on SVO (subject verb object) triples, phrasal verbs, compound noun phrases, and psychoiinguistic notions of word association (e.g., doctor/nurse).