Philip Resnik

2024

pdf abs
Overview of the CLPsych 2024 Shared Task: Leveraging Large Language Models to Identify Evidence of Suicidality Risk in Online Posts
Jenny Chim | Adam Tsakalidis | Dimitris Gkoumas | Dana Atzil-Slonim | Yaakov Ophir | Ayah Zirikly | Philip Resnik | Maria Liakata
Proceedings of the 9th Workshop on Computational Linguistics and Clinical Psychology (CLPsych 2024)

We present the overview of the CLPsych 2024 Shared Task, focusing on leveraging open source Large Language Models (LLMs) for identifying textual evidence that supports the suicidal risk level of individuals on Reddit. In particular, given a Reddit user, their pre- determined suicide risk level (‘Low’, ‘Mod- erate’ or ‘High’) and all of their posts in the r/SuicideWatch subreddit, we frame the task of identifying relevant pieces of text in their posts supporting their suicidal classification in two ways: (a) on the basis of evidence highlighting (extracting sub-phrases of the posts) and (b) on the basis of generating a summary of such evidence. We annotate a sample of 125 users and introduce evaluation metrics based on (a) BERTScore and (b) natural language inference for the two sub-tasks, respectively. Finally, we provide an overview of the system submissions and summarise the key findings.

pdf abs
TopicGPT: A Prompt-based Topic Modeling Framework
Chau Pham | Alexander Hoyle | Simeng Sun | Philip Resnik | Mohit Iyyer
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Topic modeling is a well-established technique for exploring text corpora. Conventional topic models (e.g., LDA) represent topics as bags of words that often require “reading the tea leaves” to interpret; additionally, they offer users minimal control over the formatting and specificity of resulting topics. To tackle these issues, we introduce TopicGPT, a prompt-based framework that uses large language models (LLMs) to uncover latent topics in a text collection. TopicGPT produces topics that align better with human categorizations compared to competing methods: it achieves a harmonic mean purity of 0.74 against human-annotated Wikipedia topics compared to 0.64 for the strongest baseline. Its topics are also more interpretable, dispensing with ambiguous bags of words in favor of topics with natural language labels and associated free-form descriptions. Moreover, the framework is highly adaptable, allowing users to specify constraints and modify topics without the need for model retraining. By streamlining access to high-quality and interpretable topics, TopicGPT represents a compelling, human-centered approach to topic modeling.

2023

pdf abs
Words, Subwords, and Morphemes: What Really Matters in the Surprisal-Reading Time Relationship?
Sathvik Nair | Philip Resnik
Findings of the Association for Computational Linguistics: EMNLP 2023

An important assumption that comes with using LLMs on psycholinguistic data has gone unverified. LLM-based predictions are based on subword tokenization, not decomposition of words into morphemes. Does that matter? We carefully test this by comparing surprisal estimates using orthographic, morphological, and BPE tokenization against reading time data. Our results replicate previous findings and provide evidence that *in the aggregate*, predictions using BPE tokenization do not suffer relative to morphological and orthographic segmentation. However, a finer-grained analysis points to potential issues with relying on BPE-based tokenization, as well as providing promising results involving morphologically-aware surprisal estimates and suggesting a new method for evaluating morphological prediction.

pdf abs
Natural Language Decompositions of Implicit Content Enable Better Text Representations
Alexander Hoyle | Rupak Sarkar | Pranav Goel | Philip Resnik
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

When people interpret text, they rely on inferences that go beyond the observed language itself. Inspired by this observation, we introduce a method for the analysis of text that takes implicitly communicated content explicitly into account. We use a large language model to produce sets of propositions that are inferentially related to the text that has been observed, then validate the plausibility of the generated content via human judgments. Incorporating these explicit representations of implicit content proves useful in multiple problem settings that involve the human interpretation of utterances: assessing the similarity of arguments, making sense of a body of opinion data, and modeling legislative behavior. Our results suggest that modeling the meanings behind observed language, rather than the literal text alone, is a valuable direction for NLP and particularly its applications to social science.

2022

We provide an overview of the CLPsych 2022 Shared Task, which focusses on the automatic identification of ‘Moments of Change’ in lon- gitudinal posts by individuals on social media and its connection with information regarding mental health . This year’s task introduced the notion of longitudinal modelling of the text generated by an individual online over time, along with appropriate temporally sen- sitive evaluation metrics. The Shared Task con- sisted of two subtasks: (a) the main task of cap- turing changes in an individual’s mood (dras- tic changes-‘Switches’- and gradual changes -‘Escalations’- on the basis of textual content shared online; and subsequently (b) the sub- task of identifying the suicide risk level of an individual – a continuation of the CLPsych 2019 Shared Task– where participants were encouraged to explore how the identification of changes in mood in task (a) can help with assessing suicidality risk in task (b).

The language of Twitter differs significantly from that of other domains commonly included in large language model training. While tweets are typically multilingual and contain informal language, including emoji and hashtags, most pre-trained language models for Twitter are either monolingual, adapted from other domains rather than trained exclusively on Twitter, or are trained on a limited amount of in-domain Twitter data.We introduce Bernice, the first multilingual RoBERTa language model trained from scratch on 2.5 billion tweets with a custom tweet-focused tokenizer. We evaluate on a variety of monolingual and multilingual Twitter benchmarks, finding that our model consistently exceeds or matches the performance of a variety of models adapted to social media data as well as strong multilingual baselines, despite being trained on less data overall.We posit that it is more efficient compute- and data-wise to train completely on in-domain data with a specialized domain-specific tokenizer.

pdf abs
Are Neural Topic Models Broken?
Alexander Miserlis Hoyle | Rupak Sarkar | Pranav Goel | Philip Resnik
Findings of the Association for Computational Linguistics: EMNLP 2022

Recently, the relationship between automated and human evaluation of topic models has been called into question. Method developers have staked the efficacy of new topic model variants on automated measures, and their failure to approximate human preferences places these models on uncertain ground. Moreover, existing evaluation paradigms are often divorced from real-world use.Motivated by content analysis as a dominant real-world use case for topic modeling, we analyze two related aspects of topic models that affect their effectiveness and trustworthiness in practice for that purpose: the stability of their estimates and the extent to which the model’s discovered categories align with human-determined categories in the data. We find that neural topic models fare worse in both respects compared to an established classical method. We take a step toward addressing both issues in tandem by demonstrating that a straightforward ensembling method can reliably outperform the members of the ensemble.

2021

pdf abs
Syntopical Graphs for Computational Argumentation Tasks
Joe Barrow | Rajiv Jain | Nedim Lipka | Franck Dernoncourt | Vlad Morariu | Varun Manjunatha | Douglas Oard | Philip Resnik | Henning Wachsmuth
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Approaches to computational argumentation tasks such as stance detection and aspect detection have largely focused on the text of independent claims, losing out on potentially valuable context provided by the rest of the collection. We introduce a general approach to these tasks motivated by syntopical reading, a reading process that emphasizes comparing and contrasting viewpoints in order to improve topic understanding. To capture collection-level context, we introduce the syntopical graph, a data structure for linking claims within a collection. A syntopical graph is a typed multi-graph where nodes represent claims and edges represent different possible pairwise relationships, such as entailment, paraphrase, or support. Experiments applying syntopical graphs to the problems of detecting stance and aspects demonstrate state-of-the-art performance in each domain, significantly outperforming approaches that do not utilize collection-level information.

pdf
Using surprisal and fMRI to map the neural bases of broad and local contextual prediction during natural language comprehension
Shohini Bhattasali | Philip Resnik
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf bib
Proceedings of the Seventh Workshop on Computational Linguistics and Clinical Psychology: Improving Access
Nazli Goharian | Philip Resnik | Andrew Yates | Molly Ireland | Kate Niederhoffer | Rebecca Resnik
Proceedings of the Seventh Workshop on Computational Linguistics and Clinical Psychology: Improving Access

pdf abs
Community-level Research on Suicidality Prediction in a Secure Environment: Overview of the CLPsych 2021 Shared Task
Sean MacAvaney | Anjali Mittu | Glen Coppersmith | Jeff Leintz | Philip Resnik
Proceedings of the Seventh Workshop on Computational Linguistics and Clinical Psychology: Improving Access

Progress on NLP for mental health — indeed, for healthcare in general — is hampered by obstacles to shared, community-level access to relevant data. We report on what is, to our knowledge, the first attempt to address this problem in mental health by conducting a shared task using sensitive data in a secure data enclave. Participating teams received access to Twitter posts donated for research, including data from users with and without suicide attempts, and did all work with the dataset entirely within a secure computational environment. We discuss the task, team results, and lessons learned to set the stage for future tasks on sensitive or confidential data.

2020

Text segmentation aims to uncover latent structure by dividing text from a document into coherent sections. Where previous work on text segmentation considers the tasks of document segmentation and segment labeling separately, we show that the tasks contain complementary information and are best addressed jointly. We introduce Segment Pooling LSTM (S-LSTM), which is capable of jointly segmenting a document and labeling segments. In support of joint training, we develop a method for teaching the model to recover from errors by aligning the predicted and ground truth segments. We show that S-LSTM reduces segmentation error by 30% on average, while also improving segment labeling.

pdf abs
A Prioritization Model for Suicidality Risk Assessment
Han-Chin Shing | Philip Resnik | Douglas Oard
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

We reframe suicide risk assessment from social media as a ranking problem whose goal is maximizing detection of severely at-risk individuals given the time available. Building on measures developed for resource-bounded document retrieval, we introduce a well founded evaluation paradigm, and demonstrate using an expert-annotated test collection that meaningful improvements over plausible cascade model baselines can be achieved using an approach that jointly ranks individuals and their social media posts.

pdf abs
Developing a Curated Topic Model for COVID-19 Medical Research Literature
Philip Resnik | Katherine E. Goodman | Mike Moran
Proceedings of the 1st Workshop on NLP for COVID-19 (Part 2) at EMNLP 2020

Topic models can facilitate search, navigation, and knowledge discovery in large document collections. However, automatic generation of topic models can produce results that fail to meet the needs of users. We advocate for a set of user-focused desiderata in topic modeling for the COVID-19 literature, and describe an effort in progress to develop a curated topic model for COVID-19 articles informed by subject matter expertise and the way medical researchers engage with medical literature.

pdf abs
Improving Neural Topic Models using Knowledge Distillation
Alexander Miserlis Hoyle | Pranav Goel | Philip Resnik
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Topic models are often used to identify human-interpretable topics to help make sense of large document collections. We use knowledge distillation to combine the best attributes of probabilistic topic models and pretrained transformers. Our modular method can be straightforwardly applied with any neural topic model to improve topic quality, which we demonstrate using two models having disparate architectures, obtaining state-of-the-art topic coherence. We show that our adaptable framework not only improves performance in the aggregate over all estimated topics, as is commonly reported, but also in head-to-head comparisons of aligned topics.

2019

pdf abs
A Multilingual Topic Model for Learning Weighted Topic Links Across Corpora with Low Comparability
Weiwei Yang | Jordan Boyd-Graber | Philip Resnik
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Multilingual topic models (MTMs) learn topics on documents in multiple languages. Past models align topics across languages by implicitly assuming the documents in different languages are highly comparable, often a false assumption. We introduce a new model that does not rely on this assumption, particularly useful in important low-resource language scenarios. Our MTM learns weighted topic links and connects cross-lingual topics only when the dominant words defining them are similar, outperforming LDA and previous MTMs in classification tasks using documents’ topic posteriors as features. It also learns coherent topics on documents with low comparability.

pdf bib
Proceedings of the Sixth Workshop on Computational Linguistics and Clinical Psychology
Kate Niederhoffer | Kristy Hollingshead | Philip Resnik | Rebecca Resnik | Kate Loveys
Proceedings of the Sixth Workshop on Computational Linguistics and Clinical Psychology

pdf abs
CLPsych 2019 Shared Task: Predicting the Degree of Suicide Risk in Reddit Posts
Ayah Zirikly | Philip Resnik | Özlem Uzuner | Kristy Hollingshead
Proceedings of the Sixth Workshop on Computational Linguistics and Clinical Psychology

The shared task for the 2019 Workshop on Computational Linguistics and Clinical Psychology (CLPsych’19) introduced an assessment of suicide risk based on social media postings, using data from Reddit to identify users at no, low, moderate, or severe risk. Two variations of the task focused on users whose posts to the r/SuicideWatch subreddit indicated they might be at risk; a third task looked at screening users based only on their more everyday (non-SuicideWatch) posts. We received submissions from 15 different teams, and the results provide progress and insight into the value of language signal in helping to predict risk level.

2018

pdf abs
Assessing Composition in Sentence Vector Representations
Allyson Ettinger | Ahmed Elgohary | Colin Phillips | Philip Resnik
Proceedings of the 27th International Conference on Computational Linguistics

An important component of achieving language understanding is mastering the composition of sentence meaning, but an immediate challenge to solving this problem is the opacity of sentence vector representations produced by current neural sentence composition models. We present a method to address this challenge, developing tasks that directly target compositional meaning information in sentence vector representations with a high degree of precision and control. To enable the creation of these controlled tasks, we introduce a specialized sentence generation system that produces large, annotated sentence sets meeting specified syntactic, semantic and lexical constraints. We describe the details of the method and generation system, and then present results of experiments applying our method to probe for compositional information in embeddings from a number of existing sentence composition models. We find that the method is able to extract useful information about the differing capacities of these models, and we discuss the implications of our results with respect to these systems’ capturing of sentence information. We make available for public use the datasets used for these experiments, as well as the generation system.

pdf bib
Proceedings of the Fifth Workshop on Computational Linguistics and Clinical Psychology: From Keyboard to Clinic
Kate Loveys | Kate Niederhoffer | Emily Prud’hommeaux | Rebecca Resnik | Philip Resnik
Proceedings of the Fifth Workshop on Computational Linguistics and Clinical Psychology: From Keyboard to Clinic

pdf abs
Expert, Crowdsourced, and Machine Assessment of Suicide Risk via Online Postings
Han-Chin Shing | Suraj Nair | Ayah Zirikly | Meir Friedenberg | Hal Daumé III | Philip Resnik
Proceedings of the Fifth Workshop on Computational Linguistics and Clinical Psychology: From Keyboard to Clinic

We report on the creation of a dataset for studying assessment of suicide risk via online postings in Reddit. Evaluation of risk-level annotations by experts yields what is, to our knowledge, the first demonstration of reliability in risk assessment by clinicians based on social media postings. We also introduce and demonstrate the value of a new, detailed rubric for assessing suicide risk, compare crowdsourced with expert performance, and present baseline predictive modeling experiments using the new dataset, which will be made available to researchers through the American Association of Suicidology.

pdf abs
CLPsych 2018 Shared Task: Predicting Current and Future Psychological Health from Childhood Essays
Veronica Lynn | Alissa Goodman | Kate Niederhoffer | Kate Loveys | Philip Resnik | H. Andrew Schwartz
Proceedings of the Fifth Workshop on Computational Linguistics and Clinical Psychology: From Keyboard to Clinic

We describe the shared task for the CLPsych 2018 workshop, which focused on predicting current and future psychological health from an essay authored in childhood. Language-based predictions of a person’s current health have the potential to supplement traditional psychological assessment such as questionnaires, improving intake risk measurement and monitoring. Predictions of future psychological health can aid with both early detection and the development of preventative care. Research into the mental health trajectory of people, beginning from their childhood, has thus far been an area of little work within the NLP community. This shared task represents one of the first attempts to evaluate the use of early language to predict future health; this has the potential to support a wide variety of clinical health care tasks, from early assessment of lifetime risk for mental health problems, to optimal timing for targeted interventions aimed at both prevention and treatment.

2017

pdf abs
Adapting Topic Models using Lexical Associations with Tree Priors
Weiwei Yang | Jordan Boyd-Graber | Philip Resnik
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

Models work best when they are optimized taking into account the evaluation criteria that people care about. For topic models, people often care about interpretability, which can be approximated using measures of lexical association. We integrate lexical association into topic optimization using tree priors, which provide a flexible framework that can take advantage of both first order word associations and the higher-order associations captured by word embeddings. Tree priors improve topic interpretability without hurting extrinsic performance.

Targeted paraphrasing is a new approach to the problem of obtaining cost-effective, reasonable quality translation that makes use of simple and inexpensive human computations by monolingual speakers in combination with machine translation. The key insight behind the process is that it is possible to spot likely translation errors with only monolingual knowledge of the target language, and it is possible to generate alternative ways to say the same thing (i.e. paraphrases) with only monolingual knowledge of the source language. Evaluations demonstrate that this approach can yield substantial improvements in translation quality.

pdf
Shedding (a Thousand Points of) Light on Biased Language
Tae Yano | Philip Resnik | Noah A. Smith
Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk

pdf
Measuring Transitivity Using Untrained Annotators
Nitin Madnani | Jordan Boyd-Graber | Philip Resnik
Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk

pdf
Error Driven Paraphrase Annotation using Mechanical Turk
Olivia Buzek | Philip Resnik | Ben Bederson
Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk

pdf
The University of Maryland Statistical Machine Translation System for the Fifth Workshop on Machine Translation
Vladimir Eidelman | Chris Dyer | Philip Resnik
Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR

2009

pdf
More than Words: Syntactic Packaging and Implicit Sentiment
Stephan Greene | Philip Resnik
Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics

pdf
Improved Statistical Machine Translation Using Monolingually-Derived Paraphrases
Yuval Marton | Chris Callison-Burch | Philip Resnik
Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing

pdf
Estimating Semantic Distance Using Soft Semantic Constraints in Knowledge-Source – Corpus Hybrid Models
Yuval Marton | Saif Mohammad | Philip Resnik
Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing

pdf
The University of Maryland Statistical Machine Translation System for the Fourth Workshop on Machine Translation
Chris Dyer | Hendra Setiawan | Yuval Marton | Philip Resnik
Proceedings of the Fourth Workshop on Statistical Machine Translation

pdf
Topological Ordering of Function Words in Hierarchical Phrase-based Translation
Hendra Setiawan | Min-Yen Kan | Haizhou Li | Philip Resnik
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP

2008

pdf
Soft Syntactic Constraints for Hierarchical Phrased-Based Translation
Yuval Marton | Philip Resnik
Proceedings of ACL-08: HLT

pdf
Generalizing Word Lattice Translation
Christopher Dyer | Smaranda Muresan | Philip Resnik
Proceedings of ACL-08: HLT

pdf
Cross-Language Parser Adaptation between Related Languages
Daniel Zeman | Philip Resnik
Proceedings of the IJCNLP-08 Workshop on NLP for Less Privileged Languages

pdf
Online Large-Margin Training of Syntactic and Structural Translation Features
David Chiang | Yuval Marton | Philip Resnik
Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing

pdf abs
Are Multiple Reference Translations Necessary? Investigating the Value of Paraphrased Reference Translations in Parameter Optimization
Nitin Madnani | Philip Resnik | Bonnie J. Dorr | Richard Schwartz
Proceedings of the 8th Conference of the Association for Machine Translation in the Americas: Research Papers

Most state-of-the-art statistical machine translation systems use log-linear models, which are defined in terms of hypothesis features and weights for those features. It is standard to tune the feature weights in order to maximize a translation quality metric, using held-out test sentences and their corresponding reference translations. However, obtaining reference translations is expensive. In our earlier work (Madnani et al., 2007), we introduced a new full-sentence paraphrase technique, based on English-to-English decoding with an MT system, and demonstrated that the resulting paraphrases can be used to cut the number of human reference translations needed in half. In this paper, we take the idea a step further, asking how far it is possible to get with just a single good reference translation for each item in the development set. Our analysis suggests that it is necessary to invest in four or more human translations in order to significantly improve on a single translation augmented by monolingual paraphrases.

2007

pdf
Tor, TorMd: Distributional Profiles of Concepts for Unsupervised Word Sense Disambiguation
Saif Mohammad | Graeme Hirst | Philip Resnik
Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007)

pdf
Using Paraphrases for Parameter Tuning in Statistical Machine Translation
Nitin Madnani | Necip Fazil Ayan | Philip Resnik | Bonnie Dorr
Proceedings of the Second Workshop on Statistical Machine Translation

2006

pdf abs
Word-Based Alignment, Phrase-Based Translation: What’s the Link?
Adam Lopez | Philip Resnik
Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers

State-of-the-art statistical machine translation is based on alignments between phrases – sequences of words in the source and target sentences. The learning step in these systems often relies on alignments between words. It is often assumed that the quality of this word alignment is critical for translation. However, recent results suggest that the relationship between alignment quality and translation quality is weaker than previously thought. We investigate this question directly, comparing the impact of high-quality alignments with a carefully constructed set of degraded alignments. In order to tease apart various interactions, we report experiments investigating the impact of alignments on different aspects of the system. Our results confirm a weak correlation, but they also illustrate that more data and better feature engineering may be more beneficial than better alignment.

2005

pdf
Improved HMM Alignment Models for Languages with Scarce Resources
Adam Lopez | Philip Resnik
Proceedings of the ACL Workshop on Building and Using Parallel Texts

pdf
The Hiero Machine Translation System: Extensions, Evaluation, and Analysis
David Chiang | Adam Lopez | Nitin Madnani | Christof Monz | Philip Resnik | Michael Subotin
Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing

pdf
OCR Post-Processing for Low Density Languages
Okan Kolak | Philip Resnik
Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing

pdf
Pattern Visualization for Machine Translation Output
Adam Lopez | Philip Resnik
Proceedings of HLT/EMNLP 2005 Interactive Demonstrations

pdf
The Linguist’s Search Engine: An Overview
Philip Resnik | Aaron Elkiss
Proceedings of the ACL Interactive Poster and Demonstration Sessions

2004

pdf
Inducing Frame Semantic Verb Classes from WordNet and LDOCE
Rebecca Green | Bonnie J. Dorr | Philip Resnik
Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04)

pdf
The University of Maryland Senseval-3 system descriptions
Clara Cabezas | Indrajit Bhattacharya | Philip Resnik
Proceedings of SENSEVAL-3, the Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text

2003

pdf
A Generative Probabilistic OCR Model for NLP Applications
Okan Kolak | William Byrne | Philip Resnik
Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics

pdf bib
The Web as a Parallel Corpus
Philip Resnik | Noah A. Smith
Computational Linguistics, Volume 29, Number 3, September 2003: Special Issue on the Web as Corpus

2002

pdf
An Unsupervised Method for Word Sense Tagging using Parallel Corpora
Mona Diab | Philip Resnik
Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics

pdf
Evaluating Translational Correspondence using Annotation Projection
Rebecca Hwa | Philip Resnik | Amy Weinberg | Okan Kolak
Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics

2001

pdf
Book Reviews: Parallel Text Processing: Alignment and Use of Translation Corpora
Philip Resnik
Computational Linguistics, Volume 27, Number 4, December 2001

pdf
Improved Cross-Language Retrieval using Backoff Translation
Philip Resnik | Douglas Oard | Gina Levow
Proceedings of the First International Conference on Human Language Technology Research

pdf
Rapidly Retargetable Interactive Translingual Retrieval
Gina-Anne Levow | Douglas W. Oard | Philip Resnik
Proceedings of the First International Conference on Human Language Technology Research

pdf
Mapping Lexical Entries in a Verbs Database to WordNet Senses
Rebecca Green | Lisa Pearl | Bonnie J. Dorr | Philip Resnik
Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics

pdf
Supervised Sense Tagging using Support Vector Machines
Clara Cabezas | Philip Resnik | Jessica Stevens
Proceedings of SENSEVAL-2 Second International Workshop on Evaluating Word Sense Disambiguation Systems

1999

pdf
Mining the Web for Bilingual Text
Philip Resnik
Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics

1998

pdf abs
Parallel strands: a preliminary investigation into mining the Web for bilingual text
Philip Resnik
Proceedings of the Third Conference of the Association for Machine Translation in the Americas: Technical Papers

Parallel corpora are a valuable resource for machine translation, but at present their availability and utility is limited by genre- and domain-specificity, licensing restrictions, and the basic dificulty of locating parallel texts in all but the most dominant of the world’s languages. A parallel corpus resource not yet explored is the World Wide Web, which hosts an abundance of pages in parallel translation, offering a potential solution to some of these problems and unique opportunities of its own. This paper presents the necessary first step in that exploration: a method for automatically finding parallel translated documents on the Web. The technique is conceptually simple, fully language independent, and scalable, and preliminary evaluation results indicate that the method may be accurate enough to apply without human intervention.