Douglas Biber


2021

pdf
Beyond the English Web: Zero-Shot Cross-Lingual and Lightweight Monolingual Classification of Registers
Liina Repo | Valtteri Skantsi | Samuel Rönnqvist | Saara Hellström | Miika Oinonen | Anna Salmela | Douglas Biber | Jesse Egbert | Sampo Pyysalo | Veronika Laippala
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop

We explore cross-lingual transfer of register classification for web documents. Registers, that is, text varieties such as blogs or news are one of the primary predictors of linguistic variation and thus affect the automatic processing of language. We introduce two new register-annotated corpora, FreCORE and SweCORE, for French and Swedish. We demonstrate that deep pre-trained language models perform strongly in these languages and outperform previous state-of-the-art in English and Finnish. Specifically, we show 1) that zero-shot cross-lingual transfer from the large English CORE corpus can match or surpass previously published monolingual models, and 2) that lightweight monolingual classification requiring very little training data can reach or surpass our zero-shot performance. We further analyse classification results finding that certain registers continue to pose challenges in particular for cross-lingual transfer.

2019

pdf
Toward Multilingual Identification of Online Registers
Veronika Laippala | Roosa Kyllönen | Jesse Egbert | Douglas Biber | Sampo Pyysalo
Proceedings of the 22nd Nordic Conference on Computational Linguistics

We consider cross- and multilingual text classification approaches to the identification of online registers (genres), i.e. text varieties with specific situational characteristics. Register is the most important predictor of linguistic variation, and register information could improve the potential of online data for many applications. We introduce the first manually annotated non-English corpus of online registers featuring the full range of linguistic variation found online. The data set consists of 2,237 Finnish documents and follows the register taxonomy developed for the Corpus of Online Registers of English (CORE). Using CORE and the newly introduced corpus, we demonstrate the feasibility of cross-lingual register identification using a simple approach based on convolutional neural networks and multilingual word embeddings. We further find that register identification results can be improved through multilingual training even when a substantial number of annotations is available in the target language.

1999

pdf
Book Reviews: Exploring Textual Data
Douglas Biber
Computational Linguistics, Volume 25, Number 1, March 1999

1993

pdf bib
Using Register-Diversified Corpora for General Language Studies
Douglas Biber
Computational Linguistics, Volume 19, Number 2, June 1993, Special Issue on Using Large Corpora: II

pdf
Co-occurrence Patterns among Collocations: A Tool for Corpus-Based Lexical Knowledge Acquisition
Douglas Biber
Computational Linguistics, Volume 19, Number 3, September 1993

1992

pdf
Book Reviews: English Computer Corpora: Selected Papers and Research Guide
Douglas Biber
Computational Linguistics, Volume 18, Number 4, December 1992