Chris Brew

2022

We demonstrate that knowledge distillation can be used not only to reduce model size, but to simultaneously adapt a contextual language model to a specific domain. We use Multilingual BERT (mBERT; Devlin et al., 2019) as a starting point and follow the knowledge distillation approach of (Sahn et al., 2019) to train a smaller multilingual BERT model that is adapted to the domain at hand. We show that for in-domain tasks, the domain-specific model shows on average 2.3% improvement in F1 score, relative to a model distilled on domain-general data. Whereas much previous work with BERT has fine-tuned the encoder weights during task training, we show that the model improvements from distillation on in-domain data persist even when the encoder weights are frozen during task training, allowing a single encoder to support classifiers for multiple tasks and languages.

2020

pdf abs
Abusive Language Detection using Syntactic Dependency Graphs
Kanika Narang | Chris Brew
Proceedings of the Fourth Workshop on Online Abuse and Harms

Automated detection of abusive language online has become imperative. Current sequential models (LSTM) do not work well for long and complex sentences while bi-transformer models (BERT) are not computationally efficient for the task. We show that classifiers based on syntactic structure of the text, dependency graphical convolutional networks (DepGCNs) can achieve state-of-the-art performance on abusive language datasets. The overall performance is at par with of strong baselines such as fine-tuned BERT. Further, our GCN-based approach is much more efficient than BERT at inference time making it suitable for real-time detection.

pdf bib
Proceedings of the 3rd NLP4IF Workshop on NLP for Internet Freedom: Censorship, Disinformation, and Propaganda
Giovanni Da San Martino | Chris Brew | Giovanni Luca Ciampaglia | Anna Feldman | Chris Leberknight | Preslav Nakov
Proceedings of the 3rd NLP4IF Workshop on NLP for Internet Freedom: Censorship, Disinformation, and Propaganda

2019

pdf bib
Proceedings of the Second Workshop on Natural Language Processing for Internet Freedom: Censorship, Disinformation, and Propaganda
Anna Feldman | Giovanni Da San Martino | Alberto Barrón-Cedeño | Chris Brew | Chris Leberknight | Preslav Nakov
Proceedings of the Second Workshop on Natural Language Processing for Internet Freedom: Censorship, Disinformation, and Propaganda

2018

pdf bib
Proceedings of the First Workshop on Natural Language Processing for Internet Freedom
Chris Brew | Anna Feldman | Chris Leberknight
Proceedings of the First Workshop on Natural Language Processing for Internet Freedom

pdf abs
Digital Operatives at SemEval-2018 Task 8: Using dependency features for malware NLP
Chris Brew
Proceedings of the 12th International Workshop on Semantic Evaluation

The four sub-tasks of SecureNLP build towards a capability for quickly highlighting critical information from malware reports, such as the specific actions taken by a malware sample. Digital Operatives (DO) submitted to sub-tasks 1 and 2, using standard text analysis technology (text classification for sub-task 1, and a CRF for sub-task 2). Performance is broadly competitive with other submitted systems on sub-task 1 and weak on sub-task 2. The annotation guidelines for the intermediate sub-tasks create a linkage to the final task, which is both an annotation challenge and a potentially useful feature of the task. The methods that DO chose do not attempt to make use of this linkage, which may be a missed opportunity. This motivates a post-hoc error analysis. It appears that the annotation task is very hard, and that in some cases both deep conceptual knowledge and substantial surrounding context are needed in order to correctly classify sentences.

We describe a process for converting the Penn Arabic Treebank into the CCG formalism. Previous efforts have yielded CCGbanks in English, German, and Turkish, thus opening these languages to the sophisticated computational tools developed for CCG and enabling further cross-linguistic development. Conversion from a context free grammar treebank to a CCGbank is a four stage process: head finding, argument classification, binarization, and category conversion. In the process of implementing a basic CCGbank conversion algorithm, we reveal properties of Arabic grammar that interfere with conversion, such as subject topicalization, genitive constructions, relative clauses, and optional pronominal subjects. All of these problematic phenomena can be resolved in a variety of ways - we discuss advantages and disadvantages of each in their respective sections. We detail these and describe our categorial analysis of each of these Arabic grammatical phenomena in depth, as well as technical details on their integration into the conversion algorithm.

2009

pdf
Brutus: A Semantic Role Labeling System Incorporating CCG, CFG, and Dependency Features
Stephen Boxwell | Dennis Mehay | Chris Brew
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP

pdf
Using the Wiktionary Graph Structure for Synonym Detection
Timothy Weale | Chris Brew | Eric Fosler-Lussier
Proceedings of the 2009 Workshop on The People’s Web Meets NLP: Collaboratively Constructed Semantic Resources (People’s Web)

2008

pdf bib
Proceedings of the Third Workshop on Issues in Teaching Computational Linguistics
Martha Palmer | Chris Brew | Fei Xia
Proceedings of the Third Workshop on Issues in Teaching Computational Linguistics

pdf abs
Statistical Identification of English Loanwords in Korean Using Automatically Generated Training Data
Kirk Baker | Chris Brew
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

This paper describes an accurate, extensible method for automatically classifying unknown foreign words that requires minimal monolingual resources and no bilingual training data (which is often difficult to obtain for an arbitrary language pair). We use a small set of phonologically-based transliteration rules to generate a potentially unlimited amount of pseudo-data that can be used to train a classifier to distinguish etymological classes of actual words. We ran a series of experiments on identifying English loanwords in Korean, in order to explore the consequences of using pseudo-data in place of the original training data. Results show that a sufficient quantity of automatically generated training data, even produced by fairly low precision transliteration rules, can be used to train a classifier that performs within 0.3% of one trained on actual English loanwords (96% accuracy).

pdf
Which Are the Best Features for Automatic Verb Classification
Jianguo Li | Chris Brew
Proceedings of ACL-08: HLT

2007

pdf
BLEUÂTRE: flattening syntactic dependencies for MT evaluation
Dennis N. Mehay | Chris Brew
Proceedings of the 11th Conference on Theoretical and Methodological Issues in Machine Translation of Natural Languages: Papers

2006

pdf
Tagging Portuguese with a Spanish Tagger
Jirka Hana | Anna Feldman | Luiz Amaral | Chris Brew
Proceedings of the Cross-Language Knowledge Induction Workshop

pdf
A Finite-State Model of Human Sentence Processing
Jihyun Park | Chris Brew
Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics

pdf
Parsing and Subcategorization Data
Jianguo Li | Chris Brew
Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions

pdf abs
A Cross-language Approach to Rapid Creation of New Morpho-syntactically Annotated Resources
Anna Feldman | Jirka Hana | Chris Brew
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

We take a novel approach to rapid, low-cost development of morpho-syntactically annotated resources without using parallel corpora or bilingual lexicons. The overall research question is how to exploit language resources and properties to facilitate and automate the creation of morphologically annotated corpora for new languages. This portability issue is especially relevant to minority languages, for which such resources are likely to remain unavailable in the foreseeable future. We compare the performance of our system on languages that belong to different language families (Romance vs. Slavic), as well as different language pairs within the same language family (Portuguese via Spanish vs. Catalan via Spanish). We show that across language families, the most difficult category is the category of nominals (the noun homonymy is challenging for morphological analysis and the order variation of adjectives within a sentence makes it challenging to create a realiable model), whereas different language families present different challenges with respect to their morpho-syntactic descriptions: for the Slavic languages, case is the most challenging category; for the Romance languages, gender is more challenging than case. In addition, we present an alternative evaluation metric for our system, where we measure how much human labor will be needed to convert the result of our tagging to a high precision annotated resource.