Julian Brooke

2018

pdf bib
Proceedings of the Second Workshop on Stylistic Variation
Julian Brooke | Lucie Flekova | Moshe Koppel | Thamar Solorio
Proceedings of the Second Workshop on Stylistic Variation

pdf abs
Cross-corpus Native Language Identification via Statistical Embedding
Francisco Rangel | Paolo Rosso | Julian Brooke | Alexandra Uitdenbogerd
Proceedings of the Second Workshop on Stylistic Variation

In this paper, we approach the task of native language identification in a realistic cross-corpus scenario where a model is trained with available data and has to predict the native language from data of a different corpus. The motivation behind this study is to investigate native language identification in the Australian academic scenario where a majority of students come from China, Indonesia, and Arabic-speaking nations. We have proposed a statistical embedding representation reporting a significant improvement over common single-layer approaches of the state of the art, identifying Chinese, Arabic, and Indonesian in a cross-corpus scenario. The proposed approach was shown to be competitive even when the data is scarce and imbalanced.

pdf abs
Deep-speare: A joint neural model of poetic language, meter and rhyme
Jey Han Lau | Trevor Cohn | Timothy Baldwin | Julian Brooke | Adam Hammond
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

In this paper, we propose a joint architecture that captures language, rhyme and meter for sonnet modelling. We assess the quality of generated poems using crowd and expert judgements. The stress and rhyme models perform very well, as generated poems are largely indistinguishable from human-written poems. Expert evaluation, however, reveals that a vanilla language model captures meter implicitly, and that machine-generated poems still underperform in terms of readability and emotion. Our research shows the importance expert evaluation for poetry generation, and that future research should look beyond rhyme/meter and focus on poetic language.

2017

pdf
Joint Sentence-Document Model for Manifesto Text Analysis
Shivashankar Subramanian | Trevor Cohn | Timothy Baldwin | Julian Brooke
Proceedings of the Australasian Language Technology Association Workshop 2017

pdf abs
Unsupervised Acquisition of Comprehensive Multiword Lexicons using Competition in an n-gram Lattice
Julian Brooke | Jan Šnajder | Timothy Baldwin
Transactions of the Association for Computational Linguistics, Volume 5

We present a new model for acquiring comprehensive multiword lexicons from large corpora based on competition among n-gram candidates. In contrast to the standard approach of simple ranking by association measure, in our model n-grams are arranged in a lattice structure based on subsumption and overlap relationships, with nodes inhibiting other nodes in their vicinity when they are selected as a lexical item. We show how the configuration of such a lattice can be optimized tractably, and demonstrate using annotations of sampled n-grams that our method consistently outperforms alternatives by at least 0.05 F-score across several corpora and languages.

pdf abs
Semi-Automated Resolution of Inconsistency for a Harmonized Multiword Expression and Dependency Parse Annotation
King Chan | Julian Brooke | Timothy Baldwin
Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017)

This paper presents a methodology for identifying and resolving various kinds of inconsistency in the context of merging dependency and multiword expression (MWE) annotations, to generate a dependency treebank with comprehensive MWE annotations. Candidates for correction are identified using a variety of heuristics, including an entirely novel one which identifies violations of MWE constituency in the dependency tree, and resolved by arbitration with minimal human intervention. Using this technique, we identified and corrected several hundred errors across both parse and MWE annotations, representing changes to a significant percentage (well over 10%) of the MWE instances in the joint corpus.

pdf abs
Sub-character Neural Language Modelling in Japanese
Viet Nguyen | Julian Brooke | Timothy Baldwin
Proceedings of the First Workshop on Subword and Character Level Models in NLP

In East Asian languages such as Japanese and Chinese, the semantics of a character are (somewhat) reflected in its sub-character elements. This paper examines the effect of using sub-characters for language modeling in Japanese. This is achieved by decomposing characters according to a range of character decomposition datasets, and training a neural language model over variously decomposed character representations. Our results indicate that language modelling can be improved through the inclusion of sub-characters, though this result depends on a good choice of decomposition dataset and the appropriate granularity of decomposition.

pdf bib
Proceedings of the Workshop on Stylistic Variation
Julian Brooke | Thamar Solorio | Moshe Koppel
Proceedings of the Workshop on Stylistic Variation

2016

pdf
Melbourne at SemEval 2016 Task 11: Classifying Type-level Word Complexity using Random Forests with Corpus and Word List Features
Julian Brooke | Alexandra Uitdenbogerd | Timothy Baldwin
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

pdf
Bootstrapped Text-level Named Entity Recognition for Literature
Julian Brooke | Adam Hammond | Timothy Baldwin
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2015

pdf
GutenTag: an NLP-driven Tool for Digital Humanities Research in the Project Gutenberg Corpus
Julian Brooke | Adam Hammond | Graeme Hirst
Proceedings of the Fourth Workshop on Computational Linguistics for Literature

pdf bib abs
Distinguishing Voices in The Waste Land using Computational Stylistics
Julian Brooke | Adam Hammond | Graeme Hirst
Linguistic Issues in Language Technology, Volume 12, 2015 - Literature Lifts up Computational Linguistics

T. S. Eliot’s poem The Waste Land is a notoriously challenging example of modernist poetry, mixing the independent viewpoints of over ten distinct characters without any clear demarcation of which voice is speaking when. In this work, we apply unsupervised techniques in computational stylistics to distinguish the particular styles of these voices, offering a computer’s perspective on longstanding debates in literary analysis. Our work includes a model for stylistic segmentation that looks for points of maximum stylistic variation, a k-means clustering model for detecting non-contiguous speech from the same voice, and a stylistic profiling approach which makes use of lexical resources built from a much larger collection of literary texts. Evaluating using an expert interpretation, we show clear progress in distinguishing the voices of The Waste Land as compared to appropriate baselines, and we also offer quantitative evidence both for and against that particular interpretation.

The task of native language (L1) identification suffers from a relative paucity of useful training corpora, and standard within-corpus evaluation is often problematic due to topic bias. In this paper, we introduce a method for L1 identification in second language (L2) texts that relies only on much more plentiful L1 data, rather than the L2 texts that are traditionally used for training. In particular, we do word-by-word translation of large L1 blog corpora to create a mapping to L2 forms that are a possible result of language transfer, and then use that information for unsupervised classification. We show this method is effective in several different learner corpora, with bigram features being particularly useful.

pdf
Robust, Lexicalized Native Language Identification
Julian Brooke | Graeme Hirst
Proceedings of COLING 2012

pdf
Building Readability Lexicons with Unannotated Corpora
Julian Brooke | Vivian Tsang | David Jacob | Fraser Shein | Graeme Hirst
Proceedings of the First Workshop on Predicting and Improving Text Readability for target reader populations

pdf
Unsupervised Stylistic Segmentation of Poetry with Change Curves and Extrinsic Features
Julian Brooke | Adam Hammond | Graeme Hirst
Proceedings of the NAACL-HLT 2012 Workshop on Computational Linguistics for Literature