Simon Petitjean

2022

This paper describes the first release of RRGparbank, a multilingual parallel treebank for Role and Reference Grammar (RRG) containing annotations of George Orwell’s novel 1984 and its translations. The release comprises the entire novel for English and a constructionally diverse and highly parallel sample (“seed”) for German, French and Russian. The paper gives an overview of annotation decisions that have been taken and describes the adopted treebanking methodology. Finally, as a possible application, a multilingual parser is trained on the treebank data. RRGparbank is one of the first resources to apply RRG to large amounts of real-world data. Furthermore, it enables comparative and typological corpus studies in RRG. And, finally, it creates new possibilities of data-driven NLP applications based on RRG.

This paper considers the task of parsing low-resource languages in a scenario where parallel English data and also a limited seed of annotated sentences in the target language are available, as for example in bootstrapping parallel treebanks. We focus on constituency parsing using Role and Reference Grammar (RRG), a theory that has so far been understudied in computational linguistics but that is widely used in typological research, i.e., in particular in the context of low-resource languages. Starting from an existing RRG parser, we propose two strategies for low-resource parsing: first, we extend the parsing model into a cross-lingual parser, exploiting the parallel data in the high-resource language and unsupervised word alignments by providing internal states of the source-language parser to the target-language parser. Second, we adopt self-training, thereby iteratively expanding the training data, starting from the seed, by including the most confident new parses in each round. Both in simulated scenarios and with a real low-resource language (Daakaka), we find substantial and complementary improvements from both self-training and cross-lingual parsing. Moreover, we also experimented with using gloss embeddings in addition to token embeddings in the target language, and this also improves results. Finally, starting from what we have for Daakaka, we also consider parsing a related language (Dalkalaen) where glosses and English translations are available but no annotated trees at all, i.e., a no-resource scenario wrt. syntactic annotations. We start with cross-lingual parser trained on Daakaka with glosses and use self-training to adapt it to Dalkalaen. The results are surprisingly good.

Complex predicates formed of a semantically ‘light’ verbal head and a noun or verb which contributes the major part of the meaning are frequently referred to as ‘light verb constructions’ (LVCs). In the paper, we present a case study of LVCs with the German posture verb stehen ‘stand’. In our account, we model the syntactic as well as semantic composition of such LVCs by combining Lexicalized Tree Adjoining Grammar (LTAG) with frames. Starting from the analysis of the literal uses of posture verbs, we show how the meaning components of the literal uses are systematically exploited in the interpretation of stehen-LVCs. The paper constitutes an important step towards a compositional and computational analysis of LVCs. We show that LTAG allows us to separate constructional from lexical meaning components and that frames enable elegant generalizations over event types and related constraints.

2018

pdf
A Parser for LTAG and Frame Semantics
David Arps | Simon Petitjean
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf
Multi-tape Computing with Synchronous Relations
Christian Wurm | Simon Petitjean
Proceedings of the 13th International Conference on Finite State Methods and Natural Language Processing (FSMNLP 2017)

pdf abs
Describing derivational polysemy with XMG
Marios Andreou | Simon Petitjean
Actes des 24ème Conférence sur le Traitement Automatique des Langues Naturelles. Volume 2 - Articles courts

In this paper, we model and test the monosemy and polysemy approaches to derivational multiplicity of meaning, using Frame Semantics and XMG. In order to illustrate our claims and proposals, we use data from deverbal nominalizations with the suffix -al on verbs of change of possession (e.g. rental, disbursal). In our XMG implementation, we show that the underspecified meaning of affixes cannot always be reduced to a single unitary meaning and that the polysemy approach to multiplicity of meaning is more judicious compared to the monosemy approach. We also introduce constraints on the potential referents of derivatives. These constraints have the form of type constraints and specify which arguments in the frame of the verbal base are compatible with the referential argument of the derivative. The introduction of type constraints rules out certain readings because frame unification only succeeds if types are compatible.

2016

pdf
Argument linking in LTAG: A constraint-based implementation with XMG
Laura Kallmeyer | Timm Lichte | Rainer Osswald | Simon Petitjean
Proceedings of the 12th International Workshop on Tree Adjoining Grammars and Related Formalisms (TAG+12)

2015

pdf abs
Une métagrammaire de l’interface morpho-sémantique dans les verbes en arabe
Simon Petitjean | Younes Samih | Timm Lichte
Actes de la 22e conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

Dans cet article, nous présentons une modélisation de la morphologie dérivationnelle de l’arabe utilisant le cadre métagrammatical offert par XMG. Nous démontrons que l’utilisation de racines et patrons abstraits comme morphèmes atomiques sous-spécifiés offre une manière élégante de traiter l’interaction entre morphologie et sémantique.