Diego Roca


2026

Various encodings have been proposed to cast constituent parsing in terms of a sequence labeling task. However, unlike in the case of dependency parsing, existing comparisons have not been entirely homogeneous and, to the best of our knowledge, there is no systematic evaluation of these encodings under uniform configurations. A homogeneous evaluation needs to account for various aspects that could influence results, either by controlling for these aspects to ensure uniformity (e.g., network architecture, parameter settings, postprocessing of ill-formed output), or by systematically analyzing their impact (e.g., the impact of binary versus arbitrary structures). In this article, we: (1) compare different encodings comprehensively both theoretically and empirically, on a modern neural architecture and across nine languages, and (2) introduce new encodings and variants, including an encoding that our analysis finds particularly accurate and compact.

2023

We introduce an encoding for parsing as sequence labeling that can represent any projective dependency tree as a sequence of 4-bit labels, one per word. The bits in each word’s label represent (1) whether it is a right or left dependent, (2) whether it is the outermost (left/right) dependent of its parent, (3) whether it has any left children and (4) whether it has any right children. We show that this provides an injective mapping from trees to labels that can be encoded and decoded in linear time. We then define a 7-bit extension that represents an extra plane of arcs, extending the coverage to almost full non-projectivity (over 99.9% empirical arc coverage). Results on a set of diverse treebanks show that our 7-bit encoding obtains substantial accuracy gains over the previously best-performing sequence labeling encodings.