This is an internal, incomplete preview of a proposed change to the ACL Anthology.
For efficiency reasons, we don't generate MODS or Endnote formats, and the preview may be incomplete in other ways, or contain mistakes.
Do not treat this content as an official publication.
YeMa
Fixing paper assignments
Please select all papers that do not belong to this person.
Indicate below which author they should be assigned to.
Large language models (LLMs) develop the in-context learning capability through pretraining and instruction tuning, enabling task adaptation without parameter updates. Self-refinement is a manifestation of this capability, which allows LLMs to iteratively refine the output using self-generated feedback. However, empirical observations reveal Inference-Free Self-Refinement (IFSR) in preference alignment: LLMs generate preference-improved output via fixed instructions, requiring no specific feedback, even no initial responses. There are two key components of the IFSR in preference alignment. The refining instruction is a fixed instruction that constrains the output distribution from a preference-semantic perspective. During training, it facilitates joint learning of preference-related semantic representations and data distribution alignment. The pseudo reference response is constructed from paired preference data and serves as a demonstration to guide the output distribution. It mitigates off-policy distributional bias while enhancing token-level preference learning in training. Experiments across multiple datasets demonstrate that incorporating IFSR into preference alignment yields performance improvement over 10%. Further ablation studies reveal additional characteristics and potential principles of IFSR.
Recent years have witnessed the impressive progress in Neural Dependency Parsing. According to the different factorization approaches to the graph joint probabilities, existing parsers can be roughly divided into autoregressive and non-autoregressive patterns. The former means that the graph should be factorized into multiple sequentially dependent components, then it can be built up component by component. And the latter assumes these components to be independent so that they can be outputted in a one-shot manner. However, when treating the directed edge as an explicit dependency relationship, we discover that there is a mixture of independent and interdependent components in the dependency graph, signifying that both aforementioned models fail to precisely capture the explicit dependencies among nodes and edges. Based on this property, we design a Semi-Autoregressive Dependency Parser to generate dependency graphs via adding node groups and edge groups autoregressively while pouring out all group elements in parallel. The model gains a trade-off between non-autoregression and autoregression, which respectively suffer from the lack of target inter-dependencies and the uncertainty of graph generation orders. The experiments show the proposed parser outperforms strong baselines on Enhanced Universal Dependencies of multiple languages, especially achieving 4% average promotion at graph-level accuracy. Also, the performances of model variations show the importance of specific parts.
With regards to WikiSum (CITATION) that empowers applicative explorations of Neural Multi-Document Summarization (MDS) to learn from large scale dataset, this study develops two hierarchical Transformers (HT) that describe both the cross-token and cross-document dependencies, at the same time allow extended length of input documents. By incorporating word- and paragraph-level multi-head attentions in the decoder based on the parallel and vertical architectures, the proposed parallel and vertical hierarchical Transformers (PHT &VHT) generate summaries utilizing context-aware word embeddings together with static and dynamics paragraph embeddings, respectively. A comprehensive evaluation is conducted on WikiSum to compare PHT &VHT with established models and to answer the question whether hierarchical structures offer more promising performances than flat structures in the MDS task. The results suggest that our hierarchical models generate summaries of higher quality by better capturing cross-document relationships, and save more memory spaces in comparison to flat-structure models. Moreover, we recommend PHT given its practical value of higher inference speed and greater memory-saving capacity.
With the development of NLP technologies, news can be automatically categorized and labeled according to a variety of characteristics, at the same time be represented as low dimensional embeddings. However, it lacks a systematic approach that effectively integrates the inherited features and inter-textual knowledge of news to represent the collective information with a dense vector. With the aim of filling this gap, the News2vec model is proposed to allow the distributed representation of news taking into account its associated features. To describe the cross-document linkages between news, a network consisting of news and its attributes is constructed. Moreover, the News2vec model treats the news node as a bag of features by developing the Subnode model. Based on the biased random walk and the skip-gram model, each news feature is mapped to a vector, and the news is thus represented as the sum of its features. This approach offers an easy solution to create embeddings for unseen news nodes based on its attributes. To evaluate our model, dimension reduction plots and correlation heat-maps are created to visualize the news vectors, together with the application of two downstream tasks, the stock movement prediction and news recommendation. By comparing with other established text/sentence embedding models, we show that News2vec achieves state-of-the-art performance on these news-related tasks.
In this paper, we describe our hierarchical ensemble system designed for the SemEval-2019 task3, EmoContext. In our system, three sets of classifiers are trained for different sub-targets and the predicted labels of these base classifiers are combined through three steps of voting to make the final prediction. Effective details for developing base classifiers are highlighted.