Beihong Jin


Improving Document Representations by Generating Pseudo Query Embeddings for Dense Retrieval
Hongyin Tang | Xingwu Sun | Beihong Jin | Jingang Wang | Fuzheng Zhang | Wei Wu
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Recently, the retrieval models based on dense representations have been gradually applied in the first stage of the document retrieval tasks, showing better performance than traditional sparse vector space models. To obtain high efficiency, the basic structure of these models is Bi-encoder in most cases. However, this simple structure may cause serious information loss during the encoding of documents since the queries are agnostic. To address this problem, we design a method to mimic the queries to each of the documents by an iterative clustering process and represent the documents by multiple pseudo queries (i.e., the cluster centroids). To boost the retrieval process using approximate nearest neighbor search library, we also optimize the matching function with a two-step score calculation procedure. Experimental results on several popular ranking and QA datasets show that our model can achieve state-of-the-art results while still remaining high efficiency.

TITA: A Two-stage Interaction and Topic-Aware Text Matching Model
Xingwu Sun | Yanling Cui | Hongyin Tang | Qiuyu Zhu | Fuzheng Zhang | Beihong Jin
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

In this paper, we focus on the problem of keyword and document matching by considering different relevance levels. In our recommendation system, different people follow different hot keywords with interest. We need to attach documents to each keyword and then distribute the documents to people who follow these keywords. The ideal documents should have the same topic with the keyword, which we call topic-aware relevance. In other words, topic-aware relevance documents are better than partially-relevance ones in this application. However, previous tasks never define topic-aware relevance clearly. To tackle this problem, we define a three-level relevance in keyword-document matching task: topic-aware relevance, partially-relevance and irrelevance. To capture the relevance between the short keyword and the document at above-mentioned three levels, we should not only combine the latent topic of the document with its deep neural representation, but also model complex interactions between the keyword and the document. To this end, we propose a Two-stage Interaction and Topic-Aware text matching model (TITA). In terms of “topic-aware”, we introduce neural topic model to analyze the topic of the document and then use it to further encode the document. In terms of “two-stage interaction”, we propose two successive stages to model complex interactions between the keyword and the document. Extensive experiments reveal that TITA outperforms other well-designed baselines and shows excellent performance in our recommendation system.

Enhancing Document Ranking with Task-adaptive Training and Segmented Token Recovery Mechanism
Xingwu Sun | Yanling Cui | Hongyin Tang | Fuzheng Zhang | Beihong Jin | Shi Wang
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

In this paper, we propose a new ranking model DR-BERT, which improves the Document Retrieval (DR) task by a task-adaptive training process and a Segmented Token Recovery Mechanism (STRM). In the task-adaptive training, we first pre-train DR-BERT to be domain-adaptive and then make the two-phase fine-tuning. In the first-phase fine-tuning, the model learns query-document matching patterns regarding different query types in a pointwise way. Next, in the second-phase fine-tuning, the model learns document-level ranking features and ranks documents with regard to a given query in a listwise manner. Such pointwise plus listwise fine-tuning enables the model to minimize errors in the document ranking by incorporating ranking-specific supervisions. Meanwhile, the model derived from pointwise fine-tuning is also used to reduce noise in the training data of the listwise fine-tuning. On the other hand, we present STRM which can compute OOV word representation and contextualization more precisely in BERT-based models. As an effective strategy in DR-BERT, STRM improves the matching perfromance of OOV words between a query and a document. Notably, our DR-BERT model keeps in the top three on the MS MARCO leaderboard since May 20, 2020.


A Topic Augmented Text Generation Model: Joint Learning of Semantics and Structural Features
Hongyin Tang | Miao Li | Beihong Jin
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Text generation is among the most fundamental tasks in natural language processing. In this paper, we propose a text generation model that learns semantics and structural features simultaneously. This model captures structural features by a sequential variational autoencoder component and leverages a topic modeling component based on Gaussian distribution to enhance the recognition of text semantics. To make the reconstructed text more coherent to the topics, the model further adapts the encoder of the topic modeling component for a discriminator. The results of experiments over several datasets demonstrate that our model outperforms several states of the art models in terms of text perplexity and topic coherence. Moreover, the latent representations learned by our model is superior to others in a text classification task. Finally, given the input texts, our model can generate meaningful texts which hold similar structures but under different topics.