Li Yuan

2024

Text-Video Retrieval (TVR) aims to align relevant video content with natural language queries. To date, most of the state-of-the-art TVR methods learn image-to-video transfer learning based on the large-scale pre-trained vision-language models (e.g., CLIP). However, fully fine-tuning these pre-trained models for TVR incurs prohibitively expensive computation cost. To this end, we propose to conduct efficient text-video Retrieval with a salient-and-correlated AdaPter (RAP), i.e., fine-tuning the pre-trained model with a few parameterized layers. To accommodate the text-video scenario, we equip our RAP with two indispensable characteristics including temporal sparsity and correlation. Specifically, we propose a low-rank modulation module to refine the per-image features from frozen CLIP backbone, which accentuates silent frames within the video features while alleviating temporal redundancy. Besides, we introduce an asynchronous self-attention mechanism which firstly selects top responsive visual patch and augments the correlation modeling between them with learnable temporal and patch offsets. Extensive experiments on four TVR datasets demonstrate that our RAP achieves superior or comparable performance compared to the fully fine-tuned counterpart and other parameter-efficient finetuning methods.

pdf abs
A Logical Pattern Memory Pre-trained Model for Entailment Tree Generation
Li Yuan | Yi Cai | Haopeng Ren | Jiexin Wang
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Generating coherent and credible explanations remains a significant challenge in the field of AI. In recent years, researchers have delved into the utilization of entailment trees to depict explanations, which exhibit a reasoning process of how a hypothesis is deduced from the supporting facts. However, existing models often overlook the importance of generating intermediate conclusions with logical consistency from the given facts, leading to inaccurate conclusions and undermining the overall credibility of entailment trees. To address this limitation, we propose the logical pattern memory pre-trained model (LMPM). LMPM incorporates an external memory structure to learn and store the latent representations of logical patterns, which aids in generating logically consistent conclusions. Furthermore, to mitigate the influence of logically irrelevant domain knowledge in the Wikipedia-based data, we introduce an entity abstraction approach to construct the dataset for pre-training LMPM. The experimental results highlight the effectiveness of our approach in improving the quality of entailment tree generation. By leveraging logical entailment patterns, our model produces more coherent and reasonable conclusions that closely align with the underlying premises.

pdf abs
Grounded Multimodal Procedural Entity Recognition for Procedural Documents: A New Dataset and Baseline
Haopeng Ren | Yushi Zeng | Yi Cai | Zhenqi Ye | Li Yuan | Pinli Zhu
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Much of commonsense knowledge in real world is the form of procudures or sequences of steps to achieve particular goals. In recent years, knowledge extraction on procedural documents has attracted considerable attention. However, they often focus on procedural text but ignore a common multimodal scenario in the real world. Images and text can complement each other semantically, alleviating the semantic ambiguity suffered in text-only modality. Motivated by these, in this paper, we explore a problem of grounded multimodal procedural entity recognition (GMPER), aiming to detect the entity and the corresponding bounding box groundings in image (i.e., visual entities). A new dataset (Wiki-GMPER) is bult and extensive experiments are conducted to evaluate the effectiveness of our proposed model.

2020

pdf abs
Graph Attention Network with Memory Fusion for Aspect-level Sentiment Analysis
Li Yuan | Jin Wang | Liang-Chih Yu | Xuejie Zhang
Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing

Aspect-level sentiment analysis(ASC) predicts each specific aspect term’s sentiment polarity in a given text or review. Recent studies used attention-based methods that can effectively improve the performance of aspect-level sentiment analysis. These methods ignored the syntactic relationship between the aspect and its corresponding context words, leading the model to focus on syntactically unrelated words mistakenly. One proposed solution, the graph convolutional network (GCN), cannot completely avoid the problem. While it does incorporate useful information about syntax, it assigns equal weight to all the edges between connected words. It may still incorrectly associate unrelated words to the target aspect through the iterations of graph convolutional propagation. In this study, a graph attention network with memory fusion is proposed to extend GCN’s idea by assigning different weights to edges. Syntactic constraints can be imposed to block the graph convolutional propagation of unrelated words. A convolutional layer and a memory fusion were applied to learn and exploit multiword relations and draw different weights of words to improve performance further. Experimental results on five datasets show that the proposed method yields better performance than existing methods.

pdf abs
YNU-HPCC at SemEval-2020 Task 8: Using a Parallel-Channel Model for Memotion Analysis
Li Yuan | Jin Wang | Xuejie Zhang
Proceedings of the Fourteenth Workshop on Semantic Evaluation

this paper proposed a parallel-channel model to process the textual and visual information in memes and then analyze the sentiment polarity of memes. In the shared task of identifying and categorizing memes, we preprocess the dataset according to the language behaviors on social media. Then, we adapt and fine-tune the Bidirectional Encoder Representations from Transformers (BERT), and two types of convolutional neural network models (CNNs) were used to extract the features from the pictures. We applied an ensemble model that combined the BiLSTM, BIGRU, and Attention models to perform cross domain suggestion mining. The officially released results show that our system performs better than the baseline algorithm

Co-authors

Ge Li 1

Li Yuan

2024

2020

Co-authors

Venues