Grounded Multimodal Named Entity Recognition (GMNER), which aims to extract textual entities, their types, and corresponding visual regions from image-text data, has become a critical task in multimodal information extraction. However, existing methods face two major challenges. First, they fail to address the semantic ambiguity caused by polysemy and the long-tail distribution of datasets. Second, unlike visual grounding which provides descriptive phrases, entity grounding only offers brief entity names which carry less semantic information. Current methods lack sufficient semantic interaction between text and image, hindering accurate entity-visual region matching. To tackle these issues, we propose MAKAR, a Multi-Agent framework based Knowledge-Augmented Reasoning, comprising three agents: Knowledge Enhancement, Entity Correction, and Entity Reasoning Grounding. Specifically, in the named entity recognition phase, the Knowledge Enhancement Agent leverages a Multimodal Large Language Model (MLLM) as an implicit knowledge base to enhance ambiguous image-text content with its internal knowledge. For samples with low-confidence entity boundaries and types, the Entity Correction Agent uses web search tools to retrieve and summarize relevant web content, thereby correcting entities using both internal and external knowledge. In the entity grounding phase, the Entity Reasoning Grounding Agent utilizes multi-step Chain-of-Thought reasoning to perform grounding for each entity. Extensive experiments show that MAKAR achieves state-of-the-art performance on two benchmark datasets. Code is available at: https://github.com/Nikol-coder/MAKAR.
Large language models (LLMs) have shown remarkable achievements across various language tasks. To enhance the performance of LLMs in scientific literature services, we developed the scientific literature LLM (SciLit-LLM) through pre-training and supervised fine-tuning on scientific literature, building upon the iFLYTEK Spark LLM. Furthermore, we present a knowledge service system Spark Research Assistant (SparkRA) based on our SciLit-LLM. SparkRA is accessible online and provides three primary functions: literature investigation, paper reading, and academic writing. As of July 30, 2024, SparkRA has garnered over 50,000 registered users, with a total usage count exceeding 1.3 million.
Recent years have witnessed increasing interests in prompt-based learning in which models can be trained on only a few annotated instances, making them suitable in low-resource settings. It is even challenging in fine-grained classification as the pre-trained language models tend to generate similar output embedding which makes it difficult to discriminate for the prompt-based classifier. In this work, we alleviate this information diffusion issue by proposing a calibration method based on a transformation which rotates the embedding feature into a new metric space where we adapt the ratio of each dimension to a uniform distribution to guarantee the distinguishability of learned embeddings. Furthermore, we take the advantage of hyperbolic embedding to capture the relation between dimensions by a coarse-fine metric learning strategy to enhance interpretability. Extensive experiments on the three datasets under various settings demonstrate the effectiveness of our approach.