Menachem Brief


2025

pdf bib
SECQUE: A Benchmark for Evaluating Real-World Financial Analysis Capabilities
Noga BenYoash | Menachem Brief | Oded Ovadia | Gil Shenderovitz | Moshik Mishaeli | Rachel Lemberg | Eitam Sheetrit
Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²)

We introduce SECQUE, a comprehensive benchmark for evaluating large language models (LLMs) in financial analysis tasks. SECQUE comprises 565 expert-written questions covering SEC filings analysis across four key categories: comparison analysis, ratio calculation, risk assessment, and financial insight generation. To assess model performance, we develop SECQUE-Judge, an evaluation mechanism leveraging multiple LLM-based judges, which demonstrates strong alignment with human evaluations. Additionally, we provide an extensive analysis of various models’ performance on our benchmark. By making SECQUE publicly available (https://huggingface.co/datasets/nogabenyoash/SecQue), we aim to facilitate further research and advancements in financial AI.

2024

pdf bib
Fine-Tuning or Retrieval? Comparing Knowledge Injection in LLMs
Oded Ovadia | Menachem Brief | Moshik Mishaeli | Oren Elisha
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Large language models (LLMs) encapsulate a vast amount of factual information within their pre-trained weights, as evidenced by their ability to answer diverse questions across different domains. However, this knowledge is inherently limited, relying heavily on the characteristics of the training data. Consequently, using external datasets to incorporate new information or refine the capabilities of LLMs on previously seen information poses a significant challenge. In this study, we compare two common approaches: unsupervised fine-tuning and retrieval-augmented generation (RAG). We evaluate both approaches on a variety of knowledge-intensive tasks across different topics. Our findings reveal that while unsupervised fine-tuning offers some improvement, RAG consistently outperforms it, both for existing knowledge encountered during training and entirely new knowledge. Moreover, we find that LLMs struggle to learn new factual information through unsupervised fine-tuning, and that exposing them to numerous variations of the same fact during training could alleviate this problem.