An Automated LLM-based Pipeline for Asset-Level Database Creation to Assess Deforestation Impact

Avanija Menon, Ovidiu Serban


Abstract
The European Union Deforestation Regulation (EUDR) requires companies to prove their products do not contribute to deforestation, creating a critical demand for precise, asset-level environmental impact data. Current databases lack the necessary detail, relying heavily on broad financial metrics and manual data collection, which limits regulatory compliance and accurate environmental modeling. This study presents an automated, end-to-end data extraction pipeline that uses LLMs to create, clean, and validate structured databases, specifically targeting sectors with a high risk of deforestation. The pipeline introduces Instructional, Role-Based, Zero-Shot Chain-of-Thought (IRZ-CoT) prompting to enhance data extraction accuracy and a Retrieval-Augmented Validation (RAV) process that integrates real-time web searches for improved data reliability. Applied to SEC EDGAR filings in the Mining, Oil & Gas, and Utilities sectors, the pipeline demonstrates significant improvements over traditional zero-shot prompting approaches, particularly in extraction accuracy and validation coverage. This work advances NLP-driven automation for regulatory compliance, CSR (Corporate Social Responsibility), and ESG, with broad sectoral applicability.
Anthology ID:
2025.climatenlp-1.10
Volume:
Proceedings of the 2nd Workshop on Natural Language Processing Meets Climate Change (ClimateNLP 2025)
Month:
July
Year:
2025
Address:
Bangkok, Thailand
Editors:
Kalyan Dutia, Peter Henderson, Markus Leippold, Christoper Manning, Gaku Morio, Veruska Muccione, Jingwei Ni, Tobias Schimanski, Dominik Stammbach, Alok Singh, Alba (Ruiran) Su, Saeid A. Vaghefi
Venues:
ClimateNLP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
142–167
Language:
URL:
https://preview.aclanthology.org/landing_page/2025.climatenlp-1.10/
DOI:
Bibkey:
Cite (ACL):
Avanija Menon and Ovidiu Serban. 2025. An Automated LLM-based Pipeline for Asset-Level Database Creation to Assess Deforestation Impact. In Proceedings of the 2nd Workshop on Natural Language Processing Meets Climate Change (ClimateNLP 2025), pages 142–167, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):
An Automated LLM-based Pipeline for Asset-Level Database Creation to Assess Deforestation Impact (Menon & Serban, ClimateNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/landing_page/2025.climatenlp-1.10.pdf