Md Mofijul Islam
Also published as: Md Mofijul Islam
2026
MathMist: A Parallel Multilingual Benchmark Dataset for Mathematical Problem Solving and Reasoning
Mahbub E Sobhani | Md. Faiyaz Abdullah Sayeedi | Tasnim Mohiuddin | Md Mofijul Islam | Swakkhar Shatabda
Findings of the Association for Computational Linguistics: EACL 2026
Mahbub E Sobhani | Md. Faiyaz Abdullah Sayeedi | Tasnim Mohiuddin | Md Mofijul Islam | Swakkhar Shatabda
Findings of the Association for Computational Linguistics: EACL 2026
Mathematical reasoning remains one of the most challenging domains for large language models (LLMs), requiring not only linguistic understanding but also structured logical deduction and numerical precision. While recent LLMs demonstrate strong general-purpose reasoning abilities, their mathematical competence across diverse languages remains underexplored. Existing benchmarks primarily focus on English or a narrow subset of high-resource languages, leaving significant gaps in assessing multilingual and cross-lingual mathematical reasoning. To address this, we introduce MathMist, a parallel multilingual benchmark for mathematical problem solving and reasoning. MathMist encompasses 2,890 parallel Bangla-English gold standard artifacts, totaling ≈30K aligned question–answer pairs across thirteen languages, representing an extensive coverage of high-, medium-, and low-resource linguistic settings. The dataset captures linguistic variety, multiple types of problem settings, and solution synthesizing capabilities. We systematically evaluate a diverse suite of models, including open-source small and medium LLMs, proprietary systems, and multilingual-reasoning-focused models under zero-shot, chain-of-thought (CoT), perturbated reasoning, and code-switched reasoning paradigms. Our results reveal persistent deficiencies in LLMs’ ability to perform consistent and interpretable mathematical reasoning across languages, with pronounced degradation in low-resource settings. All the codes and data are available at GitHub: https://github.com/mahbubhimel/MathMist
DocSplit: A Comprehensive Benchmark Dataset and Evaluation Approach for Document Packet Recognition and Splitting
Md Mofijul Islam | Md Sirajus Salekin | Nivedha Balakrishnan | Vincil C. Bishop III | Niharika Jain | Spencer Romo | Bob Strahan | Boyi Xie | Diego A. Socolinsky
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Md Mofijul Islam | Md Sirajus Salekin | Nivedha Balakrishnan | Vincil C. Bishop III | Niharika Jain | Spencer Romo | Bob Strahan | Boyi Xie | Diego A. Socolinsky
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Document understanding in real-world applications often requires processing heterogeneous, multi-page document packets containing multiple documents stitched together. Despite recent advances in visual document understanding, the fundamental task of document packet splitting, which involves separating a document packet into individual units, remains largely unaddressed. We present the first comprehensive benchmark dataset, DocSplit, along with novel evaluation metrics for assessing the document packet splitting capabilities of large language models. DocSplit comprises five datasets of varying complexity, covering diverse document types, layouts, and multimodal settings. We formalize the DocSplit task, which requires models to identify document boundaries, classify document types, and maintain correct page ordering within a document packet. The benchmark addresses real-world challenges, including out-of-order pages, interleaved documents, and documents lacking clear demarcations. We conduct extensive experiments evaluating multimodal LLMs on our datasets, revealing significant performance gaps in current models’ ability to handle complex document splitting tasks. The DocSplit benchmark datasets and proposed novel evaluation metrics provide a systematic framework for advancing document understanding capabilities essential for legal, financial, healthcare, and other document-intensive domains. We release the datasets and evaluation code to facilitate future research in document packet processing.
IDP Accelerator: Agentic Document Intelligence from Extraction to Compliance Validation
Md Mofijul Islam | Md Sirajus Salekin | Joe King | Priyashree Roy | Vamsi Thilak Gudi | Spencer Romo | Akhil Nooney | Bob Strahan | Boyi Xie | Diego A. Socolinsky
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
Md Mofijul Islam | Md Sirajus Salekin | Joe King | Priyashree Roy | Vamsi Thilak Gudi | Spencer Romo | Akhil Nooney | Bob Strahan | Boyi Xie | Diego A. Socolinsky
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
Understanding and extracting structured insights from unstructured documents remains a foundational challenge in industrial NLP. While Large Language Models (LLMs) enable zero-shot extraction, traditional pipelines often fail to handle multi-document packets, complex reasoning, and strict compliance requirements. We present IDP (Intelligent Document Processing) Accelerator, a framework enabling agentic AI for end-to-end document intelligence with four key components: (1) DocSplit, a novel benchmark dataset and multimodal classifier using BIO tagging to segment complex document packets; (2) configurable Extraction Module leveraging multimodal LLMs to transform unstructured content into structured data; (3) Agentic Analytics Module, compliant with the Model Context Protocol (MCP) providing data access through secure, sandboxed code execution; and (4) Rule Validation Module replacing deterministic engines with LLM-driven logic for complex compliance checks. The interactive demonstration enables users to upload document packets, visualize classification results, and explore extracted data through an intuitive web interface. We demonstrate effectiveness across industries, highlighting a production deployment at a leading healthcare provider achieving 98% classification accuracy, 80% reduced processing latency, and 77% lower operational costs over legacy baselines. IDP Accelerator is open-sourced with a live demonstration available to the community.
2025
DM-Codec: Distilling Multimodal Representations for Speech Tokenization
Md Mubtasim Ahasan | Md Fahim | Tasnim Mohiuddin | Akmmahbubur Rahman | Aman Chadha | Tariq Iqbal | M Ashraful Amin | Md Mofijul Islam | Amin Ahsan Ali
Findings of the Association for Computational Linguistics: EMNLP 2025
Md Mubtasim Ahasan | Md Fahim | Tasnim Mohiuddin | Akmmahbubur Rahman | Aman Chadha | Tariq Iqbal | M Ashraful Amin | Md Mofijul Islam | Amin Ahsan Ali
Findings of the Association for Computational Linguistics: EMNLP 2025
Recent advancements in speech-language models have yielded significant improvements in speech tokenization and synthesis. However, effectively mapping the complex, multidimensional attributes of speech into discrete tokens remains challenging. This process demands acoustic, semantic, and contextual information for precise speech representations. Existing speech representations generally fall into two categories: acoustic tokens from audio codecs and semantic tokens from speech self-supervised learning models. Although recent efforts have unified acoustic and semantic tokens for improved performance, they overlook the crucial role of contextual representation in comprehensive speech modeling. Our empirical investigations reveal that the absence of contextual representations results in elevated Word Error Rate (WER) and Word Information Lost (WIL) scores in speech transcriptions. To address these limitations, we propose two novel distillation approaches: (1) a language model (LM)-guided distillation method that incorporates contextual information, and (2) a combined LM and self-supervised speech model (SM)-guided distillation technique that effectively distills multimodal representations (acoustic, semantic, and contextual) into a comprehensive speech tokenizer, termed DM-Codec. The DM-Codec architecture adopts a streamlined encoder-decoder framework with a Residual Vector Quantizer (RVQ) and incorporates the LM and SM during the training process. Experiments show DM-Codec significantly outperforms state-of-the-art speech tokenization models, reducing WER by up to 13.46%, WIL by 9.82%, and improving speech quality by 5.84% and intelligibility by 1.85% on the LibriSpeech benchmark dataset.
2022
A Vocabulary-Free Multilingual Neural Tokenizer for End-to-End Task Learning
Md Mofijul Islam | Gustavo Aguilar | Pragaash Ponnusamy | Clint Solomon Mathialagan | Chengyuan Ma | Chenlei Guo
Proceedings of the 7th Workshop on Representation Learning for NLP
Md Mofijul Islam | Gustavo Aguilar | Pragaash Ponnusamy | Clint Solomon Mathialagan | Chengyuan Ma | Chenlei Guo
Proceedings of the 7th Workshop on Representation Learning for NLP
Subword tokenization is a commonly used input pre-processing step in most recent NLP models. However, it limits the models’ ability to leverage end-to-end task learning. Its frequency-based vocabulary creation compromises tokenization in low-resource languages, leading models to produce suboptimal representations. Additionally, the dependency on a fixed vocabulary limits the subword models’ adaptability across languages and domains. In this work, we propose a vocabulary-free neural tokenizer by distilling segmentation information from heuristic-based subword tokenization. We pre-train our character-based tokenizer by processing unique words from multilingual corpus, thereby extensively increasing word diversity across languages. Unlike the predefined and fixed vocabularies in subword methods, our tokenizer allows end-to-end task learning, resulting in optimal task-specific tokenization. The experimental results show that replacing the subword tokenizer with our neural tokenizer consistently improves performance on multilingual (NLI) and code-switching (sentiment analysis) tasks, with larger gains in low-resource languages. Additionally, our neural tokenizer exhibits a robust performance on downstream tasks when adversarial noise is present (typos and misspelling), further increasing the initial improvements over statistical subword tokenizers.
Search
Fix author
Co-authors
- Muhammad Tasnim Mohiuddin 2
- Spencer Romo 2
- Md Sirajus Salekin 2
- Diego A. Socolinsky 2
- Bob Strahan 2
- Boyi Xie 2
- Gustavo Aguilar 1
- Md Mubtasim Ahasan 1
- Amin Ahsan Ali 1
- M Ashraful Amin 1
- Nivedha Balakrishnan 1
- Aman Chadha 1
- Mahbub E Sobhani 1
- Md Fahim 1
- Vamsi Thilak Gudi 1
- Chenlei Guo 1
- Vincil C. Bishop III 1
- Tariq Iqbal 1
- Niharika Jain 1
- Joe King 1
- Chengyuan Ma 1
- Clint Solomon Mathialagan 1
- Akhil Nooney 1
- Pragaash Ponnusamy 1
- Akmmahbubur Rahman 1
- Priyashree Roy 1
- Md. Faiyaz Abdullah Sayeedi 1
- Swakkhar Shatabda 1