Shafqat Mumtaz Virk

2021

pdf abs
A Novel Machine Learning Based Approach for Post-OCR Error Detection
Shafqat Mumtaz Virk | Dana Dannélls | Azam Sheikh Muhammad
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

Post processing is the most conventional approach for correcting errors that are caused by Optical Character Recognition(OCR) systems. Two steps are usually taken to correct OCR errors: detection and corrections. For the first task, supervised machine learning methods have shown state-of-the-art performances. Previously proposed approaches have focused most prominently on combining lexical, contextual and statistical features for detecting errors. In this study, we report a novel system to error detection which is based merely on the n-gram counts of a candidate token. In addition to being simple and computationally less expensive, our proposed system beats previous systems reported in the ICDAR2019 competition on OCR-error detection with notable margins. We achieved state-of-the-art F1-scores for eight out of the ten involved European languages. The maximum improvement is for Spanish which improved from 0.69 to 0.90, and the minimum for Polish from 0.82 to 0.84.

pdf abs
A Data-Driven Semi-Automatic Framenet Development Methodology
Shafqat Mumtaz Virk | Dana Dannélls | Lars Borin | Markus Forsberg
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

FrameNet is a lexical semantic resource based on the linguistic theory of frame semantics. A number of framenet development strategies have been reported previously and all of them involve exploration of corpora and a fair amount of manual work. Despite previous efforts, there does not exist a well-thought-out automatic/semi-automatic methodology for frame construction. In this paper we propose a data-driven methodology for identification and semi-automatic construction of frames. As a proof of concept, we report on our initial attempts to build a wider-scale framenet for the legal domain (LawFN) using the proposed methodology. The constructed frames are stored in a lexical database and together with the annotated example sentences they have been made available through a web interface.

pdf abs
A Deep Learning System for Automatic Extraction of Typological Linguistic Information from Descriptive Grammars
Shafqat Mumtaz Virk | Daniel Foster | Azam Sheikh Muhammad | Raheela Saleem
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

Linguistic typology is an area of linguistics concerned with analysis of and comparison between natural languages of the world based on their certain linguistic features. For that purpose, historically, the area has relied on manual extraction of linguistic feature values from textural descriptions of languages. This makes it a laborious and time expensive task and is also bound by human brain capacity. In this study, we present a deep learning system for the task of automatic extraction of linguistic features from textual descriptions of natural languages. First, textual descriptions are manually annotated with special structures called semantic frames. Those annotations are learned by a recurrent neural network, which is then used to annotate un-annotated text. Finally, the annotations are converted to linguistic feature values using a separate rule based module. Word embeddings, learned from general purpose text, are used as a major source of knowledge by the recurrent neural network. We compare the proposed deep learning system to a previously reported machine learning based system for the same task, and the deep learning system wins in terms of F1 scores with a fair margin. Such a system is expected to be a useful contribution for the automatic curation of typological databases, which otherwise are manually developed.

2020

pdf abs
From Linguistic Descriptions to Language Profiles
Shafqat Mumtaz Virk | Harald Hammarström | Lars Borin | Markus Forsberg | Søren Wichmann
Proceedings of the 7th Workshop on Linked Data in Linguistics (LDL-2020)

Language catalogues and typological databases are two important types of resources containing different types of knowledge about the world’s natural languages. The former provide metadata such as number of speakers, location (in prose descriptions and/or GPS coordinates), language code, literacy, etc., while the latter contain information about a set of structural and functional attributes of languages. Given that both types of resources are developed and later maintained manually, there are practical limits as to the number of languages and the number of features that can be surveyed. We introduce the concept of a language profile, which is intended to be a structured representation of various types of knowledge about a natural language extracted semi-automatically from descriptive documents and stored at a central location. It has three major parts: (1) an introductory; (2) an attributive; and (3) a reference part, each containing different types of knowledge about a given natural language. As a case study, we develop and present a language profile of an example language. At this stage, a language profile is an independent entity, but in the future it is envisioned to become part of a network of language profiles connected to each other via various types of relations. Such a representation is expected to be suitable both for humans and machines to read and process for further deeper linguistic analyses and/or comparisons.

pdf abs
The DReaM Corpus: A Multilingual Annotated Corpus of Grammars for the World’s Languages
Shafqat Mumtaz Virk | Harald Hammarström | Markus Forsberg | Søren Wichmann
Proceedings of the Twelfth Language Resources and Evaluation Conference

There exist as many as 7000 natural languages in the world, and a huge number of documents describing those languages have been produced over the years. Most of those documents are in paper format. Any attempts to use modern computational techniques and tools to process those documents will require them to be digitized first. In this paper, we report a multilingual digitized version of thousands of such documents searchable through some well-established corpus infrastructures. The corpus is annotated with various meta, word, and text level attributes to make searching and analysis easier and more useful.

2019

pdf abs
Exploiting Frame-Semantics and Frame-Semantic Parsing for Automatic Extraction of Typological Information from Descriptive Grammars of Natural Languages
Shafqat Mumtaz Virk | Azam Sheikh Muhammad | Lars Borin | Muhammad Irfan Aslam | Saania Iqbal | Nazia Khurram
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

We describe a novel system for automatic extraction of typological linguistic information from descriptive grammars of natural languages, applying the theory of frame semantics in the form of frame-semantic parsing. The current proof-of-concept system covers a few selected linguistic features, but the methodology is general and can be extended not only to other typological features but also to descriptive grammars written in languages other than English. Such a system is expected to be a useful assistance for automatic curation of typological databases which otherwise are built manually, a very labor and time consuming as well as cognitively taxing enterprise.

2016

pdf abs
A Supervised Approach for Enriching the Relational Structure of Frame Semantics in FrameNet
Shafqat Mumtaz Virk | Philippe Muller | Juliette Conrath
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Frame semantics is a theory of linguistic meanings, and is considered to be a useful framework for shallow semantic analysis of natural language. FrameNet, which is based on frame semantics, is a popular lexical semantic resource. In addition to providing a set of core semantic frames and their frame elements, FrameNet also provides relations between those frames (hence providing a network of frames i.e. FrameNet). We address here the limited coverage of the network of conceptual relations between frames in FrameNet, which has previously been pointed out by others. We present a supervised model using rich features from three different sources: structural features from the existing FrameNet network, information from the WordNet relations between synsets projected into semantic frames, and corpus-collected lexical associations. We show large improvements over baselines consisting of each of the three groups of features in isolation. We then use this model to select frame pairs as candidate relations, and perform evaluation on a sample with good precision.

In this paper, we describe a multilingual open-source computational grammar of Persian, developed in Grammatical Framework (GF) ― A type-theoretical grammar formalism. We discuss in detail the structure of different syntactic (i.e. noun phrases, verb phrases, adjectival phrases, etc.) categories of Persian. First, we show how to structure and construct these categories individually. Then we describe how they are glued together to make well-formed sentences in Persian, while maintaining the grammatical features such as agreement, word order, etc. We also show how some of the distinctive features of Persian, such as the ezafe construction, are implemented in GF. In order to evaluate the grammar's correctness, and to demonstrate its usefulness, we have added support for Persian in a multilingual application grammar (the Tourist Phrasebook) using the reported resource grammar.

pdf bib
Computational evidence that Hindi and Urdu share a grammar but not the lexicon
K.V.S Prasad | Shafqat Mumtaz Virk
Proceedings of the 3rd Workshop on South and Southeast Asian Natural Language Processing