Proceedings of the 17th International Conference on Natural Language Processing (ICON): System Demonstrations
- Anthology ID:
- Patna, India
- NLP Association of India (NLPAI)
A novel literature-based discovery system based on UMLS Ontologies, Semantic Filters, Statistics, and Word Embed-dings was developed and validated against the well-established Raynaud’s disease – Fish Oil discovery by min-ing different size and specificity corpora of Pubmed titles and abstracts. Results show an ‘inverse effect’ between open ver-sus closed discovery search modes. In open discovery, a more general and bigger corpus (Vascular disease or Peri-vascular disease) produces better results than a more specific and smaller in size corpus (Raynaud disease), whereas in closed discovery, the exact opposite is true.
Identification and extraction of Multiword Expressions (MWEs) is very hard and challenging task in various Natural Language processing applications like Information Retrieval (IR), Information Extraction (IE), Question-Answering systems, Speech Recognition and Synthesis, Text Summarization and Machine Translation (MT). Multiword Expressions are two or more consecutive words but treated as a single word and actual meaning this expression cannot be extracted from meaning of individual word. If any systems recognized this expression as separate words, then results of system will be incorrect. Therefore it is mandatory to identify these expressions to improve the result of the system. In this report, our main focus is to develop an automated tool to extract Multiword Expressions from monolingual and parallel corpus of English and Punjabi. In this tool, Rule based approach, Linguistic approach, statistical approach, and many more approaches were used to identify and extract MWEs from monolingual and parallel corpus of English and Punjabi and achieved more than 90% f-score value in some types of MWEs.
Machine Translation is ongoing research for last few decades. Today, Corpus-based Machine Translation systems are very popular. Statistical Machine Translation and Neural Machine Translation are based on the parallel corpus. In this research, the Punjabi to English Bidirectional Neural Machine Translation system is developed. To improve the accuracy of the Neural Machine Translation system, Word Embedding and Byte Pair Encoding is used. The claimed BLEU score is 38.30 for Punjabi to English Neural Machine Translation system and 36.96 for English to Punjabi Neural Machine Translation system.
Machine translation from English to Indian languages is always a difficult task due to the unavailability of a good quality corpus and morphological richness in the Indian languages. For a system to produce better translations, the size of the corpus should be huge. We have employed three similarity and distance measures for the research and developed a software to extract parallel data from comparable corpora automatically with high precision using minimal resources. The software works upon four algorithms. The three algorithms have been used for finding Cosine Similarity, Euclidean Distance Similarity and Jaccard Similarity. The fourth algorithm is to integrate the outputs of the three algorithms in order to improve the efficiency of the system.
Hindi and Sanskrit both the languages are having the same script i.e. Devnagari Script which results in few basic similarities in their grammar rules. As we know that Hindi ranks fourth in terms of speaker’s size in the world and over 60 million people in India are Hindi internet users. In India itself, there are approximately 120 languages and 240 mother tongues but hardly a few languages are recognized worldwide while the others are losing their existence in society day by day. Likewise, Sanskrit is one of those important languages that are being ignored in society. As per census report of India in 2001, less than 15000 citizens have returned Sanskrit as their Mother tongue or preferred medium of communication. A key reason behind poor acceptance of Sanskrit is due to language barrier among Indian masses and lack of knowledge about this language among people. Therefore, our attempt is just to connect a big crowd of Hindi users with Sanskrit language and make them familiar at least with the basics of Sanskrit. We developed a translation tool that parses Sanskrit words (prose) one by one and translate it into equivalent Hindi language in step by step manner: (i) We created a strong Hindi-Sanskrit corpus that can deal with Sanskrit words effectively and efficiently. (ii) We proposed an algorithm to stem Sanskrit word that chops off the starts/ends of words to find the root words in the form of nouns and verbs. (iii) After stemming, we developed an algorithm to search the equivalent Hindi meaning of stemmed words from the corpus-based on semantic analysis. (iv)We developed an algorithm to implement semantic analysis to translate words that help the tool to identify required parameter details like gender, number, case etc. (v) Next, we developed an algorithm for discourse integration to dis-join each translated sentence based on subject/noun dependency. (vi) Next, we implemented pragmatic analysis algorithm that ensures the meaningful validation of these translated Hindi sentences syntactically and semantically. (vii) We further extended our work to summarize the translated text story and suggest a suitable heading/title. For this, we referred ripple down rule-based parts of speech (RDR-POS) Tagger for word tagging in the POS tagger corpora. (viii) We proposed a title generation algorithm which suggests some suitable title of the translated text. (ix) Finally, we assembled all phases to one translation tool that takes a story of maximum one hundred words as input and translates it into equivalent Hindi language.
Machine Translation is a popular area of NLP research field. There are various approaches to develop a machine translation system like Rule-Based, Statistical, Neural and Hybrid. A rule-Based system is based on grammatical rules and uses bilingual lexicons. Statistical and Neural use the large parallel corpus for training the respective models. Where the Hybrid MT system is a mixture of different approaches. In these days the corpus-based machine translation system is quite popular in NLP research area. But these models demands huge parallel corpus. In this research, we have used a hybrid approach to develop Urdu to Punjabi machine translation system. In the developed system, statistical and various sub-system based on the linguistic rule has been used. The system yield 80% accuracy on a different set of the sentence related to domains like Political, Entertainment, Tourism, Sports and Health. The complete system has been developed in a C#.NET programming language.
The Hindi to Dogri Machine translation system is a rule-based MT developed and copyrighted by GoI in 2014. It is the first system developed to convert Hindi text into Dogri (the regional language of Jammu). The system is developed using ASP.Net and the databases are in MS-Access. This Machine Translation system accepts Hindi text as input and provides Dogri text as output in Unicode.
Opinion Mining (OM) is a field of study in Computer Science that deals with development of software applications related to text classifications and summarizations. Researchers working in this field contribute lexical resources, computing methodologies, text classification approaches, and summarization modules to perform OM tasks across various domains and different languages. Lexical and computational components developed for an Opinion Mining System that processes Hindi text taken from weblogs are presented in the paper for the demonstration. Text chosen for processing are the ones demonstrating cause and effect relationship between related entities ‘Food’ and ‘Health Issues’. The work is novel and lexical resources developed are useful in current research and may be of importance for future research in the field. The resources are developed for an algorithm ‘A’ such that for a sentence ‘Y’ which is a domain specific sentence from weblogs in Hindi, A(Y) returns a set F, HI, p, s such that F is a subset of set, FOOD=set of word or phrases in Hindi used for an edible item and HI is a subset of set, HEALTH_ISSUE= set of word or phrases in Hindi used for a part of body composition ‘BODY_COMPONENT’ UNION set of word or phrases in Hindi used for a health problem a human being face ‘HEALTH_PROBLEM’. Element ‘p’ takes numeric value ‘1’ or ‘-1’ where value ‘1’ means that from the text ‘Y’, algorithm ‘A’ computationally derived that the food entities mentioned in set ‘F’ have a positive effect in health issues mentioned in set ‘HI’ and the numeric value ‘-1’ means that the food entities in set ‘F’ have a negative effect in health issues in set ‘HI’. The element‘s’ may take value ‘1’ or ‘2’ indicating that the strength of polarity ‘p’ is medium or strong.
Sentiment analysis is a field of study for analyzing people’s emotions, such as Nice, Happy, ਦੁਖੀ (sad), changa (Good), etc. towards the entities and attributes expressed in written text. It noticed that, on microblogging websites (Facebook, YouTube, Twitter ), most people used more than one language to express their emotions. The change of one language to another language within the same written text is called code-mixing. In this research, we gathered the English-Punjabi code-mixed corpus from micro-blogging websites. We have performed language identification of code-mix text, which includes Phonetic Typing, Abbreviation, Wordplay, Intentionally misspelled words and Slang words. Then we performed tokenization of English and Punjabi language words consisting of different spellings. Then we performed sentiment analysis based on the above text based on the lexicon approach. The dictionary created for English Punjabi code mixed consists of opinionated words. The opinionated words are then categorized into three categories i.e. positive words list, negative words list, and neutral words list. The rest of the words are being stored in an unsorted word list. By using the N-gram approach, a statistical technique is applied at sentence level sentiment polarity of the English-Punjabi code-mixed dataset. Our results show an accuracy of 83% with an F-1 measure of 77%.
Khasi is an Austro Asiatic language spoken by one of the tribes in Meghalaya, and parts of Assam and Bangladesh. The fact that some NLP tools for Khasi are now available online for testing purposes is the culmination of the arduous investment in time and effort. Initially when work for Khasi was initiated, resources for Khasi, such as tagset and annotated corpus or any NLP tools, were nonexistent. As part of the author’s ongoing work for her doctoral program, currently, the resources for Khasi that are in place are the BIS (Bureau of Indian Standards) tagset for Khasi, a 90k annotated corpus, and NLP tools such as POS (parts of speech) taggers and shallow parsers. These mentioned tools are highlighted in this demonstration paper.
Chatbot is defined as one of the most advanced and promising expressions of interaction between humans and machines. They are sometimes called as digital assistants that can analyze human capabilities. There are so many chatbots already developed in English with supporting libraries and packages. But to customize these engines in other languages is a tedious process. Also there are many barriers to train these engines with other morphologically rich languages. Artificial Intelligence (AI) based or Machine Learning based Chatbots can answer complex ambiguous questions. The AI chatbots are capable of creating replies from scratch using Natural Language Processing techniques. Both categories have their advantages and disadvantages. Rule based chatbots can give more reliable and grammatically correct answers but fail to respond to questions outside their knowledge base. On the other hand, machine learning based chatbots need a vast amount of learning data and necessitated continuous improvement to the data-base to improve the cognitive capabilities.A hybrid chatbot employs the concepts of both AI and rule based bots, it can handle situations with both the approaches. One of the biggest threat faced by the society during the Corona pandemic was Mis-Information, Dis-information and Mal- information. Government wanted to establish a single source of truth, where the public can rely for authentic information. To support the cause and to fulfill the need to support the general public due to the rapid spread of COVID-19 Pandemic during the months of February and March 2020, ICFOSS has developed an interactive bot which is based on ‘hybrid technology’ and interacts with the people in regional language (Malayalam).
Code mixing is prevalent when users use two or more languages while communicating. It becomes more complex when users prefer romanized text to Unicode typing. The automatic processing of social media data has become one of popular areas of interest. Especially since COVID period the involvement of youngsters has attained heights. Walking with the pace our intended software deals with Language Identification and Normalization of English and Punjabi code mixed text. The software designed follows a pipeline which includes data collection, pre-processing, language identification, handling Out of Vocabulary words, normalization and transliteration of English- Punjabi text. After applying five-fold cross validation on the corpus, the accuracy of 96.8% is achieved on a trained dataset of around 80025 tokens. After the prediction of the tags: the slangs, contractions in the user input are normalized to their standard form. In addition, the words with Punjabi as predicted tags are transliterated to Punjabi.
Abstract Development of Machine Translation System (MTS) for any language pair is a challenging task for several reasons. Lack of lexical resources for any language is one of the major issue arise while developing MTS using that language. For example, during the development of Punjabi to Urdu MTS, many issues were recognized while preparing lexical resources for both the language. Since there is no machine readable dictionary is available for Punjabi to Urdu which can be directly used for translation; however various dictionaries are available to explain the meaning of word. Along with this, handling of OOV (out of vocabulary words), handling of multiple sense Punjabi word in Urdu, identification of proper nouns, identification of collocations in the source sentence i.e. Punjabi sentence in our case, are the issues which we are facing during development of this system. Since MTSs are in great demand from the last one decade and are being widely used in applications such as in case of smart phones. Therefore, development of such a system becomes more demanding and more users friendly. There usage is mainly in large scale translations, automated translations; act as an instrument to bridge a digital divide.
Natural Language Processing (NLP) is the most attention-grabbing field of artificial intelligence. It focuses on the interaction between humans and computers. Through NLP we can make thec omputers recognize, decode and deduce the meaning ofhuman dialect splendidly. But there are numerous difficulties that are experienced in NLP and, Anaphora is one such issue. Anaphora emerges often in composed writings and oral talk. Anaphora Resolution is the process of finding antecedent of corresponding referent and is required in different applications of NLP.Appreciable works have been accounted for anaphora in English and different languages, but no work has been done in Punjabi Language. Through this paper we are enumerating the introduction of Anaphora Resolution in Punjabi language. The accuracy achieved for the system is 47%.
People belonging to hearing-impaired community feels very uncomfortable while travelling or visiting at airport without the help of human interpreter. Hearing-impaired people are not able to hear any announcements made at airport like which flight heading to which destination. They remain ignorant about the choosing of gate number or counter number without the help of interpreter. Even they cannot find whether flight is on time, delayed or cancelled. The Airport Announcement System for Deaf is a rule-based MT developed. It is the first system developed in the domain of public places to translate all the announcements used at Airport into Indian Sign Language (ISL) synthetic animations. The system is developed using Python and Flask Framework. This Machine Translation system accepts announcements in the form of English text as input and produces Indian Sign Language (ISL) synthetic animations as output.
People belonging to hearing-impaired community feels very uncomfortable while travelling or visiting at Railway Stations without the help of human interpreter. Hearing-impaired people are not able to hear any announcements made at Railway Stations like which train heading to which destination. They remain ignorant about the choosing of platform number or counter number without the help of interpreter. Even they cannot find whether train is on time, delayed or cancelled. The Railway Stations Announcement System for Deaf is a rule-based MT developed. It is the first system developed in the domain of public places to translate all the announcements used at Railway Stations into Indian Sign Language (ISL) synthetic animations. The system is developed using Python and Flask Framework. This Machine Translation system accepts announcements in the form of English text as input and produces Indian Sign Language (ISL) synthetic animations as output.
Sign Language is the natural way of expressing thoughts and feelings for the deaf community. Sign language is a diagrammatic and non-verbal language used by the impaired community to communicate their feeling to their lookalike one. Today we live in the era of technological development, owing to which instant communication is quite easy but even then, a lot of work needs to be done in the field of Sign language automation to improve the quality of life among the deaf community. The traditional approaches used for representing the signs are in the form of videos or text that are expensive, time-consuming, and are not easy to use. In this research work, an attempt is made for the conversion of Complex and Compound English sentences to Indian Sign Language (ISL) using synthetic video animations. The translation architecture includes a parsing module that parses the input complex or compound English sentences to their simplified versions by using complex to simple and compound to simple English grammar rules respectively. The simplified sentence is then forwarded to the conversion segment that rearranges the words of the English language into its corresponding ISL using the devised grammar rules. The next segment constitutes the removal of unwanted words or stop words. This segment gets an input sentence generated by ISL grammar rules. Unwanted or unnecessary words are eliminated by this segment. This removal is important because ISL needs only a meaningful sentence rather than unnecessary usage of linking verbs, helping verbs, and so on. After parsing through the eliminator segment, the sentence is sent to the concordance segment. This segment checks each word in the sentence and translates them into their respective lemma. Lemma is the basic requiring node of each word because sign language makes use of basic words irrespective of other languages that make use of gerund, suffixes, three forms of verbs, different kinds of nouns, adjectives, pronouns in their sentence theory. All the words of the sentence are checked in the lexicon which contains the English word with its HamNoSys notation and the words that are not in the lexicon are replaced by their synonym. The words of the sentence are replaced by their counter HamNoSys code. In case the word is not present in the lexicon, the HamNoSys code will be taken for each alphabet of the word in sequence. The HamNoSys code is converted into the SiGML tags (a form of XML tags) and these SiGML tags are then sent to the animation module which converts the SiGML code into the synthetic animation using avatar (computer-generated animation character).
Plagiarism is closely linked with Intellectual Property Rights and Copyrights laws, both of which have been formed to protect the ownership of the concept. Most of the available tools for detecting plagiarism when tested with sample Punjabi text, failed to recognise the Punjabi text and the ones, which supported Punjabi text, did a simple string comparison for detecting the suspected copy-paste plagiarism, ignoring the other forms of plagiarism such as word switching, synonym replacement and sentence switching etc.