In this paper, we propose an alternative to the classic masked language modeling (MLM) pre-training paradigm, where the objective is altered from the reconstruction of the exact identity of randomly selected masked subwords to the prediction of their latent semantic properties. We coin the proposed pre-training technique masked latent semantic modeling (MLSM for short). In order to make the contextualized determination of the latent semantic properties of the masked subwords possible, we rely on an unsupervised technique which uses sparse coding. Our experimental results reveal that the fine-tuned performance of those models that we pre-trained via MLSM is consistently and significantly better compared to the use of vanilla MLM pretraining and other strong baselines.
In this paper, we introduce our submission in the task of visual word sense disambiguation (vWSD). Our proposed solution operates by deriving quasi-symbolic semantic categories from the hidden representations of multi-modal text-image encoders. Our results are mixed, as we manage to achieve a substantial boost in performance when evaluating on a validation set, however, we experienced detrimental effects during evaluation on the actual test set. Our positive results on the validation set confirms the validity of the quasi-symbolic features, whereas our results on the test set revealed that the proposed technique was not able to cope with the sufficiently different distribution of the test data.
Codenames is a popular board game, in which knowledge and cooperation between players play an important role. The task of a player playing as a spymaster is to find words (clues) that a teammate finds related to as many of some given words as possible, but not to other specified words. This is a hard challenge even with today’s advanced language technology methods. In our study, we create spymaster agents using four types of relatedness measures that require only a raw text corpus to produce. These include newly introduced ones based on co-occurrences, which outperform FastText cosine similarity on gold standard relatedness data. To generate clues in Codenames, we combine relatedness measures with four different scoring functions, for two languages, English and Hungarian. For testing, we collect decisions of human guesser players in an online game, and our configurations outperform previous agents among methods using raw corpora only.
In this paper, we advocate for using large pre-trained monolingual language models in cross lingual zero-shot word sense disambiguation (WSD) coupled with a contextualized mapping mechanism. We also report rigorous experiments that illustrate the effectiveness of employing sparse contextualized word representations obtained via a dictionary learning procedure. Our experimental results demonstrate that the above modifications yield a significant improvement of nearly 6.5 points of increase in the average F-score (from 62.0 to 68.5) over a collection of 17 typologically diverse set of target languages. We release our source code for replicating our experiments at https://github.com/begab/sparsity_makes_sense.
In this work, we analyze the performance and properties of cross-lingual word embedding models created by mapping-based alignment methods. We use several measures of corpus and embedding similarity to predict BLI scores of cross-lingual embedding mappings over three types of corpora, three embedding methods and 55 language pairs. Our experimental results corroborate that instead of mere size, the amount of common content in the training corpora is essential. This phenomenon manifests in that i) despite of the smaller corpus sizes, using only the comparable parts of Wikipedia for training the monolingual embedding spaces to be mapped is often more efficient than relying on all the contents of Wikipedia, ii) the smaller, in return less diversified Spanish Wikipedia works almost always much better as a training corpus for bilingual mappings than the ubiquitously used English Wikipedia.
In this paper, we introduce our system that we participated with at the multilingual and cross-lingual word-in-context disambiguation SemEval 2021 shared task. In our experiments, we investigated the possibility of using an all-words fine-grained word sense disambiguation system trained purely on sense-annotated data in English and draw predictions on the semantic equivalence of words in context based on the similarity of the ranked lists of the (English) WordNet synsets returned for the target words decisions had to be made for. We overcame the multi,-and cross-lingual aspects of the shared task by applying a multilingual transformer for encoding the texts written in either Arabic, English, French, Russian and Chinese. While our results lag behind top scoring submissions, it has the benefit that it not only provides a binary flag whether two words in their context have the same meaning, but also provides a more tangible output in the form of a ranked list of (English) WordNet synsets irrespective of the language of the input texts. As our framework is designed to be as generic as possible, it can be applied as a baseline for basically any language (supported by the multilingual transformed architecture employed) even in the absence of any additional form of language specific training data.
The application of transformer-based contextual representations has became a de facto solution for solving complex NLP tasks. Despite their successes, such representations are arguably opaque as their latent dimensions are not directly interpretable. To alleviate this limitation of contextual representations, we devise such an algorithm where the output representation expresses human-interpretable information of each dimension. We achieve this by constructing a transformation matrix based on the semantic content of the embedding space and predefined semantic categories using Hellinger distance. We evaluate our inferred representations on supersense prediction task. Our experiments reveal that the interpretable nature of transformed contextual representations makes it possible to accurately predict the supersense category of a word by simply looking for its transformed coordinate with the largest coefficient. We quantify the effects of our proposed transformation when applied over traditional dense contextual embeddings. We additionally investigate and report consistent improvements for the integration of sparse contextual word representations into our proposed algorithm.
We propose the technique of quasi-multitask learning (Q-MTL), a simple and easy to implement modification of standard multitask learning, in which the tasks to be modeled are identical. With this easy modification of a standard neural classifier we can get benefits similar to an ensemble of classifiers with a fraction of the resources required. We illustrate it through a series of sequence labeling experiments over a diverse set of languages, that applying Q-MTL consistently increases the generalization ability of the applied models. The proposed architecture can be regarded as a new regularization technique that encourages the model to develop an internal representation of the problem at hand which is beneficial to multiple output units of the classifier at the same time. Our experiments corroborate that by relying on the proposed algorithm, we can approximate the quality of an ensemble of classifiers at a fraction of computational resources required. Additionally, our results suggest that Q-MTL handles the presence of noisy training labels better than ensembles.
This paper introduces our efforts at the FinCasual shared task for modeling causality in financial utterances. Our approach uses the commonly and successfully applied strategy of fine-tuning a transformer-based language model with a twist, i.e. we modified the training and inference mechanism such that our model produces multiple predictions for the same instance. By designing such a model that returns k>1 predictions at the same time, we not only obtain a more resource efficient training (as opposed to fine-tuning some pre-trained language model k independent times), but our results indicate that we are also capable of obtaining comparable or even better evaluation scores that way. We compare multiple strategies for combining the k predictions of our model. Our submissions got ranked third on both subtasks of the shared task.
In this paper, we demonstrate that by utilizing sparse word representations, it becomes possible to surpass the results of more complex task-specific models on the task of fine-grained all-words word sense disambiguation. Our proposed algorithm relies on an overcomplete set of semantic basis vectors that allows us to obtain sparse contextualized word representations. We introduce such an information theory-inspired synset representation based on the co-occurrence of word senses and non-zero coordinates for word forms which allows us to achieve an aggregated F-score of 78.8 over a combination of five standard word sense disambiguating benchmark datasets. We also demonstrate the general applicability of our proposed framework by evaluating it towards part-of-speech tagging on four different treebanks. Our results indicate a significant improvement over the application of the dense word representations.
This paper describes 300-sparsians’s participation in SemEval-2018 Task 9: Hypernym Discovery, with a system based on sparse coding and a formal concept hierarchy obtained from word embeddings. Our system took first place in subtasks (1B) Italian (all and entities), (1C) Spanish entities, and (2B) music entities.
In this paper we introduce our system participating at the 2017 SemEval shared task on keyphrase extraction from scientific documents. We aimed at the creation of a keyphrase extraction approach which relies on as little external resources as possible. Without applying any hand-crafted external resources, and only utilizing a transformed version of word embeddings trained at Wikipedia, our proposed system manages to perform among the best participating systems in terms of precision.
In this paper we propose and carefully evaluate a sequence labeling framework which solely utilizes sparse indicator features derived from dense distributed word representations. The proposed model obtains (near) state-of-the art performance for both part-of-speech tagging and named entity recognition for a variety of languages. Our model relies only on a few thousand sparse coding-derived features, without applying any modification of the word representations employed for the different tasks. The proposed model has favorable generalization properties as it retains over 89.8% of its average POS tagging accuracy when trained at 1.2% of the total available training data, i.e. 150 sentences per language.