There is a lack of reproducibility in results from experiments that apply the Appraisal taxonomy. Appraisal is widely used by linguists to study how people judge things or people. Automating Appraisal could be beneficial for use cases such as moderating online comments. Past work in Appraisal annotation has been descriptive in nature and, the lack of publicly available data sets hinders the progress of automation. In this work, we are interested in two things; first, measuring the performance of automated approaches to Appraisal classification in the publicly available Australasian Language Technology Association (ALTA) Shared Task Challenge data set. Second, we are interested in reproducing the annotation of the ALTA data set. Four additional annotators, each with a different linguistics background, were employed to re-annotate the data set. Our results show a poor level of agreement at more detailed Appraisal categories (Fleiss Kappa = 0.059) and a fair level of agreement (Kappa = 0.372) at coarse-level categories. We find similar results when using automated approaches that are available publicly. Our empirical evidence suggests that at present, automating classification is practical only when considering coarse-level categories of the taxonomy.
In 2019, the Australasian Language Technology Association (ALTA) organised a shared task to detect the target of sarcastic comments posted on social media. However, there were no winners as it proved to be a difficult task. In this work, we revisit the task posted by ALTA by using transformers, specifically BERT, given the current success of the transformer-based model in various NLP tasks. We conducted our experiments on two BERT models (TD-BERT and BERT-AEN). We evaluated our model on the data set provided by ALTA (Reddit) and two additional data sets: ‘book snippets’ and ‘Tweets’. Our results show that our proposed method achieves a 15.2% improvement from the current state-of-the-art system on the Reddit data set and 4% improvement on Tweets.
We describe our methods for automatically grading the level of clinical evidence in medical papers, as part of the ALTA 2021 shared task. We use a combination of transfer learning and a hand-crafted, feature-based classifier. Our system (�orangutanV3�) obtained an accuracy score of 0.4918, which placed third in the leaderboard. From our failure analysis, we find that our classification techniques do not appropriately handle cases when the conclusions of across the medical papers are themselves inconclusive. We believe that this shortcoming can be overcome�thus improving the classification accuracy�by incorporating document similarity techniques.
We describe our method for classifying short texts into the APPRAISAL framework, work we conducted as part of the ALTA 2020 shared task. We tackled this problem using transfer learning. Our team, “orangutanV2” placed equal first in the shared task, with a mean F1-score of 0.1026 on the private data set.
We describe our methods in trying to detect the target of sarcasm as part of ALTA 2019 shared task. We use combination of ensemble of clas- sifiers and a rule-based system. Our team ob- tained a Dice-Sorensen Coefficient score of 0.37150, which placed 2nd in the public leader- board. Despite no team beating the baseline score for the private dataset, we present our findings and also some of the challenges and future improvements which can be used in or- der to tackle the problem.