Pedro José Vivancos-Vicente

Also published as: Pedro José Vivancos-vicente


2024

pdf
UMUTeam at SemEval-2024 Task 8: Combining Transformers and Syntax Features for Machine-Generated Text Detection
Ronghao Pan | José Antonio García-díaz | Pedro José Vivancos-vicente | Rafael Valencia-garcía
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)

These working notes describe the UMUTeam’s participation in Task 8 of SemEval-2024 entitled “Multigenerator, Multidomain, and Multilingual Black-Box Machine-Generated Text Detection”. This shared task aims at identifying machine-generated text in order to mitigate its potential misuse. This shared task is divided into three subtasks: Subtask A, a binary classification task to determine whether a given full-text was written by a human or generated by a machine; Subtask B, a multi-class classification problem to determine, given a full-text, who generated it. It can be written by a human or generated by a specific language model; and Subtask C, mixed human-machine text recognition. We participated in Subtask B, using an approach based on fine-tuning a pre-trained model, such as RoBERTa, combined with syntactic features of the texts. Our system placed 23rd out of a total of 77 participants, with a score of 75.350%, outperforming the baseline.

2022

pdf
UMUTextStats: A linguistic feature extraction tool for Spanish
José Antonio García-Díaz | Pedro José Vivancos-Vicente | Ángela Almela | Rafael Valencia-García
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Feature Engineering consists in the application of domain knowledge to select and transform relevant features to build efficient machine learning models. In the Natural Language Processing field, the state of the art concerning automatic document classification tasks relies on word and sentence embeddings built upon deep learning models based on transformers that have outperformed the competition in several tasks. However, the models built from these embeddings are usually difficult to interpret. On the contrary, linguistic features are easy to understand, they result in simpler models, and they usually achieve encouraging results. Moreover, both linguistic features and embeddings can be combined with different strategies which result in more reliable machine-learning models. The de facto tool for extracting linguistic features in Spanish is LIWC. However, this software does not consider specific linguistic phenomena of Spanish such as grammatical gender and lacks certain verb tenses. In order to solve these drawbacks, we have developed UMUTextStats, a linguistic extraction tool designed from scratch for Spanish. Furthermore, this tool has been validated to conduct different experiments in areas such as infodemiology, hate-speech detection, author profiling, authorship verification, humour or irony detection, among others. The results indicate that the combination of linguistic features and embeddings based on transformers are beneficial in automatic document classification.