Benjamin C. M. Fung


2021

pdf bib
The Topic Confusion Task: A Novel Evaluation Scenario for Authorship Attribution
Malik Altakrori | Jackie Chi Kit Cheung | Benjamin C. M. Fung
Findings of the Association for Computational Linguistics: EMNLP 2021

Authorship attribution is the problem of identifying the most plausible author of an anonymous text from a set of candidate authors. Researchers have investigated same-topic and cross-topic scenarios of authorship attribution, which differ according to whether new, unseen topics are used in the testing phase. However, neither scenario allows us to explain whether errors are caused by failure to capture authorship writing style or by the topic shift. Motivated by this, we propose the topic confusion task where we switch the author-topic configuration between the training and testing sets. This setup allows us to investigate two types of errors: one caused by the topic shift and one caused by the features’ inability to capture the writing styles. We show that stylometric features with part-of-speech tags are the least susceptible to topic variations. We further show that combining them with other features leads to significantly lower topic confusion and higher attribution accuracy. Finally, we show that pretrained language models such as BERT and RoBERTa perform poorly on this task and are surpassed by simple features such as word-level n-gram.

pdf bib
ER-AE: Differentially Private Text Generation for Authorship Anonymization
Haohan Bo | Steven H. H. Ding | Benjamin C. M. Fung | Farkhund Iqbal
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Most of privacy protection studies for textual data focus on removing explicit sensitive identifiers. However, personal writing style, as a strong indicator of the authorship, is often neglected. Recent studies, such as SynTF, have shown promising results on privacy-preserving text mining. However, their anonymization algorithm can only output numeric term vectors which are difficult for the recipients to interpret. We propose a novel text generation model with a two-set exponential mechanism for authorship anonymization. By augmenting the semantic information through a REINFORCE training reward function, the model can generate differentially private text that has a close semantic and similar grammatical structure to the original text while removing personal traits of the writing style. It does not assume any conditioned labels or paralleled text data for training. We evaluate the performance of the proposed model on the real-life peer reviews dataset and the Yelp review dataset. The result suggests that our model outperforms the state-of-the-art on semantic preservation, authorship obfuscation, and stylometric transformation.

2006

pdf bib
Document Clustering Method Based on Frequent Co-occurring Words
Ye-Hang Zhu | Guan-Zhong Dai | Benjamin C. M. Fung | De-Jun Mu
Proceedings of the 20th Pacific Asia Conference on Language, Information and Computation