The Influence of Background Data Size on the Performance of a Score-based Likelihood Ratio System: A Case of Forensic Text Comparison

Shunichi Ishihara


Abstract
This study investigates the robustness and stability of a likelihood ratio–based (LR-based) forensic text comparison (FTC) system against the size of background population data. Focus is centred on a score-based approach for estimating authorship LRs. Each document is represented with a bag-of-words model, and the Cosine distance is used as the score-generating function. A set of population data that differed in the number of scores was synthesised 20 times using the Monte-Carol simulation technique. The FTC system’s performance with different population sizes was evaluated by a gradient metric of the log–LR cost (Cllr). The experimental results revealed two outcomes: 1) that the score-based approach is rather robust against a small population size—in that, with the scores obtained from the 40 60 authors in the database, the stability and the performance of the system become fairly comparable to the system with a maximum number of authors (720); and 2) that poor performance in terms of Cllr, which occurred because of limited background population data, is largely due to poor calibration. The results also indicated that the score-based approach is more robust against data scarcity than the feature-based approach; however, this finding obliges further study.
Anthology ID:
2020.alta-1.3
Volume:
Proceedings of the 18th Annual Workshop of the Australasian Language Technology Association
Month:
December
Year:
2020
Address:
Virtual Workshop
Editors:
Maria Kim, Daniel Beck, Meladel Mistica
Venue:
ALTA
SIG:
Publisher:
Australasian Language Technology Association
Note:
Pages:
21–31
Language:
URL:
https://aclanthology.org/2020.alta-1.3
DOI:
Bibkey:
Cite (ACL):
Shunichi Ishihara. 2020. The Influence of Background Data Size on the Performance of a Score-based Likelihood Ratio System: A Case of Forensic Text Comparison. In Proceedings of the 18th Annual Workshop of the Australasian Language Technology Association, pages 21–31, Virtual Workshop. Australasian Language Technology Association.
Cite (Informal):
The Influence of Background Data Size on the Performance of a Score-based Likelihood Ratio System: A Case of Forensic Text Comparison (Ishihara, ALTA 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/emnlp-22-attachments/2020.alta-1.3.pdf