WebDP: Understanding Discourse Structures in Semi-Structured Web Documents
Peilin Liu, Hongyu Lin, Meng Liao, Hao Xiang, Xianpei Han, Le Sun
Abstract
Web documents have become rich data resources in current era, and understanding their discourse structure will potentially benefit various downstream document processing applications. Unfortunately, current discourse analysis and document intelligence research mostly focus on either discourse structure of plain text or superficial visual structures in document, which cannot accurately describe discourse structure of highly free-styled and semi-structured web documents. To promote discourse studies on web documents, in this paper we introduced a benchmark – WebDP, orienting a new task named Web Document Discourse Parsing. Specifically, a web document discourse structure representation schema is proposed by extending classical discourse theories and adding special features to well represent discourse characteristics of web documents. Then, a manually annotated web document dataset – WEBDOCS is developed to facilitate the study of this parsing task. We compared current neural models on WEBDOCS and experimental results show that WebDP is feasible but also challenging for current models.- Anthology ID:
- 2023.findings-acl.650
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2023
- Month:
- July
- Year:
- 2023
- Address:
- Toronto, Canada
- Editors:
- Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 10235–10258
- Language:
- URL:
- https://preview.aclanthology.org/icon-24-ingestion/2023.findings-acl.650/
- DOI:
- 10.18653/v1/2023.findings-acl.650
- Cite (ACL):
- Peilin Liu, Hongyu Lin, Meng Liao, Hao Xiang, Xianpei Han, and Le Sun. 2023. WebDP: Understanding Discourse Structures in Semi-Structured Web Documents. In Findings of the Association for Computational Linguistics: ACL 2023, pages 10235–10258, Toronto, Canada. Association for Computational Linguistics.
- Cite (Informal):
- WebDP: Understanding Discourse Structures in Semi-Structured Web Documents (Liu et al., Findings 2023)
- PDF:
- https://preview.aclanthology.org/icon-24-ingestion/2023.findings-acl.650.pdf