@inproceedings{forkel-list-2020-cldfbench,
    title = "{CLDFB}ench: Give Your Cross-Linguistic Data a Lift",
    author = "Forkel, Robert  and
      List, Johann-Mattis",
    editor = "Calzolari, Nicoletta  and
      B{\'e}chet, Fr{\'e}d{\'e}ric  and
      Blache, Philippe  and
      Choukri, Khalid  and
      Cieri, Christopher  and
      Declerck, Thierry  and
      Goggi, Sara  and
      Isahara, Hitoshi  and
      Maegaard, Bente  and
      Mariani, Joseph  and
      Mazo, H{\'e}l{\`e}ne  and
      Moreno, Asuncion  and
      Odijk, Jan  and
      Piperidis, Stelios",
    booktitle = "Proceedings of the Twelfth Language Resources and Evaluation Conference",
    month = may,
    year = "2020",
    address = "Marseille, France",
    publisher = "European Language Resources Association",
    url = "https://preview.aclanthology.org/ingest-emnlp/2020.lrec-1.864/",
    pages = "6995--7002",
    language = "eng",
    ISBN = "979-10-95546-34-4",
    abstract = "While the amount of cross-linguistic data is constantly increasing, most datasets produced today and in the past cannot be considered FAIR (findable, accessible, interoperable, and reproducible). To remedy this and to increase the comparability of cross-linguistic resources, it is not enough to set up standards and best practices for data to be collected in the future. We also need consistent workflows for the ``retro-standardization'' of data that has been published during the past decades and centuries. With the Cross-Linguistic Data Formats initiative, first standards for cross-linguistic data have been presented and successfully tested. So far, however, CLDF creation was hampered by the fact that it required a considerable degree of computational proficiency. With cldfbench, we introduce a framework for the retro-standardization of legacy data and the curation of new datasets that drastically simplifies the creation of CLDF by providing a consistent, reproducible workflow that rigorously supports version control and long term archiving of research data and code. The framework is distributed in form of a Python package along with usage information and examples for best practice. This study introduces the new framework and illustrates how it can be applied by showing how a resource containing structural and lexical data for Sinitic languages can be efficiently retro-standardized and analyzed."
}Markdown (Informal)
[CLDFBench: Give Your Cross-Linguistic Data a Lift](https://preview.aclanthology.org/ingest-emnlp/2020.lrec-1.864/) (Forkel & List, LREC 2020)
ACL
- Robert Forkel and Johann-Mattis List. 2020. CLDFBench: Give Your Cross-Linguistic Data a Lift. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 6995–7002, Marseille, France. European Language Resources Association.