@inproceedings{kohn-etal-2016-mining,
    title = "Mining the Spoken {W}ikipedia for Speech Data and Beyond",
    author = {K{\"o}hn, Arne  and
      Stegen, Florian  and
      Baumann, Timo},
    editor = "Calzolari, Nicoletta  and
      Choukri, Khalid  and
      Declerck, Thierry  and
      Goggi, Sara  and
      Grobelnik, Marko  and
      Maegaard, Bente  and
      Mariani, Joseph  and
      Mazo, Helene  and
      Moreno, Asuncion  and
      Odijk, Jan  and
      Piperidis, Stelios",
    booktitle = "Proceedings of the Tenth International Conference on Language Resources and Evaluation ({LREC}'16)",
    month = may,
    year = "2016",
    address = "Portoro{\v{z}}, Slovenia",
    publisher = "European Language Resources Association (ELRA)",
    url = "https://preview.aclanthology.org/ingest-emnlp/L16-1735/",
    pages = "4644--4647",
    abstract = "We present a corpus of time-aligned spoken data of Wikipedia articles as well as the pipeline that allows to generate such corpora for many languages. There are initiatives to create and sustain spoken Wikipedia versions in many languages and hence the data is freely available, grows over time, and can be used for automatic corpus creation. Our pipeline automatically downloads and aligns this data. The resulting German corpus currently totals 293h of audio, of which we align 71h in full sentences and another 86h of sentences with some missing words. The English corpus consists of 287h, for which we align 27h in full sentence and 157h with some missing words. Results are publically available."
}Markdown (Informal)
[Mining the Spoken Wikipedia for Speech Data and Beyond](https://preview.aclanthology.org/ingest-emnlp/L16-1735/) (Köhn et al., LREC 2016)
ACL
- Arne Köhn, Florian Stegen, and Timo Baumann. 2016. Mining the Spoken Wikipedia for Speech Data and Beyond. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 4644–4647, Portorož, Slovenia. European Language Resources Association (ELRA).