Amir Ejmail


2024

pdf
DiaSet: An Annotated Dataset of Arabic Conversations
Abraham Israeli | Aviv Naaman | Guy Maduel | Rawaa Makhoul | Dana Qaraeen | Amir Ejmail | Dina Lisnanskey | Julian Jubran | Shai Fine | Kfir Bar
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

We introduce DiaSet, a novel dataset of dialectical Arabic speech, manually transcribed and annotated for two specific downstream tasks: sentiment analysis and named entity recognition. The dataset encapsulates the Palestine dialect, predominantly spoken in Palestine, Israel, and Jordan. Our dataset incorporates authentic conversations between YouTube influencers and their respective guests. Furthermore, we have enriched the dataset with simulated conversations initiated by inviting participants from various locales within the said regions. The participants were encouraged to engage in dialogues with our interviewer. Overall, DiaSet consists of 644.8K tokens and 23.2K annotated instances. Uniform writing standards were upheld during the transcription process. Additionally, we established baseline models by leveraging some of the pre-existing Arabic BERT language models, showcasing the potential applications and efficiencies of our dataset. We make DiaSet publicly available for further research.