Steven Au
2026
MIDI-PHOR: Multi-View Distillation for Music Understanding and Captioning
Steven Au
Proceedings of the 4th Workshop on NLP for Music and Audio (NLP4MusA 2026)
Steven Au
Proceedings of the 4th Workshop on NLP for Music and Audio (NLP4MusA 2026)
Text-only training is a promising new method for training multimodal machine learning models without data from every modality. However, few studies have explored its use as an approximation of missing data for supervised learning in data-scarce environments. In this work, we examine techniques to acquire text-based training data, address the modality gap, and present a case study on classifying subjective audio timbre descriptions based on three kinds of text-only training data and six augmentation methods on eight audio-timbre datasets. We find text-only training successfully trains supervised audio classifiers without audio that are able to compete with a zero-shot baseline and training on real audio.
2025
Personalized Graph-Based Retrieval for Large Language Models
Steven Au | Cameron Dimacali | Ojasmitha Pedirappagari | Namyong Park | Franck Dernoncourt | Yu Wang | Nikos Kanakaris | Hanieh Deilamsalehy | Ryan A. Rossi | Nesreen K. Ahmed
Proceedings of the 39th Pacific Asia Conference on Language, Information and Computation
Steven Au | Cameron Dimacali | Ojasmitha Pedirappagari | Namyong Park | Franck Dernoncourt | Yu Wang | Nikos Kanakaris | Hanieh Deilamsalehy | Ryan A. Rossi | Nesreen K. Ahmed
Proceedings of the 39th Pacific Asia Conference on Language, Information and Computation
2024
UCSC NLP at SemEval-2024 Task 10: Emotion Discovery and Reasoning its Flip in Conversation (EDiReF)
Neng Wan | Steven Au | Esha Ubale | Decker Krogh
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)
Neng Wan | Steven Au | Esha Ubale | Decker Krogh
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)
We describe SemEval-2024 Task 10: EDiReF consisting of three sub-tasks involving emotion in conversation across Hinglish code-mixed and English datasets. Subtasks include classification of speaker emotion in multiparty conversations (Emotion Recognition in Conversation) and reasoning around shifts in speaker emotion state (Emotion Flip Reasoning). We deployed a BERT model for emotion recognition and two GRU-based models for emotion flip. Our model achieved F1 scores of 0.45, 0.79, and 0.68 for subtasks 1, 2, and 3, respectively.