Liang Xie
2025
A Learning-based Multi-Frame Visual Feature Framework for Real-Time Driver Fatigue Detection
Liang Xie
|
Songlin Fan
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (System Demonstrations)
Driver fatigue is a significant factor contributing to road accidents, highlighting the need for reliable and accurate detection methods. In this study, we introduce a novel learning-based multi-frame visual feature framework (LMVFF) designed for precise fatigue detection. Our methodology comprises several clear and interpretable steps. Initially, facial landmarks are detected, enabling the calculation of distances between eyes, lips, and the assessment of head rotation angles based on 68 identified landmarks. Subsequently, visual features from the eye region are extracted, and an effective visual model is developed to accurately classify eye openness. Additionally, features characterizing lip movements are analyzed to detect yawning, thereby enriching fatigue detection through continuous monitoring of eye blink frequency, yawning occurrences, and head movements. Compared to conventional single-feature detection approaches, LMVFF significantly reduces instances of fatigue misidentification. Moreover, we employ various quantization and compression techniques for multiple computation stages, substantially reducing the latency of our system and achieving a real-time frame rate of 25-30 FPS for practical applications.
2024
Landmark-Guided Cross-Speaker Lip Reading with Mutual Information Regularization
Linzhi Wu
|
Xingyu Zhang
|
Yakun Zhang
|
Changyan Zheng
|
Tiejun Liu
|
Liang Xie
|
Ye Yan
|
Erwei Yin
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Lip reading, the process of interpreting silent speech from visual lip movements, has gained rising attention for its wide range of realistic applications. Deep learning approaches greatly improve current lip reading systems. However, lip reading in cross-speaker scenarios where the speaker identity changes, poses a challenging problem due to inter-speaker variability. A well-trained lip reading system may perform poorly when handling a brand new speaker. To learn a speaker-robust lip reading model, a key insight is to reduce visual variations across speakers, avoiding the model overfitting to specific speakers. In this work, in view of both input visual clues and latent representations based on a hybrid CTC/attention architecture, we propose to exploit the lip landmark-guided fine-grained visual clues instead of frequently-used mouth-cropped images as input features, diminishing speaker-specific appearance characteristics. Furthermore, a max-min mutual information regularization approach is proposed to capture speaker-insensitive latent representations. Experimental evaluations on public lip reading datasets demonstrate the effectiveness of the proposed approach under the intra-speaker and inter-speaker conditions.