Indexed by:
Abstract:
Speech disorders can significant impact speakers’ articulation, resulting in large variations in speech. These variations can affect the performance of Automatic Speech Recognition (ASR), limiting the access of individuals with speech disorders to the benefits provided by this technology. Previous research on human speech perception has shown that both auditory and articulatory information play important roles, with the latter being more effective when the input speech is distorted. When a sound is perceived, the brain processes its auditory features and activates neural simulations of the articulatory movements associated with that sound. Throughout this process, acoustic and articulatory information often enhance each other, improving the overall comprehension and processing of the auditory stimulus. Motivated by these findings, this study proposes an Inclusive Gestural Feature Extraction (InGesFE) method and a Multi-Feature Co-Activation Module (MF-CoAct) to address the challenge of large variability in dysarthric ASR. The InGesFE method extracts features using a richness constraint and a phoneme distinctiveness constraint, enabling them to share similar characteristics with articulatory gestures, including: (1) rich aspects of input speech, (2) phonemic distinctiveness, and (3) robustness in conveying intent. Meanwhile, the MF-CoAct facilitates the co-activation of auditory and articulatory (gestural) features through a statistical variable-based activation network. Additionally, a continual pre-training method is designed to support faster and more effective adaptation to highly variable speech. To evaluate the effectiveness of the proposed method, two widely used dysarthria datasets, TORGO and UASpeech, are employed. Across both datasets, our approach led to a relative word error rate reduction (WERR) of 13.75%–15.37% for single-word recognition and 36.48% for multiword recognition compared to the baseline. It outperformed existing methods for speakers with severe dysarthria and very low intelligibility, reaching a word error rate (WER) of 51.41% on the UASpeech dataset. It also demonstrated increased robustness in noisy environments, achieving a 19.16% WERR in single-word recognition and a 38.49% WERR in multiword recognition under noisy conditions. Further analysis indicates that the features extracted by InGesFE capture richer articulatory information beyond auditory features alone, particularly improving the representation of co-articulatory cues. © 2025 Elsevier B.V.
Keyword:
Reprint 's Address:
Email:
Source :
Information Fusion
ISSN: 1566-2535
Year: 2026
Volume: 125
1 4 . 8 0 0
JCR@2023
Cited Count:
SCOPUS Cited Count:
ESI Highly Cited Papers on the List: 0 Unfold All
WanFang Cited Count:
Chinese Cited Count:
30 Days PV: 0
Affiliated Colleges: