← Back to Home

The Evolution of Speech Recognition

Speech Recognition Meme

From Garbled Glitches to Pocket Professors

If Speech Synthesis (TTS) is the "mouth" of CALL, then Automatic Speech Recognition (ASR) is its "ears". For years, these "ears" were notoriously hard of hearing, especially when dealing with accented, hesitant speech of a language learner (and even in some cases, still are!!). However, the last five years have seen a paradigm shift. We have moved from simple "pattern matching", i.e. where the computer simply checks if you said a specific word, to systems that can actually evaluate the quality of your communication (such as in its use for CAPT).

The Transformer Revolution and Accuracy

The biggest technical leap since 2021 has been the widespread adoption of Transformer-based models. Unlike older systems that processed sound in a linear sequence, transformers allow for parallel processing of audio data, meaning the AI can look at the "whole" sentence to understand context. As explored in Speech Recognition Transformers, this has drastically reduced Word Error Rates (WER) in noisy environments, which was previously the bane of classroom-based CALL tools.

Impact on Pronunciation and Anxiety

Recent empirical evidence suggests that ASR is no longer just a gimmick. A 2023 study found that consistent proper use of ASR technology, with immediate personalised feedback showed significantly signs of improvement, both for accentedness and general comprehensibility of English as a Foreign Lanuage (EFL) learners. Perhaps more importantly, it addressed the "Affective Filter" I mentioned in my second blog post, specifically that learners report that the immediate, private feedback from an ASR system allows them to "fail safely", meaning that learners can practice without the feeling that a human listener is judging their use of the language. (figma)

However, a more recent study, highlights some important observations which an bring forward an interesting paradox. While the use of ASR for CAPT improves phoneme-level accuracy (how you say your 'p's and 'b's), it can lower speaking confidence if the feedback is too critical or "game-ified", and/or if the feedback is too negative. This suggests that the next five years of CALL design shouldn't just focus on "perfect" recognition, but on "supportive" recognition.

The Move Toward Corrective Feedback

We are now seeing ASR usage move beyond simple transcription. Modern systems are being integrated with Automated Corrective Feedback (ACF). According to research comparing different ASR-based systems, the most effective tools are those that provide "phonetic-level" feedback (pointing out exactly how the tongue and mouth should be) rather than just giving a global "Correct/Incorrect" score [see previous blog]. This bridges the gap between simply being understood by a machine and actually improving as a speaker.

Towards The Future

As we look toward the next steps, the challenge for ASR in CALL remains the "long tail" of diversity, i.e. capturing the speech of children, non-standard dialects, and highly emotional speech. While we've finally gotten a working model of the "ears", we are still far from having a complete system that recognises all the these problems.