The State of Modern AI Text To Speech Systems for Screen Reader Users, received 2026 01 06

0 views

Skip to first unread message

Colin Howard

unread,

Jan 6, 2026, 2:18:44 AMJan 6

to post AVIP list

Greetings,

Please read this to it's end, maybe you have opinions to express, I plan to
stay with what I'm already using until matteers are finally settled.

I wonder what the future is, for software such as the Guide programs will
be?

Sam's Stuff - Sunday, January 4

If you're not a screen reader user yourself, you might be surprised to learn
that the text to speech technology used by most blind people hasn't changed
in the last 30 years. While text to speech has taken the sighted world by
storm, in everything from personal assistants to GPS to telephone systems,
the voices used by blind folks have remained mostly static. This is largely
intentional. The needs of a blind text to speech user are vastly different
than those of a sighted user. While sighted users prefer voices that are
natural, conversational, and as human-like as possible, blind users tend to
prefer voices that are fast, clear, predictable, and efficient. This results
in a preference among blind users for voices that sound somewhat robotic,
but can be understood at high rates of speed, often upwards of 800 to 900
words per minute. The speaking rate of an average person hovers around 200
to 250 words per minute, for comparison.

Unfortunately, this difference in needs has resulted in blind people getting
left out of the explosion of text to speech advancement, and has caused many
problems. First, the voice that is preferred by the majority of western
English blind users, called Eloquence, was last updated in 2003. While it is
so overwhelmingly popular that even Apple was eventually pressured to add
the voice to iPhone, mac, Apple TV, and Apple Watch, even they were forced
to use an emulation layer. As Eloquence is a 32-bit voice last compiled in
2003, it cannot run in modern software without some sort of emulation or
bridge. If the sourcecode to Eloquence still exists and can be compiled,
even large companies like Apple haven't managed to find or compile it. As
the NVDA screen reader moves from being a 32-bit application to a 64-bit
one, keeping eloquence running with it has been a challenge that I and many
other community members have spent a lot of time and effort solving. The
eloquence libraries also have many known security issues, and anyone using
the libraries today is forced to understand and program around them, as
Eloquence itself can never be updated or fixed. These stopgap solutions are
entirely untenable, and are likely to take us only so far. A better solution
is urgently needed.

The second problem this has caused is for those who speak languages other
than English. As most modern text to speech voices are created by and for
sighted users, blind users begin to find that the voices available in less
popular languages are inefficient, overly conversational, slow, and
otherwise unsatisfactory. While espeak-ng is an open-source text to speech
system that attempts to support hundreds of languages while meeting the
needs of blind users, it brings a different set of problems to the table.
First, many of the languages it supports were added based on pronunciation
rules taken from Wikipedia articles, without involving speakers of the
language. Second, Espeak-ng is based directly on Speak, a text to speech
system written by Jonathan Duddington in 1995 for RISC OS on the BBC Micro,
meaning that espeak users today continue to have to live with many of the
design decisions made back in 1995 for an operating system that no longer
exists. Third, looking at the Espeak-ng repository, it seems to only have
one or two active maintainers. While this is obviously better than the zero
active maintainers of Eloquence, it could still become a problem in the
future.

These are the reasons that I'm always interested in advancements in text to
speech, and am actively keeping my ears open for something that takes
advantage of modern technology, while continuing to suit the needs of screen
reader users like myself.

Over the holiday break, I decided to take a look at two modern AI-based text
to speech systems, and see if they could be added to NVDA. I chose two
models, because they advertised themselves as fast, able to run without a
GPU, and responsive. The first was supertonic, and the second was Kitten
TTS. As both models require 64-bit Python, I wrote the addons for the 64-bit
alpha of NVDA. However, other than making development easier, this had
little effect on the results.

Unfortunately, doing this work uncovered a number of issues that I believe
are common to all of the modern AI-based text to speech systems, and make
them unsuitable for use in screen readers. The first issue is dependency
bloat. In order to bundle these systems as NVDA addons, developers are
required to include a vast multitude of large and complex Python packages.
In the case of Kitten TTS, the number is around 103, and just over 30 for
supertonic. As the standard building and packaging methods for NVDA addons
do not support specifying and building requirements, these dependencies need
to be manually copied over, included in any github repositories, and cannot
be automatically updated. Loading all of these dependencies directly into
NVDA also causes the screen reader to load slower, use more system
resources, and opens NVDA users up to any security issue in any of these
libraries. As a screen reader needs access to the entire system, this is far
from ideal.

The second issue is accuracy. These modern systems are developed to sound
human, natural, and conversational. Unfortunately this seems to come at the
expense of accuracy. In my testing, both models had a tendency to skip
words, read numbers incorrectly, chop off short utterances, and ignore
prosody hints from text punctuation. Kitten TTS is slightly better here, as
it uses a deterministic phonemizer (the same one used by espeak, actually)
to determine the correct way to pronounce words, leaving only the generation
of the speech itself up to AI. But never the less, Kitten TTS is still far
from perfectly accurate. When it comes to use in a screen reader, skipping
words, or reading numbers incorrectly, is unacceptable.

The third issue is speed. Supertonic has the edge, here, but even it is far
too slow. Unlike older text to speech systems, Supertonic and Kitten TTS
cannot begin generating speech until they have an entire chunk of text.
Supertonic is slightly faster, as it can stream result audio as it becomes
available, whereas Kitten TTS cannot start speaking until all of the audio
for the chunk is fully generated. But for use in a screen reader, a text to
speech system needs to begin generating speech as quickly as possible,
rather than waiting for an entire phrase or sentence. Users of screen
readers quickly jump through text and frequently interrupt the screen
reader, and thus require the text to speech system to be able to quickly
discard and restart speech.

The fourth and final issue is control. Older text to speech systems make
changing the pitch, speed, volume, breathiness, roughness, headsize, and
other parameters of the voice easy. This allows screen reader users to
customize the voice to our exact needs, as well as offering the ability to
change the characteristics of the voice in real time based on the formatting
or other attributes of the text. AI text to speech models, being trained on
data from a particular set of speakers, cannot offer this customization.
Instead, they inherit the speaking speed, pitch, volume, and other
characteristics that were present in the training data. Kitten TTS and
Supertonic both offer basic speed control, however it is highly variable
from voice to voice and utterance to utterance. This leads to a loss of
functionality that many blind users depend on.

If you'd like to experience these issues for yourself, feel free to follow
the links above to my GitHub repositories. They offer ready to install
addons that can be installed and used with the 64-bit NVDA alphas.

I'm picking on Kitten TTS and Supertonic not because they're particularly
bad for the above problems, but because they're the models that are the
state of the art in AI text to speech right now when it comes to speed and
size. Other models, like Kokoro, exhibit all of the same issues, but more
so.

So what's the way forward for blind screen reader users? Sadly, I don't
know. Modern text to speech research has little to no overlap with our
requirements. Using Eloquence, the system that many blind people find best,
is becoming increasingly untenable. ESpeak uses an odd architecture
originally designed for computers in 1995, and has few maintainers. Blastbay
Studios has done some interesting work to create a text to speech voice
using modern design and technology, that meets the requirements of blind
users. But it's a closed-source product with a single maintainer, that also
suffers from a lack of pronunciation accuracy. In an ideal world, someone
would re-implement Eloquence as a set of open source libraries. However,
doing so would require expertise in linguistics, digital signal processing,
and audiology, as well as excellent programming abilities. My suspicion is
that modernizing the text to speech stack that is preferred by blind
power-users is an effort that would require several million dollars of
funding at minimum. Instead, we'll probably wind up having to settle for
text to speech voices that are "good enough", while being nowhere near as
fast and efficient as what we have currently. Personally, I intend to keep
Eloquence limping along for as long as I can, until the layers of required
emulation and bridges make real time use impossible. Perhaps at that point
AI will be good enough that it can be prompted to create a text to speech
system that's up to our standards. Or, more hopefully, articles like this
one may bring attention to the issues, and bring our community together to
recognize the problems and find solutions.

https://stuff.interfree.ca/2026/01/05/ai-tts-for-screenreaders.html

David Goldfield,

Blindness Assistive Technology Specialist

http://www.DavidGoldfield.com

Director of Marketing,

Blazie Technologies

http://www.BlazieTech.com

JAWS Certified, 2022

NVDA Certified Expert

Subscribe to the Tech-VI announcement list to receive blindness technology
news, events and information.

Email: tech-vi+...@groups.io

http://www.DavidGoldfield.com

Colin Howard, living in Southern England, hopes you and your family,
acquaintances and friends are enjoying a peaceful, prosperous and happy
2026.

Reply all

Reply to author

Forward

0 new messages