SkipOpenAccess logo, circles representing the letters OA in BrailleOpenAccess

Getting AI to speak Māori — well, sort of

When we hear computers or AI try to output Māori speech, it usually butchers the pronunciation. This was an issue for my automated blog audio system. So here’s the solution I found.

Listen to this article

What’s the problem?

Most text-to-speech (TTS) systems cannot output Māori speech correctly.

It’s always a terrible-sounding Americanised or British-sounding attempt, that falls drastically short of what is acceptable.

This blog now has a seriously cool automated audio generator, for all my future blog posts. But I wanted to ensure Māori text sounded good, too.

So I got to thinking 🤔

What does it usually sound like?

NZ English speakers generally mix English and Māori together, which is even more challenging for computer-based TTS to handle.

Let’s take this example sentence:

Kia ora koutou! How is everyone going? We’re here today to talk about making Aotearoa more accessible for tāngata whaikaha, tāngata whaikaha Māori, and their whānau. Ka kite anōngā mihi nui.

Here’s how typical TTS would announce this:

And here’s how my new audio generation system sounds:

A pretty impressive improvement, I think.

How did people used to solve this?

Well, usually people wouldn’t solve this problem. We’d be stuck with butchered Māori pronunciation on most TTS outputs.

A naïve solution is to “massage” the English text input by using “phonetic English” that goes to the TTS, to try and coax it into better pronunciation. For example, you’d send “Fye-kar-har” to the TTS, instead of “Whaikaha”, because most TTS don’t realise the “Wh” in Māori is an “f” sound.

This solution rarely works properly. It’s very manual, and requires a lot of trial and error just to get semi-passable results. Not good enough for me.

Background on local neural speech generation

Recently, I’ve been tinkering with local neural text-to-speech engines. I was sick of paying money to ElevenLabs to create speech outputs that kinda sucked.

I discovered there are many free and open-source alternatives, that are completely capable of running on my modern-ish Apple M3 processor.

This was an excellent discovery, because it means my blog’s audio generation system was now completely free to run, and sounded excellent (for English text).

How neural speech generation works

Firstly, my blog posts are stored as Markdown, with a bit of HTML mixed in where necessary.

This is ideal, because it means my blog posts are basically just text — easy for TTS to handle.

The Markdown is cleaned up by some JavaScript to ensure it’s a simple string of text for TTS to handle. Images get replaced by their alt text, etc.

Then, that text is piped to Misaki G2P. Misaki is the Grapheme-to-Phoneme converter.

When Misaki G2P converts the text from graphemes to phonemes, it basically outputs the text using the International Phonetic Alphabet (IPA). The IPA is a way of writing down how speech sounds rather than English text.

I then pipe the phonemes to Kokoro-82M which can be found on Hugging Face 🤗.

Kokoro is an open-weight TTS model — this basically means you can download the model and run it locally on your computer — no more paying tech billionaires for expensive AI subscriptions.

Kokoro’s main feature is it’s extremely lightweight — you can run this thing on a laptop and it’s seriously fast.

So, Kokoro gets fed a bunch of phonemes, then it outputs silky-smooth AI-generated speech, in either a British or American accent. It works amazingly well, it sounds very natural and human-like, unlike older TTS systems.

But… it struggles with Māori speech. Damn 😲

So how did I fix this?

Well, when looking at the pipeline, I realised that the issue is that Misaki G2P (the thing that converts English text into phonemes) probably has no idea how to convert Māori words into the correct phonemes.

I then formed a hypothesis: Kokoro, the AI model that takes the phonemes and outputs audio, might be able to handle being fed corrected Māori phonemes and output well-pronounced Māori speech.

Looking further at the pipeline, I saw that espeak-ng is used as a fallback G2P. Basically, where Misaki G2P doesn’t know how to convert something to phonemes, it falls back to espeak-ng for the G2P conversion process, as a fallback.

From past experience, I knew that espeak-ng (an open-source TTS engine) does output (albeit highly croaky and robotic) Māori speech. espeak-ng is not just a G2P converter — it also outputs audible speech, but my god it sounds terrible.

But, what if we fed Māori words to just espeak-ng’s G2P system, and take the phoneme output, and feed that to Kokoro? This way, we’re taking the parts of espeak-ng that are amazing (it has Māori G2P), and then using Kokoro to handle the conversion of phonemes to audible speech. Sounds like a plan.

I then asked my trusty buddy Claude Opus 4.7 to code this up.

In my blog posts, I mark up any Māori text with the appropriate lang attributes, which is used to detect if text should be sent to espeak-ng’s Māori G2P instead of Misaki.

I do not rely on lang="mi" alone. Some words and phrases are part of normal NZ English usage, so I handle those with a manual override system instead. That lets me replace particular words and phrases globally with phoneme overrides, rather than depending entirely on HTML lang attribute markup.

Once espeak-ng has converted each piece of Māori text into phonemes, I inject those phonemes into the blog post using a notation that allows you to override Misaki G2P’s guess at Māori pronunciation. For example, “Whaikaha” would be changed to [Whaikaha](/fIkˈɑhɑ/). The part in square brackets is the original word, and the stuff in parentheses is the pronunciation in IPA that Kokoro will use when generating speech.


But wait, there was a problem

I discovered that espeak-ng outputs standard IPA — the standardised way we write down how language sounds. But, annoyingly, Kokoro’s AI model was not trained on standard IPA, it uses a strange custom variant.

What makes this worse, is it’s an easy problem to miss. Kokoro doesn’t give errors when it sees IPA characters it doesn’t understand — it silently drops that phoneme. So much of espeak-ng’s phoneme output was vanishing into the abyss and I didn’t realise.

I thought, well, Misaki clearly uses espeak-ng as a fallback, so how did they handle this problem?

So I pulled the Misaki and Kokoro source code, and got Claude Opus to delve into how the two talk to each other. It turns out Misaki already has its own translation tables for handling diphthongs from espeak-ng, since Misaki uses espeak-ng itself as a fallback. I re-implemented those tables in my speech synthesis pipeline.

This ensured Kokoro was receiving phonemes in the format it was expecting, ensuring that Māori text sounded even better.

Sometimes, manual correction is required

Unfortunately, even all the above effort is not enough to pronounce some words correctly. I’d say it sounds good like 85% of the time, and the other 15% of words need case-by-case intervention.

To deal with this, I added a manual override system. It lets me specify the IPA for particular words or phrases on a case-by-case basis. To generate better override IPA, I tell Claude what the mispronunciation sounded like, and Claude makes an educated guess at a better IPA for Kokoro — and this generally gives me great results.

These case-by-case issues are a combination of two factors: espeak-ng might be producing incorrect IPA, and Kokoro’s voice is trained on English IPA only.


What was the result?

Honestly, it blew me away. It was one of those amazing vibe-coding moments where a little bit of human wisdom paired with AI implementation, resulted in an output I didn’t believe was possible.

Over the years I’ve seen a few people attempt to fix this problem, but I’m not aware of anything that is fully automated, open source, free, and available to me currently. My use-case is the generation of low-cost audio for my blog posts, to help blind/low vision people, and people with reading disabilities like dyslexia. I don’t claim to be an expert on te reo — I’m just trying to fix a practical accessibility issue for disabled people here.

The thing that is innovative here, is I’m simply taking off-the-shelf open source systems, and mixing them together (with a translation layer) to create an optimised output. I think this has resulted in speech outputs that are less overtly wrong, albeit not perfect — which I think is still a win.

This approach has taken Māori speech which was impossible to understand, to a point where it is easy to understand for end-users.

This approach likely will not work well for text that is predominantly Māori — I think where this system will excel, is the form of NZ English we predominantly use, which mixes small Māori words and phrases in with English text.

So what I’ve learnt, is even small local models like Kokoro, while trained primarily on English text, are capable of being automatically corrected into better-sounding Māori pronunciations of text.

I think the lesson here, is while current TTS models are not trained on Māori speech, you can intervene at the Grapheme-to-Phoneme layer of the speech synthesis pipeline, to provide better results. This method is far easier to implement, and doesn’t require the manual trial-and-error required in cajoling English phonetic text into good Māori speech.

I should note: I failed a linguistics course at university, twice. So, this is my small contribution to the field I failed to learn about.

Further research

I think espeak-ng’s G2P might work well enough, but it’s a pretty old-school rule-based G2P — it relies on dictionary lookups and deterministic rules.

I took a look at espeak-ng’s source code to inspect its Māori G2P rules, and it does look very rudimentary compared to other languages — so it’s a definite weakness. Ideally, someone far more knowledgeable than me could contribute to espeak-ng to help improve it.

More modern G2P systems use neural networks to convert text into phonemes. It might be worthwhile investigating the creation of a Māori G2P using a neural network, in order to feed more accurate IPA data to neural TTS, like Kokoro.

I know there’s a few other people working on actual solutions to this problem, and I don’t claim to be working on that level — this is a hacky open source partial-solution, that met the goals of speech quality I set for myself.

To be clear, I’m not trying to solve Māori speech synthesis in general here; my problem is more specific: just making small bits of Māori text sound less wrong in predominantly English blog posts. And I think I achieved that.

Ngā mihi, Callum.


Buy me a coffee?

If you like this content and want to support it, consider buying me a coffee. I’ll use the funds to keep writing free accessibility content.

Buy me a coffee on Ko-fi ☕️


Need accessibility help?

If you need support with accessibility audits, team training, or ongoing accessibility consulting, OpenAccess can help.

Get in contact