Spongebob Can Now Narrate Your Writing

Fifteen.ai is a proof of concept web platform that allows you to make various characters from different pieces of media repeat what you write. The site, funded by MIT, has served over 4.2 million audio files which are the output of different characters speaking out the text the users have requested.

I’m frankly surprised this wasn’t done earlier.

I’m frankly surprised this wasn’t done earlier.


Text to speech is a technology has been around for a while now, being part of many operating systems since as early as the 1990s. However, due to the recent innovations in the field of artificial intelligence, the quality of text to speech engines has seen rapid growth and the voices generated are increasingly natural. The most famous example of this is an unreleased feature of Google's voice assistant. The feature allows the voice assistant to call and make appointments for you and also respond to calls first if needed. The voice assistant goes as far to include "um"s and "hmm"s in its replies, undoubtedly fooling the caller on the other end into thinking there is an actual human on the other side.

There is no doubt that in order to make such an intricate AI, days upon days of training data was used by Google to hone the voice assistant into answering in the most natural way possible. Fifteen, on the other hand, highlights a very different improvement in voice synthesising algorithms: it is able to produce samples of audio similar to that of the original character in as little as 5 minutes of training data. Though it is still a bit rough around the edges (try inputting "coronavirus" and listen to different characters absolutely butcher the pronunciation), the AI does a good job at accurately replicating most basic words. Additionally, depending on the characters chosen on the website, the web platform allows you to give an emotion to the narration of the sample text. For example, Twilight Sparkle from My Little Pony has the option of outputting the written text in a happy mode.

The fact that you can train artificial intelligence to synthesise voice as natural as this with just 5 minutes of training data comes with both benefits and potential problems. The obvious benefit that most people might think of is better sounding voice assistants, but companies so far have not decided to use better voice synthesis on voice assistants with research showing that it can draw users away by coming off as eerily human. It is likely that we will see any phone voice assistants with naturally sounding voices until they can consistently be as natural as the voice showed in the Google assistant calling demonstration and cross the infamous "uncanny valley", a valley at which a technology generates an imperfect human and causes discomfort in users despite being close to the real thing. An important benefit of better voice synthesis is allowing those impaired and unable to speak have a nearly natural voice and allow them to express their thoughts with it. On the other hand, a problem that we may end up facing at some point in the future is that it if someone records you for 5 minutes speaking about different things, they can impersonate your voice using the same technology that this artificial intelligence used, that being said, there is still a while before artificial intelligence can accurately synthesise more complex words given the small set of training data.