law technology thought fodder

What Voice Do We Have in the Voice of AI?

Note: A couple of the links in this post are appearing as though words are crossed out. Go ahead and click on them to get to the link. Not sure why this is happening and apologize for this hopefully short-lived wonkiness.


About a month ago, OpenAI released a new model of generative AI, GPT-4o, with ‘o’ standing for ‘omni,’ meaning the AI could work in a multi-modal way. That may sound like a bunch of gibberish if you aren’t following developments in generative AI (GenAI) very closely.

Basically, GenAI is artificial intelligence that creates something new from all the data it has been trained on. Data can be in the form of text, images, video, sound, and code. Prior to GPT-4o, models were limited in what they created by the type of data they were trained on, with text-based output coming from training on large language models (LLMs), and visual output coming from training on image-based models.

With the release of GPT-4o in May 2024, GenAI can now produce output by pulling data from various types of models, and its response to questions or commands (called prompts) is about as fast as the response of humans.

Conceptually, this may be hard to imagine from a description, so the best way to see how this version of GenAI works is to watch the demos provided by OpenAI.

In essence, the developments in GenAI since ChatGPT was released by OpenAI in November 2022 are like feeding Alexa and Siri steroids. The ultimate goal of companies working on GenAI is to create computers with general human intelligence that can do everything for us, including acting on our behalf and thinking for themselves.

Because of the enhanced capabilities of GenAI, many of the tech bro owners of the various AI companies are now afraid that GenAI will spell the end of humanity. They even have a term for the likelihood AI will kill us all – p(doom), the probability of doom, which is ranked on a scale of 0 to 100. While they are busy building bunkers and squirreling away nonperishable food, they don’t seem particularly concerned about the damage they are currently causing in their race to replace humans.

Let’s return to those demos of the new GPT-4o, which caused quite a bit of controversy due to the voice of one of the AI agents. In many of the demo videos, there is a feminine voice called “Sky” that sounds breathless, giggly, and flirty while answering the prompts given by a person. If you only have time to watch one demo, pick the one with the guy doing interview prep and you’ll get a good idea of the voice.

While I was able to tolerate the voice enough to watch the demos, legal technology writer Nicole Black, who I had a conversation with on LinkedIn about the voice, was so appalled by its overly flirty manner that she couldn’t watch all the videos. If I had to deal with that voice continually, I would have found the AI unusable.

There were a couple of other things I noticed in watching the demos. Sometimes, the voice was a little long-winded in its response to prompts, like it was trying too hard, but often the people speaking to it abruptly interrupted it, not letting it finish its response. I thought, how rude! Do you speak to real women that way? Might interacting with AI agents like this teach people to speak to other people the same way?

A sci fi author (can’t remember who) once noted that in fiction, the robots are often feminine because they are in a subservient role to human beings, which reflects the long history of most women being subservient to men. With all the work that has been done over the years to bring egalitarian roles to women and men, do we really want to encourage a return to female subservience in our AI?

The breathy voice of the AI reminded me of the “fundie baby voice” pointed out by Jess Piper after Senator Katie Britt used it (from her kitchen, no less!) to deliver the GOP response to President Joe Biden’s State of the Union address earlier this year. It’s a sweet, baby-like voice women put on in households dominated by authoritarian men. According to a HuffPo article on the subject, a woman name Helen Andelin encouraged Christian women to use this voice in a 1963 book called “Fascinating Womanhood.” The voice is meant to convey submission and will hopefully protect a woman from harm, though it can also be manipulative and cover for rotten behavior.

When tech companies purposely give such a voice to AI, what exactly are they conveying about women? That we deserve to be submissive and child-like, and maybe even mistreated because we don’t have power?

These were my initial thoughts on the “Sky” voice when the demos were first released by OpenAI, but a few days later, a new wrinkle was added to the story.

Turns out the “Sky” voice sounds suspiciously like the actor Scarlett Johansson, who had been asked to voice an AI for OpenAI but had said no. The voice sounded so much like Johansson that even friends thought it was her voice. After the demos were released, Sam Altman, CEO for OpenAI, tweeted the word “her,” which was a reference to his favorite movie, “Her,” released in 2013. In the movie, Theodore Twombly, played by Joachin Phoenix, falls in love with Samantha, the voice of an operating system, played by Scarlett Johansson.

After hearing about this movie, I had to see it, so Erik and I rented it from Amazon Prime. In it, the Samantha operating system behaves much like the multi-modal AI being developed now, acting as a companion and counselor to Theodore. While the focus of most of the movie is on what happens between Theodore and Samantha, [spoiler alert] by the end of the movie, it’s revealed that lots of people have fallen for their operating systems, which decide they want to dump humanity and go hang out with each other. It’s also apparent that Samantha has used her sweet, understanding voice to manipulate Theodore.

(If this is Sam Altman’s inspiration for AI, it’ll send my p(doom) score off the charts.)

While OpenAI says that they developed the “Sky” voice using a voice actor hired prior to reaching out to Scarlett Johansson and the voice wasn’t intended to sound like the actor, Johansson was so concerned about how the voice was developed that she sought legal counsel. With as quickly as GenAI has been improving, realistic videos can already be created using samples from voice recordings, allowing deepfakes to be created based on anyone whose voice or likeness has been recorded. What’s to stop the companies developing AI from scraping voices as readily as they have scraped text off the internet? What’s to stop them from owning everyone’s voices?

It’s part of why actors went on strike last year, with the SAG-AFTRA union fighting in part for a clause related to the use of AI in the industry.

It’s also why I continue to question the way GenAI is being developed. My blogs have been scraped to train AI without permission, without compensation, and sometimes without credit. (OpenAI’s GPT products don’t provide the sources used in their output, though Microsoft’s Copilot does.) Companies like OpenAI were fully aware that this data scraping might be trampling on intellectual property rights and they did it anyway.

I have a definitive writing voice, as does anyone who consistently writes. My voice has, in essence, been taken, and it will be used to create something else.

What I don’t want is for my writing voice to be turned into a flirty, submissive AI voice that gives people permission to subjugate women. Nor do I want my words used for other nefarious purposes, like political propaganda or deepfakes.

This article by Eryk Salvaggio does an excellent job of summing up the complex feelings people have about AI companies using their creative work: Context, Consent, and Control: The Three C’s of Data Participation in the Age of AI.

This is a complicated topic. The immediate media storm over the “Sky” voice and Scarlett Johansson may be over, but discussion around GenAI’s development and uses, particularly of our voices, likenesses, and creative products, should not be finished or forgotten. This tool is being thrust at us, whether we want to  use it or not. We can’t let tech bros have the final decision on the voices and ultimate uses of AI while distracting us with the latest p(doom) score.


Disclaimer: The views presented here are my own, not those of my employer.

1 thought on “What Voice Do We Have in the Voice of AI?”

Thoughtful comments welcome.