May 27, 2024
ChatGPT goes multimodal: now supports voice, image uploads

Head over to our on-demand library to view sessions from VB Transform 2023. Register Here

After unveiling its newest image generation model DALL-E 3 with support for text and typography generations last week, OpenAI is moving to make its hit AI chatbot ChatGPT better.

In a surprise and sudden move, OpenAI announced that ChatGPT will now support both voice prompts from users and their image uploads.

The move will give users the ability to have back-and-forth conversations with ChatGPT – in a way similar to how they talk to Amazon’s Alexa, Apple’s Siri, or Google Assistant – and ask for the bot to analyze and react to any image they upload, such as translating signage or identifying objects when asked by the user in text accompanying their image upload.

Voice inputs will only be available on OpenAI’s ChatGPT mobile apps for Android and iOS apps. Image inputs will be available across mobile apps and desktop. 


VB Transform 2023 On-Demand

Did you miss a session from VB Transform 2023? Register to access the on-demand library for all of our featured sessions.


Register Now

OpenAI says the features have been powered by its proprietary speech recognition, synthesis and vision models and will be made available to people who have subscribed to ChatGPT Plus and Enterprise over the next two weeks. Other groups of users, including developers, will get these capabilities soon after, according to the company.

How will voice and image prompting work?

In a blog post published this morning, OpenAI said the voice conversation capabilities will allow users to talk about anything and everything by simply speaking out aloud.

They’ll just have to pick one from five voice options, speak what they want, and the bot will use the chosen voice to provide the answer. For instance, one could ask for a bedtime story or throw questions about a debate ongoing debate at the dinner table.

The company delivers these capabilities with speech-to-text and text-to-speech models that function in near real-time, converting input voice into text, feeding that text into OpenAI’s underlying large language model (LLM) GPT-4 to deliver a response, and finally converting that text back into the user-selected voice. OpenAI claims it has worked with multiple voice artists to create human-like voices for synthesis.

Notably, Amazon is similarly working to enhance its Alexa digital assistant, which powers the Echo line of smart devices, with the power of LLMs – to make its answers more relevant and contextual than they are at present. And earlier today, Amazon announced it is investing a hefty $4 billion in OpenAI rival Anthropic, maker of the Claude 2 chatbot.

While voice adds conversational capabilities to ChatGPT, image support gives it the power of Google Lens, allowing one to simply click a picture and add it to the chat with a potential question. ChatGPT will analyze the image in the context of the accompanying text and produce an answer. It can even engage in a back-and-forth conversation around that subject. 

For instance, with new capabilities, it could help one fix their bike, help with a math problem or even discuss the historical relevance of a monument you’re just visiting. All happens just with the image.

The new capabilities appear to greatly enhance the utility of ChatGPT, and OpenAI’s choice to deploy them now is notable, as the company did not elect to wait until its release of the anticipated GPT-4.5 or GPT-5 LLM to bundle them into those assumed forthcoming, more powerful AIs.

Available to ChatGPT Plus and Enterprise users soon

Over the next two weeks, both voice and image prompting capabilities will be available for Enterprise and Plus users of ChatGPT, the former mobile-only (for now) and the latter both desktop and mobile.

The update from OpenAI comes nearly a year after the initial blockbuster release of ChatGPT and multiple updates to its underlying models and interfaces since. The company said it is moving slowly to make sure that the capabilities of the bot are not misused in any way.

“We believe in making our tools available gradually, which allows us to make improvements and refine risk mitigations over time while also preparing everyone for more powerful systems in the future. This strategy becomes even more important with advanced models involving voice and vision,” the company noted in the blog.

To prevent the misuse of its voice synthesis capabilities, which can be abused for thing like fraud, the company has restricted the use to just voice chat and certain approved partnerships. This includes one with Spotify where the music platform is helping its podcasters transcribe their content into different languages while retaining their own voice.

Similarly, to avoid privacy and accuracy concerns stemming from image recognition, the company has also restricted the bot’s ability to analyze and make direct statements about people if they’re present in an input image.

The new features are expected for non-paying users, as well, but the company has not shared an exact timeline yet.

VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.

Source link