The big problem with Veo 3

Veo3 is Google's latest AI model, launched in late May, allowing users to create videos based on voice commands. This model has attracted attention from the content creation community as it allows the creation of videos with sound and dialogue, a feature not available in previous versions of Google's model, thus making them more realistic.

Many users utilize Veo 3 video clips, up to 8 seconds long, to create commercials, ASMR videos, fantasy movie trailers, and humorous street interviews.

Oscar-nominated director Darren Aronofsky used the tool to create a short film called Ancestra. At the press conference, Google DeepMind CEO Demis Hassabis compared Veo 3 to a step "out of the silent film era" in cinema.

"Persistent" subtitles from Veo 3

However, many users have found that this tool doesn't work as expected. When creating clips with dialogue, Veo 3 often automatically inserts meaningless, jumbled subtitles, even when the command clearly states not to add subtitles.

Removing these subtitles isn't simple. Users are forced to recreate the clip, spending "tokens" which means spending more money on Google, or use external tools to remove the subtitles, or trim the video to remove the subtitles.

Veo 3 produces realistic visuals and dialogue that matches lip movements, but the subtitles are meaningless. Photo: Lesswrong .

Josh Woodward, vice president of Google Labs and Gemini, posted on X on June 9th that Google had developed patches to reduce the spam issue. But more than a month later, users continue to report this problem on Google Labs' Discord channel, showing that fixing bugs in large AI models is not easy.

Like Google's previous AI video creation models, Veo 3 is a paid model, starting at $249.99 per month. To create an 8-second video, users enter a description into Flow, Gemini, or another platform. Each clip creation using Veo 3 costs at least 20 AI credits, and users can top up for $25 for 2,500 credits.

Mona Weiss, a commercial director, says recreating footage to remove subtitles is becoming a significant expense. “If you create a scene with dialogue using Veo3, about 40% of the output will have meaningless subtitles rendering the video unusable,” she says. “It costs a lot of money to get a scene you like, but it ends up being unusable.”

Meaningless subtitles are difficult to remove on the Veo 3. Photo: Technology Review .

When Weiss reported the issue to Google Labs via Discord hoping to get her wasted credits back, the support team referred her to the company's official support department. They offered a refund for the Veo 3 subscription fee, but not for the credits. Weiss refused because accepting the refund would mean losing access to the model.

The Google Labs Discord support team stated that subtitles might be automatically activated if voice is detected, and they are working to fix this bug.

The problem stems from Google's approach.

The reason Veo 3 automatically inserts subtitles stems from the data the model was trained on.

Although Google hasn't released details of the data categories used to train its models, it likely includes videos from platforms like YouTube and TikTok, many of which contain subtitles. These subtitles are embedded directly into the video frames, making them difficult to remove before being used as training data, according to Shuo Niu, a researcher on video sharing platforms and AI at Clark University (Massachusetts, USA).

"Text-to-video models are trained using reinforcement learning to create content that mimics human-made videos, and if those videos have subtitles, the model can 'learn' that adding subtitles makes the product more like a human-made video," he explained.

Veo 3 was affected by model training data from YouTube and TikTok videos. Image: Mashable .

A Google spokesperson said: “We are constantly improving our video creation capabilities, especially in terms of text, natural-sounding voice, and perfectly synchronized audio. We encourage users to retry the command if they find the results inconsistent and provide feedback to us through the like or dislike feature.”

Furthermore, the reason this model ignores prompts like "No subtitles" is because negative statements (instructing the AI not to do something) are generally less effective than affirmative prompts, according to Tuhin Chakrabarty, a researcher in AI systems at Stony Brook University.

To completely resolve the issue, Google will have to examine every frame of all videos used to train Veo 3, then remove or relabel videos with subtitles before retraining the model. This will take weeks, Chakrabarty added.

Katerina Cizek, a documentary filmmaker and art director at the MIT Open Documentary Lab, argues that this issue demonstrates Google's willingness to release products that aren't yet fully finished.

"Google needs a win," Cizek stated. "They need to be the first to release a tool that can create audio that matches lip movements. And that's more important than fixing the subtitle issue."

Source: https://znews.vn/van-de-lon-cua-veo-3-post1569402.html