
The image of the astronaut riding a horse was created by combining two types of AI-generated models. Photo: MIT News
When speed and quality are no longer a trade-off.
In the field of AI-powered image creation, there are currently two main methods:
Diffusion models allow for the creation of detailed, sharp images. However, they are very slow and consume a lot of computational resources because they require dozens of processing steps to remove noise from each pixel.
Autoregressive models, on the other hand, are much faster because they can predict small parts of an image sequentially. However, they often produce images with less detail and are prone to errors.
HART (hybrid autoregressive transformer) combines both, offering "the best of both worlds." First, it uses an autoregressive model to construct the overall image by encoding it into discrete tokens. Then, a slightly diffuse model further processes to add residual tokens—details lost during the encoding process.
The result is images of comparable (or superior) quality to the most advanced diffusion models, but processing is nine times faster and uses 31% less computing resources.
This new approach helps create high-quality images at high speed.
One of the notable innovations of HART is how it addresses the problem of information loss when using autoregressive models. Converting images into discrete tokens speeds up the process, but also results in the loss of important details such as object outlines, facial features, hair, eyes, and mouth.
HART's solution is to have the diffusion model focus solely on "patching up" these details using residual tokens. And because the model has already done most of the work through autoregression, the diffusion model only needs 8 processing steps instead of over 30 as before.
"The diffusion model is easier to implement and therefore more effective," co-author Haotian Tang explained.
Specifically, the combination of an autoregressive transformer model with 700 million parameters and a mild diffusion model with 37 million parameters allows HART to achieve performance comparable to a diffusion model with up to 2 billion parameters, but nine times faster.
Initially, the research team also tried integrating the diffusion model into the early stages of the image creation process, but this led to an accumulation of errors. The most effective approach is to let the diffusion model handle the final step and focus only on the "missing" parts of the image.
Unlocking the future of multimedia AI.
The research team's next step is to build AI vision models – a next-generation language based on the HART architecture. Because HART is scalable and adaptable to many types of data (multimodal), they expect to be able to apply it to video creation, audio prediction, and many other fields.
This research was funded by multiple organizations, including the MIT-IBM Watson AI Lab, the MIT-Amazon Science Center, the MIT AI Hardware Program, and the U.S. National Science Foundation. NVIDIA also provided GPU infrastructure for training the model.
(According to MIT News)
Source: https://vietnamnet.vn/cong-cu-ai-moi-tao-anh-chat-luong-cao-nhanh-gap-9-lan-2384719.html
Comment (0)