DeepSeek training is cheap, now there's even cheaper inference

Researchers at DeepSeek have announced a new experimental model called V3.2-exp, designed to significantly reduce the cost of inference when used in long context operations.

DeepSeek announced the model in a post on Hugging Face, and also posted a linked academic paper on GitHub.

The most important feature of the complex new model is called DeepSeek Sparse Attention. Essentially, the system uses a module called a “lightning indexer” to prioritize specific excerpts from the context window.

DeepSeek announces cost-effective inference model.

A separate system called the “fine-grained token selection system” then selects specific tokens from those snippets to load into the module’s limited attention window. Combined, they allow Sparse Attention models to operate on long pieces of context with relatively small server load.

For long-context operations, the system's benefits are significant. DeepSeek's preliminary testing shows that the cost of a simple inference function (API) call can be reduced by up to half in long-context scenarios.

Further testing is needed to build a more robust assessment, but since the model is open and freely available on Hugging Face, it shouldn't be long before third-party tests can evaluate the claims in the paper.

Unlike other AI Chatbot models that consume a lot of energy, DeepSeek goes in the direction of saving costs from training to operation.

DeepSeek's new model is one of a series of recent breakthroughs that tackle the problem of inference cost—essentially, the server cost of running a pre-trained AI model, as opposed to the cost of training it.

In DeepSeek's case, the researchers were looking for ways to make the basic transformer architecture more efficient—and found that significant improvements needed to be made.

Based in China, DeepSeek is an unusual figure in the AI craze, especially for those who see AI research as a competition between the US and China. The company made a splash earlier this year with its R1 model, trained primarily using reinforcement learning at a much lower cost than its US competitors.

However, the model failed to spark the full-scale revolution in AI training that some predicted, and the company slowly retreated from the spotlight in the months that followed.

The new “sparse attention” approach is unlikely to cause as much outrage as R1 — but it could still teach US service providers some much-needed tricks to help keep inference costs low.

https://techcrunch.com/2025/09/29/deepseek-releases-sparse-attention-model-that-cuts-api-costs-in-half/

Source: https://khoahocdoisong.vn/deepseek-dao-tao-da-re-nay-con-co-ban-suy-luan-re-hon-post2149057353.html