Few parameters, lots of data

PaLM 2, Google's latest large language model (LLM) announced last week, uses nearly five times the amount of training data as its 2022 predecessor, allowing it to do more advanced programming, math, and content creation.

At the Google I/O Developer Conference, the search giant introduced PaLM 2 - a language model trained on 3.6 trillion tokens. These tokens are sequences of words - the building blocks used to train LLM to predict the next word.

The previous version of PaLM was released in 2022 and was minted with 780 billion tokens.

Google CEO Sundar Pichai at last week's Google I/O event introduced the company's latest large language model PaLM 2

While Google has been touting its AI prowess in search, email, word processing, and spreadsheets, the company has been reluctant to disclose the size or details of its training datasets. OpenAI also keeps the details of its latest LLM training algorithm, GPT-4, a secret.

Tech companies attribute the reason to the competitive nature of their businesses. Both Google and OpenAI are racing to attract users with chatbots rather than traditional search engines.

Compact, powerful, cost effective

Google says PaLM 2 is smaller than its predecessors, training with 340 billion parameters compared to 540 billion for the previous version. This shows that the company's technology is becoming more efficient at performing complex tasks.

To achieve this, PaLM 2 uses a new technique called “extended computing optimization,” which delivers “better overall performance, including faster inference with fewer parameters that reduce overhead.”

Google's latest language AI model, trained in over 100 languages, is performing a variety of tasks for 25 features and products, including the experimental chatbot Bard. PaLM 2 comes in four versions based on size, from smallest to largest: Gecko, Otter, Bison, and Unicorn.

According to public documentation, PaLM 2 is more powerful than any existing model. Facebook's LlaMA, released in February, was trained on 1.4 trillion tokens. Meanwhile, OpenAI last publicly disclosed the training data size for ChatGPT, a version of GPT-3, with 300 billion tokens.

The explosion of AI applications has created controversy around the technology. Earlier this year, El Mahdi El Mhamdi, a senior scientist at Google Research, resigned in protest at the search giant’s lack of transparency.

This week, OpenAI CEO Sam Altman also testified before the US Senate Judiciary subcommittee on privacy and technology in the context of AI becoming more widespread, where the “father” of ChatGPT agreed with lawmakers that new regulations are needed to govern AI.

(According to CNBC)

Source