Two male students publish research at the world's leading AI conference

Using adversarial training methods to let AI generate new data, the research of two students from Ho Chi Minh City University of Technology was published at AAAI - the world's leading AI conference.

Research on multilingual models to train AI to create synonyms by Pham Khanh Trinh and Le Minh Khoi, 23 years old, was published in the documents of the AAAI-24 Conference on Artificial Intelligence, held at the end of February in Vancouver, Canada.

Associate Professor Dr. Quan Thanh Tho, Deputy Dean of the Faculty of Computer Science and Engineering, Ho Chi Minh City University of Technology, assessed this as a commendable result. Mr. Tho said that AAAI is considered by researchers and experts to be of top quality in scientific conferences in the fields of computer science and artificial intelligence, with a very low article acceptance rate, this year at 23.75%.

Minh Khoi and Khanh Trinh (middle) during their graduation thesis defense, 2023. Photo: Provided by the character — Minh Khoi and Khanh Trinh (middle) during their graduation thesis defense, 2023. Photo: *Provided by the character*

Sharing a passion for Deep Learning and Natural Language Processing, Trinh and Khoi chose to research on large language models (LLMs). Both wanted to find out the limitations of LLMs and improve it.

Khanh Trinh said that Chat GPTs or LLMs need to be trained on a huge amount of text data to generate accurate and diverse responses for users. The two boys realized that with less popular languages such as Hindi, Kazakh, or Indonesian, Chat GPTs and LLMs often give unexpected results because they have not studied these languages much, or these languages do not have enough data for them to learn.

"Why don't we create more text data from the 'little resources' of those languages to further train the AI?", the two male students asked. From there, the LAMPAT model (Low-rank Adaptation for Multilingual Paraphrasing using Adversarial Training) - multilingual interpretation using the adversarial training method researched by Trinh and Khoi, was born.

LAMPAT is able to generate a synonym from a given input sentence, in order to generate more text data. The “adversarial training” explainer is a relatively new method for training large language models. Given an input sentence, with traditional training methods, the application will generate an output sentence. But with adversarial training, the application can comment on and edit the output sentence, “adversarial” to generate more sentences.

The multilingual nature of LAMPAT lies in the fact that this model integrates 60 languages at the same time. Based on the collected data sets, the team continues to train LAMPAT to generate synonyms. The amount of text data generated from LAMPAT will continue to be used to train LLMs so that these models can learn many different ways of expressing information for the same content, thereby giving diverse responses with a higher probability of being correct. With this feature, the team representative believes that LAMPAT can be integrated into applications such as ChatGPT to further perfect this model.

In addition, the lack of data for Chat GPT or LLMs forces some companies to look for many external sources such as books, newspapers, blogs, etc. without paying attention to copyright issues. Creating synonyms is also one of the ways to limit plagiarism and copyright infringement, according to Khanh Trinh.

Nam Sinh gave an example of applications like Chat GPT, when a user requests a summary of an existing text A, the application will generate a summary text B. If the group's research method is integrated, when receiving text A, the application will generate multiple texts with the same content A1, A2, A3 based on the mechanism of generating synonyms, from which it will summarize the text and produce many results for the user to choose from.

In the early stages of the research, the team had difficulty preparing evaluation data for 60 languages. Due to the lack of access to a large enough amount of data, the team compiled a diverse and complete dataset of 13 languages to objectively evaluate the model, including: Vietnamese, English, French, German, Russian, Japanese, Chinese, Spanish, Hungarian, Portuguese, Swedish, Finnish, Czech. This is also a reliable dataset for the final Human Evaluation step.

Minh Khoi (left) and Khanh Trinh (right) took a commemorative photo with teacher Quan Thanh Tho on graduation day, November 2023. Photo: Provided by the character — Minh Khoi (left) and Khanh Trinh (right) took a commemorative photo with teacher Quan Thanh Tho on graduation day, November 2023. Photo: *Provided by the character*

For each of English, Vietnamese, German, French, and Japanese, the team randomly extracted 200 pairs of sentences (one pair consisting of the output sentence and the correct label) for evaluation. For each of the above languages, the team asked 5 language experts to independently score them, based on three criteria: semantic preservation; word choice and lexical similarity, and fluency and coherence of the output sentence. The scale was calculated from 1 to 5. As a result, the average evaluation score from language experts in these 5 languages ranged from 4.2 to 4.6/5 points.

The example gives a pair of Vietnamese sentences that are scored 4.4/5, in which the input sentence is: "He explained the problem in detail", and the output sentence is: "He explained the problem in detail".

But there are also pairs of sentences of poor quality, with semantic errors, such as the pair of sentences "We eat while the soup is hot - We eat the soup while we are hot", which only scores 2/5 points.

Khanh Trinh said it took 8 months to research and complete this project. This is also the topic of Trinh and Khoi's graduation thesis. The thesis ranked first in the Computer Science Council 2 with 9.72/10 points.

According to Mr. Quan Thanh Tho, although LAMPAT has demonstrated its proficiency in generating human-like synonym phrases across multiple languages, it still needs improvement to handle idioms, folk songs, and proverbs in different languages.

Furthermore, the team’s evaluation dataset only includes 13 languages, which still leaves out many, especially minority languages. Therefore, the team needs to do research to improve and expand the capabilities of current multilingual interpretation models. From here, we can remove the language barrier between countries and ethnicities.

At the end of 2023, Trinh and Khoi graduated with honors and distinction in Computer Science with a grade point average (GPA) of 3.7 and 3.9/4, respectively. Both plan to study abroad for a master's degree and pursue research in artificial intelligence and machine learning.

"We continue to research this topic with the goal of applying LAMPAT more to upcoming scientific projects, creating a reliable multilingual product for users," Trinh shared.

Le Nguyen

Source link