PV: Sir, could you tell us about the role and value of data in training artificial intelligence?
Mr. Dao Duc Minh: The success of artificial intelligence will depend largely on knowing how to select, collect and process data. To train a high-quality artificial intelligence model, we often start by training from a fairly large database.
Then, when the model is deployed and tested, continued data collection and processing will play a very important role in improving and perfecting the model quality.
Data must meet standards in terms of quantity, quality, diversity and universality. For example, in the process of developing the ViVi Virtual Assistant product for Vietnamese people, to train them, we had to collect and process tens of thousands of hours of high-quality data, from hundreds of thousands of voices from different regions, diverse ages and genders, with content spanning hundreds of fields,...
The raw data is initially cleaned, labeled and processed through many steps to create the highest quality data source to feed into the AI model, thereby improving ViVi's accuracy. This number reaches almost the maximum: >98%.
Collecting and processing thousands of hours of data is very expensive and complicated. But we need good data to have quality artificial intelligence. ChatGPT or Bard (Google's chatbot) are both trained on huge data sources collected from many different sources on the Internet.
For AI to be successful, it must be trained on large and diverse data sources, so that the results produced will be highly accurate. On the contrary, to analyze big data, we need to use AI to ensure the ability to process data accurately on a large scale, thereby creating results that are more decisive or predictive.
It is a resonance between artificial intelligence and big data.
PV: Please tell us about the process of selecting and collecting data for machine learning. How will this data be collected and from what sources? Especially when the place that owns the most information about Vietnamese users is the social networking sites of foreign companies (Google, Facebook...)
Mr. Dao Duc Minh: The first step in the process of selecting and collecting data for machine learning models is to understand what is a good choice. We can refer to the 5V model of big data, a good data source will include all 5 factors: volume, value, variety, velocity and veracity.
Typically, to create the best AI model for a practical application, a good data source will need to be both diverse and universal across many similar problems, as well as specific and individual to that application.
It is a fact that the largest source of human data is on the Internet and social networks. This data source is largely owned by foreign companies. However, data can come from many different sources and Vietnam still has the advantage of accessing its own data sources. Besides, there are data problems that only Vietnamese people can solve. Because we are the ones who understand the characteristics of "Vietnamese data", understand the needs and characteristics of Vietnamese people, thereby successfully applying technology to serve the lives of Vietnamese people.
For ViVi, the first problem that VinBigData set out was to bring a voice assistant product made by Vietnamese people, for Vietnamese people. That is, we must master Vietnamese data sources, combine with artificial intelligence technology to bring a highly applicable product, optimally serving the needs of Vietnamese people.
From these goals, we understand what and where we need to collect data sources for training. This data source does not necessarily have to be the vast data sources on the web.
With the desire to master Vietnamese data and technology, since its inception, VinBigData has built its own data sources that are unique to Vietnamese people. The total amount of data we own has reached more than 3,500 Terabytes. Specifically, we have: Data on millions of multi-regional voices in Vietnam; more than 2 million medical images from many different sources; millions of data on camera images of multiple objects in Vietnam (people, vehicles, and objects), and dozens of different multi-disciplinary databases..., all of which have been collected, cleaned, processed, and labeled.
In particular, in 2021, we also announced the Project to Sequence 1000 Vietnamese Genomes (published by the Big Data Research Institute - the predecessor of VinBigData), becoming one of the units owning the largest Vietnamese genome database. This research result has been and is being shared with the community of doctors and geneticists, aiming towards personalized medicine for Vietnam in the future.
PV : What happens next after the data is collected and how is it standardized? Is the bigger the data, the better?
Mr. Dao Duc Minh: As I said, volume is one of the important factors when collecting data. However, I also want to emphasize that: If it is not selected, cleaned and classified clearly, big data alone is not enough.
Typically, data will go through a basic processing cycle including: Collection (structured and unstructured data), storage (data is stored in a database system), processing (including a series of steps such as filtering, cleaning, labeling, data enhancement, information extraction/synthesis, as well as data visualization) and analysis. This process can be repeated many times during the development and completion of an AI system.
The important thing is what value will data bring to life? This is what VinBigData has been nurturing for nearly 5 years of researching and developing products. We believe that only when technology really enters life, solves social problems and improves people's lives, will research be truly successful.
PV: You have recently talked a lot about how we collect and create our own data warehouses. So what will be the criteria to determine the boundaries of data collection and use to ensure user rights?
Mr. Dao Duc Minh: The process of collecting and processing data requires legal regulations or security standards to protect users as well as businesses. Vietnam is still in the process of building and perfecting specific standards to protect user data.
There are already quite a few standards in the world . For example: GDPR – the European Union’s user data protection standard; or PCI-DSS is a standard aimed at protecting card payment users.
When we want to popularize or bring Vietnamese products to the international market, complying with these international standards is very necessary.
In the immediate future, to ensure the rights of users, VinBigData strives to create transparency in the process of collecting and using data with the purposes and objectives of collecting and using data being made public. Especially with data owned by individuals.
Currently, VinBigData has signed with a series of international organizations to ensure the security and rights of users. After that, we hope to have consensus between businesses and the Government to soon build a legal corridor as well as legal standards on protecting user data.
PV: When possessing big data, how will artificial intelligence face risks or data security vulnerabilities?
Mr. Dao Duc Minh: If used properly, Data will be a valuable asset. The risk of data loss and leakage is an issue that requires security measures from the beginning.
Until something happens, we often don’t fully understand the importance of data security. But when something happens, the damage will be huge. Recently, more than 200 million Twitter users’ data was leaked. User information was publicly sold on many different platforms. Suppose if all these millions of users filed a lawsuit, Twitter would suffer huge losses.
If the data leak is purely technical, the damage is usually less. But if the leak is related to intentional data theft, the consequences are very unpredictable. For individuals, bad guys can completely use the leaked information for many different illegal purposes. As for businesses, the information leak not only causes huge financial losses to fix related problems, but also causes damage to reputation and brand in the market.
PV : What solutions are needed to "patch" these vulnerabilities and improve data security, sir?
Mr. Dao Duc Minh: The first and most useful solution is prevention from the beginning: Building equipment to protect information security and safety; multi-layer protection; operating the correct process.
Specifically, safety and security prevention includes many different layers. In addition to investing in security and safety equipment; it is necessary to simultaneously build a process for processing and interacting with users and data, establish a strict data lifecycle control process, and at the same time improve the skills and awareness of information security of users and the operating team, and assign appropriate data usage rights (who has the right to access and use which data?)
On the other hand, businesses also need to identify and be flexible in applying data security policies, classifying the sensitivity level and security level of each type of data to have appropriate security measures, avoiding mechanically applying information security policies too tightly, which can sometimes hinder the process of data development and exploitation.
Especially for units that use data for development, data classification is even more important. Because data will have to circulate a lot between different departments.
Businesses need to be prepared for the worst case scenario, with relevant experts on hand to minimize damage to the greatest extent possible.
PV : 2023 will be the year of data. What are Vietnam's strengths and weaknesses in data? In your opinion, what do we need to prepare for a successful year of digital data?
Mr. Dao Duc Minh: 2023 will be the year of digital data for Vietnam. In terms of advantages, we have an advantage in data. Vietnam has a population of 100 million. Of which, the proportion of young people using smartphones, personal computers, etc. is high. That is a characteristic to promote data and pose problems that need to be solved by artificial intelligence in Vietnam. The second strength is people. Specifically, Vietnam has the world's leading experts in artificial intelligence. In addition, the young human resources in information technology in the country have a very good foundation in mathematics. These are two human resources that can be combined to create products of international standards.
Regarding limitations, we have difficulty in standardizing data. In Vietnam, each place, each enterprise, each administrative unit has different data. Data is not standardized, fragmented, and not synchronized. We also need a more specific legal corridor to standardize data.
To have a successful digital data year, Vietnam needs to grasp the core points as well as take advantage of the power of technology. The resonance between big data and artificial intelligence will be the lever for Vietnam's digital data year.
By mastering data from all levels from central to local, government and enterprises, Vietnam will be able to “preserve” the country’s valuable digital resources. Combined with advanced intellectual technologies, we will be able to “exploit” this resource to the fullest.
“Vietnamese people own Vietnamese data” also helps Vietnam avoid the situation of: Buying back products exploited on its own resources.
At the present time, specifically in the 4.0 revolution, Vietnam has many advantages compared to previous revolutions. We have the opportunity to take advantage of technology to quickly catch up and improve the country's position on the world map. I think the key to achieving this goal faster and more sustainably is "data" and "people".
PV: Having worked at a large artificial intelligence company in the US, what made you return to Vietnam?
Mr. Dao Duc Minh: In 2017, I returned to Vietnam. It can be said that this was a turning point. While working in the US, although I worked on many large government projects, the results I achieved were often just a few steps in a large processing process. There were even times when I did not know whether the solutions I developed had been used or not, because the project's security procedures were very strict.
Meanwhile, Vietnam is in the development stage, there are many problems about big data and artificial intelligence that need to be solved. At that time, I received an invitation from Professor Vu Ha Van: Return to Vietnam to realize the goal of developing Vietnamese technology solutions to serve the lives of Vietnamese people.
I feel that if I stay in Vietnam, I will be able to work on problems with greater impact. That is one of the important points that makes my return much more meaningful.
PV: Thank you for this conversation.
- Production organization: Viet Anh - Hong Van
- Performed by: Thi Uyen
- Photo: Thanh Dat
Comment (0)