They are coming for your data
There's an epic battle for more data to train the latest AI models. And it's drawing uncomfortably close.
Last week, I wrote that we are witnessing an unprecedented race for AI supremacy as tech giants rush to train ever larger, more powerful AI models. To win that race, they are deploying gargantuan clusters of AI systems and pouring billions into new power-intensive data centres.
There's a lot of pomp and spectacle for sure. But beyond increasingly strident warnings of impending environmental catastrophe, I imagine that most of us are simply wringing our arms and continuing with our lives.
Yet far from the maddening clamour at the frontlines of AI research (or fundraising), another battle of titanic proportions is happening. And this one should give you pause.
The hidden battle for your data
If GPUs are the modern-day equivalents of pickaxes and shovels in the current craze around generative artificial intelligence, then data is undoubtedly the fuel needed to power the engines. And the technology giants have just about finished strip mining all the available data on the Internet.
Actually, that's not an accurate statement, so let's try again. They have just about finished strip mining all available data on the Internet, including copyrighted media such as published books and online articles, YouTube videos protected by terms of service, and a plethora of other high-quality sources.
According to a report in the New York Times, companies like OpenAI, Meta, and Google had cut corners in their quest to harvest enough data for their AI models. Indeed, OpenAI's Whisper AI model, an automatic speech recognition service, was purportedly created to pilfer YouTube videos. (I highlighted key details from the Times report here.)
In case there is any doubt left, Mira Murati, the CTO of OpenAI was earlier this year caught in an awkward moment with The Wall Street Journal over Sora, OpenAI's jaw-dropping, unreleased text-to-video model. When asked about the source of Sora’s training data in a video interview, Murati replied: "We used publicly available data and licensed data.”
But when pressed if that included videos on YouTube, she fumbled and made a remark that many saw as a confession: "You know, if they were publicly available, publicly available to use... but I'm not sure. I'm not confident about it.” She then backpaddled: "I'm actually not sure about that."
Feeding the insatiable appetite of AI
The reality of AI is an open secret to everyone involved in AI training: Generative AI models need a vast amount of data. The more diverse and accurate the data, the better the performance of the resulting AI models. And guess where the best, highest-quality data are? Hint: It's probably not your midnight Facebook ramblings but the copyrighted works of the world.
Unsurprisingly, the conglomerates and multi-billion-dollar AI startups have been locked in a constant hunt for more data since ChatGPT burst onto the scene. And while Twitter (Now "X") was among the first to shut off free access to its API in February 2023, every online service has been tightening its service terms or shutting down the spigot of free access to data via API. Not that the former matters when Internet data is often scrapped without asking.
It's easier to beg for forgiveness than to ask for permission, right? I mean, just ask Scarlett Johansson and her feelings towards OpenAI right now.
When I spoke with Dr Leslie Teo, senior director of AI Products at AI Singapore, in April this year about the rampant misuse of data, he told me his team was not unaware of "what others are doing." And he should know: He helms the project to train SEA-LION, an AI model designed for the Southeast Asian region.
Instead, the AI Singapore team took great pains to source their data ethically. Dr Teo had personally turned away data brokers offering to sell high-quality data sourced from dubious or unknown sources. I can only imagine these data brokers are selling because there are people on the market who are buying.
There is a cost for setting the bar high, however. "We pay a price... our models will not be as good," he said.
The end of data naivety
If we believe estimates by research institute Epoch, tech companies could run through the high-quality data on the Internet as soon as 2026. That's barely 18 months from today.
For now, the tech giants are signing deals with publishers left, right, and centre to gain access to more content, or quality data. For instance, OpenAI earlier this week signed a deal with News Corp to access new and achieved material from its portfolio of companies such as The Wall Street Journal, The New York Post, MarketWatch and Barron’s.
Beyond the niceties in the press statements over each deal, there is a nuanced strategy around putting past infringements to rest, access to fresh content, and gaining a leg up over rival AI firms. For publishers, money in their bank accounts today is infinitely better than a costly, protracted court battle and an uncertain outcome. Honestly, how does an outsider prove that an AI firm has stolen its content anyway?
Elsewhere, Google, and now OpenAI have signed a deal to access real-time content from Reddit’s data API. And numerous deals are happening right now as the scramble for gated quality content heats up.
What next? Will they come for your personal data? As of May 2024, this sounds implausible. However, there are some troubling signs. For instance, Google has already apparently broadened its terms of service to let it tap publicly available Google Docs for its AI products, though it denies having done that. (Read: "Yet")
Just the other day, I noticed that ChatGPT now offers a handy interface on my Plus account to connect Google Drive and OneDrive accounts to help me manage my information in ChatGPT. What a potentially useful feature...
You know what they say these days: If you are not paying for an online service, then you are the product. In 2026, you might have to pay to ensure your data stay unmolested.