The Atlantic's AI Watchdog Database Is A Wakeup Call For Indian Creators

When Homegrown searched AI Watchdog's databases, the results showed that a wide range of Indian creatives had found their way into datasets that have been assembled for AI research and machine learning.
When Homegrown searched AI Watchdog's databases, the results showed that a wide range of Indian and South Asian musicians had found their way into datasets that have been assembled for AI research and machine learning.
When Homegrown searched AI Watchdog's databases, the results showed that a wide range of Indian and South Asian musicians had found their way into datasets that have been assembled for AI research and machine learning.Images Courtesy: Raveena & Ecca Vandal
Published on
10 min read
Summary

This article explores how AI companies train generative AI models using massive datasets containing books, music, films, images, websites, and other copyrighted material, examining the inclusion of Indian artists and writers through The Atlantic's AI Watchdog serach engine. It also looks at the legal, economic, and environmental consequences of large-scale data extraction, alongside the growing efforts by creators to protect their work through new tools, copyright campaigns, legislation, subscription-based publishing, and physical media.

“The old colonialism grabbed land, resources and human labour. The new one grabs us, the daily flow of our lives, in the abstract form of digital data,” write Media scholars Nick Couldry & Ulises A. Mejias in their paper on ‘Data Colonialism,’ describing data extraction as the normalisation of exploiting human beings. The internet runs on data, and over the last two decades, technology companies have become extraordinarily good at collecting it. Every search, click, scroll/interaction, purchase, location ping, playlist, comment, photo upload, voice recording, and viewing habit creates information that can be recorded and analysed.

Social media platforms have built their entire business models on the attention economy, generating a constant stream of behavioural data in the process. Viral trends, personality filters, reaction videos, AI image generators, "add yours" chains, and countless other forms of online participation all produce information about how people behave and what they’re into. In 2024, the US Federal Trade Commission said major social media and video streaming companies had engaged in "vast surveillance" of users, highlighting the enormous scale at which personal information is collected and monetised. For years, this data has been used to target advertisements, shape recommendation algorithms, and predict consumer behaviour. With the rise of artificial intelligence, it has acquired another highly ‘valuable’ function: training machine learning systems.

In 2006, British mathematician and data strategist Clive Humby called data “the new oil” His argument was that data had become one of the most valuable resources in the modern economy. But, just as crude oil must be extracted and refined before it can be used, raw information must be processed before it becomes valuable to tech companies. The information collected from websites, apps, images, books, videos, emails, songs, and user activity has to be organised, cleaned, sorted, categorised, and labelled so that computers can understand it. The result of this processing is what is known as a ‘dataset’ — a large collection of organised information that can be used for machine learning.

Google's translation systems have been trained on billions of translated examples across more than 100 languages, while Gmail's Smart Reply was trained on around 238 million email messages after preprocessing. Modern AI systems operate on the same principle. To generate text, images, music, video, or code, they need access to enormous datasets — books, articles, research papers, websites, forums, photographs, illustrations, films, songs, lyrics, and other creative works produced by millions of people — every available material from across the internet, stolen, to train their models, that can later be reproduced in response to a user's prompt. 

Last year, The Atlantic launched an investigation, called AI Watchdog, into the datasets used to train some of the world's most powerful AI systems. Led by journalist Alex Reisner, the project aims to open what The Atlantic describes as the "black box" of AI training by showing the public exactly what kinds of books, music, films, television scripts, videos, and other creative works have been collected in these datasets. The project includes searchable databases that allow writers, musicians, filmmakers, journalists, and other creators to look up their own work and see whether it appears in datasets linked to AI development. In June 2026, the publication expanded the investigation with searchable music datasets containing millions of tracks, revealing the enormous scale of copyrighted material gathered for AI training. More than a simple search engine, AI Watchdog functions as a public record of how creative work moves from the internet into machine-learning datasets, giving artists and rights holders a rare opportunity to see material that would otherwise remain buried inside technical archives and training corpora. 

When Homegrown searched AI Watchdog's databases, the results showed that a wide range of Indian and South Asian musicians had found their way into datasets that have been assembled for AI research and machine learning. One of the datasets where Indian artists appeared was Sleeping Disco, a collection of 9,713,413 music tracks sourced from YouTube, along with lyrics gathered from Genius.com. The dataset was assembled by Sleeping AI, a group of AI researchers that builds training datasets and publicly shares research on different aspects of AI development. Artists found in this dataset ranged from acclaimed New Delhi indie-rock band Peter Cat Recording Co. to Indian-American rapper and producer KOAD, Sri Lankan-Australian punk-rap artist Ecca Vandal, and indie band Green Park.

Images Courtesy: When Chai Met Toast & KOAD

Another dataset, Laion Disco, contained 12,320,916 music tracks sourced from YouTube, representing a total of 91 years of music. It was assembled by LAION, a nonprofit organisation based in Germany that builds large datasets used in AI research. Indian and South Asian artists identified in this dataset included everyone from industry legends like Shreya Ghoshal to indie artists like Prateek Kuhad, Chaar Diwaari, Raveena, Midival Punditz, When Chai Met Toast, Bloodywood, and Curtain Blue. AI Watchdog also showed artists appearing in other datasets, such as Spotify Tracks, a collection of 114,000 music tracks ripped from Spotify. This dataset was assembled by an unknown AI developer on Hugging Face and had been downloaded more than 70,000 times as of May 2026. Among the artists found in this dataset were Anoushka Shankar and Seedhe Maut.

When it comes to writers, everyone from fiction and non-fiction to screenwriters had their work used to machine learning — Arundhati Roy, Aravind Adiga, Annie Zaidi, Kiran Desai, Manu Pillai, Tishani Doshi, Siddhartha Deb, and Janice Pariat within datasets called Library Genesis, Library Genesis Fiction, and Sci-Mag, which also included Anurag Kashyap, Sneha Desai, Sriram Raghavan, and Varun Grover. Library Genesis is one of the largest digital libraries circulating online. According to AI Watchdog, it contains more than 7.5 million books across fiction and nonfiction collections and more than 81 million research papers through its Sci-Mag archive. The repository was created in Russia in 2008 and has continued to grow through contributions from users around the world. Publishers have filed at least two lawsuits against its creators over the years, yet the archive is still up. Library Genesis is used for AI systems developed by companies including OpenAI, Meta, and Anthropic, while suggesting that other AI companies may also have used material from the archive.

The unauthorised use of copyrighted material to train generative AI systems has also raised major alarms about the future of intellectual property rights. By downloading millions of books, articles, songs, and images at an industrial scale, tech companies are completely bypassing the traditional rules of copyright law. These frameworks exist to ensure that creators have control over their work through explicit consent, credit, and licensing fees before anyone can copy or distribute it. Instead, AI developers feed this data into machine learning systems to build commercial tools, creating an unfair transfer of value where tech corporations profit off independent artists who lack the corporate backing or legal resources to block scraping bots from taking their portfolios.

This practice threatens the financial viability of creative industries. These data-hungry systems swallow original human work to churn out cheap, automated outputs that directly compete in the exact same markets as the original creators. If courts rule that this massive scraping falls under fair use or technological progress, it will establish a dangerous legal precedent. Experts warn this could permanently destroy the economic protections that allow writers, musicians, and artists to make a living, effectively replacing independent human creativity with corporate-owned automated software. 

So what can you do? There are now several tools designed to protect your IP and make it harder for AI companies to steal creative work. For visual artists, Glaze disguises an artist's style so that AI systems have a harder time analysing and copying it, while Nightshade introduces tiny changes that are invisible to the human eye but can ‘poison’ AI training data and cause models to learn incorrect information if enough protected images are stolen and added to a dataset. Writers do not yet have a widely adopted equivalent to Nightshade, but there are tools that target the theft itself. Nepenthes creates endless fake pages and links that trap AI crawlers and keep them away from real content, while Locaine feeds scrapers large amounts of junk text that can contaminate the data they steal. Cloudflare's AI Labyrinth works similarly by directing suspected AI crawlers into networks of AI-generated decoy pages, wasting their resources and making it harder for them to collect genuine material. Researchers have also explored the use of invisible text and Unicode-based techniques that can interfere with how language models read and process documents, although these remain experimental and are not yet widely used by writers.

Hyperallergic

The aggressive rollout of AI has become an exercise in techno-authoritarianism. Governments and tech companies are demanding more data centres, which communities across the United States have been fighting against, over concerns about water consumption, energy use, environmental damage, and the growing power of a handful of technology companies. In India, the expansion of data centres is forcing vulnerable groups out of their homes. In Mumbai, residents of a Dalit settlement were evicted from an area where the Hiranandani Group is expanding its data centres, while in Andhra Pradesh, Dalit families are being pressured to sell government-allocated land to make way for a massive Google data centre campus, threatening their primary source of income and security. 

It is deeply unsettling to realize that the physical resources we need to survive on this planet are being aggressively drained to fuel data centres, while the very escape hatch we use to cope with life's daily struggles and systemic injustices — our art, our music, and our stories — is simultaneously being scraped and hollowed out.

However, if there’s one thing we know about ourselves, it’s that we don’t just go gentle into the good night. New bipartisan bills are attempting to aggressively close the legal loopholes that tech companies have exploited for years. Take the CREATOR Act, which targets "style impersonation" by establishing a federal right that prevents AI companies from using a creator's name in a prompt to mimic their distinct visual technique for free commercial gain. Similarly, the NO FAKES Act treats a person's voice and visual likeness as a protected, licensable right, establishing a system to hold hosting platforms and corporate entities legally liable for unauthorized digital replicas. There is even the TRAIN Act, which forces AI companies to hand over their secret training records, effectively stripping away the corporate "black box" defence (which refers to a company citing the complexity of its own algorithms to avoid legal accountability) and giving creators the transparency they need to fight back.

Meanwhile, artists and creators have already taken matters into their own hands, and it has completely transformed how the internet works. The web has essentially been split into a two-tier reality. The first layer is the free, public internet that Google indexes — a space that has now succumbed to "enshittification," as per the dead internet theory, flooded with algorithmic SEO sludge, content farms, and AI-generated articles. Anything real and human — the most thoughtful writing, reporting, criticism, and expertise are behind private, subscription-based spaces like Substack newsletters, private Discord servers, Patreon communities, podcasts, paid memberships, WhatsApp groups, and specialist forums. Writers and researchers realised that the old model of giving work away for free to earn through unstable ad networks or platform reach is broken. By using email lists and direct subscriptions, they own their audience, while being safe from AI crawlers. 

Last year in February, more than 1,000 musicians, including Kate Bush, Paul McCartney, Damon Albarn, Annie Lennox, Yusuf/Cat Stevens, and Imogen Heap, came together to release ‘Is This What We Want?’, a protest album opposing proposed UK copyright rules that would allow AI companies to train their models on copyrighted work unless creators actively opted out. Instead of songs, the album features recordings of empty studios and silent performance spaces. The titles of its 12 tracks combine to deliver a single message: "The British government must not legalise music theft to benefit AI companies." Revenue from the project was donated to the charity Help Musicians, while a later physical vinyl edition included an exclusive bonus track by Paul McCartney. The release used silence as a political statement, arguing that if artists lose ownership over their work, the spaces where music is created could one day be left empty.

Alongside a growing switch to physical media, independent artists bundling albums with artwork, videos, photographs, lyrics, demos, and other material into limited physical releases sold directly to fans through USB, we had our first straight-to-VHS movie release in 20 years. Robert dos Santos’ ‘This is How the World Ends’ is a sci-fi feature made as an artistic protest against automated, AI-driven creativity and modern digital streaming, speaking to a collective resistance that has sprouted over the last few years against our own tech-driven erasure. “I'm asking people to do a lot, but that's what it means to be a human,” notes Robert. “That's what it means to exist in this lifetime, to actually participate in the act of life, and not to just allow things to happen.”

All the dataset information referenced in this article was identified using The Atlantic's AI Watchdog search engine.
logo
Homegrown
homegrown.co.in