An Atlantic reporter just made the invisible visible: a searchable database showing exactly which 21 million songs Google, Stability, and other AI labs used to train their music-generating systems.
For months, AI music generators like Suno and Udio have promised to create original compositions in seconds. But the question of where those systems learned their craft—and whether artists consented—has remained locked inside research papers and corporate servers. Alex Reisner changed that by uncovering four separate training datasets and publishing them as fully searchable tools anyone can query from their phone.
- The Scale of Extraction: Four AI music training datasets contain a combined 21 million tracks, with the two largest holding 12 million and 9 million songs respectively.
- Corporate Confirmation: Both Google and Stability AI have acknowledged using these datasets in their own published research papers, establishing a direct chain of accountability.
- The Consent Gap: Artists received no notification, no consent request, and no compensation when their work was absorbed into datasets downloaded thousands of times across the research community.
Two of the datasets are staggering in scale. One contains 12 million tracks; another holds 9 million. The remaining two datasets each hold over 100,000 songs—still representing a significant volume of training material. Reisner confirmed that both Google and Stability have acknowledged using these datasets in their own published research papers. Google cited one dataset in a research paper on music generation. Stability referenced another in its own technical documentation.
The Free Music Archive dataset stands out as one source Reisner identified. While the archive allows free personal streaming, the terms of use restrict commercial reuse—a detail that raises questions about whether training commercial AI systems on that material aligns with the original licensing agreement. The datasets have been downloaded thousands of times since their initial release, though it remains impossible to know exactly which companies or researchers have accessed them beyond the confirmed cases.
What Does It Mean When 21 Million Songs Are Scraped Without Consent?
What makes Reisner’s work significant is the shift from opacity to accountability. Before this, musicians and rights holders had no easy way to discover whether their work ended up in an AI training set. A producer could have dozens of songs absorbed into a 12-million-track dataset with no notification, no consent request, and no compensation. The searchable database flips that dynamic: artists can now look up their own names, labels can audit their catalogs, and researchers can see the actual composition of datasets that power commercial products.
This pattern—harvesting creative output at scale without the knowledge of those who produced it—echoes a structural logic that has appeared before in data-driven industries. Cambridge Analytica’s operation depended on the same principle: that information generated through ordinary human activity could be collected, aggregated, and commercially exploited before any regulatory framework existed to stop it. The music training data situation differs in medium but shares the same foundational dynamic: value is extracted from human expression, at scale, before consent frameworks catch up.
• 21 million total tracks identified across four AI music training datasets
• 12 million songs in the largest single dataset alone—larger than most commercial streaming catalogs
• Thousands of downloads recorded for these datasets, with no comprehensive disclosure of which organizations accessed them
• Research published in ACM Digital Library identifies copyright infringement in AI music generator outputs as an active and unresolved legal question across the field
The implications ripple across the music industry. Streaming platforms like Spotify have already faced pressure from artists over how AI is trained on their work. Taylor Swift, Billie Eilish, and other high-profile musicians have publicly objected to their music being used to train generative AI without permission. But the problem extends far beyond A-list names. Independent artists, session musicians, and catalog owners—many with limited legal resources—have had even less visibility into whether their work was scraped and repurposed. The question of who gets heard in AI music systems is not merely metaphorical—researchers at arXiv examining fairness in AI music systems have documented how dataset composition determines which cultural traditions, genres, and artist communities are represented in the outputs these tools produce.
Why Did It Take a Journalist to Create This Transparency?
For consumers, the database matters because it exposes the foundation of tools they may already be using. If you’ve experimented with an AI music generator, you’re interacting with a system trained on real human creativity—often without those humans knowing it. The quality and character of AI-generated music reflects the datasets it learned from. A system trained on 12 million songs has absorbed patterns, styles, and emotional textures from decades of human artistry. That’s not a neutral technical process; it’s a transfer of creative labor into a machine.
Reisner’s investigation also highlights a structural gap in AI governance. Research papers from Google and Stability describing their music models are public, but the actual datasets they reference are often buried in appendices or hosted on academic servers with minimal discoverability. By making these datasets searchable, Reisner has done what the companies themselves did not: create transparency at scale. He’s given independent researchers, artists, and journalists the tools to audit AI training practices without waiting for corporate disclosure or regulatory intervention. This is precisely the kind of accountability gap that Meta’s AI training data practices have also exposed—where public-facing research papers acknowledge data use that the affected parties never consented to.
• A 2025 survey on generative music model evaluation identifies the question of whether AI training on copyrighted material constitutes infringement as legally unresolved and technically complex
• Analysis published in ScienceDirect documents Sony Music’s formal Declaration of AI Training Opt Out as an industry response to the absence of consent mechanisms in current training pipelines
• Research examining representational bias in AI music datasets finds that watermarking training data has emerged as a proposed technical solution, though adoption remains inconsistent across the industry
Will Searchable Accountability Actually Change Industry Behavior?
The question now is whether this transparency will translate into policy change. The music industry has already begun pushing back through lawsuits and licensing demands. Stability and other AI labs face multiple copyright claims. But individual accountability—knowing which specific artists are in which datasets—has been missing from those conversations. A searchable database changes that equation. It transforms an abstract legal dispute about categories of content into a concrete, verifiable record of whose work was used and by whom.
The broader governance question is whether voluntary transparency tools can substitute for regulatory mandates. Reisner’s database exists because a journalist built it—not because any company or regulator required it. That distinction matters. The consolidation of data accountability into the hands of a few platforms has consistently shown that voluntary disclosure frameworks produce incomplete records. The music training data landscape is no different: four datasets are now visible, but the full scope of what has been scraped across the industry remains unknown.
For artists wondering if they’re in one of these datasets, Reisner’s tools are now live and queryable. For AI companies, the database represents a new baseline of public scrutiny. The era of training on massive music collections in the shadows appears to be ending, at least for these four datasets. Whether that transparency extends to future training data remains an open question as AI labs continue developing new music generation systems—and as the legal and ethical frameworks struggle to keep pace with the speed of extraction.
