Robots.txt won't save you: The case for licensed AI content

A lot of the anger around AI comes down to a simple complaint: Publishers feel like they’ve been robbed. When developers use their content without compensation or consent, it can feel like their work has been strip-mined. Third-party scrapers grab their content and resell it to enterprise buyers at full margin while publishers get nothing in return. But it doesn’t have to be this way. Content licensing allows publishers to receive proper compensation for the use of their work while at the same time providing developers with high-quality material that makes their AI tools better.
The extractive model: a tax that takes everything
Publishers have lived through ad tech's opaque supply chains for years. Intermediaries took a share of the revenue, and it was rarely clear what value they added. Frustrating to be sure, but at least the publisher got something back.
The scraper economy is a different proposition. As Chris Dicker, CEO of Candr Media, put it to Digiday:
With scrapers, the value extraction is total. They're taking 100% of the content, paying 0% and then in some cases using that content to create competing products that remove the publisher entirely. It's not a tax, it's a hostile takeover funded by our own IP.
Dicker went on to note that the industry has plenty of Napsters, but no iTunes or Spotify.
The numbers behind the scraper economy
This isn’t fringe activity. Citing Mordor Intelligence data, Matthew Scott Goldstein estimates the third-party scraper economy is a $1 billion industry, and publishers make nothing from it. His report identifies 21 vendors operating in the space, including Firecrawl, Exa, Tavily, Brave, You.com, Perplexity Sonar and Bright Data. TollBit's running index identifies nearly 40. More than 70 enterprise companies were found to be paying for publisher content sourced this way, among them BCG, IBM, Cohere, AWS, Salesforce, Apple, Zoom, PwC, Shopify and Alibaba.
As Goldstein observed:
The scraper economy is being rebranded as agent infrastructure, and while the technology is getting sharper and the enterprise pitch is getting cleaner, the underlying economics have not changed.
Without a marketplace layer to mediate that consumption, it’s a race to see who can extract value fastest while the question of who gets paid is pushed off to the side.
Defensive measures alone are not closing the gap. TollBit's State of the Bots report found roughly 30 percent of AI bot scrapes violate explicit robots.txt directives. One publishing exec told Digiday that robots.txt "is as useful as a chocolate teapot right now."
The aligned model: structured licensing as operational logic
The alternative is not new in principle but it’s becoming clearer in practice. According to Pauline Frommer:
AI-driven search does not have to be based on theft; publishers can, and should, be compensated for the use of their copyrighted material.
This isn’t just an ethics problem. The extractive model gives rise to three structural problems that can compound over time.
- It denies publishers revenue while raising their infrastructure costs and siphoning off their readers.
- It exposes platforms to growing legal risk as the rules tighten and provenance becomes a much bigger deal.
- It kills the economic incentive to produce the kind of structured content that could make AI output better in the first place.
Licensing infrastructure can address all three of these issues. Publishers get a defined revenue line and visibility into how their content is being used. Platforms get content with a clean chain of title. And the revenue stream allows publishers to keep producing content.
Licensing fees aren’t pocket change, either:
- AI companies are estimated to have paid an average of $2.9 billion in content licensing fees as of January 2025.
- Amazon agreed to pay the New York Times $20 million per year to access their content.
- A collective licensing body in the UK pays out £50 million (about $66.5 million) a year to small or mid-size publishers.
Where this leaves the choice
Newstex is one example of how the aligned model works in practice. We license high-quality content directly from publishers and deliver it to platforms in AI-ready formats while also making sure that creators are properly compensated for the use of their work. If you’d like to apply for syndication with Newstex, click here.
Structured content licensing is the best way to ensure the information ecosystem remains viable. Developers gain access to first-class material that can improve the output of their AI tools while publishers gain the security of an income stream. Ultimately, it’s a win for everyone.


