More than 340 local news outlets are limiting the Internet Archive’s access to their journalism
In January, Nieman Lab broke the story that major news publishers — including The New York Times, The Guardian, and USA Today Co. — had started blocking the Internet Archive due to concerns that AI companies might scrape the nonprofit’s repositories for training data.
No news publisher has confirmed to Nieman Lab that an AI company has already scraped their content from the Wayback Machine. Still, in the five months since we published our story the number of news sites blocking the Internet Archive has continued to rise.
Overwhelmingly, these sites are local news outlets.
Our new analysis shows that more than 340 local news sites across the United States are now limiting the Internet Archive’s ability to access and preserve their stories. Many sites in our sample are owned by five of the seven largest local news publishers in the country: USA Today Co., McClatchy, Advance Local, MediaNews Group, and Tribune Publishing. The latter two are both subsidiaries of the “vulture hedge fund” Alden Global Capital.
Researchers, historians, and citizens around the world rely on the web archives of local news sites to do their work.
“Blocking the Internet Archive’s web crawlers threatens one of the most effective ways that we capture and store news content for the long term,” Edward McCain, a journalism librarian at the University of Missouri, said. “In the present we may have some workarounds, but in the long run, it weakens a vital link in primary source materials that we need to understand where we’ve been and where we want to go.”
Working journalists are among the most frequent users of the Wayback Machine’s local news archives. Over the last month, online petitions have called for news media companies to allow the Internet Archive to preserve their journalism.
“I cover news within a larger news desert in New York’s Rockland, Sullivan, and Rockland counties. This means I need to heavily rely on archival data of old news articles from now deceased, or zombie-fied, media outlets,” wrote B.J. Mendelson, the editor of The Monroe Gazette newsletter, in one recent petition signed by over 200 journalists. “Without the Internet Archive, my [work] would be incredibly difficult to do.”
In the face of publisher concerns, the Wayback Machine has highlighted its efforts to minimize abuse of its site, including implementing systems that limit bulk downloading and working with vendors like Cloudflare to monitor bot activity. “We are in conversation with many publishers and appreciate the opportunity to address their concerns,” Mark Graham, the founder of the Wayback Machine, told Nieman Lab, noting that the Internet Archive’s terms of use only permits using its collections for scholarship or research purposes.
Meredith Broussard, a data journalist and professor at New York University, said that as profit margins for news thin, it’s only become more important to news publishers to protect their intellectual property.
“This is the same fight that everybody has been having with the Internet Archive since its inception,” Broussard said. “Internet Archive is a very old-school, ‘information-should-be-free’ organization. But the people who are invested differently have different priorities. There are lots of different historical and legal and economic issues that are colliding in this situation. AI companies [are] the catalyst for the latest skirmish in a very old battle.”
Alden Global Capital is another major local news chain that has rolled out new restrictions on the Internet Archive. About 60 of those sites are owned by MediaNews Group, the Alden subsidiary that operates dailies across the country, including The Mercury News, the Denver Post, and the New York Daily News. Another seven publications are operated by Tribune Publishing, most notably the Chicago Tribune.
Alden has been criticized for aggressively acquiring U.S. newspapers and stripping them of resources for short-term profits. Alden did not respond to requests for comment.
In July 2025, Alden ran an editorial in more than 60 of its daily newspapers openly criticizing OpenAI and other AI companies that have used news content to train their models without compensation. “Securing permission from, and fairly compensating, those publishers who created this great foundation of knowledge is the right, just and American thing to do,” read the editorial. Both Alden publishers are part of the major copyright infringement suit against OpenAI and Microsoft that includes The New York Times and is currently winding its way through federal court.
Some independent local publishers, like The Baltimore Banner, are open to AI chatbots surfacing their stories without licensing deals. But they’re still concerned that a “back door” like the Wayback Machine’s might hurt their chances at being cited properly.
Last year, The Banner worked with the company DataDome to analyze crawler activity on its site. The findings were striking: about 25% of The Banner’s site traffic was coming from bots, including crawlers operated by the Internet Archive, according to Biswajit Ganguly, the chief technology officer and AI strategist at the Banner.
Based on that analysis, The Banner started blocking the Internet Archive, later adding one of its crawlers to its robots.txt file. It still lets major AI companies through, including crawlers used by ChatGPT and Claude.
As Ganguly explains it, the new restrictions on the Wayback Machine are less about negotiating licensing deals or preventing The Banner’s stories from appearing in AI products, and more about ensuring those products trace information back to The Banner instead of linking to sites that aggregate its work.
“We didn’t want the bots to be trained on our content, and then spit out answers based on the content without any kind of references, link, or attribution to our sources,” said Ganguly. “If ChatGPT finds something in the Wayback Machine…we were not sure how well it would be attributed back to us.”
He added that The Banner is still gathering information on how AI search products interact with news about the Baltimore region and the publication is open to lifting its block down the line.
“The threat is definitely not the Internet Archive,” Ganguly said. “But it’s a question of how the other actors are going to provide references or attributions and links back to the real creator of the content.”
Blocking as leverage for payment
Local publishers aren’t the only ones ramping up these efforts. Condé Nast, another arm of Advance Publications, has rolled out a coordinated effort to disallow the Internet Archive. Vogue, The New Yorker, Pitchfork, Vanity Fair, Bon Appetit, and Wired currently disallow four crawling bots from our list. (Last month, Wired covered the existential threat these blocks pose to the Internet Archive). Condé Nast did not respond to a request for comment.
The Atlantic has been working with Cloudflare to block the Internet Archive since last summer and added one of the Internet Archive’s crawlers to its robots.txt file in an update earlier this year, according to Anna Bross, The Atlantic’s SVP of communications. She said the decision is part of the outlet’s “aggressive” blocking policy.
“Our default is to block: No one should be scraping The Atlantic’s journalism without permission, regardless of the use,” Bross said.
The Atlantic’s CEO Nick Thompson commented on our January reporting in a video posted to LinkedIn in April. He said blocking the Internet Archive is important for publishers that want to maintain leverage when negotiating licensing with big AI companies.
“Because of the damages that can be done when you let all your content be scraped, because of all the leverage you lose, there will be worthy products that you previously gave your data to and now you can’t,” said Thompson.
Major international publishers have also started to block the Internet Archive, including the leading newspaper in Brazil, Folha de S.Paulo. Folha added three Internet Archive user agents to its robots.txt file in February.
“Folha believes that the sustainability of professional journalism — the very material the public record seeks to preserve — depends on protecting intellectual property,” said Sérgio Dávila, Folha’s editor-in-chief. “If AI companies wish to use this archive for training, they must enter into licensing agreements rather than rely on third-party repositories.”
Dávila noted that Folha invests in its own digital archive, Acervo Folha, which includes digitized editions of print issues going back to the paper’s founding in 1921. Access to Acervo Folha is available to paying subscribers.
What can be done?
Archiving is expensive; the technical infrastructure, storage, and expertise can be cost-prohibitive to smaller news organizations.
Before the rise of digital news, many papers maintained physical archives, often staffed with in-house librarians. Today, due to the contraction of the newspaper industry, most of those dedicated archiving roles are gone and the move to digital publishing has only complicated the issue.
A new content management system (CMS) can often lead to major archival losses. In 2024, thousands of articles vanished from the sites of the Daily Hampshire Gazette and the Greenfield Recorder in Western Massachusetts during a CMS switch. When publications close many former owners don’t want to shoulder the cost of maintaining a site. In 2022, a decade after The Hook, a Charlottesville weekly, went under, its archived site went offline, along with over 22,000 stories.
The Internet Archive is often touted as a hero of the web for taking on the Herculean task of preserving the entirety of the internet, and for stepping in when news organizations fail to preserve their own work.
In December, the Internet Archive partnered with the Poynter Institute and Investigative Reporters and Editors to train a cohort of 33 local and national news outlets on how to develop and implement an archiving strategy. The initiative, funded through a Press Forward grant, aims to train 300 newsrooms in digital preservation and in using the Internet Archive’s services by the end of 2027.
Most of the initial cohort is made up of independent and nonprofit local newsrooms, including Outlier Media, Charlottesville Tomorrow, and The 51st. Wired is the only publication in our dataset restricting Internet Archive access that is participating in the program.
As Broussard, the NYU professor, points out, while the Internet Archive is one of the few efforts to make archives free, it isn’t the only effort to archive news. News publishers have long licensed their journalism to commercial archives like ProQuest and LexisNexis, which are often available in libraries, universities, and for individual subscriptions. They’re not free, but they do exist. At least several publications in our sample appear in ProQuest databases, including the Chicago Tribune, The Baltimore Sun, Honolulu Civil Beat, and USA Today.
Economic incentives are a valid reason for publishers to want to keep their contents out of the Internet Archive, Broussard said, but news outlets should have a long-term, multifaceted preservation strategy. Even with a plan in place, the reality for many publishers is that it’s unlikely that they’ll be able to save everything.
“Every news organization, especially local news organizations, generally launch thinking, ‘we’re going to put stuff on the internet and it’s going to be there forever,’ and that’s not true,” Broussard said. “Anybody who told you the internet is forever lied.”