News

Government health datasets were altered without documentation, Lancet study shows

The Journalist's Resource · Naseem S. Miller · last updated

For months now, researchers and journalists have been documenting the disappearance of federal health data and monitoring changes to government websites. Now, a new analysis finds that some of the existing datasets have also been modified, most of them lacking a notice or log about the change.

Researchers compared more than 200 federal datasets that were available between January and March with their archived versions and found that nearly half were altered. In most cases, the word “gender” was changed to “sex.” Only 15 of the altered datasets included a note about the modification.

“The lack of transparency is a particular concern,” says Janet Freilich, a professor at Boston University School of Law, and co-author of the study, which was published in The Lancet in July.

Alterations were made across multiple federal agencies, including the Department of Veterans Affairs and the Centers for Disease Control and Prevention. The reason for the modifications was not documented in the datasets, but they coincide with a January 20 presidential directive instructing federal agencies to use the term “sex” instead of “gender.”

Federal health datasets have been a major source of information for scientists, and undocumented changes to existing data can undermine confidence in government statistics and distort research.

“There are two levels of harm here,” says Freilich, a patent lawyer by training, who has been following changes to the federal data in recent months. “If you think you’re looking for whatever the column title reflects, but the column — the underlying data — actually reflects something else, then you’re going to get a wrong answer. But the second level of harm is, this really impairs trust in federal data.”

federal health dataset

A screenshot of CDC’s Youth Risk Behavior Surveillance System, captured on Aug. 18, 2025.

In March, Freilich co-wrote a paper in The New England Journal of Medicine on the disappearing data, finding that from Jan. 21 to Feb. 11 2025, the Centers for Disease Control and Prevention had removed 203 databases.

“I’m not expecting this information to come back,” Freilich says. “I just plead for transparency.”

Michelle Kaufman, an associate professor and director of the Gender Equity Unit at the Johns Hopkins Bloomberg School of Public Health, who was not involved in the Lancet study, said that while most people are aware that several federal datasets have been taken down, “this actual doctoring of it takes it to the next level.”

“I’ve been telling my students, ‘You might want to find other data sets that aren’t connected to the U.S. government, because we don’t know the accuracy at this point,’” Kaufman says.

She has also been advising her students to immediately download federal datasets they might need for research.

“You don’t know if it’s going to be there tomorrow,” she says.

The study and its findings

Freilich and her co-author Aaron Kesselheim, a professor of medicine at Harvard Medical School, examined metadata from more than 200 datasets from the Department of Health and Human Services, the Centers for Disease Control and Prevention, and the Department of Veterans Affairs, covering Jan. 20 to March 25, 2025.

Under the OPEN Government Data Act, federal agencies keep lists of information about all their datasets, called their metadata, including a unique ID, title, creation data, description, and content of the dataset (here’s an example). These lists are collected from each agency regularly by Data.gov, which acts as a central hub that brings together datasets from across the federal government and other sources.

Some of the datasets researchers examined in the Lancet study include the Behavioral Risk Factor Surveillance System Prevalence Data, Global Tobacco Surveillance System, and U.S. Census Annual Estimates of the Resident Population for Selected Age Groups by Sex for the United States.

Using Microsoft Word’s comparison tool, the authors then manually compared current datasets to the archived versions recorded by the Internet Archive. They focused on word changes, not numerical data. Researchers also didn’t track changes to the wording on government websites.

In one example, the authors identified a Department of Veterans Affairs modified dataset about veteran health care use in 2021, in which a column titled “gender” was renamed “sex”. Those words were also changed in the dataset’s title and description. Before March 5, the dataset had not changed since it was published in 2022.

Because many datasets did not have an archived copy, the Lancet study may not be representative of all datasets in federal repositories, the authors note. But in addition to documenting undisclosed changes to some of the existing datasets, the study reveals an increase in the pace of data alterations since January: 4% of changes happened in late January, while 72% occurred in March.

Researchers also found:

  • In 25% of altered datasets, the change from “gender” to “sex” made the data descriptions more consistent, as the word “gender” had been applied to data also labeled as “sex.”
  • In four datasets, “social determinants of health” was changed to “non medical factors.” In one, “socioeconomic status” was changed to “socioeconomic characteristics.” In another existing dataset, the question “Are PTSD clinical trials gender diverse?” was changed to “Do PTSD clinical trials include men and women?”
  • Of the altered datasets, 89 involved changes in classification or categorization, such as changing the column headers. About 25 had modified descriptive text, such as tags and paragraph overview.

To safeguard data integrity, Freilich and Kesselheim call for stronger transparency measures, independent archiving, and international alternatives.

“Gender” and “sex” in research

Sex and gender capture different information in research.

Sex usually refers to a person’s biological characteristics, whereas gender refers to socially constructed roles and norms, according to a 2023 paper by Kaufman, published in the Bulletin of the World Health Organization.

“So just because you were born as a designated sex category at birth, it does not mean that, psychologically, that’s how you feel, and that’s where the separation of biological sex comes in as separate to the social construction of gender,” Kaufman says.

Gender has been a focus of research, particularly in psychology, since the 1970s. Researchers still conflate the two concepts, which can make it difficult to compare studies. However, overall, gender and sex are not interchangeable in most studies and surveys. Gender captures a wider range of social experiences of people, compared with sex, which only captures male and female.

“Whether you’re talking about intersex people biologically, or nonbinary, third gender, transgender people in terms of identity, it erases that experience because you have to fit people into one of those two categories, male or female,” Kaufman says.

In addition, if a study aims to investigate the social constructions of gender and how roles and norms might have impacted health outcomes, using “sex” would make it difficult to interpret the results.

“Is it about the biology, the hormones, the chemical makeup of the person that led to these health outcomes, or was it their roles as a woman, or expectations as a man, that then led them down a certain path to those health outcomes?” Kaufman says. “By going back to this sort of gender essentialism of sex being a binary and that lining up completely with gender is sort of backtracking a lot of the research that’s been done over the past several decades.”

Where to find archived data

There’s no perfect alternative to the government databases.

“There’s a lot that can be done on the non-governmental side, but the government has such a leg up in the scope of information it can gather and its authority to gather information that others just can’t get access to,” Freilich says.

Some non-governmental organizations do have their own datasets, as we explained in a February 2025 piece.

Since January, several volunteer groups and newsrooms have also been downloading and archiving government datasets and making them available to the public.

We’ve curated some of those resources below.

  • The Data Rescue Project is a collaboration among a group of data organizations and members of the Data Curation Network. The project — a clearinghouse for preserving at-risk public information — has created a Data Rescue Tracker and a Portal to catalogue ongoing public data rescue efforts.
  • Harvard Dataverse: Harvard Dataverse is a large publicly available repository of data from researchers at Harvard University and around the world, covering a range of topics from astronomy to engineering to health and medicine.
  • The Harvard Library Innovation Lab Team has released more than 311,000 datasets harvested in 2024 and 2025 on Source Cooperative.
  • DataLumos is a crowdsourced repository for at-risk US federal government data. DataLumos is hosted by ICPSR, an international consortium of more than 800 academic institutions and research organizations.
  • Public Environmental Data Project: Run by a coalition of volunteers from several organizations, including Boston University and the Harvard Climate and Health CAFE Research Coordinating Center, the project has compiled a large list of federal databases and tools, including the CDC’s Social Vulnerability Index and Environmental Justice Index.
  • The Federal Environmental Web Tracker is monitoring and tracking changes to thousands of pages of federal government websites.
  • STAT News is backing and monitoring CDC data in real time.
  • Run by health policy data analyst Charles Gaba, ACASignups.net has a list of archived versions of cdc.gov web pages.
  • Here are some of the CDC datasets uploaded to the Internet Archive before January 28th, 2025.
  • Archive.org has an “End of Term 2024 Web Crawls” downloadable data collection.
  • The Data Liberation Project, run by MuckRock and Big Local News, has a list of archived datasets.
  • Looking for an alternative to the National Library of Medicine’s PubMed to look for research papers? Try Europe PMC. Germany is also planning a global alternative to PubMed.
  • Data journalist Hannah Recht is tracking changes to the U.S. Census datasets.
  • Dataindex.us is a collaborative effort to monitor changes to federal datasets.
  • The 19th, an independent nonprofit newsroom reporting on gender, politics, and policy, has archived government documents, including the CDC’s maternal mortality data, the CDC’s abortion and contraception data, research studies on teens, and guidelines from the National Academies on how to collect data on gender and sexuality.
  • Investigative Reporters & Editors: The nonprofit journalism organization has downloaded more than 120 data sets from the federal websites, as recently as November. Some of those data sets include Adverse Event Reporting System, Behavioral Risk Factor Surveillance System, Medical Device Reports, Mortality Multiple Cause-of-Death Database, National Electronic Injury Surveillance System (NEISS), National Practitioner Databank, Nuclear Materials Events Database, OSHA Workplace Safety Data, and Social Security Administration Death Master File. IRE members can contact the organization and order the data sets. The organization has been providing data to members since the early 1990s.

The post Government health datasets were altered without documentation, Lancet study shows appeared first on The Journalist’s Resource.