Soda Makes You Old and Other “Data Mined” Myths

“‘If you torture your data long enough, they will tell you whatever you want to hear.’ Dr. James Mills noted in a 1993 New England Journal of Medicine article. “In plain English, this means that study data, if manipulated in enough different ways can prove whatever the investigator wants to prove.”

Indeed, such “data torturing” is responsible for a recent junk science study that claims drinking soda will age your cells, which makes it as dangerous as smoking. But these findings reflect clever manipulation of data—and nothing more.

Regarding the soda study, columnist Daniel Engber points out in Slate: “The newly published paper delivers a mishmash of suspect stats and overbroad conclusions, marshaled to advance a theory that’s both unsupported by the data and somewhat at odds with existing research in the field.” Engber goes on to detail how these researchers worked and reworked the data to get their desired result.

Unfortunately, such data mining is all too common. In many cases, researchers don’t even have to work that hard. Rather than using a truly random sample, their “data mining” may simply involve excluding certain participants as “outliers” or pulling out select data subsets from a larger database. In that case, the sample may reflect the researcher’s bias instead of constituting a truly random sample.

For example, many studies on the chemical bisphenol A (BPA) rely on data mined from the National Health and Nutrition Examination Survey (NHANES), a Centers for Disease Control and Prevention (CDC) program that assesses national health trends. CDC collects health data from a different group of volunteers every year via physical exams and interviews. In addition to recording the volunteers’ health ailments, the data also include measurements of BPA in urine and blood. NHANS is a treasure trove for researches who want to cherry pick subsets of data to produce spurious correlations that they can market as meaningful findings.

Numerous BPA studies pull the data from various years to see if there are correlations between certain illnesses and levels of BPA in the volunteers’ urine. Not surprisingly, many generate “positive” results. Although generally meaningless, these studies draw myriad headlines.

With BPA, there are other serious problems associated with using this data. The BPA data only measure a one-spot measurement of BPA, which varies considerably in the body over just hours. These data tell us nothing about overall exposure and hence are inappropriate for drawing conclusions about BPA risk. Yet there are dozens of studies that rely on BPA data from NHANES that are published in peer-reviewed journals. While these studies make headlines, they do not offer much scientific insight.

The study attempting to condemn soda also relies on NHANES data appears to cherry pick a subset of data to “prove” that drinking soda is as bad for you as smoking. The authors use the following data:

The study population included 5309 US adults, aged 20 to 65 years, with no history of diabetes or cardiovascular disease, from the 1999 to 2002 National Health and Nutrition Examination Surveys. 

Notice that they only use data from four years from a database that contains decades of data. Why those years? We don’t know, but it may well be that other years did not generate the same provocative results. But as Engber details, data selection is just the tip of this data-mining iceberg. Unfortunately, not everyone who reads the scary headlines will find Engber’s excellent analysis, leaving many consumers misinformed and needlessly alarmed.