Putting the science back in “data science”

Data is powerful, allowing us to make valuable connections and unprecedented predictions, but can be difficult to use correctly—especially when it’s flawed to begin with.

“We have computers now, which means that our thirst for answers has only increased,” says John C. Malone Associate Professor of Computer Science Ilya Shpitser. “Our ability to store data has increased, too, but the fact that we can store more doesn’t make it any higher quality. If anything, I would say data quality has actually decreased in this era of big, ubiquitous data.”

That’s why Shpitser’s research goal, as he puts it, is to “help people deal with screwed-up data.” This involves using techniques like data fusion, auxiliary data, and error correction to fix common issues in datasets that can prevent researchers from drawing accurate conclusions.

Avoiding causal inference pitfalls

In an ideal world, causal questions—such as “Does A cause B?”—would be answered by randomized controlled trials, the gold standard for scientific experiments. But these trials can’t always be performed; they might be too dangerous or unethical for the participants, too time-consuming or expensive, or simply unrepresentative of the true population being studied.

Researchers have long used observational data, or data gathered through passive observation rather than active experimentation, to approximate cause-effect relationships when randomized controlled trials aren’t feasible. But even observational data has its pitfalls. Take confounding variables, for example—when unmeasured factors affect the causal relationship a researcher is hoping to establish, like socioeconomic background on health outcomes—or systematic selection—when only a certain group of people is observed in a study but researchers want to extrapolate a causal relationship for a larger population.

Shpitser and other data scientists working in the field of causal inference use techniques like inverse weighting and data fusion—the act of combining datasets collected on different, but overlapping, populations referred to as “superpopulations”—to work around these common pitfalls. If after these adjustments there’s enough information to predict what would have happened if a randomized trial been hypothetically performed, then a causal relationship can be established; if not, researchers will need to do more data collection to be able to determine cause and effect.

In recent work that appeared at the 40th Conference on Uncertainty in Artificial Intelligence (UAI-24) last month, Shpitser and his PhD student, Jaron Lee, present a causal inference algorithm that answers this question when using data on the same superpopulation that’s been obtained under different experimental conditions; they’re joined in this research by Amir Emad Ghassami, a former postdoctoral fellow at Hopkins now teaching at Boston University.

A solve for systematic missingness

Another common data complication is “missingness”—when the data a researcher is looking for simply isn’t present in the data they’ve collected. Random missingness—which occurs when participants drop out of a study for unrelated reasons, for example—can generally be ignored, but missingness that is systematic must be addressed, says Shipster. Systematic missingness occurs when data values are missing as a direct result of the causal relationship being investigated, such as when people with more severe depression don’t answer a survey because of their condition—that’s key data that’s missing.

Researchers may be tempted to think of this as a one-size-fits-all problem, download what’s called an imputation package, and clean their data in an attempt to solve the missingness issue, but Shpitser warns that it’s not that easy. However, it’s possible to compensate for systematically missing data by combining it with auxiliary missing-at-random data.

Shpister, Ghassami, and Zixiao Wang, BSPH ’24 (ScM), show how this works with a real-life example in new work that was recently presented at the 41st International Conference on Machine Learning. They demonstrate their method’s effectiveness by estimating the hospitalization rate in New York during the initial stage of the COVID-19 pandemic in March 2020, when hospitalization status was often unrecorded, with auxiliary data from March 2023, a period characterized by improved conditions and thus a reduced level of missingness in hospitalization data.

The fight against zero inflation

There’s another, more insidious type of data missingness that occurs when missing data is recorded as a zero value, which artificially inflates the number of zeros present in the data. In such cases, it can be very difficult to determine which zeros are real and which aren’t.

For example, in another paper presented at UAI-24, Shpitser’s research team—which includes Lee; Trung Phung, Engr ’22 (PhD); and collaborators from the School of Medicine, the Bloomberg School of Public Health, the Johns Hopkins Health System, and Vassar Brothers Medical Center in Poughkeepsie, New York—attempts to find the true rate of central line-associated bloodstream infections, or CLABSIs, which occur when bacteria or germs enter a patient’s bloodstream through their IV.

CLABSI occurrence is determined after the fact when an adjudicator looks at a patient’s electronic health record, or EHR. They denote the presence of a CLABSI with a “1” and the absence of an infection with a “0.” But when the adjudicator doesn’t have enough information to determine the presence of a CLABSI—perhaps because they’re prevented from accessing the full EHR or the EHR data is inconclusive—they may record a zero anyway as a “presumed negative.”

Shpitser describes how to account for these potentially fake zeros: “We use a proxy variable—whether the adjudicator has access to the patient’s full EHR in the first place. It turns out that if we have access to such a proxy, we can put very strong bounds on what the CLABSI probability is.”

Data science in health care

Most of the pressing questions in medicine and public health are causal, such as “What’s the cause of this disease?” or “What is the effect of this public health intervention?”

“Using the wealth of data that we have to answer these questions is very difficult and requires sophisticated methods,” says Shpitser. “The fact that we have a lot of data doesn’t necessarily translate into making this any easier; if anything, it makes it harder because a lot of this large-scale data we’re collecting is incredibly poorly curated or has a lot of problems in it.”

To help other researchers address these problems, Shpitser regularly compiles his new findings and causal inference methods into an open-source Python package called Ananke. The package is available for download here.

“I think that my most valuable contribution is not any specific analysis that I do, but just in developing partnerships with clinicians and solving problems together,” Shpitser says. “That’s the thing that’s going to keep paying dividends forward for other problems—in health care and beyond.”

Image Caption: Ilya Shpitser