Jump to content

Economist uses machine learning to rethink how cause and effect are assessed in statistics

Using Google Cloud preemptible instances and Techila Technologies Distributed Computing Engine, a University of Pittsburgh team ran 1,000 complex simulations in 48 hours.

One of the major challenges in social sciences is how to infer causation for complex events occurring in the dynamic conditions of the real world. Variables can be measured and compared, but which ones are most important? Which ones cause the others to follow, in a domino-like chain of cause and effect? One popular approach among researchers is to use an instrumental variable to examine whether the variable under investigation has a causal effect on a specific outcome. In theory, the instrumental variable must be strongly correlated with the variable under investigation while having no direct effect on the outcome. However, in practice, that is hard to achieve. In recent decades, Econometricians, who study inference in statistics, developed methods for causal inference which are robust to weak instruments.

Running large-scale statistics–with confidence

Mahrad Sharifvaghefi knew this was a central problem in his field. As an Assistant Professor of Economics at the University of Pittsburgh, Sharifvaghefi studies applied macroeconomics, and specializes in using machine learning to conduct large-scale statistics. He and his coauthors, Marcelo Moreira from Faculdades Integradas Vital Brazil (FVB) and Geert Ridder from University of Southern California, wanted to develop a powerful method to infer causation in general settings, when correlation with instruments might be weak.

Sharifvaghefi et al. began to compare the performance of their proposed method relative to some of the competing procedures by running extensive Monte Carlo simulations using servers at Pittsburgh’s Data Center. But their initial analysis took more than 30 minutes to calculate a result for one simulation–and to optimize his experiment he needed to run at least 1,000 simulations with 300 different designs. Sharifvaghefi et al. estimated that it would take more than six months to yield high-confidence results if they continued at this pace, and they couldn’t be sure they’d always have access to the time and computing capacity they needed on premises. Their research required a faster, more efficient computing solution.

Making big data research easier, faster, and less expensive

Enterprise Architect Brian Pasquini works to solve exactly these kinds of problems for Pittsburgh’s research scientists. He and his colleague Sandra E. Brandon, Strategic Research Liaison in the Office of the CIO, ensure researchers have the right resources at the right time. Pasquini says, “it was impractical for Mahrad to use a whole cluster on premises. He’d have to queue up for the time and our maximum run is six days using a maximum of 672 cores. We looked for a solution where he could scale up on demand and on his schedule.” This led the Pitt team to Google Cloud and Techila Technologies, an approved partner on Google Cloud Marketplace. Techila’s Distributed Computing Engine can accelerate research by running parallel computations on Google Cloud’s preemptible virtual machine instances. It also integrates with the statistical applications Mahrad was already using. Rainer Wehkamp, CEO of Techila, says that “Techila’s platform can essentially hide the complexity of high-performance computing so scientists can focus on their results while ensuring reproducibility. And using Google Cloud’s preemptibles is easy and scalable.”

It worked. With Google Cloud and Techila, Sharifvaghefi ran the simulations on 10K preemptible instances, each with four cores, in only 48 hours. “This would have been impossible on shared campus resources,” he says. “The experiment ran much faster and was also more accurate,“ Pasquini reports. “That means Mahrad has more definite results or inferences.” Using preemptible instances was also more cost-effective: the team estimates that the project cost less than a third of what it would have cost on premises.

Transforming research methodologies across fields

This workflow could transform statistical methodology for economics and beyond.

Mahrad Sharifvaghefi, Assistant Professor of Economics, University of Pittsburgh

Sharifvaghefi says. It could improve the confidence level of any complex statistical analysis: like how different interventions affect COVID infection rates or how interest rates and consumer consumption interact. It could help answer these and other urgent socioeconomic questions that impact our everyday lives.

The Pitt team wants to expand the benefits of these innovative cloud technologies to other researchers outside the sciences. Their shared high performance computing resources are highly utilized so this will also help distribute demand, and accommodate projects that can’t wait. Brandon says, “we want to encourage research that transcends disciplines and impacts the whole community. This is about speed to science--demonstrating the capacity of cloud computing for scaling and just-in-time resources. If we rely only on traditional infrastructure, by the time we design, purchase, and build it, it could very well be obsolete.” Pasquini agrees, adding that deploying Google Cloud and Techila was like “providing an easy button. Mahrad’s project was our first foray into this new workflow, but there’s more to come.”

For more details about Sharifvaghefi et al.’s research, see their working paper. To get started with Google Cloud, apply for free credits towards your research.

This is about speed to science--demonstrating the capacity of cloud computing for scaling and just-in-time resources. If we rely only on traditional infrastructure, by the time we design, purchase, and build it, it could very well be obsolete.

Sandra E. Brandon, Strategic Research Liaison in the Office of the CIO, University of Pittsburgh

Sign up here for updates, insights, resources, and more.