• ## Uncertainty analysis of GiveWell's cost-effectiveness analysis

GiveWell produces cost-effectiveness models of its top charities. These models take as inputs many uncertain parameters. Instead of representing those uncertain parameters with point estimates—as the cost-effectiveness analysis spreadsheet does—we can (should) represent them with probability distributions. Feeding probability distributions into the models allows us to output explicit probability distributions on the cost-effectiveness of each charity.

### GiveWell’s cost-effectiveness analysis

GiveWell, an in-depth charity evaluator, makes their detailed spreadsheets models available for public review. These spreadsheets estimate the value per dollar of donations to their 8 top charities: GiveDirectly, Deworm the World, Schistosomiasis Control Initiative, Sightsavers, Against Malaria Foundation, Malaria Consortium, Helen Keller International, and the END Fund. For each charity, a model is constructed taking input values to an estimated value per dollar of donation to that charity. The inputs to these models vary from parameters like “malaria prevalence in areas where AMF operates” to “value assigned to averting the death of an individual under 5”.

Helpfully, GiveWell isolates the input parameters it deems as most uncertain. These can be found in the “User inputs” and “Moral weights” tabs of their spreadsheet. Outsiders interested in the top charities can reuse GiveWell’s model but supply their own perspective by adjusting the values of the parameters in these tabs.

For example, if I go to the “Moral weights” tab and run the calculation with a 0.1 value for doubling consumption for one person for one year—instead of the default value of 1—I see the effect of this modification on the final results: deworming charities look much less effective since their primary effect is on income.

### Uncertain inputs

GiveWell provides the ability to adjust these input parameters and observe altered output because the inputs are fundamentally uncertain. But our uncertainty means that picking any particular value as input for the calculation misrepresents our state of knowledge. From a subjective Bayesian point of view, the best way to represent our state of knowledge on the input parameters is with a probability distribution over the values the parameter could take. For example, I could say that a negative value for increasing consumption seems very improbable to me but that a wide range of positive values seem about equally plausible. Once we specify a probability distribution, we can feed these distributions into the model and, in principle, we’ll end up with a probability distribution over our results. This probability distribution on the results helps us understand the uncertainty contained in our estimates and how literally we should take them.

#### Is this really necessary?

Perhaps that sounds complicated. How are we supposed to multiply, add and otherwise manipulate arbitrary probability distributions in the way our models require? Can we somehow reduce our uncertain beliefs about the input parameters to point estimates and run the calculation on those? One candidate is to take the single most likely value of each input and using that value in our calculations. This is the approach the current cost-effectiveness analysis takes (assuming you provide input values selected in this way). Unfortunately, the output of running the model on these inputs is necessarily a point value and gives no information about the uncertainty of the results. Because the results are probably highly uncertain, losing this information and being unable to talk about the uncertainty of the results is a major loss. A second possibility is to take lower bounds on the input parameters and run the calculation on these values, and to take the upper bounds on the input parameters and run the calculation on these values. This will produce two bounding values on our results, but it’s hard to give them a useful meaning. If the lower and upper bounds on our inputs describe, for example, a 95% confidence interval, the lower and upper bounds on the result don’t (usually) describe a 95% confidence interval.

#### Computers are nice

If we had to proceed analytically, working with probability distributions throughout, the model would indeed be troublesome and we might have to settle for one of the above approaches. But we live in the future. We can use computers and Monte Carlo methods to numerically approximate the results of working with probability distributions while leaving our models clean and unconcerned with these probabilistic details. Guesstimate is a tool that works along these lines and bills itself as “A spreadsheet for things that aren’t certain”.

### Analysis

We have the beginnings of a plan then. We can implement GiveWell’s cost-effectiveness models in a Monte Carlo framework (PyMC3 in this case), specify probability distributions over the input parameters, and finally run the calculation and look at the uncertainty that’s been propagated to the results.

Full post

• ## Conditioning in causal graphs

As mentioned in the warnings on the first post on graphical causal models, I’ve been lying to you so far. But it was for a good reason: that sweet, sweet expository simplicity. So far, all our definitions, algorithms, etc. have proceeded without any acknowledgment of the social scientists’ favorite statistical tool: controlling for a variable1.

In this post, we’ll introduce the concept of conditioning to our graphical causal models framework and see how it both complicates things and offers new possibilities. (This post deliberately mirrors the structure of that one so it may be handy to have it open in a second tab/window for comparison purposes.)

### Causal triplets, again

We started out by talking about three types of causal triplets: chains, forks and inverted forks. For convenience, here is the summary table we ended up with:

When we add the possibility of conditioning, things change dramatically:

The complete reversal of in/dependence occasioned by conditioning on the middle vertex may be a bit surprising. There’s a certain reflex that says when ever you want to draw a clean causal story out of messy data, conditioning on more stuff will help you. But as we see here, that’s not generally true2. Conditioning can also introduce spurious correlation.

Full post

• ## Instrumental variables on causal graphs

Last time we talked about viewing d-separation as a tool for model selection. But we’re pretty limited in the causal models we can distinguish between by only observing our variables of interest—any two graphs with the same set of d-separations are indistinguishable. Instrumental variables are a common tool for trying to get around the limitations of purely observational data.

### Instrumental variables

Instrumental variables (IV) are variables that we’re not intrinsically interested in but that we look at in an attempt to suss out causality. The instrument must be correlated with our cause, but its only impact on the effect should be via the cause.

The classic example is about—you guessed it—smoking. Because running an RCT on smoking is ethically verboten, we’re limited to observational data. How can we determine if smoking causes lung cancer from observational data alone? An instrumental variable! To reiterate, we want a factor that affects smoking prevalence but (almost certainly) does not affect lung cancer in other ways. Finding an instrument that satisfies the IV criteria generally seems to require substantial creativity. Can you think of an instrument for the causal effect of smoking on lung cancer?

An instrument that meets these criteria is a tax on cigarettes. We expect smoking to decrease as taxes increase, but it seems hard to imagine a cigarette tax otherwise having an effect on lung cancer.

### Instrumental variables on causal graphs

Okay, so that’s what IVs are at a high level. But what are they concretely in the graphical causal model setting we’ve been developing?

#### A brief notational interlude

We’ll get this out of the way here:

• $$\perp\!\!\!\perp$$ is the symbol for d-separation
• Once we add the strikethrough, $$\not\!\!{\perp\!\!\!\perp}$$ mean d-connected.
• If $$G$$ is a graph, $$G_{\overline{X}}$$, is $$G$$ in which all the edges pointing to vertex X have been removed1.

#### Defined

We’ll start with the definition and then try to build up a feel for it. An instrumental variable X for the causal effect of Y on Z in graph G must be:

1. d-connected to our cause Y—$$(X \not\!\!{\perp\!\!\!\perp} Y)_G$$
2. d-separated from our effect Z after severing the cause Y from all its parents—$$(X \perp\!\!\!\perp Z)_{G_\overline{Y}}$$
Full post

• ## Flip it and reverse it

Last time we found the d-separations that correspond to a graph. This time, we find the graphs that correspond to a set of d-separations. Which is more useful because we generally know d-separations and generally don’t know graphs.

Last time we talked about causal graphs, what d-separation and d-connection mean, and how to infer these properties from a causal graph. But this isn’t terribly useful because it requires that we have a fully specified causal graph. If we’re performing research in new or uncertain areas, we have data rather than a causal graph. And this data tells us about d-separations (variables that are independent of each other) and d-connections (variables that are correlated). So our work last time was exactly backwards: graphs to d-separations. This time we’ll go from d-separations to graphs.

### Model selection

One way to think about d-separation and d-connection is as helping us with model selection. Last time we presented

as one possible causal model regarding smoking. But it’s not the only possibility. We might also be worried that the true causal structure looks like this (just go with it):

How can we tell them apart? Can we use observational data alone? In this case, observational data alone is enough to distinguish between these two causal models! The key is that the two models have different sets of d-separations. In the original model, all the vertices are d-connected and there are no d-separations (this must be the case since there are no colliders). In the second (silly) model, “smoking” and “lung cancer” are d-separated because “yellow fingers” is a collider between them. If our data show that smoking and lung cancer are independent, we must rule out the first model and prefer the second. If the two variables are correlated, we must rule out the second model and prefer the first.

This is a procedure that works generally:

1. Draw out the plausible graphical causal models that include all the variables you have data on
2. Determine the d-separations for each plausible model
3. Determine the variables in your data that are independent
4. Retain the models from step 1 whose d-separations in step 2 are compatible with the data analysis in step 3

The ideal is that there’s only one model left at the end of step 4. However, it’s possible to end up with none. This means that step 1 wasn’t permissive enough and more models need to be considered. It’s also possible to end up with more than one model. Not all models are distinguishable by observational data alone. This occurs whenever two models have the same set of d-separations.

Full post

can the human brain deal with the complexity to control an extra limb and yield advantages from it? […] Anatomical MRI of the supernumerary finger (SF) revealed that it is actuated by extra muscles and nerves, and fMRI identified a distinct cortical representation of the SF. […] Polydactyly subjects were able to coordinate the SF with their other fingers for more complex movements than five fingered subjects, and so carry out with only one hand tasks normally requiring two hands.

In summary, most of the biggest claims made by Wilkinson and Pickett in The Spirit Level look even weaker today than they did when the book was published. Only one of the six associations stand up under W & P’s own methodology and none of them stand up when the full range of countries is analysed. In the case of life expectancy - the very flagship of The Spirit Level - the statistical association is the opposite of what the hypothesis predicts.

If The Spirit Level hypothesis were correct, it would produce robust and consistent results over time as the underlying data changes. Instead, it seems to be extremely fragile, only working when a very specific set of statistics are applied to a carefully selected list of countries.

The allure of “meta” and “axiomatic first principles” is that it’s kinda like get-rich-quick thinking but for epistemics. Get a few abstractions really right and potentially earn more than you would grinding as an object-level wage slave for decades.

Trying to identify the best policy is different from estimating the precise impact of every individual policy: as long as we can identify the best policy, we do not care about the precise impacts of inferior policies. Yet, despite this, most experiments follow protocols that are designed to figure out the impact of every policy, even the obviously inferior ones.

Cambiaso rode six different horses to help his team win. […] What is noteworthy is that all six horses were clones of the same mare—they’re named Cuartetera 01 through 06. […] “Every scientist that deals with epigenetics told me this would never work,” says Meeker

Full post

A weblog