(This post is painfully long. Coping advice: Each subsection within Direct (empirical) evidence, within Indirect evidence, and within Responses is pretty independent—feel free to dip in and out as desired. I’ve also put a list-formatted summary at the end of each these sections boiling down each subsection to one or two sentences.)

Intro

Dan is a student council representative at his school. This semester he is in charge of scheduling discussions about academic issues. He often picks topics that appeal to both professors and students in order to stimulate discussion.

Is Dan’s behavior morally acceptable? On first glance, you’d be inclined to say yes. And even on the second and third glance, obviously, yes. Dan is a stand-up guy. But what if you’d been experimentally manipulated to feel disgust while reading the vignette? If we’re to believe (Wheatley and Haidt 2005), there’s a one-third chance you’d judge Dan as morally suspect. ‘One subject justified his condemnation of Dan by writing “it just seems like he’s up to something.” Another wrote that Dan seemed like a “popularity seeking snob.”’

The possibility that moral judgments track irrelevant factors like incidental disgust at the moment of evaluation is (to me, at least) alarming. But now that you’ve been baited, we can move on the boring, obligatory formalities.

Arguably, we don’t care about the exact cost-effectiveness estimates of each of GiveWell’s top charities. Instead, we care about their relative values. By using distance metrics across these multidimensional outputs, we can perform uncertainty and sensitivity analysis to answer questions about:

how uncertain we are about the overall relative values of the charities

which input parameters this overall relative valuation is most sensitive to

In the lasttwo posts, we performed uncertainty and sensitivity analyses on GiveWell’s charity cost-effectiveness estimates. Our outputs were, respectively:

probability distributions describing our uncertainty about the value per dollar obtained for each charity and

estimates of how sensitive each charity’s cost-effectiveness is to each of its input parameters

Another issue is that by treating each cost-effectiveness estimate as independent we underweight parameters which are shared across many models. For example, the moral weight that ought to be assigned to increasing consumption shows up in many models. If we consider all the charity-specific models together, this input seems to become more important.

Metrics on rankings

We can solve both of these problems by abstracting away from particular values in the cost-effectiveness analysis and looking at the overall rankings returned. That is we want to transform:

GiveWell’s cost-effectiveness estimates for its top charities

Charity

Value per $10,000 donated

GiveDirectly

38

The END Fund

222

Deworm the World

738

Schistosomiasis Control Initiative

378

Sightsavers

394

Malaria Consortium

326

Against Malaria Foundation

247

Helen Keller International

223

into:

But how do we usefully express probabilities over rankings^{1} (rather than probabilities over simple cost-effectivness numbers)? The approach we’ll follow below is to characterize a ranking produced by a run of the model by computing its distance from the reference ranking listed above (i.e. GiveWell’s current best estimate). Our output probability distribution will then express how far we expect to be from the reference ranking—how much we might learn about the ranking with more information on the inputs. For example, if the distribution is narrow and near 0, that means our uncertain input parameters mostly produce results similar to the reference ranking. If the distribution is wide and far from 0, that means our uncertain input parameters produce results that are highly uncertain and not necessarily similar to the reference ranking.

Spearman’s footrule

What is this mysterious distance metric between rankings that enables the above approach? One such metric is called Spearman’s footrule distance. It’s defined as:

\(c\) varies over all the elements \(A\) of the rankings and

\(\text{pos}(r, x)\) returns the integer position of item \(x\) in ranking \(r\).

In other words, the footrule distance between two rankings is the sum over all items of the (absolute) difference in positions for each item. (We also add a normalization factor so that the distance varies ranges from 0 to 1 but omit that trivia here.)

So the distance between A, B, C and A, B, C is 0; the (unnormalized) distance between A, B, C and C, B, A is 4; and the (unnormalized) distance between A, B, C and B, A, C is 2.

Kendall’s tau

Another common distance metric between rankings is Kendall’s tau. It’s defined as:

\(i\) and \(j\) are items in the set of unordered pairs \(P\) of distinct elements in \(u\) and \(v\)

\(\bar{K}_{i,j}(u, v) = 0\) if \(i\) and \(j\) are in the same order (concordant) in \(u\) and \(v\) and \(\bar{K}_{i,j}(u, v) = 1\) otherwise (discordant)

In other words, the Kendall tau distance looks at all possible pairs across items in the rankings and counts up the ones where the two rankings disagree on the ordering of these items. (There’s also a normalization factor that we’ve again omitted so that the distance ranges from 0 to 1.)

So the distance between A, B, C and A, B, C is 0; the (unnormalized) distance between A, B, C and C, B, A is 3; and the (unnormalized) distance between A, B, C and B, A, C is 1.

Angular distance

One drawback of the above metrics is that they throw away information in going from the table with cost-effectiveness estimates to a simple ranking. What would be ideal is to keep that information and find some other distance metric that still emphasizes the relationship between the various numbers rather than their precise values.

Angular distance is a metric which satisfies these criteria. We can regard the table of charities and cost-effectiveness values as an 8-dimensional vector. When our output produces another vector of cost-effectiveness estimates (one for each charity), we can compare this to our reference vector by finding the angle between the two^{2}.

Visual (scatter plot) and delta moment-independent sensitivity analysis on GiveWell’s cost-effectiveness models show which input parameters the cost-effectiveness estimates are most sensitive to. Preliminary results (given our input uncertainty) show that some input parameters are much more influential on the final cost-effectiveness estimates for each charity than others.

Last time we introduced GiveWell’s cost-effectiveness analysis which uses a spreadsheet model to take point estimates of uncertain input parameters to point estimates of uncertain results. We adjusted this approach to take probability distributions on the input parameters and in exchange got probability distributions on the resulting cost-effectiveness estimates. But this machinery lets us do more. Now that we’ve completed an uncertainty analysis, we can move on to sensitivity analysis.

Sensitivity analysis

The basic idea of sensitivity analysis is, when working with uncertain values, to see which input values most affect the output when they vary. For example, if you have the equation \(f(a, b) = 2^a + b\) and each of \(a\) and \(b\) varies uniformly over the range from 5 to 10, \(f(a, b)\) is much more sensitive to \(a\) then \(b\). A sensitivity analysis is practically useful in that it can offer you guidance as to which parameters in your model it would be most useful to investigate further (i.e. to narrow their uncertainty).

Visual sensitivity analysis

The first kind of sensitivity analysis we’ll run is just to look at scatter plots comparing each input parameter to the final cost-effectiveness estimates. We can imagine these scatter plots as the result of running the following procedure many times^{1}: sample a single value from the probability distribution for each input parameter and run the calculation on these values to determine a result value. If we repeat this procedure enough times, it starts to approximate the true values of the probability distributions.

(One nice feature of this sort of analysis is that we see how the output depends on a particular input even in the face of variations in all the other inputs—we don’t hold everything else constant. In other words, this is a global sensitivity analysis.)

(Caveat: We are again pretending that we are equally uncertain about each input parameter and the results reflect this limitation. To see the analysis result for different input uncertainties, edit and run the Jupyter notebook.)

Direct cash transfers

GiveDirectly

The scatter plots show that, given our choice of input uncertainty, the output is most sensitive (i.e. the scatter plot for these parameters shows the greatest directionality) to the input parameters:

Highlighted input factors to which result is highly sensitive

Input

Type of uncertainty

Meaning/importance

value of increasing ln consumption per capita per annum

Moral

Determines final conversion between empirical outcomes and value

transfer as percent of total cost

Operational

Determines cost of results

return on investment

Opportunities available to recipients

Determines stream of consumption over time

baseline consumption per capita

Empirical

Diminishing marginal returns to consumption mean that baseline consumption matters

GiveWell produces cost-effectiveness models of its top charities. These models take as inputs many uncertain parameters. Instead of representing those uncertain parameters with point estimates—as the cost-effectiveness analysis spreadsheet does—we can (should) represent them with probability distributions. Feeding probability distributions into the models allows us to output explicit probability distributions on the cost-effectiveness of each charity.

GiveWell, an in-depth charity evaluator, makes their detailed spreadsheets models available for public review. These spreadsheets estimate the value per dollar of donations to their 8 top charities: GiveDirectly, Deworm the World, Schistosomiasis Control Initiative, Sightsavers, Against Malaria Foundation, Malaria Consortium, Helen Keller International, and the END Fund. For each charity, a model is constructed taking input values to an estimated value per dollar of donation to that charity. The inputs to these models vary from parameters like “malaria prevalence in areas where AMF operates” to “value assigned to averting the death of an individual under 5”.

Helpfully, GiveWell isolates the input parameters it deems as most uncertain. These can be found in the “User inputs” and “Moral weights” tabs of their spreadsheet. Outsiders interested in the top charities can reuse GiveWell’s model but supply their own perspective by adjusting the values of the parameters in these tabs.

For example, if I go to the “Moral weights” tab and run the calculation with a 0.1 value for doubling consumption for one person for one year—instead of the default value of 1—I see the effect of this modification on the final results: deworming charities look much less effective since their primary effect is on income.

Uncertain inputs

GiveWell provides the ability to adjust these input parameters and observe altered output because the inputs are fundamentally uncertain. But our uncertainty means that picking any particular value as input for the calculation misrepresents our state of knowledge. From a subjective Bayesian point of view, the best way to represent our state of knowledge on the input parameters is with a probability distribution over the values the parameter could take. For example, I could say that a negative value for increasing consumption seems very improbable to me but that a wide range of positive values seem about equally plausible. Once we specify a probability distribution, we can feed these distributions into the model and, in principle, we’ll end up with a probability distribution over our results. This probability distribution on the results helps us understand the uncertainty contained in our estimates and how literally we should take them.

Is this really necessary?

Perhaps that sounds complicated. How are we supposed to multiply, add and otherwise manipulate arbitrary probability distributions in the way our models require? Can we somehow reduce our uncertain beliefs about the input parameters to point estimates and run the calculation on those? One candidate is to take the single most likely value of each input and using that value in our calculations. This is the approach the current cost-effectiveness analysis takes (assuming you provide input values selected in this way). Unfortunately, the output of running the model on these inputs is necessarily a point value and gives no information about the uncertainty of the results. Because the results are probably highly uncertain, losing this information and being unable to talk about the uncertainty of the results is a major loss. A second possibility is to take lower bounds on the input parameters and run the calculation on these values, and to take the upper bounds on the input parameters and run the calculation on these values. This will produce two bounding values on our results, but it’s hard to give them a useful meaning. If the lower and upper bounds on our inputs describe, for example, a 95% confidence interval, the lower and upper bounds on the result don’t (usually) describe a 95% confidence interval.

Computers are nice

If we had to proceed analytically, working with probability distributions throughout, the model would indeed be troublesome and we might have to settle for one of the above approaches. But we live in the future. We can use computers and Monte Carlo methods to numerically approximate the results of working with probability distributions while leaving our models clean and unconcerned with these probabilistic details. Guesstimate is a tool that works along these lines and bills itself as “A spreadsheet for things that aren’t certain”.

Analysis

We have the beginnings of a plan then. We can implement GiveWell’s cost-effectiveness models in a Monte Carlo framework (PyMC3 in this case), specify probability distributions over the input parameters, and finally run the calculation and look at the uncertainty that’s been propagated to the results.

As mentioned in the warnings on the first post on graphical causal models, I’ve been lying to you so far. But it was for a good reason: that sweet, sweet expository simplicity. So far, all our definitions, algorithms, etc. have proceeded without any acknowledgment of the social scientists’ favorite statistical tool: controlling for a variable^{1}.

In this post, we’ll introduce the concept of conditioning to our graphical causal models framework and see how it both complicates things and offers new possibilities. (This post deliberately mirrors the structure of that one so it may be handy to have it open in a second tab/window for comparison purposes.)

Causal triplets, again

We started out by talking about three types of causal triplets: chains, forks and inverted forks. For convenience, here is the summary table we ended up with:

Types of causal triplets

Name of triplet

Name of central vertex

Diagram

Ends (A and C) dependent?

Chain

Mediator/Traverse

A → B → C

Causally (probably)

Fork

Confounder/Common cause

A ← B → C

Noncausally

Inverted fork

Collider/Common effect

A → B ← C

No

When we add the possibility of conditioning, things change dramatically:

Types of causal triplets with conditioning on central vertex

Name of triplet

Name of central vertex

Diagram

Ends (A and C) dependent?

Chain

Mediator/Traverse

A → B → C

No

Fork

Confounder/Common cause

A ← B → C

No

Inverted fork

Collider/common effect

A → B ← C

Noncausally

The complete reversal of in/dependence occasioned by conditioning on the middle vertex may be a bit surprising. There’s a certain reflex that says when ever you want to draw a clean causal story out of messy data, conditioning on more stuff will help you. But as we see here, that’s not generally true^{2}. Conditioning can also introduce spurious correlation.

Last time we talked about viewing d-separation as a tool for model selection. But we’re pretty limited in the causal models we can distinguish between by only observing our variables of interest—any two graphs with the same set of d-separations are indistinguishable. Instrumental variables are a common tool for trying to get around the limitations of purely observational data.

Instrumental variables

Instrumental variables (IV) are variables that we’re not intrinsically interested in but that we look at in an attempt to suss out causality. The instrument must be correlated with our cause, but its only impact on the effect should be via the cause.

The classic example is about—you guessed it—smoking. Because running an RCT on smoking is ethically verboten, we’re limited to observational data. How can we determine if smoking causes lung cancer from observational data alone? An instrumental variable! To reiterate, we want a factor that affects smoking prevalence but (almost certainly) does not affect lung cancer in other ways. Finding an instrument that satisfies the IV criteria generally seems to require substantial creativity. Can you think of an instrument for the causal effect of smoking on lung cancer?

…

An instrument that meets these criteria is a tax on cigarettes. We expect smoking to decrease as taxes increase, but it seems hard to imagine a cigarette tax otherwise having an effect on lung cancer.

Instrumental variables on causal graphs

Okay, so that’s what IVs are at a high level. But what are they concretely in the graphical causal model setting we’ve been developing?

A brief notational interlude

We’ll get this out of the way here:

\(\perp\!\!\!\perp\) is the symbol for d-separation

Once we add the strikethrough, \(\not\!\!{\perp\!\!\!\perp}\) mean d-connected.

If \(G\) is a graph, \(G_{\overline{X}}\), is \(G\) in which all the edges pointing to vertex X have been removed^{1}.

Defined

We’ll start with the definition and then try to build up a feel for it. An instrumental variable X for the causal effect of Y on Z in graph G must be:

d-connected to our cause Y—\((X \not\!\!{\perp\!\!\!\perp} Y)_G\)

d-separated from our effect Z after severing the cause Y from all its parents—\((X \perp\!\!\!\perp Z)_{G_\overline{Y}}\)

Last time we found the d-separations that correspond to a graph. This time, we find the graphs that correspond to a set of d-separations. Which is more useful because we generally know d-separations and generally don’t know graphs.

Last time we talked about causal graphs, what d-separation and d-connection mean, and how to infer these properties from a causal graph. But this isn’t terribly useful because it requires that we have a fully specified causal graph. If we’re performing research in new or uncertain areas, we have data rather than a causal graph. And this data tells us about d-separations (variables that are independent of each other) and d-connections (variables that are correlated). So our work last time was exactly backwards: graphs to d-separations. This time we’ll go from d-separations to graphs.

Model selection

One way to think about d-separation and d-connection is as helping us with model selection. Last time we presented

as one possible causal model regarding smoking. But it’s not the only possibility. We might also be worried that the true causal structure looks like this (just go with it):

How can we tell them apart? Can we use observational data alone? In this case, observational data alone is enough to distinguish between these two causal models! The key is that the two models have different sets of d-separations. In the original model, all the vertices are d-connected and there are no d-separations (this must be the case since there are no colliders). In the second (silly) model, “smoking” and “lung cancer” are d-separated because “yellow fingers” is a collider between them. If our data show that smoking and lung cancer are independent, we must rule out the first model and prefer the second. If the two variables are correlated, we must rule out the second model and prefer the first.

This is a procedure that works generally:

Draw out the plausible graphical causal models that include all the variables you have data on

Determine the d-separations for each plausible model

Determine the variables in your data that are independent

Retain the models from step 1 whose d-separations in step 2 are compatible with the data analysis in step 3

The ideal is that there’s only one model left at the end of step 4. However, it’s possible to end up with none. This means that step 1 wasn’t permissive enough and more models need to be considered. It’s also possible to end up with more than one model. Not all models are distinguishable by observational data alone. This occurs whenever two models have the same set of d-separations.

can the human brain deal with the complexity to control an extra limb and yield advantages from it? […] Anatomical MRI of the supernumerary finger (SF) revealed that it is actuated by extra muscles and nerves, and fMRI identified a distinct cortical representation of the SF. […] Polydactyly subjects were able to coordinate the SF with their other fingers for more complex movements than five fingered subjects, and so carry out with only one hand tasks normally requiring two hands.

In summary, most of the biggest claims made by Wilkinson and Pickett in The Spirit Level look even weaker today than they did when the book was published. Only one of the six associations stand up under W & P’s own methodology and none of them stand up when the full range of countries is analysed. In the case of life expectancy - the very flagship of The Spirit Level - the statistical association is the opposite of what the hypothesis predicts.

If The Spirit Level hypothesis were correct, it would produce robust and consistent results over time as the underlying data changes. Instead, it seems to be extremely fragile, only working when a very specific set of statistics are applied to a carefully selected list of countries.

The allure of “meta” and “axiomatic first principles” is that it’s kinda like get-rich-quick thinking but for epistemics. Get a few abstractions really right and potentially earn more than you would grinding as an object-level wage slave for decades.

Trying to identify the best policy is different from estimating the precise impact of every individual policy: as long as we can identify the best policy, we do not care about the precise impacts of inferior policies. Yet, despite this, most experiments follow protocols that are designed to figure out the impact of every policy, even the obviously inferior ones.

Cambiaso rode six different horses to help his team win. […] What is noteworthy is that all six horses were clones of the same mare—they’re named Cuartetera 01 through 06. […] “Every scientist that deals with epigenetics told me this would never work,” says Meeker