# Biased and Inefficient

## Taking meta-analysis heterogeneity seriously

In fixed-effects meta-analysis of a set of trials the goal is to find a weighted average of the true treatment effects in those trials (whatever they might be). The results are summarised by the weighted average and a confidence interval reflecting its sampling uncertainty.

In random-effects meta-analysis the trials are modelled as an exchangeable sample, implying that they can be treated as coming independently from some latent distribution of true treatment effects. That’s attractive in some situations. What doesn’t make sense to me is summarising the results just by the mean of this latent distribution and a confidence interval for that mean.

That is, the model for individual study estimates $\Delta_i$ is
$$\Delta_i\sim N(\mu_i,\sigma^2_i)$$
$$\mu_i\sim N(\mu_0, \tau^2)$$
and we usually report a confidence interval for $\mu_0.$

If you take seriously the idea of modelling heterogeneity in the true treatment effect, a confidence interval for the mean isn’t enough. In order to make decisions you need a prediction interval for the the true treatment effect in a new population that might include you.

The difference between these intervals can be pretty large. Today, I saw a paper (open-access) in the new Nature Scientific Reports journal, a meta-analysis of observational studies of vitamin C and lung cancer.  Their Table 3 presents study-specific estimates and a random-effects meta-analysis for the risk ratio per extra 100mg/day vitamin C.

The point estimate is 0.93 and the confidence interval is 0.88-0.98, but the $I^2$ heterogeneity statistic is 75%.  That is, the heterogeneity in the estimates is about three times the sampling uncertainty.  Putting the data into my rmeta package in R I can reproduce their output (apart from their summary $p$-value, which I think must be a typo), and I get an estimate of $\tau=0.23$.

Combining that with the mean, the simple heterogeneity model says that the true effect on the relative risk scale of an extra 100mg/day vitamin C varies enormously depending on context, with 95% limits from 0.58 to 1.47.  The true effect is beneficial in 62% of trials and harmful in 48%. This is without adding in the sampling uncertainty, which would expand the bounds slightly for a true prediction interval.

If we take the heterogeneity model seriously, this meta-analysis is telling us we have almost no clue about the effect of vitamin C on lung cancer in a new population that wasn’t one of the studies that went into the analysis.  Averaging over all populations, vitamin C is estimated to be slightly beneficial, but in your population we can’t tell.  Since the data are all observational and are visibly inconsistent, that’s not terribly surprising, and is most likely due to different confounding patterns.

I think reporting suitable upper and lower quantiles of the latent treatment effect distribution in addition to a confidence interval for the mean would be an improvement for random-effects meta-analysis.  In particular, it would help with the ‘how much is too much’ question about $I^2$, since a highly-heterogeneous set of studies would always end up with a wide treatment effect distribution.

It would be even better to report confidence intervals for the upper and lower quantiles, but that would take a little theoretical work, and the simple solution is probably good enough.

## A brief observation on shills

By and large, it isn’t cost-effective to bribe academics to say things they don’t believe.

Academics who are sufficiently well-known to be worth bribing tend to be (ipso facto) both moderately affluent, and willing to have given up chances for greater wealth in exchange for freedom to do basically whatever they are interested in. The sensible strategy is to find academics who already believe what you want said, and to buy them bigger megaphones.  Over time, cognitive dissonance may warp their views to be even better aligned with yours, but that’s just a bonus.

Given that, it may be sensible to downweight the level of support for a position when it comes from sponsored research, but it is irrational to say an argument should be dismissed just because it comes from someone you view as a shill.

## Upcoming talks

"Statistics in the media: <sigh>". Café Scientifique, Horse and Trap pub, Mount Eden, Auckland. August 27, evening.

Short course: “Elements of R”, aka the Ken and Thomas show, University of Lausanne, September 3-5.

Short course: Complex sampling and epidemiologic designs. University of Milan, September 9-10.

"Two million t-tests: lessons from GWAS", Australasian Epidemiology Association, Auckland, October 9.

"Measure Everything on Everyone? Substudies and Sampling in CVD
Cohorts”  The Kronmal Symposium: Lessons Learned from Cardiovascular Epidemiology. November 24, University of Washington, Seattle.

## Survey package update

There’s a new version, 3.30-3, of the ‘survey’ package for R.  It’s got quite a lot of new stuff:

• AIC and BIC for generalised linear models
• Rank tests for more than two groups
• Logrank and generalised logrank tests

Since I’m known for a lack of enthusiasm about any of these techniques, why are they in the package? Am I just enabling?

Well, AIC and BIC are interesting, and I’ll say more below. For all these techniques, though, the lack of versions that account for the sampling design hasn’t stopped people using them. They’re probably even worse off using versions that assume independent sampling than versions that handle weights and clustering. If we’re going to give the Kruskal-Wallis test the unmarked grave it so richly deserves, we clearly aren’t going to do it by restricting implementations.  Instead, we need to teach people what it really does, and show them that isn’t what they want.

AIC turns out to be relatively straightforward: Jon Rao and Alastair Scott worked out the sampling distribution of the likelihood ratio back when I was learning how to solve simultaneous equations. The expected value of the log likelihood ratio when the smaller model fits as well as the larger model is the trace of a matrix. That matrix would be the identity under iid sampling and correctly-specified models, giving the $p$ penalty in AIC; under complex sampling the matrix isn’t the identity, but is easily estimated.

BIC is harder, because the likelihood can’t just be a convenience. However, if all the models you consider can be nested in some Big Ugly Model and obtained by setting specific parameters to specific values in that Big Ugly Model, we can use the (approximate) Gaussian likelihood of the parameter estimates in the Big Ugly Model.  Each submodel also has a Gaussian likelihood, and we can work out the honest, traditional, entirely Bayesian BIC for those Gaussian models.  The result is a penalised Wald statistic, with a penalty that is the number of parametris times a sort of effective log sample size

I should also note that the generalised logrank tests are the work of Kevin Rader and Stuart Lipsitz, who presented them at the Joint Statistical Meetings in San Diego two years ago.  They were on my to-do list, but the Harvard folks got to them first, and I just implemented what they presented.

## Structure and interpretation of unspeakable tongues

As a simple programming/probability exercise, I fitted a second-order Markov chain to "Structure and Interpretation of Computer Programs" and the complete works of H.P. Lovecraft. Adding the King James Bible tends to improve the output, but even in NZ that seems inappropriate for a class example.

Here’s some selected but unedited output: it always finishes at a full stop and starts on the word after one.

Turing (1912-1954), whose 1936 paper laid the foundations of computing, but rather of a machine for each pair (i, j) that passes through the weed-choked fissure to that moment, I crouched on the floor in front of my strange new impulses seemed subtly connected with if and have to write programs as large-scale data-base-management systems.

This has important implications, both theoretical and practical, because we must distinguish different types as being defined in terms of opprobrium.

I reached for the special symbol that can be implemented as an element of dy, which will be the result of its midst, and before he could - he was unknown in Cairo, to know terror lives.

I turned again to the horror of its still greater was he thinking? To what extent could the expert test of the or expression, and we are given points a and b are equal? if they had a curiously fragmentary, intangible, and heterogeneous army of invasion.

Then, bracing himself, he added, had important business elsewhere, and which involved other French settlers of two cubes in two registers with those who inferred from the data base satisfying a specified location.

Unify-match is basically the right branch, a set of needed and modified by the generic arithmetic operations, we use assoc to see the rescue made by a sleeper is indeed the best course would involve many additional hazards.

Information Retrieval Logic programming excels in incidents and sensations of tense and heated discussion; and before the recursive definition which translates readily into the flat head and end with only general sense, memory, and the mockery of the defenders, and afterward the physician, unable to sing.

I could see that it signals an error if the evaluation rule for adding “greats” to a complex number, such as “If x is 1 and the nameless scent was now shaking with their wives testify the bewilderment which their far too common Knowledge to be delayed, we must ensure that our programs at a distance from the all-pervading and almost dreaded the exertion, for his family and Dr.

De Marigny recalled the old room of my youth about the matter of duty, though in reverse order.

Binger saw me and bade me mark that all the other hand, even toy parsers can be extended to include some germ-tainted crocodile meat as food for the body of a century.

## Feynman and the Suck Fairy

There’s been a bit of…discussion…about Richard Feynman recently. In one Twitter discussion, Richard Easther said he had been thinking of using Feynman’s commencement address "Cargo Cult Science" with a first-year physics class, and had decided against.

I was a bit surprised. It’s been a long time since I read that piece, but I couldn’t remember anything objectionable in it. So I re-read it.  It’s still really good in a lot of ways. But. Yeah. That.

The problems are more obvious in Feynman’s book of autobiographical anecdotes "Surely You’re Joking, Mr Feynman". I read that when it came out, and loved it. I was in high school at the time. It wasn’t that I loved the casual sexism; I just didn’t really notice. I was interested in the science bits of the book, and I didn’t care what he did in bars in Buffalo. It’s harder to re-read it now without wincing.

If you find yourself feeling defensive about Feynman, you might like to read Jo Walton on the Suck Fairy in literature. The Suck Fairy goes through books you used to love, and edits them to make them suck when you read them again. The whole point of the name, of course, is that this isn’t true; the book always did suck and you just didn’t notice when you were younger.  The book you remember is still good, it just isn’t real.

The "Surely You’re Joking.." I remember, and its author and hero, were great. There’s no need to deny that. The problem is, they weren’t the actual book and the actual physicist. The actual person was a genius, but not really a role model.

The imaginary Feynman, the one who wrote the imaginary book I remember reading, would have wanted us to be honest about the faults of the actual Feynman.

## Unacceptable Use

The East-West Center at the University of Hawai’i has one of the most bizarre Internet Acceptable Use Policies I have ever seen. Among other things, it “strictly prohibits”

The distribution, dissemination, storage, copying and/or sale of materials protected by copyright, trademark, trade secret or other intellectual-property right laws.

That bans making this post, downloading the slides for the Summer Institute in Statistical Genetics that I want to revise, installing new R packages, and probably even reading email (unless it was sent by someone who is a US government employee as part of their work).  You might think there’s an implied “except with permission of copyright owner”, except that in lots of other places in the policy they make that sort of exception explicit. Also, the policy goes on to say that this includes, but is not limited to, illegal copying and distribution.

More practically annoying is the fact that they block ssh and scp, so I can’t upload the files for my course tomorrow. Maybe I’ll go for a walk and see if I can get an eduroam connection.

## Herd immunity simulations

Especially for vaccines that are not 100% effective, a large chunk of the benefit comes from ‘herd immunity’, the fact that incomplete vaccination makes it harder for an epidemic to get started and spread. Increasing the proportion of people vaccinated helps those people, and it also helps the people who aren’t vaccinated.

Here’s a set of simulations (code, needs FNN package and R) that show the effect. There is a simulated population of 10,000 people living on a square (actually, a doughnut, since it wraps around).  Vaccinated people are green, unvaccinated are black.

Each day there is a 1/30 chance of a new disease case arriving.

If you are near a disease case you have a chance of being infected (red) which is lower, but still not zero, if you are vaccinated. The disease lasts four days and then you are immune (blue). Everyone moves slowly around, except that each day 1% of people get on a plane and move a long way.

With 10% vaccinated there is no herd immunity. As soon as an epidemic gets going, it spreads everywhere.

With 50% vaccinated, the epidemics still spread, but more slowly, and there’s a lower chance of starting one when a case arrives.

With 70% vaccinated, the epidemics burn out before covering the population

With 90% vaccinated, the epidemics don’t even get started.

## Monotonicity and smoothness

Andrew Gelman has an interesting discussion of monotonicity as a modelling constraint.  I basically agree with what he says, but since my first real statistical research (my M.Sc. thesis) was on order restrictions I thought I’d write about a related aspect of the problem.

Assuming that a relationship is monotone sounds like a very strong assumption, and therefore one that you’d expect to gain a lot by making. Asymptotically, this isn’t true.  If the relationship between $X$ and $Y$ is only known to be monotone, you get $E[Y|X=x]$ estimated to $O_p(n^{-1/3})$ where $X$ has non-zero density. By assuming smoothness you can get $O_p(n^{-2/5})$, which is better. That is, if you have a lot of data and you know a relationship is smooth, you don’t gain anything by knowing it is monotone, but if you know it is monotone you do gain by knowing it is smooth.

I think that’s non-intuitive, and I think the reason it’s non-intuitive is the asymptotics. If you have relatively sparse data, knowing that the relationship is monotone is fairly powerful, but knowing it is smooth is pretty useless. If you have very dense data, knowing a priori the relationship is smooth is useful, but knowing a priori that it is monotone is not all that helpful, since it will be fairly obvious whether it’s monotone or not.

## Anchoring bias

Anchoring bias: high school students asked to add up the digits in their phone number and to estimate how many countries there are in Africa.

(phew, it worked)

(I did delete one data point as non-responsive: estimated number of countries in Africa was 1)

(with adults I’d use last two digits of phone number, but with teenage girls I thought a bit more information-hiding was appropriate)