# Biased and Inefficient

## My likelihood depends on your frequency properties

[note: you may need to click on a single post for the typesetting to work; it doesn’t always work for the blog as a whole]

The likelihood principle states that given two hypotheses $H_0$ and $H_1$ and data $X$, all the evidence regarding which hypothesis is true is contained in the likelihood ratio
$$LR=\frac{P[X|H_1]}{P[X|H_0]}.$$

One of the fundamentals of  scientific research is the idea of scientific publication, which allows other researchers to form their own conclusions based on your results and those of others. The data available to other researchers, and thus the likelihood on which they rely for inference, depends on your publication behaviour. In practice, and even in principle, publication behaviour for one hypothesis does depend on evidence you obtained for other hypotheses under study, so likelihood-based inference by other researchers depends on the operating characteristics of your inference.

Consider an idealised situation of two scientists, Alice and Bob (who are on sabbatical from the cryptography literature). Alice spends her life collecting, analysing, and reporting on data $X$ that are samples of size $n$ from $N_p(\mu, I)$ distributions, in order to make inference about $\mu$. Bob is also interested in $\mu$ but doesn’t have the budget to collect his own $N_p(\mu,I)$ data. He assesses the evidence for various values of $\mu$ by reading the papers of Alice and other researchers and using their reported statistics $Y$. In the future, he might be able to get their raw data easily, but not yet.

Alice and Bob primarily care about $\mu_1$ which is obviously much more interesting than $\left\{\mu_i\right\}_{i=2}^p$, and more likely to be meaningfully far from zero, but they have some interest in the others. Alice bases her likelihood inference on the multivariate Normal distributions $f_X(X|\mu_i)$, Bob bases his on $f_Y(Y|\mu_i)$.

Compare Alice and Bob’s likelihood functions for the hypotheses $\mu_i=0$ and $\mu_i=\delta$ with $\delta$ meaningfully greater than $0$ in the following scenarios. In all of them, Alice collects data on $\mu_1$ and reports the likelihood ratio for $\mu_1=0$ versus $\mu_1=\delta$.

1. Alice collects only data on $\mu_1$ and reports the likelihood ratio for $\mu_1=0$ versus $\mu_1=\delta$.
2.  Alice also collects data on $\mu_2$ and reports whether she finds strong evidence for $\mu_2=\delta$ over $\mu_2=0$ or not.
3. Alice also collects data on  $\mu_2\ldots\mu_q$ for some $q\leq p$.  If she finds evidence worth mentioning in favour of $\mu_i=\delta$, she publishes her likelihood ratio, otherwise she reports that there wasn’t enough evidence.
4.  Alice also collects data on $\mu_2\ldots\mu_q$ for some $q\leq p$. If she finds sufficient evidence for $\mu_i=\delta$ for any $i>1$ she reports the likelihood ratios for all $\mu_i$, otherwise only for $\mu_1$.

Alice’s likelihood ratio is the same in all scenarios. She obtains for each $i$

$$\frac{L_1}{L_0}=\frac{L(\mu_i=\delta)}{L(\mu_i=0)}=\frac{\exp(n^2(\bar X_i-\delta)^2/2)}{\exp(n^2\bar X_i^2/2)}.$$

and because she has been properly trained in decision theory her beliefs and her decisions about future research for any $\mu_i$ depend only on $\bar X_i$, not on $q$ or on other $\bar X_j$ or on how she decided to what to publish.

Bob’s likelihood ratio for $\mu_1$ is always the same as Alice’s. For the other parameters, things are more complicated.

1. no other parameters
2. Bob’s data is Alice’s result, $Y_2=1$ for finding strong evidence, $Y_2=0$ for not. His likelihoods are $L_1=(1-\beta)^{Y_2}\beta^{1-Y_2}$ and $L_0=\alpha^{Y_2}(1-\alpha)^{1-Y_2}$, where $\alpha$ is the probability Alice finds strong evidence for $\mu_2=\delta$ when $\mu_2=0$ is true and $\beta$ is the probability Alice fails to find strong evidence for $\mu_2=\delta$ when $\mu_2=\delta$ is true.
3. Bob has a censored Normal likelihood, which depends on $\alpha$ and $\beta$. If he ignores this and just uses Alice’s likelihood ratio when it’s available, he will inevitably end up believing $\mu_i=\delta$ for all $i>2$, regardless of the truth.
4. Bob’s likelihood ratio for the other $\mu_i$ depends on $\alpha$, $\beta$, $q$ and on the values of $\mu_j$ for $j\neq i$.

In scenarios 2-4, Bob’s likelihood depends on Alice’s criterion for strength of evidence and on how likely she is to satisfy it — if Alice were a frequentist, we’d call $\alpha$ and $\beta$ her Type I and Type II error rates. But it’s not a problem of misuse of $p$-values. Alice doesn’t use $p$-values. She would never touch a $p$-value without heavy gloves. She doesn’t even like being in the same room as a $p$-value.

In scenario 4, Bob also needs to know $q$ in order to interpret papers that do not include results for $i>1$ — he needs to know Alice’s family-wise power and Type I error rate. That’s actually not quite true: if Bob knows Alice is following this rule he can ignore her papers that don’t contain all the likelihood ratios, since he does know $q$ for the ones that do.  His likelihood for $i>1$ still depends on $\alpha$, $\beta$, $q$, and the other $\mu$s.

At least, if nearly all Alice’s papers report results for all the $\mu$s, Bob knows that the bias from just using Alice’s likelihood ratio when available will be small and he may be able to get by without all the detail and complication.

This isn’t quite the same as publication bias, though it’s related. At least if $q$ is given and we know Alice’s criteria,  she always publishes information about every analysis that would be sufficient for likelihood inference not only about $\mu_i=0$ vs $\mu_i=\delta$, but even for point and interval estimation of $\mu_i$. Alice isn’t being evil here. She’s not hiding negative results; they just aren’t that interesting.

Of course, the problem would go away if Alice published, say, posterior distributions or point and interval estimates for all $\mu_i$, at least if $p$ isn’t large enough that the complete set could be sensitive

tl;dr:  If I can’t get your data or at least (approximately) sufficient statistics, my conclusions may depend on details of your analysis and decision making that don’t affect your conclusions. And if you ever just report "was/wasn’t significant," Bob will hunt you down and make you regret it.

## Chemical nerdview

One of Stephen J. Gould’s essays contains the admission

I confess that I have always been greatly amused by the term primate, used in its ecclesiastical sense as “an archbishop … holding the first place among the bishops of a province.” My merriment must be shared by all zoologists, for primates, to us, are monkeys and apes—members of the order Primates.

But this amusement is silly, parochial, and misguided.

Gould points out that the clerics had the term first, and that the etymology, from the Latin primas, first, is just as appropriate (actually, more appropriate) in the ecclesiastical usage as in the zoological one.

Chemists tend to find the concept of ‘organic farming’ amusing, because they think of the term ‘organic’ as meaning ‘containing carbon atoms’, not as indicating derivation from living things.  The chemists are even more at fault than the zoologists, because this sense of organic is not only older than the ‘carbon atoms’ sense, but actually comes from chemistry, as does the semi-mystical distinction between synthesis in a lab and in a living cell.

Before 1828, there was a clear division in chemistry between ordinary compounds that could be made in the lab by ordinary chemical procedures and organic chemicals that were produced by living creatures. When Friedrick Wöhler first made the organic compound urea from  inorganic ammonium cyanate he virtually pissed himself in shock, writing to Berzelius

I cannot, so to say, hold my chemical water and must tell you that I can make urea without thereby needing to have kidneys, or anyhow, an animal, be it human or dog”.

Even then, the vitalist idea that organic compounds were special remained for a couple of decades, until Kolbe synthesized acetic acid step by step from precursors that were undeniably non-organic (in the old sense).

Chemists then had to adapt their terminology. Since nearly all of the compounds produced only by living cells contained carbon atoms, and vice versa, they could just make a slight shift in definition.  This occasionally caused problems for small molecules without much carbon — John Clarke’s book Ignition mentions some confusion over organic vs non-organic nomenclature of rocket fuel compounds — but worked well enough that many people seem to have forgotten the shift occurred.

The chemists got over vitalism; cells do just work by the same rules of chemistry and physics as test-tubes. Ascorbic acid from a Chinese chemical company really does fail to prevent cancer just as well as natural vitamin C. If complex mixtures of carotenoids turn out to have health benefits not provided by pure beta-carotene, these benefits will just depend on the actual molecules present, not on their origin.

Vitalism in chemistry was wrong, but the vitalist sense of ‘organic’ is older and was just as rooted in chemistry as the ‘carbon atoms’ sense. There are better things to sneer at Wholefoods about.

## WNAR Biometrics short course

I’m giving the one-day short course at the WNAR Biometrics meeting in Honolulu this year. It’s an introduction to large-scale genetic association studies. Here’s the blurb. You’ll soon be able to register for the conference, probably at a link here

Introduction to large-scale genetic association studies

It is now feasible to measure hundreds of thousands of genetic variants on enough individuals for population-based epidemiology, and is becoming feasible to do the same for all or part of the complete DNA sequence.  This course will cover the statistics and some of the computing needed to analyse large-scale genetic data on large numbers of unrelated individuals.  Apart from a few brief mentions, it will not cover family-based studies or organisms other than humans, and will assume raw measurements have already been turned into genotypes.

Since 2008, Thomas has been a member of the Analysis Committee of the Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE), which has done quite a lot of genome-wide SNP analysis and, more recently, DNA sequencing, in large cardiovascular cohort studies.

Outline.

1. In which we meet DNA and learn how it is measured

DNA is very stable and is the same in essentially all cells in the body, so blood samples taken in the 1990s can easily be used to study disease and biology today (unlike RNA, protein, methylation, etc). How DNA varies, and how these variants are measured.  What goes wrong with the measurements.

2. In which we perform simple tests many times

The basic genome-wide association study consists of millions of simple linear or logistic regressions.  The multiplicity leads to the resurrection of statistical issues that had previously been dismissed or settled.

3. In which we go to the library

Annotation is the process of working out that your association between TLA levels and genotype at  rs5551212 is uninteresting because that variant is in the TLA kinase gene and we already know about it.  Or, occasionally, not.

4. In which we may not be broken down by age and sex, but do discriminate based on ancestry

Confounding works differently in genetics, because your genome gets fixed before you are born.  How model selection for main effects and interactions works in this context.  How confounding by ancestry can cause problems, and what can be done about it.

5. In which we make up data

You have measured a million SNPs, but that’s a tiny fraction of all the known ones. How we impute unmeasured SNPs.

6. In which we have friends

Genetic effects tend to be really, really small. A single cohort study typically isn’t big enough to be useful, so we need to work together in larger consortia.  Combining results without sharing individual data is important. So is playing nicely with the other children.

7.  In which we do not take one thing at a time

In principle, a meaningful genetic difference could involve multiple SNPs in a complicated non-additive way. How might we tell?

Also, when looking at very rare genetic variants there’s no real point studying them individually. How can we group them to increase power?

8. In which we worry about the future

If there is still time and energy, a brief lecture on some things we might have in the future, such as molecular haplotyping and reliable functional annotation.

## This is a wug. Now you have two of them.

Three words that used to be plurals, and are changing in three different ways.

Candelabra used to be the plural of candelabrum, a multiple-armed candlestick holder. There are very few other English words ending in ‘brum’, and most of the words ending in ‘bra’ are singular (e.g. vertebra, penumbra, cobra, zebra, sabra). Over time, candelabra has been used more and more often as the singular, perhaps most famously in the biographical movie “Behind the Candelabra” about Liberace; the corresponding plural is candelabras. This change presumably started as a simple error; some authorities would say it still is.

Agenda used to be the plural of agendum, from the Latin “which ought to be done”. People would write lists of things which ought to be done, and write “Agenda” at the top of the lists rather than “To do”, because they were pretentious and British.  Over time, agenda became the name of the list itself, rather than the items, and specifically the name of a list of things to be done at a meeting; the corresponding plural is agendas. There’s no error involved, just metonymy.

Data used to be the plural of datum, from the Latin “was given”. In statistics, you usually need a set of data, and the numbers are only really meaningful in context, not on their own. The new use has been to treat data as a mass noun, like information, something for which singular and plural are not relevant. There’s no error involved; there is a change of meaning.

Mass nouns take the same verb forms as singular nouns, so underinformed pedant wannabes sometimes claim data is being treated as singular. The simplest way to see that this claim is wrong is that data has no plural. No-one regards datas as correct English, although new singular count nouns very reliably form plurals with ‘s’.

## At risk of vanishing

A degree in science, in addition to specific facts about squid, neutrinos, or palladium-catalysed cross-couplings, should teach students what to do with questions about the world. In particular, they should learn to think about what the implications would be of each answer to the question, and know how we might use these implications to rule out some of the answers and reduce our uncertainty about others.

A degree in the humanities, in addition to specific facts about tenses in French, resource-allocation procedures in village societies, or the development of the Sangam literature,should teach students what to do with questions about the world. In particular, they should learn to think about what questions should be asked on a particular topic, the different ways these could be answered, and whose interests are served by systems that promote one question or answer over another.

Of course, there’s some overlap in all disciplines, and a lot in some disciplines (such as, say, linguistics, statistics, geography, sociology), but I think a lot of academics would recognise these divisions, for good or bad.

So, when faced with a claim that by adding extra tertiary places in science and engineering

The country could lose “an informed and thoughtful citizenry which understands the history and cultures of a diverse nation and supports social and economic innovation and international engagement”.

we’d hope that someone with science training would ask if there’s any empirical support for the idea that people with science degrees are less informed and thoughtful, or less supportive of social and economic innovation and international engagement. We’d also hope that they would have some idea how empirical support or refutation could be generated if it wasn’t available.

We’d  hope that someone with humanities training might wonder who was making this claim, and why the media thought it was the most important issue about current funding allocations to NZ universities, and which personal interests and historical power structures this choice of issue tends to assume and reinforce.

Hey, I didn’t start this. And it’s not hard to find an argument for increased humanities funding that I’d support. But this wasn’t it.

## Moving the goalposts?

There’s a paper in PNAS suggesting that lots of published scientific associations are likely to be false, and that Bayesian considerations imply a p-value threshold of 0.005 instead of 0.05 would be good. It’s had an impact outside the statistical world, eg, with a post on the blog Ars Technica.  The motivation for the PNAS paper is a statistics paper showing how to relate p-values to Bayes Factors in some tests.

Some people have asked me what I think. So.

1. I much prefer the other way (non-paywalled tech report) to get classical p-values as part of an optimal Bayesian decision, because it’s based on estimation rather than on identifying arbitrary alternatives, and it seems to correspond better to what my scientific colleagues are trying to do with p-values. Ok, and because Ken is a friend.

2. The PNAS paper starts off by talking about reproducibility in terms of scientific fraud and slides into talking about publishing results that don’t meet the proposed new p<0.005. I’m not exaggerating: here’s the complete first paragraph

Reproducibility of scientific research is critical to the scientific endeavor, so the apparent lack of reproducibility threatens the credibility of the scientific enterprise (e.g., refs. 1 and 2). Unfortunately, concern over the nonreproducibility of scientific studies has become so pervasive that a Web site, Retraction Watch, has been established to monitor the large number of retracted papers, and methodology for detecting flawed studies has developed nearly into a scientific discipline of its own

That’s not a rhetorical device I’m happy with, to put it mildly.

3. If you don’t use p-value thresholds as a publishing criterion, the change won’t have any impact. And if you think p-value thresholds should be a publishing criterion, you’ve got worse problems than reproducibility.

4. False negatives are errors, too.  People already report “there was no association between X and Y ” (or worse “there was no effect of X on Y”) in subgroups where the p-value is greater than 0.05.  If you have the same data and decrease the false positives you have to increase the false negatives.

5. The problem isn’t the threshold so much as the really weak data in a lot of research, especially small-sample experimental research [large-sample observational research has different problems].  Larger sample sizes or better experimental designs would actually reduce the error rate; moving the threshold only swaps which kind of error you make.

6. Standards are valuable in scientific writing, but only to the extent that they reduce communication costs. That applies to statistical terminology as much as it applies to structured abstracts. Changing standards imposes substantial costs and is only worth it if there are substantial benefits.

7. And finally, why is it a disaster that a single study doesn’t always reach the correct answer? Why would any reasonable person expect it to? It’s not as if we have to ignore everything except the results of that one experiment in making any decisions.

## A diversity of gifts, but the same spirit

Peter Green used this line (from I Corinthians) for his Royal Statistical Society Presidential Address in 2003, which anyone interested in the future of statistics should read. I’ve been planning to steal it ever since then, and the time seems right.

Roger, Jeff, and Rafa at Simply Statistics are holding an unconference on the future of statistics, some time before dawn tomorrow morning New Zealand time. I probably won’t be attending, but if you’re in a more compatible time zone it promises to be interesting. It’s also sparked some Twitter chatter on the future of statistics. As you’d expect given the promoters, the chatter has focused on the importance of computation and applications and argued that theory is overrated. To some extent I agree, but I’m writing this in defense of methodological pluralism.

I think it’s unquestionably true that the academic statistical community overvalues mathematical formalism. It’s can be easier to publish sterile generalisations or pointless complications of mathematical statistics than useful simulation studies or high-quality applications. Much of the community has not really caught up with the fact that computation is thousands of times cheaper than it was three decades ago, and this has real implications for the best ways to solve problems.  My colleague Alastair Scott (Chicago, 1965) tells the story of joining a discussion with other members of his generation about the most important advances in statistics over their careers. He suggested computing, which had not been brought up  by anyone else and was received with some surprise.

On the other hand, some of the discussion reminds me of non-statisticians finding that, say, Andrew Gelman or Don Rubin is more knowledgeable and sensible than whoever taught them statistics in QMETH 101 and taking this as strong evidence for Bayesian statistics over frequentist statistics. It’s certainly true that, say, Hadley Wickham or Roger Peng’s research is of more benefit to humankind than the median piece of asymptotic statistics. But that’s mostly because they are really smart and hardworking. If everyone learned lots of statistical computing and graphics and less theory it wouldn’t turn everyone into Roger or Hadley. It would mean that useless papers on Edgeworth expansions of the overgeneralized beta distribution were replaced by useless simulations or ungeneralisable data analyses or pointless graphs. Beating Sturgeon’s Law just isn’t that easy — as any journal editor can tell you.

The point of mathematical statistics is that it tells you how to simplify problems that are too hard to think about heuristically. That’s only a minority of scientific problems, but it’s an important minority.  A huge amount of cognitive effort has gone into developing mathematical tools for thinking about inference, and these tools are valuable today. I’ve given some examples in past posts, and lots of people could give you others.

So, what would I recommend? Diversity.  The heretics are right that we shouldn’t have all PhD programs teaching two semesters from TSH and TPE; but some programs should. Some of them should focus on computation and algorithms. Some should teach more modern theory from van der Vaart instead. Some, like Santa Cruz and Duke, should focus on decision theory and Bayesian methods. Perhaps some should just concentrate on particular areas of application.

Statistics needs Savage Bayesians and moderate borrowing-strength Bayesians; we need applied statisticians who know the difference between DNA and RNA and probabilists who know the difference between l2 and L2(P); we need Big Data and randomised experiements; and we need mathematical statisticians who understand why asymptotic approximations are useful.

I spent a lot of time and effort on statistical computing before it became popular, against the advice of my seniors.  I understand the attraction in elevating Chambers and Friedman and casting down Cramer and Kolmogorov. I can see the poetic justice in mathematical statistics becoming a peripheral subject in graduate programs. But I think it would be a terrible waste.

## Interaction: ‘real’ and statistical

Confounding is a model-independent property of nature: if doing A has a particular effect on Y, it is objectively either true or untrue that the conditional distributions of Y given A and not A match that particular effect.

Interaction or effect modification is scale-dependent: you ask "is the effect of A on X in the presence of B the same as the effect of A on X in the absence of B." This requires reducing “the effect” to a single number or other low-dimensional summary.  If A and B both have an effect on X there must be summaries that show an interaction — the effect can’t be both exactly additive and exactly multiplicative, for example — so interaction is intrinsically more statistical and model-based than confounding.

Scientists often dismiss mere ‘statistical interaction’ and say they are interested in ‘real’ interaction. As they should be. But it’s not that simple.

Two real-world examples show that even when everything is known there may not be a good answer to whether there is “really” interaction or “really” effect modification.

1. Antifolate antibiotics.  Folate is essential for cell growth. It acts as a co-enzyme, taking part in reactions and then being recycled. There are two classes of antibiotic that act on folate: the sulfonamides prevent bacteria from synthesizing folate, and trimethoprim and its relatives prevent folate from being recycled after use.

Do these drugs interact?

• A biochemist says ‘No’. They inhibit completely different enzymes and have no effect on each other
• A microbiologist says ‘Yes’. Blocking availability of folate in two ways allows bacteria to be killed (in a Petri dish) with much lower doses of the two drugs when they are combined
• A clinician says “Kinda, but not really”. Because of different absorption and distribution in the body, the two drugs don’t really act synergistically. They are sometimes given together, but mostly to avoid resistance.

2. Hib vaccination.  This one is even simpler.

In Australia before the Haemophilus influenzae type B (Hib) vaccine, the Hib meningitis rate was 4.5/100000/year in indigenous communities and 1.7/100000/year in the rest of the population.

After the vaccine was introduced, the rate was 0.5/100000/year in indigenous communities and 0.1/100000/year elsewhere.

Did the vaccination increase or decrease the disparity in meningitis risk? It depends how you measure: the relative risk is higher, the risk differences is lower.

In both cases there is ambiguity, but in neither case are there any facts whose addition would settle the question.

## Barren proxies

In causal inference it is often the case that you can’t obtain a confounding variable directly, you can only measure something that it affects.  Judea Pearl correctly points out the danger of conditioning on a ‘barren proxy’ for a confounder, in situations like this one:

A confounds the effect of B on C. D is affected by A but does not directly affect either B or C, so it is a ‘barren proxy’ for A.

It’s easy to see that conditioning on D will not, in general, remove confounding by A. The problem, as so often with causal graphs, is where to draw the boundaries. Examined sufficiently closely, almost every variable in statistics is a barren proxy.

Suppose A is average particulate air pollution dose for people in a city and D is measured particulate air pollution concentration.  The standard measurement technique is to force air through a filter and trap the pollution particles, which are then weighed. The particles that end up on the filter cannot have any effect on health; measured air pollution is a barren proxy for exposure.

Suppose A is blood glucose concentration. A blood drop is removed and fed into a testing device. The glucose in that drop of blood doesn’t participate in future chemical reactions in the body; measured glucose is a barren proxy.

Suppose A is whether or not you have had a heart attack, and D is the conclusion from expert examination of your medical records. A subsequent examination of the medical records can’t possibly have an impact on whether your heart muscle cells actually died during the event; diagnosis of heart attack is a barren proxy.

As a simple matter of fact, essentially all measured variables in medicine are barren proxies. That’s not the important distinction, though. What actually matters is whether there are good causal reasons that the relationship between the true confounder and the measured variable is close, not just observationally but under intervention.  That is, we should care whether D is a reliable measurement of A, not whether it is a barren proxy for A. Unfortunately, the criteria for being a reliable measurement are not simple and qualitative.