I'm a statistical researcher in Auckland. This blog is for things that don't fit on my department's blog, StatsChat.
I also tweet as @tslumley

We’ve moved

Nothing’s going to change here, but new entries and a copy of almost everything from here is at netlify 

Graduation

So, I gave one of the graduation addresses during Silly Hat Week last week.  It’s not my usual style of writing, but since there’s no point being embarrassed about a speech you’ve already given in front of your boss, your boss’s boss, and 2000 other people, I thought I’d post it here. 


Chancellor, Vice-Chancellor, Members of Council, fellow Members of the University, Graduands, families and friends: kia ora tātou

It’s five o’clock on a Friday. In the cultural traditions of my people, that’s a time to celebrate the successful end of some hard work. Today we’re celebrating a longer period of harder work than most Fridays, and with two thousand more people than usual.

For some of you, the journey will have been especially challenging: your friends didn’t go to university with you, or you’ve had to juggle family and work and study obligations, or you’re studying in a foreign country and a foreign language. But it won’t have been easy for anyone: your degree is a real achievement to be proud of.

Congratulations, graduands. You did it.

In a few minutes from now, the Chancellor will say the magic words and it will be official: you know Science. I’m sure I speak for my colleagues on the stage when I say we’re all proud to have helped.

Of course, no-one one knows everything about science, but you do know more than most people in New Zealand – or in the world.  Part of that is the specific facts about plankton or neutrinos or Sonogashira couplings that you’ve studied. Just as important is what you’ve learned about how science works: how it fits together like a crossword puzzle, and how you can use what we already know about the world to test answers to new questions.

Some of you will go on to make your own contributions to scientific knowledge – in fact, some of you already have. Many of you will use what you’ve learned about science in your future careers.  But I want to talk about a way all of you can use your scientific training to contribute to the community.

Today, scientific research is helping us uncover problems and risks for society and potential solutions to them: climate change, drug policy, road pricing, crime prediction, and many more.  The news media report some of this research, but tend to lose the context it was done in: they don’t distinguish well between solid and speculative ideas or between major and minor risks or between breakthroughs and the usual incremental scientific advances. It’s not that the journalists don’t care, it’s that they don’t have the time and resources to cover each story in the depth it would need — it’s not sports journalism.

But the news – and maybe social media – is how most people get their information about policy issues and the science informing them. And that’s a problem.

In a democracy we want decisions made in a way that reflects the whole range of interests and values across society. It’s hard to do that unless there’s a shared understanding of what the problems and potential solutions are. Either people need to understand the issues themselves or they need to trust someone else’s understanding.  

Helping create this shared understanding is partly a job for those of us on the stage – one of the University’s official roles,  according to the legislation that set it up, is to act as a critic and conscience for society. Since I moved to New Zealand I’ve been spending a lot more time on science communication: blogging, tweeting, talking to journalists and journalism students. Many of my colleagues do the same. But there are things you can do that we can’t.  

As I said at the beginning, you know more about science – both the basic facts, and the ways of answering questions – than most people in New Zealand, or most people in the world. Knowledge is power, and with modest power comes modest responsibility.   Each of you has the skills and training to understand at least some of the scientific issues in our political and social debates. If you have a smartphone, you’ve got instant access to a lot of the world’s scientific knowledge in a way that was unimaginable fifty years ago. And, most importantly, you’ve got people – friends and family – who trust you to share their values and interests, when they wouldn’t trust me.  

I’m not suggesting you go around preaching science to all your friends: that would just get annoying, and you’d end up with fewer friends. But if you keep up the habits of learning from your time at the University, and if you look into issues carefully before making up your mind it’s going to have an influence on the people around you – and sometimes they’ll ask your opinion.

Now, trying to follow the facts behind social issues can be uncomfortable.  Like all of us, you’re going to sometimes find out that the arguments you’d like to support are wrong, or at least a lot weaker than you’d like. In fact, finding that you were wrong about something is one of the best signs that you’re still doing science right.  

So, I’m going to propose a challenge: at the end of every year when you think about New Year’s resolutions, also ask yourself: “What did I change my mind about this year?”


svylme

I’m working on an R package for mixed models under complex sampling.  It’s here. At the moment, it only tries to fit two-level linear mixed models to two-stage samples – for example, if you sample schools then students within schools, and want a model with school-level random effects. Also, it’s still experimental and not really tested and may very well contain nuts.

The package uses pairwise composite likelihood, because that’s a lot easier to implement efficiently than the other approaches, and because it doesn’t have the problems with nonlinearity and weight scaling. On the downside, there’s some loss in efficiency, especially in situations where both the predictor and response are very strongly correlated within clusters. 

Currently, I’ve only got standard error estimation for the fixed effects. I think I know how to do it for the random effects, but it’s enough of a pain that I don’t want to unless people actually need it.  A resampling approach may turn out to be better. 

Extending this to more levels of model and stages of sampling shouldn’t be that hard – more tedious than difficult – as long as the model level are nested in the sampling stages. That is, you could have school districts as sampling units above schools, or classrooms as a model level below schools. 

The more-general case where model and sampling units aren’t nested turns out to be straightforward for estimation (though efficient implementations are non-trivial) but rather more complicated for standard error estimation.  It will come, but not right now. 

Generalised linear mixed models are also tricky: for what I’m doing now you don’t need the conditional modes of the random effects, but for efficient computation in GLMMs you do, in order to use adaptive Gaussian quadrature. It should still be feasible, but it may be a while. 

The package relies on lme4 in two ways. First, it uses lmer to get good starting values for optimisation, which is always helpful. Second, it uses Doug Bates’s idea of profiling out the fixed effects parameters and then just feeding the profile loglikelihood to a black-box, derivative-free optimiser. Profiling cuts the number of parameters down a lot, which makes optimisation easier.  Using a derivative-free optimiser simplifies the code a lot, since the derivatives with respect to the variance components get messy fast. 

Small p hacking

The proposal to change p-value thresholds from 0.05 to 0.005 won’t die. I think it’s targeting the wrong question:  many studies are too weak in various ways to provide the sort of reliable evidence they want to claim, and the choices available in analysis and publication process eat up too much of that limited information.  If you use p-values to decide what to publish, that’s your problem, and that’s what you need to fix.

One issue that doesn’t get as much attention is how a change would affect the sensitivity of p-values to analysis choices.  When we started doing genome-wide association studies, we noticed that the results were much more sensitive than we had expected. If you changed the inclusion criteria or the adjustment variables in the model, the p-value for the most significant SNPs often changed a lot.  Part of that is just having low power. Part of it is because the Normal tail probability gets small very quickly: small changes in a $Z$-statistic of -5 give you quite large changes in the tail probability. Part of it is because we were working at the edge of where the Normal approximation holds up: analyses that are equivalent to first order can start showing differences.  Overall, as Ken Rice put it “the data are very tired after getting to $5\times 10^{-8}$”, and you can’t expect them to do much additional work.  In our (large) multi-cohort GWAS we saw these problems only at very small p-values , but with smaller datasets you’d see them a lot earlier.

None of this is a problem if you have a tightly pre-specified analysis and do simulations to check that it’s going to be ok.  But it did mean we had to be more careful about being too flexible.  Fortunately, the computational load of GWAS (back then) was enough to encourage pre-specified analyses.  But if you’re trying to make a change to general statistical practice in order to mask some of the symptoms of analysis flexibility and publication bias, you do need to worry.

Chebyshev’s inequality and `UCL’

Chebyshev’s inequality (or any of the other transliterations of Чебышёв) is a simple bound on the proportion of a distribution that can be far from the mean. The Wikipedia page, on the other hand, isn’t simple. I’m hoping this will be more readable.

We have a random quantity $X$ with mean $\mu$ and variance $\sigma^2$, and – knowing nothing else – we want to say something about the probability that $X-\mu$ is large.  Since we know nothing else, we need to find the largest possible value of $\Pr(|X-\mu|\geq d)$ for any distribution with that mean and variance. 

It’s easier to think about the problem the other way: fix $\Pr(|X-\mu|\geq d)$, and ask how small we can make the variance.  So, how can we change the distribution to reduce the variance without changing $\Pr(|X-\mu|\geq d)$?  For any point further than $d$ away we can move it to distance $d$ away. It still has distance at least $d$, but the variance is smaller.  For any point closer than $d$ we can move it all the way to the mean. It’s still closer than $d$, and the variance is smaller.

So, the worst-case distribution has all its probability at $\mu-d$, $\mu$, or $\mu+d$.  Write $p/2$ for the probability that $X=\mu-d$. This  must be the same as the probability that $X=\mu+d$, or $\mu$ wouldn’t be the mean.  By a straightforward calculation the variance of this worst-case distribution is $\sigma^2=pd^2$.  So, $p=\sigma^2/d^2$ for the worst-case distribution, and $p\leq \sigma^2/d^2$ always.

In environmental monitoring there’s an approach to upper confidence limits for a mean based on this: the variance of the mean of $n$ observations is $\sigma^2/n$, and the probability of being more than $d$ units away from the mean is bounded by $\sigma^2/nd^2$.  The problem, where this is used in environmental monitoring, is that you don’t know $\sigma^2$.  You’ve got an estimate based on the data, but this estimate is going to be unreliable in precisely the situations where you’d need a conservative, worst-case interval.  The approach is ok if you’ve taken enough samples to see the bad bits of the pollution distribution, but it’s not very helpful if you don’t know whether you have.

Why pairwise likelihood?

Xudong Huang and I are working on fitting mixed models using pairwise composite likelihood. JNK Rao and various co-workers have done this in the past, but only for the setting where the structure (clusters, etc) in the sampling is the same as in the model.  That’s not always true. 

The example that made me interested in this was genetic analyses from the Hispanic Community Health Survey.  The survey is a multistage sample: census block groups and then households. The models have random effect structure corresponding to relatedness.  Since there are unrelated people in the same household (eg, spouses) and related people in different households, the sampling and design structures are very different.

What you’d start off trying to do in a traditional design-based approach is to estimate the population loglikelihood.  In a linear mixed model that’s of the form
$$-\frac{1}{2}\log\left|\Xi\right| -\frac{1}{2} (y-\mu(\beta))^T\Xi^{-1} y-\mu(\beta))$$
for a covariance matrix $\Xi$ and regression parameters $\beta$. 

The way I’ve described the problem previously is “it’s not clear where you stick the weights”. That’s true, but it’s worth going into more detail.  Suppose you know $\Xi$ for the whole population.  You then know the log-determinant term, and the residual term is a pairwise sum. If $R_{ij}$ is the indicator that the pair $(i,\,j)$ is sampled, and $\pi_{ij}$ is the probability, you could use
$$\hat\ell =-\frac{1}{2}\log\left|\Xi\right| -\frac{1}{2} \sum_{i,j} \frac{R_{ij}}{\pi_{ij}}(y-\mu(\beta))_i^T\left(\Xi^{-1}\right)_{ij} y-\mu(\beta))_j$$

Typically we don’t know $\Xi$ for the whole population: it can depend on covariates (in a random-slope model), or we might not even know the number of people in non-sampled clusters, or in the genetic example we wouldn’t know the genetic relationships between non-sampled people.  Also, the full population could be quite big, so working out $-\frac{1}{2}\log\left|\Xi\right|$ might be no fun.  The approach might be worth trying for a spatial model, but it’s not going to work for a health survey.

If the model and design structures were the same, we might treat $\Xi$ as block diagonal, with some blocks observed and others not, and hope to easily rescale the sample log determinant to the population one. But in general that will be hopeless, too. 

Pairwise composite likelihood works because we can use a different objective function, one that really is a sum that’s easy to reweight.  If $\ell_{ij}$ is the loglikelihood based on observations $i$ and $j$, then in the population
$$\ell_\textrm{composite} = \sum_{i,j} \ell_{ij}$$
and based on the sample
$$\hat\ell_\textrm{composite} = \sum_{i,j} \frac{R_{ij}}{\pi_{ij}}\ell_{ij}.$$
We now only need to be able to work out $\Xi$ for observed pairs of people, which we can easily do.  Since the summands are honest-to-Fisher loglikelihoods, they have their expected maximum at the true parameter values and estimation works straightforwardly for both $\beta$ and $\Xi$. Variance estimation for $\hat\beta$ doesn’t work so easily: a sandwich estimator has a lot of terms, and proving it’s consistent is fairly delicate. But it is consistent.

We would do strictly better in terms of asymptotic efficiency by going beyond pairs to triples or fourples or whatever. But even triples would up the computational complexity by a factor of $n$, and require us to have explicit formulas for sixth-order joint sampling probabilities – and it gets exponentially worse for larger tuples. 

Faster generalised linear models in largeish data

There basically isn’t an algorithm for generalised linear models that computes the maximum likelihood estimator in a single pass over the $N$ observatons in the data. You need to iterate.  The bigglm function in the biglm package does the iteration using bounded memory, by reading in the data in chunks, and starting again at the beginning for each iteration. That works, but it can be slow, especially if the database server doesn’t communicate that fast with your R process.

There is, however, a way to cheat slightly. If we had a good starting value $\tilde\beta$, we’d only need one iteration – and all the necessary computation for a single iteration can be done in a single database query that returns only a small amount of data.  It’s well known that if $\|\tilde\beta-\beta\|=O_p(N^{-½})$, the estimator resulting from one step of Newton–Raphson is fully asymptotically efficient. What’s less well known is that for simple models like glms, we only need $\|\tilde\beta-\beta\|=o_p(N^{-¼})$.

There’s not usually much advantage in weakening the assumption that way, because in standard asymptotics for well-behaved finite-dimensional parametric models, any reasonable starting estimator will be $\sqrt{N}$-consistent. In the big-data setting, though, there’s a definite advantage: a starting estimator based on a bit more than $N^{½}$ observations will have error less than $N^{-¼}$.  More concretely, if we sample $n=N^{5/9}$ observations and compute the full maximum likelihood estimator, we end up with a starting estimator $\tilde\beta$ satisfying $$\|\tilde\beta-\beta\|=O_p(n^{-½})=O_p(N^{-5/18})=o_p(N^{-¼}).$$

The proof is later, because you don’t want to read it. The basic idea is doing a Taylor series expansion and showing the remainder is $O_p(\|\tilde\beta-\beta\|^2)$, not just $o_p(\|\tilde\beta-\beta\|).$

This approach should be faster than bigglm, because it only needs one and a bit iterations, and because the data stays in the database. It doesn’t scale as far as bigglm, because you need to be able to handle $n$ observations in memory, but with $N$ being a billion, $n$ is only a hundred thousand. 

The database query is fairly straightforward because the efficient score in a generalised linear model is of the form 
$$\sum_{i=1}^N x_iw_i(y_i-\mu_i)$$
for some weights $w_i$. Even better, $w_i=1$ for the most common models. We do need an exponentiation function, which isn’t standard SQL, but is pretty widely supplied. 

So, how well does it work? On my ageing Macbook Air, I did a 1.7-million-record logistic regression to see if red cars are faster. More precisely, using the “passenger car/van” records from the NZ vehicle database, I fit a regression model where the outcome was being red and the predictors were vehicle mass, power, and number of seats. More powerful engines, fewer seats, and lower mass were associated with the vehicle being red. Red cars are faster.

The computation time was 1.4s for the sample+one iteration approach and 15s for bigglm.

Now I’m working on  an analysis of the NYC taxi dataset, which is much bigger and has more interesting variables.  My first model, with 87 million records, was a bit stressful for my laptop. It took nearly half an hour elapsed time for the sample+one-step approach and 41 minutes for bigglm, though bigglm took about three times as long in CPU time.  I’m going to try on my desktop to see how the comparison goes there.  Also, this first try was using the in-process MonetDBLite database, which will make bigglm look good, so I should also try a setting where the data transfer between R and the database actually needs to happen. 

I’ll be talking about this at the JSM and (I hope at useR).

Math stuff

Suppose we are fitting a generalised linear model with regression parameters $\beta$, outcome $Y$, and predictors $X$.  Let $\beta_0$ be the true value of $\beta$, $U_N(\beta)$ be the score at $\beta$ on $N$ observations and $I_N(\beta)$ theFisher information at $\beta$ on $N$ observations. Assume the second partial derivatives of the loglikelihood have uniformly bounded second moments on a compact neighbourhood $K$ of $\beta_0$. Let $\Delta_3$ be the tensor of third partial derivatives of the log likelihood, and assume its elements

$$(\Delta_3)_{ijk}=\frac{\partial^3}{\partial x_i\partial x_jx\partial _k}\log\ell(Y;X,\beta)$$
have uniformly bounded second moments on $K$.

Theorem:  Let $n=N^{\frac{1}{2}+\delta}$ for some $\delta\in (0,½]$, and let $\tilde\beta$ be the maximum likelihood estimator of $\beta$ on a subsample of size $n$.  The one-step estimators
$$\hat\beta_{\textrm{full}}= \tilde\beta + I_N(\tilde\beta)^{-1}U_N(\tilde\beta)$$
and
$$\hat\beta= \tilde\beta + \frac{n}{N}I_n(\tilde\beta)^{-1}U_N(\tilde\beta)$$
are first-order efficient

Proof: The score function at the true parameter value is of the form
$$U_N(\beta_0)=\sum_{i=1}^Nx_iw_i(\beta_0)(y_i-\mu_i(\beta_0)$$
By the mean-value form of Taylor’s theorem we have
$$U_N(\beta_0)=U_N(\tilde\beta)+I_N(\tilde\beta)(\tilde\beta-\beta_0)+\Delta_3(\beta^*)(\tilde\beta-\beta_0,\tilde\beta-\beta_0)$$
where $\beta^*$ is on the interval between $\tilde\beta$ and $\beta_0$. With probability 1, $\tilde\beta$ and thus $\beta^*$ is in $K$ for all sufficiently large $n$, so the remainder term is $O_p(Nn^{-1})=o_p(N^{½})$.
Thus
$$I_N^{-1}(\tilde\beta) U_N(\beta_0) = I^{-1}_N(\tilde\beta)U_N(\tilde\beta)+\tilde\beta-\beta_0+o_p(N^{-½})$$

Let $\hat\beta_{MLE}$ be the maximum likelihood estimator. It is a standard result that
$$\hat\beta_{MLE}=\beta_0+I_N^{-1}(\beta_0) U_N(\beta_0)+o_p(N^{-½})$$

So
$$\begin{eqnarray*}
\hat\beta_{MLE}&=& \tilde\beta+I^{-1}_N(\tilde\beta)U_N(\tilde\beta)+o_p(N^{-½})\\
&=& \hat\beta_{\textrm{full}}+o_p(N^{-½})
\end{eqnarray*}$$

Now, define $\tilde I(\tilde\beta)=\frac{N}{n}I_n(\tilde\beta)$, the estimated full-sample information based on the subsample, and let ${\cal I}(\tilde\beta)=E_{X,Y}\left[N^{-1}I_N\right]$ be the expected per-observation information.  By the Central Limit Theorem we have  
$$I_N(\tilde\beta)=I_n(\tilde\beta)+(N-n){\cal I}(\tilde\beta)+O_p((N-n)n^{-½}),$$
so
$$I_N(\tilde\beta) \left(\frac{N}{n}I_n(\tilde\beta)\right)^{-1}=\mathrm{Id}_p+ O_p(n^{-½})$$
where $\mathrm{Id}_p$ is the $p\times p$ identity matrix.
We have
$$\begin{eqnarray*}
\hat\beta-\tilde\beta&=&(\hat\beta_{\textrm{full}}-\tilde\beta)I_N(\tilde\beta)^{-1} \left(\frac{N}{n}I_n(\tilde\beta)\right)\\
&=&(\hat\beta_{\textrm{full}}-\tilde\beta)\left(\mathrm{Id}_p+ O_p(n^{-½}\right)\\
&=&(\hat\beta_{\textrm{full}}-\tilde\beta)+ O_p(n^{-1})
\end{eqnarray*}$$
so $\hat\beta$ (without the $\textrm{full}$)is also asymptotically efficient. 

Useful debugging trick

If you have a thing with lots of indices, such as a fourth-order sampling probability $\pi_{ijk\ell}$ (the probability that individuals $i$, $j$, $k$ and $\ell$ are all sampled), there will likely be scenarios where it has lots and lots of symmetries. 

A useful trick is to write a wrapper that checks them:

FourPi<-function(i,j,k,l){
   answer <- FourPiInternal(i,j,k,l)
   sym <- FourPiInternal(j,i,k,l)
   if (abs((answer-sym)/(answer+sym))>EPSILON) stop(paste(i,j,k,l))
   answer
}

Other useful tricks:

  • The score (deriviative of loglikelihood) has mean zero at the true parameters under sampling from the model, even in finite samples
  • Quite a few design-based variance estimators are unbiased for the sampling variance even in small samples. 
  • Many (but not all) variance estimators should be positive semidefinite even in small samples.
  • If you have two estimators of the same thing, do a scatterplot of them or of their estimating functions.

More generally, properties of estimating functions are often easier to check in small samples than properties of the estimators.  This is especially useful when you have an estimator that takes $\Omega\left(M^2N^2\right)$ time for large $N$ and moderate $M$, so you can’t just scale up and use asymptotics.  If the computation time isn’t $O(N)$ or near offer, tests you can do with small samples are enormously useful 

The Ihaka Lectures

They’re back!

This year our theme is visualisation. The lectures will again run on Wednesday evenings in March. The three speakers work in different areas of data visualisation: collect the complete set!

Paul Murrell is an Associate Professor in Statistics here in Auckland. He’s a member of the R Core Development Team, and responsible for a lot of graphics infrastructure in R. The ‘grid’ graphics system grew out of his PhD thesis with Ross Ihaka. 

Di Cook is Professor of Business Analytics at Monash University, and was previously a Professor at Iowa State University. She’s an expert in visualisation for data analysis: using graphics, especially interactive and dynamic graphics, to learn from data. 

Alberto Cairo is the Knight Chair in Visual Journalism at the University of Miami. His focus is in data visualisation for communication, especially with non-specialists.  He led the creation of the Interactive Infographics Department at El Mundo (elmundo.es, Spain), in 2000.

More details to follow.

More tests for survey data

If you know about design-based analysis of survey data, you probably know about the Rao-Scott tests, at least in contingency tables.  The tests started off in the 1980s as “ok, people are going to keep doing Pearson $X^2$ tests on estimated population tables, can we work out how to get $p$-values that aren’t ludicrous?” Subsequently, they turned out to have better operating characteristics than the Wald-type tests that were the obvious thing to do – mostly by accident.  Finally, they’ve been adapted to regression models in general, and reinterpreted as tests in a marginal working model of independent sampling, where they are distinctive in that they weight different directions of departure from the null in a way that doesn’t depend on the sampling design. 

The Rao–Scott test statistics are asymptotically equivalent to $(\hat\beta-\beta_0)^TV_0^{-1}(\hat\beta-\beta_0)$, where $\hat\beta$ is the estimate of $\beta_0$, and $V_0$ is the variance matrix you’d get with full population data. The standard Wald tests are targetting  $(\hat\beta-\beta_0)^TV^{-1}(\hat\beta-\beta_0)$, where $V$ is the actual variance matrix of $\hat\beta$.  One reason the Rao–Scott score and likelihood ratio tests work better in small samples is just that score and likelihood ratio tests seem to work better in small samples than Wald tests. But there’s another reason. 

The actual Wald-type test statistic (up to degree-of-freedom adjustments) is $(\hat\beta-\beta_0)^T\hat V^{-1}(\hat\beta-\beta_0)$. In small samples $\hat V$ is often poorly estimated, and in particular its condition number is, on average, larger than the condition number of $V$, so its inverse is wobblier. The Rao–Scott tests obviously can’t avoid this problem completely: $\hat V$ must be involved somewhere. However, they use $\hat V$ via the eigenvalues of $\hat V_0^{-1}\hat V$; in the original Satterthwaite approximation, the mean and variance of these eigenvalues.  In the typical survey settings, $V_0$ is fairly well estimated, so inverting it isn’t a problem. The fact that $\hat V$ is more ill-conditioned than $V$ translates as fewer degrees of freedom for the Satterthwaite approximation, and so to a more conservative test.  This conservative bias happens to cancel out a lot of the anticonservative bias and the tests work relatively well.  

Here’s an example of qqplots of $-\log_{10} p$-values simulated  in a Cox model: the Wald test is the top panel and the Rao–Scott LRT is the bottom panel. The clusters are of size 100; the orange tests use the design degrees of freedom minus the number of parameters as the denominator degrees of freedom in an $F$ test. 

image

So, what’s new? SUDAAN has tests they call “Satterthwaite Adjusted Wald Tests”, which are based on $(\hat\beta-\beta_0)^T\hat V_0^{-1} (\hat\beta-\beta_0)$.  I’ve added similar tests to version 3.33 of the survey package (which I hope will be on CRAN soon).  These new tests are (I think) asymptotically locally equivalent to the Rao–Scott LRT and score tests. I’d expect them to be slightly inferior in operating characteristics just based on traditional folklore about score and likelihood ratio tests being better. But you can do the simulations yourself and find out. 

The implementation is in the regTermTest() function, and I’m calling these “working Wald tests” rather than “Satterthwaite adjusted”, because the important difference is the substitution of $V_0$ for $V$, and because I don’t even use the Satterthwaite approximation to the asymptotic distribution by default.