The title is stolen from Dijkstra.

Papers submitted to medical journals below the top rank are most often reviewed by medical reviewers who have used the Wilcoxon test as a catch-all in their active days. To smuggle a paper though their phalanx, garnish it generously with:

- p-values
- Wilcoxon-tests (t-test: “The authors should test for normality”)
- p-values
- Bland-Altman plots
- p-values
- Area under curve (AUC)
- p-values

Other modernistic stuff will delay the review process: “The authors
should supplement their Bayesian analysis by non-random methods that
provide correct p-values”. There is nothing wrong with Wilcoxon tests
and Bland-Altman plots, and the mentioned **AUC** is on the boundary to
semi-professionalism, so is is worth a look. Anything left I did not
mention?

The AUC (area under the curve) or the AOB (signed area over baseline) is popular because it is easy to calculate: for the case of equidistant points, it is the average over all values multiplied with the record duration. Or, equivalently, the area of a the equivalent rectangle shown as shaded area below, which has a height equal to the mean volume and a width equal to record duration. There are better approximation for the AUC (Gabrielsson (2006), p. 161), but that does not change the validity of the arguments in the following.

- The average over all data points recorded is moderately informative.
Assuming the vertical axis shows a volume, it has units of
*ml*, which*is*easy to understand, but the dependency on the arbitrary record duration is awkward. Should the study result depend on the time the patient or experimenter is willing to sit around and wait? - The record duration is mostly arbitrary. It may differ between records, e.g. when we have to stop recording before target time.

**The AUC is highly biased when used naively; that is, when it is
computed over an arbitrary record duration.**

The AUC *is* very useful when computed up to very large times, when the
volume has gone back to baseline and no further contribution to the area
is to be expected. This is illustrated in the curves below, which are
similar to the PDR curves recorded in ^{13}C breath tests. For
all time series shown, the total area under the curve up to infinity has
been chosen to be the same. The AUC is a measure of bioavailability
(Gabrielsson 2006), the chief reason why in non-compartmental
pharmacodynamics it is the prime parameter of interest. The AUCs
computed from the recorded data range is invalid in the left panel, but
a good approximation in the right panel.

It is possible to compute the AUC up to infinity even when only a
limited data set is available; or, even worse, when the duration of the
records is not the same between test units. First, you have to fit a
model curve to your data that goes back to the reference line at
infinity, and has a finite improper
integral; the first
does not imply the latter. Remember high school calculus: the function
1/x does not have an improper integral - it is infinite - but
1/x^{2} has one; even 1/x^{1.0001} has one.

In analysis of ^{13}C breath test curves (see Bluck (2009) for a
review), the Gamma function from earlier publications does not have an
improper integral, while the exponential beta function has one; some
examples are given here. In analysis of gastric
emptying, the power exponential generally does not have an improper
integral, but the linexp function has one, so
the AUC can be calculated.

Once you fitted a parametric curve to your data, the (**the** only,
real, valid, nice!) AUC can be computed by consulting an integral table
or feeding Wolfram Alpha. Note that a
spline function cannot be used in this case, because the spline is a
spline is a spline and thus flexible to bend to data’s capriciousness,
but it does not know where volume will wander in the infinite future.

Too much to remember for the next meeting with you boss (the “we did all with Wilcox” guy)? Read the executive summary:

- … and someone suggests “compute the AUC”, follow Nancy Reagan and
just
**say no**. It is almost always the wrong approach, even if the tests give wonderful p-values. - … if the someone insists, ask whether using the average over all
measured points instead will do
^{1}. A ‘yes’ answer is acceptable, a ‘no’ means that someone misuses poor AUC for pseudo-scientific blow-off. - … if you have measured the curve long enough until it is back to baseline, say “Sorry, will try”. With the exception of some gastric emptying studies, I have never encountered such a case, because patient and experimenters share the desire to be home in time.
- … if you have fitted a parametric curve, you can often compute the AUC. It might be useful to quantify your study data, but always consider other alternatives, such as peak height or peak position which you also can extract from the fit. And don’t try half a dozen parameters extracted from a fitted curve to pick “the significant one”; I wish more people would disappear into Nirwana when hiking the garden of forking paths of Borges and Gelman.

1: Use a weighted mean for non-equidistant points

One feature of AUC is its robustness; the form of the curve does not play a role, a curve with two peaks can have the same AUC as one dropping down montonously, even if both curves tell different stories. Robustness can be a virtue, but can also hide interesting effects; or, to tell your boss: you may loose power by reducing curve details to an area.

When you find a good model function for your interesting alternatives, you are back in power.

Bluck, L J C. 2009. “Recent Advances in the Interpretation of the (13)C
Octanoate Breath Test for Gastric Emptying.” *J Breath Res* 3 (3):
034002. .

Gabrielsson, Dan, Johan ; Weiner. 2006. *Pharmacokinetic &
Pharmacodynamic Data Analysis : Concepts and Applications*. 4. ed., rev.
and expanded. PK/Pd Data Analysis: Concepts and Applications. Stockholm:
Swedish Pharmaceutical Press.