Eliciting probability distributions from quantiles

We often have intuitions about the probability distribution of a variable that we would like to translate into a formal specification of a distribution. Transforming our beliefs into a fully specified probability distribution allows us to further manipulate the distribution in useful ways.

For example, you believe that the cost of a medication is a positive number that’s about 10, but with a long right tail: say, a 10% probability of being more than 100. To use this cost estimate in a Monte Carlo simulation, you need to know exactly what distribution to plug in. Or perhaps you have a prior about the effect of creatine on cognitive performance, and you want to formally update that prior using Bayes’ rule when a new study comes out. Or you want to make a forecast about a candidate’s share of the vote and evaluate the accuracy of your forecast using a scoring rule.

In most software, you have to specify a distribution by its parameters, but these parameters are rarely intuitive. The normal distribution’s mean and standard deviation are somewhat intuitive, but this is the exception rather than the rule. The lognormal’s mu and sigma correspond to the mean and standard deviation of the variable’s logarithm, something I personally have no intuitions about. And I don’t expect good results if you ask someone to supply a beta distribution’s alpha and beta shape parameters.

I have built a tool that creates a probability distribution (of a given family) from user-supplied quantiles, sometimes also called percentiles. Quantiles are points on the cumulative distribution function: \((p,x)\) pairs such that \(P(X<x)=p\). To illustrate what quantiles are, we can look at the example distribution below, which has a 50th percentile (or median) of -1 and a 90th percentile of 10.


A cumulative distribution function with a median of -1 and a 90th percentile of 10

The code is on GitHub. To use the tool, you have a few options:

Let’s run through some examples of how you can use this tool. At the end, I will discuss how it compares to other probability elicitation software, and why I think it’s a valuable addition.

Traditional distributions

The tool supports the normal and lognormal distributions, and more of the usual distribution families could easily be added. The user supplies the distribution family, along with an arbitrary number of quantiles. If more quantiles are provided than the distribution has parameters (more than two in this case), the system is over-determined. The tool then uses least squares to find the best fit.

This is some example input:

family = 'lognormal'
quantiles = [(0.1,50),(0.5,70),(0.75,100),(0.9,150)]

And the corresponding output:

More than two quantiles provided, using least squares fit

Lognormal distribution
mu 4.313122980928514
sigma 0.409687416531683

quantiles:
0.01 28.79055927521217
0.1 44.17183774344628
0.25 56.64439363937313
0.5 74.67332855521319
0.75 98.44056294458953
0.9 126.2366766332274
0.99 193.67827989071688

Metalog distribution1

The feature I am most excited about, however, is the support for a new type of distribution developed specifically for the purposes of flexible elicitation from quantiles, called the meta-logistic distribution. It was first described in Keelin 2016, which puts it at the cutting edge compared to the venerable normal distribution invented by Gauss and Laplace around 1810. The meta-logistic, or metalog for short, does not use traditional parameters. Instead, it can take on as many terms as the user provides quantiles, and adopts whatever shape is needed to fit these quantiles very closely. Closed-form expressions exist for its quantile function (the inverse of the CDF) and for its PDF. This leads to attractive computational properties (see footnote)2.

Keelin explains that

[t]he metalog distributions provide a convenient way to translate CDF data into smooth, continuous, closed-from distribution functions that can be used for real-time feedback to experts about the implications of their probability assessments.

The metalog quantile function is derived by modifying the logistic quantile function,

\[\mu + s \ln{\frac{y}{1-y}} \quad\text{ for } 0 < y < 1\]

by letting \(\mu\) and \(s\) depend on \(y\) instead of being constant.

As Keelin writes, given a systematically increasing \(s\) as one moves from left to right, a right skewed distribution would result. And a systematically decreasing \(\mu\) as one moves from left to right would make the distribution spikier in the middle with correspondingly heavier tails.

By modifying \(s\) and \(\mu\) in ever more complex ways we can make the metalog take on almost any shape. In particular, in most cases the metalog CDF passes through all the provided quantiles exactly3. Moreover, we can specify the metalog to be unbounded, to have arbitrary bounds, or to be semi-bounded above or below.

Instead of thinking about which of several highly constraining distribution families to use, just choose the metalog and let your quantiles speak for themselves. As Keelin says:

one needs a distribution that has flexibility far beyond that of traditional distributions – one that enables “the data to speak for itself” in contrast to imposing unexamined and possibly inappropriate shape constraints on that data.

For example, we can fit an unbounded metalog to the same quantiles as above:

family = 'metalog'
quantiles = [(0.1,50),(0.5,70),(0.75,100),(0.9,150)]
metalog_leftbound = None
metalog_rightbound = None
Meta-logistic distribution

quantiles:
0.01 11.968367580205552
0.1 50.000000000008185
0.25 58.750000000005215
0.5 70.0
0.75 100.00000000000519
0.9 150.00000000002515
0.99 281.7443263650518

The metalog’s actual parameters (as opposed to the user-supplied quantiles) have no simple interpretation and are of no use unless the next piece of software you’re going to use knows what a metalog is. Therefore the program doesn’t return the parameters. Instead, if we want to manipulate this distribution, we can use the expressions of the PDF and CDF that the software provides4, or alternatively export a large number of samples into another tool that accepts distributions described by a list of samples (such as the Monte Carlo simulation tool Guesstimate). By default, 5000 samples will be printed; you can copy and paste them.

Approaches to elicitation

How does this tool compare to other approaches for creating subjective belief distributions? Here are the strategies I’ve seen.

Belief intervals

The first approach is to provide a belief interval that is mapped to some fixed quantiles, e.g. a 90% belief interval (between the 0.05 and 0.95 quantile) like on Guesstimate. Metaculus provides a graphical way to input the same data, allowing the user to drag the quantiles across a line under a graph of the PDF. This is the simplest and most user-friendly approach. The tool I built incorporates the belief interval approach while going beyond it in two ways. First, you can provide completely arbitrary quantiles, instead of specifically the 0.05 and 0.95 – or some other belief interval symmetric around 0.5. Second, you can provide more than two quantiles, which allows the user to query intuitive information about more parts of the distribution.


Guesstimate


Metaculus

Drawing

Another option is to draw the PDF on a canvas, in free form, using your mouse. This is the very innovative approach of probability.dev.5


probability.dev

Ought’s elicit

Ought’s elicit lets you provide quantiles like my tool, or equivalently bins with some probability mass in each bin6. The resulting distribution is by default piecewise uniform (the cdf is piecewise linear), but it’s possible to apply smoothing. It has all the features I want, the drawback is that it only supports bounded distributions7.


Elicit

Mixtures

A meta-level approach that can be applied to any of the above is to allow the user to specify a mixture distribution, a weighted average of distributions. For example, 1/3 weight on a normal(5,5) and 2/3 weight on a lognormal(1,0.75). My opinion on mixtures is that they are good if the user is thinking about the event disjunctively; for example, she may be envisioning two possible scenarios, each of which she has a distribution in mind for. But on Metaculus and Foretold my impression is that mixtures are often used to indirectly achieve a single distribution whose rough shape the user had in mind originally.

The future

This is an exciting space with many recent developments. Guesstimate, Metaculus, Elicit and the metalog distribution have all been created in the last 5 years.

  1. Regrettably, I had to use R for this part, making my program an unholy mixture of Python and R (with rpy2). Beyond the inelegance of it, this adds a delay of multiple seconds the first time a metalog is created, to load the R session from within Python. Subsequent runs are not slowed down noticeably.

    There actually exist two python packages for R (pymetalog and metalog), but I could make neither of them work with a reasonable level of effort for this project. pymetalog, with which I spent the most time trying, gives an answer that implies a decreasing CDF when using qmetalog on some ‘peaky’ inputs. The packages don’t appear to be maintained. 

  2. For the quantile function expression, see Keelin 2016, definition 1. The fact that this is in closed form means, first, that sampling randomly from the distribution is computationally trivial. We can use the inverse transform method: we take random samples from a uniform distribution over \([0,1]\) and plug them into the quantile function. Second, plotting the CDF for a certain range of probabilities (e.g. from 1% to 99%) is also easy.

    The expression for the PDF is unusual in that it is a function of the cumulative probability \(p \in (0,1)\), instead of a function of values of the random variable. See Keelin 2016, definition 2. As Keelin explains (p. 254), to plot the PDF as is customary we can use the quantile function \(q(p)\) on the horizontal axis and the PDF expression \(f(p)\) on the vertical axis, and vary \(p\) in, for example, \([0.01,0.99]\) to produce the corresponding values on both axes.

    Hence, for (i) querying the quantile function of the fitted metalog, sampling, and plotting the CDF, and (ii) plotting the PDF, everything can be done in closed form.

    To query the CDF, however, numerical equation solving is applied. Since the quantile function is differentiable, Newton’s method can be applied and is fast. (Numerical equation solving is also used to query the PDF as a function of values of the random variable – but I don’t see why one would need densities except for plotting.) 

  3. In most cases, there exists a metalog whose CDF passes through all the provided quantiles exactly. In that case, there exists an expression of the metalog parameters that is in closed form as a function of the quantiles (“\(a = Y^{−1}x\)”, Keelin 2016, p. 253. Keelin denotes the metalog parameters \(a\), the matrix \(Y\) is a simple function of the quantiles’ y-coordinates, and the vector \(x\) contains the quantiles’ x-coordinates. The metalog parameters \(a\) are the numbers that are used to modify the logistic quantile function. This modification is done according to equation 6 on p. 254.)

    If there is no metalog that fits the quantiles exactly (i.e. the expression for \(a\) above does not imply a valid probability distribution), we have to use optimization to find the feasible metalog that fits the quantiles most closely. In this software implementation, “most closely” is defined as minimizing the absolute differences between the quantiles and the CDF (see here for more discussion).

    In my experience, if a small number of quantiles describing a PDF with sharp peaks are provided, the closest feasible metalog fit to the quantiles may not pass through all the quantiles exactly. 

  4. You can use my script’s Python function objects. Simply look for the r_dmetalog_func (the PDF) and r_pmetalog_func (the CDF). These are currently rpy2 function objects which take r vectors as parameters, but you can easily wrap a function around it to turn it into an ordinary python function. 

  5. Drawing the PDF instead of the CDF makes it difficult to hit quantiles. But drawing the CDF would probably be less intuitive – I often have the rough shape of the PDF in mind, but I never have intuitions about the rough shape of the CDF. The canvas-based approach also runs into difficulty with the tail of unbounded distributions. Overall I think it’s very cool but I haven’t found it that practical. 

  6. To provide quantiles, simply leave the Min field empty – it defaults to the left bound of the distribution. 

  7. I suspect this is a fundamental problem of the approach of starting with piecewise uniforms and adding smoothing. You need the tails of the CDF to asymptote towards 0 and 1, but it’s hard to find a mathematical function that does this while also (i) having the right probability mass under the tail (ii) stitching onto the piecewise uniforms in a natural way. I’d love to be proven wrong, though; the user interface and user experience on Elicit are really nice. (I’m aware that Elicit allows for ‘open-ended’ distributions, where probability mass can be assigned to an out-of-bounds outcome, but one cannot specify how that mass is distributed inside the out-of-bounds interval(s). So there is no true support for unbounded distributions. The ‘out-of-bounds’ feature exists because Elicit seems to be mainly intended as an add-on to Metaculus, which supports such ‘open-ended’ distributions but no truly unbounded ones.) 

August 28, 2020

Debugging surprising behavior in SciPy numerical integration

I wrote a Python app to apply Bayes’ rule to continuous distributions. It looks like this:

Screenshot

I’m learning a lot about numerical analysis from this project. The basic idea is simple:

def unnormalized_posterior_pdf(x):
    return prior.pdf(x)*likelihood.pdf(x)

# integrate unnormalized_posterior_pdf over the reals
normalization_constant = integrate.quad(unnormalized_posterior_pdf,-np.inf,np.inf)[0]

def posterior_pdf(x):
    return unnormalized_posterior_pdf(x)/normalization_constant

However, when testing my code on complicated distributions, I ran into some interesting puzzles.

A first set of problems was caused by the SciPy numerical integration routines that my program relies on. They were sometimes returning incorrect results or RuntimeErorrs. These problems appeared when the integration routines had to deal with ‘extreme’ values: small normalization constants or large inputs into the cdf function. I eventually learned to hold the integration algorithm’s hand a little bit and show it where to go.

A second set of challenges had to do with how long my program took to run: sometimes 30 seconds to return the percentiles of the posterior distribution. While 30 seconds might be acceptable for someone who desperately needed that bayesian update, I didn’t want my tool to feel like a punch card mainframe. I eventually managed to make the program more than 10 times faster. The tricks I used all followed the same strategy. In order to make it less expensive to repeatedly evaluate the posterior’s cdf by numerical integration, I tried to find ways to make the interval to integrate narrower.

You can follow along with all the tests described in this post using this file, whereas the code doing the calculations for the webapp is here.

Small normalization constants

Alt text

When the prior and likelihood are far apart, the unnormalized posterior takes tiny values.

It turns out that SciPy’s integration routine, integrate.quad, (incidentally, written in actual Fortran!) has trouble integrating such a low-valued pdf.

prior = stats.lognorm(s=.5,scale=math.exp(.5)) # a lognormal(.5,.5) in SciPy notation
likelihood = stats.norm(20,1)

class Posterior_scipyrv(stats.rv_continuous):
    def __init__(self,d1,d2):
        super(Posterior_scipyrv, self).__init__()
        self.d1= d1
        self.d2= d2

        self.normalization_constant = integrate.quad(self.unnormalized_pdf,-np.inf,np.inf)[0]

    def unnormalized_pdf(self,x):
        return self.d1.pdf(x) * self.d2.pdf(x)

    def _pdf(self,x):
        return self.unnormalized_pdf(x)/self.normalization_constant

posterior = Posterior_scipyrv(prior,likelihood)

print('normalization constant:',posterior.normalization_constant)
print("CDF values:")
for i in range(30):
    print(i,posterior.cdf(i))

The cdf converges to… 52,477. This is not so good.

Because the cdf does converge, but to an incorrect value, we can conclude that the normalization constant is to blame. Because the cdf converges to a number greater than 1, posterior.normalization_constant, about 3e-12, is an underestimate of the true value.

If we shift the likelihood distribution just a little bit to the left, to likelihood = stats.norm(18,1), the cdf converges correctly, and we get a normalization constant of about 6e-07. Obviously, the normalization constant should not jump five orders of magnitude from 6e-07 to 3e-12 as a result of this small shift.

The program is not integrating the unnormalized pdf correctly.

Difficulties with integration usually have to do with the shape of the function. If your integrand zig-zags up and down a lot, the algorithm may miss some of the peaks. But here, the shape of the posterior is almost the same whether we use stats.norm(18,1) or stats.norm(20,1)1. So the problem really seems to occur once we are far enough in the tails of the prior that the unnormalized posterior pdf takes values below a certain absolute (rather than relative) threshold. I don’t yet understand why. Perhaps some of the values are becoming too small to be represented with standard floating point numbers.

This seems rather bizarre, but here’s a piece of evidence that really demonstrates that low absolute values are what’s tripping up the integration routine that calculates the normalization constant. We just multiply the unnormalized pdf by 10000 (which will cancel out once we normalize).

def unnormalized_pdf(self,x):
    return 10000*self.d1.pdf(x) * self.d2.pdf(x)

Now the cdf converges to 1 perfectly (??!).

Large inputs into cdf

We take a prior and likelihood that are unproblematically close together:

prior = stats.lognorm(s=.5,scale=math.exp(.5))# a lognormal(.5,.5) in SciPy notation
likelihood = stats.norm(5,1)
posterior = Posterior_scipyrv(prior,likelihood)

for i in range(100):
    print(i,posterior.cdf(i))

At first, the cdf goes to 1 as expected, but suddenly all hell breaks loose and the cdf decreases to some very tiny values:

22 1.0000000000031484
23 1.0000000000095246
24 1.0000000000031442
25 2.4520867144186445e-09
26 2.7186998869943613e-12
27 1.1495658559228458e-15

What’s going on? When asked to integrate the pdf from minus infinity up to some large value like 25, quad doesn’t know where to look for the probability mass. When the upper bound of the integral is in an area with still enough probability mass, like 23 or 24 in this example, quad finds its way to the mass. But if you ask it to find a peak very far away, it fails.

A piece of confirmatory evidence is that if we make the peak spikier and harder to find, by setting the likelihood’s standard deviation to 0.5 instead of 1, the cdf fails earlier:

22 1.000000000000232
23 2.9116983489798973e-12

We need to hold the integration algorithm’s hand and show it where on the real line the peak of the distribution is located. In SciPy’s quad, you can supply the points argument to point out places ‘where local difficulties of the integrand may occur’, but only when the integration interval is finite. The solution I came up with is to split the interval into two halves.

def split_integral(f,splitpoint,integrate_to):
    a,b = -np.inf,np.inf
    if integrate_to < splitpoint:
        # just return the integral normally
        return integrate.quad(f,a,integrate_to)[0]
    else:
        integral_left = integrate.quad(f, a, splitpoint)[0]
        integral_right = integrate.quad(f, splitpoint, integrate_to)[0]
        return integral_left + integral_right

This definitely won’t work for every difficult integral, but should help for many cases where most of the probability mass is not too far from the splitpoint.

For splitpoint, a simple choice is the average of the prior and likelihood’s expected values.

class Posterior_scipyrv(stats.rv_continuous):
    def __init__(self,d1,d2):
        self.splitpoint = (self.d1.expect()+self.d2.expect())/2

We can now override the built-in cdf method, and specify our own method that uses split_integral:

class Posterior_scipyrv(stats.rv_continuous):
    def _cdf(self,x):
        return split_integral(self.pdf,self.splitpoint,x)

Now things run correctly:

22 1.0000000000000198
23 1.0000000000000198
24 1.0000000000000198
25 1.00000000000002
26 1.0000000000000202
...
98 1.0000000000000198
99 1.0000000000000193

Defining support of posterior

So far I’ve only talked about problems that cause the program to return the wrong answer. This section is about a problem that only causes inefficiency, at least when it isn’t combined with other problems.

If you don’t specify the support of a continuous random variable in SciPy, it defaults to the entire real line. This leads to inefficiency when querying quantiles of the distribution. If I want to know the 50th percentile of my distribution, I call ppf(0.5). As I described previously, ppf works by numerically solving the equation \(cdf(x)=0.5\). The ppf method automatically passes the support of the distribution into the equation solver and tells it to only look for solutions inside the support. When a distribution’s support is a subset of the reals, searching over the entire reals is inefficient.

To remedy this, we can define the support of the posterior as the intersection of the prior and likelihood’s support. For this we need a small function that calculates the intersection of two intervals.

def intersect_intervals(two_tuples):
    d1 , d2 = two_tuples

    d1_left,d1_right = d1[0],d1[1]
    d2_left,d2_right = d2[0],d2[1]

    if d1_right < d2_left or d2_right < d2_left:
        raise ValueError("the distributions have no overlap")
    
    intersect_left,intersect_right = max(d1_left,d2_left),min(d1_right,d2_right)

    return intersect_left,intersect_right

We can then call this function:

class Posterior_scipyrv(stats.rv_continuous):
    def __init__(self,d1,d2):
        super(Posterior_scipyrv, self).__init__()
        a1, b1 = d1.support()
        a2, b2 = d2.support()

        # 'a' and 'b' are scipy's names for the bounds of the support
        self.a , self.b = intersect_intervals([(a1,b1),(a2,b2)])

To test this, let’s use a beta distribution, which is defined on \([0,1]\):

prior = stats.beta(1,1)
likelihood = stats.norm(1,3)

We know that the posterior will also be defined on \([0,1]\). By defining the support of the posterior inside the the __init__ method of Posterior_scipyrv, we give SciPy access to this information.

We can time the resulting speedup in calculating posterior.ppf(0.99):

print("support:",posterior.support())
s = time.time()
print("result:",posterior.ppf(0.99))
e = time.time()
print(e-s,'seconds to evalute ppf')
support: (-inf, inf)
result: 0.9901821216897447
3.8804399967193604 seconds to evalute ppf

support: (0.0, 1.0)
result: 0.9901821216904315
0.40013647079467773 seconds to evalute ppf

We’re able to achieve an almost 10x speedup, with very meaningful impact on user experience. For less extreme quantiles, like posterior.ppf(0.5), I still get a 2x speedup.

The lack of properly defined support causes only inefficiency if we continue to use split_integral to calculate the cdf. But if we leave the cdf problem unaddressed, it can combine with the too-wide support to produce outright errors.

For example, suppose we use a beta distribution again for the prior, but we don’t use the split integral for the cdf, and nor do we define the support of the posterior as \([0,1]\) instead of \({\rm I\!R}\).

prior = stats.beta(1,1)
likelihood = stats.norm(1,3)

class Posterior_scipyrv(stats.rv_continuous):
    def __init__(self,d1,d2):
        super(Posterior_scipyrv, self).__init__()
        self.d1= d1
        self.d2= d2

        self.normalization_constant = integrate.quad(self.unnormalized_pdf,-np.inf,np.inf)[0]
    
    def unnormalized_pdf(self,x):
        return self.d1.pdf(x) * self.d2.pdf(x)

    def _pdf(self,x):
        return self.unnormalized_pdf(x)/self.normalization_constant

posterior = Posterior_scipyrv(prior,likelihood)

print("cdf values:")
for i in range(20):
    print(i/5,posterior.cdf(i/5))

The cdf fails quickly now:

3.2 0.9999999999850296
3.4 0.0
3.6 0.0

When the integration algorithm is looking over all of \((-\infty,3.4]\), it has no way of knowing that all the probability mass is in \([0,1]\). The posterior distribution has only one big bump in the middle, so it’s not surprising that the algorithm misses it.

If we now ask the equation solver in ppf to find quantiles, without telling it that all the solutions are in \([0,1]\), it will try to evaluate points like cdf(4), which return 0 – but ppf is assuming that the cdf is increasing. This leads to catastrophe. Running posterior.ppf(0.5) gives a RuntimeError: Failed to converge after 100 iterations. At first I wondered why beta distributions would always give me RuntimeErrors…

Optimization: CDF memoization

When we call ppf, the equation solver calls cdf for the same distribution many times. This suggests we could optimize things further by storing known cdf values, and only doing the integration from the closest known value to the desired value. This will result in the same number of integration calls, but each will be over a smaller interval (except the first). This is a form of memoization.

We can also squeeze out some additional speedup by considering the cdf to be 1 forevermore once it reaches values close to 1.

class Posterior_scipyrv(stats.rv_continuous):
    def _cdf(self,x):
        # exploit considering the cdf to be 1
        # forevermore once it reaches values close to 1
        for x_lookup in self.cdf_lookup:
            if x_lookup < x and np.around(self.cdf_lookup[x_lookup],5)==1.0:
                return 1

        # check lookup table for largest integral already computed below x
        sortedkeys = sorted(self.cdf_lookup ,reverse=True)
        for key in sortedkeys:
            #find the greatest key less than x
            if key<x:
                ret = self.cdf_lookup[key]+integrate.quad(self.pdf,key,x)[0]
                self.cdf_lookup[float(x)] = ret
                return ret
        
        # Initial run
        ret = split_integral(self.pdf,self.splitpoint,x)
        self.cdf_lookup[float(x)] = ret
        return ret

If we return to our earlier prior and likelihood

prior = stats.lognorm(s=.5,scale=math.exp(.5)) # a lognormal(.5,.5) in SciPy notation
likelihood = stats.norm(5,1)

and make calls to ppf([0.1, 0.9, 0.25, 0.75, 0.5]), the memoization gives us about a 5x speedup:

memoization False
[2.63571613 5.18538207 3.21825988 4.56703016 3.88645864]
length of lookup table: 0
2.1609253883361816 seconds to evalute ppf

memoization True
[2.63571613 5.18538207 3.21825988 4.56703016 3.88645864]
length of lookup table: 50
0.4501194953918457 seconds to evalute ppf

These speed gains again occur over a range that makes quite a difference to user experience: going from multiple seconds to a fraction of a second.

Optimization: ppf with bounds

In my webapp, I give the user some standard percentiles: 0.1, 0.25, 0.5, 0.75, 0.9.

Given that ppf works by numerical equation solving on the cdf, if we give the solver a smaller domain in which to look for the solutions, it should find them more quickly. When we calculate multiple percentiles, each percentile we calculate helps us close in on the others. If the 0.1 percentile is 12, we have a lower bound of 12 for on any percentile \(p>0.1\). If we have already calculated a percentile on each side, we have both a lower and upper bound.

We can’t directly pass the bounds to ppf, so we have to wrap the method, which is found here in the source code. (To help us focus, I give a simplified presentation below that cuts out some code designed to deal with unbounded supports. The code below will not run correctly).

class Posterior_scipyrv(stats.rv_continuous):
    def ppf_with_bounds(self, q, leftbound, rightbound):
        left, right = self._get_support()

        # SciPy ppf code to deal with case where left or right are infinite.
        # Omitted for simplicity.

        if leftbound is not None:
          left = leftbound
        if rightbound is not None:
          right = rightbound


        # brentq is the equation solver (from Brent 1973)
        # _ppf_to_solve is simply cdf(x)-q, since brentq
        # finds points where a function equals 0
        return optimize.brentq(self._ppf_to_solve,left, right, args=q)

To get some bounds, we run the extreme percentiles first, narrowing in on the middle percentiles from both sides. For example in 0.1, 0.25, 0.5, 0.75, 0.9, we want to evaluate them in this order: 0.1, 0.9, 0.25, 0.75, 0.5. We store each of the answers in result.

class Posterior_scipyrv(stats.rv_continuous):
    def compute_percentiles(self, percentiles_list):
        result = {}
        percentiles_list.sort()

        # put percentiles in the order they should be computed
        percentiles_reordered = sum(zip(percentiles_list,reversed(percentiles_list)), ())[:len(percentiles_list)] # see https://stackoverflow.com/a/17436999/8010877

        def get_bounds(dict, p):
            # get bounds (if any) from already computed `result`s
            keys = list(dict.keys())
            keys.append(p)
            keys.sort()
            i = keys.index(p)
            if i != 0:
                leftbound = dict[keys[i - 1]]
            else:
                leftbound = None
            if i != len(keys) - 1:
                rightbound = dict[keys[i + 1]]
            else:
                rightbound = None
            return leftbound, rightbound

        for p in percentiles_reordered:
            leftbound , rightbound = get_bounds(result,p)
            res = self.ppf_with_bounds(p,leftbound,rightbound)
            result[p] = np.around(res,2)

        sorted_result = {key:value for key,value in sorted(result.items())}
        return sorted_result

The speedup is relatively minor when calculating just 5 percentiles.

Using ppf bounds? True
total time to compute percentiles: 3.1997928619384766 seconds

Using ppf bounds? False
total time to compute percentiles: 3.306936264038086 seconds

It grows a little bit with the number of percentiles, but calculating a large number of percentiles would just lead to information overload for the user.

This was surprising to me. Using the bounds dramatically cuts the width of the interval for equation solving, but leads to only a minor speedup. Using fulloutput=True in optimize.brentq, we can see the number of function evaluations that brentq uses. This lets us see that the number of evaluations needed by brentq is highly non-linear in the width of the interval. The solver gets quite close to the solution very quickly, so giving it a narrow interval hardly helps.

Using ppf bounds? True
brentq looked between 0.0 10.0 and took 11 iterations
brentq looked between 0.52 10.0 and took 13 iterations
brentq looked between 0.52 2.24 and took 8 iterations
brentq looked between 0.81 2.24 and took 9 iterations
brentq looked between 0.81 1.73 and took 7 iterations
total time to compute percentiles: 3.1997928619384766 seconds

Using ppf bounds? False
brentq looked between 0.0 10.0 and took 11 iterations
brentq looked between 0.0 10.0 and took 10 iterations
brentq looked between 0.0 10.0 and took 10 iterations
brentq looked between 0.0 10.0 and took 10 iterations
brentq looked between 0.0 10.0 and took 9 iterations
total time to compute percentiles: 3.306936264038086 seconds

Brent’s method is a very efficient equation solver.

  1. It has a very similar shape to the likelihood (because the likelihood has much lower variance than the prior). 

July 1, 2020

How long does it take to sample from a distribution?

Suppose a study comes out about the effect of a new medication and you want to precisely compute how to update your beliefs given this new evidence. You might use Bayes’ theorem for continuous distributions.

\[p(\theta | x) =\frac{p(x | \theta) p(\theta) }{p(x)}=\frac{p(x | \theta) p(\theta) }{\int_\Theta p(x | \theta) p(\theta) d \theta}\]

The normalization constant (the denominator of the formula) is an integral that is not too difficult to compute, as long as the distributions are one-dimensional.

For example, with:

from scipy import stats
from scipy import integrate

prior = stats.lognorm(scale=math.exp(1),s=1)
likelihood = stats.norm(loc=5,scale=20)
def unnormalized_posterior_pdf(x):
	return prior.pdf(x)*likelihood.pdf(x)
normalization_constant = integrate.quad(
    unnormalized_posterior_pdf,-np.inf,np.inf)[0]

the integration runs in less than 100 milliseconds on my machine. So we can get a PDF for an arbitrary 1-dimensional posterior very easily.

But taking a single sample from the (normalized) distribution takes about a second:

# Normalize unnormalized_posterior_pdf
# using the method above and return the posterior as a
# scipy.stats.rv_continuous object.
# This takes about 100 ms
posterior = update(prior,likelihood) 

# Take 1 random sample, this takes about 1 s
posterior.rvs(size=1) 

And this difference can be even starker for higher-variance posteriors (with s=4 in the lognormal prior, I get 250 ms for the normalization constant and almost 10 seconds for 1 random sample).

For a generic continuous random variable, rvs uses inverse transform sampling. It first generates a random number from the uniform distribution between 0 and 1, then passes this number to ppf, the percent point function, or more commonly quantile function, of the distribution. This function is the inverse of the CDF. For a given percentile, it tells you what value corresponds to that percentile of the distribution. Randomly selecting a percentile \(x\) and evaluating the \(x\) th percentile of the distribution is equivalent to randomly sampling from the distribution.

How is ppf evaluated? The CDF, which in general (and in fact most of the time1) has no explicit expression at all, is inverted by numerical equation solving, also known as root finding. For example, evaluating ppf(0.7) is equivalent to solving cdf(x)-0.7=0, which can be done with numerical methods. The simplest such method is the bisection algorithm, but more efficient ones have been developed (ppf uses Brent’s method). The interesting thing for the purposes of runtime is that the root finding algorithm must repeatedly call cdf in order to narrow in on the solution. Each call to cdf means an expensive integration of the PDF.

CDF Bisection
The bisection algorithm to solve cdf(x)-0.7=0

An interesting corollary is that getting one random number is just as expensive as computing a chosen percentile of the distribution using ppf (assuming that drawing a random number between 0 and 1 takes negligible time). For approximately the cost of 10 random numbers, you could characterize the distribution by its deciles.

On the other hand, sampling from a distribution whose family is known (like the lognormal) is extremely fast with rvs. I’m getting 10,000 samples in a millisecond (prior.rvs(size=10000)). This is not because there exists an analytical expression for its inverse CDF, but because there are very efficient algorithms2 for sampling from these specific distributions3.

So far I have only spoken about 1-dimensional distributions. The difficulty of computing the normalization constant in multiple dimensions is often given as a reason for using numerical approximation methods like Markov chain Monte Carlo (MCMC). For example, here:

Although in low dimension [the normalization constant] can be computed without too much difficulties, it can become intractable in higher dimensions. In this last case, the exact computation of the posterior distribution is practically infeasible and some approximation techniques have to be used […]. Among the approaches that are the most used to overcome these difficulties we find Markov Chain Monte Carlo and Variational Inference methods.

However, the difficulty of sampling from a posterior distribution that isn’t in a familiar family could be a reason to use such techniques even in the one-dimensional case. This is true despite the fact that we can easily get an analytic expression for the PDF of the posterior.

For example, with the MCMC package emcee, I’m able to get 10,000 samples from the posterior in 8 seconds, less than a millisecond per sample and a 1,000x improvement over rvs!

ndim, nwalkers, nruns = 1, 20, 500

start = time.time()
def log_prob(x):
    if posterior.pdf(x)>0:
        return math.log(posterior.pdf(x))
    else:
        return -np.inf
sampler = emcee.EnsembleSampler(nwalkers, 1, log_prob)
sampler.run_mcmc(p0, nruns) #p0 are the starting samples

These samples will only be drawn from a distribution approximating the posterior, whereas rvs is as precise as SciPy’s root finding and integration algorithms. However, I think there are MCMC algorithms out there that converge very well.

Here’s the code for running the timings on your machine.

  1. “For a continuous distribution, however, we need to integrate the probability density function (PDF) of the distribution, which is impossible to do analytically for most distributions (including the normal distribution).” Wikipedia on Inverse transform sampling

  2. “For the normal distribution, the lack of an analytical expression for the corresponding quantile function means that other methods (e.g. the Box–Muller transform) may be preferred computationally. It is often the case that, even for simple distributions, the inverse transform sampling method can be improved on: see, for example, the ziggurat algorithm and rejection sampling. On the other hand, it is possible to approximate the quantile function of the normal distribution extremely accurately using moderate-degree polynomials, and in fact the method of doing this is fast enough that inversion sampling is now the default method for sampling from a normal distribution in the statistical package R.” Wikipedia on Inverse transform sampling

  3. The way it works in Python is that, in the definition of the class Lognormal (a subclass of the continuous random variable class), the generic inverse transform rvs method is overwritten with a more tailored sampling algorithm. SciPy will know to apply the more efficient method when rvs is called on an instance of class Lognormal. 

May 31, 2020

Hidden subsidies for cars

Personal vehicles are ubiquitous. They dominate cities. They are actually so entrenched that they can blend into the background, no longer rising to our attention. Having as many cars as we do can seem to be the ‘natural’ state of affairs.

Our level of car use could perhaps be called natural if it were the result of people’s preferences interacting in well-functioning markets. No reader of this blog, I take it, would believe such a claim. The negative externalities of cars are well-documented: pollution, congestion, noise, and so on.

The subsidies for cars are less obvious, but I think they’re also important.

In our relationship to cars in the urban environment, we’re almost like David Foster Wallace’s fish who asked ‘what the hell is water?’. I want to flip that perspective and point out some specific government policies that increase the number of cars in cities.

"Manhattan, 1964 by Evelyn Hofer"
“Manhattan, 1964” by Evelyn Hofer

Free or cheap street parking

Privately provided parking in highly desirable city centres can cost hundreds of dollars a month. But the government provides car storage on the side of the street for a fraction of that, often for free.1

The width of roads

Streets and sidewalks sit on large amounts of strategically placed land that is publicly owned. Most of that land is devoted to cars. On large thoroughfares, I’d guess cars take easily 70% of the space, leaving only thin slivers on each side for pedestrians.

This blogger estimates, apparently by eyeballing Google Maps, that streets take up 43% of the land in Washington DC, 25% in Paris, and 20% in Tokyo.

Space that is now used for parked cars or moving cars could be used, for example, by shops and restaurants, for bikeshare stations, to plant trees, for parklets, or even to add more housing. And if there was a market for this land I’m sure people would come up with many other clever uses.

Highways

Even if highways aren’t actually inside the city, they have important indirect effects on urban life. Whether the government pays for highways or train lines to connect cities to each other is a policy choice with clear effects on day to day life in the city, even for those who do not travel.

In the United States, this implicit subsidy for cars is large. According to the department of transportation, in 2018 $49 billion out of the department’s budget of $87 billion was spent on highways2.

In this post I don’t want to get into the very complicated question of how much governments should optimally spend on highways. For all I know the U.S. policy may be optimal. My point is only that any government spending on highways indirectly subsidises the presence of cars in cities. This is non-obvious and worth pointing out. When the government pays for a Metro in your city, the subsidy to Metros plain to see. Meanwhile, the subsidy to cars via a huge network of roads across the country passes unnoticed by many.

To be fair, in the United States federal spending on highways is largely financed by taxes on taxes on vehicle fuel. So it’s not clear whether federal highways policy is a net subsidy to cars. However, the way highway spending is financed varies by country. For example, in Germany, “federal highways are funded by the federation through a combination of general revenue and receipts from tolls imposed on truck traffic”.

Minimum parking requirements

Many zoning codes require new buildings to include some fixed number of off-street parking spaces. This isn’t as much of a problem in the European cities I’m familiar with, but in the US, parking minimums are far beyond what the market would provide, and are a significant cost to developers. One paper estimated that the cost of parking in Los Angeles increases the cost of office space by 27-67%3.

Suburban sprawl

United States built sprawling suburbs in the postwar period. I still remember the famous aerial view of Levittown, the prototypical prefabricated suburb, from my middle school history book.

The growth of suburbia was aided by specific government policies that tipped the scales in favour of individual homes in the suburbs, and against apartments in cities. The growth of suburbia led to more cars in the city, because people who live in suburbs are much more likely to drive to work.

Devon Zuegel has an excellent exposition of how federal mortgage insurance subsidized suburbia4:

[The federal housing administration] provides insurance on mortgages that meet certain criteria, repaying the principal to lenders if borrowers default. […] Mortgages had to meet an opinionated set of criteria to qualify for the federal insurance. […] The ideal house had “sunshine, ventilation, scenic outlook, privacy, and safety”, and “effective landscaping and gardening” added to its worth. The guide recommended that houses should be set back at least 15 feet from the road, and well-tended lawns that matched the neighbors’ yards helped the rating. […] [The FHA manual] prescribed minimum street widths and other specific measurements.

The federal government was effectively prescribing how millions of Americans should live, down to their landscaping and gardening! I wonder if Khrushchev brought up this interesting fact about American life in his conversations with Eisenhower. ;)

Further reading

  • A study from the Canadian Victoria Tansport Policy Institute, Transportation Land Valuation
  • Anything by Donald Shoup, an economist and urban planner
  • Some cool colour-coded maps of U.S. cities, showing the surface area devoted to surface parking, above-ground parking garages, and park space.
  • Barcelona’s superblocks
  1. If you want more on this topic, economist and urban planner Donald Shoup has a 733-page tome called The High Cost of Free Parking

  2. See the supporting summary table on page 82 of this document. The sum of spending for the Federal Highway Administration, the Federal Motor Carrier Safety Administration, and the National Traffic Safety Administration comes to $49 billion. Thanks to Devin Jacob for the pointer. 

  3. Shoup 1999, The trouble with minimum parking requirements, in section 3.1, estimates that parking requirements in Los Angeles increase the cost of office space by 27% for aboveground parking, and 67% for underground parking. 

  4. Devon wrote a two-part series: Part 1, quoted above, deals with federal mortgage policy, and lays out a convincing case that it included large implicit subsidies. Part 2 is about “how suburban sprawl gets special treatment in our tax code”. It shows that owning and building homes is heavily subsidized, for example by the gargantuan mortgage interest deduction. I agree that this means people are encouraged to consume more housing, but I don’t see how it differentially encourages suburban housing. Devon quotes economist Edward Glaeser, who says that

    More than 85 percent of people in detached homes are owner-occupiers, in part because renting leads to home depreciation. More than 85 percent of people in larger buildings rent. Since ownership and structure type are closely connected, subsidizing homeownership encourages people to leave urban high-rises and move into suburban homes.

    So the key link in the argument is the connection between ownership and structure type. I’d like to see it spelled out and sourced better. Could the observed correlation just be due to a selection effect? If there’s a true causal effect, do large buildings have more renters because it’s genuinely more efficient that way, or is there some some market failure that prevents people from being apartment-owners in the city? 

December 6, 2019

Why scientific fraud is hard to catch

It’s nearly impossible to catch a scientific fraudster if they’re halfway competent.

Uri Simonsohn has become a minor nerd celeb by exposing fraudulent academic scientists who used fabricated data to get published. The Atlantic called him “the data vigilante”. I’ll describe two simple statistical techniques he has used – and why I’m pessimistic about the impact of such techniques.

If a parameter is measured with many significant digits, the last digit should be distributed uniformly 0-9. In a study of an intervention to increase factory workers’ use of hand sanitizer, sanitizer use was measured with a scale sensitive to the 100th of a gram. But the data had an unusual prevalence of 6s, 7s and 9s on the last digit. Uri Simonsohn and colleagues conducted a chi-square test and reject the hypothesis that the digits follow a uniform distribution, p=0.00000000000000001.1

A second sign of fraudulent data is if the baseline means are too similar between treatment groups. In one of the hand sanitizer studies, there were 40 participants, 20 in the control condition and 20 in the treatment condition. Simonsohn used a “bootstrapping” technique – randomly shuffling the 40 observations into two groups of 20, and repeating this millions of times, in order to estimate how often we would see such similar means if the data were truly drawn randomly (less than once in a 100,000)2.

There are other, more mathematically intense techniques for forensic data analysis3, but the common theme among them is to detect fraudsters creating suspiciously non-random data.

I want to tell these hand sanitizer people: come on, how hard can it be to use a random number generator? We know people are bad at producing randomness. In poker, it’s often optimal to play a mixed strategy, which requires randomising your play. But we have a strong natural tendency to play non-randomly, so poker players have developed ad hoc randomisation devices, like looking at your watch and playing call if you’re in the first half of the minute and fold if you’re in the second half. A similar incapacity to produce enough randomness seems to have befallen these amateurish scientific fakers. In order to produce data that violates the last-digit-uniformity law, you have to literally be writing the fake numbers by hand into a computer!

Savvier baddies would not shoot themselves in the foot in this way. It’s very easy to just draw some random numbers from a pre-specified distribution.

I can imagine that as you run more complex experiments, with multiple treatment arms and many potentially correlated parameters, it becomes difficult to create realistic fake data, even if you randomly draw it from a distribution. Some inconsistency could always escape your notice, and a sufficiently determined data sleuth might catch you.

But there’s a much easier solution: just run a legitimate experiment, and then add a constant of your choice to all observations in the treatment group. This data would look exactly like the real thing – the only lie would be that the “treatment” was you logging on to the computer in the middle of the night and changing the numbers. I can’t think of any way this misconduct could be detected statistically. And it has the additional benefit that you’re running an experiment, so people in your department won’t be wondering where you’re getting all that data from.

Statistical sleuthing is fun, but I suspect it’s powerless against the majority of fraud.

My broader hope is that we’ll see a rise in the norm of having multiple independent replications of a study. This single tide should wash away many of the problems with current science. If a study fails to replicate multiple times, the result will lose credibility – even if we never find out whether it was due to outright fraud or merely flawed science.

  1. http://datacolada.org/74, Figure 2 

  2. http://datacolada.org/74, Problem 4 

  3. see the “fake data” category of Simonsohn’s blog Data Colada, which by the way is excellent on many topics besides fraud. 

August 2, 2019