Modelling longevity in the UK using Markov chains


There are many pressing policy problems that could be affected by the dynamics of ageing in the UK.  In particular, improvements in life expectancy could pose all sorts of problems:

I sought to model the dynamics of ageing in the UK by using mortality rate data from the ONS, in order to apply a Python library (QuantEcon) and see how much insight a basic model can produce.

In terms of insight, I will not pretend to know whether more resources are needed for pensions or social care, as I am sure that an army of government boffins is on the case.

Mortality datasets often used in risk modelling for insurance.  Markov chains are often used to model the dynamics of ageing.  These are two basic states that a person could be:

  • Alive
  • Dead

The ‘dead’ state is an absorbing state, as once that state is reached, it is never left.  I will leave the actuaries to account for the supernatural in their models!

The ONS has data on age-specific mortality rates in a recent statistical bulletin (see table in image).

Data on mortality rates (ONS)
Data on age-specific mortality rates (ONS)

From that level of data, we can model the age of death for a person, by creating discrete random variables governed by Markov chains.  Once we have a way to simulate longevity for one person, then a series of simulations could form a distribution for a population.


See the whole Jupyter notebook for this model as a Github gist in the frame below:

The QuantEcon module (imported as ‘qe’) has the handy DiscreteRV class and capabilities specific to Markov chains.  The DiscreteRV class is all that is needed for this model from QuantEcon.

From the data above, these are discrete random variables for the age-specific mortality rates (e.g. ‘rv65’ for 65-69 group; ‘rv90’ for 90+ group):

# Set up discrete variables for each Markov chain
# STATE '0': alive this year
# STATE '1': dies this year
rv65 = qe.DiscreteRV([0.9881, 0.0119]) # 65-69
rv70 = qe.DiscreteRV([0.9807, 0.0193]) # 70-74
rv75 = qe.DiscreteRV([0.9664, 0.0336]) # (etc.)
rv80 = qe.DiscreteRV([0.9409, 0.0591])
rv85 = qe.DiscreteRV([0.8922, 0.1078])
rv90 = qe.DiscreteRV([0.7851, 0.2149]) # 90+

 # Set up accumulator variable to track person's age
 # Due to nature of model, restrict starting age parameter to range [65, 90]
 if age < 65: counter = 65 elif age > 90:
     counter = 90
     counter = age

The principle behind this setup is to do a series of draws until state 1 (death) is reached: i.e. the series of draws will be like [0, 0, … , 1].  That is how the Markov process is being simulated.

The ‘simulation’ function will have a ‘starting age’, which will go through draws of random variables until death is reached.  Through that, it will be possible to consider the expected longevity for any given age (e.g. from age 65 or from age 100).

Let’s set out some assumptions for the model:

  • A series of Markov chains represent mortality dynamics.
    • State set (for each year): {alive, dead}
    • Morality rates are across 5-year intervals (excl. 90+), sourced from recent ONS data
    • Model uses draws of (independent) random variables
  • Age of death is restricted to range [65, 109] for simplicity.
  • Model does not account for gender.

The upper limit on age of death is not too problematic for the model, given the mortality rates.  Only a few dozen people will reach the upper limit over 100,000 simulations.  A uniform mortality rate for every year after 90 is almost certainly unrealistic, as will be discussed later, so simulations past 100 are restricted by the data.

In the simulation function, following the code outlined above, the sequence across the random variables can be done through a series of ‘while loops’ (one per variable).  Let’s look at the first two loops:

# Sequence of while loops
while counter < 70: draws = 5 - (counter - 65) # 1-5 draws states_65 = rv65.draw(draws) # array for draws if 1 in states_65: # sum: index of first '1' obs + lower limit # e.g. 4 + 65 sum if '1' state in 5th draw return np.argmax(states_65 > 0) + counter
        # No '1' state observed, so move to next chain (70-74)
        counter = 70 # move to next loop
while counter < 75: draws = 5 - (counter - 70) states_70 = rv70.draw(draws) if 1 in states_70: return np.argmax(states_70 > 0) + counter
        counter = 75
# ...

There seems to be a lot of repetition of code, which is not ideal, but it is not time-consuming to do six ‘while loops’.  I spent some time trying to automate this code more generally but could not think of an elegant approach.

The idea in this code is to store the results of the draws from the random variable in an array (using the draw method from DiscreteRV).  If ‘1’ is observed, then the loop is ended and the function will return one scalar: the age of death.  if all the draws are ‘0’, then the loop is exited and we move on to the next loop, where the idea repeats in a similar fashion.

The last loop needs to make sure that every simulation returns a value (everyone dies).  If entering the loop, the counter will have value 90.  Here is the final loop:

while True:
    # Person has reached age 90, so use final chain
    draws = 110 - counter # treat 109 as maximum age
    states_fin = rv90.draw(draws)
    # Final
    return np.argmax(states_fin > 0) + counter

Thus the ‘simulation’ function is complete and returns an integer in the range [age, 109], where the lower bound is just the argument of the function (‘age’).

Now, let’s create a function to store a distribution:

def distribution_gen(age=65, n=10000):
    Generate a distribution of ages of death.
    Returns a list of size n, where each element is an age.
        age (int): number of years, restricted to range [65, 90]
    See 'simulation' function for more detail on the model.
    distribution = [] # store each observation in list
    for i in range(n):
        obs = simulation(age)
    return distribution

Longevity for people aged 65

As an example of the model in action, let’s model the longevity of people currently aged 65 (bearing in mind the law of large numbers).

large_dist = distribution_gen(65, 10000)

The function runtime was less than one second, which is a pleasant surprise.  Here is a histogram of the distribution, plotted using Matplotlib:

Histogram: longevity of people aged 65
Histogram for age=65

The mean age of death is 84 for the distribution.

As the distribution’s values are in a list, a dictionary can be created to store the frequencies:

from collections import Counter
# Store distribution's information in dictionary
freq = Counter(large_dist)

Suppose there are 650,000 people aged 65 in the UK today, which is a fair approximation based on ONS population data.  How many of them are expected to become centenarians (aged 100+) by the model?

This is where the dictionary comes in handy, to craft a function that can perform calculations for this distribution, given any age:

def pop_greater_equal(age=100, pop_size=650):
    dict_sub = sum([v for k, v in freq.items() if k >= age])
    return dict_sub * 1000 * (pop_size / sum(freq.values()))

One iteration of my model estimates that 15,400 of them will become centenarians.  That will amount to over 40 birthday cards for the monarch to send every day.

Note: the actual numbers vary for each generated distribution.  I chose n=10,000 as a reasonable number to draw, given the size of the population.  The distribution’s key summary stats (e.g. mean) differ by 0.1 years or so depending on the iteration, with n set at that size.

How does the model compare to actual statistics?

Consider the context of the model, by looking at ONS statistics (2012) on mortality and population:

  • “A man aged 65 in the UK could expect to live for 18.2 years, […] and a 65-year-old woman, for 20.7 years”
  • “In 2012 there were 13,350 centenarians living in Britain”
  • “Out of the 13,350 centenarians living in the UK in 2012, 660 were aged 105 years and older”

The results from the model seem intuitive:

  • The mean age of death for 65-year-olds is approximately 84, i.e. expect to live for almost 19 more years. This is fairly close to the ONS stats above.
  • The estimate of the number of centenarians is close to the actual number today.  However, the model is a projection of today’s 65-year-olds, so that is not the right comparison.  An ONS calculator estimates the chances of 65-year-olds reaching 100 to be 7% and 11% for men and women respectively.  Taking the mid-range, then that would give around 50,000 according to the ONS compared to my figure.  The calculator’s mortality data point to higher longevity than my (ONS-sourced) data, so that could explain some of the discrepancy from my model.
  • The model’s estimate of people aged 105+ (3,700) is quite far from the actual number today.  The mortality rate of a 100-year-old is assumed to be the same for a 90-year-old, when in reality the mortality rate is likely to be much higher, so the model’s estimate is likely to be an overestimate. This is a limitation of the dataset.  I cannot find an ONS calculator online that looks at the probability of reaching an age beyond 100 unfortunately.

For what is a basic model, given its reliance on one small table, it seems to estimate the actual UK stats reasonably well and the probability distribution is straightforward to modify.

It would be interesting to see if more parameters can be used in the model (e.g. gender).  Sourcing of suitable datasets would be a challenge, but the next issue is how that would be scaled up: e.g. can the loops be replaced by recursive processes or object-oriented programming?