Planning a trip to London: finding your bearings via nearest neighbour search

Big Ben near Westminster station
What are the nearest stations to London’s top tourist attractions?

Introduction

Travelling is one of my favourite hobbies but it often comes with frustration.  It can be so difficult to plan a trip so I have been working on a data science project that tries to make it easier to plan a trip itinerary based on geodata.

I have published my project to my Github so check out my initial notebook there.  This is the first in a series of blog posts on this project, which will focus on only small components of code and describe things in more general terms.

There are so many things to do in the travel planning process, e.g.:

  • Choose a destination (probably from a shortlist).
  • Book flights.
  • Book accommodation (again from a shortlist).
  • Develop a travel itinerary.
  • Pack bags for the trip.
  • Arrange transport (e.g. to airport and accommodation).

All that is done before even going on a trip, but already the number of tasks involved seems time-consuming.  Some trips can take days to plan over a long period and the planning process is typically iterative (e.g. develop a draft itinerary and then refine it nearer the time).

Even as a seasoned traveller I would say that developing a travel itinerary is the hardest thing to do in that list. An itinerary defines a trip and it is something that you want to get right, especially if it may be your only visit to a destination.  For all the travel guides and websites I examine when planning a trip, I normally need to adjust my plans ‘on the day’: e.g. the transportation isn’t as straightforward as I thought; an attraction is closed; plans are dependent on weather conditions.

Since itineraries are so time-consuming to get right, before and during a trip, I’ve sought a way to make it easier for me to develop a travel itinerary via data science techniques.  As someone who has lived, studied and worked in London, I thought it would be worthwhile to start off the project by using data on London.

I read Hamza Bendemra’s post on geo-location clustering in Paris, which inspired me to do a take on a project like this one.  With my experience in travel technology, experience of travel planning for large groups and a project on transportation, I thought I could go much further in my own project on travel planning.

These are the two initial aims of the project:

  • Find the nearest rail station for London’s top attractions and calculate how close the attractions are to each station.
  • Group together London’s top attractions and use the groupings for itinerary planning (e.g. spend one day for attractions in one group, another day for attractions in a second group, etc.).

That could help to establish a proof of concept for a more ambitious project.

Train icon

This post focuses on the first part: calculating the nearest station to London’s top tourist attractions.  Before going into that, it is also worthwhile to describe the nature of the datasets.

Data on attractions and stations

There are over 1,600 attractions in London listed on TripAdvisor, so there is plenty of things to do.  I typically use TripAdvisor to compare destinations before planning a trip and it’s normally beneficial to look over some of the top attractions.  The top attractions on TripAdvisor have high ratings and a relatively high volume of reviews, so they may have widespread appeal to fit in any travel itinerary.

I created a spreadsheet with geodata on more than 20 of the top attractions in London on TripAdvisor.  These are the main items for each record in the spreadsheet table:

  • Attraction rank
  • Attraction name
  • Geodata (e.g. latitude, longitude)
  • Recommended visit duration interval (e.g. 2–3 hours)

This dataset involved some manual data entry and no web scraping at all.

Top London attractions
The top London attraction on TripAdvisor is the National Gallery

I also have a dataset on over 600 rail stations in London (e.g. London Underground, London Overground, National Rail stations, etc.), which I have used previously for my graph analysis of London’s rail system.  These are the main items for each record in the spreadsheet:

  • Station name
  • Geodata (e.g. latitude, longitude, postcode)
  • TfL fare zone

I’ve loaded up the data into dataframes that are imaginatively called sights and rail respectively in my notebook.

Nearest station to attractions

With the data loaded and the geodata in a consistent format across both dataframes, we are ready to calculate the proximity between rail stations and tourist attractions.

The aim in this post is to find the nearest station for London’s top attractions and calculate how close the attractions are to each station.

In essence, the primary problem is a ‘nearest neighbour’ problem and the second problem involves applied trigonometry.

The rationale for tackling this aim first is to get more context on the attractions before approaching the travel planning problem, which is an unsupervised learning problem.

There are 21 attractions that need a ‘nearest station’ calculated from over 600 stations.

A KD Tree can store the latitude–longitude pairs of all the stations so that nearest neighbour calculations for any co-ordinates queried on the tree are efficient.

 

# Set up KD Tree object
rail_tree = sp.spatial.KDTree(np.array(rail[['latitude',
  'longitude']]))

That tree was set up with a Numpy array of latitude–longitude pairs (i.e. an n x 2 array where n is the number of stations).

A Numpy array, e.g. an m x 2 array, can also be queried on the tree to return the nearest neighbour in the tree for m points (in object nn_id) and their distances from the neighbours (in object nn_dist).

 

# Query the nearest neighbours of the tree and store info in arrays
nn_dist, nn_id = rail_tree.query(sight_geo_pairs, p=2) # Euclidean distance

Nearest neighbour calculations

A query of the National Gallery’s co-ordinates (51.508929 , -0.128299) returns position 387 in the tree as the nearest neighbour, which is Leicester Square.

See the notebook for more depth on the code and more examples beyond the first case (National Gallery).

Euclidean distance is used to represent the distance ‘as the crow flies’ between two points.  There are other types of distance like Manhattan distance but the Euclidean distance is most appropriate for the data, as it is rotation invariant.  A new column is added to the dataframe to record the nearest station for each attraction.


# Create a new column in sights df for nearest station
sights['nearest_station'] = rail['station'].iloc[nn_id].values


The distance returned by the tree for the National Gallery is 0.00236287, but this is not easy to interpret and does not account for the Earth’s surface.

We cannot find out the distance in miles, for example, from the distances already calculated.

We need to have the lat-long pairs of both points, which form a line, in order to calculate the miles between them (via the haversine formula).

Distance calculations

The haversine formula applies trigonometry on a sphere to calculate the distance between two points on a sphere. It can be used to give an approximate (Euclidean) distance between two points on Earth.

My notebook defines a function called haversine, which takes geodata from two points as parameters and returns the distance in miles between the two points by default.  The mathematics in the function matches the formula specified in the Wikipedia article linked above.

I gather latitude–longitude pairs for each attraction and their nearest station in two numpy arrays that both have the shape m x 2, where m is the number of attractions.  My function requires the data to be flattened, so I concatenate the two arrays into one array of shape m x 4 to contain all the data we need to calculate the relevant haversine distances for the attractions.  A column of Euclidean distance (miles) is created through a list comprehension, which iterates over each row of the array to unpack the two latitude–longitude pairs and pass them to the haversine function.


# Create a col for the distance
# For each attraction, the two geo pairs are passed to the haversine function to obtain a value
sights['nearest_station_miles'] = [haversine(*four_coords) for four_coords
  in np.concatenate([sight_geo_pairs, station_geo_pairs], axis=1)]

The top 5 attractions are all approximately 0.2 miles away from their nearest stations and each one has a different station.  The National Gallery’s nearest station is 0.163 miles away (Leicester Square).  London’s #5 attraction is the Victoria & Albert Museum and its nearest station is 0.193 miles away (South Kensington).

There are some attractions that are within 500 feet of the nearest station.  Big Ben is approx. 260 feet away from its nearest station (Westminster).  The photo at the top of this article shows a roundel of Westminster station in the foreground, with the nearby clock tower in the background.

Concluding points

The application of nearest neighbour search to my data has enabled me to obtain more geographical context on London’s top attractions.  Most people traversing through London can use the Tube map as a point of reference for their travelling, so calculating the nearest station to London’s attractions is a useful first step in planning a travel itinerary for the city’s key sights.

One point to bear in mind with the calculations is that the nearest station calculated for an attraction may not be the nearest in terms of actual walking distance at the street level.  The Euclidean distance is a simple way to calculate distance between points; much richer data would be needed, e.g. street routes in a graph, to calculate walking distances.  Euclidean distance works well for giving geographical context, without introducing unnecessary complexity.

This context will be useful for approaching the next part of the project, which is an unsupervised learning problem: clustering of London’s top attractions and using the clusters to plan an itinerary.  The attractions in London are spread across areas of differing densities, so it will be interesting to see how attractions are grouped together and to what extent an itinerary could be formed from them.

UK phone number parser: recognising phone numbers from text input

Introduction

In a previous post, I outlined business problems I encountered with data that contain UK phone numbers:

  • Dealing with spreadsheets or databases that have inconsistent phone number formats
  • Scraping phone number information from a clunky PDF (e.g. a directory) or webpage

I designed a regex to recognise UK phone numbers and think it is better than many of the example regular expressions listed online (at least, for what I was looking to achieve for my own program).


This is the sequence I followed in designing a program that recognises UK phone numbers and parses them to a consistent format:

  1. Write the regex
  2. Create program that processes an input
  3. Export the phone numbers to a workbook and process them

In this post, I am in the intermediate step: create a program that processes an input (text input).  This ended up being a clipboard input, as web scraping is not as versatile in my view and I am more likely to be using the program for PDFs than webpages.


Getting started with the program

The first building block for the program:

Define a regex and take clipboard input.  Then find all strings that match the regex pattern and return them as a list (of raw phone numbers).

Once that is done, it is possible to increase the functionality of the program later (e.g. export the list to Excel; convert the raw numbers to a consistent format).

Regex object

The regular expression was defined and tested in the previous post on this project.  The regex was tested using RegExr rather than through a Python IDE, but this gives a dry run to write a  program.

The RegexObject is set up as follows, via the re library:


 phone_regex_UK = re.compile(r"""(
   [multi-line regex here]
    )""", re.VERBOSE | re.DOTALL)

Some standard methods are used to make sure that the regex can process long blocks of text well.

These are the three regex groups, as could be seen last time with RegExr:

  1. International dialling code (e.g. 00 44; +44(0)), if any.
  2. Phone number without leading 0.
  3. Extension information (e.g. ‘ext. 123’, ‘ext 123’), if any.

For example, this would be the breakdown of ‘0161 123 4567 ext. 1231’:

  1. [blank]
  2. ‘0161 123 4567’
  3. ‘ext. 1231’

Clipboard input

I had considered incorporating a web scraping library for the program, but thought that there would need to be some element of visual inspection by an end-user to account for basic problems such as formatting and cumbersome websites.

The intent of the program was always to reduce the time needed for data cleaning, so clipboard input would be much more powerful and allow for more adaptability.

This block of code uses the pyperclip library to process clipboard text and datetime to help monitor the performance of the program:


 print("Copying text from the clipboard ...")
 timer_start = datetime.datetime.now() #timer
 raw_clip = str(pyperclip.paste())
 pyperclip.copy(raw_clip)

Search functionality

Now that an input can be stored within an object in the program, more code is needed to process what is contained in that object.

The input could be a messy million-character string of text for a webpage or long PDF file, which could contain hundreds of phone numbers.  It is critical then to create code that achieves these things:

  • Returns the phone numbers in a consistent format: in essence, the dialling code is extraneous information and all the white space can be removed from ‘group 2’.
  • Fast performance: the program needs iteration, so we expect O(n) performance as a target.  List and string methods can be used to keep performance to that level.

Here is code that loops over every character of text in the RegexObject and does just that:


# Search functionality block
# raw_list will store all the raw numbers recognised by regex
# matches_list will store numbers in a semi-standardised format
#    i.e. leading zeroes & no intl codes, but all else is messy
matches_list = []
raw_list = []
for groups in phone_regex_UK.findall(raw_clip):
   raw_num = groups[0]
   raw_list.append(raw_num)
   num = "0" + groups[2]
   num = ''.join(ch for ch in num if ch.isdigit())
   if groups[3] != "":
      ext = groups[3]
      ext = ''.join(ch for ch in ext if ch.isdigit())
      num += "x" + ext
   matches_list.append(num)

Each phone number is stored as a raw_num (all content is kept with groups[0]).  Then the essential phone number detail is accessed via groups[2]: the number is stored in a format that eliminates spaces by using a list comprehension (ch for ch…).

In case of an extension number, the number is kept in a form such as ‘x123’ for ‘ext. 123’.  At the end of the iteration, the ‘cleaned up’ number is placed at the end of the matches_list.

 

So far all the processing action is being kept under the hood but the next post on this project will address program usability.  If thinking about the program in ‘minimum viable product’ terms, it is perhaps one step away: the functionality has been tested a fair amount, but more attention is needed on design and usability.

My Github page is live

Pleased to announce that my Github page is now live: @exmachinadata.

It has been quite a learning curve to get a Jupyter notebook and understand the nuances with features (e.g. gists).

My first repository is my London Tube repo.  I hope to publish some gists and integrate my Github page with this website better – watch this space!

 

London’s Tube network: adding interchange data

Introduction

Modelling and analysis of London’s Tube network has been an interesting project for me to do, as it has enabled me to create my own data pipeline by using multiple data sources.  I have refined the model and will discuss it in this post.

These are the two challenges I have sought to tackle so far in this project:

  • Calculate the shortest path between two stations.
  • Perform graph analysis / link analysis on the network.

As presented in my previous posts on this project, I managed to create a function that displays the shortest path between two stations and do some initial link analysis on the most important stations in the network.  My graph has all the current London Overground routes (e.g. Lea Valley Lines), whereas many network analyses online are based on older data.

What elements can be improved?

  • More interchange data can be added to the graph.
  • The interchange edges can be less clunky (especially the OSI [out-of-station interchange] and walking routes between stations).
  • Data visualisation could add much more value to the analysis.

Improving the graph data

The first two problems can be addressed together, by adding edges that represent distances from:

  • Platform to street level
  • Platform to platform (interchange)

In the previous model, there were assumptions set on the interchange times (in essence, they were treated as identical due to lack of data).  However, the interchange times vary across the network.  An interchange time could be as high as 15 minutes, or at the lower end could be a hop to a parallel platform.  Look at the layout of King’s Cross St. Pancras station as an example of a large station with many tunnels.

I have incorporated TfL’s data on interchange times in my model, which has involved adding over 600 edges.  The graph has approximately twice the number of edges as a consequence.

There are missing data, so I had to estimate distances for edges at some stations: e.g. Clapham North; most of the Lea Valley Line stations.

Tube network connections in Excel (including interchange edges)
Tube network connections (e.g. distances between platforms in a station)

Fastest route

Google Maps is not great for calculating short journeys between stations, especially journeys involving a large National Rail station.

This is probably my best example of a calculation by Google Maps that seems ‘off’.

Suppose you arrive in Paddington station from a National Rail service.  How long would it take to get to Notting Hill Gate station (street level)?

Google Maps says 5 minutes:

  • Paddington station concourse to Paddington’s District/Circle Line platforms: 2 minutes
  • Two stops to Notting Hill Gate: 3 minutes

This journey would not take 5 minutes.  It would take around 5 minutes to get this far: Paddington concourse; Paddington Underground entrance on Praed Street; Circle/District platforms.

Google Maps does not appear to use much data on interchange times for Tube stations.  I am surprised, as it does use timetable data of some sort.  In fact, it even lists Notting Hill Gate station incorrectly as ‘Notting Hill’.

Using my fastest_route function (code shown in previous post), the journey is estimated to take 11 minutes, not 5 minutes:


# Google Maps has Paddington concourse -> Notting Hill Gate as 5min journey
fastest_route("Paddington", "Notting Hill Gate [Circle]") # start: National Rail
fastest_route("Paddington", "Notting Hill Gate")
fastest_route("Paddington (H&C)", "Notting Hill Gate") # start: north end of Paddington

The middle case, of most interest, returns 11 minutes:

JOURNEY:

Paddington
Paddington (Bakerloo)
Paddington (Bakerloo) [Circle]
Bayswater [Circle]
Notting Hill Gate [Circle]
Notting Hill Gate

JOURNEY TIME: 11.0 minutes

The first and last journeys have journey times of 9.0 and 14.0 minutes respectively.

West Hampstead to West Ruislip (revisited)

An interesting journey discussed in a previous post was: West Hampstead area to West Ruislip.  There are three separate stations in West Hampstead (Underground, Overground and Thameslink) so that opens up three different routes to West Ruislip, which take approx. 50 minutes.  The inclusion of interchange data in the data pipeline would give a better idea of the actual ‘fastest’ route.

With the interchange data, the fastest route from West Hampstead Overground to West Ruislip takes 49.0 minutes.  The new model is suggesting a walk to the Underground station, instead of taking the Overground to Shepherd’s Bush and taking the Central line from there:

JOURNEY:

West Hampstead
West Hampstead Underground
West Hampstead Underground [Jubilee]
Finchley Road [Jubilee]
Finchley Road [Metropolitan]
Wembley Park [Metropolitan]
Preston Road [Metropolitan]
Northwick Park [Metropolitan]
Harrow-on-the-Hill [Metropolitan]
West Harrow [Metropolitan]
Rayners Lane [Metropolitan]
Eastcote [Metropolitan]
Ruislip Manor [Metropolitan]
Ruislip [Metropolitan]
Ickenham [Metropolitan]
Ickenham
West Ruislip

JOURNEY TIME: 49.0 minutes

The route via the Jubilee line and Central line is not far off on journey time though.  It takes 47.0 minutes from Swiss Cottage’s Jubilee line platform to West Ruislip. In total, the journey from West Hampstead Underground to West Ruislip via Jubilee and Central lines is 51.0 minutes.

Journeys that are not working so well

There are some quirks in the Tube network that are hard to model, so an undirected graph will not give truly accurate times for some journeys.  These are examples:

  • Interchanges between platforms on the same line are difficult to model: e.g. Euston’s two Northern line branches; Edgware Road’s two Circle line branches; interchanges such as Harrow-on-the-Hill and Leytonstone.
  • The London Overground has different services – some involve interchanges, others may not.  There are services that do a circle starting and ending at Clapham Junction, but other services do not do an orbit.  In my current model, a journey from West Hampstead to Shepherd’s Bush involves an interchange at Willesden Junction.  However, a service doing the Clapham Junction loop would not require that interchange.
  • Some Underground edges have different services, e.g. trains often start at Mill Hill East for commuting, but during the day trains do not tend to terminate there.  A journey to that station can involve a 15-minute interchange at Finchley Central!  There are some fast trains on the Metropolitan line at peak times as another example.

As can be seen, there are a lot of complexities in the Tube network, so a graph cannot model all of them successfully.

Consider some examples in action:


fastest_route("Paddington (H&C) [Circle]", "Paddington (Bakerloo) [Circle]")
# The two Circle services at Edgware Rd are at different stations - this journey is not right

JOURNEY:

Paddington (H&C) [Circle]
Edgware Road (Circle) [Circle]
Paddington (Bakerloo) [Circle]

JOURNEY TIME: 6.0 minutes

The time between the two Paddington stations’ Circle line platforms is not 6 minutes.  The problem is the Edgware Road (Circle) [Circle] node.  There needs to be an interchange between the different Circle line platforms at Edgware Road (Circle) station.  To avoid this problem, the node would need to be split up into two nodes.

For example, Euston [Northern] node could be broken up into two nodes: Euston [Northern – Bank] and Euston [Northern – Charing X].  This small change introduces another level of complexity for the whole network.

Overall, I am still pleased with the model’s results for journey times, now that interchange times are incorporated in the model.  In some respects, the model is working better than Google Maps (e.g. the nuances of Paddington’s rail connections).

In the next post on this project, I will revisit the graph link analysis (e.g. PageRank) and see how the interchange data have influenced the centrality of stations.

Modelling longevity in the UK using Markov chains

Introduction

There are many pressing policy problems that could be affected by the dynamics of ageing in the UK.  In particular, improvements in life expectancy could pose all sorts of problems:

I sought to model the dynamics of ageing in the UK by using mortality rate data from the ONS, in order to apply a Python library (QuantEcon) and see how much insight a basic model can produce.

In terms of insight, I will not pretend to know whether more resources are needed for pensions or social care, as I am sure that an army of government boffins is on the case.


Mortality datasets often used in risk modelling for insurance.  Markov chains are often used to model the dynamics of ageing.  These are two basic states that a person could be:

  • Alive
  • Dead

The ‘dead’ state is an absorbing state, as once that state is reached, it is never left.  I will leave the actuaries to account for the supernatural in their models!

The ONS has data on age-specific mortality rates in a recent statistical bulletin (see table in image).

Data on mortality rates (ONS)
Data on age-specific mortality rates (ONS)

From that level of data, we can model the age of death for a person, by creating discrete random variables governed by Markov chains.  Once we have a way to simulate longevity for one person, then a series of simulations could form a distribution for a population.

Model

See the whole Jupyter notebook for this model as a Github gist in the frame below:

The QuantEcon module (imported as ‘qe’) has the handy DiscreteRV class and capabilities specific to Markov chains.  The DiscreteRV class is all that is needed for this model from QuantEcon.

From the data above, these are discrete random variables for the age-specific mortality rates (e.g. ‘rv65’ for 65-69 group; ‘rv90’ for 90+ group):


# Set up discrete variables for each Markov chain
# STATE '0': alive this year
# STATE '1': dies this year
rv65 = qe.DiscreteRV([0.9881, 0.0119]) # 65-69
rv70 = qe.DiscreteRV([0.9807, 0.0193]) # 70-74
rv75 = qe.DiscreteRV([0.9664, 0.0336]) # (etc.)
rv80 = qe.DiscreteRV([0.9409, 0.0591])
rv85 = qe.DiscreteRV([0.8922, 0.1078])
rv90 = qe.DiscreteRV([0.7851, 0.2149]) # 90+

 # Set up accumulator variable to track person's age
 # Due to nature of model, restrict starting age parameter to range [65, 90]
 if age < 65: counter = 65 elif age > 90:
     counter = 90
 else:
     counter = age

The principle behind this setup is to do a series of draws until state 1 (death) is reached: i.e. the series of draws will be like [0, 0, … , 1].  That is how the Markov process is being simulated.

The ‘simulation’ function will have a ‘starting age’, which will go through draws of random variables until death is reached.  Through that, it will be possible to consider the expected longevity for any given age (e.g. from age 65 or from age 100).

Let’s set out some assumptions for the model:

  • A series of Markov chains represent mortality dynamics.
    • State set (for each year): {alive, dead}
    • Morality rates are across 5-year intervals (excl. 90+), sourced from recent ONS data
    • Model uses draws of (independent) random variables
  • Age of death is restricted to range [65, 109] for simplicity.
  • Model does not account for gender.

The upper limit on age of death is not too problematic for the model, given the mortality rates.  Only a few dozen people will reach the upper limit over 100,000 simulations.  A uniform mortality rate for every year after 90 is almost certainly unrealistic, as will be discussed later, so simulations past 100 are restricted by the data.

In the simulation function, following the code outlined above, the sequence across the random variables can be done through a series of ‘while loops’ (one per variable).  Let’s look at the first two loops:


# Sequence of while loops
while counter < 70: draws = 5 - (counter - 65) # 1-5 draws states_65 = rv65.draw(draws) # array for draws if 1 in states_65: # sum: index of first '1' obs + lower limit # e.g. 4 + 65 sum if '1' state in 5th draw return np.argmax(states_65 > 0) + counter
    else:
        # No '1' state observed, so move to next chain (70-74)
        counter = 70 # move to next loop
while counter < 75: draws = 5 - (counter - 70) states_70 = rv70.draw(draws) if 1 in states_70: return np.argmax(states_70 > 0) + counter
    else:
        counter = 75
# ...

There seems to be a lot of repetition of code, which is not ideal, but it is not time-consuming to do six ‘while loops’.  I spent some time trying to automate this code more generally but could not think of an elegant approach.

The idea in this code is to store the results of the draws from the random variable in an array (using the draw method from DiscreteRV).  If ‘1’ is observed, then the loop is ended and the function will return one scalar: the age of death.  if all the draws are ‘0’, then the loop is exited and we move on to the next loop, where the idea repeats in a similar fashion.

The last loop needs to make sure that every simulation returns a value (everyone dies).  If entering the loop, the counter will have value 90.  Here is the final loop:


while True:
    # Person has reached age 90, so use final chain
    draws = 110 - counter # treat 109 as maximum age
    states_fin = rv90.draw(draws)
    # Final
    return np.argmax(states_fin > 0) + counter

Thus the ‘simulation’ function is complete and returns an integer in the range [age, 109], where the lower bound is just the argument of the function (‘age’).

Now, let’s create a function to store a distribution:

def distribution_gen(age=65, n=10000):
    """
    Generate a distribution of ages of death.
    Returns a list of size n, where each element is an age.
    
    Args:
        age (int): number of years, restricted to range [65, 90]
    
    See 'simulation' function for more detail on the model.
    
    """
    distribution = [] # store each observation in list
    for i in range(n):
        obs = simulation(age)
        distribution.append(obs)
    return distribution

Longevity for people aged 65

As an example of the model in action, let’s model the longevity of people currently aged 65 (bearing in mind the law of large numbers).


large_dist = distribution_gen(65, 10000)

The function runtime was less than one second, which is a pleasant surprise.  Here is a histogram of the distribution, plotted using Matplotlib:

Histogram: longevity of people aged 65
Histogram for age=65

The mean age of death is 84 for the distribution.

As the distribution’s values are in a list, a dictionary can be created to store the frequencies:


from collections import Counter
# Store distribution's information in dictionary
freq = Counter(large_dist)

Suppose there are 650,000 people aged 65 in the UK today, which is a fair approximation based on ONS population data.  How many of them are expected to become centenarians (aged 100+) by the model?

This is where the dictionary comes in handy, to craft a function that can perform calculations for this distribution, given any age:


def pop_greater_equal(age=100, pop_size=650):
    dict_sub = sum([v for k, v in freq.items() if k >= age])
    return dict_sub * 1000 * (pop_size / sum(freq.values()))

One iteration of my model estimates that 15,400 of them will become centenarians.  That will amount to over 40 birthday cards for the monarch to send every day.

Note: the actual numbers vary for each generated distribution.  I chose n=10,000 as a reasonable number to draw, given the size of the population.  The distribution’s key summary stats (e.g. mean) differ by 0.1 years or so depending on the iteration, with n set at that size.

How does the model compare to actual statistics?

Consider the context of the model, by looking at ONS statistics (2012) on mortality and population:

  • “A man aged 65 in the UK could expect to live for 18.2 years, […] and a 65-year-old woman, for 20.7 years”
  • “In 2012 there were 13,350 centenarians living in Britain”
  • “Out of the 13,350 centenarians living in the UK in 2012, 660 were aged 105 years and older”

The results from the model seem intuitive:

  • The mean age of death for 65-year-olds is approximately 84, i.e. expect to live for almost 19 more years. This is fairly close to the ONS stats above.
  • The estimate of the number of centenarians is close to the actual number today.  However, the model is a projection of today’s 65-year-olds, so that is not the right comparison.  An ONS calculator estimates the chances of 65-year-olds reaching 100 to be 7% and 11% for men and women respectively.  Taking the mid-range, then that would give around 50,000 according to the ONS compared to my figure.  The calculator’s mortality data point to higher longevity than my (ONS-sourced) data, so that could explain some of the discrepancy from my model.
  • The model’s estimate of people aged 105+ (3,700) is quite far from the actual number today.  The mortality rate of a 100-year-old is assumed to be the same for a 90-year-old, when in reality the mortality rate is likely to be much higher, so the model’s estimate is likely to be an overestimate. This is a limitation of the dataset.  I cannot find an ONS calculator online that looks at the probability of reaching an age beyond 100 unfortunately.

For what is a basic model, given its reliance on one small table, it seems to estimate the actual UK stats reasonably well and the probability distribution is straightforward to modify.

It would be interesting to see if more parameters can be used in the model (e.g. gender).  Sourcing of suitable datasets would be a challenge, but the next issue is how that would be scaled up: e.g. can the loops be replaced by recursive processes or object-oriented programming?