Modelling longevity in the UK using Markov chains

Introduction

There are many pressing policy problems that could be affected by the dynamics of ageing in the UK.  In particular, improvements in life expectancy could pose all sorts of problems:

I sought to model the dynamics of ageing in the UK by using mortality rate data from the ONS, in order to apply a Python library (QuantEcon) and see how much insight a basic model can produce.

In terms of insight, I will not pretend to know whether more resources are needed for pensions or social care, as I am sure that an army of government boffins is on the case.


Mortality datasets often used in risk modelling for insurance.  Markov chains are often used to model the dynamics of ageing.  These are two basic states that a person could be:

  • Alive
  • Dead

The ‘dead’ state is an absorbing state, as once that state is reached, it is never left.  I will leave the actuaries to account for the supernatural in their models!

The ONS has data on age-specific mortality rates in a recent statistical bulletin (see table in image).

Data on mortality rates (ONS)
Data on age-specific mortality rates (ONS)

From that level of data, we can model the age of death for a person, by creating discrete random variables governed by Markov chains.  Once we have a way to simulate longevity for one person, then a series of simulations could form a distribution for a population.

Model

See the whole Jupyter notebook for this model as a Github gist in the frame below:

The QuantEcon module (imported as ‘qe’) has the handy DiscreteRV class and capabilities specific to Markov chains.  The DiscreteRV class is all that is needed for this model from QuantEcon.

From the data above, these are discrete random variables for the age-specific mortality rates (e.g. ‘rv65’ for 65-69 group; ‘rv90’ for 90+ group):


# Set up discrete variables for each Markov chain
# STATE '0': alive this year
# STATE '1': dies this year
rv65 = qe.DiscreteRV([0.9881, 0.0119]) # 65-69
rv70 = qe.DiscreteRV([0.9807, 0.0193]) # 70-74
rv75 = qe.DiscreteRV([0.9664, 0.0336]) # (etc.)
rv80 = qe.DiscreteRV([0.9409, 0.0591])
rv85 = qe.DiscreteRV([0.8922, 0.1078])
rv90 = qe.DiscreteRV([0.7851, 0.2149]) # 90+

 # Set up accumulator variable to track person's age
 # Due to nature of model, restrict starting age parameter to range [65, 90]
 if age < 65: counter = 65 elif age > 90:
     counter = 90
 else:
     counter = age

The principle behind this setup is to do a series of draws until state 1 (death) is reached: i.e. the series of draws will be like [0, 0, … , 1].  That is how the Markov process is being simulated.

The ‘simulation’ function will have a ‘starting age’, which will go through draws of random variables until death is reached.  Through that, it will be possible to consider the expected longevity for any given age (e.g. from age 65 or from age 100).

Let’s set out some assumptions for the model:

  • A series of Markov chains represent mortality dynamics.
    • State set (for each year): {alive, dead}
    • Morality rates are across 5-year intervals (excl. 90+), sourced from recent ONS data
    • Model uses draws of (independent) random variables
  • Age of death is restricted to range [65, 109] for simplicity.
  • Model does not account for gender.

The upper limit on age of death is not too problematic for the model, given the mortality rates.  Only a few dozen people will reach the upper limit over 100,000 simulations.  A uniform mortality rate for every year after 90 is almost certainly unrealistic, as will be discussed later, so simulations past 100 are restricted by the data.

In the simulation function, following the code outlined above, the sequence across the random variables can be done through a series of ‘while loops’ (one per variable).  Let’s look at the first two loops:


# Sequence of while loops
while counter < 70: draws = 5 - (counter - 65) # 1-5 draws states_65 = rv65.draw(draws) # array for draws if 1 in states_65: # sum: index of first '1' obs + lower limit # e.g. 4 + 65 sum if '1' state in 5th draw return np.argmax(states_65 > 0) + counter
    else:
        # No '1' state observed, so move to next chain (70-74)
        counter = 70 # move to next loop
while counter < 75: draws = 5 - (counter - 70) states_70 = rv70.draw(draws) if 1 in states_70: return np.argmax(states_70 > 0) + counter
    else:
        counter = 75
# ...

There seems to be a lot of repetition of code, which is not ideal, but it is not time-consuming to do six ‘while loops’.  I spent some time trying to automate this code more generally but could not think of an elegant approach.

The idea in this code is to store the results of the draws from the random variable in an array (using the draw method from DiscreteRV).  If ‘1’ is observed, then the loop is ended and the function will return one scalar: the age of death.  if all the draws are ‘0’, then the loop is exited and we move on to the next loop, where the idea repeats in a similar fashion.

The last loop needs to make sure that every simulation returns a value (everyone dies).  If entering the loop, the counter will have value 90.  Here is the final loop:


while True:
    # Person has reached age 90, so use final chain
    draws = 110 - counter # treat 109 as maximum age
    states_fin = rv90.draw(draws)
    # Final
    return np.argmax(states_fin > 0) + counter

Thus the ‘simulation’ function is complete and returns an integer in the range [age, 109], where the lower bound is just the argument of the function (‘age’).

Now, let’s create a function to store a distribution:

def distribution_gen(age=65, n=10000):
    """
    Generate a distribution of ages of death.
    Returns a list of size n, where each element is an age.
    
    Args:
        age (int): number of years, restricted to range [65, 90]
    
    See 'simulation' function for more detail on the model.
    
    """
    distribution = [] # store each observation in list
    for i in range(n):
        obs = simulation(age)
        distribution.append(obs)
    return distribution

Longevity for people aged 65

As an example of the model in action, let’s model the longevity of people currently aged 65 (bearing in mind the law of large numbers).


large_dist = distribution_gen(65, 10000)

The function runtime was less than one second, which is a pleasant surprise.  Here is a histogram of the distribution, plotted using Matplotlib:

Histogram: longevity of people aged 65
Histogram for age=65

The mean age of death is 84 for the distribution.

As the distribution’s values are in a list, a dictionary can be created to store the frequencies:


from collections import Counter
# Store distribution's information in dictionary
freq = Counter(large_dist)

Suppose there are 650,000 people aged 65 in the UK today, which is a fair approximation based on ONS population data.  How many of them are expected to become centenarians (aged 100+) by the model?

This is where the dictionary comes in handy, to craft a function that can perform calculations for this distribution, given any age:


def pop_greater_equal(age=100, pop_size=650):
    dict_sub = sum([v for k, v in freq.items() if k >= age])
    return dict_sub * 1000 * (pop_size / sum(freq.values()))

One iteration of my model estimates that 15,400 of them will become centenarians.  That will amount to over 40 birthday cards for the monarch to send every day.

Note: the actual numbers vary for each generated distribution.  I chose n=10,000 as a reasonable number to draw, given the size of the population.  The distribution’s key summary stats (e.g. mean) differ by 0.1 years or so depending on the iteration, with n set at that size.

How does the model compare to actual statistics?

Consider the context of the model, by looking at ONS statistics (2012) on mortality and population:

  • “A man aged 65 in the UK could expect to live for 18.2 years, […] and a 65-year-old woman, for 20.7 years”
  • “In 2012 there were 13,350 centenarians living in Britain”
  • “Out of the 13,350 centenarians living in the UK in 2012, 660 were aged 105 years and older”

The results from the model seem intuitive:

  • The mean age of death for 65-year-olds is approximately 84, i.e. expect to live for almost 19 more years. This is fairly close to the ONS stats above.
  • The estimate of the number of centenarians is close to the actual number today.  However, the model is a projection of today’s 65-year-olds, so that is not the right comparison.  An ONS calculator estimates the chances of 65-year-olds reaching 100 to be 7% and 11% for men and women respectively.  Taking the mid-range, then that would give around 50,000 according to the ONS compared to my figure.  The calculator’s mortality data point to higher longevity than my (ONS-sourced) data, so that could explain some of the discrepancy from my model.
  • The model’s estimate of people aged 105+ (3,700) is quite far from the actual number today.  The mortality rate of a 100-year-old is assumed to be the same for a 90-year-old, when in reality the mortality rate is likely to be much higher, so the model’s estimate is likely to be an overestimate. This is a limitation of the dataset.  I cannot find an ONS calculator online that looks at the probability of reaching an age beyond 100 unfortunately.

For what is a basic model, given its reliance on one small table, it seems to estimate the actual UK stats reasonably well and the probability distribution is straightforward to modify.

It would be interesting to see if more parameters can be used in the model (e.g. gender).  Sourcing of suitable datasets would be a challenge, but the next issue is how that would be scaled up: e.g. can the loops be replaced by recursive processes or object-oriented programming?

Recognising UK phone numbers through a regex

Introduction

These are some business problems that have motivated my desire to make a program focused on UK phone numbers:

  • Dealing with spreadsheets or databases that have inconsistent phone number formats
  • Scraping phone number information from a clunky PDF (e.g. a directory) or webpage

Through some experimentation with a regex, pyperclip and OpenPyXL, I eventually produced a program that recognises UK phone numbers and parses them to a consistent format.


In this post, I will focus on my regex.  There are many websites that have regular expressions for recognising UK phone numbers, but I did not find one that accounted well for features of phone numbers such as extension numbers and international call prefixes.

This will be the sequence of posts:

  1. Write the regex
  2. Create program that processes an input
  3. Export the phone numbers to a workbook and process them

Regex

Starting point

I will avoid discussing the intricacies of UK phone numbers here, so refer to Wikipedia’s article on UK phone numbers for detail on how the numbering system works.

These are features of phone numbers that will be important in a regex for my program:

  1. Access codes: e.g. 00 (international call prefix) and +44 (UK country code)
  2. Phone number without the leading 0
  3. Extension number (e.g. x1000 or ext1000)

These numbers correspond to my regex’s ‘groups’.  Of course, group 2’s information is of most interest, but the information in the others groups also needs to be processed.

I found quite a neat regex on a Stack Overflow thread (‘Match UK telephone number in any format’):


^(?:(?:\(?(?:0(?:0|11)\)?[\s-]?\(?|\+)44\)?[\s-]?(?:\(?0\)?[\s-]?)?)|(?:\(?0))(?:(?:\d{5}\)?[\s-]?\d{4,5})|(?:\d{4}\)?[\s-]?(?:\d{5}|\d{3}[\s-]?\d{3}))|(?:\d{3}\)?[\s-]?\d{3}[\s-]?\d{3,4})|(?:\d{2}\)?[\s-]?\d{4}[\s-]?\d{4}))(?:[\s-]?(?:x|ext\.?|\#)\d{3,4})?$

However, it will not work for my programming task so I wanted to improve it:

  • Use groups in a regex
  • Stop digits from being excluded in some cases
  • Account for more characters (e.g. I have seen ‘.’ separator many times in phone numbers from UK companies, whose IT people are in the USA!)
  • Break up the regex so that it is easier to read and modify
Matches for the regex found online
Matches for the regex found online. Notice that some formats are not recognised.

I have not seen a thorough exposition of how a regex actually ‘works’ for UK phone numbers, so I will produce one here.  I hope that this post is useful in seeing how a regex can work, as I certainly recognised more nuances on them while writing the regex.

My regex

This is my regex, with comments in the code snippet to explain what is happening for some lines:


(?:
(\(?
(?:0(?:0|11)\)?[\s-]? # almost every country's exit code, e.g. 00 or 011
\(?|\+?)44\)?[\s-]? # UK country code (+44)
(?:\(?0\)?[\s-]?)?) # leading 0 (trunk prefix)
| # one of many Boolean operators to come for the actual number
\(?0)((?:\d{5}\)?[\.\s-]?\d{4,5}) # e.g. 016977 xxxx
|
(?:\d{4}\)?[\.\s-]?(?:\d{3}[\.\s-]?\d{3})) # e.g. 01234 xxx xxx
|
(?:\d{4}\)?[\.\s-]?(?:\d{5})) # e.g. 01xxx xxxxx
|
(?:\d{3}\)?[\.\s-]?\d{3}[\.\s-]?\d{3,4}) # e.g. 01xx xxx xxxx
|
(?:\d{2}\)?[\.\s-]?\d{4}[\.\s-]?\d{4}) # e.g. 020 xxxx xxxx
) # that's the largest capturing group over now
(?:[\s-]?((?:x|ext[\.\s]*|\#)\d{3,4})?) # e.g. x123, ext123


This is the one-line version, if you wish to paste it into RegExr for your own analysis:


(?:(\(?(?:0(?:0|11)\)?[\s-]?\(?|\+?)44\)?[\s-]?(?:\(?0\)?[\s-]?)?)|\(?0)((?:\d{5}\)?[\.\s-]?\d{4,5})|(?:\d{4}\)?[\.\s-]?(?:\d{3}[\.\s-]?\d{3}))|(?:\d{4}\)?[\.\s-]?(?:\d{5}))|(?:\d{3}\)?[\.\s-]?\d{3}[\.\s-]?\d{3,4})|(?:\d{2}\)?[\.\s-]?\d{4}[\.\s-]?\d{4}))(?:[\s-]?((?:x|ext[\.\s]*|\#)\d{3,4})?)

What I found is that the regex needs to be in a precise order, so that the regex has the best chance of recognising any phone number given to it.  The sequence of Boolean expressions (…|…|…|…|…) needs to have the phone number formats ‘cascade’, as it were: start with the longest area codes and longest numbers first.  For example, 01xxx xxx xxx comes before 01xxx xxxxx as the former is the longer number (the regex I found online had this the other way round).  I am not sure why regular expressions have this order of precedence, but that is my intuition in any case!

Matches for my phone regex
Matches for my phone regex (phone numbers surrounded by ‘1’ characters). Notice that there are three capturing groups.

I am pleased with my regex’s performance with phone numbers and it is more comprehensive in the cases it recognises than any other regex I have found online.

There are some caveats to bear in mind for the regex:

  • The regex may recognise non-UK phone numbers as a valid UK phone number.  e.g. this could happen for phone numbers from countries such as Italy.
  • The regex may recognise a non-existent phone number as a valid UK phone number.

This is just the nature of regular expressions, as a regex is concerned with format.

I still think data validation that aims to exclude non-UK phone numbers would be a worthy programming task, but it needs to go beyond regex and join up with other software such as Excel (see future post for an implementation).  One task that would be especially interesting is to extract the area code from a phone number (e.g. 020 from a number written as 0207 xxx xxxx).

In the next post on this project, I will present a Python program that takes clipboard input and passes it through my regex.

Exploring my music collection: foobar and folders

My music collection

Listening to music is one of my hobbies and I have a large collection of CDs.  (Huh, CDs, what are those again?)  All my CDs have been meticulously ripped to my computer as lossless files and then converted to high-bitrate MP3s.  I have been ripping my music as lossless files by habit, even during the heydays of peer-to-peer file sharing and before online music stores (e.g. iTunes Store) became popular, so that I can archive my collection.  Of course I use music streaming services such as Spotify today, but I use them more for music discovery, rather than for listening to my favourite albums.

I wanted to know ‘more’ about my music and make my listening experience fun and personal, so I started to use foobar2000 and was attracted by the variety of its components.  Over time, I have tagged my music files with all sorts of information, such as beats-per-minute (BPM) and ‘beat intensity’.  Through all my tagging, I have been able to produce playlists for my collection and search my collection more efficiently.

Without further ado, let’s see a glimpse of my music collection.

Image of my music collection (foobar2000).
My music collection in action (currently playing The Killers).

I have nearly 400 releases (albums, extended plays, singles, etc.) in my collection and that amounts to almost 19 days of music.

As that glimpse shows, I like to have a nice visual element to my music library and the listening experience.  I have album art for everything in my collection and other images such as the disc of an album (where available).  If anything, I think vinyl has had a resurgence in the last 10 years because of the large artwork (eye candy) and the more physical nature of the medium, compared to CDs and digital downloads.

There are so many questions that I think could be answered by making programs for my collection.  Streaming services are making their own algorithms with their big data systems.  Here are some examples that I could explore:

  • Are there artists in my collection that tend to have songs with strong rhythms?  (Perhaps more than I might expect?)
    • I would expect genres such as house, funk, ska and electronic dance music to have songs with prominent rhythms.  In contrast, jazz numbers and ballads would not have prominent beats.
  • What songs could go into my party playlists?  e.g. BPM range of 100 to 140 and songs with a strong rhythm seem ideal.  How much music would I have left when filtering the playlist to a certain decade, e.g. 2000s?
  • Could I categorise my collection based on the tags and metadata, e.g. via K-means clustering?
  • What are the most common colours across the album artwork in my collection?  Is the black-and-white palette dominated by post-punk albums or certain artists (e.g. Joy Division, the xx)?
  • What are the longest and shortest songs in my collection, by: genre, artist, year, etc.?

Making it easier to maintain and search my collection

My library is organised into a directory with this folder structure:

[album artist]\[year] – [album]

It works well for finding files quickly, but it is very difficult to ‘zoom out’ or take a ‘macro’ view of my collection.  This is where programming can  help: automating tedious tasks.  For example, if I want to do a batch rename of a selection of filenames, then a program could do that quickly and efficiently.

In my album folders, I can have a lot of ‘non-music’ files:

  • folder.jpg: the album art image
  • artist.jpg: image of the artist
  • disc.png: image of CD or record, with transparency
  • LOG file that has information on my music rip, e.g. accurate rip statistics, errors with the rip

As I have two separate libraries stored on different hard drives—my lossless and MP3 libraries—it can be hard to keep track of everything.  For example, I might have added album artwork to my MP3 library but not to my lossless library.

I have produced my own program to list directory paths that have image files, as one way to make it easier to maintain my music library.

These are some things I wanted to answer through my program:

  • How many images are in my music library’s directory?
  • How much hard drive storage is being used for the images?

With about 100 lines of code, I made a program that works a treat for my collection.

Music library images program output: start
Music library images: start of output

Here is a snippet of the code for the main function:


def list_images_in_lib():
 # Start time
 start_time = time.time()
 
 # 1: check album art
 tot_f, tot_f_size = (0, 0)
 print("\n", '*' * 20, "folder.jpg", '*' * 20, "\n")
    for root, albumArtists, albums in os.walk(os.getcwd()):
       for file in albums:
          if file.endswith("folder.jpg"):
          print(os.path.join(root, file))
          tot_f += 1
          tot_f_size += os.path.getsize(os.path.join(root, file)) / 1000000
 print("\n", "Number of 'folder' images in library:", tot_f, "(%f MB)" %(tot_f_size))
 # 2: check artist art
 tot_a, tot_a_size = (0, 0)
 
 # ...LARGE SNIP...
 # ................
 print("\n", "Number of 'disc' images in library:", tot_d, "(%f MB)" %(tot_d_size))
 
 # Calculate amount of time needed to process all the images:
 end_time = time.time()
 elapsed_time = end_time - start_time
 tot_images = tot_f + tot_a + tot_d
 tot_images_size = tot_f_size + tot_a_size + tot_d_size
 print("\n", "Time to process all", tot_images, "images:",
   round(elapsed_time, 6), "seconds",
   "(%f MB)" %(tot_images_size))

The list_images_in_lib function uses Python’s os.walk() and loops to get all the information needed for my files.  Fortunately, my music library is structured in a way that can make use of the os.walk() tuple results neatly.

The working directory here is ‘C:\…\MP3-library’.  Let’s give some intuition on how the os.walk() tuple is being used to reach an album:

  1. C:\Users\Mark\Music\MP3-library\
  2. Air\
  3. {2016} – Twentyears\

That is not enough information, as the item of interest is a file (e.g. folder.jpg, disc.png) in the album folder.  Another loop is needed to process the file: ‘for file in albums: […]’.  Within that loop, accumulator patterns and print statements are producing the output of interest: e.g. the stats on file size and the filepaths.

Music library images program output: end
Music library images: end of output

It takes about 6 seconds to traverse my music library and return all the output.  Perhaps there could be more clever use of string concatenation and a way to write the code without a nested loop, but for my purposes the code is working well and I would not benefit from speed gains much, even with a larger collection.

I was surprised that my average image size was anywhere near 1MB.

There is other functionality that I have included in my program but have not discussed in this post:

  • Check if an external hard drive is connected (e.g. my backup hard drive).
  • Choose music directory (local or external) to use for the program.

Tools that could be used to explore my music collection further

The foobar2000 Text Tools component can be used to gather the tags (metadata) for my music and eventually have all the music organised in a spreadsheet format.  A spreadsheet with all my songs opens up immense potential for statistical analysis and even machine learning (e.g. classification of songs).

Now that I have all my album folders as string output (when removing ‘folder.jpg’), there is also potential to make more programs for library organisation, e.g. batch renaming.

 

Where should I go next?  It would be great to see how else people have analysed their music collections.

London’s Tube network: initial link analysis

In a previous blog post, I outlined my aims to analyse London’s current Tube network.

As a recap, these are two core challenges I want to tackle:

  • Calculate the shortest path between two stations.
  • Perform graph analysis / link analysis on the network.

The previous post focused on the first challenge, whereas this post follows on from it and approaches the second challenge: link analysis.

I hope to update these posts with links to my Jupyter notebooks, once I tidy up my ‘lab’ notebooks.

 

For my data, I want to make explicit some issues that affect the analysis, due to how the dataset is set up and the assumptions in this initial graph model:

  • All interchanges have the same cost (1 minute). In reality, interchanges within a station are not equal. Green Park has very long tunnels for an interchange, whereas stations such as Finsbury Park can have an interchange on a parallel platform.
  • The walking routes and OSIs have a costly interchange between the station and its street (a result of the first assumption). This should be zero.  An example path with this issue is Bayswater → Bayswater [WALK] → Queensway [WALK] → Queensway.

These two problems can be solved together, once connections between platforms within a station are added. TfL has released data on interchange times within each station, but some data cleaning is needed before these connections can be added to the model.

Before improving the dataset, let us do some initial analysis to see how the graph is working.


Link analysis

I created a new NetworkX (nx) graph object, so that the graph is in a format for link analysis.  Why a separate object?  This is just because of how the weights need to be specified for the analysis you want to perform:

  • For ‘shortest path’ work, the weights in the graph are the journey times (mins).
  • For ‘link analysis’ work, the weights in the graph are the inverses of the journey times, e.g. 2-minute edge has weight 1/2.  This is how the graph needs to be set up for the link analysis algorithms to work as expected.

There are two link analysis algorithms we could do on the network fairly quickly, which can open up different interpretations:

  • PageRank
  • HITS algorithm (hubs and authorities)

Mark Dunne also used those algorithms for his graph analysis, so it would be interesting to see how my results compare.

As a brief note, the Tube is modelled as an undirected graph in my dataset, so the hubs and authorities analyses of the HITS algorithm are equivalent.  There are only a few edges in the network that are actually unidirectional (e.g. Heathrow loop; West India Quay is skipped in the DLR route towards Lewisham).

Setup

My dataset is only one CSV file and I loaded it into a Pandas dataframe.  The graph object for the link analysis will be called ‘graph_weights’.  I’ll leave the code for setting up these objects in the Jupyter notebook, as it’s fairly long and it will probably be simplified in a v2 project when the within-station interchanges are added to my dataset.

With the graph object set up, here is the code for setting up a dataframe of PageRank values for each node:


pagerank = nx.pagerank_numpy(graph_weights, weight='weight')
pagerank = pd.DataFrame.from_dict(pagerank, orient='index').reset_index()
pagerank.columns = ['node', 'pagerank']

Now for a full implementation of HITS for completeness:


hits = nx.hits_scipy(graph_weights, max_iter=2500)
# Hub values
hits_hub = hits[0] # Get the dictionary out of the tuple
hits_hub = pd.DataFrame.from_dict(hits_hub, orient='index').reset_index()
hits_hub.columns = ['node', 'hub']
# Authority values
hits_auth = hits[1] # Get the dictionary out of the tuple
hits_auth = pd.DataFrame.from_dict(hits_auth, orient='index').reset_index()
hits_auth.columns = ['node', 'authority']
# Show hub and authority values
hits_all = pd.merge(hits_hub, hits_auth, on='node')

For some reason, I needed the max_iter argument to have a value (e.g. 2500) much higher than the default.

Analysis

Here is my analysis, with the top 20 nodes shown for each algorithm (PageRank and hub values are shown).  To make explicit, due to how my dataset is structured, a node that is just the station name is akin to the station’s ticket hall, and nodes that have the station line in square brackets are akin to station platforms.  I haven’t aggregated everything by station yet, as it is interesting to see how it works with the nodes arranged like this for now.

Top 20 nodes by algorithm
The top 20 nodes by algorithm (PageRank and HITS)

The results for PageRank seem fairly intuitive:

  • Major stations have a high rank and tend to have many different services (multiple Underground, Overground or DLR services): e.g. Liverpool Street, Bank, Waterloo, Stratford, Willesden Junction.
  • Fairly small stations such as Euston Square and Wood Lane are ranked quite high. In these cases, the edges for OSIs and short walking routes could explain why they have high rankings.
  • Poplar [DLR] node is the highest ranked ‘platform’ (not shown in image, but is in top 25, with value 0.001462). This seems reasonable as every DLR service passes through the station; it may be the case that more journeys pass through Poplar, rather than start or end there.  Earl’s Court [District] is next in ranking and the intuition is similar.

The HITS results are much more striking in contrast:

  • Liverpool Street is by far the most prominent station in this HITS analysis.  All of its platforms and the ticket hall score higher than other nodes.
  • The main hubs in my dataset are all in the City of London: Liverpool Street, Moorgate, Bank, Barbican and Farringdon are all prominent.

It is interesting that Mark Dunne’s analysis has the main hubs in the West End: in particular, Oxford Circus, Green Park, Piccadilly Circus, etc.  I think this is largely because my dataset has all the current London Overground routes, including the Lea Valley Lines. This has brought many more routes (via Hackney) onto TfL’s map: it could be likened to a north-eastern Metro-land.

If this graph were extended to include National Rail routes from the London commuter belt (especially ones in TfL’s fare zones), then I would expect that stations such as Clapham Junction, Victoria, London Bridge and King’s Cross would be more prominent in the network. Inclusion of these routes is problematic, as they operate a variety of services (e.g. fast trains vs all-stations trains), so assigning values to edges is difficult.

Retrospective

Let’s wrap up this project in its current form, so that we consider whether it is worthwhile to extend this project any further.

  • What has worked well so far?
    • The initial analysis appears to represent the network well. The results seem intuitive.
    • The Excel workbook with all the data has been quick to edit when significant changes needed to be made.
    • Some functionality has been straightforward to implement, e.g. shortest path function.
  • What can be improved?
    • More interchange data can be added to the graph.
    • The interchange edges can be less clunky.
    • Data visualisation could add much more value to the analysis.

TfL have produced data on interchange times within stations and I seek to incorporate those in my dataset.  That would add a few hundred connections to my network: edges from platforms to ticket halls; edges between platforms in stations.  I have done some initial data cleaning on these connections, but the most time-consuming work will be matching TfL’s station names to my dataset’s station naming convention.

Traversing London’s Tube network

The TfL Tube map is becoming increasingly dense, as more and more lines and stations are added to it. (By the time I achieve anything with this graph analysis, there may be more tweaking to do because of Crossrail!)

I have tried to find a graph online that is set up for basic analysis, which accounts for:

  • Interchanges within station
  • Interchanges between stations, i.e. out-of-station interchanges (OSIs) and other short walking routes.
  • The latest TfL network, especially all the current London Overground routes.

There are interactive tools (e.g. Oliver O’Brien’s Tube Creature and Google Maps) that cover those three factors, but I have not found a workable dataset that can be used for my own graph analysis.

Mark Dunne’s Tube graph project is the best graph analysis I have come across, which used data from Nicola Greco. However, the data do not cover the three elements specified above.


These are two core challenges I want to tackle:

  • Calculate the shortest path between two stations.
  • Perform graph analysis / link analysis on the network.

I have created my own dataset to include all the Overground routes and added my own interchange edges. There are about 120 out-of-station interchanges (OSIs) and walking routes in my dataset (fairly even split between the two).

For my dataset, I have also designed an ID system that includes information about all the stations: e.g. Piccadilly Circus has ID 71034, where ’10’ shows that it is in Zone 1. This numbering system has been useful for producing pivot tables and bringing in an element of data verification.

Tube network connections (Excel)
Tube network connections in Excel

I used the NetworkX (as nx) library to set up the graph.

Journey times

A function

Here are the neighbours for Queen’s Park station: “Queen’s Park [Bakerloo]”, “Queen’s Park [WDCL]”, “Queen’s Park [WALK]”.

In other words, Queen’s Park station has its Bakerloo and Watford DC line platforms as neighbours, in addition to the ‘WALK’ platform, which arises from how my dataset is constructed to account for short walking routes.

Now let’s create a function to obtain the fastest route and see how it is working.


def fastest_route(start, end):
 """
 Return the fastest path between the 'start' and 'end' points.
 Each station and interchange is printed, along with the journey time.
 
 Tip: use "" when calling the function, as escape characters may be needed with ''.
 """
 journey_path = nx.shortest_path(graph_times, start, end, weight='weight')
 journey_time = nx.shortest_path_length(graph_times, start, end, weight='weight')
 print('\nJOURNEY:', *journey_path, sep='\n\t')
 print('\nJOURNEY TIME:', journey_time, 'minutes')

Let’s try some examples with the function and see how it is working.


fastest_route("Queen's Park", "Brondesbury Park")

JOURNEY:

Queen’s Park
Queen’s Park [WALK]
Brondesbury Park [WALK]
Brondesbury Park

JOURNEY TIME: 12.0 minutes

Queen’s Park to Brondesbury Park is not an OSI, despite both stations’ being only 0.5 miles apart on the same road (Salisbury Road) and their being on different Overground lines. On TfL’s map, people unfamiliar with the area might think that an interchange at Willesden Junction would be a faster journey.

Note: The distance between each station and its ‘walk’ node is a 1-minute journey due to the uniform assumption applied to all interchanges in the graph design.

Queen's Park map
Queen’s Park and Brondesbury Park stations are only half a mile apart. (Image source: Google Maps)

West Hampstead to West Ruislip

Let’s look at an interesting case: a journey from the West Hampstead area to West Ruislip station.

There are three different routes that take a similar amount of time (approx. 50 minutes, +/- a few minutes):

  • Take Jubilee line to Bond Street, then take the Central line direct to West Ruislip.
  • Get on the Metropolitan line (Finchley Road), leave at Ickenham and then walk from there to West Ruislip. That walking route is an OSI on the network.
  • Take Overground service to Shepherd’s Bush, then take the Central line direct to West Ruislip.

West Hampstead’s Underground, Overground and Thameslink stations are all separate stations.  These are the fastest routes based on my current graph:

  • West Hampstead Overground → West Ruislip: 50.5 mins, with interchange at Shepherd’s Bush.
  • West Hampstead Underground → West Ruislip: 48 mins, with interchange at Finchley Road and walk from Ickenham to destination.

If more data on actual interchange times were put in the dataset, then perhaps the shortest paths would change.  The Overground and Underground stations are less than 2.5 minutes apart in reality.

Thoughts so far

I was pleased with the results of my fastest route function and moved on to some graph analysis: I will show the findings in the next blog post.

The key thing to change in the data pipeline is to add the actual interchange times between stations’ platforms, in order to give more accurate journey times.

Note: I hope to publish the Jupyter notebook of this project later on and am trying not to dwell too much on the actual code in these blog posts.