London’s Tube network: adding interchange data


Modelling and analysis of London’s Tube network has been an interesting project for me to do, as it has enabled me to create my own data pipeline by using multiple data sources.  I have refined the model and will discuss it in this post.

These are the two challenges I have sought to tackle so far in this project:

  • Calculate the shortest path between two stations.
  • Perform graph analysis / link analysis on the network.

As presented in my previous posts on this project, I managed to create a function that displays the shortest path between two stations and do some initial link analysis on the most important stations in the network.  My graph has all the current London Overground routes (e.g. Lea Valley Lines), whereas many network analyses online are based on older data.

What elements can be improved?

  • More interchange data can be added to the graph.
  • The interchange edges can be less clunky (especially the OSI [out-of-station interchange] and walking routes between stations).
  • Data visualisation could add much more value to the analysis.

Improving the graph data

The first two problems can be addressed together, by adding edges that represent distances from:

  • Platform to street level
  • Platform to platform (interchange)

In the previous model, there were assumptions set on the interchange times (in essence, they were treated as identical due to lack of data).  However, the interchange times vary across the network.  An interchange time could be as high as 15 minutes, or at the lower end could be a hop to a parallel platform.  Look at the layout of King’s Cross St. Pancras station as an example of a large station with many tunnels.

I have incorporated TfL’s data on interchange times in my model, which has involved adding over 600 edges.  The graph has approximately twice the number of edges as a consequence.

There are missing data, so I had to estimate distances for edges at some stations: e.g. Clapham North; most of the Lea Valley Line stations.

Tube network connections in Excel (including interchange edges)
Tube network connections (e.g. distances between platforms in a station)

Fastest route

Google Maps is not great for calculating short journeys between stations, especially journeys involving a large National Rail station.

This is probably my best example of a calculation by Google Maps that seems ‘off’.

Suppose you arrive in Paddington station from a National Rail service.  How long would it take to get to Notting Hill Gate station (street level)?

Google Maps says 5 minutes:

  • Paddington station concourse to Paddington’s District/Circle Line platforms: 2 minutes
  • Two stops to Notting Hill Gate: 3 minutes

This journey would not take 5 minutes.  It would take around 5 minutes to get this far: Paddington concourse; Paddington Underground entrance on Praed Street; Circle/District platforms.

Google Maps does not appear to use much data on interchange times for Tube stations.  I am surprised, as it does use timetable data of some sort.  In fact, it even lists Notting Hill Gate station incorrectly as ‘Notting Hill’.

Using my fastest_route function (code shown in previous post), the journey is estimated to take 11 minutes, not 5 minutes:

# Google Maps has Paddington concourse -> Notting Hill Gate as 5min journey
fastest_route("Paddington", "Notting Hill Gate [Circle]") # start: National Rail
fastest_route("Paddington", "Notting Hill Gate")
fastest_route("Paddington (H&C)", "Notting Hill Gate") # start: north end of Paddington

The middle case, of most interest, returns 11 minutes:


Paddington (Bakerloo)
Paddington (Bakerloo) [Circle]
Bayswater [Circle]
Notting Hill Gate [Circle]
Notting Hill Gate

JOURNEY TIME: 11.0 minutes

The first and last journeys have journey times of 9.0 and 14.0 minutes respectively.

West Hampstead to West Ruislip (revisited)

An interesting journey discussed in a previous post was: West Hampstead area to West Ruislip.  There are three separate stations in West Hampstead (Underground, Overground and Thameslink) so that opens up three different routes to West Ruislip, which take approx. 50 minutes.  The inclusion of interchange data in the data pipeline would give a better idea of the actual ‘fastest’ route.

With the interchange data, the fastest route from West Hampstead Overground to West Ruislip takes 49.0 minutes.  The new model is suggesting a walk to the Underground station, instead of taking the Overground to Shepherd’s Bush and taking the Central line from there:


West Hampstead
West Hampstead Underground
West Hampstead Underground [Jubilee]
Finchley Road [Jubilee]
Finchley Road [Metropolitan]
Wembley Park [Metropolitan]
Preston Road [Metropolitan]
Northwick Park [Metropolitan]
Harrow-on-the-Hill [Metropolitan]
West Harrow [Metropolitan]
Rayners Lane [Metropolitan]
Eastcote [Metropolitan]
Ruislip Manor [Metropolitan]
Ruislip [Metropolitan]
Ickenham [Metropolitan]
West Ruislip

JOURNEY TIME: 49.0 minutes

The route via the Jubilee line and Central line is not far off on journey time though.  It takes 47.0 minutes from Swiss Cottage’s Jubilee line platform to West Ruislip. In total, the journey from West Hampstead Underground to West Ruislip via Jubilee and Central lines is 51.0 minutes.

Journeys that are not working so well

There are some quirks in the Tube network that are hard to model, so an undirected graph will not give truly accurate times for some journeys.  These are examples:

  • Interchanges between platforms on the same line are difficult to model: e.g. Euston’s two Northern line branches; Edgware Road’s two Circle line branches; interchanges such as Harrow-on-the-Hill and Leytonstone.
  • The London Overground has different services – some involve interchanges, others may not.  There are services that do a circle starting and ending at Clapham Junction, but other services do not do an orbit.  In my current model, a journey from West Hampstead to Shepherd’s Bush involves an interchange at Willesden Junction.  However, a service doing the Clapham Junction loop would not require that interchange.
  • Some Underground edges have different services, e.g. trains often start at Mill Hill East for commuting, but during the day trains do not tend to terminate there.  A journey to that station can involve a 15-minute interchange at Finchley Central!  There are some fast trains on the Metropolitan line at peak times as another example.

As can be seen, there are a lot of complexities in the Tube network, so a graph cannot model all of them successfully.

Consider some examples in action:

fastest_route("Paddington (H&C) [Circle]", "Paddington (Bakerloo) [Circle]")
# The two Circle services at Edgware Rd are at different stations - this journey is not right


Paddington (H&C) [Circle]
Edgware Road (Circle) [Circle]
Paddington (Bakerloo) [Circle]

JOURNEY TIME: 6.0 minutes

The time between the two Paddington stations’ Circle line platforms is not 6 minutes.  The problem is the Edgware Road (Circle) [Circle] node.  There needs to be an interchange between the different Circle line platforms at Edgware Road (Circle) station.  To avoid this problem, the node would need to be split up into two nodes.

For example, Euston [Northern] node could be broken up into two nodes: Euston [Northern – Bank] and Euston [Northern – Charing X].  This small change introduces another level of complexity for the whole network.

Overall, I am still pleased with the model’s results for journey times, now that interchange times are incorporated in the model.  In some respects, the model is working better than Google Maps (e.g. the nuances of Paddington’s rail connections).

In the next post on this project, I will revisit the graph link analysis (e.g. PageRank) and see how the interchange data have influenced the centrality of stations.

London’s Tube network: initial link analysis

In a previous blog post, I outlined my aims to analyse London’s current Tube network.

As a recap, these are two core challenges I want to tackle:

  • Calculate the shortest path between two stations.
  • Perform graph analysis / link analysis on the network.

The previous post focused on the first challenge, whereas this post follows on from it and approaches the second challenge: link analysis.

I hope to update these posts with links to my Jupyter notebooks, once I tidy up my ‘lab’ notebooks.


For my data, I want to make explicit some issues that affect the analysis, due to how the dataset is set up and the assumptions in this initial graph model:

  • All interchanges have the same cost (1 minute). In reality, interchanges within a station are not equal. Green Park has very long tunnels for an interchange, whereas stations such as Finsbury Park can have an interchange on a parallel platform.
  • The walking routes and OSIs have a costly interchange between the station and its street (a result of the first assumption). This should be zero.  An example path with this issue is Bayswater → Bayswater [WALK] → Queensway [WALK] → Queensway.

These two problems can be solved together, once connections between platforms within a station are added. TfL has released data on interchange times within each station, but some data cleaning is needed before these connections can be added to the model.

Before improving the dataset, let us do some initial analysis to see how the graph is working.

Link analysis

I created a new NetworkX (nx) graph object, so that the graph is in a format for link analysis.  Why a separate object?  This is just because of how the weights need to be specified for the analysis you want to perform:

  • For ‘shortest path’ work, the weights in the graph are the journey times (mins).
  • For ‘link analysis’ work, the weights in the graph are the inverses of the journey times, e.g. 2-minute edge has weight 1/2.  This is how the graph needs to be set up for the link analysis algorithms to work as expected.

There are two link analysis algorithms we could do on the network fairly quickly, which can open up different interpretations:

  • PageRank
  • HITS algorithm (hubs and authorities)

Mark Dunne also used those algorithms for his graph analysis, so it would be interesting to see how my results compare.

As a brief note, the Tube is modelled as an undirected graph in my dataset, so the hubs and authorities analyses of the HITS algorithm are equivalent.  There are only a few edges in the network that are actually unidirectional (e.g. Heathrow loop; West India Quay is skipped in the DLR route towards Lewisham).


My dataset is only one CSV file and I loaded it into a Pandas dataframe.  The graph object for the link analysis will be called ‘graph_weights’.  I’ll leave the code for setting up these objects in the Jupyter notebook, as it’s fairly long and it will probably be simplified in a v2 project when the within-station interchanges are added to my dataset.

With the graph object set up, here is the code for setting up a dataframe of PageRank values for each node:

pagerank = nx.pagerank_numpy(graph_weights, weight='weight')
pagerank = pd.DataFrame.from_dict(pagerank, orient='index').reset_index()
pagerank.columns = ['node', 'pagerank']

Now for a full implementation of HITS for completeness:

hits = nx.hits_scipy(graph_weights, max_iter=2500)
# Hub values
hits_hub = hits[0] # Get the dictionary out of the tuple
hits_hub = pd.DataFrame.from_dict(hits_hub, orient='index').reset_index()
hits_hub.columns = ['node', 'hub']
# Authority values
hits_auth = hits[1] # Get the dictionary out of the tuple
hits_auth = pd.DataFrame.from_dict(hits_auth, orient='index').reset_index()
hits_auth.columns = ['node', 'authority']
# Show hub and authority values
hits_all = pd.merge(hits_hub, hits_auth, on='node')

For some reason, I needed the max_iter argument to have a value (e.g. 2500) much higher than the default.


Here is my analysis, with the top 20 nodes shown for each algorithm (PageRank and hub values are shown).  To make explicit, due to how my dataset is structured, a node that is just the station name is akin to the station’s ticket hall, and nodes that have the station line in square brackets are akin to station platforms.  I haven’t aggregated everything by station yet, as it is interesting to see how it works with the nodes arranged like this for now.

Top 20 nodes by algorithm
The top 20 nodes by algorithm (PageRank and HITS)

The results for PageRank seem fairly intuitive:

  • Major stations have a high rank and tend to have many different services (multiple Underground, Overground or DLR services): e.g. Liverpool Street, Bank, Waterloo, Stratford, Willesden Junction.
  • Fairly small stations such as Euston Square and Wood Lane are ranked quite high. In these cases, the edges for OSIs and short walking routes could explain why they have high rankings.
  • Poplar [DLR] node is the highest ranked ‘platform’ (not shown in image, but is in top 25, with value 0.001462). This seems reasonable as every DLR service passes through the station; it may be the case that more journeys pass through Poplar, rather than start or end there.  Earl’s Court [District] is next in ranking and the intuition is similar.

The HITS results are much more striking in contrast:

  • Liverpool Street is by far the most prominent station in this HITS analysis.  All of its platforms and the ticket hall score higher than other nodes.
  • The main hubs in my dataset are all in the City of London: Liverpool Street, Moorgate, Bank, Barbican and Farringdon are all prominent.

It is interesting that Mark Dunne’s analysis has the main hubs in the West End: in particular, Oxford Circus, Green Park, Piccadilly Circus, etc.  I think this is largely because my dataset has all the current London Overground routes, including the Lea Valley Lines. This has brought many more routes (via Hackney) onto TfL’s map: it could be likened to a north-eastern Metro-land.

If this graph were extended to include National Rail routes from the London commuter belt (especially ones in TfL’s fare zones), then I would expect that stations such as Clapham Junction, Victoria, London Bridge and King’s Cross would be more prominent in the network. Inclusion of these routes is problematic, as they operate a variety of services (e.g. fast trains vs all-stations trains), so assigning values to edges is difficult.


Let’s wrap up this project in its current form, so that we consider whether it is worthwhile to extend this project any further.

  • What has worked well so far?
    • The initial analysis appears to represent the network well. The results seem intuitive.
    • The Excel workbook with all the data has been quick to edit when significant changes needed to be made.
    • Some functionality has been straightforward to implement, e.g. shortest path function.
  • What can be improved?
    • More interchange data can be added to the graph.
    • The interchange edges can be less clunky.
    • Data visualisation could add much more value to the analysis.

TfL have produced data on interchange times within stations and I seek to incorporate those in my dataset.  That would add a few hundred connections to my network: edges from platforms to ticket halls; edges between platforms in stations.  I have done some initial data cleaning on these connections, but the most time-consuming work will be matching TfL’s station names to my dataset’s station naming convention.

Traversing London’s Tube network

The TfL Tube map is becoming increasingly dense, as more and more lines and stations are added to it. (By the time I achieve anything with this graph analysis, there may be more tweaking to do because of Crossrail!)

I have tried to find a graph online that is set up for basic analysis, which accounts for:

  • Interchanges within station
  • Interchanges between stations, i.e. out-of-station interchanges (OSIs) and other short walking routes.
  • The latest TfL network, especially all the current London Overground routes.

There are interactive tools (e.g. Oliver O’Brien’s Tube Creature and Google Maps) that cover those three factors, but I have not found a workable dataset that can be used for my own graph analysis.

Mark Dunne’s Tube graph project is the best graph analysis I have come across, which used data from Nicola Greco. However, the data do not cover the three elements specified above.

These are two core challenges I want to tackle:

  • Calculate the shortest path between two stations.
  • Perform graph analysis / link analysis on the network.

I have created my own dataset to include all the Overground routes and added my own interchange edges. There are about 120 out-of-station interchanges (OSIs) and walking routes in my dataset (fairly even split between the two).

For my dataset, I have also designed an ID system that includes information about all the stations: e.g. Piccadilly Circus has ID 71034, where ’10’ shows that it is in Zone 1. This numbering system has been useful for producing pivot tables and bringing in an element of data verification.

Tube network connections (Excel)
Tube network connections in Excel

I used the NetworkX (as nx) library to set up the graph.

Journey times

A function

Here are the neighbours for Queen’s Park station: “Queen’s Park [Bakerloo]”, “Queen’s Park [WDCL]”, “Queen’s Park [WALK]”.

In other words, Queen’s Park station has its Bakerloo and Watford DC line platforms as neighbours, in addition to the ‘WALK’ platform, which arises from how my dataset is constructed to account for short walking routes.

Now let’s create a function to obtain the fastest route and see how it is working.

def fastest_route(start, end):
 Return the fastest path between the 'start' and 'end' points.
 Each station and interchange is printed, along with the journey time.
 Tip: use "" when calling the function, as escape characters may be needed with ''.
 journey_path = nx.shortest_path(graph_times, start, end, weight='weight')
 journey_time = nx.shortest_path_length(graph_times, start, end, weight='weight')
 print('\nJOURNEY:', *journey_path, sep='\n\t')
 print('\nJOURNEY TIME:', journey_time, 'minutes')

Let’s try some examples with the function and see how it is working.

fastest_route("Queen's Park", "Brondesbury Park")


Queen’s Park
Queen’s Park [WALK]
Brondesbury Park [WALK]
Brondesbury Park

JOURNEY TIME: 12.0 minutes

Queen’s Park to Brondesbury Park is not an OSI, despite both stations’ being only 0.5 miles apart on the same road (Salisbury Road) and their being on different Overground lines. On TfL’s map, people unfamiliar with the area might think that an interchange at Willesden Junction would be a faster journey.

Note: The distance between each station and its ‘walk’ node is a 1-minute journey due to the uniform assumption applied to all interchanges in the graph design.

Queen's Park map
Queen’s Park and Brondesbury Park stations are only half a mile apart. (Image source: Google Maps)

West Hampstead to West Ruislip

Let’s look at an interesting case: a journey from the West Hampstead area to West Ruislip station.

There are three different routes that take a similar amount of time (approx. 50 minutes, +/- a few minutes):

  • Take Jubilee line to Bond Street, then take the Central line direct to West Ruislip.
  • Get on the Metropolitan line (Finchley Road), leave at Ickenham and then walk from there to West Ruislip. That walking route is an OSI on the network.
  • Take Overground service to Shepherd’s Bush, then take the Central line direct to West Ruislip.

West Hampstead’s Underground, Overground and Thameslink stations are all separate stations.  These are the fastest routes based on my current graph:

  • West Hampstead Overground → West Ruislip: 50.5 mins, with interchange at Shepherd’s Bush.
  • West Hampstead Underground → West Ruislip: 48 mins, with interchange at Finchley Road and walk from Ickenham to destination.

If more data on actual interchange times were put in the dataset, then perhaps the shortest paths would change.  The Overground and Underground stations are less than 2.5 minutes apart in reality.

Thoughts so far

I was pleased with the results of my fastest route function and moved on to some graph analysis: I will show the findings in the next blog post.

The key thing to change in the data pipeline is to add the actual interchange times between stations’ platforms, in order to give more accurate journey times.

Note: I hope to publish the Jupyter notebook of this project later on and am trying not to dwell too much on the actual code in these blog posts.