Modelling and analysis of London’s Tube network has been an interesting project for me to do, as it has enabled me to create my own data pipeline by using multiple data sources. I have refined the model and will discuss it in this post.
These are the two challenges I have sought to tackle so far in this project:
- Calculate the shortest path between two stations.
- Perform graph analysis / link analysis on the network.
As presented in my previous posts on this project, I managed to create a function that displays the shortest path between two stations and do some initial link analysis on the most important stations in the network. My graph has all the current London Overground routes (e.g. Lea Valley Lines), whereas many network analyses online are based on older data.
What elements can be improved?
- More interchange data can be added to the graph.
- The interchange edges can be less clunky (especially the OSI [out-of-station interchange] and walking routes between stations).
- Data visualisation could add much more value to the analysis.
Improving the graph data
The first two problems can be addressed together, by adding edges that represent distances from:
- Platform to street level
- Platform to platform (interchange)
In the previous model, there were assumptions set on the interchange times (in essence, they were treated as identical due to lack of data). However, the interchange times vary across the network. An interchange time could be as high as 15 minutes, or at the lower end could be a hop to a parallel platform. Look at the layout of King’s Cross St. Pancras station as an example of a large station with many tunnels.
I have incorporated TfL’s data on interchange times in my model, which has involved adding over 600 edges. The graph has approximately twice the number of edges as a consequence.
There are missing data, so I had to estimate distances for edges at some stations: e.g. Clapham North; most of the Lea Valley Line stations.
Google Maps is not great for calculating short journeys between stations, especially journeys involving a large National Rail station.
This is probably my best example of a calculation by Google Maps that seems ‘off’.
Suppose you arrive in Paddington station from a National Rail service. How long would it take to get to Notting Hill Gate station (street level)?
Google Maps says 5 minutes:
- Paddington station concourse to Paddington’s District/Circle Line platforms: 2 minutes
- Two stops to Notting Hill Gate: 3 minutes
This journey would not take 5 minutes. It would take around 5 minutes to get this far: Paddington concourse; Paddington Underground entrance on Praed Street; Circle/District platforms.
Google Maps does not appear to use much data on interchange times for Tube stations. I am surprised, as it does use timetable data of some sort. In fact, it even lists Notting Hill Gate station incorrectly as ‘Notting Hill’.
Using my fastest_route function (code shown in previous post), the journey is estimated to take 11 minutes, not 5 minutes:
# Google Maps has Paddington concourse -> Notting Hill Gate as 5min journey fastest_route("Paddington", "Notting Hill Gate [Circle]") # start: National Rail fastest_route("Paddington", "Notting Hill Gate") fastest_route("Paddington (H&C)", "Notting Hill Gate") # start: north end of Paddington
The middle case, of most interest, returns 11 minutes:
Paddington (Bakerloo) [Circle]
Notting Hill Gate [Circle]
Notting Hill Gate
JOURNEY TIME: 11.0 minutes
The first and last journeys have journey times of 9.0 and 14.0 minutes respectively.
West Hampstead to West Ruislip (revisited)
An interesting journey discussed in a previous post was: West Hampstead area to West Ruislip. There are three separate stations in West Hampstead (Underground, Overground and Thameslink) so that opens up three different routes to West Ruislip, which take approx. 50 minutes. The inclusion of interchange data in the data pipeline would give a better idea of the actual ‘fastest’ route.
With the interchange data, the fastest route from West Hampstead Overground to West Ruislip takes 49.0 minutes. The new model is suggesting a walk to the Underground station, instead of taking the Overground to Shepherd’s Bush and taking the Central line from there:
West Hampstead Underground
West Hampstead Underground [Jubilee]
Finchley Road [Jubilee]
Finchley Road [Metropolitan]
Wembley Park [Metropolitan]
Preston Road [Metropolitan]
Northwick Park [Metropolitan]
West Harrow [Metropolitan]
Rayners Lane [Metropolitan]
Ruislip Manor [Metropolitan]
JOURNEY TIME: 49.0 minutes
The route via the Jubilee line and Central line is not far off on journey time though. It takes 47.0 minutes from Swiss Cottage’s Jubilee line platform to West Ruislip. In total, the journey from West Hampstead Underground to West Ruislip via Jubilee and Central lines is 51.0 minutes.
Journeys that are not working so well
There are some quirks in the Tube network that are hard to model, so an undirected graph will not give truly accurate times for some journeys. These are examples:
- Interchanges between platforms on the same line are difficult to model: e.g. Euston’s two Northern line branches; Edgware Road’s two Circle line branches; interchanges such as Harrow-on-the-Hill and Leytonstone.
- The London Overground has different services – some involve interchanges, others may not. There are services that do a circle starting and ending at Clapham Junction, but other services do not do an orbit. In my current model, a journey from West Hampstead to Shepherd’s Bush involves an interchange at Willesden Junction. However, a service doing the Clapham Junction loop would not require that interchange.
- Some Underground edges have different services, e.g. trains often start at Mill Hill East for commuting, but during the day trains do not tend to terminate there. A journey to that station can involve a 15-minute interchange at Finchley Central! There are some fast trains on the Metropolitan line at peak times as another example.
As can be seen, there are a lot of complexities in the Tube network, so a graph cannot model all of them successfully.
Consider some examples in action:
fastest_route("Paddington (H&C) [Circle]", "Paddington (Bakerloo) [Circle]") # The two Circle services at Edgware Rd are at different stations - this journey is not right
Paddington (H&C) [Circle]
Edgware Road (Circle) [Circle]
Paddington (Bakerloo) [Circle]
JOURNEY TIME: 6.0 minutes
The time between the two Paddington stations’ Circle line platforms is not 6 minutes. The problem is the Edgware Road (Circle) [Circle] node. There needs to be an interchange between the different Circle line platforms at Edgware Road (Circle) station. To avoid this problem, the node would need to be split up into two nodes.
For example, Euston [Northern] node could be broken up into two nodes: Euston [Northern – Bank] and Euston [Northern – Charing X]. This small change introduces another level of complexity for the whole network.
Overall, I am still pleased with the model’s results for journey times, now that interchange times are incorporated in the model. In some respects, the model is working better than Google Maps (e.g. the nuances of Paddington’s rail connections).
In the next post on this project, I will revisit the graph link analysis (e.g. PageRank) and see how the interchange data have influenced the centrality of stations.