London’s Tube network: initial link analysis

In a previous blog post, I outlined my aims to analyse London’s current Tube network.

As a recap, these are two core challenges I want to tackle:

  • Calculate the shortest path between two stations.
  • Perform graph analysis / link analysis on the network.

The previous post focused on the first challenge, whereas this post follows on from it and approaches the second challenge: link analysis.

I hope to update these posts with links to my Jupyter notebooks, once I tidy up my ‘lab’ notebooks.

 

For my data, I want to make explicit some issues that affect the analysis, due to how the dataset is set up and the assumptions in this initial graph model:

  • All interchanges have the same cost (1 minute). In reality, interchanges within a station are not equal. Green Park has very long tunnels for an interchange, whereas stations such as Finsbury Park can have an interchange on a parallel platform.
  • The walking routes and OSIs have a costly interchange between the station and its street (a result of the first assumption). This should be zero.  An example path with this issue is Bayswater → Bayswater [WALK] → Queensway [WALK] → Queensway.

These two problems can be solved together, once connections between platforms within a station are added. TfL has released data on interchange times within each station, but some data cleaning is needed before these connections can be added to the model.

Before improving the dataset, let us do some initial analysis to see how the graph is working.


Link analysis

I created a new NetworkX (nx) graph object, so that the graph is in a format for link analysis.  Why a separate object?  This is just because of how the weights need to be specified for the analysis you want to perform:

  • For ‘shortest path’ work, the weights in the graph are the journey times (mins).
  • For ‘link analysis’ work, the weights in the graph are the inverses of the journey times, e.g. 2-minute edge has weight 1/2.  This is how the graph needs to be set up for the link analysis algorithms to work as expected.

There are two link analysis algorithms we could do on the network fairly quickly, which can open up different interpretations:

  • PageRank
  • HITS algorithm (hubs and authorities)

Mark Dunne also used those algorithms for his graph analysis, so it would be interesting to see how my results compare.

As a brief note, the Tube is modelled as an undirected graph in my dataset, so the hubs and authorities analyses of the HITS algorithm are equivalent.  There are only a few edges in the network that are actually unidirectional (e.g. Heathrow loop; West India Quay is skipped in the DLR route towards Lewisham).

Setup

My dataset is only one CSV file and I loaded it into a Pandas dataframe.  The graph object for the link analysis will be called ‘graph_weights’.  I’ll leave the code for setting up these objects in the Jupyter notebook, as it’s fairly long and it will probably be simplified in a v2 project when the within-station interchanges are added to my dataset.

With the graph object set up, here is the code for setting up a dataframe of PageRank values for each node:


pagerank = nx.pagerank_numpy(graph_weights, weight='weight')
pagerank = pd.DataFrame.from_dict(pagerank, orient='index').reset_index()
pagerank.columns = ['node', 'pagerank']

Now for a full implementation of HITS for completeness:


hits = nx.hits_scipy(graph_weights, max_iter=2500)
# Hub values
hits_hub = hits[0] # Get the dictionary out of the tuple
hits_hub = pd.DataFrame.from_dict(hits_hub, orient='index').reset_index()
hits_hub.columns = ['node', 'hub']
# Authority values
hits_auth = hits[1] # Get the dictionary out of the tuple
hits_auth = pd.DataFrame.from_dict(hits_auth, orient='index').reset_index()
hits_auth.columns = ['node', 'authority']
# Show hub and authority values
hits_all = pd.merge(hits_hub, hits_auth, on='node')

For some reason, I needed the max_iter argument to have a value (e.g. 2500) much higher than the default.

Analysis

Here is my analysis, with the top 20 nodes shown for each algorithm (PageRank and hub values are shown).  To make explicit, due to how my dataset is structured, a node that is just the station name is akin to the station’s ticket hall, and nodes that have the station line in square brackets are akin to station platforms.  I haven’t aggregated everything by station yet, as it is interesting to see how it works with the nodes arranged like this for now.

Top 20 nodes by algorithm
The top 20 nodes by algorithm (PageRank and HITS)

The results for PageRank seem fairly intuitive:

  • Major stations have a high rank and tend to have many different services (multiple Underground, Overground or DLR services): e.g. Liverpool Street, Bank, Waterloo, Stratford, Willesden Junction.
  • Fairly small stations such as Euston Square and Wood Lane are ranked quite high. In these cases, the edges for OSIs and short walking routes could explain why they have high rankings.
  • Poplar [DLR] node is the highest ranked ‘platform’ (not shown in image, but is in top 25, with value 0.001462). This seems reasonable as every DLR service passes through the station; it may be the case that more journeys pass through Poplar, rather than start or end there.  Earl’s Court [District] is next in ranking and the intuition is similar.

The HITS results are much more striking in contrast:

  • Liverpool Street is by far the most prominent station in this HITS analysis.  All of its platforms and the ticket hall score higher than other nodes.
  • The main hubs in my dataset are all in the City of London: Liverpool Street, Moorgate, Bank, Barbican and Farringdon are all prominent.

It is interesting that Mark Dunne’s analysis has the main hubs in the West End: in particular, Oxford Circus, Green Park, Piccadilly Circus, etc.  I think this is largely because my dataset has all the current London Overground routes, including the Lea Valley Lines. This has brought many more routes (via Hackney) onto TfL’s map: it could be likened to a north-eastern Metro-land.

If this graph were extended to include National Rail routes from the London commuter belt (especially ones in TfL’s fare zones), then I would expect that stations such as Clapham Junction, Victoria, London Bridge and King’s Cross would be more prominent in the network. Inclusion of these routes is problematic, as they operate a variety of services (e.g. fast trains vs all-stations trains), so assigning values to edges is difficult.

Retrospective

Let’s wrap up this project in its current form, so that we consider whether it is worthwhile to extend this project any further.

  • What has worked well so far?
    • The initial analysis appears to represent the network well. The results seem intuitive.
    • The Excel workbook with all the data has been quick to edit when significant changes needed to be made.
    • Some functionality has been straightforward to implement, e.g. shortest path function.
  • What can be improved?
    • More interchange data can be added to the graph.
    • The interchange edges can be less clunky.
    • Data visualisation could add much more value to the analysis.

TfL have produced data on interchange times within stations and I seek to incorporate those in my dataset.  That would add a few hundred connections to my network: edges from platforms to ticket halls; edges between platforms in stations.  I have done some initial data cleaning on these connections, but the most time-consuming work will be matching TfL’s station names to my dataset’s station naming convention.

Leave a Reply

Your email address will not be published. Required fields are marked *