Planning a trip to London: finding your bearings via nearest neighbour search

Big Ben near Westminster station
What are the nearest stations to London’s top tourist attractions?

Introduction

Travelling is one of my favourite hobbies but it often comes with frustration.  It can be so difficult to plan a trip so I have been working on a data science project that tries to make it easier to plan a trip itinerary based on geodata.

I have published my project to my Github so check out my initial notebook there.  This is the first in a series of blog posts on this project, which will focus on only small components of code and describe things in more general terms.

There are so many things to do in the travel planning process, e.g.:

  • Choose a destination (probably from a shortlist).
  • Book flights.
  • Book accommodation (again from a shortlist).
  • Develop a travel itinerary.
  • Pack bags for the trip.
  • Arrange transport (e.g. to airport and accommodation).

All that is done before even going on a trip, but already the number of tasks involved seems time-consuming.  Some trips can take days to plan over a long period and the planning process is typically iterative (e.g. develop a draft itinerary and then refine it nearer the time).

Even as a seasoned traveller I would say that developing a travel itinerary is the hardest thing to do in that list. An itinerary defines a trip and it is something that you want to get right, especially if it may be your only visit to a destination.  For all the travel guides and websites I examine when planning a trip, I normally need to adjust my plans ‘on the day’: e.g. the transportation isn’t as straightforward as I thought; an attraction is closed; plans are dependent on weather conditions.

Since itineraries are so time-consuming to get right, before and during a trip, I’ve sought a way to make it easier for me to develop a travel itinerary via data science techniques.  As someone who has lived, studied and worked in London, I thought it would be worthwhile to start off the project by using data on London.

I read Hamza Bendemra’s post on geo-location clustering in Paris, which inspired me to do a take on a project like this one.  With my experience in travel technology, experience of travel planning for large groups and a project on transportation, I thought I could go much further in my own project on travel planning.

These are the two initial aims of the project:

  • Find the nearest rail station for London’s top attractions and calculate how close the attractions are to each station.
  • Group together London’s top attractions and use the groupings for itinerary planning (e.g. spend one day for attractions in one group, another day for attractions in a second group, etc.).

That could help to establish a proof of concept for a more ambitious project.

Train icon

This post focuses on the first part: calculating the nearest station to London’s top tourist attractions.  Before going into that, it is also worthwhile to describe the nature of the datasets.

Data on attractions and stations

There are over 1,600 attractions in London listed on TripAdvisor, so there is plenty of things to do.  I typically use TripAdvisor to compare destinations before planning a trip and it’s normally beneficial to look over some of the top attractions.  The top attractions on TripAdvisor have high ratings and a relatively high volume of reviews, so they may have widespread appeal to fit in any travel itinerary.

I created a spreadsheet with geodata on more than 20 of the top attractions in London on TripAdvisor.  These are the main items for each record in the spreadsheet table:

  • Attraction rank
  • Attraction name
  • Geodata (e.g. latitude, longitude)
  • Recommended visit duration interval (e.g. 2–3 hours)

This dataset involved some manual data entry and no web scraping at all.

Top London attractions
The top London attraction on TripAdvisor is the National Gallery

I also have a dataset on over 600 rail stations in London (e.g. London Underground, London Overground, National Rail stations, etc.), which I have used previously for my graph analysis of London’s rail system.  These are the main items for each record in the spreadsheet:

  • Station name
  • Geodata (e.g. latitude, longitude, postcode)
  • TfL fare zone

I’ve loaded up the data into dataframes that are imaginatively called sights and rail respectively in my notebook.

Nearest station to attractions

With the data loaded and the geodata in a consistent format across both dataframes, we are ready to calculate the proximity between rail stations and tourist attractions.

The aim in this post is to find the nearest station for London’s top attractions and calculate how close the attractions are to each station.

In essence, the primary problem is a ‘nearest neighbour’ problem and the second problem involves applied trigonometry.

The rationale for tackling this aim first is to get more context on the attractions before approaching the travel planning problem, which is an unsupervised learning problem.

There are 21 attractions that need a ‘nearest station’ calculated from over 600 stations.

A KD Tree can store the latitude–longitude pairs of all the stations so that nearest neighbour calculations for any co-ordinates queried on the tree are efficient.

 

# Set up KD Tree object
rail_tree = sp.spatial.KDTree(np.array(rail[['latitude',
  'longitude']]))

That tree was set up with a Numpy array of latitude–longitude pairs (i.e. an n x 2 array where n is the number of stations).

A Numpy array, e.g. an m x 2 array, can also be queried on the tree to return the nearest neighbour in the tree for m points (in object nn_id) and their distances from the neighbours (in object nn_dist).

 

# Query the nearest neighbours of the tree and store info in arrays
nn_dist, nn_id = rail_tree.query(sight_geo_pairs, p=2) # Euclidean distance

Nearest neighbour calculations

A query of the National Gallery’s co-ordinates (51.508929 , -0.128299) returns position 387 in the tree as the nearest neighbour, which is Leicester Square.

See the notebook for more depth on the code and more examples beyond the first case (National Gallery).

Euclidean distance is used to represent the distance ‘as the crow flies’ between two points.  There are other types of distance like Manhattan distance but the Euclidean distance is most appropriate for the data, as it is rotation invariant.  A new column is added to the dataframe to record the nearest station for each attraction.


# Create a new column in sights df for nearest station
sights['nearest_station'] = rail['station'].iloc[nn_id].values


The distance returned by the tree for the National Gallery is 0.00236287, but this is not easy to interpret and does not account for the Earth’s surface.

We cannot find out the distance in miles, for example, from the distances already calculated.

We need to have the lat-long pairs of both points, which form a line, in order to calculate the miles between them (via the haversine formula).

Distance calculations

The haversine formula applies trigonometry on a sphere to calculate the distance between two points on a sphere. It can be used to give an approximate (Euclidean) distance between two points on Earth.

My notebook defines a function called haversine, which takes geodata from two points as parameters and returns the distance in miles between the two points by default.  The mathematics in the function matches the formula specified in the Wikipedia article linked above.

I gather latitude–longitude pairs for each attraction and their nearest station in two numpy arrays that both have the shape m x 2, where m is the number of attractions.  My function requires the data to be flattened, so I concatenate the two arrays into one array of shape m x 4 to contain all the data we need to calculate the relevant haversine distances for the attractions.  A column of Euclidean distance (miles) is created through a list comprehension, which iterates over each row of the array to unpack the two latitude–longitude pairs and pass them to the haversine function.


# Create a col for the distance
# For each attraction, the two geo pairs are passed to the haversine function to obtain a value
sights['nearest_station_miles'] = [haversine(*four_coords) for four_coords
  in np.concatenate([sight_geo_pairs, station_geo_pairs], axis=1)]

The top 5 attractions are all approximately 0.2 miles away from their nearest stations and each one has a different station.  The National Gallery’s nearest station is 0.163 miles away (Leicester Square).  London’s #5 attraction is the Victoria & Albert Museum and its nearest station is 0.193 miles away (South Kensington).

There are some attractions that are within 500 feet of the nearest station.  Big Ben is approx. 260 feet away from its nearest station (Westminster).  The photo at the top of this article shows a roundel of Westminster station in the foreground, with the nearby clock tower in the background.

Concluding points

The application of nearest neighbour search to my data has enabled me to obtain more geographical context on London’s top attractions.  Most people traversing through London can use the Tube map as a point of reference for their travelling, so calculating the nearest station to London’s attractions is a useful first step in planning a travel itinerary for the city’s key sights.

One point to bear in mind with the calculations is that the nearest station calculated for an attraction may not be the nearest in terms of actual walking distance at the street level.  The Euclidean distance is a simple way to calculate distance between points; much richer data would be needed, e.g. street routes in a graph, to calculate walking distances.  Euclidean distance works well for giving geographical context, without introducing unnecessary complexity.

This context will be useful for approaching the next part of the project, which is an unsupervised learning problem: clustering of London’s top attractions and using the clusters to plan an itinerary.  The attractions in London are spread across areas of differing densities, so it will be interesting to see how attractions are grouped together and to what extent an itinerary could be formed from them.

London’s Tube network: adding interchange data

Introduction

Modelling and analysis of London’s Tube network has been an interesting project for me to do, as it has enabled me to create my own data pipeline by using multiple data sources.  I have refined the model and will discuss it in this post.

These are the two challenges I have sought to tackle so far in this project:

  • Calculate the shortest path between two stations.
  • Perform graph analysis / link analysis on the network.

As presented in my previous posts on this project, I managed to create a function that displays the shortest path between two stations and do some initial link analysis on the most important stations in the network.  My graph has all the current London Overground routes (e.g. Lea Valley Lines), whereas many network analyses online are based on older data.

What elements can be improved?

  • More interchange data can be added to the graph.
  • The interchange edges can be less clunky (especially the OSI [out-of-station interchange] and walking routes between stations).
  • Data visualisation could add much more value to the analysis.

Improving the graph data

The first two problems can be addressed together, by adding edges that represent distances from:

  • Platform to street level
  • Platform to platform (interchange)

In the previous model, there were assumptions set on the interchange times (in essence, they were treated as identical due to lack of data).  However, the interchange times vary across the network.  An interchange time could be as high as 15 minutes, or at the lower end could be a hop to a parallel platform.  Look at the layout of King’s Cross St. Pancras station as an example of a large station with many tunnels.

I have incorporated TfL’s data on interchange times in my model, which has involved adding over 600 edges.  The graph has approximately twice the number of edges as a consequence.

There are missing data, so I had to estimate distances for edges at some stations: e.g. Clapham North; most of the Lea Valley Line stations.

Tube network connections in Excel (including interchange edges)
Tube network connections (e.g. distances between platforms in a station)

Fastest route

Google Maps is not great for calculating short journeys between stations, especially journeys involving a large National Rail station.

This is probably my best example of a calculation by Google Maps that seems ‘off’.

Suppose you arrive in Paddington station from a National Rail service.  How long would it take to get to Notting Hill Gate station (street level)?

Google Maps says 5 minutes:

  • Paddington station concourse to Paddington’s District/Circle Line platforms: 2 minutes
  • Two stops to Notting Hill Gate: 3 minutes

This journey would not take 5 minutes.  It would take around 5 minutes to get this far: Paddington concourse; Paddington Underground entrance on Praed Street; Circle/District platforms.

Google Maps does not appear to use much data on interchange times for Tube stations.  I am surprised, as it does use timetable data of some sort.  In fact, it even lists Notting Hill Gate station incorrectly as ‘Notting Hill’.

Using my fastest_route function (code shown in previous post), the journey is estimated to take 11 minutes, not 5 minutes:


# Google Maps has Paddington concourse -> Notting Hill Gate as 5min journey
fastest_route("Paddington", "Notting Hill Gate [Circle]") # start: National Rail
fastest_route("Paddington", "Notting Hill Gate")
fastest_route("Paddington (H&C)", "Notting Hill Gate") # start: north end of Paddington

The middle case, of most interest, returns 11 minutes:

JOURNEY:

Paddington
Paddington (Bakerloo)
Paddington (Bakerloo) [Circle]
Bayswater [Circle]
Notting Hill Gate [Circle]
Notting Hill Gate

JOURNEY TIME: 11.0 minutes

The first and last journeys have journey times of 9.0 and 14.0 minutes respectively.

West Hampstead to West Ruislip (revisited)

An interesting journey discussed in a previous post was: West Hampstead area to West Ruislip.  There are three separate stations in West Hampstead (Underground, Overground and Thameslink) so that opens up three different routes to West Ruislip, which take approx. 50 minutes.  The inclusion of interchange data in the data pipeline would give a better idea of the actual ‘fastest’ route.

With the interchange data, the fastest route from West Hampstead Overground to West Ruislip takes 49.0 minutes.  The new model is suggesting a walk to the Underground station, instead of taking the Overground to Shepherd’s Bush and taking the Central line from there:

JOURNEY:

West Hampstead
West Hampstead Underground
West Hampstead Underground [Jubilee]
Finchley Road [Jubilee]
Finchley Road [Metropolitan]
Wembley Park [Metropolitan]
Preston Road [Metropolitan]
Northwick Park [Metropolitan]
Harrow-on-the-Hill [Metropolitan]
West Harrow [Metropolitan]
Rayners Lane [Metropolitan]
Eastcote [Metropolitan]
Ruislip Manor [Metropolitan]
Ruislip [Metropolitan]
Ickenham [Metropolitan]
Ickenham
West Ruislip

JOURNEY TIME: 49.0 minutes

The route via the Jubilee line and Central line is not far off on journey time though.  It takes 47.0 minutes from Swiss Cottage’s Jubilee line platform to West Ruislip. In total, the journey from West Hampstead Underground to West Ruislip via Jubilee and Central lines is 51.0 minutes.

Journeys that are not working so well

There are some quirks in the Tube network that are hard to model, so an undirected graph will not give truly accurate times for some journeys.  These are examples:

  • Interchanges between platforms on the same line are difficult to model: e.g. Euston’s two Northern line branches; Edgware Road’s two Circle line branches; interchanges such as Harrow-on-the-Hill and Leytonstone.
  • The London Overground has different services – some involve interchanges, others may not.  There are services that do a circle starting and ending at Clapham Junction, but other services do not do an orbit.  In my current model, a journey from West Hampstead to Shepherd’s Bush involves an interchange at Willesden Junction.  However, a service doing the Clapham Junction loop would not require that interchange.
  • Some Underground edges have different services, e.g. trains often start at Mill Hill East for commuting, but during the day trains do not tend to terminate there.  A journey to that station can involve a 15-minute interchange at Finchley Central!  There are some fast trains on the Metropolitan line at peak times as another example.

As can be seen, there are a lot of complexities in the Tube network, so a graph cannot model all of them successfully.

Consider some examples in action:


fastest_route("Paddington (H&C) [Circle]", "Paddington (Bakerloo) [Circle]")
# The two Circle services at Edgware Rd are at different stations - this journey is not right

JOURNEY:

Paddington (H&C) [Circle]
Edgware Road (Circle) [Circle]
Paddington (Bakerloo) [Circle]

JOURNEY TIME: 6.0 minutes

The time between the two Paddington stations’ Circle line platforms is not 6 minutes.  The problem is the Edgware Road (Circle) [Circle] node.  There needs to be an interchange between the different Circle line platforms at Edgware Road (Circle) station.  To avoid this problem, the node would need to be split up into two nodes.

For example, Euston [Northern] node could be broken up into two nodes: Euston [Northern – Bank] and Euston [Northern – Charing X].  This small change introduces another level of complexity for the whole network.

Overall, I am still pleased with the model’s results for journey times, now that interchange times are incorporated in the model.  In some respects, the model is working better than Google Maps (e.g. the nuances of Paddington’s rail connections).

In the next post on this project, I will revisit the graph link analysis (e.g. PageRank) and see how the interchange data have influenced the centrality of stations.

London’s Tube network: initial link analysis

In a previous blog post, I outlined my aims to analyse London’s current Tube network.

As a recap, these are two core challenges I want to tackle:

  • Calculate the shortest path between two stations.
  • Perform graph analysis / link analysis on the network.

The previous post focused on the first challenge, whereas this post follows on from it and approaches the second challenge: link analysis.

I hope to update these posts with links to my Jupyter notebooks, once I tidy up my ‘lab’ notebooks.

 

For my data, I want to make explicit some issues that affect the analysis, due to how the dataset is set up and the assumptions in this initial graph model:

  • All interchanges have the same cost (1 minute). In reality, interchanges within a station are not equal. Green Park has very long tunnels for an interchange, whereas stations such as Finsbury Park can have an interchange on a parallel platform.
  • The walking routes and OSIs have a costly interchange between the station and its street (a result of the first assumption). This should be zero.  An example path with this issue is Bayswater → Bayswater [WALK] → Queensway [WALK] → Queensway.

These two problems can be solved together, once connections between platforms within a station are added. TfL has released data on interchange times within each station, but some data cleaning is needed before these connections can be added to the model.

Before improving the dataset, let us do some initial analysis to see how the graph is working.


Link analysis

I created a new NetworkX (nx) graph object, so that the graph is in a format for link analysis.  Why a separate object?  This is just because of how the weights need to be specified for the analysis you want to perform:

  • For ‘shortest path’ work, the weights in the graph are the journey times (mins).
  • For ‘link analysis’ work, the weights in the graph are the inverses of the journey times, e.g. 2-minute edge has weight 1/2.  This is how the graph needs to be set up for the link analysis algorithms to work as expected.

There are two link analysis algorithms we could do on the network fairly quickly, which can open up different interpretations:

  • PageRank
  • HITS algorithm (hubs and authorities)

Mark Dunne also used those algorithms for his graph analysis, so it would be interesting to see how my results compare.

As a brief note, the Tube is modelled as an undirected graph in my dataset, so the hubs and authorities analyses of the HITS algorithm are equivalent.  There are only a few edges in the network that are actually unidirectional (e.g. Heathrow loop; West India Quay is skipped in the DLR route towards Lewisham).

Setup

My dataset is only one CSV file and I loaded it into a Pandas dataframe.  The graph object for the link analysis will be called ‘graph_weights’.  I’ll leave the code for setting up these objects in the Jupyter notebook, as it’s fairly long and it will probably be simplified in a v2 project when the within-station interchanges are added to my dataset.

With the graph object set up, here is the code for setting up a dataframe of PageRank values for each node:


pagerank = nx.pagerank_numpy(graph_weights, weight='weight')
pagerank = pd.DataFrame.from_dict(pagerank, orient='index').reset_index()
pagerank.columns = ['node', 'pagerank']

Now for a full implementation of HITS for completeness:


hits = nx.hits_scipy(graph_weights, max_iter=2500)
# Hub values
hits_hub = hits[0] # Get the dictionary out of the tuple
hits_hub = pd.DataFrame.from_dict(hits_hub, orient='index').reset_index()
hits_hub.columns = ['node', 'hub']
# Authority values
hits_auth = hits[1] # Get the dictionary out of the tuple
hits_auth = pd.DataFrame.from_dict(hits_auth, orient='index').reset_index()
hits_auth.columns = ['node', 'authority']
# Show hub and authority values
hits_all = pd.merge(hits_hub, hits_auth, on='node')

For some reason, I needed the max_iter argument to have a value (e.g. 2500) much higher than the default.

Analysis

Here is my analysis, with the top 20 nodes shown for each algorithm (PageRank and hub values are shown).  To make explicit, due to how my dataset is structured, a node that is just the station name is akin to the station’s ticket hall, and nodes that have the station line in square brackets are akin to station platforms.  I haven’t aggregated everything by station yet, as it is interesting to see how it works with the nodes arranged like this for now.

Top 20 nodes by algorithm
The top 20 nodes by algorithm (PageRank and HITS)

The results for PageRank seem fairly intuitive:

  • Major stations have a high rank and tend to have many different services (multiple Underground, Overground or DLR services): e.g. Liverpool Street, Bank, Waterloo, Stratford, Willesden Junction.
  • Fairly small stations such as Euston Square and Wood Lane are ranked quite high. In these cases, the edges for OSIs and short walking routes could explain why they have high rankings.
  • Poplar [DLR] node is the highest ranked ‘platform’ (not shown in image, but is in top 25, with value 0.001462). This seems reasonable as every DLR service passes through the station; it may be the case that more journeys pass through Poplar, rather than start or end there.  Earl’s Court [District] is next in ranking and the intuition is similar.

The HITS results are much more striking in contrast:

  • Liverpool Street is by far the most prominent station in this HITS analysis.  All of its platforms and the ticket hall score higher than other nodes.
  • The main hubs in my dataset are all in the City of London: Liverpool Street, Moorgate, Bank, Barbican and Farringdon are all prominent.

It is interesting that Mark Dunne’s analysis has the main hubs in the West End: in particular, Oxford Circus, Green Park, Piccadilly Circus, etc.  I think this is largely because my dataset has all the current London Overground routes, including the Lea Valley Lines. This has brought many more routes (via Hackney) onto TfL’s map: it could be likened to a north-eastern Metro-land.

If this graph were extended to include National Rail routes from the London commuter belt (especially ones in TfL’s fare zones), then I would expect that stations such as Clapham Junction, Victoria, London Bridge and King’s Cross would be more prominent in the network. Inclusion of these routes is problematic, as they operate a variety of services (e.g. fast trains vs all-stations trains), so assigning values to edges is difficult.

Retrospective

Let’s wrap up this project in its current form, so that we consider whether it is worthwhile to extend this project any further.

  • What has worked well so far?
    • The initial analysis appears to represent the network well. The results seem intuitive.
    • The Excel workbook with all the data has been quick to edit when significant changes needed to be made.
    • Some functionality has been straightforward to implement, e.g. shortest path function.
  • What can be improved?
    • More interchange data can be added to the graph.
    • The interchange edges can be less clunky.
    • Data visualisation could add much more value to the analysis.

TfL have produced data on interchange times within stations and I seek to incorporate those in my dataset.  That would add a few hundred connections to my network: edges from platforms to ticket halls; edges between platforms in stations.  I have done some initial data cleaning on these connections, but the most time-consuming work will be matching TfL’s station names to my dataset’s station naming convention.

Traversing London’s Tube network

The TfL Tube map is becoming increasingly dense, as more and more lines and stations are added to it. (By the time I achieve anything with this graph analysis, there may be more tweaking to do because of Crossrail!)

I have tried to find a graph online that is set up for basic analysis, which accounts for:

  • Interchanges within station
  • Interchanges between stations, i.e. out-of-station interchanges (OSIs) and other short walking routes.
  • The latest TfL network, especially all the current London Overground routes.

There are interactive tools (e.g. Oliver O’Brien’s Tube Creature and Google Maps) that cover those three factors, but I have not found a workable dataset that can be used for my own graph analysis.

Mark Dunne’s Tube graph project is the best graph analysis I have come across, which used data from Nicola Greco. However, the data do not cover the three elements specified above.


These are two core challenges I want to tackle:

  • Calculate the shortest path between two stations.
  • Perform graph analysis / link analysis on the network.

I have created my own dataset to include all the Overground routes and added my own interchange edges. There are about 120 out-of-station interchanges (OSIs) and walking routes in my dataset (fairly even split between the two).

For my dataset, I have also designed an ID system that includes information about all the stations: e.g. Piccadilly Circus has ID 71034, where ’10’ shows that it is in Zone 1. This numbering system has been useful for producing pivot tables and bringing in an element of data verification.

Tube network connections (Excel)
Tube network connections in Excel

I used the NetworkX (as nx) library to set up the graph.

Journey times

A function

Here are the neighbours for Queen’s Park station: “Queen’s Park [Bakerloo]”, “Queen’s Park [WDCL]”, “Queen’s Park [WALK]”.

In other words, Queen’s Park station has its Bakerloo and Watford DC line platforms as neighbours, in addition to the ‘WALK’ platform, which arises from how my dataset is constructed to account for short walking routes.

Now let’s create a function to obtain the fastest route and see how it is working.


def fastest_route(start, end):
 """
 Return the fastest path between the 'start' and 'end' points.
 Each station and interchange is printed, along with the journey time.
 
 Tip: use "" when calling the function, as escape characters may be needed with ''.
 """
 journey_path = nx.shortest_path(graph_times, start, end, weight='weight')
 journey_time = nx.shortest_path_length(graph_times, start, end, weight='weight')
 print('\nJOURNEY:', *journey_path, sep='\n\t')
 print('\nJOURNEY TIME:', journey_time, 'minutes')

Let’s try some examples with the function and see how it is working.


fastest_route("Queen's Park", "Brondesbury Park")

JOURNEY:

Queen’s Park
Queen’s Park [WALK]
Brondesbury Park [WALK]
Brondesbury Park

JOURNEY TIME: 12.0 minutes

Queen’s Park to Brondesbury Park is not an OSI, despite both stations’ being only 0.5 miles apart on the same road (Salisbury Road) and their being on different Overground lines. On TfL’s map, people unfamiliar with the area might think that an interchange at Willesden Junction would be a faster journey.

Note: The distance between each station and its ‘walk’ node is a 1-minute journey due to the uniform assumption applied to all interchanges in the graph design.

Queen's Park map
Queen’s Park and Brondesbury Park stations are only half a mile apart. (Image source: Google Maps)

West Hampstead to West Ruislip

Let’s look at an interesting case: a journey from the West Hampstead area to West Ruislip station.

There are three different routes that take a similar amount of time (approx. 50 minutes, +/- a few minutes):

  • Take Jubilee line to Bond Street, then take the Central line direct to West Ruislip.
  • Get on the Metropolitan line (Finchley Road), leave at Ickenham and then walk from there to West Ruislip. That walking route is an OSI on the network.
  • Take Overground service to Shepherd’s Bush, then take the Central line direct to West Ruislip.

West Hampstead’s Underground, Overground and Thameslink stations are all separate stations.  These are the fastest routes based on my current graph:

  • West Hampstead Overground → West Ruislip: 50.5 mins, with interchange at Shepherd’s Bush.
  • West Hampstead Underground → West Ruislip: 48 mins, with interchange at Finchley Road and walk from Ickenham to destination.

If more data on actual interchange times were put in the dataset, then perhaps the shortest paths would change.  The Overground and Underground stations are less than 2.5 minutes apart in reality.

Thoughts so far

I was pleased with the results of my fastest route function and moved on to some graph analysis: I will show the findings in the next blog post.

The key thing to change in the data pipeline is to add the actual interchange times between stations’ platforms, in order to give more accurate journey times.

Note: I hope to publish the Jupyter notebook of this project later on and am trying not to dwell too much on the actual code in these blog posts.