Travelling is one of my favourite hobbies but it often comes with frustration. It can be so difficult to plan a trip so I have been working on a data science project that tries to make it easier to plan a trip itinerary based on geodata.
I have published my project to my Github so check out my initial notebook there. This is the first in a series of blog posts on this project, which will focus on only small components of code and describe things in more general terms.
There are so many things to do in the travel planning process, e.g.:
- Choose a destination (probably from a shortlist).
- Book flights.
- Book accommodation (again from a shortlist).
- Develop a travel itinerary.
- Pack bags for the trip.
- Arrange transport (e.g. to airport and accommodation).
All that is done before even going on a trip, but already the number of tasks involved seems time-consuming. Some trips can take days to plan over a long period and the planning process is typically iterative (e.g. develop a draft itinerary and then refine it nearer the time).
Even as a seasoned traveller I would say that developing a travel itinerary is the hardest thing to do in that list. An itinerary defines a trip and it is something that you want to get right, especially if it may be your only visit to a destination. For all the travel guides and websites I examine when planning a trip, I normally need to adjust my plans ‘on the day’: e.g. the transportation isn’t as straightforward as I thought; an attraction is closed; plans are dependent on weather conditions.
Since itineraries are so time-consuming to get right, before and during a trip, I’ve sought a way to make it easier for me to develop a travel itinerary via data science techniques. As someone who has lived, studied and worked in London, I thought it would be worthwhile to start off the project by using data on London.
I read Hamza Bendemra’s post on geo-location clustering in Paris, which inspired me to do a take on a project like this one. With my experience in travel technology, experience of travel planning for large groups and a project on transportation, I thought I could go much further in my own project on travel planning.
These are the two initial aims of the project:
- Find the nearest rail station for London’s top attractions and calculate how close the attractions are to each station.
- Group together London’s top attractions and use the groupings for itinerary planning (e.g. spend one day for attractions in one group, another day for attractions in a second group, etc.).
That could help to establish a proof of concept for a more ambitious project.
This post focuses on the first part: calculating the nearest station to London’s top tourist attractions. Before going into that, it is also worthwhile to describe the nature of the datasets.
Data on attractions and stations
There are over 1,600 attractions in London listed on TripAdvisor, so there is plenty of things to do. I typically use TripAdvisor to compare destinations before planning a trip and it’s normally beneficial to look over some of the top attractions. The top attractions on TripAdvisor have high ratings and a relatively high volume of reviews, so they may have widespread appeal to fit in any travel itinerary.
I created a spreadsheet with geodata on more than 20 of the top attractions in London on TripAdvisor. These are the main items for each record in the spreadsheet table:
- Attraction rank
- Attraction name
- Geodata (e.g. latitude, longitude)
- Recommended visit duration interval (e.g. 2–3 hours)
This dataset involved some manual data entry and no web scraping at all.
I also have a dataset on over 600 rail stations in London (e.g. London Underground, London Overground, National Rail stations, etc.), which I have used previously for my graph analysis of London’s rail system. These are the main items for each record in the spreadsheet:
- Station name
- Geodata (e.g. latitude, longitude, postcode)
- TfL fare zone
I’ve loaded up the data into dataframes that are imaginatively called
rail respectively in my notebook.
Nearest station to attractions
With the data loaded and the geodata in a consistent format across both dataframes, we are ready to calculate the proximity between rail stations and tourist attractions.
The aim in this post is to find the nearest station for London’s top attractions and calculate how close the attractions are to each station.
In essence, the primary problem is a ‘nearest neighbour’ problem and the second problem involves applied trigonometry.
The rationale for tackling this aim first is to get more context on the attractions before approaching the travel planning problem, which is an unsupervised learning problem.
There are 21 attractions that need a ‘nearest station’ calculated from over 600 stations.
A KD Tree can store the latitude–longitude pairs of all the stations so that nearest neighbour calculations for any co-ordinates queried on the tree are efficient.
# Set up KD Tree object rail_tree = sp.spatial.KDTree(np.array(rail[['latitude', 'longitude']]))
That tree was set up with a Numpy array of latitude–longitude pairs (i.e. an
n x 2 array where
n is the number of stations).
A Numpy array, e.g. an
m x 2 array, can also be queried on the tree to return the nearest neighbour in the tree for
m points (in object
nn_id) and their distances from the neighbours (in object
# Query the nearest neighbours of the tree and store info in arrays nn_dist, nn_id = rail_tree.query(sight_geo_pairs, p=2) # Euclidean distance
Nearest neighbour calculations
A query of the National Gallery’s co-ordinates (51.508929 , -0.128299) returns position 387 in the tree as the nearest neighbour, which is Leicester Square.
See the notebook for more depth on the code and more examples beyond the first case (National Gallery).
Euclidean distance is used to represent the distance ‘as the crow flies’ between two points. There are other types of distance like Manhattan distance but the Euclidean distance is most appropriate for the data, as it is rotation invariant. A new column is added to the dataframe to record the nearest station for each attraction.
# Create a new column in sights df for nearest station sights['nearest_station'] = rail['station'].iloc[nn_id].values
The distance returned by the tree for the National Gallery is 0.00236287, but this is not easy to interpret and does not account for the Earth’s surface.
We cannot find out the distance in miles, for example, from the distances already calculated.
We need to have the lat-long pairs of both points, which form a line, in order to calculate the miles between them (via the haversine formula).
The haversine formula applies trigonometry on a sphere to calculate the distance between two points on a sphere. It can be used to give an approximate (Euclidean) distance between two points on Earth.
My notebook defines a function called
haversine, which takes geodata from two points as parameters and returns the distance in miles between the two points by default. The mathematics in the function matches the formula specified in the Wikipedia article linked above.
I gather latitude–longitude pairs for each attraction and their nearest station in two numpy arrays that both have the shape
m x 2, where m is the number of attractions. My function requires the data to be flattened, so I concatenate the two arrays into one array of shape
m x 4 to contain all the data we need to calculate the relevant haversine distances for the attractions. A column of Euclidean distance (miles) is created through a list comprehension, which iterates over each row of the array to unpack the two latitude–longitude pairs and pass them to the haversine function.
# Create a col for the distance # For each attraction, the two geo pairs are passed to the haversine function to obtain a value sights['nearest_station_miles'] = [haversine(*four_coords) for four_coords in np.concatenate([sight_geo_pairs, station_geo_pairs], axis=1)]
The top 5 attractions are all approximately 0.2 miles away from their nearest stations and each one has a different station. The National Gallery’s nearest station is 0.163 miles away (Leicester Square). London’s #5 attraction is the Victoria & Albert Museum and its nearest station is 0.193 miles away (South Kensington).
There are some attractions that are within 500 feet of the nearest station. Big Ben is approx. 260 feet away from its nearest station (Westminster). The photo at the top of this article shows a roundel of Westminster station in the foreground, with the nearby clock tower in the background.
The application of nearest neighbour search to my data has enabled me to obtain more geographical context on London’s top attractions. Most people traversing through London can use the Tube map as a point of reference for their travelling, so calculating the nearest station to London’s attractions is a useful first step in planning a travel itinerary for the city’s key sights.
One point to bear in mind with the calculations is that the nearest station calculated for an attraction may not be the nearest in terms of actual walking distance at the street level. The Euclidean distance is a simple way to calculate distance between points; much richer data would be needed, e.g. street routes in a graph, to calculate walking distances. Euclidean distance works well for giving geographical context, without introducing unnecessary complexity.
This context will be useful for approaching the next part of the project, which is an unsupervised learning problem: clustering of London’s top attractions and using the clusters to plan an itinerary. The attractions in London are spread across areas of differing densities, so it will be interesting to see how attractions are grouped together and to what extent an itinerary could be formed from them.