Recognising UK phone numbers through a regex

Introduction

These are some business problems that have motivated my desire to make a program focused on UK phone numbers:

  • Dealing with spreadsheets or databases that have inconsistent phone number formats
  • Scraping phone number information from a clunky PDF (e.g. a directory) or webpage

Through some experimentation with a regex, pyperclip and OpenPyXL, I eventually produced a program that recognises UK phone numbers and parses them to a consistent format.


In this post, I will focus on my regex.  There are many websites that have regular expressions for recognising UK phone numbers, but I did not find one that accounted well for features of phone numbers such as extension numbers and international call prefixes.

This will be the sequence of posts:

  1. Write the regex
  2. Create program that processes an input
  3. Export the phone numbers to a workbook and process them

Regex

Starting point

I will avoid discussing the intricacies of UK phone numbers here, so refer to Wikipedia’s article on UK phone numbers for detail on how the numbering system works.

These are features of phone numbers that will be important in a regex for my program:

  1. Access codes: e.g. 00 (international call prefix) and +44 (UK country code)
  2. Phone number without the leading 0
  3. Extension number (e.g. x1000 or ext1000)

These numbers correspond to my regex’s ‘groups’.  Of course, group 2’s information is of most interest, but the information in the others groups also needs to be processed.

I found quite a neat regex on a Stack Overflow thread (‘Match UK telephone number in any format’):


^(?:(?:\(?(?:0(?:0|11)\)?[\s-]?\(?|\+)44\)?[\s-]?(?:\(?0\)?[\s-]?)?)|(?:\(?0))(?:(?:\d{5}\)?[\s-]?\d{4,5})|(?:\d{4}\)?[\s-]?(?:\d{5}|\d{3}[\s-]?\d{3}))|(?:\d{3}\)?[\s-]?\d{3}[\s-]?\d{3,4})|(?:\d{2}\)?[\s-]?\d{4}[\s-]?\d{4}))(?:[\s-]?(?:x|ext\.?|\#)\d{3,4})?$

However, it will not work for my programming task so I wanted to improve it:

  • Use groups in a regex
  • Stop digits from being excluded in some cases
  • Account for more characters (e.g. I have seen ‘.’ separator many times in phone numbers from UK companies, whose IT people are in the USA!)
  • Break up the regex so that it is easier to read and modify
Matches for the regex found online
Matches for the regex found online. Notice that some formats are not recognised.

I have not seen a thorough exposition of how a regex actually ‘works’ for UK phone numbers, so I will produce one here.  I hope that this post is useful in seeing how a regex can work, as I certainly recognised more nuances on them while writing the regex.

My regex

This is my regex, with comments in the code snippet to explain what is happening for some lines:


(?:
(\(?
(?:0(?:0|11)\)?[\s-]? # almost every country's exit code, e.g. 00 or 011
\(?|\+?)44\)?[\s-]? # UK country code (+44)
(?:\(?0\)?[\s-]?)?) # leading 0 (trunk prefix)
| # one of many Boolean operators to come for the actual number
\(?0)((?:\d{5}\)?[\.\s-]?\d{4,5}) # e.g. 016977 xxxx
|
(?:\d{4}\)?[\.\s-]?(?:\d{3}[\.\s-]?\d{3})) # e.g. 01234 xxx xxx
|
(?:\d{4}\)?[\.\s-]?(?:\d{5})) # e.g. 01xxx xxxxx
|
(?:\d{3}\)?[\.\s-]?\d{3}[\.\s-]?\d{3,4}) # e.g. 01xx xxx xxxx
|
(?:\d{2}\)?[\.\s-]?\d{4}[\.\s-]?\d{4}) # e.g. 020 xxxx xxxx
) # that's the largest capturing group over now
(?:[\s-]?((?:x|ext[\.\s]*|\#)\d{3,4})?) # e.g. x123, ext123


This is the one-line version, if you wish to paste it into RegExr for your own analysis:


(?:(\(?(?:0(?:0|11)\)?[\s-]?\(?|\+?)44\)?[\s-]?(?:\(?0\)?[\s-]?)?)|\(?0)((?:\d{5}\)?[\.\s-]?\d{4,5})|(?:\d{4}\)?[\.\s-]?(?:\d{3}[\.\s-]?\d{3}))|(?:\d{4}\)?[\.\s-]?(?:\d{5}))|(?:\d{3}\)?[\.\s-]?\d{3}[\.\s-]?\d{3,4})|(?:\d{2}\)?[\.\s-]?\d{4}[\.\s-]?\d{4}))(?:[\s-]?((?:x|ext[\.\s]*|\#)\d{3,4})?)

What I found is that the regex needs to be in a precise order, so that the regex has the best chance of recognising any phone number given to it.  The sequence of Boolean expressions (…|…|…|…|…) needs to have the phone number formats ‘cascade’, as it were: start with the longest area codes and longest numbers first.  For example, 01xxx xxx xxx comes before 01xxx xxxxx as the former is the longer number (the regex I found online had this the other way round).  I am not sure why regular expressions have this order of precedence, but that is my intuition in any case!

Matches for my phone regex
Matches for my phone regex (phone numbers surrounded by ‘1’ characters). Notice that there are three capturing groups.

I am pleased with my regex’s performance with phone numbers and it is more comprehensive in the cases it recognises than any other regex I have found online.

There are some caveats to bear in mind for the regex:

  • The regex may recognise non-UK phone numbers as a valid UK phone number.  e.g. this could happen for phone numbers from countries such as Italy.
  • The regex may recognise a non-existent phone number as a valid UK phone number.

This is just the nature of regular expressions, as a regex is concerned with format.

I still think data validation that aims to exclude non-UK phone numbers would be a worthy programming task, but it needs to go beyond regex and join up with other software such as Excel (see future post for an implementation).  One task that would be especially interesting is to extract the area code from a phone number (e.g. 020 from a number written as 0207 xxx xxxx).

In the next post on this project, I will present a Python program that takes clipboard input and passes it through my regex.

Exploring my music collection: foobar and folders

My music collection

Listening to music is one of my hobbies and I have a large collection of CDs.  (Huh, CDs, what are those again?)  All my CDs have been meticulously ripped to my computer as lossless files and then converted to high-bitrate MP3s.  I have been ripping my music as lossless files by habit, even during the heydays of peer-to-peer file sharing and before online music stores (e.g. iTunes Store) became popular, so that I can archive my collection.  Of course I use music streaming services such as Spotify today, but I use them more for music discovery, rather than for listening to my favourite albums.

I wanted to know ‘more’ about my music and make my listening experience fun and personal, so I started to use foobar2000 and was attracted by the variety of its components.  Over time, I have tagged my music files with all sorts of information, such as beats-per-minute (BPM) and ‘beat intensity’.  Through all my tagging, I have been able to produce playlists for my collection and search my collection more efficiently.

Without further ado, let’s see a glimpse of my music collection.

Image of my music collection (foobar2000).
My music collection in action (currently playing The Killers).

I have nearly 400 releases (albums, extended plays, singles, etc.) in my collection and that amounts to almost 19 days of music.

As that glimpse shows, I like to have a nice visual element to my music library and the listening experience.  I have album art for everything in my collection and other images such as the disc of an album (where available).  If anything, I think vinyl has had a resurgence in the last 10 years because of the large artwork (eye candy) and the more physical nature of the medium, compared to CDs and digital downloads.

There are so many questions that I think could be answered by making programs for my collection.  Streaming services are making their own algorithms with their big data systems.  Here are some examples that I could explore:

  • Are there artists in my collection that tend to have songs with strong rhythms?  (Perhaps more than I might expect?)
    • I would expect genres such as house, funk, ska and electronic dance music to have songs with prominent rhythms.  In contrast, jazz numbers and ballads would not have prominent beats.
  • What songs could go into my party playlists?  e.g. BPM range of 100 to 140 and songs with a strong rhythm seem ideal.  How much music would I have left when filtering the playlist to a certain decade, e.g. 2000s?
  • Could I categorise my collection based on the tags and metadata, e.g. via K-means clustering?
  • What are the most common colours across the album artwork in my collection?  Is the black-and-white palette dominated by post-punk albums or certain artists (e.g. Joy Division, the xx)?
  • What are the longest and shortest songs in my collection, by: genre, artist, year, etc.?

Making it easier to maintain and search my collection

My library is organised into a directory with this folder structure:

[album artist]\[year] – [album]

It works well for finding files quickly, but it is very difficult to ‘zoom out’ or take a ‘macro’ view of my collection.  This is where programming can  help: automating tedious tasks.  For example, if I want to do a batch rename of a selection of filenames, then a program could do that quickly and efficiently.

In my album folders, I can have a lot of ‘non-music’ files:

  • folder.jpg: the album art image
  • artist.jpg: image of the artist
  • disc.png: image of CD or record, with transparency
  • LOG file that has information on my music rip, e.g. accurate rip statistics, errors with the rip

As I have two separate libraries stored on different hard drives—my lossless and MP3 libraries—it can be hard to keep track of everything.  For example, I might have added album artwork to my MP3 library but not to my lossless library.

I have produced my own program to list directory paths that have image files, as one way to make it easier to maintain my music library.

These are some things I wanted to answer through my program:

  • How many images are in my music library’s directory?
  • How much hard drive storage is being used for the images?

With about 100 lines of code, I made a program that works a treat for my collection.

Music library images program output: start
Music library images: start of output

Here is a snippet of the code for the main function:


def list_images_in_lib():
 # Start time
 start_time = time.time()
 
 # 1: check album art
 tot_f, tot_f_size = (0, 0)
 print("\n", '*' * 20, "folder.jpg", '*' * 20, "\n")
    for root, albumArtists, albums in os.walk(os.getcwd()):
       for file in albums:
          if file.endswith("folder.jpg"):
          print(os.path.join(root, file))
          tot_f += 1
          tot_f_size += os.path.getsize(os.path.join(root, file)) / 1000000
 print("\n", "Number of 'folder' images in library:", tot_f, "(%f MB)" %(tot_f_size))
 # 2: check artist art
 tot_a, tot_a_size = (0, 0)
 
 # ...LARGE SNIP...
 # ................
 print("\n", "Number of 'disc' images in library:", tot_d, "(%f MB)" %(tot_d_size))
 
 # Calculate amount of time needed to process all the images:
 end_time = time.time()
 elapsed_time = end_time - start_time
 tot_images = tot_f + tot_a + tot_d
 tot_images_size = tot_f_size + tot_a_size + tot_d_size
 print("\n", "Time to process all", tot_images, "images:",
   round(elapsed_time, 6), "seconds",
   "(%f MB)" %(tot_images_size))

The list_images_in_lib function uses Python’s os.walk() and loops to get all the information needed for my files.  Fortunately, my music library is structured in a way that can make use of the os.walk() tuple results neatly.

The working directory here is ‘C:\…\MP3-library’.  Let’s give some intuition on how the os.walk() tuple is being used to reach an album:

  1. C:\Users\Mark\Music\MP3-library\
  2. Air\
  3. {2016} – Twentyears\

That is not enough information, as the item of interest is a file (e.g. folder.jpg, disc.png) in the album folder.  Another loop is needed to process the file: ‘for file in albums: […]’.  Within that loop, accumulator patterns and print statements are producing the output of interest: e.g. the stats on file size and the filepaths.

Music library images program output: end
Music library images: end of output

It takes about 6 seconds to traverse my music library and return all the output.  Perhaps there could be more clever use of string concatenation and a way to write the code without a nested loop, but for my purposes the code is working well and I would not benefit from speed gains much, even with a larger collection.

I was surprised that my average image size was anywhere near 1MB.

There is other functionality that I have included in my program but have not discussed in this post:

  • Check if an external hard drive is connected (e.g. my backup hard drive).
  • Choose music directory (local or external) to use for the program.

Tools that could be used to explore my music collection further

The foobar2000 Text Tools component can be used to gather the tags (metadata) for my music and eventually have all the music organised in a spreadsheet format.  A spreadsheet with all my songs opens up immense potential for statistical analysis and even machine learning (e.g. classification of songs).

Now that I have all my album folders as string output (when removing ‘folder.jpg’), there is also potential to make more programs for library organisation, e.g. batch renaming.

 

Where should I go next?  It would be great to see how else people have analysed their music collections.

London’s Tube network: initial link analysis

In a previous blog post, I outlined my aims to analyse London’s current Tube network.

As a recap, these are two core challenges I want to tackle:

  • Calculate the shortest path between two stations.
  • Perform graph analysis / link analysis on the network.

The previous post focused on the first challenge, whereas this post follows on from it and approaches the second challenge: link analysis.

I hope to update these posts with links to my Jupyter notebooks, once I tidy up my ‘lab’ notebooks.

 

For my data, I want to make explicit some issues that affect the analysis, due to how the dataset is set up and the assumptions in this initial graph model:

  • All interchanges have the same cost (1 minute). In reality, interchanges within a station are not equal. Green Park has very long tunnels for an interchange, whereas stations such as Finsbury Park can have an interchange on a parallel platform.
  • The walking routes and OSIs have a costly interchange between the station and its street (a result of the first assumption). This should be zero.  An example path with this issue is Bayswater → Bayswater [WALK] → Queensway [WALK] → Queensway.

These two problems can be solved together, once connections between platforms within a station are added. TfL has released data on interchange times within each station, but some data cleaning is needed before these connections can be added to the model.

Before improving the dataset, let us do some initial analysis to see how the graph is working.


Link analysis

I created a new NetworkX (nx) graph object, so that the graph is in a format for link analysis.  Why a separate object?  This is just because of how the weights need to be specified for the analysis you want to perform:

  • For ‘shortest path’ work, the weights in the graph are the journey times (mins).
  • For ‘link analysis’ work, the weights in the graph are the inverses of the journey times, e.g. 2-minute edge has weight 1/2.  This is how the graph needs to be set up for the link analysis algorithms to work as expected.

There are two link analysis algorithms we could do on the network fairly quickly, which can open up different interpretations:

  • PageRank
  • HITS algorithm (hubs and authorities)

Mark Dunne also used those algorithms for his graph analysis, so it would be interesting to see how my results compare.

As a brief note, the Tube is modelled as an undirected graph in my dataset, so the hubs and authorities analyses of the HITS algorithm are equivalent.  There are only a few edges in the network that are actually unidirectional (e.g. Heathrow loop; West India Quay is skipped in the DLR route towards Lewisham).

Setup

My dataset is only one CSV file and I loaded it into a Pandas dataframe.  The graph object for the link analysis will be called ‘graph_weights’.  I’ll leave the code for setting up these objects in the Jupyter notebook, as it’s fairly long and it will probably be simplified in a v2 project when the within-station interchanges are added to my dataset.

With the graph object set up, here is the code for setting up a dataframe of PageRank values for each node:


pagerank = nx.pagerank_numpy(graph_weights, weight='weight')
pagerank = pd.DataFrame.from_dict(pagerank, orient='index').reset_index()
pagerank.columns = ['node', 'pagerank']

Now for a full implementation of HITS for completeness:


hits = nx.hits_scipy(graph_weights, max_iter=2500)
# Hub values
hits_hub = hits[0] # Get the dictionary out of the tuple
hits_hub = pd.DataFrame.from_dict(hits_hub, orient='index').reset_index()
hits_hub.columns = ['node', 'hub']
# Authority values
hits_auth = hits[1] # Get the dictionary out of the tuple
hits_auth = pd.DataFrame.from_dict(hits_auth, orient='index').reset_index()
hits_auth.columns = ['node', 'authority']
# Show hub and authority values
hits_all = pd.merge(hits_hub, hits_auth, on='node')

For some reason, I needed the max_iter argument to have a value (e.g. 2500) much higher than the default.

Analysis

Here is my analysis, with the top 20 nodes shown for each algorithm (PageRank and hub values are shown).  To make explicit, due to how my dataset is structured, a node that is just the station name is akin to the station’s ticket hall, and nodes that have the station line in square brackets are akin to station platforms.  I haven’t aggregated everything by station yet, as it is interesting to see how it works with the nodes arranged like this for now.

Top 20 nodes by algorithm
The top 20 nodes by algorithm (PageRank and HITS)

The results for PageRank seem fairly intuitive:

  • Major stations have a high rank and tend to have many different services (multiple Underground, Overground or DLR services): e.g. Liverpool Street, Bank, Waterloo, Stratford, Willesden Junction.
  • Fairly small stations such as Euston Square and Wood Lane are ranked quite high. In these cases, the edges for OSIs and short walking routes could explain why they have high rankings.
  • Poplar [DLR] node is the highest ranked ‘platform’ (not shown in image, but is in top 25, with value 0.001462). This seems reasonable as every DLR service passes through the station; it may be the case that more journeys pass through Poplar, rather than start or end there.  Earl’s Court [District] is next in ranking and the intuition is similar.

The HITS results are much more striking in contrast:

  • Liverpool Street is by far the most prominent station in this HITS analysis.  All of its platforms and the ticket hall score higher than other nodes.
  • The main hubs in my dataset are all in the City of London: Liverpool Street, Moorgate, Bank, Barbican and Farringdon are all prominent.

It is interesting that Mark Dunne’s analysis has the main hubs in the West End: in particular, Oxford Circus, Green Park, Piccadilly Circus, etc.  I think this is largely because my dataset has all the current London Overground routes, including the Lea Valley Lines. This has brought many more routes (via Hackney) onto TfL’s map: it could be likened to a north-eastern Metro-land.

If this graph were extended to include National Rail routes from the London commuter belt (especially ones in TfL’s fare zones), then I would expect that stations such as Clapham Junction, Victoria, London Bridge and King’s Cross would be more prominent in the network. Inclusion of these routes is problematic, as they operate a variety of services (e.g. fast trains vs all-stations trains), so assigning values to edges is difficult.

Retrospective

Let’s wrap up this project in its current form, so that we consider whether it is worthwhile to extend this project any further.

  • What has worked well so far?
    • The initial analysis appears to represent the network well. The results seem intuitive.
    • The Excel workbook with all the data has been quick to edit when significant changes needed to be made.
    • Some functionality has been straightforward to implement, e.g. shortest path function.
  • What can be improved?
    • More interchange data can be added to the graph.
    • The interchange edges can be less clunky.
    • Data visualisation could add much more value to the analysis.

TfL have produced data on interchange times within stations and I seek to incorporate those in my dataset.  That would add a few hundred connections to my network: edges from platforms to ticket halls; edges between platforms in stations.  I have done some initial data cleaning on these connections, but the most time-consuming work will be matching TfL’s station names to my dataset’s station naming convention.

Traversing London’s Tube network

The TfL Tube map is becoming increasingly dense, as more and more lines and stations are added to it. (By the time I achieve anything with this graph analysis, there may be more tweaking to do because of Crossrail!)

I have tried to find a graph online that is set up for basic analysis, which accounts for:

  • Interchanges within station
  • Interchanges between stations, i.e. out-of-station interchanges (OSIs) and other short walking routes.
  • The latest TfL network, especially all the current London Overground routes.

There are interactive tools (e.g. Oliver O’Brien’s Tube Creature and Google Maps) that cover those three factors, but I have not found a workable dataset that can be used for my own graph analysis.

Mark Dunne’s Tube graph project is the best graph analysis I have come across, which used data from Nicola Greco. However, the data do not cover the three elements specified above.


These are two core challenges I want to tackle:

  • Calculate the shortest path between two stations.
  • Perform graph analysis / link analysis on the network.

I have created my own dataset to include all the Overground routes and added my own interchange edges. There are about 120 out-of-station interchanges (OSIs) and walking routes in my dataset (fairly even split between the two).

For my dataset, I have also designed an ID system that includes information about all the stations: e.g. Piccadilly Circus has ID 71034, where ’10’ shows that it is in Zone 1. This numbering system has been useful for producing pivot tables and bringing in an element of data verification.

Tube network connections (Excel)
Tube network connections in Excel

I used the NetworkX (as nx) library to set up the graph.

Journey times

A function

Here are the neighbours for Queen’s Park station: “Queen’s Park [Bakerloo]”, “Queen’s Park [WDCL]”, “Queen’s Park [WALK]”.

In other words, Queen’s Park station has its Bakerloo and Watford DC line platforms as neighbours, in addition to the ‘WALK’ platform, which arises from how my dataset is constructed to account for short walking routes.

Now let’s create a function to obtain the fastest route and see how it is working.


def fastest_route(start, end):
 """
 Return the fastest path between the 'start' and 'end' points.
 Each station and interchange is printed, along with the journey time.
 
 Tip: use "" when calling the function, as escape characters may be needed with ''.
 """
 journey_path = nx.shortest_path(graph_times, start, end, weight='weight')
 journey_time = nx.shortest_path_length(graph_times, start, end, weight='weight')
 print('\nJOURNEY:', *journey_path, sep='\n\t')
 print('\nJOURNEY TIME:', journey_time, 'minutes')

Let’s try some examples with the function and see how it is working.


fastest_route("Queen's Park", "Brondesbury Park")

JOURNEY:

Queen’s Park
Queen’s Park [WALK]
Brondesbury Park [WALK]
Brondesbury Park

JOURNEY TIME: 12.0 minutes

Queen’s Park to Brondesbury Park is not an OSI, despite both stations’ being only 0.5 miles apart on the same road (Salisbury Road) and their being on different Overground lines. On TfL’s map, people unfamiliar with the area might think that an interchange at Willesden Junction would be a faster journey.

Note: The distance between each station and its ‘walk’ node is a 1-minute journey due to the uniform assumption applied to all interchanges in the graph design.

Queen's Park map
Queen’s Park and Brondesbury Park stations are only half a mile apart. (Image source: Google Maps)

West Hampstead to West Ruislip

Let’s look at an interesting case: a journey from the West Hampstead area to West Ruislip station.

There are three different routes that take a similar amount of time (approx. 50 minutes, +/- a few minutes):

  • Take Jubilee line to Bond Street, then take the Central line direct to West Ruislip.
  • Get on the Metropolitan line (Finchley Road), leave at Ickenham and then walk from there to West Ruislip. That walking route is an OSI on the network.
  • Take Overground service to Shepherd’s Bush, then take the Central line direct to West Ruislip.

West Hampstead’s Underground, Overground and Thameslink stations are all separate stations.  These are the fastest routes based on my current graph:

  • West Hampstead Overground → West Ruislip: 50.5 mins, with interchange at Shepherd’s Bush.
  • West Hampstead Underground → West Ruislip: 48 mins, with interchange at Finchley Road and walk from Ickenham to destination.

If more data on actual interchange times were put in the dataset, then perhaps the shortest paths would change.  The Overground and Underground stations are less than 2.5 minutes apart in reality.

Thoughts so far

I was pleased with the results of my fastest route function and moved on to some graph analysis: I will show the findings in the next blog post.

The key thing to change in the data pipeline is to add the actual interchange times between stations’ platforms, in order to give more accurate journey times.

Note: I hope to publish the Jupyter notebook of this project later on and am trying not to dwell too much on the actual code in these blog posts.

The pursuit of data science: reflections from self-teaching so far

Introduction

As a first blog post, I thought it would be worthwhile to describe my journey so far in data science and present some reflections.  All my programming knowledge is self-taught, as I never studied programming formally.  Roamer-Too was the only ‘programming’ I ever really saw before university and the number of people studying computer science degrees was falling while I was at school; my interest in computers and technology didn’t really transmit to programming.

I have a quantitative economics background from my degree  I have experience with proprietary software such as Stata and SPSS during my studies, alas I do not have the deep pockets to justify having licences for my own use.  One of my main motivations for learning data science was to build up my experience in programming and have a (mostly) open-source toolkit that I can use to do data analysis, particularly regression.

The most challenging part of developing my data science skills was knowing where to start.  For many months I was dabbling with Julia and Python, all the while wondering whether R is the best option for me.  It was a struggle to weigh up which language was most suitable for me and it was difficult to stay motivated when I had little spare time.

Python seemed the most versatile language for my goals and I was ultimately swayed by Anaconda and Jupyter notebooks.  Python’s syntax also feels like the most intuitive language for my mind.

I have been inspired by guides to data science and will give the path/roadmap I made below.

My roadmap

Core competences

  1. Installation; basic programming and computer science concepts
  2. Applications of programming to tasks
  3. Applications of programming to my research specialism
  4. Machine learning fundamentals

Programming exercises keep everything going, but now it is time to bring together everything in a major project.

In my case, my first major project is modelling of house prices through machine learning (regression).  This was not the clean Boston housing dataset, but involved creating a data pipeline from scratch, in order to apply SQL.  I will revisit this project in a blog post as it was a fun process.

Develop competences incrementally

The experience of a major project is great encouragement for tackling new projects.  Through an incremental approach, look at:

  1. Tutorials/documentation on libraries of interest (e.g. scikit-learn)
  2. Learn new concepts as needed for projects (e.g. graph theory)
  3. Learn how to be a better coder (e.g. PEP-8 practices)

Patience, young grasshopper

How long does it take to develop a solid grounding in data science?

That is the burning question for a beginner and that does not have a standard answer, except for: it depends! *sigh*

I would say that it took me 10 weeks of full-time study (equivalent) to feel comfortable with the core competences to pursue multiple projects.  In my case, that was starting with little understanding of computer science and programming, while feeling comfortable with core econometrics concepts (statistics, linear algebra, etc.).

The first project or programming milestone can be achieved quite soon after starting though.  Victories of any scale along the way are encouraging!  After going through two Python textbook courses, my first program implemented a regex and used OpenPyXL and Pyperclip to parse phone numbers.

Of course, the journey continues.

Resources

Throughout my journey, data science websites, YouTube, blogs and other materials have inspired me and helped me to develop a path that suits my goals.

The roadmap described is not really sequential, but the course materials I studied evolved into this order:

  • Interactive Python – Think like a computer scientist
  • Automate the Boring Stuff with Python
  • QuantEcon (first few parts)
  • Andrew Ng’s Machine Learning course
  • PostgreSQL tutorial

Those resources gave me the confidence to do my house price machine learning.

Here are some resources that have aided my incremental development:

  • DataSchool.io tutorial on scikit-learn
  • Subreddits
  • Stack Overflow community
  • Python documentation; Python libraries’ documentation
  • PEP-8 documentation
  • Interactive Python – Problem solving with algorithms and data structures
  • Conference talks from data scientists on YouTube
  • Cheat sheets for programming languages, concepts, etc.
  • Dataquest.io blog posts
  • Practice with ‘take-home’ assignments (interview preparation)

Main reflections

Here are the main reflections on my journey so far:

  • Installation of libraries and other components did frustrate me at the start (yes, even Anaconda).  I found Interactive Python incredibly useful, as all the programming exercises are done online and the course content has interactive demonstrations of Python’s control flow across topics.
  • It can be quite hard to achieve much in one-hour chunks of time (e.g. evenings) spread across a week.  I feel that I can achieve more by immersing myself in one day with seven hours of practice, rather than seven days of one-hour chunks.  Reading blogs and the like is something that can be done in small chunks continuously and useful for a break from learning.
  • It is worthwhile to look at many different sources for inspiration before establishing firm plans.  It was through QuantEcon that I discovered Interactive Python and Automate the Boring Stuff as recommended materials to read.  Communities such as Reddit are useful in getting multiple opinions.
  • Progress can stagnate suddenly, but see that as temporary.  I think it has been worth persisting with some course content even if it does not appear to be helping me to do my own programs.  QuantEcon has some very technical material, but even that content has some beautiful object-oriented programs and the course got me started with libraries such as SciPy and NumPy.
  • ‘Recent’ material becomes obsolete quickly!  Even tutorials that were produced a year ago could rely on code that becomes deprecated.  Some materials are still based on Python 2 instead of Python 3, so those differences may take additional time to pick up.  Scikit-learn is an evolving library so the documentation needs to be consulted to get past deprecation warnings.  I followed a plotting tutorial that was interesting but it used an API that was phased out, so that tutorial became a dead end.

I have been inspired by other bloggers’ journeys but have not seen posts that give a brief roadmap and learning tips, so I hope this post is useful for others.