My Github page is live

Pleased to announce that my Github page is now live: @exmachinadata.

It has been quite a learning curve to get a Jupyter notebook and understand the nuances with features (e.g. gists).

My first repository is my London Tube repo.  I hope to publish some gists and integrate my Github page with this website better – watch this space!

 

The pursuit of data science: reflections from self-teaching so far

Introduction

As a first blog post, I thought it would be worthwhile to describe my journey so far in data science and present some reflections.  All my programming knowledge is self-taught, as I never studied programming formally.  Roamer-Too was the only ‘programming’ I ever really saw before university and the number of people studying computer science degrees was falling while I was at school; my interest in computers and technology didn’t really transmit to programming.

I have a quantitative economics background from my degree  I have experience with proprietary software such as Stata and SPSS during my studies, alas I do not have the deep pockets to justify having licences for my own use.  One of my main motivations for learning data science was to build up my experience in programming and have a (mostly) open-source toolkit that I can use to do data analysis, particularly regression.

The most challenging part of developing my data science skills was knowing where to start.  For many months I was dabbling with Julia and Python, all the while wondering whether R is the best option for me.  It was a struggle to weigh up which language was most suitable for me and it was difficult to stay motivated when I had little spare time.

Python seemed the most versatile language for my goals and I was ultimately swayed by Anaconda and Jupyter notebooks.  Python’s syntax also feels like the most intuitive language for my mind.

I have been inspired by guides to data science and will give the path/roadmap I made below.

My roadmap

Core competences

  1. Installation; basic programming and computer science concepts
  2. Applications of programming to tasks
  3. Applications of programming to my research specialism
  4. Machine learning fundamentals

Programming exercises keep everything going, but now it is time to bring together everything in a major project.

In my case, my first major project is modelling of house prices through machine learning (regression).  This was not the clean Boston housing dataset, but involved creating a data pipeline from scratch, in order to apply SQL.  I will revisit this project in a blog post as it was a fun process.

Develop competences incrementally

The experience of a major project is great encouragement for tackling new projects.  Through an incremental approach, look at:

  1. Tutorials/documentation on libraries of interest (e.g. scikit-learn)
  2. Learn new concepts as needed for projects (e.g. graph theory)
  3. Learn how to be a better coder (e.g. PEP-8 practices)

Patience, young grasshopper

How long does it take to develop a solid grounding in data science?

That is the burning question for a beginner and that does not have a standard answer, except for: it depends! *sigh*

I would say that it took me 10 weeks of full-time study (equivalent) to feel comfortable with the core competences to pursue multiple projects.  In my case, that was starting with little understanding of computer science and programming, while feeling comfortable with core econometrics concepts (statistics, linear algebra, etc.).

The first project or programming milestone can be achieved quite soon after starting though.  Victories of any scale along the way are encouraging!  After going through two Python textbook courses, my first program implemented a regex and used OpenPyXL and Pyperclip to parse phone numbers.

Of course, the journey continues.

Resources

Throughout my journey, data science websites, YouTube, blogs and other materials have inspired me and helped me to develop a path that suits my goals.

The roadmap described is not really sequential, but the course materials I studied evolved into this order:

  • Interactive Python – Think like a computer scientist
  • Automate the Boring Stuff with Python
  • QuantEcon (first few parts)
  • Andrew Ng’s Machine Learning course
  • PostgreSQL tutorial

Those resources gave me the confidence to do my house price machine learning.

Here are some resources that have aided my incremental development:

  • DataSchool.io tutorial on scikit-learn
  • Subreddits
  • Stack Overflow community
  • Python documentation; Python libraries’ documentation
  • PEP-8 documentation
  • Interactive Python – Problem solving with algorithms and data structures
  • Conference talks from data scientists on YouTube
  • Cheat sheets for programming languages, concepts, etc.
  • Dataquest.io blog posts
  • Practice with ‘take-home’ assignments (interview preparation)

Main reflections

Here are the main reflections on my journey so far:

  • Installation of libraries and other components did frustrate me at the start (yes, even Anaconda).  I found Interactive Python incredibly useful, as all the programming exercises are done online and the course content has interactive demonstrations of Python’s control flow across topics.
  • It can be quite hard to achieve much in one-hour chunks of time (e.g. evenings) spread across a week.  I feel that I can achieve more by immersing myself in one day with seven hours of practice, rather than seven days of one-hour chunks.  Reading blogs and the like is something that can be done in small chunks continuously and useful for a break from learning.
  • It is worthwhile to look at many different sources for inspiration before establishing firm plans.  It was through QuantEcon that I discovered Interactive Python and Automate the Boring Stuff as recommended materials to read.  Communities such as Reddit are useful in getting multiple opinions.
  • Progress can stagnate suddenly, but see that as temporary.  I think it has been worth persisting with some course content even if it does not appear to be helping me to do my own programs.  QuantEcon has some very technical material, but even that content has some beautiful object-oriented programs and the course got me started with libraries such as SciPy and NumPy.
  • ‘Recent’ material becomes obsolete quickly!  Even tutorials that were produced a year ago could rely on code that becomes deprecated.  Some materials are still based on Python 2 instead of Python 3, so those differences may take additional time to pick up.  Scikit-learn is an evolving library so the documentation needs to be consulted to get past deprecation warnings.  I followed a plotting tutorial that was interesting but it used an API that was phased out, so that tutorial became a dead end.

I have been inspired by other bloggers’ journeys but have not seen posts that give a brief roadmap and learning tips, so I hope this post is useful for others.