Recognising UK phone numbers through a regex

Introduction

These are some business problems that have motivated my desire to make a program focused on UK phone numbers:

  • Dealing with spreadsheets or databases that have inconsistent phone number formats
  • Scraping phone number information from a clunky PDF (e.g. a directory) or webpage

Through some experimentation with a regex, pyperclip and OpenPyXL, I eventually produced a program that recognises UK phone numbers and parses them to a consistent format.


In this post, I will focus on my regex.  There are many websites that have regular expressions for recognising UK phone numbers, but I did not find one that accounted well for features of phone numbers such as extension numbers and international call prefixes.

This will be the sequence of posts:

  1. Write the regex
  2. Create program that processes an input
  3. Export the phone numbers to a workbook and process them

Regex

Starting point

I will avoid discussing the intricacies of UK phone numbers here, so refer to Wikipedia’s article on UK phone numbers for detail on how the numbering system works.

These are features of phone numbers that will be important in a regex for my program:

  1. Access codes: e.g. 00 (international call prefix) and +44 (UK country code)
  2. Phone number without the leading 0
  3. Extension number (e.g. x1000 or ext1000)

These numbers correspond to my regex’s ‘groups’.  Of course, group 2’s information is of most interest, but the information in the others groups also needs to be processed.

I found quite a neat regex on a Stack Overflow thread (‘Match UK telephone number in any format’):


^(?:(?:\(?(?:0(?:0|11)\)?[\s-]?\(?|\+)44\)?[\s-]?(?:\(?0\)?[\s-]?)?)|(?:\(?0))(?:(?:\d{5}\)?[\s-]?\d{4,5})|(?:\d{4}\)?[\s-]?(?:\d{5}|\d{3}[\s-]?\d{3}))|(?:\d{3}\)?[\s-]?\d{3}[\s-]?\d{3,4})|(?:\d{2}\)?[\s-]?\d{4}[\s-]?\d{4}))(?:[\s-]?(?:x|ext\.?|\#)\d{3,4})?$

However, it will not work for my programming task so I wanted to improve it:

  • Use groups in a regex
  • Stop digits from being excluded in some cases
  • Account for more characters (e.g. I have seen ‘.’ separator many times in phone numbers from UK companies, whose IT people are in the USA!)
  • Break up the regex so that it is easier to read and modify
Matches for the regex found online
Matches for the regex found online. Notice that some formats are not recognised.

I have not seen a thorough exposition of how a regex actually ‘works’ for UK phone numbers, so I will produce one here.  I hope that this post is useful in seeing how a regex can work, as I certainly recognised more nuances on them while writing the regex.

My regex

This is my regex, with comments in the code snippet to explain what is happening for some lines:


(?:
(\(?
(?:0(?:0|11)\)?[\s-]? # almost every country's exit code, e.g. 00 or 011
\(?|\+?)44\)?[\s-]? # UK country code (+44)
(?:\(?0\)?[\s-]?)?) # leading 0 (trunk prefix)
| # one of many Boolean operators to come for the actual number
\(?0)((?:\d{5}\)?[\.\s-]?\d{4,5}) # e.g. 016977 xxxx
|
(?:\d{4}\)?[\.\s-]?(?:\d{3}[\.\s-]?\d{3})) # e.g. 01234 xxx xxx
|
(?:\d{4}\)?[\.\s-]?(?:\d{5})) # e.g. 01xxx xxxxx
|
(?:\d{3}\)?[\.\s-]?\d{3}[\.\s-]?\d{3,4}) # e.g. 01xx xxx xxxx
|
(?:\d{2}\)?[\.\s-]?\d{4}[\.\s-]?\d{4}) # e.g. 020 xxxx xxxx
) # that's the largest capturing group over now
(?:[\s-]?((?:x|ext[\.\s]*|\#)\d{3,4})?) # e.g. x123, ext123


This is the one-line version, if you wish to paste it into RegExr for your own analysis:


(?:(\(?(?:0(?:0|11)\)?[\s-]?\(?|\+?)44\)?[\s-]?(?:\(?0\)?[\s-]?)?)|\(?0)((?:\d{5}\)?[\.\s-]?\d{4,5})|(?:\d{4}\)?[\.\s-]?(?:\d{3}[\.\s-]?\d{3}))|(?:\d{4}\)?[\.\s-]?(?:\d{5}))|(?:\d{3}\)?[\.\s-]?\d{3}[\.\s-]?\d{3,4})|(?:\d{2}\)?[\.\s-]?\d{4}[\.\s-]?\d{4}))(?:[\s-]?((?:x|ext[\.\s]*|\#)\d{3,4})?)

What I found is that the regex needs to be in a precise order, so that the regex has the best chance of recognising any phone number given to it.  The sequence of Boolean expressions (…|…|…|…|…) needs to have the phone number formats ‘cascade’, as it were: start with the longest area codes and longest numbers first.  For example, 01xxx xxx xxx comes before 01xxx xxxxx as the former is the longer number (the regex I found online had this the other way round).  I am not sure why regular expressions have this order of precedence, but that is my intuition in any case!

Matches for my phone regex
Matches for my phone regex (phone numbers surrounded by ‘1’ characters). Notice that there are three capturing groups.

I am pleased with my regex’s performance with phone numbers and it is more comprehensive in the cases it recognises than any other regex I have found online.

There are some caveats to bear in mind for the regex:

  • The regex may recognise non-UK phone numbers as a valid UK phone number.  e.g. this could happen for phone numbers from countries such as Italy.
  • The regex may recognise a non-existent phone number as a valid UK phone number.

This is just the nature of regular expressions, as a regex is concerned with format.

I still think data validation that aims to exclude non-UK phone numbers would be a worthy programming task, but it needs to go beyond regex and join up with other software such as Excel (see future post for an implementation).  One task that would be especially interesting is to extract the area code from a phone number (e.g. 020 from a number written as 0207 xxx xxxx).

In the next post on this project, I will present a Python program that takes clipboard input and passes it through my regex.

Leave a Reply

Your email address will not be published. Required fields are marked *