UK phone number parser: recognising phone numbers from text input

Introduction

In a previous post, I outlined business problems I encountered with data that contain UK phone numbers:

  • Dealing with spreadsheets or databases that have inconsistent phone number formats
  • Scraping phone number information from a clunky PDF (e.g. a directory) or webpage

I designed a regex to recognise UK phone numbers and think it is better than many of the example regular expressions listed online (at least, for what I was looking to achieve for my own program).


This is the sequence I followed in designing a program that recognises UK phone numbers and parses them to a consistent format:

  1. Write the regex
  2. Create program that processes an input
  3. Export the phone numbers to a workbook and process them

In this post, I am in the intermediate step: create a program that processes an input (text input).  This ended up being a clipboard input, as web scraping is not as versatile in my view and I am more likely to be using the program for PDFs than webpages.


Getting started with the program

The first building block for the program:

Define a regex and take clipboard input.  Then find all strings that match the regex pattern and return them as a list (of raw phone numbers).

Once that is done, it is possible to increase the functionality of the program later (e.g. export the list to Excel; convert the raw numbers to a consistent format).

Regex object

The regular expression was defined and tested in the previous post on this project.  The regex was tested using RegExr rather than through a Python IDE, but this gives a dry run to write a  program.

The RegexObject is set up as follows, via the re library:


 phone_regex_UK = re.compile(r"""(
   [multi-line regex here]
    )""", re.VERBOSE | re.DOTALL)

Some standard methods are used to make sure that the regex can process long blocks of text well.

These are the three regex groups, as could be seen last time with RegExr:

  1. International dialling code (e.g. 00 44; +44(0)), if any.
  2. Phone number without leading 0.
  3. Extension information (e.g. ‘ext. 123’, ‘ext 123’), if any.

For example, this would be the breakdown of ‘0161 123 4567 ext. 1231’:

  1. [blank]
  2. ‘0161 123 4567’
  3. ‘ext. 1231’

Clipboard input

I had considered incorporating a web scraping library for the program, but thought that there would need to be some element of visual inspection by an end-user to account for basic problems such as formatting and cumbersome websites.

The intent of the program was always to reduce the time needed for data cleaning, so clipboard input would be much more powerful and allow for more adaptability.

This block of code uses the pyperclip library to process clipboard text and datetime to help monitor the performance of the program:


 print("Copying text from the clipboard ...")
 timer_start = datetime.datetime.now() #timer
 raw_clip = str(pyperclip.paste())
 pyperclip.copy(raw_clip)

Search functionality

Now that an input can be stored within an object in the program, more code is needed to process what is contained in that object.

The input could be a messy million-character string of text for a webpage or long PDF file, which could contain hundreds of phone numbers.  It is critical then to create code that achieves these things:

  • Returns the phone numbers in a consistent format: in essence, the dialling code is extraneous information and all the white space can be removed from ‘group 2’.
  • Fast performance: the program needs iteration, so we expect O(n) performance as a target.  List and string methods can be used to keep performance to that level.

Here is code that loops over every character of text in the RegexObject and does just that:


# Search functionality block
# raw_list will store all the raw numbers recognised by regex
# matches_list will store numbers in a semi-standardised format
#    i.e. leading zeroes & no intl codes, but all else is messy
matches_list = []
raw_list = []
for groups in phone_regex_UK.findall(raw_clip):
   raw_num = groups[0]
   raw_list.append(raw_num)
   num = "0" + groups[2]
   num = ''.join(ch for ch in num if ch.isdigit())
   if groups[3] != "":
      ext = groups[3]
      ext = ''.join(ch for ch in ext if ch.isdigit())
      num += "x" + ext
   matches_list.append(num)

Each phone number is stored as a raw_num (all content is kept with groups[0]).  Then the essential phone number detail is accessed via groups[2]: the number is stored in a format that eliminates spaces by using a list comprehension (ch for ch…).

In case of an extension number, the number is kept in a form such as ‘x123’ for ‘ext. 123’.  At the end of the iteration, the ‘cleaned up’ number is placed at the end of the matches_list.

 

So far all the processing action is being kept under the hood but the next post on this project will address program usability.  If thinking about the program in ‘minimum viable product’ terms, it is perhaps one step away: the functionality has been tested a fair amount, but more attention is needed on design and usability.

Recognising UK phone numbers through a regex

Introduction

These are some business problems that have motivated my desire to make a program focused on UK phone numbers:

  • Dealing with spreadsheets or databases that have inconsistent phone number formats
  • Scraping phone number information from a clunky PDF (e.g. a directory) or webpage

Through some experimentation with a regex, pyperclip and OpenPyXL, I eventually produced a program that recognises UK phone numbers and parses them to a consistent format.


In this post, I will focus on my regex.  There are many websites that have regular expressions for recognising UK phone numbers, but I did not find one that accounted well for features of phone numbers such as extension numbers and international call prefixes.

This will be the sequence of posts:

  1. Write the regex
  2. Create program that processes an input
  3. Export the phone numbers to a workbook and process them

Regex

Starting point

I will avoid discussing the intricacies of UK phone numbers here, so refer to Wikipedia’s article on UK phone numbers for detail on how the numbering system works.

These are features of phone numbers that will be important in a regex for my program:

  1. Access codes: e.g. 00 (international call prefix) and +44 (UK country code)
  2. Phone number without the leading 0
  3. Extension number (e.g. x1000 or ext1000)

These numbers correspond to my regex’s ‘groups’.  Of course, group 2’s information is of most interest, but the information in the others groups also needs to be processed.

I found quite a neat regex on a Stack Overflow thread (‘Match UK telephone number in any format’):


^(?:(?:\(?(?:0(?:0|11)\)?[\s-]?\(?|\+)44\)?[\s-]?(?:\(?0\)?[\s-]?)?)|(?:\(?0))(?:(?:\d{5}\)?[\s-]?\d{4,5})|(?:\d{4}\)?[\s-]?(?:\d{5}|\d{3}[\s-]?\d{3}))|(?:\d{3}\)?[\s-]?\d{3}[\s-]?\d{3,4})|(?:\d{2}\)?[\s-]?\d{4}[\s-]?\d{4}))(?:[\s-]?(?:x|ext\.?|\#)\d{3,4})?$

However, it will not work for my programming task so I wanted to improve it:

  • Use groups in a regex
  • Stop digits from being excluded in some cases
  • Account for more characters (e.g. I have seen ‘.’ separator many times in phone numbers from UK companies, whose IT people are in the USA!)
  • Break up the regex so that it is easier to read and modify
Matches for the regex found online
Matches for the regex found online. Notice that some formats are not recognised.

I have not seen a thorough exposition of how a regex actually ‘works’ for UK phone numbers, so I will produce one here.  I hope that this post is useful in seeing how a regex can work, as I certainly recognised more nuances on them while writing the regex.

My regex

This is my regex, with comments in the code snippet to explain what is happening for some lines:


(?:
(\(?
(?:0(?:0|11)\)?[\s-]? # almost every country's exit code, e.g. 00 or 011
\(?|\+?)44\)?[\s-]? # UK country code (+44)
(?:\(?0\)?[\s-]?)?) # leading 0 (trunk prefix)
| # one of many Boolean operators to come for the actual number
\(?0)((?:\d{5}\)?[\.\s-]?\d{4,5}) # e.g. 016977 xxxx
|
(?:\d{4}\)?[\.\s-]?(?:\d{3}[\.\s-]?\d{3})) # e.g. 01234 xxx xxx
|
(?:\d{4}\)?[\.\s-]?(?:\d{5})) # e.g. 01xxx xxxxx
|
(?:\d{3}\)?[\.\s-]?\d{3}[\.\s-]?\d{3,4}) # e.g. 01xx xxx xxxx
|
(?:\d{2}\)?[\.\s-]?\d{4}[\.\s-]?\d{4}) # e.g. 020 xxxx xxxx
) # that's the largest capturing group over now
(?:[\s-]?((?:x|ext[\.\s]*|\#)\d{3,4})?) # e.g. x123, ext123


This is the one-line version, if you wish to paste it into RegExr for your own analysis:


(?:(\(?(?:0(?:0|11)\)?[\s-]?\(?|\+?)44\)?[\s-]?(?:\(?0\)?[\s-]?)?)|\(?0)((?:\d{5}\)?[\.\s-]?\d{4,5})|(?:\d{4}\)?[\.\s-]?(?:\d{3}[\.\s-]?\d{3}))|(?:\d{4}\)?[\.\s-]?(?:\d{5}))|(?:\d{3}\)?[\.\s-]?\d{3}[\.\s-]?\d{3,4})|(?:\d{2}\)?[\.\s-]?\d{4}[\.\s-]?\d{4}))(?:[\s-]?((?:x|ext[\.\s]*|\#)\d{3,4})?)

What I found is that the regex needs to be in a precise order, so that the regex has the best chance of recognising any phone number given to it.  The sequence of Boolean expressions (…|…|…|…|…) needs to have the phone number formats ‘cascade’, as it were: start with the longest area codes and longest numbers first.  For example, 01xxx xxx xxx comes before 01xxx xxxxx as the former is the longer number (the regex I found online had this the other way round).  I am not sure why regular expressions have this order of precedence, but that is my intuition in any case!

Matches for my phone regex
Matches for my phone regex (phone numbers surrounded by ‘1’ characters). Notice that there are three capturing groups.

I am pleased with my regex’s performance with phone numbers and it is more comprehensive in the cases it recognises than any other regex I have found online.

There are some caveats to bear in mind for the regex:

  • The regex may recognise non-UK phone numbers as a valid UK phone number.  e.g. this could happen for phone numbers from countries such as Italy.
  • The regex may recognise a non-existent phone number as a valid UK phone number.

This is just the nature of regular expressions, as a regex is concerned with format.

I still think data validation that aims to exclude non-UK phone numbers would be a worthy programming task, but it needs to go beyond regex and join up with other software such as Excel (see future post for an implementation).  One task that would be especially interesting is to extract the area code from a phone number (e.g. 020 from a number written as 0207 xxx xxxx).

In the next post on this project, I will present a Python program that takes clipboard input and passes it through my regex.