In a previous post, I outlined business problems I encountered with data that contain UK phone numbers:
- Dealing with spreadsheets or databases that have inconsistent phone number formats
- Scraping phone number information from a clunky PDF (e.g. a directory) or webpage
I designed a regex to recognise UK phone numbers and think it is better than many of the example regular expressions listed online (at least, for what I was looking to achieve for my own program).
This is the sequence I followed in designing a program that recognises UK phone numbers and parses them to a consistent format:
- Write the regex
- Create program that processes an input
- Export the phone numbers to a workbook and process them
In this post, I am in the intermediate step: create a program that processes an input (text input). This ended up being a clipboard input, as web scraping is not as versatile in my view and I am more likely to be using the program for PDFs than webpages.
Getting started with the program
The first building block for the program:
Define a regex and take clipboard input. Then find all strings that match the regex pattern and return them as a list (of raw phone numbers).
Once that is done, it is possible to increase the functionality of the program later (e.g. export the list to Excel; convert the raw numbers to a consistent format).
The regular expression was defined and tested in the previous post on this project. The regex was tested using RegExr rather than through a Python IDE, but this gives a dry run to write a program.
The RegexObject is set up as follows, via the re library:
phone_regex_UK = re.compile(r"""( [multi-line regex here] )""", re.VERBOSE | re.DOTALL)
Some standard methods are used to make sure that the regex can process long blocks of text well.
These are the three regex groups, as could be seen last time with RegExr:
- International dialling code (e.g. 00 44; +44(0)), if any.
- Phone number without leading 0.
- Extension information (e.g. ‘ext. 123’, ‘ext 123’), if any.
For example, this would be the breakdown of ‘0161 123 4567 ext. 1231’:
- ‘0161 123 4567’
- ‘ext. 1231’
I had considered incorporating a web scraping library for the program, but thought that there would need to be some element of visual inspection by an end-user to account for basic problems such as formatting and cumbersome websites.
The intent of the program was always to reduce the time needed for data cleaning, so clipboard input would be much more powerful and allow for more adaptability.
This block of code uses the pyperclip library to process clipboard text and datetime to help monitor the performance of the program:
print("Copying text from the clipboard ...") timer_start = datetime.datetime.now() #timer raw_clip = str(pyperclip.paste()) pyperclip.copy(raw_clip)
Now that an input can be stored within an object in the program, more code is needed to process what is contained in that object.
The input could be a messy million-character string of text for a webpage or long PDF file, which could contain hundreds of phone numbers. It is critical then to create code that achieves these things:
- Returns the phone numbers in a consistent format: in essence, the dialling code is extraneous information and all the white space can be removed from ‘group 2’.
- Fast performance: the program needs iteration, so we expect O(n) performance as a target. List and string methods can be used to keep performance to that level.
Here is code that loops over every character of text in the RegexObject and does just that:
# Search functionality block # raw_list will store all the raw numbers recognised by regex # matches_list will store numbers in a semi-standardised format #&nbsp; &nbsp; i.e. leading zeroes &amp; no intl codes, but all else is messy matches_list =  raw_list =  for groups in phone_regex_UK.findall(raw_clip): raw_num = groups raw_list.append(raw_num) num = "0" + groups num = ''.join(ch for ch in num if ch.isdigit()) if groups != "": ext = groups ext = ''.join(ch for ch in ext if ch.isdigit()) num += "x" + ext matches_list.append(num)
Each phone number is stored as a raw_num (all content is kept with groups). Then the essential phone number detail is accessed via groups: the number is stored in a format that eliminates spaces by using a list comprehension (ch for ch…).
In case of an extension number, the number is kept in a form such as ‘x123’ for ‘ext. 123’. At the end of the iteration, the ‘cleaned up’ number is placed at the end of the matches_list.
So far all the processing action is being kept under the hood but the next post on this project will address program usability. If thinking about the program in ‘minimum viable product’ terms, it is perhaps one step away: the functionality has been tested a fair amount, but more attention is needed on design and usability.