perl programming, regular expressions, ipv4 addresses, and failed job interviews

Introduction

Some time ago I had a phone interview to test my programming skills and after a few basic questions with some live coding – how would you implement memcpy()? how would you implement memmove() – and after an expansion for exponentiation that was out of my league without consulting Knuth – the challenge was given:

“Write a regular expression to match an IPv4 address.”

…and I felt really good about this – I consider myself to be a serious regexp geek, not Spencer / Wall / Schwartz / Rossum, but I have at least written a few lexers, parsers, matchers and a bunch of fileglob routines – so I started out with the obvious base for refinement:

\d+\.\d+\.\d+\.\d+

“Yes, but that will match any octets.”

Yes, yes, quite true – so let’s do some restricted matching on it:

\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}

…or in fact I wrote:

(\d{1,3})\.(\d{1,3})\.(\d{1,3})\.(\d{1,3})

…because (having done this a few hundred times in the past two decades) I reasoned that the programmer would want to parse-out the octets at this point, and so it’d be worthwhile saving them; my interviewer followed through with:

“Yes, but that will still match values larger than 255!”

Well, OK – I thought – if you want to start being strict with the parser you could use the regular expression engine to reject inputs outside the range 0..255; you can do that, but you are starting to overload the code. Nonetheless playing along in the spirit of the endeavour (and with the clock ticking) – I began to mod the code, and eventually each octet started to look like:

(\d|\d\d|1\d\d|2(([0-4]\d)|5[0-5]))

…the total looking somewhat like:


(\d|\d\d|1\d\d|2(([0-4]\d)|5[0-5]))\.
(\d|\d\d|1\d\d|2(([0-4]\d)|5[0-5]))\.
(\d|\d\d|1\d\d|2(([0-4]\d)|5[0-5]))\.
(\d|\d\d|1\d\d|2(([0-4]\d)|5[0-5]))

…but I said to the guy “Look, you really don’t want to be doing it this way, it’ll spend all of its time backtracking…” which brought the prompt retort:

“It will not! There is no backtracking!” [ed: verbatim, I think]

Hmm. Okay. Maybe that was the wrong way to explain it?

What I was trying to convey was that the pattern matcher would be thrashing backwards and forwards (because of all the parenthetical OR-separated subgroups) looking for fixed points to anchor against; but we were very short for time and in fact the interview was soon terminated for overrunning – just after I pasted a code-fragment of how I would more verbosely but I felt more wisely go about solving the problem…

[update: see also the DFA vs: NFA comments in the Reddit thread]

Yet one week later the feedback from the interview was deemed to have been “negative” – and that was the end of that job application.

So be it. Maybe I need to be able to write my own exponentiation routine to be a real programmer?

However, not being one to let a performance issue drop I decided to document the experience for my blog, and hence this posting.

Anyone who’s read the first edition of The Camel Book[1] will remember the section that tells you that multiple, simple regular expressions are dealt with much faster than long and complex ones, and it’s kinda obvious why that happens when you consider what it’s trying to achieve – but the cost/scale of the problem often gets missed.

But let’s put that aside those concerns for a moment and look at three other issues:

Non-Performance Concerns

  1. Code Maintainability: the above is atrocious from the perspective of maintenance, it’s the sort of thing which unfairly gives Perl a bad reputation. It looks complex, the core pattern is repeated four times – meaning four things to poke, four places to make mistakes – plus the above is actually buggy, as we will soon demonstrate.
  2. Code Coupling: you should not be using your parser to do semantic validation; syntactic validation certainly, but in compiler design you don’t hotwire lex to check for integer overflow, instead you say this looks like an integer and pass it off to another module which does input validation and optimisation.

    Perhaps my interviewer knew this, perhaps not; but it’s very easy (but incorrect) to assume that parsing comes for free and any validation processing that is put on top of that is high and extra cost.

  3. Code Complexity: what happens when the next specification change requires you to match only legitimate class-A/B/C addresses, and skip multicast. What then? It’s already bad enough…

So those are three more reasons not to solve the challenge this way, but I am making a performance claim so let’s get back to the experiment:

Lab Conditions

I created a standard blob of test data from 2Gb of Apache webserver logfiles and errorfiles; actually it was 700-ish megabytes of log data concatenated thrice, but that’s close enough for government work.

I then built a small Perl harness for pushing this 2Gb file through a regular expression, doing some small amount of pointless work to ensure that nothing is being optimised-away, while printing the matches for debugging.

Finally: I wrote a shellscript wrapper to run the whole thing against the clock.

All code for this posting is available at:code.google.com/p/regular-expression-experiment; the stuff in the posting will likely be a little munged by HTML making it inoperable.

The Perl Harness

Small and simple, the Perl harness is this:

$pat = qr!( REGULAR EXPRESSION )!ox;

while (<>) {
    while (s/$pat/-/o) {
	print "$1\n";
    }
}

The Challenge

So let’s run again through the development and refinement of this regular expression challenge; first let’s define our own problem statement:

Take a large, arbitrary logfile; extract from it anything that looks like a legitimate IPv4 address in decimal-dotted-quad notation, possibly multiple ones per line, possibly embedded within strings of text (hostnames?) rather than nicely tokenised. For the moment ignore the problem of leading zeros and/or zero-padding. Make three timed passes over the data for each test. Quiescent machine with 4Gb RAM so the input file can prettymuch live in the disk buffer cache.

The Simple Approach

Let’s start with the simple pattern which will produce wrong results:

$pat = qr!( \d+\. \d+\. \d+\. \d+ )!ox;

runtimes: 1m37.020s 1m37.204s 1m36.845s

We know this pattern will match illegal strings like “1000.1000.1000.1000″ however it provides a useful fast-and-dumb baseline time of about 1m37s. If we could guarantee that no illegal data were in the file, this would be the time we would have to beat – but then that would be no fun.

Note that we only have one set of parenthesis in the pattern, saving out the whole thing for printing; if we also save the individual octets with more parenthses for later processing, what’s the extra cost?

$pat = qr!( (\d+)\. (\d+)\. (\d+)\. (\d+) )!ox;

runtimes: 1m48.240s 1m48.152s 1m48.114s

Answer: it costs an extra 11 seconds with no other code changes; not much, but it does demonstrate that regular expression simplicity is a factor in performance, even if you change nothing else.

Now let’s try and legitimise the output and print only those strings with legitimate “octets” which are in the range 0..255:

$pat = qr!( (\d+)\. (\d+)\. (\d+)\. (\d+) )!ox;

...

print "$1\n" if (($2 < 256) and ($3 < 256) and ($4 < 256) and ($5 < 256));

runtimes: 2m7.649s 2m8.172s 2m7.414s

The test shows that this works really quite well, and the four octet string-to-int-and-comparison cost for every line of output has added only another 20s to runtime.

SUMMARY FOR THIS SECTION: 2m7s is the time to beat.

The Slightly More Clever Matching Approach

So let’s try to get clever – after all, all that string-to-int conversion has to be more expensive than regular expressions, surely? A succession of development is necessary, however:


$pat = qr!( \d{1,3}\. \d{1,3}\. \d{1,3}\. \d{1,3} )!ox;

There’s a bug in the above code because the regexp is not adequately greedy and so if it encounters a string like foo-12.34.56.1000-bar then it will return 12.34.56.100 ignoring the trailing “0″ that would disqualify this from being an IP address. [nb: EXAMPLE FIXED]

In short, this regexp would synthesise bad data, so let’s improve it:


$pat = qr!( (\d{1,3})\. (\d{1,3})\. (\d{1,3})\. (\d{1,3}) )!ox;

...

print "$1\n" if (($1 < 256) and ($2 < 256) and ($3 < 256) and ($4 < 256));

Nope, you can’t do this because the match passes the bad data into the integer comparison test. It’s useful to do this to get a sense of the timings but bzzzzt! try again:

$pat = qr!( (\d{1,3})\. (\d{1,3})\. (\d{1,3})\. (\d{1,3})\b # ANCHOR TO WORD BOUNDARY )!ox;

...

print "$1\n" if (($2 < 256) and ($3 < 256) and ($4 < 256) and ($5 < 256));

runtimes: 2m32.568s 2m33.067s 2m32.716s

That’s clever, isn’t it? Anchor the tail of the regexp to a word boundary so that the last octet must be comprised of between one and three digits at the end of a word; plus you still have to check for “octets” like 999.

And it takes about 25 seconds longer than the dumb-parse-and-compare method.

Ouch.

SUMMARY FOR THIS SECTION: “smarter pattern matching” = 120% FAIL.

The Complex Matching Approach

Let’s throw more science at the problem; first-up are the non-solutions:

$pat = qr!(
(\d|\d\d|1\d\d|2[0-4]\d|25[0-5])\.
(\d|\d\d|1\d\d|2[0-4]\d|25[0-5])\.
(\d|\d\d|1\d\d|2[0-4]\d|25[0-5])\.
(\d|\d\d|1\d\d|2[0-4]\d|25[0-5]) # nongreedy, RE will take shortest match
)!ox;

Oy veh, this is a revision of the behemoth from above. It’s also buggy. The problem is the left-to-right evaluation of the parethesised subgroups – in short if you feed 255.255.255.255 into this match, you’ll probably get 255.255.255.2 returned to you because the first \d matched the first character of the final octet, and returned that.

I have no idea whether there’s a mandatory left-to-right execution of matches within groups (POSIX?) so the results might end up being implementation-dependent – and that is always a bad thing.

Never fear – we can just reverse the order:

$pat = qr!(
(25[0-5]|2[0-4]\d|1\d\d|\d\d|\d)\.
(25[0-5]|2[0-4]\d|1\d\d|\d\d|\d)\.
(25[0-5]|2[0-4]\d|1\d\d|\d\d|\d)\.
(25[0-5]|2[0-4]\d|1\d\d|\d\d|\d)

…except that this one suffers the same “unanchored” problem as the “slightly more clever” set, so that 12.34.56.789 synthesises 12.34.56.78, data that can’t be an IP address and didn’t exist in the original.

However we know how to fix this, don’t we?

$pat = qr!(
(25[0-5]|2[0-4]\d|1\d\d|\d\d|\d)\.
(25[0-5]|2[0-4]\d|1\d\d|\d\d|\d)\.
(25[0-5]|2[0-4]\d|1\d\d|\d\d|\d)\.
(25[0-5]|2[0-4]\d|1\d\d|\d\d|\d)\b # ANCHOR TO WORD BOUNDARY
)!ox;

runtimes: 3m8.960s 3m8.586s 3m8.445s

So solving this problem by pushing values (“…or a three digit number beginning with ’25′…”) into the regular expression has added 1 minute to the program runtime.

SUMMARY FOR THIS SECTION: “complex pattern matching” = 148% FAIL.

That’s really bad, and that’s what I was trying to get across to my interviewer; but let’s try and salvage this approach, just one last time…

The Hybrid Approach

$pat = qr!(
(\d|\d\d|1\d\d|2[0-4]\d|25[0-5])\.
(\d|\d\d|1\d\d|2[0-4]\d|25[0-5])\.
(\d|\d\d|1\d\d|2[0-4]\d|25[0-5])\.
(\d+) # GREEDY BUT NEEDS TO BE BOUNDSCHECKED
)!ox;

...

print "$1\n" if ($5 < 256);

runtimes: 2m54.555s 2m53.972s 2m53.931s

You get an intermediate result, recovering about 14 seconds of runtime by not trying to coersce the matcher into working out how big a decimal number should be, and instead letting it get on with identifying things that look like sequences of decimal numerals; and that's only with respect to the final octet.

SUMMARY FOR THIS SECTION: "hybrid pattern matching" = 136% FAIL.

So it looks like fast-and-dumb wins... except there's one more twist in the tale.

The Anchoring Bug

Remember we anchored a few of the patterns above with \b to tie the final octet to a word boundary? Well comparison of the output showed those algorithms to be rejecting strings of the form:

foo192.168.1.1bar

...because the final octet was not on a "word boundary" as defined by Perl regular expressions; in Perl \b is defined as a interchange between alphanumeric and non-alphanumeric and the final octet in the example above is not on such an interchange; so all of those implementations are also dropping potentially-viable inputs. Plus the anchoring issue also applies to the head-end, too, which we missed in the first round of regexps. Argh.

This aspect beat me - I stared at the regular expressions technical reference for a while trying to address this before deciding "to hell with this, it's going to still be slower than the parse-and-convert method".

Which (again) was my point all along.

The Summary

There are still a bunch of open questions and ambiguities in the above problem definition, and guesses that we have to make about potential inputs and whether they create viable outputs. Also: what about leading zero padding? Are they permissible? If so, how many zeroes should be permitted? Some of the regular expressions cited above permit leading zeros, others don't.

Not to mention: what about IPv6? :-)

But to get the performance boosts I've described above, I can at least provide the following advice:

  1. you - or someone - really ought to be specifying the problem really clearly before implementation
  2. and then you should be implementing in a style that satisfies all of the performance and maintainability issues that are raised above.
  3. use your patternmatcher to match patterns, not to compare values.
  4. k.i.s.s. works really well for regexp.
  5. this is a question which is really hard to treat properly in 10 minutes.
  6. the camel book is still epic.

--
[1] I have January 1991 first edition, love it so much that I rescued it from my Saab after crashing it in 1993 (winter, bend, black ice) into a river and flooding the car; I washed the book in pure water to remove the silt and spent four weeks drying it in a cupboard before I could open it again.

22 thoughts on “perl programming, regular expressions, ipv4 addresses, and failed job interviews

  1. alecm Post author

    have reviewed it now – the point is that:

    >I’m not sure about your complaint about the greediness of \d{1,3}

    …because it is constrained by the “3″ it is incapable of grabbing the fourth digit which prevents the final octet from making sense; there’s no way for the regexp as written to check what follows without using some kind of complex lookahead pattern, hence the subsequent anchoring issue

    UPDATE UPDATE UPDATE

    I realise that when writing this up, I flubbed the main text description totally. What I am trying to convey is that this happens:


    $ echo foo-12.34.56.1000-bar | three-digits-1.pl,bug
    12.34.56.100

    ie: something that is clearly not an IP address, gets converted into one.

    Reply
  2. Mark Pettit

    If you don’t care about capturing the octets, I’d be curious what the performance of this regex would be:


    (?:(?<!\d)|\A)(?:(?:\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.){3}(?:\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])(?:(?!\d)|\z)

    I believe that's a legal, legitimate, maximally-optimized regex that will match legal four-octet IP addresses and only legal four-octet IP addresses. Specifically, it will not match "127.1" which is a legal IP in some cases (try "ping 127.1" from the command line) or "2130706433" (which is the decimal equivalent for "127.0.0.1") which is also legal in some places.

    It uses negative look-around assertions instead of \b.

    Reply
  3. Mark Pettit

    An additional comment: I always ask this question when I interview people for a coding-focused Sysadmin position. Nobody has *ever* pointed out the performance shortcomings of the more complicated regexes. If they had, I would have highly recommended they be hired. Most people get lost when I point out the problems with the most basic \d+\.\d+\.\d+\.\d+ regex.

    Shame on the guy for cutting you off and not understanding your point when you raised a legitimate issue with the more complicated regex.

    Reply
  4. P

    exponentiation: montgomery multiplication – an exercise I did in learning the little bit of python I know so far

    def exp_2(a,b):
    ans=1
    part=a
    while (b>0):
    if (b%2):
    ans *= part
    # print ‘A=’,a, ‘B=’,b, ‘PART=’,part, ‘ANS=’,ans
    b = int(b/2)
    part *= part
    return ans

    Reply
  5. v shah

    Try /D (non-/d) instead of /b, but I am not sure if it will work for end of line.

    Reply
    1. alecm Post author

      @ v shah / @ imd

      yes, but you have to allow for “\D or perhaps end of line” because you are pulling this out of raw data files, so once again you have a messy complex expression.

      @ mike:

      if you use “\D*” (rewriting your comment) then you’ve forgotten that “(foo)*” at the end of a regexp is essentially redundant – you are saying “zero or more instances of foo”, and in the expression “1234″ against “\d\d\d\D*” then 123 will match because there is a “zero or more non-digits” anchor string between the 3 and the 4, so to speak.

      $ perl -e 'print "$1\n" if ("1234" =~ /(\d\d\d)\D*/)'
      123

      there would also be an overlap match of 234, if we were being non-destructive.

      @ P

      yes, that kind of thing.

      @ all

      I am amazed how many programmers – including me – try to “fix” the regexp and take absolutely no notice of the three other maintainability (etc) issues.

      Reply
  6. alecm Post author

    MARK PETIT REGEXP RESULTS

    I took Mark Petit’s awesome regex from the above and dropped it into the framework, quiesced the machine and ran the same tests against it as everything else. I had to wrap a master pair of (parens) around the regexp to pull out the text that had been matched, but otherwise[1] left it vanilla:

    http://code.google.com/p/regular-expression-experiment/source/browse/trunk/petit-1.pl

    All the code and results have been posted back to the repo.

    http://code.google.com/p/regular-expression-experiment/source/browse/trunk/results-petit.txt

    I ran it against simple-3 to make sure everything is about the same, and yes simple-3 produced the same times as yesterday: 2m8.271s 2m8.941s 2m8.941s

    Then I ran petit-1: 5m2.360s 5m1.968s 5m2.028s

    Sorry Mark, your regexp adds almost 3 minutes / a factor of 2.37x increase in execution time – but that is a very interesting bit of research.

    Your regexp also _does_ produce results which were very similar to the simple-3 output, with the difference being that simple-3 would match/permit zero padding like “6.011.025.001″ whereas petit-1 does not.

    If it’s important to reject zero-padding, the next question is whether a factor of 2.3x provides enough time in which to make the checks; my bet would be “yes” but I’ve not tried it yet.


    [1] ps: I did also try a petit-2.pl, attempting to use one of the groups in the regexp itself to extract the IP address string, rather than adding a master set of parens. No dice, nothing wanted to help.

    Reply
  7. alecm Post author

    PERL::REGEXP::COMMON RESULTS

    At the behest of Reddit, I did a test using Regexp::Common:

    http://code.google.com/p/regular-expression-experiment/source/browse/trunk/regexp-common-1.pl

    As with Mark’s code I had to wrap it in (parens) but rather than mess around I embedded that straight into the match, since I was importing a prefab regexp from a library

    Runtimes: 2m35.002s 2m34.942s 2m34.999s

    from: http://code.google.com/p/regular-expression-experiment/source/browse/trunk/results-regexp-common.txt

    So basically it is performing about the same as the “Slightly More Clever Matching Approach”, perhaps a little bit slower.

    It is also – like some other complex-regexp-based code in this blogpost – generating bad output from implausible input:


    $ echo 5.9.1.14019 | ./regexp-common-1.pl
    5.9.1.140

    Put differently: using Regexp::Common over brute force numeric comparison takes 1.2x longer and may be giving you one kind of potentially bad output; but at least it’s cheap to code and someone else’s problem, is it not?

    Reply
  8. Niall

    I think this is a case where the interview process clearly failed to arrive at the right answer. Sadly interviewing processes like this don’t often produce deterministic results :(

    Good analysis though,

    N

    Reply
  9. alecm Post author

    Philosophical thought:

    Maybe simple-3.pl is simple the duck-typing approach to IP-address matching?

    ie: rip out four things that could plausibly be integer octets and check if they fit within bounds. It’s pretty hard to argue with that.

    Everything else is doing classic pattern matching. More slowly, with more complexity, and suffering bugs.

    Reply
  10. alecm Post author

    MORE PATTERNS

    Fripltister of Reddit suggested this code:

    http://code.google.com/p/regular-expression-experiment/source/browse/trunk/fripletister-1.pl

    …which ran in 2m23s but only produced an output 53% of the size of simple-3.pl; this may be a anchoring issue, we’ll look into it because a lot of thought went into the regexp; but until then this performance figure is dubious and the algorithm should be considered busted.

    Perlgeek of Reddit suggested:

    http://code.google.com/p/regular-expression-experiment/source/browse/trunk/perlgeek-1.pl

    …which bears repetition because of its Perliness:

    my $octet = qr/\b(\d{1,3})\b(??{ $^N <= 255 ? '' : '(?!)'})/;
    my $pat = qr/((?:$octet\.){3}$octet)/;

    …which ran in 24m48s, well beyond a factor of 10x worse, but did seem to generate plausible output; that said I went out for a nice meal while it was running.

    The search to beat brute-force simple-3.pl continues. I recommend simpler approaches.

    Reply
  11. Anonymous Coward

    Great article. Completely agree with you on the principles.

    There’s another bug in your more more complicated regular expressions. Just like the trailing regex truncation problem, there’s the leading regex truncation problem. So

    \d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}
    will wrongly match 1234.56.78.9012 as
    234.56.78.901

    More reason for keeping the regex simple.

    Reply
  12. Pingback: HELP: Advice to a new graduate applying for a job in Pentesting & Security, first interview – dropsafe

Leave a Reply