Sunday, December 14, 2014

Regular Expressions

Regular Expressions, or regex/regexp, is a pattern matching syntax most commonly used for targeting or replacing strings. They were defined by the mathematician, Stephen Kleene, in the 1950s and have penetrated a good number of languages and applications. You'll want to be familiar with them. It could mean the difference between having an excellent tool in your arsenal or something that just haunts you throughout your career in software engineering. If you're lucky, you'll find a project that forces you to apply it long enough to learn it like the back of your hand. If you aren't you'll find scenarios here and there and find yourself having to review all of the obscure hieroglyphics over and over again.

Prerequisites: Complete the lessons at regexone. Also check out this excellent video series on Regular Expressions by Derek Banas.

Real World Applications

Pattern Matching

I'll just get right into the scenarios where I myself leveraging regex the most. First one being the need to perform more complex searches in my IDE. Most IDE's will have regex capabilities built in. Since I use Vim, it's seamlessly integrated into the search function. Here are some common regexes I've written to search through JavaScript files:

Note: Vim has its own flavor of regex, which is more similar to Perl's implementation. For instance, Vim regexes make you escape the plus symbols when you want to check for repetition, and escape the parenthesis when you want to group.

// javascript functions to match
function thing(a, x,y, b,c) {};
function thingDoThis(a, b,c) {};

var thing = function(x, y, z){
    console.log('hello');
};

var thingDo = function(){
    console.log('hello');
};

var thingDo = function(x, y){
    console.log('hello');
};

var thing_do =function(z){
    console.log('hello');
};
this.thing_do =function(a, b,c){};
var self = this;
self.thing_do = function(asdf, f){};
// find all of the anonymous functions
/function\s\+\w\+([a-zA-Z\,\ ]*)\s*/

// find all of the non-anonymous functions
/\(var\s\+\|this\.\|self\.\)\w\+\s*=\s*function([a-zA-Z\,\ ]*)\s*/

// convert all anonymous functions into private functions
:%s/\(function\s\)\+\(\w\+\)\(([a-zA-Z\,\ ]*)\)\s*/var\ \2\ =\ function\3/g

As you can see, the regexes can get very complicated. I'll break down some of the characters, ahead, but it might help to refer to vimregex.com if you're at all interested in Vim. Regular Expressions take a significant amount of time to learn. I really recommend find every excuse to expose yourself to it everyday.

// full regex in Vim
/function\s\+\w\+([a-zA-Z\,\ ]*)\s*/

// breakdown of each component
/             # start the search
function      # must start with "function"
\s\+          # followed by 1 space
\w\+          # and one word; the function name
(             # followed by an open parenthesis
[a-zA-Z\,\ ]* # that optionally contain parameters; any combination of words, spaces, and commas
)             # followed by a closed parenthesis
\s*           # and an optional space
/             # end the search

Hopefully, it's easier to understand broken up that way. Now, we'll breakdown regex that involves searching and replacing. We'll skip some of the basics this time and focus on something called capture groups. In Vim regexes, capture groups are designated by escaped parenthesis, and basically store a piece of the match for backreferencing. Let's look at our example:

// full regex in Vim
:%s/\(function\s\)\+\(\w\+\)\(([a-zA-Z\,\ ]*)\)\s*/var\ \2\ =\ function\3/g

// breakdown of each component
:%s                     # tell vim to include all lines when searching
/                       # start the search
\(function\s\)\+        # capture group 1 includes 1 instance of "function" and 1 space
\(\w\+\)                # capture group 2 includes a word; the function name
\(([a-zA-Z\,\ ]*)\)     # capture group 3 are the function's parameters
\s*                     # followed by an optional space
/                       # end the search and begin the replace rules
var\ \2\ =\ function\3  # replace with "var ", capture group 2, " = function" and capture group 3
/                       # end the replace
g                       # tell vim to do a global replace

And there you go. If you're being exposed to this for the first time, this may be a bit overwhelming. Just keep practicing the basics outlined in the prerequisites and you'll get there.

Validating Password Inputs

Let's say you needed to make sure the password had to start with a letter, contain between 4 and 20 characters and contain various alphanumeric characters and symbols:

[A-Za-z0-9@#$%^&+=]{4,20}

And to make that reusable, you would do the following in Python:

import re
password = raw_input("Enter string: ")
if re.match(r'[A-Za-z0-9@#$%^&+=]{4,20}', password):
    # match
else:
    # no match

Other Scenarios

  • Parsing text files for database importing
  • Finding and replacing text
  • Validating email entries and other types of inputs
  • Offloading validation and other processes to the client-side
  • Parsing log files

That concludes this article. Check back for a follow up article where we'll be applying this to log parsing for Logstash.

No comments:

Post a Comment