Basic Regular Expressions
I got an e-mail about my blog on JDK 5 features asking for details on how to replace StringBuffer with StringBuilder in a large codebase. I am going to use it as an opportunity to talk a little about regular expressions, even though:
- Many, if not most of you, reading this probably already know all about regular expressions. If you are able to sneer at this post as too remedial, then good for you. And I don’t mean that sarcastically. I think having even a cursory understanding of regular expressions can dramatically improve your productivity, both in moving around in your IDE, and in the code you write.
- Regular expressions are probably overkill for this particular problem. But hey, I think I can still squeeze something useful out of it.
Tools
Here are two indispensable "tools" for working with regular expressions:
- The book Mastering Regular Expressions, 2nd Edition, by Jeffrey Friedl. If you using them for the first time, it explains everything. If you are experienced with them, it’s a great reference. If you haven’t already, you could just get the book and stop reading the rest of this blog. But I’ll continue anyway.
- The RegEx Coach lets you try out all your regular expressions interactively. It is a standalone program.
I also see there is a Java-specific RegEx book, Java Regular Expressions: Taming the java.util.regex Engine. It has good reviews on Amazon, but I have not had a chance to read it.
There is also an Eclipse plugin for regular expressions, Regular Expression Tester. I tried it during the JBuilder 2007 development phase and it looked promising. The neat thing about it is that when you copy a regular expression and paste it into your Java code, it will add the extra backslashes (Background: Regular expressions, just like Java, use the backslash as an escape character. Thus when you code a regular expression in Java, you have to escape the escape character. You might end up having a lot of them in a particular expression, and it’s easy to miss escaping one backslash). There was something that bothered me about it though, so I ended up uninstalling it. I think it took over a keystroke I commonly use, but I don’t recall for sure. I’ve installed so many different plugins over the last year, so my memory is hazy. I just remember being very busy and deciding I didn’t have time to figure out the issue. I ended up uninstalling it. I’ll have to give it a try again.
Replacing StringBuffer with StringBuilder
As I mentioned in my previous post, I went through the JBuilder codebase and replaced all instances of StringBuffer with StringBuilder. I don’t recall how many I ended up replacing, but I believe it was at least in the hundreds. Now I could of just gone into the IDE and simply done a global search and replace of the literal string "StringBuffer" with the literal string "StringBuilder". But I didn’t want to do that in case we were using StringBuffer anywhere as a non-initialized field, because there might be threading issues, or if we were using it anywhere as a parameter to a method. Or any other cases that I can’t think of. Basically I wanted to change all lines of the type:
StringBuffer sb = new StringBuffer();
with
StringBuilder sb = new StringBuilder();
The problem with a literal search/replace is of course that the variable may not always be called sb. We can easily handle that with a regular expression. All of the following can be entered in the JBuilder 2007/Eclipse Search | File | File Search. Make sure you have the Regular expression checkbox selected. Tip: In that dialog, if you press Ctrl+Space in the Containing text field, you get Code Assist for regular expressions.
So lets start out with a simple regular expression:
StringBuffer (w+) = new StringBuffer();
Let’s look at the new "(w+)". The w is regular expression construct standing for an word character (0-9, a-z, A-Z, _). The + means 1 or more consecutive instances of the preceding. Our regular expession will find any line instantiating a StringBuffer variable, whose name consists of 1…n word characters. The parentheses around the w+are not literal. They are for grouping that part of the match. We will need the group when we do the replace. Since the parenthesis is a special character in regular expressions, when we want to look for a parenthesis in a string, we need to escape it in the expression. That’s why the expression ends with (). It means we are looking for the literal left parenthesis and the literal right parenthesis.
One important thing to note is that we could further refine the expression so that it only searches for legal Java variables. A Java variable cannot begin with a numeric. A Java variable has a maximum number of characters. Our current expression does not catch either of those cases. But we don’t worry about it. My codebase compiles. That means I don’t have any illegal Java variable names. It’s one of the points that I took from Mastering Regular Expressions. You can base your expression on what you know about what you are searching. In this case I know I am searching error-free Java code.
To do the replace, from the dialog, we click on Replace instead of Search. When our first match is found, we enter this in With portion of the Replace dialog, making sure that Regular Expression is checked:
StringBuilder $1 = new StringBuilder();
The $1 refers to (w+) from our search expression. It means whatever the collection of characters that the expression matched in that group should be part of the replace. For example, if the variable name was foo in the original statement, then it will continue to be foo in the replaced statement.
What’s wrong with this regular expression? A few things:
- It doesn’t work if the user passes a parameter to the
StringBuilderconstructor. - It doesn’t work if there are any extra spaces between any of the tokens.
- It doesn’t work if there is an embedded tab between any of the tokens
- It doesn’t work if the statement spans two lines.
Let’s tackle the argument(s) to the constructor first. To handle it, we can use this expression:
StringBuffer (w+) = new StringBuffer((.*));
We’ve added the (.*). The parentheses are there for grouping. The . means match any character. The * means 0 or more instances. So what we are looking for after the left parenthesis is any number of characters, including 0 characters to handle the no-argument constructor case, followed by a right parethesis, followed by a semicolon. This should handle constructors like new StringBuffer(), new StringBuffer("whatever"), new StringBuffer(16), and new StringBuffer(methodThatReturnsAStringOrInt()).
If we run this, we have to modify our replace expression, because we now have two groups:
StringBuilder $1 = new StringBuilder($2);
In general, you want to be careful when using .* in a regular expression, because it matches anything. Again, we could come up with a more refined expression, but this one serves our purpose.
What do we do about the potential extra spaces, tabs, and/or newlines? For that we can use s, which matches a whitespace character. Whitespace is typically considered to be the space, tab, carriage return, and line feed characters, although its meaning can vary depending on the specfic regular expression implementation. So to handle either a space or a tab before the variable:
StringBuffers(w+) = new StringBuffer((.*));
That will only handle one tab or space. What if there are two or more? We just add the +:
StringBuffers+(w+) = new StringBuffer((.*));
Now we put the s+ between all the tokens:
StringBuffers+(w+)s+=s+news+StringBuffer((.*));
Our replace expression remains the same.
This one still has problems. What if somebody has put a space before the opening parenthesis, e.g.,
StringBuffer sb = new StringBuffer ();
We can handle this if we want to. Just insert a s* in there, to handle 0 or more whitespace characters. It depends how far you want to go.
Conclusion
Again, for this case, regular expressions are arguably a little bit of overkill. I still like it better because it does skip statements not initializing a StringBuffer, because it matches the whole statement, and if you decide to do the replace, it replaces the whole statement. At the very least, you could end up with one half as many matches that you have to manually review than you would if you did a literal string search and replace. For those of you new to regular expressions, it will hopefully give you a sense of what you can do with them. If you want to start learning more about them, I strongly recommened the previously mentioned book, Mastering Regular Expressions. I had played around with regular expressions before reading the book, but never got the understanding that I did after I read the book.
Charles
I have made some edits since my initial post — I had the usage of StringBuffer and StringBuilder reversed in some cases in the original post.
Share This | Email this page to a friend
Posted by Charles Overbeck on January 16th, 2007 under Java |8 Responses to “Basic Regular Expressions”
Leave a Comment
Server Response from: dnrh2.codegear.com

RSS Feed
January 16th, 2007 at 11:23 am
http://www.regextester.com is another handy way to test regular expressions. As far as I know, Java regular expressions are at least "close enough" to Perl regular expressions (maybe exactly the same?) for this page to be useful.
January 16th, 2007 at 1:29 pm
Nice. Regular expressions always start nice and simple, and end up as horrors of complexity, like your example. But walking through the evolution of such an expression is quite enlightening.
January 16th, 2007 at 2:50 pm
Jbuilderr 2007 says StringBuffer\s+(\w+)\s+=\s+new\s+StringBuffer\(.*)\); is missing a parenthesis.
I suspect you meant: StringBuffer\s+(\w+)\s+=\s+new\s+StringBuffer\((.*)\);
Cheers,
Andy Dingfelder
January 16th, 2007 at 2:58 pm
Andy,
Yes, you are right. I have edited the post accordingly. Thanks!
Charles
January 17th, 2007 at 6:27 am
To replace this type line,
please use a Freeware: Search & Replace Master
http://www.knowlesys.com/software/search-and-replace-master/index.htm
replace
StringBuffer*=*new*StringBuffer();
with
StringBuilder{$1}= new StringBuilder();
Enjoy it!
January 17th, 2007 at 6:30 am
Thanks for this post. Another RegEx utility to add to your tools list is RegexBuddy by Jan Goyvaerts at JGSoft (Just-GreatSoftware.com):
http://www.regexbuddy.com/
The best parts (to me) is it is written with Delphi, it’s inexpensive (29.95 US dollars) and you get lifetime updates for no extra charge.
Regards,
Jim Dodd
Onset Computer Corp.
January 17th, 2007 at 7:21 am
Well, I was going to point out the identifers can have underscores as well as A-Z,a-z,0-9, but it seems that \w accept those as well. (So the regex expression is correct, but you probably should mention it in the text).
Also, since everyone is pitching their favorite Regex tool, mine is Regular Expression Workbench, a free utility from Eric Gunnerson of Microsoft. (http://blogs.msdn.com/ericgu/archive/2003/07/07/52362.aspx)
January 17th, 2007 at 8:56 am
Bear,
Search and Replace Master looks pretty neat, but I would personally prefer to bite the bullet and go the regular expression route. The next step after using regular expressions in the IDE is to start using them in code. There’s a whole slew of programming problems you can solve with regular expressions, so I always like to use them wherever I can, or else I start to get rusty.
James,
You are right about the underscore and \w. I have edited the post yet again so that it should now be correct. Thanks for pointing it out.
All,
Thanks for pointing out your favorite RegEx tools. I’ve been so happy with The RegEx Coach, that I never looked further, but maybe I’ll look around some more.
Charles Overbeck