I've just seen a huge regex for Java that made me think a little about maintainability of regular expressions in general. I believe that most people - except some badass perl mongers - would agree that regular expressions are hardly maintainable.
In this very short example, the common way of creating regular expression is still quite readable to any mediocre talented developer. However, think about those spooky expressions that fill two or more lines with 80 characters each. Sure, the (verbose) fluent interface would require several lines instead of only two, but I'm sure it would be much more readable (hence maintainable).
It's interesting to read that most people think that regular expressions are here to stay - although it takes tools to read them and smart guys to think of ways to make them maintainable. While I'm not sure that a fluent interface is the best way to go, I'm sure that some smart engineers - we? ;) - should spend some time to make regular expressions a thing of the past - it's enough that they've been with us for 50 years know, don't you think?
here's the kind of pattern I'm talking about - extra kudos to the first one who's able to translate it - RegexBuddies allowed (it's from an Apache project btw) extra kudos awarded to chii and mez: it's a RFC compliant email address validation pattern - although its RFC822 (see ex-parrot.com), not 5322 - not sure if there is a difference though - if yes, I'll award next extra kudos for a patch to fit 5322 ;)
This is easy to read for any experience level, and for the inexperienced, it teaches regex at the same time. Also, the comments can be tailored to the situation to explain the business rules behind the regex rather than just the structure.
Your example seems to create a regex by concatenating on to strings. This is not very robust. I believe that these methods should be added to String and Patern/Regex objects, because it will make implementation and code cleaner. Besides it's similar to the way Regular Expressions are classically defined.
pattern.capture(variable) would store pattern in variable. In the event that the capture is part of an expression to be matched multiple times, variable should contain an array of strings of all matches for pattern.
Fluent languages are not well suited to the recursive nature of regular expressions. So consideration needs to be given to the order of operations. Just chaining methods together does not allow for very complex regular expressions. Exactly the situation where such a tool would be useful.
The proposed scheme used .domain() which seems to be a common regex. To treat user defined patterns as a methods doesn't make it easy to add new patterns. In a language like Java users of the library would have to override the class to add methods for commonly used patterns.
This is addressed by my suggestion to treat every piece as an object. A pattern object can be created for each commonly used regex like matching a domain for example. Given my previous thoughts about capturing it wouldn't too be hard to ensure that capturing works for multiple copies of the same common pattern that contains a captured section.
In hindsight, my scheme borrows heavily from Martin Fowler's approach in use. Although I didn't intend for things to go this way, it definitely makes using such a system more maintainable. It also addresses a problem or two with Fowler's approach (capturing order).
This notation is a bit more concise, and I feel that it's easier to understand the hierarchy/structure of the pattern. Each method could accept either a string, or a pattern fragment, as the first argument.
Long answer:The company I work for makes hardware-based regular expression engines for enterprise content filtering applications. Think running anti-virus or firewall applications at 20GB/sec speeds right in the network routers rather than taking up valuable server or processor cycles. Most anti-virus, anti-spam or firewall apps are a bunch of regex expressions at the core.
Anyway, the way the regex's are written has a huge impact on the performance of the scanning. You can write regexs in several different ways to do the same thing, and some will have drastically faster performance. We've written compilers and linters for our customers to help them maintain and tune their expressions.
Back to the OP's question, rather than defining a whole new syntax, I would write a linter (sorry, ours is proprietary) cut and paste regex's into that will break down legacy regex's and output "fluent english" for someone to understand better. I'd also add relative performance checks and suggestions for common modifications.
The short answer, for me, is that, once you get to regular expressions (or other pattern matching that does the same thing) that are long enough to cause a problem... you should probably be considering if they're the right tool for the job in the first place.
Honestly, any fluent interface seems like it would be harder to read than a standard regular expression. For really short expressions, the fluent version is verbose, but not too long; it's readable. But so is the regular expression for something that long.
Note that the above will incorrectly classify arbritarily many email addresses as being valid that do NOT conform to the grammar they're required to follow, as that grammar requires more than a simple FSM for full parsing. I don't believe it'd misclassify a valid address as invalid, though.
I would most likely not. I think this is a case of using the right tool for the job. There are some great answers here such as: 1579202 from Jeremy Stein. I have recently "gotten" regular expressions on a recent project and found them to be very useful just as they are, and when properly commented they are understandable if you know the syntax.
I think the "knowing the syntax" part is what turns people off to Regular Expressions, to those who don't understand they look arcane and cryptic, but they are a powerful tool and in applied Computer Science (e.g. writing software for a living) I feel like intelligent professionals should and should be able to learn to use them and us them appropriately.
As they say "With great power comes great responsibility." I have seen people use regular expressions everywhere for everything, but used judiciously by someone who has taken the time to learn the syntax thoroughly, they are incredibly helpful; to me, adding another layer would in a way defeat their purpose, or at a minimum take away their power.
This just my own opinion and I can understand where people are coming from who would desire a framework like this, or who would avoid regular expressions, but I have hard time hearing "Regular Expressions are bad" from those who haven't take the time to learn them and make an informed decision.
To make everybody happy (regex masters and fluid interface proponents), make sure the fluid interface can output an appropriate raw regex pattern, and also take a regular regex using a Factory method and generate fluid code for it.
Let's compare: I have worked often with (N)Hibernate ICriteria queries, which can be considered a Fluent mapping to SQL. I was (and still am) enthusiastic about them, but did they make the SQL queries more legible? No, more to the contrary, but another benefit rose: it became much easier to programmatically build statements, to subclass them and create your own abstractions etc.
What I'm getting at is that using a new interface for a given language, if done right, can prove worthwhile, but don't think too highly of it. In many cases it won't become easier to read (nested subtraction character classes, Captures in look-behind, if-branching to name a few advanced concepts that will be hard to combine fluently). But in just as many cases, the benefits of greater flexibility outweigh the added overhead of syntax complexity.
To add to your list of possible alternative approaches and to take this out of the context of only Java, consider the LINQ syntax. Here's what it could look like (a bit contrived) (from, where and select are keywords in LINQ):
just a rough idea, I know. The good thing of LINQ is that it is checked by the compiler, a kind-of language in a language. By default, LINQ can also be expressed as fluent chained syntax, which makes it, if well designed, compatible with other OO languages.
In fact, I am not even sure it would be possible to have an API smart enough to really provide enough guidance to the developer when creating a new regexp, not talking about ambiguous cases, as mentioned in a previous answer.
You mentioned a regexp example for an RFC. I am 99% sure (there is still 1% hope;-)) that any API would not make that example any simpler, but conversely that would only make it more complex to read! That's a typical example where you don't want to use regexp anyway!
Even regarding regexp creation, due to the problem of ambiguous patterns, it is probable that with a fluent API, you would never come with the right expression the first time, but would have to change several times until you get what you really want.
Make no mistake, I do love fluent interfaces; I have developed some libraries that use them, and I use several 3rd-party libraries based on them (e.g. FEST for Java testing). But I don't think they can be the golden hammer for any problem.
If we consider Java exclusively, I think the main problem with regexps is the necessary escaping of backslashes in Java string constants. That's one point that makes it incredibly difficult to create and understand regexp in Java. Hence, the first step to enhance Java regexp would, for me, be a language change, a la Groovy, where string constants don't need to escape backslashes.
Let me state right up front that I'm not in the intended audience for this book. Fluent C# is a book for beginners to programming. Although I do feel like one from time to time, I've spent the better parts of my adult life hacking together software systems of all scales. Reviewing a book for novices is hard. My view will be quite different. A reader with little programming experience is likely to focus on other aspects and will probably face different problems. So why did I opt to review it? Well, this book claims to be different. It's based on principles of cognitive science and instructional design. Both topics I have an interest in. From that perspective parts of the book provide an interesting read. To a beginner interested in learning C#, the technical inaccuracies are likely to become an obstacle.
7fc3f7cf58