Sometimes it is necessary to perform substitions where the result of one substition will match what is substituted from the first. The most common may be when escaping HTML, where “<” will be “<” and “&” will be “&”. In that case, only one of the substitions produce a result matching the other, and it can be solved by performing the substitutions in the correct order. What if both substitutions produce something matched by the other? The reasonable answer is of course simply parsing the data properly, but sometimes a round fighting with regular expressions is simply too tempting.
Consider a case where curly braces surrounded by whitespace, or where one of the whitespace characters instead is the beginning or end of a line, are to substituted with “{ < }” for “{” and “{ > }” for “}”. In other words, both results match both patterns, so performing blind pattern substitution in a series will only lead to noise. This exact case is probably best illustrated by example:
{ a b { c { d e{ {f g{h } i j } k } l m { < } n { _
Above, all cases of curly braces should be substituted, except in the lines with the letters e, f, and g/h, as the they don't line start/end or whitespace on both sides.
Below, the correctly quoted data is presented, which probably illustrates what I am trying to achieve in a clearer way than the above introduction.
{ < } a b { < } c { < } d e{ {f g{h { > } i j { > } k { > } l m { < } < { > } n { < } _
Regular expressions are line noise two minutes after having written them, so I will present my work step by step. I will be using the GNU implementation of “sed” as platform.
First the input must be matched, either whitespace or start of line, followed by a curly brace, then whitespace or end of line:
/\(^\|\s\)}\(\s\|$\)/
The grouping parentheses for the subexpressions should be almost visible in the fog of backslashes.
Since there are two possible starts and two possible ends, and they must be preserved, the substition includes two back references, using “\1” to refer to the first subexpression and “\2” to the second.
/\1{ > }\2/
So, given the above, we get the following command line to substitute the closing curly braces:
sed -e 's/\(^\|\s\)}\(\s\|$\)/\1{ > }\2/g'
The “g” is just to match and substitute all matches in a line, not just the first.
This is all well and good, but since our substitution contains opening curly braces, just doing the same thing for the opening brace won't work at all.
To avoid garbling the output, one of the strings must first be transformed to something not matching the the other, then the other is substituted, and then, finally, the first placeholder is replaced with the desired result.
The problem with step-wise substitions, is that one must make sure that the placeholder is not a string which already exists in the input data. That can be solved by either using some other unique string or marker characters which for some reason can not exist in the input, but a more elegant solution is using a string which would be matched by the search expressions themselves.
Substituting the matching “{” with “{ _” makes an output which will match the search for opening braces, since there will always be a space after them, but any existing combinations of underscores and curly braces which should not be substituted will not be touched, since they did not match the expression in the first place. This gives a new sed command:
sed -e 's/\(^\|\s\){\(\s\|$\)/\1{ _\2/g'
On other words, match the opening braces in exactly the same manner as the closing ones above, but do not insert any closing braces in the substitution.
The complete series of commands becomes:
#! /bin/zsh sed -e 's/\(^\|\s\){\(\s\|$\)/\1{ _\2/g' \ | sed -e 's/\(^\|\s\)}\(\s\|$\)/\1{ > }\2/g' \ | sed -e 's/\(^\|\s\){ _\(\s\|$\)/\1{ < }\2/g'
In other words, by using an extra substitution, the problem where the string mutually match is simplified to a problem where only one match, and the operations can be ordered to produce the correct result.
> sed <<EOF -e 's/\(^\|\s\){\(\s\|$\)/\1{ _\2/g' \ > | sed -e 's/\(^\|\s\)}\(\s\|$\)/\1{ > }\2/g' \ pipe> | sed -e 's/\(^\|\s\){ _\(\s\|$\)/\1{ < }\2/g' pipe pipe heredoc> { a b { c { d e{ {f g{h } i j } k } l m { < } n { _ EOF { < } a b { < } c { < } d e{ {f g{h { > } i j { > } k { > } l m { < } < { > } n { < } _
Simply fire up that lexer. That's actually readable in the future. Ironically, I first intended to include that as a proper punchline in this article, but the lexer I used (which was for the grammar which was the source of this question in the first place) used backslash as escape characters, so using it naïvely on data containing backslashes worked just as bad as using regular expressions naïvely. That's a pretty good moral of a story too.
Steinar Knutsen, 20201003T132031Z, manuscript hash 01A87E22