Substituting strings matching each other

Sometimes it is necessary to perform substitions where the result of one substition will match what is substituted from the first. The most common may be when escaping HTML, where “<” will be “&lt;” and “&” will be “&amp;”. In that case, only one of the substitions produce a result matching the other, and it can be solved by performing the substitutions in the correct order. What if both substitutions produce something matched by the other? The reasonable answer is of course simply parsing the data properly, but sometimes a round fighting with regular expressions is simply too tempting.

1.0 Example problem

Consider a case where curly braces surrounded by whitespace, or where one of the whitespace characters instead is the beginning or end of a line, are to substituted with “{ < }” for “{” and “{ > }” for “}”. In other words, both results match both patterns, so performing blind pattern substitution in a series will only lead to noise. This exact case is probably best illustrated by example:

1.1 Input file

{ a
b {
c { d
e{
{f
g{h
} i
j }
k } l
m { < }
n { _

Above, all cases of curly braces should be substituted, except in the lines with the letters e, f, and g/h, as the they don't line start/end or whitespace on both sides.

1.2 Desired output

Below, the correctly quoted data is presented, which probably illustrates what I am trying to achieve in a clearer way than the above introduction.

{ < } a
b { < }
c { < } d
e{
{f
g{h
{ > } i
j { > }
k { > } l
m { < } < { > }
n { < } _

2.0 Building the regular expressions

Regular expressions are line noise two minutes after having written them, so I will present my work step by step. I will be using the GNU implementation of “sed” as platform.

2.1 Matching the input

First the input must be matched, either whitespace or start of line, followed by a curly brace, then whitespace or end of line:

/\(^\|\s\)}\(\s\|$\)/

The grouping parentheses for the subexpressions should be almost visible in the fog of backslashes.

2.2 Substituting with back references

Since there are two possible starts and two possible ends, and they must be preserved, the substition includes two back references, using “\1” to refer to the first subexpression and “\2” to the second.

/\1{ > }\2/

2.3 sed arguments for one patterns

So, given the above, we get the following command line to substitute the closing curly braces:

sed -e 's/\(^\|\s\)}\(\s\|$\)/\1{ > }\2/g'

The “g” is just to match and substitute all matches in a line, not just the first.

This is all well and good, but since our substitution contains opening curly braces, just doing the same thing for the opening brace won't work at all.

3.0 Step-wise substitution

To avoid garbling the output, one of the strings must first be transformed to something not matching the the other, then the other is substituted, and then, finally, the first placeholder is replaced with the desired result.

The problem with step-wise substitions, is that one must make sure that the placeholder is not a string which already exists in the input data. That can be solved by either using some other unique string or marker characters which for some reason can not exist in the input, but a more elegant solution is using a string which would be matched by the search expressions themselves.

3.1 Substitution which matches only one of the patterns

Substituting the matching “{” with “{ _” makes an output which will match the search for opening braces, since there will always be a space after them, but any existing combinations of underscores and curly braces which should not be substituted will not be touched, since they did not match the expression in the first place. This gives a new sed command:

sed -e 's/\(^\|\s\){\(\s\|$\)/\1{ _\2/g'

On other words, match the opening braces in exactly the same manner as the closing ones above, but do not insert any closing braces in the substitution.

3.2 Putting it all together

The complete series of commands becomes:

#! /bin/zsh

sed -e 's/\(^\|\s\){\(\s\|$\)/\1{ _\2/g' \
    | sed -e 's/\(^\|\s\)}\(\s\|$\)/\1{ > }\2/g' \
    | sed -e 's/\(^\|\s\){ _\(\s\|$\)/\1{ < }\2/g'
  1. Substitute the first pattern with something matching itself, but not the other.
  2. Substitute the second pattern with the desired end result.
  3. Substitute the placeholder from the first step with the desired end result.

In other words, by using an extra substitution, the problem where the string mutually match is simplified to a problem where only one match, and the operations can be ordered to produce the correct result.

3.3 Shell example

> sed <<EOF -e 's/\(^\|\s\){\(\s\|$\)/\1{ _\2/g' \              
> | sed -e 's/\(^\|\s\)}\(\s\|$\)/\1{ > }\2/g' \
pipe> | sed -e 's/\(^\|\s\){ _\(\s\|$\)/\1{ < }\2/g'
pipe pipe heredoc> { a
b {
c { d
e{
{f
g{h
} i
j }
k } l
m { < }
n { _
EOF
{ < } a
b { < }
c { < } d
e{
{f
g{h
{ > } i
j { > }
k { > } l
m { < } < { > }
n { < } _

4.0 Conclusion

Simply fire up that lexer. That's actually readable in the future. Ironically, I first intended to include that as a proper punchline in this article, but the lexer I used (which was for the grammar which was the source of this question in the first place) used backslash as escape characters, so using it naïvely on data containing backslashes worked just as bad as using regular expressions naïvely. That's a pretty good moral of a story too.

Steinar Knutsen, 20201003T132031Z, manuscript hash 01A87E22