regexp char class subtract and intersect

# Russ Cox (19 years ago)

Regarding the proposal at developer.mozilla.org/es4/proposals/extend_regexps.html to add intersection and subtraction of character classes, it certainly sounds like a good idea in principle, but I am not sure I agree about the particulars of the proposal. (Of course if the Unicode properties go off the table then it makes sense to toss intersection and subtraction as well; I am assuming that Unicode properties are still a planned addition.)

First, the proposal does not specify the binding precedences of the new operators: surely [a-c-b] means [ac] but does [a-c-bb] also mean [ac] or does it mean [acb]? Etc. This is hardly a show-stopper, just a detail missing from the rough sketch proposal.

More importantly the proposal breaks the nice regexp syntax property, inherited from egrep, that escaped punctuation always denotes the corresponding literal (and that non-punctuation is always a literal unless escaped). Adding - and & as operators breaks the rule and adds confusion. Making - special is particularly bad, because - is already special in character classes. I find the - rules weird enough that I routinely use - in character classes when I want the literal so I don't have to remember when - is special and when it isn't.

I suggest using unescaped & as the conjunction operator, to preserve the property that \punctuation is always a literal. It is true that this would affect existing character classes that list &, but these can be diagnosed by the compiler as an empty character class: [!@#$%&*()] is obviously (to the compiler) a mistake, because the intersection is empty.

I also suggest not adding an explicit subtraction operator but instead using intersection with the complement, since there is already a complement operator ^: [a-c&^bb] = [a-c&^b] = [ac], [a-z&^by] = [a-z&^b&^y] = [ac-xz], etc.

The specific grammar would be something like:

CharClass ::= "[" ClassIntersect "]" ClassComplement ::= ClassRanges | "^" ClassRanges ClassIntersect ::= ClassComplement | ClassIntersect "&" ClassComplement

Russ

# Russ Cox (18 years ago)

[Reposting because I never saw any replies nor any notes on the wiki, in case this got lost in the shuffle.]

From: Russ Cox <rsc at swtch.com>

Date: Feb 13, 2007 9:48 PM Subject: regexp char class subtract and intersect To: es4-discuss at mozilla.org

Regarding the proposal at developer.mozilla.org/es4/proposals/extend_regexps.html to add intersection and subtraction of character classes, it certainly sounds like a good idea in principle, but I am not sure I agree about the particulars of the proposal. (Of course if the Unicode properties go off the table then it makes sense to toss intersection and subtraction as well; I am assuming that Unicode properties are still a planned addition.)

First, the proposal does not specify the binding precedences of the new operators: surely [a-c-b] means [ac] but does [a-c-bb] also mean [ac] or does it mean [acb]? Etc. This is hardly a show-stopper, just a detail missing from the rough sketch proposal.

More importantly the proposal breaks the nice regexp syntax property, inherited from egrep, that escaped punctuation always denotes the corresponding literal (and that non-punctuation is always a literal unless escaped). Adding - and & as operators breaks the rule and adds confusion. Making - special is particularly bad, because - is already special in character classes. I find the - rules weird enough that I routinely use - in character classes when I want the literal so I don't have to remember when - is special and when it isn't.

I suggest using unescaped & as the conjunction operator, to preserve the property that \punctuation is always a literal. It is true that this would affect existing character classes that list &, but these can be diagnosed by the compiler as an empty character class: [!@#$%&*()] is obviously (to the compiler) a mistake, because the intersection is empty.

I also suggest not adding an explicit subtraction operator but instead using intersection with the complement, since there is already a complement operator ^: [a-c&^bb] = [a-c&^b] = [ac], [a-z&^by] = [a-z&^b&^y] = [ac-xz], etc.

The specific grammar would be something like:

CharClass ::= "[" ClassIntersect "]" ClassComplement ::= ClassRanges | "^" ClassRanges ClassIntersect ::= ClassComplement | ClassIntersect "&" ClassComplement

Russ

# Lars T Hansen (18 years ago)

These are all good points. Java (java.sun.com/javase/6/docs/api, see java.util.regex) solves all these problems in the following way:

  • && in a charset means intersection; this sequence is not likely to be used except in error since a single & always suffices

  • subtraction is handled through intersection with a complemented embedded charset, eg, [a-c&&[^cde]].

I think we perceived Java's solution as inelegant when we devised what's now proposed, especially subtraction-by-complement-and-intersection is unpleasant, especially because the nested character sets makes regular expression lexing a little messier. (As it is, and by design, regular expression literals can be lexed by a DFA when the program is parsed; only the regular expression compiler needs to handle the context-free language. Allowing nested sets breaks this.)

I'll put this item on the agenda for the next committee meeting.

Sorry for taking so long to answer,

# Russ Cox (18 years ago)

These are all good points. Java (java.sun.com/javase/6/docs/api, see java.util.regex) solves all these problems in the following way:

  • && in a charset means intersection; this sequence is not likely to be used except in error since a single & always suffices

  • subtraction is handled through intersection with a complemented embedded charset, eg, [a-c&&[^cde]].

I think we perceived Java's solution as inelegant when we devised what's now proposed, especially subtraction-by-complement-and-intersection is unpleasant, especially because the nested character sets makes regular expression lexing a little messier. (As it is, and by design, regular expression literals can be lexed by a DFA when the program is parsed; only the regular expression compiler needs to handle the context-free language. Allowing nested sets breaks this.)

That's very interesting, thanks. Perhaps it would suffice for drop the [ ], e.g., [a-c&&^cde]. But perhaps that would be too close to regular Java not to go the rest of the way.

I don't understand the final parenthetical. Regular expressions already have nested parens, so I don't understand how nested [ ] makes them harder to parse -- I thought that in both cases one just looked for the termination character ignoring escapes and checking of nesting. Unless /[/]/ is a valid way to match a slash, I don't see why the lexer needs to care whether the [ ] are properly nested.

Russ

# Lars T Hansen (18 years ago)

On 5/7/07, Russ Cox <rsc at swtch.com> wrote:

These are all good points. Java (java.sun.com/javase/6/docs/api, see java.util.regex) solves all these problems in the following way:

  • && in a charset means intersection; this sequence is not likely to be used except in error since a single & always suffices

  • subtraction is handled through intersection with a complemented embedded charset, eg, [a-c&&[^cde]].

I think we perceived Java's solution as inelegant when we devised what's now proposed, especially subtraction-by-complement-and-intersection is unpleasant, especially because the nested character sets makes regular expression lexing a little messier. (As it is, and by design, regular expression literals can be lexed by a DFA when the program is parsed; only the regular expression compiler needs to handle the context-free language. Allowing nested sets breaks this.)

That's very interesting, thanks. Perhaps it would suffice for drop the [ ], e.g., [a-c&&^cde]. But perhaps that would be too close to regular Java not to go the rest of the way.

I don't understand the final parenthetical. Regular expressions already have nested parens, so I don't understand how nested [ ] makes them harder to parse -- I thought that in both cases one just looked for the termination character ignoring escapes and checking of nesting. Unless /[/]/ is a valid way to match a slash, I don't see why the lexer needs to care whether the [ ] are properly nested.

That's precisely the problem. The charset [.../...] was not legal in ES3, the / needed to be escaped. But MSIE allows it and the web uses it, so the other browsers have followed, and it will be legal in ES4.

# Russ Cox (18 years ago)

Sorry for continuing to bring this up, but did anything get resolved here? I'm still hoping that ES4 won't break the "backslash before punctuation is always a literal" rule followed in Perl, PCRE, and many other modern regexp packages.

Either using && and &&^ as the current & and - or using & and &^ as the current & and - (with an error if CharsetIntersect returns an empty set) would avoid this problem.

Russ

# Lars T Hansen (18 years ago)

I'm actively at work on this, and I'm very mindful about the compatibility problem. Most likely we'll follow Java, which should not be a problem. More soon.