regexp char class subtract and intersect

# Russ Cox (19 years ago)

Regarding the proposal at developer.mozilla.org/es4/proposals/extend_regexps.html to add intersection and subtraction of character classes, it certainly sounds like a good idea in principle, but I am not sure I agree about the particulars of the proposal. (Of course if the Unicode properties go off the table then it makes sense to toss intersection and subtraction as well; I am assuming that Unicode properties are still a planned addition.)

First, the proposal does not specify the binding precedences of the new operators: surely [a-c-b] means [ac] but does [a-c-bb] also mean [ac] or does it mean [acb]? Etc. This is hardly a show-stopper, just a detail missing from the rough sketch proposal.

More importantly the proposal breaks the nice regexp syntax property, inherited from egrep, that escaped punctuation always denotes the corresponding literal (and that non-punctuation is always a literal unless escaped). Adding - and & as operators breaks the rule and adds confusion. Making - special is particularly bad, because - is already special in character classes. I find the - rules weird enough that I routinely use - in character classes when I want the literal so I don't have to remember when - is special and when it isn't.

I suggest using unescaped & as the conjunction operator, to preserve the property that \punctuation is always a literal. It is true that this would affect existing character classes that list &, but these can be diagnosed by the compiler as an empty character class: [!@#$%&*()] is obviously (to the compiler) a mistake, because the intersection is empty.

I also suggest not adding an explicit subtraction operator but instead using intersection with the complement, since there is already a complement operator ^: [a-c&^bb] = [a-c&^b] = [ac], [a-z&^by] = [a-z&^b&^y] = [ac-xz], etc.

The specific grammar would be something like:

CharClass ::= "[" ClassIntersect "]" ClassComplement ::= ClassRanges | "^" ClassRanges ClassIntersect ::= ClassComplement | ClassIntersect "&" ClassComplement

Russ

Regarding the proposal at
http://developer.mozilla.org/es4/proposals/extend_regexps.html
to add intersection and subtraction of character classes,
it certainly sounds like a good idea in principle, but I am not
sure I agree about the particulars of the proposal.  (Of course
if the Unicode properties go off the table then it makes sense
to toss intersection and subtraction as well; I am assuming
that Unicode properties are still a planned addition.)

First, the proposal does not specify the binding precedences
of the new operators: surely [a-c\-b] means [ac] but does [a-c\-bb]
also mean [ac] or does it mean [acb]?  Etc.  This is hardly a
show-stopper, just a detail missing from the rough sketch proposal.

More importantly the proposal breaks the nice regexp syntax property,
inherited from egrep, that escaped punctuation always denotes
the corresponding literal (and that non-punctuation is always a
literal unless escaped).  Adding \- and \& as operators breaks
the rule and adds confusion.  Making \- special is particularly bad,
because - is already special in character classes.  I find the - rules
weird enough that I routinely use \- in character classes when
I want the literal so I don't have to remember when - is special and
when it isn't.

I suggest using unescaped & as the conjunction operator,
to preserve the property that \punctuation is always a literal.
It is true that this would affect existing character classes that
list &, but these can be diagnosed by the compiler as an
empty character class: [!@#$%&*()] is obviously (to the compiler)
a mistake, because the intersection is empty.

I also suggest not adding an explicit subtraction operator but
instead using intersection with the complement, since there is
already a complement operator ^: [a-c&^bb] = [a-c&^b] = [ac],
[a-z&^by] = [a-z&^b&^y] = [ac-xz], etc.

The specific grammar would be something like:

  CharClass ::= "[" ClassIntersect "]"
  ClassComplement ::= ClassRanges | "^" ClassRanges
  ClassIntersect ::= ClassComplement | ClassIntersect "&" ClassComplement

Russ

# Russ Cox (18 years ago)

[Reposting because I never saw any replies nor any notes on the wiki, in case this got lost in the shuffle.]

From: Russ Cox <rsc at swtch.com>

Date: Feb 13, 2007 9:48 PM Subject: regexp char class subtract and intersect To: es4-discuss at mozilla.org

The specific grammar would be something like:

CharClass ::= "[" ClassIntersect "]" ClassComplement ::= ClassRanges | "^" ClassRanges ClassIntersect ::= ClassComplement | ClassIntersect "&" ClassComplement

Russ

[Reposting because I never saw any replies nor any notes
on the wiki, in case this got lost in the shuffle.]

From: Russ Cox <rsc at swtch.com>
Date: Feb 13, 2007 9:48 PM
Subject: regexp char class subtract and intersect
To: es4-discuss at mozilla.org

Regarding the proposal at
http://developer.mozilla.org/es4/proposals/extend_regexps.html
to add intersection and subtraction of character classes,
it certainly sounds like a good idea in principle, but I am not
sure I agree about the particulars of the proposal.  (Of course
if the Unicode properties go off the table then it makes sense
to toss intersection and subtraction as well; I am assuming
that Unicode properties are still a planned addition.)

First, the proposal does not specify the binding precedences
of the new operators: surely [a-c\-b] means [ac] but does [a-c\-bb]
also mean [ac] or does it mean [acb]?  Etc.  This is hardly a
show-stopper, just a detail missing from the rough sketch proposal.

More importantly the proposal breaks the nice regexp syntax property,
inherited from egrep, that escaped punctuation always denotes
the corresponding literal (and that non-punctuation is always a
literal unless escaped).  Adding \- and \& as operators breaks
the rule and adds confusion.  Making \- special is particularly bad,
because - is already special in character classes.  I find the - rules
weird enough that I routinely use \- in character classes when
I want the literal so I don't have to remember when - is special and
when it isn't.

I suggest using unescaped & as the conjunction operator,
to preserve the property that \punctuation is always a literal.
It is true that this would affect existing character classes that
list &, but these can be diagnosed by the compiler as an
empty character class: [!@#$%&*()] is obviously (to the compiler)
a mistake, because the intersection is empty.

I also suggest not adding an explicit subtraction operator but
instead using intersection with the complement, since there is
already a complement operator ^: [a-c&^bb] = [a-c&^b] = [ac],
[a-z&^by] = [a-z&^b&^y] = [ac-xz], etc.

The specific grammar would be something like:

  CharClass ::= "[" ClassIntersect "]"
  ClassComplement ::= ClassRanges | "^" ClassRanges
  ClassIntersect ::= ClassComplement | ClassIntersect "&" ClassComplement

Russ

# Lars T Hansen (18 years ago)

These are all good points. Java (java.sun.com/javase/6/docs/api, see java.util.regex) solves all these problems in the following way:

&& in a charset means intersection; this sequence is not likely to be used except in error since a single & always suffices
subtraction is handled through intersection with a complemented embedded charset, eg, [a-c&&[^cde]].

I think we perceived Java's solution as inelegant when we devised what's now proposed, especially subtraction-by-complement-and-intersection is unpleasant, especially because the nested character sets makes regular expression lexing a little messier. (As it is, and by design, regular expression literals can be lexed by a DFA when the program is parsed; only the regular expression compiler needs to handle the context-free language. Allowing nested sets breaks this.)

I'll put this item on the agenda for the next committee meeting.

Sorry for taking so long to answer,

These are all good points.  Java
(http://java.sun.com/javase/6/docs/api/, see java.util.regex) solves
all these problems in the following way:

-  && in a charset means intersection; this sequence is not likely to
be used except in error since a single & always suffices

- subtraction is handled through intersection with a complemented
embedded charset, eg, [a-c&&[^cde]].

I think we perceived Java's solution as inelegant when we devised
what's now proposed, especially
subtraction-by-complement-and-intersection is unpleasant, especially
because the nested character sets makes regular expression lexing a
little messier.  (As it is, and by design, regular expression literals
can be lexed by a DFA when the program is parsed; only the regular
expression compiler needs to handle the context-free language.
Allowing nested sets breaks this.)

I'll put this item on the agenda for the next committee meeting.

Sorry for taking so long to answer,
--lars

On 4/5/07, Russ Cox <rsc at swtch.com> wrote:

> Regarding the proposal at
> http://developer.mozilla.org/es4/proposals/extend_regexps.html
> to add intersection and subtraction of character classes,
> it certainly sounds like a good idea in principle, but I am not
> sure I agree about the particulars of the proposal.  (Of course
> if the Unicode properties go off the table then it makes sense
> to toss intersection and subtraction as well; I am assuming
> that Unicode properties are still a planned addition.)
>
> First, the proposal does not specify the binding precedences
> of the new operators: surely [a-c\-b] means [ac] but does [a-c\-bb]
> also mean [ac] or does it mean [acb]?  Etc.  This is hardly a
> show-stopper, just a detail missing from the rough sketch proposal.
>
> More importantly the proposal breaks the nice regexp syntax property,
> inherited from egrep, that escaped punctuation always denotes
> the corresponding literal (and that non-punctuation is always a
> literal unless escaped).  Adding \- and \& as operators breaks
> the rule and adds confusion.  Making \- special is particularly bad,
> because - is already special in character classes.  I find the - rules
> weird enough that I routinely use \- in character classes when
> I want the literal so I don't have to remember when - is special and
> when it isn't.
>
> I suggest using unescaped & as the conjunction operator,
> to preserve the property that \punctuation is always a literal.
> It is true that this would affect existing character classes that
> list &, but these can be diagnosed by the compiler as an
> empty character class: [!@#$%&*()] is obviously (to the compiler)
> a mistake, because the intersection is empty.
>
> I also suggest not adding an explicit subtraction operator but
> instead using intersection with the complement, since there is
> already a complement operator ^: [a-c&^bb] = [a-c&^b] = [ac],
> [a-z&^by] = [a-z&^b&^y] = [ac-xz], etc.
>
> The specific grammar would be something like:
>
>   CharClass ::= "[" ClassIntersect "]"
>   ClassComplement ::= ClassRanges | "^" ClassRanges
>   ClassIntersect ::= ClassComplement | ClassIntersect "&" ClassComplement
>
> Russ
> _______________________________________________
> Es4-discuss mailing list
> Es4-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es4-discuss
>

# Russ Cox (18 years ago)

These are all good points. Java (java.sun.com/javase/6/docs/api, see java.util.regex) solves all these problems in the following way:

&& in a charset means intersection; this sequence is not likely to be used except in error since a single & always suffices

subtraction is handled through intersection with a complemented embedded charset, eg, [a-c&&[^cde]].

I think we perceived Java's solution as inelegant when we devised what's now proposed, especially subtraction-by-complement-and-intersection is unpleasant, especially because the nested character sets makes regular expression lexing a little messier. (As it is, and by design, regular expression literals can be lexed by a DFA when the program is parsed; only the regular expression compiler needs to handle the context-free language. Allowing nested sets breaks this.)

That's very interesting, thanks. Perhaps it would suffice for drop the [ ], e.g., [a-c&&^cde]. But perhaps that would be too close to regular Java not to go the rest of the way.

I don't understand the final parenthetical. Regular expressions already have nested parens, so I don't understand how nested [ ] makes them harder to parse -- I thought that in both cases one just looked for the termination character ignoring escapes and checking of nesting. Unless /[/]/ is a valid way to match a slash, I don't see why the lexer needs to care whether the [ ] are properly nested.

Russ

> These are all good points.  Java
> (http://java.sun.com/javase/6/docs/api/, see java.util.regex) solves
> all these problems in the following way:
> 
> -  && in a charset means intersection; this sequence is not likely to
> be used except in error since a single & always suffices
> 
> - subtraction is handled through intersection with a complemented
> embedded charset, eg, [a-c&&[^cde]].
> 
> I think we perceived Java's solution as inelegant when we devised
> what's now proposed, especially
> subtraction-by-complement-and-intersection is unpleasant, especially
> because the nested character sets makes regular expression lexing a
> little messier.  (As it is, and by design, regular expression literals
> can be lexed by a DFA when the program is parsed; only the regular
> expression compiler needs to handle the context-free language.
> Allowing nested sets breaks this.)

That's very interesting, thanks.  Perhaps it would suffice
for drop the [ ], e.g., [a-c&&^cde].  But perhaps that would 
be too close to regular Java not to go the rest of the way.

I don't understand the final parenthetical.
Regular expressions already have nested parens,
so I don't understand how nested [ ] makes them 
harder to parse -- I thought that in both cases one
just looked for the termination character ignoring
escapes and checking of nesting.  Unless /[/]/ is a
valid way to match a slash, I don't see why the lexer
needs to care whether the [ ] are properly nested.

Russ

# Lars T Hansen (18 years ago)

On 5/7/07, Russ Cox <rsc at swtch.com> wrote:

These are all good points. Java (java.sun.com/javase/6/docs/api, see java.util.regex) solves all these problems in the following way:

&& in a charset means intersection; this sequence is not likely to be used except in error since a single & always suffices

subtraction is handled through intersection with a complemented embedded charset, eg, [a-c&&[^cde]].

I think we perceived Java's solution as inelegant when we devised what's now proposed, especially subtraction-by-complement-and-intersection is unpleasant, especially because the nested character sets makes regular expression lexing a little messier. (As it is, and by design, regular expression literals can be lexed by a DFA when the program is parsed; only the regular expression compiler needs to handle the context-free language. Allowing nested sets breaks this.)

That's very interesting, thanks. Perhaps it would suffice for drop the [ ], e.g., [a-c&&^cde]. But perhaps that would be too close to regular Java not to go the rest of the way.

I don't understand the final parenthetical. Regular expressions already have nested parens, so I don't understand how nested [ ] makes them harder to parse -- I thought that in both cases one just looked for the termination character ignoring escapes and checking of nesting. Unless /[/]/ is a valid way to match a slash, I don't see why the lexer needs to care whether the [ ] are properly nested.

That's precisely the problem. The charset [.../...] was not legal in ES3, the / needed to be escaped. But MSIE allows it and the web uses it, so the other browsers have followed, and it will be legal in ES4.

On 5/7/07, Russ Cox <rsc at swtch.com> wrote:
> > These are all good points.  Java
> > (http://java.sun.com/javase/6/docs/api/, see java.util.regex) solves
> > all these problems in the following way:
> >
> > -  && in a charset means intersection; this sequence is not likely to
> > be used except in error since a single & always suffices
> >
> > - subtraction is handled through intersection with a complemented
> > embedded charset, eg, [a-c&&[^cde]].
> >
> > I think we perceived Java's solution as inelegant when we devised
> > what's now proposed, especially
> > subtraction-by-complement-and-intersection is unpleasant, especially
> > because the nested character sets makes regular expression lexing a
> > little messier.  (As it is, and by design, regular expression literals
> > can be lexed by a DFA when the program is parsed; only the regular
> > expression compiler needs to handle the context-free language.
> > Allowing nested sets breaks this.)
>
> That's very interesting, thanks.  Perhaps it would suffice
> for drop the [ ], e.g., [a-c&&^cde].  But perhaps that would
> be too close to regular Java not to go the rest of the way.
>
> I don't understand the final parenthetical.
> Regular expressions already have nested parens,
> so I don't understand how nested [ ] makes them
> harder to parse -- I thought that in both cases one
> just looked for the termination character ignoring
> escapes and checking of nesting.  Unless /[/]/ is a
> valid way to match a slash, I don't see why the lexer
> needs to care whether the [ ] are properly nested.

That's precisely the problem.  The charset [.../...] was not legal in
ES3, the / needed to be escaped.  But MSIE allows it and the web uses
it, so the other browsers have followed, and it will be legal in ES4.

--lars

# Russ Cox (18 years ago)

Sorry for continuing to bring this up, but did anything get resolved here? I'm still hoping that ES4 won't break the "backslash before punctuation is always a literal" rule followed in Perl, PCRE, and many other modern regexp packages.

Either using && and &&^ as the current & and - or using & and &^ as the current & and - (with an error if CharsetIntersect returns an empty set) would avoid this problem.

Russ

Sorry for continuing to bring this up, but did anything
get resolved here?  I'm still hoping that ES4 won't
break the "backslash before punctuation is always
a literal" rule followed in Perl, PCRE, and many other
modern regexp packages.

Either using && and &&^ as the current \& and \-
or using & and &^ as the current \& and \-
(with an error if CharsetIntersect returns an empty set)
would avoid this problem.

Russ


On 5/7/07, Lars T Hansen <lth at acm.org> wrote:
> These are all good points.  Java
> (http://java.sun.com/javase/6/docs/api/, see java.util.regex) solves
> all these problems in the following way:
>
> -  && in a charset means intersection; this sequence is not likely to
> be used except in error since a single & always suffices
>
> - subtraction is handled through intersection with a complemented
> embedded charset, eg, [a-c&&[^cde]].
>
> I think we perceived Java's solution as inelegant when we devised
> what's now proposed, especially
> subtraction-by-complement-and-intersection is unpleasant, especially
> because the nested character sets makes regular expression lexing a
> little messier.  (As it is, and by design, regular expression literals
> can be lexed by a DFA when the program is parsed; only the regular
> expression compiler needs to handle the context-free language.
> Allowing nested sets breaks this.)
>
> I'll put this item on the agenda for the next committee meeting.
>
> Sorry for taking so long to answer,
> --lars
>
> On 4/5/07, Russ Cox <rsc at swtch.com> wrote:
>
> > Regarding the proposal at
> > http://developer.mozilla.org/es4/proposals/extend_regexps.html
> > to add intersection and subtraction of character classes,
> > it certainly sounds like a good idea in principle, but I am not
> > sure I agree about the particulars of the proposal.  (Of course
> > if the Unicode properties go off the table then it makes sense
> > to toss intersection and subtraction as well; I am assuming
> > that Unicode properties are still a planned addition.)
> >
> > First, the proposal does not specify the binding precedences
> > of the new operators: surely [a-c\-b] means [ac] but does [a-c\-bb]
> > also mean [ac] or does it mean [acb]?  Etc.  This is hardly a
> > show-stopper, just a detail missing from the rough sketch proposal.
> >
> > More importantly the proposal breaks the nice regexp syntax property,
> > inherited from egrep, that escaped punctuation always denotes
> > the corresponding literal (and that non-punctuation is always a
> > literal unless escaped).  Adding \- and \& as operators breaks
> > the rule and adds confusion.  Making \- special is particularly bad,
> > because - is already special in character classes.  I find the - rules
> > weird enough that I routinely use \- in character classes when
> > I want the literal so I don't have to remember when - is special and
> > when it isn't.
> >
> > I suggest using unescaped & as the conjunction operator,
> > to preserve the property that \punctuation is always a literal.
> > It is true that this would affect existing character classes that
> > list &, but these can be diagnosed by the compiler as an
> > empty character class: [!@#$%&*()] is obviously (to the compiler)
> > a mistake, because the intersection is empty.
> >
> > I also suggest not adding an explicit subtraction operator but
> > instead using intersection with the complement, since there is
> > already a complement operator ^: [a-c&^bb] = [a-c&^b] = [ac],
> > [a-z&^by] = [a-z&^b&^y] = [ac-xz], etc.
> >
> > The specific grammar would be something like:
> >
> >   CharClass ::= "[" ClassIntersect "]"
> >   ClassComplement ::= ClassRanges | "^" ClassRanges
> >   ClassIntersect ::= ClassComplement | ClassIntersect "&" ClassComplement
> >
> > Russ
> > _______________________________________________
> > Es4-discuss mailing list
> > Es4-discuss at mozilla.org
> > https://mail.mozilla.org/listinfo/es4-discuss
> >
>
>

# Lars T Hansen (18 years ago)

I'm actively at work on this, and I'm very mindful about the compatibility problem. Most likely we'll follow Java, which should not be a problem. More soon.

I'm actively at work on this, and I'm very mindful about the
compatibility problem.  Most likely we'll follow Java, which should
not be a problem.  More soon.

--lars

On 6/14/07, Russ Cox <rsc at swtch.com> wrote:
> Sorry for continuing to bring this up, but did anything
> get resolved here?  I'm still hoping that ES4 won't
> break the "backslash before punctuation is always
> a literal" rule followed in Perl, PCRE, and many other
> modern regexp packages.
>
> Either using && and &&^ as the current \& and \-
> or using & and &^ as the current \& and \-
> (with an error if CharsetIntersect returns an empty set)
> would avoid this problem.
>
> Russ
>
>
> On 5/7/07, Lars T Hansen <lth at acm.org> wrote:
> > These are all good points.  Java
> > (http://java.sun.com/javase/6/docs/api/, see java.util.regex) solves
> > all these problems in the following way:
> >
> > -  && in a charset means intersection; this sequence is not likely to
> > be used except in error since a single & always suffices
> >
> > - subtraction is handled through intersection with a complemented
> > embedded charset, eg, [a-c&&[^cde]].
> >
> > I think we perceived Java's solution as inelegant when we devised
> > what's now proposed, especially
> > subtraction-by-complement-and-intersection is unpleasant, especially
> > because the nested character sets makes regular expression lexing a
> > little messier.  (As it is, and by design, regular expression literals
> > can be lexed by a DFA when the program is parsed; only the regular
> > expression compiler needs to handle the context-free language.
> > Allowing nested sets breaks this.)
> >
> > I'll put this item on the agenda for the next committee meeting.
> >
> > Sorry for taking so long to answer,
> > --lars
> >
> > On 4/5/07, Russ Cox <rsc at swtch.com> wrote:
> >
> > > Regarding the proposal at
> > > http://developer.mozilla.org/es4/proposals/extend_regexps.html
> > > to add intersection and subtraction of character classes,
> > > it certainly sounds like a good idea in principle, but I am not
> > > sure I agree about the particulars of the proposal.  (Of course
> > > if the Unicode properties go off the table then it makes sense
> > > to toss intersection and subtraction as well; I am assuming
> > > that Unicode properties are still a planned addition.)
> > >
> > > First, the proposal does not specify the binding precedences
> > > of the new operators: surely [a-c\-b] means [ac] but does [a-c\-bb]
> > > also mean [ac] or does it mean [acb]?  Etc.  This is hardly a
> > > show-stopper, just a detail missing from the rough sketch proposal.
> > >
> > > More importantly the proposal breaks the nice regexp syntax property,
> > > inherited from egrep, that escaped punctuation always denotes
> > > the corresponding literal (and that non-punctuation is always a
> > > literal unless escaped).  Adding \- and \& as operators breaks
> > > the rule and adds confusion.  Making \- special is particularly bad,
> > > because - is already special in character classes.  I find the - rules
> > > weird enough that I routinely use \- in character classes when
> > > I want the literal so I don't have to remember when - is special and
> > > when it isn't.
> > >
> > > I suggest using unescaped & as the conjunction operator,
> > > to preserve the property that \punctuation is always a literal.
> > > It is true that this would affect existing character classes that
> > > list &, but these can be diagnosed by the compiler as an
> > > empty character class: [!@#$%&*()] is obviously (to the compiler)
> > > a mistake, because the intersection is empty.
> > >
> > > I also suggest not adding an explicit subtraction operator but
> > > instead using intersection with the complement, since there is
> > > already a complement operator ^: [a-c&^bb] = [a-c&^b] = [ac],
> > > [a-z&^by] = [a-z&^b&^y] = [ac-xz], etc.
> > >
> > > The specific grammar would be something like:
> > >
> > >   CharClass ::= "[" ClassIntersect "]"
> > >   ClassComplement ::= ClassRanges | "^" ClassRanges
> > >   ClassIntersect ::= ClassComplement | ClassIntersect "&" ClassComplement
> > >
> > > Russ
> > > _______________________________________________
> > > Es4-discuss mailing list
> > > Es4-discuss at mozilla.org
> > > https://mail.mozilla.org/listinfo/es4-discuss
> > >
> >
> >
>