Inline regexps and caching
2009/1/23 Laurens Holst <lholst at students.cs.uu.nl>:
Hi,
I and a colleague were puzzled by some strange behaviour in Firefox, we found that in some browsers literal regular expressions are cached and reused. Testcase:
function test(str) { var regexp = /^[^d]\bd{1,4}\b[^d]$/g; alert('Expexted: 0/true, result: ' + regexp.lastIndex + '/' + regexp.test(str) ); }
var xxx = "MM/dd/yyyy"; test(xxx); test(xxx);
It turns out that Firefox and Opera return 'false' for the second test result, whereas Internet Explorer and Safari return 'true' in both cases.
Firefox and Opera are doing what ES3 requires (s 7.8.5: bclary.com/2004/11/07/#a-7.8.5 ), but I believe that it's being changed in 3.1 to produce a new one each time the literal expression is executed.
Mike
On Fri, Jan 23, 2009 at 7:33 AM, Mike Shaver <mike.shaver at gmail.com> wrote:
Firefox and Opera are doing what ES3 requires (s 7.8.5: bclary.com/2004/11/07/#a-7.8.5 ),
Correct.
but I believe that it's being changed in 3.1 to produce a new one each time the literal expression is executed.
Correct. In the meantime, you can change expressions like
var regexp = /^[^d]*\bd{1,4}\b[^d]*$/g;
into
var regexp = new RegExp("^[^d]*\\bd{1,4}\\b[^d]*$","g");
Yes, this is ugly. But an ugly program that works is better than a pretty one that doesn't.
Laurens Holst wrote:
Hi,
I and a colleague were puzzled by some strange behaviour in Firefox, we found that in some browsers literal regular expressions are cached and reused. Testcase:
function test(str) { var regexp = /^[^d]\bd{1,4}\b[^d]$/g; alert('Expexted: 0/true, result: ' + regexp.lastIndex + '/' + regexp.test(str) ); }
var xxx = "MM/dd/yyyy"; test(xxx); test(xxx);
It turns out that Firefox and Opera return ‘false’ for the second test result, whereas Internet Explorer and Safari return ‘true’ in both cases.
The latter behaviour seems most sensible and expected to me; browsers can of course cache the regular expression object to avoid parsing it over and over again, but they should IMO clone that cached object every time it is used.
This is a known problem that has been fixed in ES3.1.
ES3 section 7.8.5:
A regular expression literal is an input element that is converted to
a RegExp object (section 15.10) when it is scanned.
ES3.1 section 7.8.5:
A regular expression literal is an input element that is converted to
a RegExp object (section 15.10) each time the literal is evaluated.
Mark S. Miller wrote:
Mike Shaver wrote:
Firefox and Opera are doing what ES3 requires (s 7.8.5: bclary.com/2004/11/07/#a-7.8.5 ),
Correct.
but I believe that it's being changed in 3.1 to produce a new one each time the literal expression is executed.
Correct. In the meantime, you can change expressions like
var regexp = /^[^d]\bd{1,4}\b[^d]$/g;
into
var regexp = new RegExp("^[^d]\bd{1,4}\b[^d]$","g");
Yes, this is ugly. But an ugly program that works is better than a pretty one that doesn't.
Ugly, and an example of using a hammer to crack a nut. The issue is provoked by the fact that for a regular expression with the global flag set the - exec - method employs the regular expression object's - lastIndex - property, leaving it set to the end index of the last match made. Knowing that suggests that a simple 'solution' would be to explicitly set the regular expression object's - lastIndex - property to zero before using it. That must be cheaper than creating a new regular expression object just for the side effect of then having one with a zero - lastIndex - property.
In addition, knowing the mechanism also directs attention towards the global flag; does the regular expression being used need to have the global flag set in the first place? If the flag is not set then subsequent - exec - uses will always start at the zero index. The example regular expression used above only appears to be interested in making a single match so probably there was never a need to have the flag set.
Richard Cornford.
On Jan 24, 2009, at 5:42 PM, Richard Cornford wrote:
Ugly, and an example of using a hammer to crack a nut.
I do this all the time, works great ;-).
Seriously, there's more afoot than can be patched by resetting
lastIndex.
The issue is provoked by the fact that for a regular expression with
the global flag set the - exec - method employs the regular
expression object's - lastIndex - property, leaving it set to the
end index of the last match made. Knowing that suggests that a
simple 'solution' would be to explicitly set the regular expression
object's - lastIndex - property to zero before using it. That must
be cheaper than creating a new regular expression object just for
the side effect of then having one with a zero - lastIndex - property.
The more general problem is shared mutable literal-expressed
singletons. In no other case (object or array initialiser, function
expressions, primitive literals) does evaluation return the singleton
created as if at parse time. Mutation hurts, sharing should be
explicit. To match the other kinds of literals and avoid bugs such as
bugzilla.mozilla.org/show_bug.cgi?id=474412
Efficiency concerns are secondary but can be addressed by lightweight
cloning of a shared-immutable compiler-created regexp.
In addition, knowing the mechanism also directs attention towards
the global flag; does the regular expression being used need to have
the global flag set in the first place? If the flag is not set then
subsequent - exec - uses will always start at the zero index. The
example regular expression used above only appears to be interested
in making a single match so probably there was never a need to have
the flag set.
This is an optimization challenge for implementors, not a reason to
specify a shared singleton with mutable state (lastIndex is mutable
and set to 0 even without the 'g' flag).
Brendan Eich wrote:
On Jan 24, 2009, at 5:42 PM, Richard Cornford wrote:
Ugly, and an example of using a hammer to crack a nut.
I do this all the time, works great ;-).
And I never do it, and that works great too.
Seriously, there's more afoot than can be patched by resetting lastIndex.
My intention was to suggest that the 'prettiest' solution was to delete the superfluous 'g' from the end of the regular expression literal. But resetting the - lastIndex - prior to using the regular expression object would also eliminate the undesirable behaviour in Laurens Holst's code, and has the merit of directly addressing the characteristic of regular expressions that results in the issue.
I admit, though, that the - new RegExp - thing is a bit of a bugbear for me. There are two reasons for that. The first is that I have encountered orders of magnitude more issues arising from people failing to cope with the 'double escaping' needed in string literal arguments for the RegExp constructor than issues following from the handling of - lastIndex -. So, for example, if you want to match a dot or a backslash in a regular expression they will need to be escaped by preceding them with a backslash, but in the string literal that backslash needs to be escaped with a second backslash if the RegExp constructor is going to see it (and in the case of matching the backslash that also needs to be escaped for in string literal). People just seem to make a lot of mistakes when being required to do that, and those mistakes don't seem to be easy to spot as the resulting regular expressions still 'work', even to the extent of sometimes making some 'correct'/expected matches.
The second reason is that the construct is often proposed without explanation and so can be received as a mystical incantation to be chanted in the face of every regular expression regardless of whether it is achieving anything useful in context. And so you encounter things like:-
... format: function(s) { return $.tablesorter.formatFloat(s.replace(new RegExp(/%/g),"")); }, ... (from a JQuery table sorting plug-in)
- and end up wondering what on earth the author thought that - new RegExp - was supposed to achieve.
The issue is provoked by the fact that for a regular expression with the global flag set the - exec - method employs the regular expression object's - lastIndex - property, leaving it set to the end index of the last match made. Knowing that suggests that a simple 'solution' would be to explicitly set the regular expression object's - lastIndex - property to zero before using it. That must be cheaper than creating a new regular expression object just for the side effect of then having one with a zero - lastIndex - property.
The more general problem is shared mutable literal-expressed singletons. In no other case (object or array initialiser, function expressions, primitive literals) does evaluation return the singleton created as if at parse time. Mutation hurts, sharing should be explicit.
All of that is true, and making sure the next language version eliminates that is a good idea. But that does not help people who have to address current ES 3 implementations.
To match the other kinds of literals and avoid bugs such as
Now that is an issue that relates to the identify of regular expression objects, and so can only be addressed by creating distinct objects with - new RegExp -.
Efficiency concerns are secondary but can be addressed by lightweight cloning of a shared-immutable compiler-created regexp.
"Can be addressed by ...", 'will be addressed by ...' and 'MUST be addressed by ...' are all very different things. It is not in the remit of the new specification to be requiring specific optimisations in future implementations.
In addition, knowing the mechanism also directs attention towards the global flag; does the regular expression being used need to have the global flag set in the first place? If the flag is not set then subsequent - exec - uses will always start at the zero index. The example regular expression used above only appears to be interested in making a single match so probably there was never a need to have the flag set.
This is an optimization challenge for implementors, not a reason to specify a shared singleton with mutable state (lastIndex is mutable and set to 0 even without the 'g' flag).
I am not saying that there should be a shared singleton. In the situation as we have it now there are implementations that create regular expression literals while parsing, and others that create them when the expression is evaluated. So it is not possible to rely on the former or expect the latter. The result is a minefield that needs to be cleaned up. But in the meanwhile bulldozing all regular expression uses with - new RegExp - seems an extreme alternative to recognising the few that can blow up in your face and diffusing them individually.
Richard Cornford.
On Jan 24, 2009, at 10:56 PM, Richard Cornford wrote:
All of that is true, and making sure the next language version
eliminates that is a good idea. But that does not help people who
have to address current ES 3 implementations.
The bug I cited,
bugzilla.mozilla.org/show_bug.cgi?id=474412
claims IE and Safari do what ES3.1 specifies already. Pratap's JScript
Deviations doc didn't mention this, at least not in the early form I
just checked, and I can't test IE here. It's true for Safari 3.2.1.
To match the other kinds of literals and avoid bugs such as
Now that is an issue that relates to the identify of regular
expression objects, and so can only be addressed by creating
distinct objects with - new RegExp -.
The question of identity is potentially involved, unless you can show
that resetting lastIndex but preserving the lexical singleton ES3
conformance satisfies most of the complaints behind the highly-dup'ed
bug
bugzilla.mozilla.org/show_bug.cgi?id=98409
(see weblogs.mozillazine.org/roadmap/archives/008325.html where
the top three most-frequently dup'ed bugzilla.mozilla.org JS bugs were
tabulated).
I'm not sure how much lastIndex resetting would help, but I know it
would hurt those reporters who complain (also or separately) about the
odd lexical singleton evaluation model. I'd rather fix regexp literals
to evaluate like other mutable-object literals. This fix subsumes any
ad-hoc fix for the lastIndex bug, for principled reasons: by
eliminating implicit sharing of literally expressed mutable objects.
"Can be addressed by ...", 'will be addressed by ...' and 'MUST be
addressed by ...' are all very different things. It is not in the
remit of the new specification to be requiring specific
optimisations in future implementations.
Of course not -- nor is it the spec's job to prematurely optimize at
the expense of semantic cleanliness and principle-of-least-astonishment.
The market will sort out the implementation quality issue, and it's
already forcing major performance work from all the top vendors. It
won't happen overnight, but we've already got a cross-browser
difference to deal with. Better to make the right long-term fix to the
spec.
I am not saying that there should be a shared singleton. In the
situation as we have it now there are implementations that create
regular expression literals while parsing, and others that create
them when the expression is evaluated.
The latter are not conforming to ES3, FWIW. The spec is clear.
So it is not possible to rely on the former or expect the latter.
The result is a minefield that needs to be cleaned up. But in the
meanwhile bulldozing all regular expression uses with - new RegExp -
seems an extreme alternative to recognising the few that can blow up
in your face and diffusing them individually.
Oh, I didn't mean to argue against your argument with Mark's advice to
use new RegExp(...) exclusively and never use literals! Sorry if I
misread you, I thought you were arguing for lastIndex resetting as an
alternative to the 3.1 evaluation change.
Brendan Eich wrote:
On Jan 24, 2009, at 10:56 PM, Richard Cornford wrote: <snip>
I am not saying that there should be a shared singleton. In the situation as we have it now there are implementations that create regular expression literals while parsing, and others that create them when the expression is evaluated.
The latter are not conforming to ES3, FWIW. The spec is clear.
They are not conforming, but their (now long-term) existence has prevented people form expecting conformity in this area. Which becomes the thing that allows the new spec to change the way regular expression literals are handled.
So it is not possible to rely on the former or expect the latter. The result is a minefield that needs to be cleaned up. But in the meanwhile bulldozing all regular expression uses with - new RegExp - seems an extreme alternative to recognising the few that can blow up in your face and diffusing them individually.
Oh, I didn't mean to argue against your argument with Mark's advice to use new RegExp(...) exclusively and never use literals! Sorry if I misread you, I thought you were arguing for lastIndex resetting as an alternative to the 3.1 evaluation change.
It would never have been realistic/practical to change the way - lastIndex - is handled in the - exec - method as that would break code such as:-
var m, rx = / ... /g; while (( m = rx.exec( st ) )){ ... // handle each successive match in turn. }
- which, even if not often seen, is something people do.
Fortunately the existing non-conforming implementations will have prevented the variation:-
var m; while (( m = / ... /g.exec( st ) )){ ... // handle each successive match in turn. }
- which would otherwise have been broken by getting rid of the shared singleton.
Richard Cornford.
I and a colleague were puzzled by some strange behaviour in Firefox, we found that in some browsers literal regular expressions are cached and reused. Testcase:
function test(str) { var regexp = /^[^d]\bd{1,4}\b[^d]$/g; alert('Expexted: 0/true, result: ' + regexp.lastIndex + '/' + regexp.test(str) ); }
var xxx = "MM/dd/yyyy"; test(xxx); test(xxx);
It turns out that Firefox and Opera return ‘false’ for the second test result, whereas Internet Explorer and Safari return ‘true’ in both cases.
The latter behaviour seems most sensible and expected to me; browsers can of course cache the regular expression object to avoid parsing it over and over again, but they should IMO clone that cached object every time it is used.
~Laurens