Streaming regexp matching

# Isiah Meadows (7 years ago)

There's two things I've found that need suspendable matching:

Partially matching against a string, which is useful with interactive form validation and such.
Pattern matching and replacement over a character stream, which is useful for things like matching against files without loading the entire thing into memory or easier filtering of requests.

Also, it'd be nice if there was a facility to get all matches, including duplicate group matches. This is often useful for simple parsing, where if such support existed, you could just use a Kleene star instead of the standard exec loops (which admittedly get old).

And finally, we could avoid setting regexp globals here. That would speed up the matcher quite a bit.

So, here's my proposal:

regexp.matcher() -> matcher - Create a streaming regexp matcher.
matcher.consume(codePoint, charSize?) -> result | undefined -

Consume a Unicode code point or -1 if no more characters exist, and return a match result, undefined if no match occurred. charSize is the number of bytes represented by codePoint (default: 1-2 if /u is set, 1 otherwise), so it can work with other encodings flexibly.

matcher.nextPossibleStart -> number - The next possible start the

matcher could have, for more effective buffering and stream management. This is implementation-defined, but it must be be -1 after the matcher completes, and it must be within [0, N) otherwise, where N is the next returned match.

result.group -> string | number | undefined - Return the group

index/name of the current match, or undefined if it's just issuing a match of the global regexp.

result.start -> number - Return the matched value's start index.
result.end -> number - Return the matched value's end index.
This does not modify any globals or regexp instance members. It only reads regexp.lastIndex on creation. (It doesn't operate on strings, so it shouldn't return any it doesn't already have.)

Most RegExp methods could similarly be built using this as a base: if they work on strings, they can iterate their code points.

As for the various concerns:

Partial matching is just iterating a string's character codes and seeing if the matcher ever returned non-undefined.
Streaming pattern matching is pretty obvious from just reading the API.
Getting all matches is just iterating the string and returning an object with all the groups + strings it matched.

So WDYT?

/cc Mathias Bynens, since I know you're involved in this kind of text-heavy stuff.

Isiah Meadows contact at isiahmeadows.com, www.isiahmeadows.com

There's two things I've found that need suspendable matching:

1. Partially matching against a string, which is useful with
interactive form validation and such.
2. Pattern matching and replacement over a character stream, which is
useful for things like matching against files without loading the
entire thing into memory or easier filtering of requests.

Also, it'd be nice if there was a facility to get *all* matches,
including duplicate group matches. This is often useful for simple
parsing, where if such support existed, you could just use a Kleene
star instead of the standard `exec` loops (which admittedly get old).

And finally, we could avoid setting regexp globals here. That would
speed up the matcher quite a bit.

So, here's my proposal:

- `regexp.matcher() -> matcher` - Create a streaming regexp matcher.
- `matcher.consume(codePoint, charSize?) -> result | undefined` -
Consume a Unicode code point or `-1` if no more characters exist, and
return a match result, `undefined` if no match occurred. `charSize` is
the number of bytes represented by `codePoint` (default: 1-2 if `/u`
is set, 1 otherwise), so it can work with other encodings flexibly.
- `matcher.nextPossibleStart -> number` - The next possible start the
matcher could have, for more effective buffering and stream
management. This is implementation-defined, but it *must* be be `-1`
after the matcher completes, and it *must* be within [0, N) otherwise,
where N is the next returned match.
- `result.group -> string | number | undefined` - Return the group
index/name of the current match, or `undefined` if it's just issuing a
match of the global regexp.
- `result.start -> number` - Return the matched value's start index.
- `result.end -> number` - Return the matched value's end index.
- This does *not* modify any globals or regexp instance members. It
only reads `regexp.lastIndex` on creation. (It doesn't operate on
strings, so it shouldn't return any it doesn't already have.)

Most RegExp methods could similarly be built using this as a base: if
they work on strings, they can iterate their code points.

As for the various concerns:

- Partial matching is just iterating a string's character codes and
seeing if the matcher ever returned non-`undefined`.
- Streaming pattern matching is pretty obvious from just reading the API.
- Getting all matches is just iterating the string and returning an
object with all the groups + strings it matched.

So WDYT?

/cc Mathias Bynens, since I know you're involved in this kind of
text-heavy stuff.

-----

Isiah Meadows
contact at isiahmeadows.com
www.isiahmeadows.com

# Jordan Harband (7 years ago)

Have you tried to implement this as a RegExp subclass, overriding all the necessary extension points?

Have you tried to implement this as a RegExp subclass, overriding all the
necessary extension points?

On Mon, Jul 30, 2018 at 11:39 AM, Isiah Meadows <isiahmeadows at gmail.com>
wrote:

> There's two things I've found that need suspendable matching:
>
> 1. Partially matching against a string, which is useful with
> interactive form validation and such.
> 2. Pattern matching and replacement over a character stream, which is
> useful for things like matching against files without loading the
> entire thing into memory or easier filtering of requests.
>
> Also, it'd be nice if there was a facility to get *all* matches,
> including duplicate group matches. This is often useful for simple
> parsing, where if such support existed, you could just use a Kleene
> star instead of the standard `exec` loops (which admittedly get old).
>
> And finally, we could avoid setting regexp globals here. That would
> speed up the matcher quite a bit.
>
> So, here's my proposal:
>
> - `regexp.matcher() -> matcher` - Create a streaming regexp matcher.
> - `matcher.consume(codePoint, charSize?) -> result | undefined` -
> Consume a Unicode code point or `-1` if no more characters exist, and
> return a match result, `undefined` if no match occurred. `charSize` is
> the number of bytes represented by `codePoint` (default: 1-2 if `/u`
> is set, 1 otherwise), so it can work with other encodings flexibly.
> - `matcher.nextPossibleStart -> number` - The next possible start the
> matcher could have, for more effective buffering and stream
> management. This is implementation-defined, but it *must* be be `-1`
> after the matcher completes, and it *must* be within [0, N) otherwise,
> where N is the next returned match.
> - `result.group -> string | number | undefined` - Return the group
> index/name of the current match, or `undefined` if it's just issuing a
> match of the global regexp.
> - `result.start -> number` - Return the matched value's start index.
> - `result.end -> number` - Return the matched value's end index.
> - This does *not* modify any globals or regexp instance members. It
> only reads `regexp.lastIndex` on creation. (It doesn't operate on
> strings, so it shouldn't return any it doesn't already have.)
>
> Most RegExp methods could similarly be built using this as a base: if
> they work on strings, they can iterate their code points.
>
> As for the various concerns:
>
> - Partial matching is just iterating a string's character codes and
> seeing if the matcher ever returned non-`undefined`.
> - Streaming pattern matching is pretty obvious from just reading the API.
> - Getting all matches is just iterating the string and returning an
> object with all the groups + strings it matched.
>
> So WDYT?
>
> /cc Mathias Bynens, since I know you're involved in this kind of
> text-heavy stuff.
>
> -----
>
> Isiah Meadows
> contact at isiahmeadows.com
> www.isiahmeadows.com
> _______________________________________________
> es-discuss mailing list
> es-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20180730/91b610aa/attachment.html>

# Michael Theriot (7 years ago)

I'll say I have at least a few times desired a way to stream input to RegExp. To the point I fiddled around implementing a finite state machine.

My use cases:

short circuiting on nth capture
parsing streams

I'll say I have at least a few times desired a way to stream input to
RegExp. To the point I fiddled around implementing a finite state machine.

My use cases:
1. short circuiting on nth capture
2. parsing streams

On Mon, Jul 30, 2018 at 2:39 PM, Isiah Meadows <isiahmeadows at gmail.com>
wrote:

> There's two things I've found that need suspendable matching:
>
> 1. Partially matching against a string, which is useful with
> interactive form validation and such.
> 2. Pattern matching and replacement over a character stream, which is
> useful for things like matching against files without loading the
> entire thing into memory or easier filtering of requests.
>
> Also, it'd be nice if there was a facility to get *all* matches,
> including duplicate group matches. This is often useful for simple
> parsing, where if such support existed, you could just use a Kleene
> star instead of the standard `exec` loops (which admittedly get old).
>
> And finally, we could avoid setting regexp globals here. That would
> speed up the matcher quite a bit.
>
> So, here's my proposal:
>
> - `regexp.matcher() -> matcher` - Create a streaming regexp matcher.
> - `matcher.consume(codePoint, charSize?) -> result | undefined` -
> Consume a Unicode code point or `-1` if no more characters exist, and
> return a match result, `undefined` if no match occurred. `charSize` is
> the number of bytes represented by `codePoint` (default: 1-2 if `/u`
> is set, 1 otherwise), so it can work with other encodings flexibly.
> - `matcher.nextPossibleStart -> number` - The next possible start the
> matcher could have, for more effective buffering and stream
> management. This is implementation-defined, but it *must* be be `-1`
> after the matcher completes, and it *must* be within [0, N) otherwise,
> where N is the next returned match.
> - `result.group -> string | number | undefined` - Return the group
> index/name of the current match, or `undefined` if it's just issuing a
> match of the global regexp.
> - `result.start -> number` - Return the matched value's start index.
> - `result.end -> number` - Return the matched value's end index.
> - This does *not* modify any globals or regexp instance members. It
> only reads `regexp.lastIndex` on creation. (It doesn't operate on
> strings, so it shouldn't return any it doesn't already have.)
>
> Most RegExp methods could similarly be built using this as a base: if
> they work on strings, they can iterate their code points.
>
> As for the various concerns:
>
> - Partial matching is just iterating a string's character codes and
> seeing if the matcher ever returned non-`undefined`.
> - Streaming pattern matching is pretty obvious from just reading the API.
> - Getting all matches is just iterating the string and returning an
> object with all the groups + strings it matched.
>
> So WDYT?
>
> /cc Mathias Bynens, since I know you're involved in this kind of
> text-heavy stuff.
>
> -----
>
> Isiah Meadows
> contact at isiahmeadows.com
> www.isiahmeadows.com
> _______________________________________________
> es-discuss mailing list
> es-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20180730/2d3beb55/attachment-0001.html>

# Isiah Meadows (7 years ago)

Not yet, but if I were, it wouldn't be a subclass (it's not necessary). But the key trouble implementing this is that I'd have to reimplement the regexp matching logic entirely from scratch if I want remotely reasonable runtime complexity. As for why:

If you ignore backreferences (which IIUC makes JS regexps go from regular to context-sensitive recognizers), it's possible to implement partial matching using regexp rewriting, but only because JS doesn't have any other arcane enough features to make it infeasible.
There is no possible way to extract duplicate matches without rewriting the regexp entirely and using regexp.exec, and the complexity of the logic to do this generally I suspect is NP-complete, maybe PSPACE-complete, and in either case, certainly infeasible when backreferences enter the picture.
It's technically possible to refactor a string for streaming, but I then lose the ability to discern a match from a non-match. I also have the same issue as per above WRT duplicate matches and partial matching with complexity issues.
If you were to combine the three, rewriting the regexp generally to support streaming matches, including duplicates, I suspect would likely be PSPACE-complete because it seems very similar to the first problem listed here 1.

* Backreferences bring the grammatical complexity of JS regexps from I'm pretty sure regular to context-sensitive.

Or in other words, either you control the raw matching logic yourself, or the polyfill runtime complexity could get absurd for this proposal. And implementing a pushdown state machine-based regexp engine + parser in JS isn't exactly something I'm willing to prototype for a strawman.

Isiah Meadows contact at isiahmeadows.com, www.isiahmeadows.com

Not yet, but if I were, it wouldn't be a subclass (it's not
necessary). But the key trouble implementing this is that I'd have to
reimplement the regexp matching logic entirely from scratch if I want
remotely reasonable runtime complexity. As for why:

- If you ignore backreferences (which IIUC makes JS regexps go from
regular to context-sensitive recognizers), it's *possible* to
implement partial matching using regexp rewriting, but only because JS
doesn't have any other arcane enough features to make it infeasible.
- There is no possible way to extract duplicate matches without
rewriting the regexp entirely and using `regexp.exec`, and the
complexity of the logic to do this generally I suspect is NP-complete,
maybe PSPACE-complete, and in either case, certainly infeasible when
backreferences enter the picture.
- It's technically possible to refactor a string for streaming, but I
then lose the ability to discern a match from a non-match. I also have
the same issue as per above WRT duplicate matches and partial matching
with complexity issues.
- If you were to combine the three, rewriting the regexp generally to
support streaming matches, including duplicates, I suspect would
likely be PSPACE-complete because it seems *very* similar to the first
problem listed here [1].

\* Backreferences bring the grammatical complexity of JS regexps from
I'm pretty sure regular to context-sensitive.

Or in other words, either you control the raw matching logic yourself,
or the polyfill runtime complexity could get absurd for this proposal.
And implementing a pushdown state machine-based regexp engine + parser
in JS isn't exactly something I'm willing to prototype for a strawman.

[1]: https://en.wikipedia.org/wiki/PSPACE-complete#Regular_expressions_and_automata

-----

Isiah Meadows
contact at isiahmeadows.com
www.isiahmeadows.com


On Mon, Jul 30, 2018 at 2:44 PM, Jordan Harband <ljharb at gmail.com> wrote:
> Have you tried to implement this as a RegExp subclass, overriding all the
> necessary extension points?
>
> On Mon, Jul 30, 2018 at 11:39 AM, Isiah Meadows <isiahmeadows at gmail.com>
> wrote:
>>
>> There's two things I've found that need suspendable matching:
>>
>> 1. Partially matching against a string, which is useful with
>> interactive form validation and such.
>> 2. Pattern matching and replacement over a character stream, which is
>> useful for things like matching against files without loading the
>> entire thing into memory or easier filtering of requests.
>>
>> Also, it'd be nice if there was a facility to get *all* matches,
>> including duplicate group matches. This is often useful for simple
>> parsing, where if such support existed, you could just use a Kleene
>> star instead of the standard `exec` loops (which admittedly get old).
>>
>> And finally, we could avoid setting regexp globals here. That would
>> speed up the matcher quite a bit.
>>
>> So, here's my proposal:
>>
>> - `regexp.matcher() -> matcher` - Create a streaming regexp matcher.
>> - `matcher.consume(codePoint, charSize?) -> result | undefined` -
>> Consume a Unicode code point or `-1` if no more characters exist, and
>> return a match result, `undefined` if no match occurred. `charSize` is
>> the number of bytes represented by `codePoint` (default: 1-2 if `/u`
>> is set, 1 otherwise), so it can work with other encodings flexibly.
>> - `matcher.nextPossibleStart -> number` - The next possible start the
>> matcher could have, for more effective buffering and stream
>> management. This is implementation-defined, but it *must* be be `-1`
>> after the matcher completes, and it *must* be within [0, N) otherwise,
>> where N is the next returned match.
>> - `result.group -> string | number | undefined` - Return the group
>> index/name of the current match, or `undefined` if it's just issuing a
>> match of the global regexp.
>> - `result.start -> number` - Return the matched value's start index.
>> - `result.end -> number` - Return the matched value's end index.
>> - This does *not* modify any globals or regexp instance members. It
>> only reads `regexp.lastIndex` on creation. (It doesn't operate on
>> strings, so it shouldn't return any it doesn't already have.)
>>
>> Most RegExp methods could similarly be built using this as a base: if
>> they work on strings, they can iterate their code points.
>>
>> As for the various concerns:
>>
>> - Partial matching is just iterating a string's character codes and
>> seeing if the matcher ever returned non-`undefined`.
>> - Streaming pattern matching is pretty obvious from just reading the API.
>> - Getting all matches is just iterating the string and returning an
>> object with all the groups + strings it matched.
>>
>> So WDYT?
>>
>> /cc Mathias Bynens, since I know you're involved in this kind of
>> text-heavy stuff.
>>
>> -----
>>
>> Isiah Meadows
>> contact at isiahmeadows.com
>> www.isiahmeadows.com
>> _______________________________________________
>> es-discuss mailing list
>> es-discuss at mozilla.org
>> https://mail.mozilla.org/listinfo/es-discuss
>
>

# Jordan Harband (7 years ago)

(It'd have to be a subclass for RegExp.prototype.whatever.call to do the right thing for your alternative regex)

(It'd have to be a subclass for `RegExp.prototype.whatever.call` to do the
right thing for your alternative regex)

On Mon, Jul 30, 2018 at 12:36 PM, Isiah Meadows <isiahmeadows at gmail.com>
wrote:

> Not yet, but if I were, it wouldn't be a subclass (it's not
> necessary). But the key trouble implementing this is that I'd have to
> reimplement the regexp matching logic entirely from scratch if I want
> remotely reasonable runtime complexity. As for why:
>
> - If you ignore backreferences (which IIUC makes JS regexps go from
> regular to context-sensitive recognizers), it's *possible* to
> implement partial matching using regexp rewriting, but only because JS
> doesn't have any other arcane enough features to make it infeasible.
> - There is no possible way to extract duplicate matches without
> rewriting the regexp entirely and using `regexp.exec`, and the
> complexity of the logic to do this generally I suspect is NP-complete,
> maybe PSPACE-complete, and in either case, certainly infeasible when
> backreferences enter the picture.
> - It's technically possible to refactor a string for streaming, but I
> then lose the ability to discern a match from a non-match. I also have
> the same issue as per above WRT duplicate matches and partial matching
> with complexity issues.
> - If you were to combine the three, rewriting the regexp generally to
> support streaming matches, including duplicates, I suspect would
> likely be PSPACE-complete because it seems *very* similar to the first
> problem listed here [1].
>
> \* Backreferences bring the grammatical complexity of JS regexps from
> I'm pretty sure regular to context-sensitive.
>
> Or in other words, either you control the raw matching logic yourself,
> or the polyfill runtime complexity could get absurd for this proposal.
> And implementing a pushdown state machine-based regexp engine + parser
> in JS isn't exactly something I'm willing to prototype for a strawman.
>
> [1]: https://en.wikipedia.org/wiki/PSPACE-complete#Regular_
> expressions_and_automata
>
> -----
>
> Isiah Meadows
> contact at isiahmeadows.com
> www.isiahmeadows.com
>
>
> On Mon, Jul 30, 2018 at 2:44 PM, Jordan Harband <ljharb at gmail.com> wrote:
> > Have you tried to implement this as a RegExp subclass, overriding all the
> > necessary extension points?
> >
> > On Mon, Jul 30, 2018 at 11:39 AM, Isiah Meadows <isiahmeadows at gmail.com>
> > wrote:
> >>
> >> There's two things I've found that need suspendable matching:
> >>
> >> 1. Partially matching against a string, which is useful with
> >> interactive form validation and such.
> >> 2. Pattern matching and replacement over a character stream, which is
> >> useful for things like matching against files without loading the
> >> entire thing into memory or easier filtering of requests.
> >>
> >> Also, it'd be nice if there was a facility to get *all* matches,
> >> including duplicate group matches. This is often useful for simple
> >> parsing, where if such support existed, you could just use a Kleene
> >> star instead of the standard `exec` loops (which admittedly get old).
> >>
> >> And finally, we could avoid setting regexp globals here. That would
> >> speed up the matcher quite a bit.
> >>
> >> So, here's my proposal:
> >>
> >> - `regexp.matcher() -> matcher` - Create a streaming regexp matcher.
> >> - `matcher.consume(codePoint, charSize?) -> result | undefined` -
> >> Consume a Unicode code point or `-1` if no more characters exist, and
> >> return a match result, `undefined` if no match occurred. `charSize` is
> >> the number of bytes represented by `codePoint` (default: 1-2 if `/u`
> >> is set, 1 otherwise), so it can work with other encodings flexibly.
> >> - `matcher.nextPossibleStart -> number` - The next possible start the
> >> matcher could have, for more effective buffering and stream
> >> management. This is implementation-defined, but it *must* be be `-1`
> >> after the matcher completes, and it *must* be within [0, N) otherwise,
> >> where N is the next returned match.
> >> - `result.group -> string | number | undefined` - Return the group
> >> index/name of the current match, or `undefined` if it's just issuing a
> >> match of the global regexp.
> >> - `result.start -> number` - Return the matched value's start index.
> >> - `result.end -> number` - Return the matched value's end index.
> >> - This does *not* modify any globals or regexp instance members. It
> >> only reads `regexp.lastIndex` on creation. (It doesn't operate on
> >> strings, so it shouldn't return any it doesn't already have.)
> >>
> >> Most RegExp methods could similarly be built using this as a base: if
> >> they work on strings, they can iterate their code points.
> >>
> >> As for the various concerns:
> >>
> >> - Partial matching is just iterating a string's character codes and
> >> seeing if the matcher ever returned non-`undefined`.
> >> - Streaming pattern matching is pretty obvious from just reading the
> API.
> >> - Getting all matches is just iterating the string and returning an
> >> object with all the groups + strings it matched.
> >>
> >> So WDYT?
> >>
> >> /cc Mathias Bynens, since I know you're involved in this kind of
> >> text-heavy stuff.
> >>
> >> -----
> >>
> >> Isiah Meadows
> >> contact at isiahmeadows.com
> >> www.isiahmeadows.com
> >> _______________________________________________
> >> es-discuss mailing list
> >> es-discuss at mozilla.org
> >> https://mail.mozilla.org/listinfo/es-discuss
> >
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20180730/4ff593f3/attachment.html>

# Isiah Meadows (7 years ago)

I was thinking in terms of what hooks need overridden (none for the base proposal), not where the method itself lived.

Isiah Meadows contact at isiahmeadows.com, www.isiahmeadows.com

I was thinking in terms of what hooks need overridden (none for the
base proposal), not where the method itself lived.
-----

Isiah Meadows
contact at isiahmeadows.com
www.isiahmeadows.com


On Mon, Jul 30, 2018 at 5:44 PM, Jordan Harband <ljharb at gmail.com> wrote:
> (It'd have to be a subclass for `RegExp.prototype.whatever.call` to do the
> right thing for your alternative regex)
>
> On Mon, Jul 30, 2018 at 12:36 PM, Isiah Meadows <isiahmeadows at gmail.com>
> wrote:
>>
>> Not yet, but if I were, it wouldn't be a subclass (it's not
>> necessary). But the key trouble implementing this is that I'd have to
>> reimplement the regexp matching logic entirely from scratch if I want
>> remotely reasonable runtime complexity. As for why:
>>
>> - If you ignore backreferences (which IIUC makes JS regexps go from
>> regular to context-sensitive recognizers), it's *possible* to
>> implement partial matching using regexp rewriting, but only because JS
>> doesn't have any other arcane enough features to make it infeasible.
>> - There is no possible way to extract duplicate matches without
>> rewriting the regexp entirely and using `regexp.exec`, and the
>> complexity of the logic to do this generally I suspect is NP-complete,
>> maybe PSPACE-complete, and in either case, certainly infeasible when
>> backreferences enter the picture.
>> - It's technically possible to refactor a string for streaming, but I
>> then lose the ability to discern a match from a non-match. I also have
>> the same issue as per above WRT duplicate matches and partial matching
>> with complexity issues.
>> - If you were to combine the three, rewriting the regexp generally to
>> support streaming matches, including duplicates, I suspect would
>> likely be PSPACE-complete because it seems *very* similar to the first
>> problem listed here [1].
>>
>> \* Backreferences bring the grammatical complexity of JS regexps from
>> I'm pretty sure regular to context-sensitive.
>>
>> Or in other words, either you control the raw matching logic yourself,
>> or the polyfill runtime complexity could get absurd for this proposal.
>> And implementing a pushdown state machine-based regexp engine + parser
>> in JS isn't exactly something I'm willing to prototype for a strawman.
>>
>> [1]:
>> https://en.wikipedia.org/wiki/PSPACE-complete#Regular_expressions_and_automata
>>
>> -----
>>
>> Isiah Meadows
>> contact at isiahmeadows.com
>> www.isiahmeadows.com
>>
>>
>> On Mon, Jul 30, 2018 at 2:44 PM, Jordan Harband <ljharb at gmail.com> wrote:
>> > Have you tried to implement this as a RegExp subclass, overriding all
>> > the
>> > necessary extension points?
>> >
>> > On Mon, Jul 30, 2018 at 11:39 AM, Isiah Meadows <isiahmeadows at gmail.com>
>> > wrote:
>> >>
>> >> There's two things I've found that need suspendable matching:
>> >>
>> >> 1. Partially matching against a string, which is useful with
>> >> interactive form validation and such.
>> >> 2. Pattern matching and replacement over a character stream, which is
>> >> useful for things like matching against files without loading the
>> >> entire thing into memory or easier filtering of requests.
>> >>
>> >> Also, it'd be nice if there was a facility to get *all* matches,
>> >> including duplicate group matches. This is often useful for simple
>> >> parsing, where if such support existed, you could just use a Kleene
>> >> star instead of the standard `exec` loops (which admittedly get old).
>> >>
>> >> And finally, we could avoid setting regexp globals here. That would
>> >> speed up the matcher quite a bit.
>> >>
>> >> So, here's my proposal:
>> >>
>> >> - `regexp.matcher() -> matcher` - Create a streaming regexp matcher.
>> >> - `matcher.consume(codePoint, charSize?) -> result | undefined` -
>> >> Consume a Unicode code point or `-1` if no more characters exist, and
>> >> return a match result, `undefined` if no match occurred. `charSize` is
>> >> the number of bytes represented by `codePoint` (default: 1-2 if `/u`
>> >> is set, 1 otherwise), so it can work with other encodings flexibly.
>> >> - `matcher.nextPossibleStart -> number` - The next possible start the
>> >> matcher could have, for more effective buffering and stream
>> >> management. This is implementation-defined, but it *must* be be `-1`
>> >> after the matcher completes, and it *must* be within [0, N) otherwise,
>> >> where N is the next returned match.
>> >> - `result.group -> string | number | undefined` - Return the group
>> >> index/name of the current match, or `undefined` if it's just issuing a
>> >> match of the global regexp.
>> >> - `result.start -> number` - Return the matched value's start index.
>> >> - `result.end -> number` - Return the matched value's end index.
>> >> - This does *not* modify any globals or regexp instance members. It
>> >> only reads `regexp.lastIndex` on creation. (It doesn't operate on
>> >> strings, so it shouldn't return any it doesn't already have.)
>> >>
>> >> Most RegExp methods could similarly be built using this as a base: if
>> >> they work on strings, they can iterate their code points.
>> >>
>> >> As for the various concerns:
>> >>
>> >> - Partial matching is just iterating a string's character codes and
>> >> seeing if the matcher ever returned non-`undefined`.
>> >> - Streaming pattern matching is pretty obvious from just reading the
>> >> API.
>> >> - Getting all matches is just iterating the string and returning an
>> >> object with all the groups + strings it matched.
>> >>
>> >> So WDYT?
>> >>
>> >> /cc Mathias Bynens, since I know you're involved in this kind of
>> >> text-heavy stuff.
>> >>
>> >> -----
>> >>
>> >> Isiah Meadows
>> >> contact at isiahmeadows.com
>> >> www.isiahmeadows.com
>> >> _______________________________________________
>> >> es-discuss mailing list
>> >> es-discuss at mozilla.org
>> >> https://mail.mozilla.org/listinfo/es-discuss
>> >
>> >
>
>

# Isiah Meadows (7 years ago)

BTW, I created a rough sketch of a strawman here: isiahmeadows/streaming-regexp-proposal

Isiah Meadows contact at isiahmeadows.com, www.isiahmeadows.com

BTW, I created a rough sketch of a strawman here:
https://github.com/isiahmeadows/streaming-regexp-proposal
-----

Isiah Meadows
contact at isiahmeadows.com
www.isiahmeadows.com


On Mon, Jul 30, 2018 at 5:48 PM, Isiah Meadows <isiahmeadows at gmail.com> wrote:
> I was thinking in terms of what hooks need overridden (none for the
> base proposal), not where the method itself lived.
> -----
>
> Isiah Meadows
> contact at isiahmeadows.com
> www.isiahmeadows.com
>
>
> On Mon, Jul 30, 2018 at 5:44 PM, Jordan Harband <ljharb at gmail.com> wrote:
>> (It'd have to be a subclass for `RegExp.prototype.whatever.call` to do the
>> right thing for your alternative regex)
>>
>> On Mon, Jul 30, 2018 at 12:36 PM, Isiah Meadows <isiahmeadows at gmail.com>
>> wrote:
>>>
>>> Not yet, but if I were, it wouldn't be a subclass (it's not
>>> necessary). But the key trouble implementing this is that I'd have to
>>> reimplement the regexp matching logic entirely from scratch if I want
>>> remotely reasonable runtime complexity. As for why:
>>>
>>> - If you ignore backreferences (which IIUC makes JS regexps go from
>>> regular to context-sensitive recognizers), it's *possible* to
>>> implement partial matching using regexp rewriting, but only because JS
>>> doesn't have any other arcane enough features to make it infeasible.
>>> - There is no possible way to extract duplicate matches without
>>> rewriting the regexp entirely and using `regexp.exec`, and the
>>> complexity of the logic to do this generally I suspect is NP-complete,
>>> maybe PSPACE-complete, and in either case, certainly infeasible when
>>> backreferences enter the picture.
>>> - It's technically possible to refactor a string for streaming, but I
>>> then lose the ability to discern a match from a non-match. I also have
>>> the same issue as per above WRT duplicate matches and partial matching
>>> with complexity issues.
>>> - If you were to combine the three, rewriting the regexp generally to
>>> support streaming matches, including duplicates, I suspect would
>>> likely be PSPACE-complete because it seems *very* similar to the first
>>> problem listed here [1].
>>>
>>> \* Backreferences bring the grammatical complexity of JS regexps from
>>> I'm pretty sure regular to context-sensitive.
>>>
>>> Or in other words, either you control the raw matching logic yourself,
>>> or the polyfill runtime complexity could get absurd for this proposal.
>>> And implementing a pushdown state machine-based regexp engine + parser
>>> in JS isn't exactly something I'm willing to prototype for a strawman.
>>>
>>> [1]:
>>> https://en.wikipedia.org/wiki/PSPACE-complete#Regular_expressions_and_automata
>>>
>>> -----
>>>
>>> Isiah Meadows
>>> contact at isiahmeadows.com
>>> www.isiahmeadows.com
>>>
>>>
>>> On Mon, Jul 30, 2018 at 2:44 PM, Jordan Harband <ljharb at gmail.com> wrote:
>>> > Have you tried to implement this as a RegExp subclass, overriding all
>>> > the
>>> > necessary extension points?
>>> >
>>> > On Mon, Jul 30, 2018 at 11:39 AM, Isiah Meadows <isiahmeadows at gmail.com>
>>> > wrote:
>>> >>
>>> >> There's two things I've found that need suspendable matching:
>>> >>
>>> >> 1. Partially matching against a string, which is useful with
>>> >> interactive form validation and such.
>>> >> 2. Pattern matching and replacement over a character stream, which is
>>> >> useful for things like matching against files without loading the
>>> >> entire thing into memory or easier filtering of requests.
>>> >>
>>> >> Also, it'd be nice if there was a facility to get *all* matches,
>>> >> including duplicate group matches. This is often useful for simple
>>> >> parsing, where if such support existed, you could just use a Kleene
>>> >> star instead of the standard `exec` loops (which admittedly get old).
>>> >>
>>> >> And finally, we could avoid setting regexp globals here. That would
>>> >> speed up the matcher quite a bit.
>>> >>
>>> >> So, here's my proposal:
>>> >>
>>> >> - `regexp.matcher() -> matcher` - Create a streaming regexp matcher.
>>> >> - `matcher.consume(codePoint, charSize?) -> result | undefined` -
>>> >> Consume a Unicode code point or `-1` if no more characters exist, and
>>> >> return a match result, `undefined` if no match occurred. `charSize` is
>>> >> the number of bytes represented by `codePoint` (default: 1-2 if `/u`
>>> >> is set, 1 otherwise), so it can work with other encodings flexibly.
>>> >> - `matcher.nextPossibleStart -> number` - The next possible start the
>>> >> matcher could have, for more effective buffering and stream
>>> >> management. This is implementation-defined, but it *must* be be `-1`
>>> >> after the matcher completes, and it *must* be within [0, N) otherwise,
>>> >> where N is the next returned match.
>>> >> - `result.group -> string | number | undefined` - Return the group
>>> >> index/name of the current match, or `undefined` if it's just issuing a
>>> >> match of the global regexp.
>>> >> - `result.start -> number` - Return the matched value's start index.
>>> >> - `result.end -> number` - Return the matched value's end index.
>>> >> - This does *not* modify any globals or regexp instance members. It
>>> >> only reads `regexp.lastIndex` on creation. (It doesn't operate on
>>> >> strings, so it shouldn't return any it doesn't already have.)
>>> >>
>>> >> Most RegExp methods could similarly be built using this as a base: if
>>> >> they work on strings, they can iterate their code points.
>>> >>
>>> >> As for the various concerns:
>>> >>
>>> >> - Partial matching is just iterating a string's character codes and
>>> >> seeing if the matcher ever returned non-`undefined`.
>>> >> - Streaming pattern matching is pretty obvious from just reading the
>>> >> API.
>>> >> - Getting all matches is just iterating the string and returning an
>>> >> object with all the groups + strings it matched.
>>> >>
>>> >> So WDYT?
>>> >>
>>> >> /cc Mathias Bynens, since I know you're involved in this kind of
>>> >> text-heavy stuff.
>>> >>
>>> >> -----
>>> >>
>>> >> Isiah Meadows
>>> >> contact at isiahmeadows.com
>>> >> www.isiahmeadows.com
>>> >> _______________________________________________
>>> >> es-discuss mailing list
>>> >> es-discuss at mozilla.org
>>> >> https://mail.mozilla.org/listinfo/es-discuss
>>> >
>>> >
>>
>>