Streaming regexp matching
Have you tried to implement this as a RegExp subclass, overriding all the necessary extension points?
Have you tried to implement this as a RegExp subclass, overriding all the necessary extension points? On Mon, Jul 30, 2018 at 11:39 AM, Isiah Meadows <isiahmeadows at gmail.com> wrote: > There's two things I've found that need suspendable matching: > > 1. Partially matching against a string, which is useful with > interactive form validation and such. > 2. Pattern matching and replacement over a character stream, which is > useful for things like matching against files without loading the > entire thing into memory or easier filtering of requests. > > Also, it'd be nice if there was a facility to get *all* matches, > including duplicate group matches. This is often useful for simple > parsing, where if such support existed, you could just use a Kleene > star instead of the standard `exec` loops (which admittedly get old). > > And finally, we could avoid setting regexp globals here. That would > speed up the matcher quite a bit. > > So, here's my proposal: > > - `regexp.matcher() -> matcher` - Create a streaming regexp matcher. > - `matcher.consume(codePoint, charSize?) -> result | undefined` - > Consume a Unicode code point or `-1` if no more characters exist, and > return a match result, `undefined` if no match occurred. `charSize` is > the number of bytes represented by `codePoint` (default: 1-2 if `/u` > is set, 1 otherwise), so it can work with other encodings flexibly. > - `matcher.nextPossibleStart -> number` - The next possible start the > matcher could have, for more effective buffering and stream > management. This is implementation-defined, but it *must* be be `-1` > after the matcher completes, and it *must* be within [0, N) otherwise, > where N is the next returned match. > - `result.group -> string | number | undefined` - Return the group > index/name of the current match, or `undefined` if it's just issuing a > match of the global regexp. > - `result.start -> number` - Return the matched value's start index. > - `result.end -> number` - Return the matched value's end index. > - This does *not* modify any globals or regexp instance members. It > only reads `regexp.lastIndex` on creation. (It doesn't operate on > strings, so it shouldn't return any it doesn't already have.) > > Most RegExp methods could similarly be built using this as a base: if > they work on strings, they can iterate their code points. > > As for the various concerns: > > - Partial matching is just iterating a string's character codes and > seeing if the matcher ever returned non-`undefined`. > - Streaming pattern matching is pretty obvious from just reading the API. > - Getting all matches is just iterating the string and returning an > object with all the groups + strings it matched. > > So WDYT? > > /cc Mathias Bynens, since I know you're involved in this kind of > text-heavy stuff. > > ----- > > Isiah Meadows > contact at isiahmeadows.com > www.isiahmeadows.com > _______________________________________________ > es-discuss mailing list > es-discuss at mozilla.org > https://mail.mozilla.org/listinfo/es-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20180730/91b610aa/attachment.html>
I'll say I have at least a few times desired a way to stream input to RegExp. To the point I fiddled around implementing a finite state machine.
My use cases:
- short circuiting on nth capture
- parsing streams
I'll say I have at least a few times desired a way to stream input to RegExp. To the point I fiddled around implementing a finite state machine. My use cases: 1. short circuiting on nth capture 2. parsing streams On Mon, Jul 30, 2018 at 2:39 PM, Isiah Meadows <isiahmeadows at gmail.com> wrote: > There's two things I've found that need suspendable matching: > > 1. Partially matching against a string, which is useful with > interactive form validation and such. > 2. Pattern matching and replacement over a character stream, which is > useful for things like matching against files without loading the > entire thing into memory or easier filtering of requests. > > Also, it'd be nice if there was a facility to get *all* matches, > including duplicate group matches. This is often useful for simple > parsing, where if such support existed, you could just use a Kleene > star instead of the standard `exec` loops (which admittedly get old). > > And finally, we could avoid setting regexp globals here. That would > speed up the matcher quite a bit. > > So, here's my proposal: > > - `regexp.matcher() -> matcher` - Create a streaming regexp matcher. > - `matcher.consume(codePoint, charSize?) -> result | undefined` - > Consume a Unicode code point or `-1` if no more characters exist, and > return a match result, `undefined` if no match occurred. `charSize` is > the number of bytes represented by `codePoint` (default: 1-2 if `/u` > is set, 1 otherwise), so it can work with other encodings flexibly. > - `matcher.nextPossibleStart -> number` - The next possible start the > matcher could have, for more effective buffering and stream > management. This is implementation-defined, but it *must* be be `-1` > after the matcher completes, and it *must* be within [0, N) otherwise, > where N is the next returned match. > - `result.group -> string | number | undefined` - Return the group > index/name of the current match, or `undefined` if it's just issuing a > match of the global regexp. > - `result.start -> number` - Return the matched value's start index. > - `result.end -> number` - Return the matched value's end index. > - This does *not* modify any globals or regexp instance members. It > only reads `regexp.lastIndex` on creation. (It doesn't operate on > strings, so it shouldn't return any it doesn't already have.) > > Most RegExp methods could similarly be built using this as a base: if > they work on strings, they can iterate their code points. > > As for the various concerns: > > - Partial matching is just iterating a string's character codes and > seeing if the matcher ever returned non-`undefined`. > - Streaming pattern matching is pretty obvious from just reading the API. > - Getting all matches is just iterating the string and returning an > object with all the groups + strings it matched. > > So WDYT? > > /cc Mathias Bynens, since I know you're involved in this kind of > text-heavy stuff. > > ----- > > Isiah Meadows > contact at isiahmeadows.com > www.isiahmeadows.com > _______________________________________________ > es-discuss mailing list > es-discuss at mozilla.org > https://mail.mozilla.org/listinfo/es-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20180730/2d3beb55/attachment-0001.html>
Not yet, but if I were, it wouldn't be a subclass (it's not necessary). But the key trouble implementing this is that I'd have to reimplement the regexp matching logic entirely from scratch if I want remotely reasonable runtime complexity. As for why:
- If you ignore backreferences (which IIUC makes JS regexps go from regular to context-sensitive recognizers), it's possible to implement partial matching using regexp rewriting, but only because JS doesn't have any other arcane enough features to make it infeasible.
- There is no possible way to extract duplicate matches without
rewriting the regexp entirely and using
regexp.exec
, and the complexity of the logic to do this generally I suspect is NP-complete, maybe PSPACE-complete, and in either case, certainly infeasible when backreferences enter the picture. - It's technically possible to refactor a string for streaming, but I then lose the ability to discern a match from a non-match. I also have the same issue as per above WRT duplicate matches and partial matching with complexity issues.
- If you were to combine the three, rewriting the regexp generally to support streaming matches, including duplicates, I suspect would likely be PSPACE-complete because it seems very similar to the first problem listed here 1.
* Backreferences bring the grammatical complexity of JS regexps from I'm pretty sure regular to context-sensitive.
Or in other words, either you control the raw matching logic yourself, or the polyfill runtime complexity could get absurd for this proposal. And implementing a pushdown state machine-based regexp engine + parser in JS isn't exactly something I'm willing to prototype for a strawman.
Isiah Meadows contact at isiahmeadows.com, www.isiahmeadows.com
Not yet, but if I were, it wouldn't be a subclass (it's not necessary). But the key trouble implementing this is that I'd have to reimplement the regexp matching logic entirely from scratch if I want remotely reasonable runtime complexity. As for why: - If you ignore backreferences (which IIUC makes JS regexps go from regular to context-sensitive recognizers), it's *possible* to implement partial matching using regexp rewriting, but only because JS doesn't have any other arcane enough features to make it infeasible. - There is no possible way to extract duplicate matches without rewriting the regexp entirely and using `regexp.exec`, and the complexity of the logic to do this generally I suspect is NP-complete, maybe PSPACE-complete, and in either case, certainly infeasible when backreferences enter the picture. - It's technically possible to refactor a string for streaming, but I then lose the ability to discern a match from a non-match. I also have the same issue as per above WRT duplicate matches and partial matching with complexity issues. - If you were to combine the three, rewriting the regexp generally to support streaming matches, including duplicates, I suspect would likely be PSPACE-complete because it seems *very* similar to the first problem listed here [1]. \* Backreferences bring the grammatical complexity of JS regexps from I'm pretty sure regular to context-sensitive. Or in other words, either you control the raw matching logic yourself, or the polyfill runtime complexity could get absurd for this proposal. And implementing a pushdown state machine-based regexp engine + parser in JS isn't exactly something I'm willing to prototype for a strawman. [1]: https://en.wikipedia.org/wiki/PSPACE-complete#Regular_expressions_and_automata ----- Isiah Meadows contact at isiahmeadows.com www.isiahmeadows.com On Mon, Jul 30, 2018 at 2:44 PM, Jordan Harband <ljharb at gmail.com> wrote: > Have you tried to implement this as a RegExp subclass, overriding all the > necessary extension points? > > On Mon, Jul 30, 2018 at 11:39 AM, Isiah Meadows <isiahmeadows at gmail.com> > wrote: >> >> There's two things I've found that need suspendable matching: >> >> 1. Partially matching against a string, which is useful with >> interactive form validation and such. >> 2. Pattern matching and replacement over a character stream, which is >> useful for things like matching against files without loading the >> entire thing into memory or easier filtering of requests. >> >> Also, it'd be nice if there was a facility to get *all* matches, >> including duplicate group matches. This is often useful for simple >> parsing, where if such support existed, you could just use a Kleene >> star instead of the standard `exec` loops (which admittedly get old). >> >> And finally, we could avoid setting regexp globals here. That would >> speed up the matcher quite a bit. >> >> So, here's my proposal: >> >> - `regexp.matcher() -> matcher` - Create a streaming regexp matcher. >> - `matcher.consume(codePoint, charSize?) -> result | undefined` - >> Consume a Unicode code point or `-1` if no more characters exist, and >> return a match result, `undefined` if no match occurred. `charSize` is >> the number of bytes represented by `codePoint` (default: 1-2 if `/u` >> is set, 1 otherwise), so it can work with other encodings flexibly. >> - `matcher.nextPossibleStart -> number` - The next possible start the >> matcher could have, for more effective buffering and stream >> management. This is implementation-defined, but it *must* be be `-1` >> after the matcher completes, and it *must* be within [0, N) otherwise, >> where N is the next returned match. >> - `result.group -> string | number | undefined` - Return the group >> index/name of the current match, or `undefined` if it's just issuing a >> match of the global regexp. >> - `result.start -> number` - Return the matched value's start index. >> - `result.end -> number` - Return the matched value's end index. >> - This does *not* modify any globals or regexp instance members. It >> only reads `regexp.lastIndex` on creation. (It doesn't operate on >> strings, so it shouldn't return any it doesn't already have.) >> >> Most RegExp methods could similarly be built using this as a base: if >> they work on strings, they can iterate their code points. >> >> As for the various concerns: >> >> - Partial matching is just iterating a string's character codes and >> seeing if the matcher ever returned non-`undefined`. >> - Streaming pattern matching is pretty obvious from just reading the API. >> - Getting all matches is just iterating the string and returning an >> object with all the groups + strings it matched. >> >> So WDYT? >> >> /cc Mathias Bynens, since I know you're involved in this kind of >> text-heavy stuff. >> >> ----- >> >> Isiah Meadows >> contact at isiahmeadows.com >> www.isiahmeadows.com >> _______________________________________________ >> es-discuss mailing list >> es-discuss at mozilla.org >> https://mail.mozilla.org/listinfo/es-discuss > >
(It'd have to be a subclass for RegExp.prototype.whatever.call
to do the
right thing for your alternative regex)
(It'd have to be a subclass for `RegExp.prototype.whatever.call` to do the right thing for your alternative regex) On Mon, Jul 30, 2018 at 12:36 PM, Isiah Meadows <isiahmeadows at gmail.com> wrote: > Not yet, but if I were, it wouldn't be a subclass (it's not > necessary). But the key trouble implementing this is that I'd have to > reimplement the regexp matching logic entirely from scratch if I want > remotely reasonable runtime complexity. As for why: > > - If you ignore backreferences (which IIUC makes JS regexps go from > regular to context-sensitive recognizers), it's *possible* to > implement partial matching using regexp rewriting, but only because JS > doesn't have any other arcane enough features to make it infeasible. > - There is no possible way to extract duplicate matches without > rewriting the regexp entirely and using `regexp.exec`, and the > complexity of the logic to do this generally I suspect is NP-complete, > maybe PSPACE-complete, and in either case, certainly infeasible when > backreferences enter the picture. > - It's technically possible to refactor a string for streaming, but I > then lose the ability to discern a match from a non-match. I also have > the same issue as per above WRT duplicate matches and partial matching > with complexity issues. > - If you were to combine the three, rewriting the regexp generally to > support streaming matches, including duplicates, I suspect would > likely be PSPACE-complete because it seems *very* similar to the first > problem listed here [1]. > > \* Backreferences bring the grammatical complexity of JS regexps from > I'm pretty sure regular to context-sensitive. > > Or in other words, either you control the raw matching logic yourself, > or the polyfill runtime complexity could get absurd for this proposal. > And implementing a pushdown state machine-based regexp engine + parser > in JS isn't exactly something I'm willing to prototype for a strawman. > > [1]: https://en.wikipedia.org/wiki/PSPACE-complete#Regular_ > expressions_and_automata > > ----- > > Isiah Meadows > contact at isiahmeadows.com > www.isiahmeadows.com > > > On Mon, Jul 30, 2018 at 2:44 PM, Jordan Harband <ljharb at gmail.com> wrote: > > Have you tried to implement this as a RegExp subclass, overriding all the > > necessary extension points? > > > > On Mon, Jul 30, 2018 at 11:39 AM, Isiah Meadows <isiahmeadows at gmail.com> > > wrote: > >> > >> There's two things I've found that need suspendable matching: > >> > >> 1. Partially matching against a string, which is useful with > >> interactive form validation and such. > >> 2. Pattern matching and replacement over a character stream, which is > >> useful for things like matching against files without loading the > >> entire thing into memory or easier filtering of requests. > >> > >> Also, it'd be nice if there was a facility to get *all* matches, > >> including duplicate group matches. This is often useful for simple > >> parsing, where if such support existed, you could just use a Kleene > >> star instead of the standard `exec` loops (which admittedly get old). > >> > >> And finally, we could avoid setting regexp globals here. That would > >> speed up the matcher quite a bit. > >> > >> So, here's my proposal: > >> > >> - `regexp.matcher() -> matcher` - Create a streaming regexp matcher. > >> - `matcher.consume(codePoint, charSize?) -> result | undefined` - > >> Consume a Unicode code point or `-1` if no more characters exist, and > >> return a match result, `undefined` if no match occurred. `charSize` is > >> the number of bytes represented by `codePoint` (default: 1-2 if `/u` > >> is set, 1 otherwise), so it can work with other encodings flexibly. > >> - `matcher.nextPossibleStart -> number` - The next possible start the > >> matcher could have, for more effective buffering and stream > >> management. This is implementation-defined, but it *must* be be `-1` > >> after the matcher completes, and it *must* be within [0, N) otherwise, > >> where N is the next returned match. > >> - `result.group -> string | number | undefined` - Return the group > >> index/name of the current match, or `undefined` if it's just issuing a > >> match of the global regexp. > >> - `result.start -> number` - Return the matched value's start index. > >> - `result.end -> number` - Return the matched value's end index. > >> - This does *not* modify any globals or regexp instance members. It > >> only reads `regexp.lastIndex` on creation. (It doesn't operate on > >> strings, so it shouldn't return any it doesn't already have.) > >> > >> Most RegExp methods could similarly be built using this as a base: if > >> they work on strings, they can iterate their code points. > >> > >> As for the various concerns: > >> > >> - Partial matching is just iterating a string's character codes and > >> seeing if the matcher ever returned non-`undefined`. > >> - Streaming pattern matching is pretty obvious from just reading the > API. > >> - Getting all matches is just iterating the string and returning an > >> object with all the groups + strings it matched. > >> > >> So WDYT? > >> > >> /cc Mathias Bynens, since I know you're involved in this kind of > >> text-heavy stuff. > >> > >> ----- > >> > >> Isiah Meadows > >> contact at isiahmeadows.com > >> www.isiahmeadows.com > >> _______________________________________________ > >> es-discuss mailing list > >> es-discuss at mozilla.org > >> https://mail.mozilla.org/listinfo/es-discuss > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20180730/4ff593f3/attachment.html>
I was thinking in terms of what hooks need overridden (none for the base proposal), not where the method itself lived.
Isiah Meadows contact at isiahmeadows.com, www.isiahmeadows.com
I was thinking in terms of what hooks need overridden (none for the base proposal), not where the method itself lived. ----- Isiah Meadows contact at isiahmeadows.com www.isiahmeadows.com On Mon, Jul 30, 2018 at 5:44 PM, Jordan Harband <ljharb at gmail.com> wrote: > (It'd have to be a subclass for `RegExp.prototype.whatever.call` to do the > right thing for your alternative regex) > > On Mon, Jul 30, 2018 at 12:36 PM, Isiah Meadows <isiahmeadows at gmail.com> > wrote: >> >> Not yet, but if I were, it wouldn't be a subclass (it's not >> necessary). But the key trouble implementing this is that I'd have to >> reimplement the regexp matching logic entirely from scratch if I want >> remotely reasonable runtime complexity. As for why: >> >> - If you ignore backreferences (which IIUC makes JS regexps go from >> regular to context-sensitive recognizers), it's *possible* to >> implement partial matching using regexp rewriting, but only because JS >> doesn't have any other arcane enough features to make it infeasible. >> - There is no possible way to extract duplicate matches without >> rewriting the regexp entirely and using `regexp.exec`, and the >> complexity of the logic to do this generally I suspect is NP-complete, >> maybe PSPACE-complete, and in either case, certainly infeasible when >> backreferences enter the picture. >> - It's technically possible to refactor a string for streaming, but I >> then lose the ability to discern a match from a non-match. I also have >> the same issue as per above WRT duplicate matches and partial matching >> with complexity issues. >> - If you were to combine the three, rewriting the regexp generally to >> support streaming matches, including duplicates, I suspect would >> likely be PSPACE-complete because it seems *very* similar to the first >> problem listed here [1]. >> >> \* Backreferences bring the grammatical complexity of JS regexps from >> I'm pretty sure regular to context-sensitive. >> >> Or in other words, either you control the raw matching logic yourself, >> or the polyfill runtime complexity could get absurd for this proposal. >> And implementing a pushdown state machine-based regexp engine + parser >> in JS isn't exactly something I'm willing to prototype for a strawman. >> >> [1]: >> https://en.wikipedia.org/wiki/PSPACE-complete#Regular_expressions_and_automata >> >> ----- >> >> Isiah Meadows >> contact at isiahmeadows.com >> www.isiahmeadows.com >> >> >> On Mon, Jul 30, 2018 at 2:44 PM, Jordan Harband <ljharb at gmail.com> wrote: >> > Have you tried to implement this as a RegExp subclass, overriding all >> > the >> > necessary extension points? >> > >> > On Mon, Jul 30, 2018 at 11:39 AM, Isiah Meadows <isiahmeadows at gmail.com> >> > wrote: >> >> >> >> There's two things I've found that need suspendable matching: >> >> >> >> 1. Partially matching against a string, which is useful with >> >> interactive form validation and such. >> >> 2. Pattern matching and replacement over a character stream, which is >> >> useful for things like matching against files without loading the >> >> entire thing into memory or easier filtering of requests. >> >> >> >> Also, it'd be nice if there was a facility to get *all* matches, >> >> including duplicate group matches. This is often useful for simple >> >> parsing, where if such support existed, you could just use a Kleene >> >> star instead of the standard `exec` loops (which admittedly get old). >> >> >> >> And finally, we could avoid setting regexp globals here. That would >> >> speed up the matcher quite a bit. >> >> >> >> So, here's my proposal: >> >> >> >> - `regexp.matcher() -> matcher` - Create a streaming regexp matcher. >> >> - `matcher.consume(codePoint, charSize?) -> result | undefined` - >> >> Consume a Unicode code point or `-1` if no more characters exist, and >> >> return a match result, `undefined` if no match occurred. `charSize` is >> >> the number of bytes represented by `codePoint` (default: 1-2 if `/u` >> >> is set, 1 otherwise), so it can work with other encodings flexibly. >> >> - `matcher.nextPossibleStart -> number` - The next possible start the >> >> matcher could have, for more effective buffering and stream >> >> management. This is implementation-defined, but it *must* be be `-1` >> >> after the matcher completes, and it *must* be within [0, N) otherwise, >> >> where N is the next returned match. >> >> - `result.group -> string | number | undefined` - Return the group >> >> index/name of the current match, or `undefined` if it's just issuing a >> >> match of the global regexp. >> >> - `result.start -> number` - Return the matched value's start index. >> >> - `result.end -> number` - Return the matched value's end index. >> >> - This does *not* modify any globals or regexp instance members. It >> >> only reads `regexp.lastIndex` on creation. (It doesn't operate on >> >> strings, so it shouldn't return any it doesn't already have.) >> >> >> >> Most RegExp methods could similarly be built using this as a base: if >> >> they work on strings, they can iterate their code points. >> >> >> >> As for the various concerns: >> >> >> >> - Partial matching is just iterating a string's character codes and >> >> seeing if the matcher ever returned non-`undefined`. >> >> - Streaming pattern matching is pretty obvious from just reading the >> >> API. >> >> - Getting all matches is just iterating the string and returning an >> >> object with all the groups + strings it matched. >> >> >> >> So WDYT? >> >> >> >> /cc Mathias Bynens, since I know you're involved in this kind of >> >> text-heavy stuff. >> >> >> >> ----- >> >> >> >> Isiah Meadows >> >> contact at isiahmeadows.com >> >> www.isiahmeadows.com >> >> _______________________________________________ >> >> es-discuss mailing list >> >> es-discuss at mozilla.org >> >> https://mail.mozilla.org/listinfo/es-discuss >> > >> > > >
BTW, I created a rough sketch of a strawman here: isiahmeadows/streaming-regexp-proposal
Isiah Meadows contact at isiahmeadows.com, www.isiahmeadows.com
BTW, I created a rough sketch of a strawman here: https://github.com/isiahmeadows/streaming-regexp-proposal ----- Isiah Meadows contact at isiahmeadows.com www.isiahmeadows.com On Mon, Jul 30, 2018 at 5:48 PM, Isiah Meadows <isiahmeadows at gmail.com> wrote: > I was thinking in terms of what hooks need overridden (none for the > base proposal), not where the method itself lived. > ----- > > Isiah Meadows > contact at isiahmeadows.com > www.isiahmeadows.com > > > On Mon, Jul 30, 2018 at 5:44 PM, Jordan Harband <ljharb at gmail.com> wrote: >> (It'd have to be a subclass for `RegExp.prototype.whatever.call` to do the >> right thing for your alternative regex) >> >> On Mon, Jul 30, 2018 at 12:36 PM, Isiah Meadows <isiahmeadows at gmail.com> >> wrote: >>> >>> Not yet, but if I were, it wouldn't be a subclass (it's not >>> necessary). But the key trouble implementing this is that I'd have to >>> reimplement the regexp matching logic entirely from scratch if I want >>> remotely reasonable runtime complexity. As for why: >>> >>> - If you ignore backreferences (which IIUC makes JS regexps go from >>> regular to context-sensitive recognizers), it's *possible* to >>> implement partial matching using regexp rewriting, but only because JS >>> doesn't have any other arcane enough features to make it infeasible. >>> - There is no possible way to extract duplicate matches without >>> rewriting the regexp entirely and using `regexp.exec`, and the >>> complexity of the logic to do this generally I suspect is NP-complete, >>> maybe PSPACE-complete, and in either case, certainly infeasible when >>> backreferences enter the picture. >>> - It's technically possible to refactor a string for streaming, but I >>> then lose the ability to discern a match from a non-match. I also have >>> the same issue as per above WRT duplicate matches and partial matching >>> with complexity issues. >>> - If you were to combine the three, rewriting the regexp generally to >>> support streaming matches, including duplicates, I suspect would >>> likely be PSPACE-complete because it seems *very* similar to the first >>> problem listed here [1]. >>> >>> \* Backreferences bring the grammatical complexity of JS regexps from >>> I'm pretty sure regular to context-sensitive. >>> >>> Or in other words, either you control the raw matching logic yourself, >>> or the polyfill runtime complexity could get absurd for this proposal. >>> And implementing a pushdown state machine-based regexp engine + parser >>> in JS isn't exactly something I'm willing to prototype for a strawman. >>> >>> [1]: >>> https://en.wikipedia.org/wiki/PSPACE-complete#Regular_expressions_and_automata >>> >>> ----- >>> >>> Isiah Meadows >>> contact at isiahmeadows.com >>> www.isiahmeadows.com >>> >>> >>> On Mon, Jul 30, 2018 at 2:44 PM, Jordan Harband <ljharb at gmail.com> wrote: >>> > Have you tried to implement this as a RegExp subclass, overriding all >>> > the >>> > necessary extension points? >>> > >>> > On Mon, Jul 30, 2018 at 11:39 AM, Isiah Meadows <isiahmeadows at gmail.com> >>> > wrote: >>> >> >>> >> There's two things I've found that need suspendable matching: >>> >> >>> >> 1. Partially matching against a string, which is useful with >>> >> interactive form validation and such. >>> >> 2. Pattern matching and replacement over a character stream, which is >>> >> useful for things like matching against files without loading the >>> >> entire thing into memory or easier filtering of requests. >>> >> >>> >> Also, it'd be nice if there was a facility to get *all* matches, >>> >> including duplicate group matches. This is often useful for simple >>> >> parsing, where if such support existed, you could just use a Kleene >>> >> star instead of the standard `exec` loops (which admittedly get old). >>> >> >>> >> And finally, we could avoid setting regexp globals here. That would >>> >> speed up the matcher quite a bit. >>> >> >>> >> So, here's my proposal: >>> >> >>> >> - `regexp.matcher() -> matcher` - Create a streaming regexp matcher. >>> >> - `matcher.consume(codePoint, charSize?) -> result | undefined` - >>> >> Consume a Unicode code point or `-1` if no more characters exist, and >>> >> return a match result, `undefined` if no match occurred. `charSize` is >>> >> the number of bytes represented by `codePoint` (default: 1-2 if `/u` >>> >> is set, 1 otherwise), so it can work with other encodings flexibly. >>> >> - `matcher.nextPossibleStart -> number` - The next possible start the >>> >> matcher could have, for more effective buffering and stream >>> >> management. This is implementation-defined, but it *must* be be `-1` >>> >> after the matcher completes, and it *must* be within [0, N) otherwise, >>> >> where N is the next returned match. >>> >> - `result.group -> string | number | undefined` - Return the group >>> >> index/name of the current match, or `undefined` if it's just issuing a >>> >> match of the global regexp. >>> >> - `result.start -> number` - Return the matched value's start index. >>> >> - `result.end -> number` - Return the matched value's end index. >>> >> - This does *not* modify any globals or regexp instance members. It >>> >> only reads `regexp.lastIndex` on creation. (It doesn't operate on >>> >> strings, so it shouldn't return any it doesn't already have.) >>> >> >>> >> Most RegExp methods could similarly be built using this as a base: if >>> >> they work on strings, they can iterate their code points. >>> >> >>> >> As for the various concerns: >>> >> >>> >> - Partial matching is just iterating a string's character codes and >>> >> seeing if the matcher ever returned non-`undefined`. >>> >> - Streaming pattern matching is pretty obvious from just reading the >>> >> API. >>> >> - Getting all matches is just iterating the string and returning an >>> >> object with all the groups + strings it matched. >>> >> >>> >> So WDYT? >>> >> >>> >> /cc Mathias Bynens, since I know you're involved in this kind of >>> >> text-heavy stuff. >>> >> >>> >> ----- >>> >> >>> >> Isiah Meadows >>> >> contact at isiahmeadows.com >>> >> www.isiahmeadows.com >>> >> _______________________________________________ >>> >> es-discuss mailing list >>> >> es-discuss at mozilla.org >>> >> https://mail.mozilla.org/listinfo/es-discuss >>> > >>> > >> >>
There's two things I've found that need suspendable matching:
Also, it'd be nice if there was a facility to get all matches, including duplicate group matches. This is often useful for simple parsing, where if such support existed, you could just use a Kleene star instead of the standard
exec
loops (which admittedly get old).And finally, we could avoid setting regexp globals here. That would speed up the matcher quite a bit.
So, here's my proposal:
regexp.matcher() -> matcher
- Create a streaming regexp matcher.matcher.consume(codePoint, charSize?) -> result | undefined
-Consume a Unicode code point or
-1
if no more characters exist, and return a match result,undefined
if no match occurred.charSize
is the number of bytes represented bycodePoint
(default: 1-2 if/u
is set, 1 otherwise), so it can work with other encodings flexibly.matcher.nextPossibleStart -> number
- The next possible start thematcher could have, for more effective buffering and stream management. This is implementation-defined, but it must be be
-1
after the matcher completes, and it must be within [0, N) otherwise, where N is the next returned match.result.group -> string | number | undefined
- Return the groupindex/name of the current match, or
undefined
if it's just issuing a match of the global regexp.result.start -> number
- Return the matched value's start index.result.end -> number
- Return the matched value's end index.regexp.lastIndex
on creation. (It doesn't operate on strings, so it shouldn't return any it doesn't already have.)Most RegExp methods could similarly be built using this as a base: if they work on strings, they can iterate their code points.
As for the various concerns:
undefined
.So WDYT?
/cc Mathias Bynens, since I know you're involved in this kind of text-heavy stuff.
Isiah Meadows contact at isiahmeadows.com, www.isiahmeadows.com