Streaming regexp matching
Have you tried to implement this as a RegExp subclass, overriding all the necessary extension points?
Have you tried to implement this as a RegExp subclass, overriding all the necessary extension points? On Mon, Jul 30, 2018 at 11:39 AM, Isiah Meadows <isiahmeadows at gmail.com> wrote: > There's two things I've found that need suspendable matching: > > 1. Partially matching against a string, which is useful with > interactive form validation and such. > 2. Pattern matching and replacement over a character stream, which is > useful for things like matching against files without loading the > entire thing into memory or easier filtering of requests. > > Also, it'd be nice if there was a facility to get *all* matches, > including duplicate group matches. This is often useful for simple > parsing, where if such support existed, you could just use a Kleene > star instead of the standard `exec` loops (which admittedly get old). > > And finally, we could avoid setting regexp globals here. That would > speed up the matcher quite a bit. > > So, here's my proposal: > > - `regexp.matcher() -> matcher` - Create a streaming regexp matcher. > - `matcher.consume(codePoint, charSize?) -> result | undefined` - > Consume a Unicode code point or `-1` if no more characters exist, and > return a match result, `undefined` if no match occurred. `charSize` is > the number of bytes represented by `codePoint` (default: 1-2 if `/u` > is set, 1 otherwise), so it can work with other encodings flexibly. > - `matcher.nextPossibleStart -> number` - The next possible start the > matcher could have, for more effective buffering and stream > management. This is implementation-defined, but it *must* be be `-1` > after the matcher completes, and it *must* be within [0, N) otherwise, > where N is the next returned match. > - `result.group -> string | number | undefined` - Return the group > index/name of the current match, or `undefined` if it's just issuing a > match of the global regexp. > - `result.start -> number` - Return the matched value's start index. > - `result.end -> number` - Return the matched value's end index. > - This does *not* modify any globals or regexp instance members. It > only reads `regexp.lastIndex` on creation. (It doesn't operate on > strings, so it shouldn't return any it doesn't already have.) > > Most RegExp methods could similarly be built using this as a base: if > they work on strings, they can iterate their code points. > > As for the various concerns: > > - Partial matching is just iterating a string's character codes and > seeing if the matcher ever returned non-`undefined`. > - Streaming pattern matching is pretty obvious from just reading the API. > - Getting all matches is just iterating the string and returning an > object with all the groups + strings it matched. > > So WDYT? > > /cc Mathias Bynens, since I know you're involved in this kind of > text-heavy stuff. > > ----- > > Isiah Meadows > contact at isiahmeadows.com > www.isiahmeadows.com > _______________________________________________ > es-discuss mailing list > es-discuss at mozilla.org > https://mail.mozilla.org/listinfo/es-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20180730/91b610aa/attachment.html>
I'll say I have at least a few times desired a way to stream input to RegExp. To the point I fiddled around implementing a finite state machine.
My use cases:
- short circuiting on nth capture
- parsing streams
I'll say I have at least a few times desired a way to stream input to RegExp. To the point I fiddled around implementing a finite state machine. My use cases: 1. short circuiting on nth capture 2. parsing streams On Mon, Jul 30, 2018 at 2:39 PM, Isiah Meadows <isiahmeadows at gmail.com> wrote: > There's two things I've found that need suspendable matching: > > 1. Partially matching against a string, which is useful with > interactive form validation and such. > 2. Pattern matching and replacement over a character stream, which is > useful for things like matching against files without loading the > entire thing into memory or easier filtering of requests. > > Also, it'd be nice if there was a facility to get *all* matches, > including duplicate group matches. This is often useful for simple > parsing, where if such support existed, you could just use a Kleene > star instead of the standard `exec` loops (which admittedly get old). > > And finally, we could avoid setting regexp globals here. That would > speed up the matcher quite a bit. > > So, here's my proposal: > > - `regexp.matcher() -> matcher` - Create a streaming regexp matcher. > - `matcher.consume(codePoint, charSize?) -> result | undefined` - > Consume a Unicode code point or `-1` if no more characters exist, and > return a match result, `undefined` if no match occurred. `charSize` is > the number of bytes represented by `codePoint` (default: 1-2 if `/u` > is set, 1 otherwise), so it can work with other encodings flexibly. > - `matcher.nextPossibleStart -> number` - The next possible start the > matcher could have, for more effective buffering and stream > management. This is implementation-defined, but it *must* be be `-1` > after the matcher completes, and it *must* be within [0, N) otherwise, > where N is the next returned match. > - `result.group -> string | number | undefined` - Return the group > index/name of the current match, or `undefined` if it's just issuing a > match of the global regexp. > - `result.start -> number` - Return the matched value's start index. > - `result.end -> number` - Return the matched value's end index. > - This does *not* modify any globals or regexp instance members. It > only reads `regexp.lastIndex` on creation. (It doesn't operate on > strings, so it shouldn't return any it doesn't already have.) > > Most RegExp methods could similarly be built using this as a base: if > they work on strings, they can iterate their code points. > > As for the various concerns: > > - Partial matching is just iterating a string's character codes and > seeing if the matcher ever returned non-`undefined`. > - Streaming pattern matching is pretty obvious from just reading the API. > - Getting all matches is just iterating the string and returning an > object with all the groups + strings it matched. > > So WDYT? > > /cc Mathias Bynens, since I know you're involved in this kind of > text-heavy stuff. > > ----- > > Isiah Meadows > contact at isiahmeadows.com > www.isiahmeadows.com > _______________________________________________ > es-discuss mailing list > es-discuss at mozilla.org > https://mail.mozilla.org/listinfo/es-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20180730/2d3beb55/attachment-0001.html>
Not yet, but if I were, it wouldn't be a subclass (it's not necessary). But the key trouble implementing this is that I'd have to reimplement the regexp matching logic entirely from scratch if I want remotely reasonable runtime complexity. As for why:
- If you ignore backreferences (which IIUC makes JS regexps go from regular to context-sensitive recognizers), it's possible to implement partial matching using regexp rewriting, but only because JS doesn't have any other arcane enough features to make it infeasible.
- There is no possible way to extract duplicate matches without
rewriting the regexp entirely and using
regexp.exec, and the complexity of the logic to do this generally I suspect is NP-complete, maybe PSPACE-complete, and in either case, certainly infeasible when backreferences enter the picture. - It's technically possible to refactor a string for streaming, but I then lose the ability to discern a match from a non-match. I also have the same issue as per above WRT duplicate matches and partial matching with complexity issues.
- If you were to combine the three, rewriting the regexp generally to support streaming matches, including duplicates, I suspect would likely be PSPACE-complete because it seems very similar to the first problem listed here 1.
* Backreferences bring the grammatical complexity of JS regexps from I'm pretty sure regular to context-sensitive.
Or in other words, either you control the raw matching logic yourself, or the polyfill runtime complexity could get absurd for this proposal. And implementing a pushdown state machine-based regexp engine + parser in JS isn't exactly something I'm willing to prototype for a strawman.
Isiah Meadows contact at isiahmeadows.com, www.isiahmeadows.com
Not yet, but if I were, it wouldn't be a subclass (it's not necessary). But the key trouble implementing this is that I'd have to reimplement the regexp matching logic entirely from scratch if I want remotely reasonable runtime complexity. As for why: - If you ignore backreferences (which IIUC makes JS regexps go from regular to context-sensitive recognizers), it's *possible* to implement partial matching using regexp rewriting, but only because JS doesn't have any other arcane enough features to make it infeasible. - There is no possible way to extract duplicate matches without rewriting the regexp entirely and using `regexp.exec`, and the complexity of the logic to do this generally I suspect is NP-complete, maybe PSPACE-complete, and in either case, certainly infeasible when backreferences enter the picture. - It's technically possible to refactor a string for streaming, but I then lose the ability to discern a match from a non-match. I also have the same issue as per above WRT duplicate matches and partial matching with complexity issues. - If you were to combine the three, rewriting the regexp generally to support streaming matches, including duplicates, I suspect would likely be PSPACE-complete because it seems *very* similar to the first problem listed here [1]. \* Backreferences bring the grammatical complexity of JS regexps from I'm pretty sure regular to context-sensitive. Or in other words, either you control the raw matching logic yourself, or the polyfill runtime complexity could get absurd for this proposal. And implementing a pushdown state machine-based regexp engine + parser in JS isn't exactly something I'm willing to prototype for a strawman. [1]: https://en.wikipedia.org/wiki/PSPACE-complete#Regular_expressions_and_automata ----- Isiah Meadows contact at isiahmeadows.com www.isiahmeadows.com On Mon, Jul 30, 2018 at 2:44 PM, Jordan Harband <ljharb at gmail.com> wrote: > Have you tried to implement this as a RegExp subclass, overriding all the > necessary extension points? > > On Mon, Jul 30, 2018 at 11:39 AM, Isiah Meadows <isiahmeadows at gmail.com> > wrote: >> >> There's two things I've found that need suspendable matching: >> >> 1. Partially matching against a string, which is useful with >> interactive form validation and such. >> 2. Pattern matching and replacement over a character stream, which is >> useful for things like matching against files without loading the >> entire thing into memory or easier filtering of requests. >> >> Also, it'd be nice if there was a facility to get *all* matches, >> including duplicate group matches. This is often useful for simple >> parsing, where if such support existed, you could just use a Kleene >> star instead of the standard `exec` loops (which admittedly get old). >> >> And finally, we could avoid setting regexp globals here. That would >> speed up the matcher quite a bit. >> >> So, here's my proposal: >> >> - `regexp.matcher() -> matcher` - Create a streaming regexp matcher. >> - `matcher.consume(codePoint, charSize?) -> result | undefined` - >> Consume a Unicode code point or `-1` if no more characters exist, and >> return a match result, `undefined` if no match occurred. `charSize` is >> the number of bytes represented by `codePoint` (default: 1-2 if `/u` >> is set, 1 otherwise), so it can work with other encodings flexibly. >> - `matcher.nextPossibleStart -> number` - The next possible start the >> matcher could have, for more effective buffering and stream >> management. This is implementation-defined, but it *must* be be `-1` >> after the matcher completes, and it *must* be within [0, N) otherwise, >> where N is the next returned match. >> - `result.group -> string | number | undefined` - Return the group >> index/name of the current match, or `undefined` if it's just issuing a >> match of the global regexp. >> - `result.start -> number` - Return the matched value's start index. >> - `result.end -> number` - Return the matched value's end index. >> - This does *not* modify any globals or regexp instance members. It >> only reads `regexp.lastIndex` on creation. (It doesn't operate on >> strings, so it shouldn't return any it doesn't already have.) >> >> Most RegExp methods could similarly be built using this as a base: if >> they work on strings, they can iterate their code points. >> >> As for the various concerns: >> >> - Partial matching is just iterating a string's character codes and >> seeing if the matcher ever returned non-`undefined`. >> - Streaming pattern matching is pretty obvious from just reading the API. >> - Getting all matches is just iterating the string and returning an >> object with all the groups + strings it matched. >> >> So WDYT? >> >> /cc Mathias Bynens, since I know you're involved in this kind of >> text-heavy stuff. >> >> ----- >> >> Isiah Meadows >> contact at isiahmeadows.com >> www.isiahmeadows.com >> _______________________________________________ >> es-discuss mailing list >> es-discuss at mozilla.org >> https://mail.mozilla.org/listinfo/es-discuss > >
(It'd have to be a subclass for RegExp.prototype.whatever.call to do the
right thing for your alternative regex)
(It'd have to be a subclass for `RegExp.prototype.whatever.call` to do the right thing for your alternative regex) On Mon, Jul 30, 2018 at 12:36 PM, Isiah Meadows <isiahmeadows at gmail.com> wrote: > Not yet, but if I were, it wouldn't be a subclass (it's not > necessary). But the key trouble implementing this is that I'd have to > reimplement the regexp matching logic entirely from scratch if I want > remotely reasonable runtime complexity. As for why: > > - If you ignore backreferences (which IIUC makes JS regexps go from > regular to context-sensitive recognizers), it's *possible* to > implement partial matching using regexp rewriting, but only because JS > doesn't have any other arcane enough features to make it infeasible. > - There is no possible way to extract duplicate matches without > rewriting the regexp entirely and using `regexp.exec`, and the > complexity of the logic to do this generally I suspect is NP-complete, > maybe PSPACE-complete, and in either case, certainly infeasible when > backreferences enter the picture. > - It's technically possible to refactor a string for streaming, but I > then lose the ability to discern a match from a non-match. I also have > the same issue as per above WRT duplicate matches and partial matching > with complexity issues. > - If you were to combine the three, rewriting the regexp generally to > support streaming matches, including duplicates, I suspect would > likely be PSPACE-complete because it seems *very* similar to the first > problem listed here [1]. > > \* Backreferences bring the grammatical complexity of JS regexps from > I'm pretty sure regular to context-sensitive. > > Or in other words, either you control the raw matching logic yourself, > or the polyfill runtime complexity could get absurd for this proposal. > And implementing a pushdown state machine-based regexp engine + parser > in JS isn't exactly something I'm willing to prototype for a strawman. > > [1]: https://en.wikipedia.org/wiki/PSPACE-complete#Regular_ > expressions_and_automata > > ----- > > Isiah Meadows > contact at isiahmeadows.com > www.isiahmeadows.com > > > On Mon, Jul 30, 2018 at 2:44 PM, Jordan Harband <ljharb at gmail.com> wrote: > > Have you tried to implement this as a RegExp subclass, overriding all the > > necessary extension points? > > > > On Mon, Jul 30, 2018 at 11:39 AM, Isiah Meadows <isiahmeadows at gmail.com> > > wrote: > >> > >> There's two things I've found that need suspendable matching: > >> > >> 1. Partially matching against a string, which is useful with > >> interactive form validation and such. > >> 2. Pattern matching and replacement over a character stream, which is > >> useful for things like matching against files without loading the > >> entire thing into memory or easier filtering of requests. > >> > >> Also, it'd be nice if there was a facility to get *all* matches, > >> including duplicate group matches. This is often useful for simple > >> parsing, where if such support existed, you could just use a Kleene > >> star instead of the standard `exec` loops (which admittedly get old). > >> > >> And finally, we could avoid setting regexp globals here. That would > >> speed up the matcher quite a bit. > >> > >> So, here's my proposal: > >> > >> - `regexp.matcher() -> matcher` - Create a streaming regexp matcher. > >> - `matcher.consume(codePoint, charSize?) -> result | undefined` - > >> Consume a Unicode code point or `-1` if no more characters exist, and > >> return a match result, `undefined` if no match occurred. `charSize` is > >> the number of bytes represented by `codePoint` (default: 1-2 if `/u` > >> is set, 1 otherwise), so it can work with other encodings flexibly. > >> - `matcher.nextPossibleStart -> number` - The next possible start the > >> matcher could have, for more effective buffering and stream > >> management. This is implementation-defined, but it *must* be be `-1` > >> after the matcher completes, and it *must* be within [0, N) otherwise, > >> where N is the next returned match. > >> - `result.group -> string | number | undefined` - Return the group > >> index/name of the current match, or `undefined` if it's just issuing a > >> match of the global regexp. > >> - `result.start -> number` - Return the matched value's start index. > >> - `result.end -> number` - Return the matched value's end index. > >> - This does *not* modify any globals or regexp instance members. It > >> only reads `regexp.lastIndex` on creation. (It doesn't operate on > >> strings, so it shouldn't return any it doesn't already have.) > >> > >> Most RegExp methods could similarly be built using this as a base: if > >> they work on strings, they can iterate their code points. > >> > >> As for the various concerns: > >> > >> - Partial matching is just iterating a string's character codes and > >> seeing if the matcher ever returned non-`undefined`. > >> - Streaming pattern matching is pretty obvious from just reading the > API. > >> - Getting all matches is just iterating the string and returning an > >> object with all the groups + strings it matched. > >> > >> So WDYT? > >> > >> /cc Mathias Bynens, since I know you're involved in this kind of > >> text-heavy stuff. > >> > >> ----- > >> > >> Isiah Meadows > >> contact at isiahmeadows.com > >> www.isiahmeadows.com > >> _______________________________________________ > >> es-discuss mailing list > >> es-discuss at mozilla.org > >> https://mail.mozilla.org/listinfo/es-discuss > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20180730/4ff593f3/attachment.html>
I was thinking in terms of what hooks need overridden (none for the base proposal), not where the method itself lived.
Isiah Meadows contact at isiahmeadows.com, www.isiahmeadows.com
I was thinking in terms of what hooks need overridden (none for the base proposal), not where the method itself lived. ----- Isiah Meadows contact at isiahmeadows.com www.isiahmeadows.com On Mon, Jul 30, 2018 at 5:44 PM, Jordan Harband <ljharb at gmail.com> wrote: > (It'd have to be a subclass for `RegExp.prototype.whatever.call` to do the > right thing for your alternative regex) > > On Mon, Jul 30, 2018 at 12:36 PM, Isiah Meadows <isiahmeadows at gmail.com> > wrote: >> >> Not yet, but if I were, it wouldn't be a subclass (it's not >> necessary). But the key trouble implementing this is that I'd have to >> reimplement the regexp matching logic entirely from scratch if I want >> remotely reasonable runtime complexity. As for why: >> >> - If you ignore backreferences (which IIUC makes JS regexps go from >> regular to context-sensitive recognizers), it's *possible* to >> implement partial matching using regexp rewriting, but only because JS >> doesn't have any other arcane enough features to make it infeasible. >> - There is no possible way to extract duplicate matches without >> rewriting the regexp entirely and using `regexp.exec`, and the >> complexity of the logic to do this generally I suspect is NP-complete, >> maybe PSPACE-complete, and in either case, certainly infeasible when >> backreferences enter the picture. >> - It's technically possible to refactor a string for streaming, but I >> then lose the ability to discern a match from a non-match. I also have >> the same issue as per above WRT duplicate matches and partial matching >> with complexity issues. >> - If you were to combine the three, rewriting the regexp generally to >> support streaming matches, including duplicates, I suspect would >> likely be PSPACE-complete because it seems *very* similar to the first >> problem listed here [1]. >> >> \* Backreferences bring the grammatical complexity of JS regexps from >> I'm pretty sure regular to context-sensitive. >> >> Or in other words, either you control the raw matching logic yourself, >> or the polyfill runtime complexity could get absurd for this proposal. >> And implementing a pushdown state machine-based regexp engine + parser >> in JS isn't exactly something I'm willing to prototype for a strawman. >> >> [1]: >> https://en.wikipedia.org/wiki/PSPACE-complete#Regular_expressions_and_automata >> >> ----- >> >> Isiah Meadows >> contact at isiahmeadows.com >> www.isiahmeadows.com >> >> >> On Mon, Jul 30, 2018 at 2:44 PM, Jordan Harband <ljharb at gmail.com> wrote: >> > Have you tried to implement this as a RegExp subclass, overriding all >> > the >> > necessary extension points? >> > >> > On Mon, Jul 30, 2018 at 11:39 AM, Isiah Meadows <isiahmeadows at gmail.com> >> > wrote: >> >> >> >> There's two things I've found that need suspendable matching: >> >> >> >> 1. Partially matching against a string, which is useful with >> >> interactive form validation and such. >> >> 2. Pattern matching and replacement over a character stream, which is >> >> useful for things like matching against files without loading the >> >> entire thing into memory or easier filtering of requests. >> >> >> >> Also, it'd be nice if there was a facility to get *all* matches, >> >> including duplicate group matches. This is often useful for simple >> >> parsing, where if such support existed, you could just use a Kleene >> >> star instead of the standard `exec` loops (which admittedly get old). >> >> >> >> And finally, we could avoid setting regexp globals here. That would >> >> speed up the matcher quite a bit. >> >> >> >> So, here's my proposal: >> >> >> >> - `regexp.matcher() -> matcher` - Create a streaming regexp matcher. >> >> - `matcher.consume(codePoint, charSize?) -> result | undefined` - >> >> Consume a Unicode code point or `-1` if no more characters exist, and >> >> return a match result, `undefined` if no match occurred. `charSize` is >> >> the number of bytes represented by `codePoint` (default: 1-2 if `/u` >> >> is set, 1 otherwise), so it can work with other encodings flexibly. >> >> - `matcher.nextPossibleStart -> number` - The next possible start the >> >> matcher could have, for more effective buffering and stream >> >> management. This is implementation-defined, but it *must* be be `-1` >> >> after the matcher completes, and it *must* be within [0, N) otherwise, >> >> where N is the next returned match. >> >> - `result.group -> string | number | undefined` - Return the group >> >> index/name of the current match, or `undefined` if it's just issuing a >> >> match of the global regexp. >> >> - `result.start -> number` - Return the matched value's start index. >> >> - `result.end -> number` - Return the matched value's end index. >> >> - This does *not* modify any globals or regexp instance members. It >> >> only reads `regexp.lastIndex` on creation. (It doesn't operate on >> >> strings, so it shouldn't return any it doesn't already have.) >> >> >> >> Most RegExp methods could similarly be built using this as a base: if >> >> they work on strings, they can iterate their code points. >> >> >> >> As for the various concerns: >> >> >> >> - Partial matching is just iterating a string's character codes and >> >> seeing if the matcher ever returned non-`undefined`. >> >> - Streaming pattern matching is pretty obvious from just reading the >> >> API. >> >> - Getting all matches is just iterating the string and returning an >> >> object with all the groups + strings it matched. >> >> >> >> So WDYT? >> >> >> >> /cc Mathias Bynens, since I know you're involved in this kind of >> >> text-heavy stuff. >> >> >> >> ----- >> >> >> >> Isiah Meadows >> >> contact at isiahmeadows.com >> >> www.isiahmeadows.com >> >> _______________________________________________ >> >> es-discuss mailing list >> >> es-discuss at mozilla.org >> >> https://mail.mozilla.org/listinfo/es-discuss >> > >> > > >
BTW, I created a rough sketch of a strawman here: isiahmeadows/streaming-regexp-proposal
Isiah Meadows contact at isiahmeadows.com, www.isiahmeadows.com
BTW, I created a rough sketch of a strawman here: https://github.com/isiahmeadows/streaming-regexp-proposal ----- Isiah Meadows contact at isiahmeadows.com www.isiahmeadows.com On Mon, Jul 30, 2018 at 5:48 PM, Isiah Meadows <isiahmeadows at gmail.com> wrote: > I was thinking in terms of what hooks need overridden (none for the > base proposal), not where the method itself lived. > ----- > > Isiah Meadows > contact at isiahmeadows.com > www.isiahmeadows.com > > > On Mon, Jul 30, 2018 at 5:44 PM, Jordan Harband <ljharb at gmail.com> wrote: >> (It'd have to be a subclass for `RegExp.prototype.whatever.call` to do the >> right thing for your alternative regex) >> >> On Mon, Jul 30, 2018 at 12:36 PM, Isiah Meadows <isiahmeadows at gmail.com> >> wrote: >>> >>> Not yet, but if I were, it wouldn't be a subclass (it's not >>> necessary). But the key trouble implementing this is that I'd have to >>> reimplement the regexp matching logic entirely from scratch if I want >>> remotely reasonable runtime complexity. As for why: >>> >>> - If you ignore backreferences (which IIUC makes JS regexps go from >>> regular to context-sensitive recognizers), it's *possible* to >>> implement partial matching using regexp rewriting, but only because JS >>> doesn't have any other arcane enough features to make it infeasible. >>> - There is no possible way to extract duplicate matches without >>> rewriting the regexp entirely and using `regexp.exec`, and the >>> complexity of the logic to do this generally I suspect is NP-complete, >>> maybe PSPACE-complete, and in either case, certainly infeasible when >>> backreferences enter the picture. >>> - It's technically possible to refactor a string for streaming, but I >>> then lose the ability to discern a match from a non-match. I also have >>> the same issue as per above WRT duplicate matches and partial matching >>> with complexity issues. >>> - If you were to combine the three, rewriting the regexp generally to >>> support streaming matches, including duplicates, I suspect would >>> likely be PSPACE-complete because it seems *very* similar to the first >>> problem listed here [1]. >>> >>> \* Backreferences bring the grammatical complexity of JS regexps from >>> I'm pretty sure regular to context-sensitive. >>> >>> Or in other words, either you control the raw matching logic yourself, >>> or the polyfill runtime complexity could get absurd for this proposal. >>> And implementing a pushdown state machine-based regexp engine + parser >>> in JS isn't exactly something I'm willing to prototype for a strawman. >>> >>> [1]: >>> https://en.wikipedia.org/wiki/PSPACE-complete#Regular_expressions_and_automata >>> >>> ----- >>> >>> Isiah Meadows >>> contact at isiahmeadows.com >>> www.isiahmeadows.com >>> >>> >>> On Mon, Jul 30, 2018 at 2:44 PM, Jordan Harband <ljharb at gmail.com> wrote: >>> > Have you tried to implement this as a RegExp subclass, overriding all >>> > the >>> > necessary extension points? >>> > >>> > On Mon, Jul 30, 2018 at 11:39 AM, Isiah Meadows <isiahmeadows at gmail.com> >>> > wrote: >>> >> >>> >> There's two things I've found that need suspendable matching: >>> >> >>> >> 1. Partially matching against a string, which is useful with >>> >> interactive form validation and such. >>> >> 2. Pattern matching and replacement over a character stream, which is >>> >> useful for things like matching against files without loading the >>> >> entire thing into memory or easier filtering of requests. >>> >> >>> >> Also, it'd be nice if there was a facility to get *all* matches, >>> >> including duplicate group matches. This is often useful for simple >>> >> parsing, where if such support existed, you could just use a Kleene >>> >> star instead of the standard `exec` loops (which admittedly get old). >>> >> >>> >> And finally, we could avoid setting regexp globals here. That would >>> >> speed up the matcher quite a bit. >>> >> >>> >> So, here's my proposal: >>> >> >>> >> - `regexp.matcher() -> matcher` - Create a streaming regexp matcher. >>> >> - `matcher.consume(codePoint, charSize?) -> result | undefined` - >>> >> Consume a Unicode code point or `-1` if no more characters exist, and >>> >> return a match result, `undefined` if no match occurred. `charSize` is >>> >> the number of bytes represented by `codePoint` (default: 1-2 if `/u` >>> >> is set, 1 otherwise), so it can work with other encodings flexibly. >>> >> - `matcher.nextPossibleStart -> number` - The next possible start the >>> >> matcher could have, for more effective buffering and stream >>> >> management. This is implementation-defined, but it *must* be be `-1` >>> >> after the matcher completes, and it *must* be within [0, N) otherwise, >>> >> where N is the next returned match. >>> >> - `result.group -> string | number | undefined` - Return the group >>> >> index/name of the current match, or `undefined` if it's just issuing a >>> >> match of the global regexp. >>> >> - `result.start -> number` - Return the matched value's start index. >>> >> - `result.end -> number` - Return the matched value's end index. >>> >> - This does *not* modify any globals or regexp instance members. It >>> >> only reads `regexp.lastIndex` on creation. (It doesn't operate on >>> >> strings, so it shouldn't return any it doesn't already have.) >>> >> >>> >> Most RegExp methods could similarly be built using this as a base: if >>> >> they work on strings, they can iterate their code points. >>> >> >>> >> As for the various concerns: >>> >> >>> >> - Partial matching is just iterating a string's character codes and >>> >> seeing if the matcher ever returned non-`undefined`. >>> >> - Streaming pattern matching is pretty obvious from just reading the >>> >> API. >>> >> - Getting all matches is just iterating the string and returning an >>> >> object with all the groups + strings it matched. >>> >> >>> >> So WDYT? >>> >> >>> >> /cc Mathias Bynens, since I know you're involved in this kind of >>> >> text-heavy stuff. >>> >> >>> >> ----- >>> >> >>> >> Isiah Meadows >>> >> contact at isiahmeadows.com >>> >> www.isiahmeadows.com >>> >> _______________________________________________ >>> >> es-discuss mailing list >>> >> es-discuss at mozilla.org >>> >> https://mail.mozilla.org/listinfo/es-discuss >>> > >>> > >> >>
There's two things I've found that need suspendable matching:
Also, it'd be nice if there was a facility to get all matches, including duplicate group matches. This is often useful for simple parsing, where if such support existed, you could just use a Kleene star instead of the standard
execloops (which admittedly get old).And finally, we could avoid setting regexp globals here. That would speed up the matcher quite a bit.
So, here's my proposal:
regexp.matcher() -> matcher- Create a streaming regexp matcher.matcher.consume(codePoint, charSize?) -> result | undefined-Consume a Unicode code point or
-1if no more characters exist, and return a match result,undefinedif no match occurred.charSizeis the number of bytes represented bycodePoint(default: 1-2 if/uis set, 1 otherwise), so it can work with other encodings flexibly.matcher.nextPossibleStart -> number- The next possible start thematcher could have, for more effective buffering and stream management. This is implementation-defined, but it must be be
-1after the matcher completes, and it must be within [0, N) otherwise, where N is the next returned match.result.group -> string | number | undefined- Return the groupindex/name of the current match, or
undefinedif it's just issuing a match of the global regexp.result.start -> number- Return the matched value's start index.result.end -> number- Return the matched value's end index.regexp.lastIndexon creation. (It doesn't operate on strings, so it shouldn't return any it doesn't already have.)Most RegExp methods could similarly be built using this as a base: if they work on strings, they can iterate their code points.
As for the various concerns:
undefined.So WDYT?
/cc Mathias Bynens, since I know you're involved in this kind of text-heavy stuff.
Isiah Meadows contact at isiahmeadows.com, www.isiahmeadows.com