Streaming regexp matching

# Isiah Meadows (18 days ago)

There's two things I've found that need suspendable matching:

  1. Partially matching against a string, which is useful with interactive form validation and such.
  2. Pattern matching and replacement over a character stream, which is useful for things like matching against files without loading the entire thing into memory or easier filtering of requests.

Also, it'd be nice if there was a facility to get all matches, including duplicate group matches. This is often useful for simple parsing, where if such support existed, you could just use a Kleene star instead of the standard exec loops (which admittedly get old).

And finally, we could avoid setting regexp globals here. That would speed up the matcher quite a bit.

So, here's my proposal:

  • regexp.matcher() -> matcher - Create a streaming regexp matcher.
  • matcher.consume(codePoint, charSize?) -> result | undefined -

Consume a Unicode code point or -1 if no more characters exist, and return a match result, undefined if no match occurred. charSize is the number of bytes represented by codePoint (default: 1-2 if /u is set, 1 otherwise), so it can work with other encodings flexibly.

  • matcher.nextPossibleStart -> number - The next possible start the

matcher could have, for more effective buffering and stream management. This is implementation-defined, but it must be be -1 after the matcher completes, and it must be within [0, N) otherwise, where N is the next returned match.

  • result.group -> string | number | undefined - Return the group

index/name of the current match, or undefined if it's just issuing a match of the global regexp.

  • result.start -> number - Return the matched value's start index.
  • result.end -> number - Return the matched value's end index.
  • This does not modify any globals or regexp instance members. It only reads regexp.lastIndex on creation. (It doesn't operate on strings, so it shouldn't return any it doesn't already have.)

Most RegExp methods could similarly be built using this as a base: if they work on strings, they can iterate their code points.

As for the various concerns:

  • Partial matching is just iterating a string's character codes and seeing if the matcher ever returned non-undefined.
  • Streaming pattern matching is pretty obvious from just reading the API.
  • Getting all matches is just iterating the string and returning an object with all the groups + strings it matched.

So WDYT?

/cc Mathias Bynens, since I know you're involved in this kind of text-heavy stuff.


Isiah Meadows contact at isiahmeadows.com, www.isiahmeadows.com

# Jordan Harband (18 days ago)

Have you tried to implement this as a RegExp subclass, overriding all the necessary extension points?

# Michael Theriot (18 days ago)

I'll say I have at least a few times desired a way to stream input to RegExp. To the point I fiddled around implementing a finite state machine.

My use cases:

  1. short circuiting on nth capture
  2. parsing streams
# Isiah Meadows (18 days ago)

Not yet, but if I were, it wouldn't be a subclass (it's not necessary). But the key trouble implementing this is that I'd have to reimplement the regexp matching logic entirely from scratch if I want remotely reasonable runtime complexity. As for why:

  • If you ignore backreferences (which IIUC makes JS regexps go from regular to context-sensitive recognizers), it's possible to implement partial matching using regexp rewriting, but only because JS doesn't have any other arcane enough features to make it infeasible.
  • There is no possible way to extract duplicate matches without rewriting the regexp entirely and using regexp.exec, and the complexity of the logic to do this generally I suspect is NP-complete, maybe PSPACE-complete, and in either case, certainly infeasible when backreferences enter the picture.
  • It's technically possible to refactor a string for streaming, but I then lose the ability to discern a match from a non-match. I also have the same issue as per above WRT duplicate matches and partial matching with complexity issues.
  • If you were to combine the three, rewriting the regexp generally to support streaming matches, including duplicates, I suspect would likely be PSPACE-complete because it seems very similar to the first problem listed here 1.

* Backreferences bring the grammatical complexity of JS regexps from I'm pretty sure regular to context-sensitive.

Or in other words, either you control the raw matching logic yourself, or the polyfill runtime complexity could get absurd for this proposal. And implementing a pushdown state machine-based regexp engine + parser in JS isn't exactly something I'm willing to prototype for a strawman.


Isiah Meadows contact at isiahmeadows.com, www.isiahmeadows.com

# Jordan Harband (18 days ago)

(It'd have to be a subclass for RegExp.prototype.whatever.call to do the right thing for your alternative regex)

# Isiah Meadows (18 days ago)

I was thinking in terms of what hooks need overridden (none for the base proposal), not where the method itself lived.

Isiah Meadows contact at isiahmeadows.com, www.isiahmeadows.com

# Isiah Meadows (18 days ago)

BTW, I created a rough sketch of a strawman here: isiahmeadows/streaming-regexp-proposal

Isiah Meadows contact at isiahmeadows.com, www.isiahmeadows.com