Inclusion of Unicode character set constants in RegExp (\p{L} ... )

# Hans Schmucker (16 years ago)

Apologies if this isn't the right place or the right format to suggest a modification to ES. I try my best to be as specific as possible to minimize problems when/if this was moved to the spec.

So far, the RegularExpression functions in ES are based on the assumption that the only language worth matching is English (\w) and that the limited number of special cases (i.e. äöü...) can easily be handled by combining the predefined range with explicit declaration of allowed unicode ranges (i.e. u1FE0-\u1FEC ...). While this is mostly true for latin-based languages like German or French it cannot be applied to non latin-based languages like arab or japanese.

This shortcoming causes some major issues if international names are to be allowed in ES applications as they cannot reliably be checked using the built in RegExp constants. Instead, all ranges have to be specified, which translates to a RegExp of around 4000 characters length (using ranges like u1FE0-\u1FEC,see pastebin.mozilla.org/530081 for the full definition of a regular expression matching characters \p{Ll} \p{Lu} \p{Lt} \p{Lm} \p{Lo} and \p{Nl}), which is highly impractical as it bloats the code and dramatically increases compile time.

Recent Regular Expression environments like Perl, Java or PCRE therefore ship with an extended Syntax that allows for matching based on it's character's Unicode properties. All Unicode characters are flagged with properties based on their role, i.e. Uppercase (Lu), Lowercase (Ll) and so on and using the beforementioned syntax (\p{Ll} \p{Lu} ...).

The practical value would be enormous while implementation is relatively trivial if the used Regular Expression library already supports this function, increasing the chances for adoption. Also, using RegularExpressions like the afforementionend pastebin.mozilla.org/530081 provide a fallback for web authors in case the functionality is not available.

The only issue I see so far is that these commands are already legal RegExp values, but producing a different result. Therefore, in order to avoid undesired behaviour, I'd suggest an additional flag, so that authors can determine support using try/catch blocks and only enable it when actually in use.

I think this would be an obvious candidate for harmony, as it increases functionality in common scenarios without requiring changes to the basic logic of ES.

Hans Schmucker Mannheim Germany

(hansschmucker at gmail.com)

# Hans Schmucker (16 years ago)

I've also added a bug in bugzilla.mozilla.org which can be used to discuss this: bugzilla.mozilla.org/show_bug.cgi?id=453554

Hans Schmucker Mannheim Germany

(hansschmucker at gmail.com)