\u0085 whitespace or a random unicode character by ES5?

# Dave Fugate (13 years ago)

Several test262 test cases operate on the assumption '\u0085', aka Next Line, is considered a whitespace character and I'd like to get some clarification on whether it really is or not as-per ES5.1.

Table 3 of ES5, Line Terminator Characters, does not call out \u0085 as being a valid line terminator. It does however state: Only the characters in Table 3 are treated as line terminators. Other new line or line breaking characters are treated as white space but not as line terminators.

This raises two questions:

  1.   Is Next Line considered to be a 'new line' or 'line breaking character'?  By definition, the answer seems to be yes
    
  2.   Next Line is not called out anywhere in Table 2, Whitespace Characters.  Does this mean Table 2 is simply missing a row for the clause from "Line Terminators" above, or that the clause should not even exist?
# Peter van der Zee (13 years ago)

On Wed, Jul 6, 2011 at 7:38 PM, Dave Fugate <dfugate at microsoft.com> wrote:

Several test262 test cases operate on the assumption ‘\u0085’, aka Next Line, is considered a whitespace character and I’d like to get some clarification on whether it really is or not as-per ES5.1.

Table 3 of ES5, Line Terminator Characters, does not call out \u0085 as being a valid line terminator.  It does however state:

Only the characters in Table 3 are treated as line terminators. Other new line or line breaking characters are treated as white space but not as line terminators.

This raises two questions:

1.       Is Next Line considered to be a ‘new line’ or ‘line breaking character’?  By definition, the answer seems to be yes

I would say no. 7.3 clearly states it's NOT a line terminator.

2.       Next Line is not called out anywhere in Table 2, Whitespace Characters.  Does this mean Table 2 is simply missing a row for the clause from “Line Terminators” above, or that the clause should not even exist?

Or 7.2 should be extended with noting that any unicode line terminator that's not listed in 7.3 is also considered (regular) white space, as per 7.2.

# Dave Fugate (13 years ago)

-----Original Message----- From: Peter van der Zee [mailto:ecma at qfox.nl] Sent: Wednesday, July 06, 2011 10:57 AM To: Dave Fugate Cc: es-discuss at mozilla.org Subject: Re: \u0085 whitespace or a random unicode character by ES5?

On Wed, Jul 6, 2011 at 7:38 PM, Dave Fugate <dfugate at microsoft.com> wrote:

Several test262 test cases operate on the assumption '\u0085', aka Next Line, is considered a whitespace character and I'd like to get some clarification on whether it really is or not as-per ES5.1.

Table 3 of ES5, Line Terminator Characters, does not call out \u0085 as being a valid line terminator.  It does however state:

Only the characters in Table 3 are treated as line terminators. Other new line or line breaking characters are treated as white space but not as line terminators.

This raises two questions:

1.       Is Next Line considered to be a 'new line' or 'line breaking character'?  By definition, the answer seems to be yes

I would say no. 7.3 clearly states it's NOT a line terminator.

[DWF] Sorry, for the confusion - I should have been more clear here. What I meant was is Next Line considered to be a 'new line' or 'line breaking character' by Unicode 3.0 (as it's already established ES5.1 does not view it as a line terminator).

2.       Next Line is not called out anywhere in Table 2, Whitespace Characters.  Does this mean Table 2 is simply missing a row for the clause from "Line Terminators" above, or that the clause should not even exist?

Or 7.2 should be extended with noting that any unicode line terminator that's not listed in 7.3 is also considered (regular) white space, as per 7.2. [DWF] Good idea.

# Allen Wirfs-Brock (13 years ago)

On Jul 6, 2011, at 11:09 AM, Dave Fugate wrote:

-----Original Message----- From: Peter van der Zee [mailto:ecma at qfox.nl] Sent: Wednesday, July 06, 2011 10:57 AM To: Dave Fugate Cc: es-discuss at mozilla.org Subject: Re: \u0085 whitespace or a random unicode character by ES5?

On Wed, Jul 6, 2011 at 7:38 PM, Dave Fugate <dfugate at microsoft.com> wrote:

Several test262 test cases operate on the assumption '\u0085', aka Next Line, is considered a whitespace character and I'd like to get some clarification on whether it really is or not as-per ES5.1.

Table 3 of ES5, Line Terminator Characters, does not call out \u0085 as being a valid line terminator. It does however state:

            Only the characters in Table 3 are treated as line 

terminators. Other new line or line breaking characters are treated as white space but not as line terminators.

This raises two questions:

  1.   Is Next Line considered to be a 'new line' or 'line breaking 
    

character'? By definition, the answer seems to be yes

I would say no. 7.3 clearly states it's NOT a line terminator.

[DWF] Sorry, for the confusion - I should have been more clear here. What I meant was is Next Line considered to be a 'new line' or 'line breaking character' by Unicode 3.0 (as it's already established ES5.1 does not view it as a line terminator).

I don't think the phrase "new line or line breaking characters" in 7.3 has any normative meaning other than in the context of Table 2- Whitespace Characters. Table 2 only lists category Zs as a source of additional whitespace characters. 0085 is in category Cc so it isn't whitespace.

  1.   Next Line is not called out anywhere in Table 2, Whitespace 
    

Characters. Does this mean Table 2 is simply missing a row for the clause from "Line Terminators" above, or that the clause should not even exist?

Or 7.2 should be extended with noting that any unicode line terminator that's not listed in 7.3 is also considered (regular) white space, as per 7.2. [DWF] Good idea.

What's a "line terminator"? Unicode category Zl (line separator, line) includes 2028 and 2029 which are already in table 3.

There are many other unicode character that contribute to line breaking. See unicode.org/reports/tr14

ES5.1 is precise enough WRT which characters are whitespace and which are Line Terminators. To me, there seems like only two relevant questions:

  1. is there a legacy browser de facto standard of including other Unicode characters in either of these ES token categories.
  2. are these real use cases for other Unicode characters into either of these categories.

Given that Dave is asking, I infer that IE does not recognize 0085, If that has always been the case (ie, prior to IE9) then it seems unlikely that recognition of 0085 has been a defacto standard.

# Dave Fugate (13 years ago)

this is a case where Canary and IE9 agree that \u0085 is whitespace or a line terminator as far as String.prototype.trim is concerned, but other browsers do not. I'll file a spec bug to clarify the text as there seems to be confusion on the Web about this one (e.g., do a search on "ECMAScript" within en.wikipedia.org/wiki/Newline). Also, doesn't help that Unicode 3.0 calls outftp://ftp.unicode.org/Public/3.0-Update/LineBreak-5.txt \u0085 as being a line break.

Any ways, thanks for everyone's help!

Dave

From: Allen Wirfs-Brock [mailto:allen at wirfs-brock.com] Sent: Wednesday, July 06, 2011 11:54 AM To: Dave Fugate Cc: Peter van der Zee; es-discuss at mozilla.org Subject: Re: \u0085 whitespace or a random unicode character by ES5?

# Allen Wirfs-Brock (13 years ago)

On Jul 6, 2011, at 3:09 PM, Dave Fugate wrote:

Hi Allen, this is a case where Canary and IE9 agree that \u0085 is whitespace or a line terminator as far as String.prototype.trim is concerned, but other browsers do not. I’ll file a spec bug to clarify the text as there seems to be confusion on the Web about this one

I don't think there is any ambiguity in the spec about trim. It references the WhiteSpace and LineTerminator grammar productions which are quite precise i their definition. Since trim is new with ES5 there presumably aren't legacy compact issue in regard to it. It looks to me that Canary and IE9 are just just out of spec WRT trim.

(e.g., do a search on “ECMAScript” within en.wikipedia.org/wiki/Newline).

I saw that, and as far as I can tell its assertion about ECMAScript and U+0085 is just wrong. I found no evidence that ES ever specified U+0085 to be a line terminator. Of course, various implementations may have.

Also, doesn’t help that Unicode 3.0 calls out \u0085 as being a line break.

The ES spec. defines ES tokens, not the Unicode spec. Also, the Unicode concert of line breaks is quite complex as I mentioned earlier see Unicode TR14.