Expectations around line ending behavior for U+2028 and U+2029

# Logan Smyth (7 years ago)

Something I've recently realized just how much U+2028 and U+2029 being newlines introduces a mismatch between different parts of a dev environment, and I'm curious for thoughts.

Engines understandable take these two characters into account when defining their line number offsets in stack traces, since they are part of the LineTerminator grammar. Similarly, Babel's parser and I assume others will do the same and take then into account for their line number data. On the other hand, it seems like every editor that I've looked at so far will not render these characters as newlines, which can create confusion for users because error messages will not align with what they see in their editors. This seems like a burden for editors, since they would need to know the type of file in order to know how to render it. There's also a question of mixed content. If I have an HTML file with a <script>, would an editor need

to be content-aware to render the newlines correctly only within the <script> tag, since U+2028/29 are not newline characters for HTML?

Another case that comes to mind is that sourcemaps don't appear to specify what counts as a line. While mappings are defined per-line, it's not clear whether these should take U+2028/29 into account or not, though I'd assume the intention is /\r?\n/. Tooling like Babel will currently take U+2028/29 into account because otherwise we'd need two independent concepts of line/column number for each location. That said, this Babel behavior is likely a bad idea because it means the application of a sourcemap would need to be aware of whether a given mapping within a file applies to JS content, or something else.

Something I've recently realized just how much U+2028 and U+2029 being
newlines introduces a mismatch between different parts of a dev
environment, and I'm curious for thoughts.

Engines understandable take these two characters into account when defining
their line number offsets in stack traces, since they are part of the
LineTerminator grammar. Similarly, Babel's parser and I assume others will
do the same and take then into account for their line number data. On the
other hand, it seems like every editor that I've looked at so far will not
render these characters as newlines, which can create confusion for users
because error messages will not align with what they see in their editors.
This seems like a burden for editors, since they would need to know the
type of file in order to know how to render it. There's also a question of
mixed content. If I have an HTML file with a <script>, would an editor need
to be content-aware to render the newlines correctly only within the
<script> tag, since U+2028/29 are not newline characters for HTML?

Another case that comes to mind is that sourcemaps don't appear to specify
what counts as a line. While mappings are defined per-line, it's not clear
whether these should take U+2028/29 into account or not, though I'd assume
the intention is /\r?\n/. Tooling like Babel will currently take U+2028/29
into account because otherwise we'd need two independent concepts of
line/column number for each location. That said, this Babel behavior is
likely a bad idea because it means the application of a sourcemap would
need to be aware of whether a given mapping within a file applies to JS
content, or something else.

Would it be worth exploring a definition of U+2028/29 in the spec such that
they behave as line terminators for ASI, but otherwise do not increment
things line number counts and behave as whitespace characters? If not, what
are your thoughts on the issues I've mentioned?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20181024/0a46c515/attachment-0001.html>

# Richard Gibson (7 years ago)

The only explicit mention of line numbers in the spec is to note that "<CR><LF>… should be considered a single SourceCharacter for the purpose of

reporting line numbers", but it's clear from things like ASI and termination of single-line comments that every LineTerminatorSequence is equal in this sense. Editors and HTML are free to do what they want, but in my opinion ECMAScript tooling at least should not pretend that these input elements don't terminate lines.

The only explicit mention of line numbers in the spec is to note that
"<CR><LF>… should be considered a single SourceCharacter for the purpose of
reporting line numbers", but it's clear from things like ASI and
termination of single-line comments that every *LineTerminatorSequence* is
equal in this sense. Editors and HTML are free to do what they want, but in
my opinion ECMAScript tooling at least should not pretend that these input
elements don't terminate lines.

On Wed, Oct 24, 2018 at 3:58 PM Logan Smyth <loganfsmyth at gmail.com> wrote:

> Something I've recently realized just how much U+2028 and U+2029 being
> newlines introduces a mismatch between different parts of a dev
> environment, and I'm curious for thoughts.
>
> Engines understandable take these two characters into account when
> defining their line number offsets in stack traces, since they are part of
> the LineTerminator grammar. Similarly, Babel's parser and I assume others
> will do the same and take then into account for their line number data. On
> the other hand, it seems like every editor that I've looked at so far will
> not render these characters as newlines, which can create confusion for
> users because error messages will not align with what they see in their
> editors. This seems like a burden for editors, since they would need to
> know the type of file in order to know how to render it. There's also a
> question of mixed content. If I have an HTML file with a <script>, would an
> editor need to be content-aware to render the newlines correctly only
> within the <script> tag, since U+2028/29 are not newline characters for
> HTML?
>
> Another case that comes to mind is that sourcemaps don't appear to specify
> what counts as a line. While mappings are defined per-line, it's not clear
> whether these should take U+2028/29 into account or not, though I'd assume
> the intention is /\r?\n/. Tooling like Babel will currently take U+2028/29
> into account because otherwise we'd need two independent concepts of
> line/column number for each location. That said, this Babel behavior is
> likely a bad idea because it means the application of a sourcemap would
> need to be aware of whether a given mapping within a file applies to JS
> content, or something else.
>
> Would it be worth exploring a definition of U+2028/29 in the spec such
> that they behave as line terminators for ASI, but otherwise do not
> increment things line number counts and behave as whitespace characters? If
> not, what are your thoughts on the issues I've mentioned?
> _______________________________________________
> es-discuss mailing list
> es-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20181025/d74ebdda/attachment.html>

# Logan Smyth (7 years ago)

Yeah, LineTerminatorSequence is definitely the canonical definition of line numbers in JS at the moment. As we explore tc39/proposal-error-stacks, it would be good to clearly specify how a line number is computed from the original source. As currently specified, a line number in a stack trace takes U+2028/29 into account, and thus requires any consumer of this source code and line number value needs to have a special case for JS code. It seems unrealistic to expect every piece of tooling that works with source code would have a special case for JS code to take these 2 characters into account. Given that, the choices are

Every tool that manipulates source code needs to know what type so it can special-case JS it is in order to process line-related information
Every tool should consider U+2028/29 newlines, causing line numbers to be off in other programming languages
Accept that tooling and the spec will never correspond and the use of these two characters in source code will continue to cause issues
Diverge the definition of current source-code line from the current LineTerminatorSequence lexical grammar such that source line number is always /\r?\n/, which is what the user is realistically going to see in their editor

Yeah, *LineTerminatorSequence* is definitely the canonical definition of
line numbers in JS at the moment. As we explore
https://github.com/tc39/proposal-error-stacks, it would be good to clearly
specify how a line number is computed from the original source. As
currently specified, a line number in a stack trace takes U+2028/29 into
account, and thus requires any consumer of this source code and line number
value needs to have a special case for JS code. It seems unrealistic to
expect every piece of tooling that works with source code would have a
special case for JS code to take these 2 characters into account. Given
that, the choices are

1. Every tool that manipulates source code needs to know what type so it
can special-case JS it is in order to process line-related information
2. Every tool should consider U+2028/29 newlines, causing line numbers to
be off in other programming languages
2. Accept that tooling and the spec will never correspond and the use of
these two characters in source code will continue to cause issues
3. Diverge the definition of current source-code line from the current
*LineTerminatorSequence* lexical grammar such that source line number is
always /\r?\n/, which is what the user is realistically going to see in
their editor

On Wed, Oct 24, 2018 at 9:09 PM Richard Gibson <richard.gibson at gmail.com>
wrote:

> The only explicit mention of line numbers in the spec is to note that
> "<CR><LF>… should be considered a single SourceCharacter for the purpose of
> reporting line numbers", but it's clear from things like ASI and
> termination of single-line comments that every *LineTerminatorSequence* is
> equal in this sense. Editors and HTML are free to do what they want, but in
> my opinion ECMAScript tooling at least should not pretend that these input
> elements don't terminate lines.
>
> On Wed, Oct 24, 2018 at 3:58 PM Logan Smyth <loganfsmyth at gmail.com> wrote:
>
>> Something I've recently realized just how much U+2028 and U+2029 being
>> newlines introduces a mismatch between different parts of a dev
>> environment, and I'm curious for thoughts.
>>
>> Engines understandable take these two characters into account when
>> defining their line number offsets in stack traces, since they are part of
>> the LineTerminator grammar. Similarly, Babel's parser and I assume others
>> will do the same and take then into account for their line number data. On
>> the other hand, it seems like every editor that I've looked at so far will
>> not render these characters as newlines, which can create confusion for
>> users because error messages will not align with what they see in their
>> editors. This seems like a burden for editors, since they would need to
>> know the type of file in order to know how to render it. There's also a
>> question of mixed content. If I have an HTML file with a <script>, would an
>> editor need to be content-aware to render the newlines correctly only
>> within the <script> tag, since U+2028/29 are not newline characters for
>> HTML?
>>
>> Another case that comes to mind is that sourcemaps don't appear to
>> specify what counts as a line. While mappings are defined per-line, it's
>> not clear whether these should take U+2028/29 into account or not, though
>> I'd assume the intention is /\r?\n/. Tooling like Babel will currently take
>> U+2028/29 into account because otherwise we'd need two independent concepts
>> of line/column number for each location. That said, this Babel behavior is
>> likely a bad idea because it means the application of a sourcemap would
>> need to be aware of whether a given mapping within a file applies to JS
>> content, or something else.
>>
>> Would it be worth exploring a definition of U+2028/29 in the spec such
>> that they behave as line terminators for ASI, but otherwise do not
>> increment things line number counts and behave as whitespace characters? If
>> not, what are your thoughts on the issues I've mentioned?
>> _______________________________________________
>> es-discuss mailing list
>> es-discuss at mozilla.org
>> https://mail.mozilla.org/listinfo/es-discuss
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20181025/d1600e2a/attachment-0001.html>

# Carsten Bormann (7 years ago)

On Oct 25, 2018, at 18:24, Logan Smyth <loganfsmyth at gmail.com> wrote:

Diverge the definition of current source-code line from the current LineTerminatorSequence lexical grammar such that source line number is always /\r?\n/, which is what the user is realistically going to see in their editor

This. U+2028/U+2029 is widely recognized as a mistake (and not only because of the discrepancy with JSON it created).

While this mistake probably cannot be repaired easily in existing parts of the specification, you can make sure that the problem does not infect new parts. (This may lead to the people who are actually using 2028/2029 to take some damage, but that is entirely OK as long as their precious existing scripts don’t break.)

Grüße, Carsten

On Oct 25, 2018, at 18:24, Logan Smyth <loganfsmyth at gmail.com> wrote:
> 
> 3. Diverge the definition of current source-code line from the current LineTerminatorSequence lexical grammar such that source line number is always /\r?\n/, which is what the user is realistically going to see in their editor

This.  U+2028/U+2029 is widely recognized as a mistake (and not only because of the discrepancy with JSON it created). 

While this mistake probably cannot be repaired easily in existing parts of the specification, you can make sure that the problem does not infect new parts.  (This may lead to the people who are actually using 2028/2029 to take some damage, but that is entirely OK as long as their precious existing scripts don’t break.)

Grüße, Carsten

# Waldemar Horwat (7 years ago)

On 10/25/2018 09:24 AM, Logan Smyth wrote:

Yeah, /LineTerminatorSequence/ is definitely the canonical definition of line numbers in JS at the moment. As we explore tc39/proposal-error-stacks, it would be good to clearly specify how a line number is computed from the original source. As currently specified, a line number in a stack trace takes U+2028/29 into account, and thus requires any consumer of this source code and line number value needs to have a special case for JS code. It seems unrealistic to expect every piece of tooling that works with source code would have a special case for JS code to take these 2 characters into account. Given that, the choices are

Every tool that manipulates source code needs to know what type so it can special-case JS it is in order to process line-related information

Every tool should consider U+2028/29 newlines, causing line numbers to be off in other programming languages

Accept that tooling and the spec will never correspond and the use of these two characters in source code will continue to cause issues

Diverge the definition of current source-code line from the current /LineTerminatorSequence/ lexical grammar such that source line number is always /\r?\n/, which is what the user is realistically going to see in their editor

The Unicode standard is the more relevant one here. Choice 2 is the correct one per the Unicode standard. Tools that do not consider U+2028/29 to be line breaks are not behaving as they should according to the latest Unicode standard.

 Waldemar

On 10/25/2018 09:24 AM, Logan Smyth wrote:
> Yeah, /LineTerminatorSequence/ is definitely the canonical definition of line numbers in JS at the moment. As we explore https://github.com/tc39/proposal-error-stacks, it would be good to clearly specify how a line number is computed from the original source. As currently specified, a line number in a stack trace takes U+2028/29 into account, and thus requires any consumer of this source code and line number value needs to have a special case for JS code. It seems unrealistic to expect every piece of tooling that works with source code would have a special case for JS code to take these 2 characters into account. Given that, the choices are
> 
> 1. Every tool that manipulates source code needs to know what type so it can special-case JS it is in order to process line-related information
> 2. Every tool should consider U+2028/29 newlines, causing line numbers to be off in other programming languages
> 2. Accept that tooling and the spec will never correspond and the use of these two characters in source code will continue to cause issues
> 3. Diverge the definition of current source-code line from the current /LineTerminatorSequence/ lexical grammar such that source line number is always /\r?\n/, which is what the user is realistically going to see in their editor

The Unicode standard is the more relevant one here.  Choice 2 is the correct one per the Unicode standard.  Tools that do not consider U+2028/29 to be line breaks are not behaving as they should according to the latest Unicode standard.

     Waldemar

# Logan Smyth (7 years ago)

Tools that do not consider U+2028/29 to be line breaks are not behaving

as they should according to the latest Unicode standard.

That's part of what I'm attempting to understand. What specifically does Unicode require for these code points? What are the expectations for languages that have differing definitions of line separators? The HTML spec defines newlines in html.spec.whatwg.org/#newlines as CR and LF only. Is that technically in violation of the Unicode spec then? If code editors were to adopt U+2028 and U+2029 as line separators, is the expectation that they would apply that to HTML files too, even though that would put the the editor's concept of a line in conflict with the language's specification?

It seems unrealistic to expect that all tooling that processes source code would adopt a new type of line separator. Given that, JS is the outlier. Similarly, does Unicode make any guarantees about what counts as a line terminator? If it changes in the future, would JS be forced to add that as a type of LineTerminator as well? If it did, that could break existing code, and if it doesn't, then JS would end up right back in the same place with a concept of line numbers that differs from other tooling. CR and LF are already the defacto standards, is it really realistic to expect tooling to ever change? It is much more likely that JS will have simply specified itself as a special-case forever, which tooling will never handle.

> Tools that do not consider U+2028/29 to be line breaks are not behaving
as they should according to the latest Unicode standard.

That's part of what I'm attempting to understand. What specifically does
Unicode require for these code points? What are the expectations for
languages that have differing definitions of line separators? The HTML spec
defines newlines in https://html.spec.whatwg.org/#newlines as CR and LF
only. Is that technically in violation of the Unicode spec then? If code
editors were to adopt U+2028 and U+2029 as line separators, is the
expectation that they would apply that to HTML files too, even though that
would put the the editor's concept of a line in conflict with the
language's specification?

It seems unrealistic to expect that all tooling that processes source code
would adopt a new type of line separator. Given that, JS is the outlier.
Similarly, does Unicode make any guarantees about what counts as a line
terminator? If it changes in the future, would JS be forced to add that as
a type of LineTerminator as well? If it did, that could break existing
code, and if it doesn't, then JS would end up right back in the same place
with a concept of line numbers that differs from other tooling. CR and LF
are already the defacto standards, is it really realistic to expect tooling
to _ever_ change? It is much more likely that JS will have simply specified
itself as a special-case forever, which tooling will never handle.

On Thu, Oct 25, 2018 at 3:10 PM Waldemar Horwat <waldemar at google.com> wrote:

> On 10/25/2018 09:24 AM, Logan Smyth wrote:
> > Yeah, /LineTerminatorSequence/ is definitely the canonical definition of
> line numbers in JS at the moment. As we explore
> https://github.com/tc39/proposal-error-stacks, it would be good to
> clearly specify how a line number is computed from the original source. As
> currently specified, a line number in a stack trace takes U+2028/29 into
> account, and thus requires any consumer of this source code and line number
> value needs to have a special case for JS code. It seems unrealistic to
> expect every piece of tooling that works with source code would have a
> special case for JS code to take these 2 characters into account. Given
> that, the choices are
> >
> > 1. Every tool that manipulates source code needs to know what type so it
> can special-case JS it is in order to process line-related information
> > 2. Every tool should consider U+2028/29 newlines, causing line numbers
> to be off in other programming languages
> > 2. Accept that tooling and the spec will never correspond and the use of
> these two characters in source code will continue to cause issues
> > 3. Diverge the definition of current source-code line from the current
> /LineTerminatorSequence/ lexical grammar such that source line number is
> always /\r?\n/, which is what the user is realistically going to see in
> their editor
>
> The Unicode standard is the more relevant one here.  Choice 2 is the
> correct one per the Unicode standard.  Tools that do not consider U+2028/29
> to be line breaks are not behaving as they should according to the latest
> Unicode standard.
>
>      Waldemar
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20181025/af819109/attachment.html>

# Allen Wirfs-Brock (7 years ago)

On Oct 25, 2018, at 4:49 PM, Logan Smyth <loganfsmyth at gmail.com> wrote:

Tools that do not consider U+2028/29 to be line breaks are not behaving as they should according to the latest Unicode standard.

That's part of what I'm attempting to understand. What specifically does Unicode require for these code points? What are the expectations for languages that have differing definitions of line separators?

see www.unicode.org/versions/Unicode11.0.0/ch05.pdf#G10213, www.unicode.org/versions/Unicode11.0.0/ch05.pdf#G10213

> On Oct 25, 2018, at 4:49 PM, Logan Smyth <loganfsmyth at gmail.com> wrote:
> 
> > Tools that do not consider U+2028/29 to be line breaks are not behaving as they should according to the latest Unicode standard.
> 
> That's part of what I'm attempting to understand. What specifically does Unicode require for these code points? What are the expectations for languages that have differing definitions of line separators? 

see https://www.unicode.org/versions/Unicode11.0.0/ch05.pdf#G10213 <https://www.unicode.org/versions/Unicode11.0.0/ch05.pdf#G10213> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20181025/eff2bfbb/attachment.html>

# Carsten Bormann (7 years ago)

On Oct 26, 2018, at 02:17, Allen Wirfs-Brock <allen at wirfs-brock.com> wrote:

see www.unicode.org/versions/Unicode11.0.0/ch05.pdf#G10213

Please explain how this is even remotely relevant for a programming language.

(Clearly, this was written by people who were trying to encode word processing text. The giveaway is the phrase “simple text editors such as program editors”. This recommendation is attempting to solve a problem that programming languages do not have. I’m not aware of any uptake in the word processing world, either.)

Grüße, Carsten

On Oct 26, 2018, at 02:17, Allen Wirfs-Brock <allen at wirfs-brock.com> wrote:
> 
> see https://www.unicode.org/versions/Unicode11.0.0/ch05.pdf#G10213 

Please explain how this is even remotely relevant for a programming language.

(Clearly, this was written by people who were trying to encode word processing text.
The giveaway is the phrase “simple text editors such as program editors”.
This recommendation is attempting to solve a problem that programming languages do not have.
I’m not aware of any uptake in the word processing world, either.)

Grüße, Carsten

# Claude Pache (7 years ago)

Le 24 oct. 2018 à 21:58, Logan Smyth <loganfsmyth at gmail.com> a écrit :

On the other hand, it seems like every editor that I've looked at so far will not render these characters as newlines,

I have just tried to open a file containing U+2028 and U+2029 in four different text editors / integrated environments on my Mac. All of them recognise both characters as newlines (and increment the line number for those that display it).

> Le 24 oct. 2018 à 21:58, Logan Smyth <loganfsmyth at gmail.com> a écrit :
> 
> On the other hand, it seems like every editor that I've looked at so far will not render these characters as newlines,

I have just tried to open a file containing U+2028 and U+2029 in four different text editors / integrated environments on my Mac. All of them recognise both characters as newlines (and increment the line number for those that display it).

—Claude

# Claude Pache (7 years ago)

Would it be worth exploring a definition of U+2028/29 in the spec such that they behave as line terminators for ASI, but otherwise do not increment things line number counts and behave as whitespace characters?

Diverging the definition of line terminator for the purpose of line counting on one side, and ASI and single-line comment on the other side, is adding yet another complication in a matter that is already messy. And I suspect that most tools that have issues with the former case, have issues as well with the latter case, so that half-fixing is equivalent to not fixing.

If we want to ”fix” the definition of line terminator somewhere, we should ”fix” it everywhere.

(Note that the recent addition of U+2028 and U+2029 inside string literals does not constitutes a modification of the definition of line terminator in that context; it is rather allowing string literals to span multiple lines in some specific cases.)

> 
> Would it be worth exploring a definition of U+2028/29 in the spec such that they behave as line terminators for ASI, but otherwise do not increment things line number counts and behave as whitespace characters?

Diverging the definition of line terminator for the purpose of line counting on one side, and ASI and single-line comment on the other side, is adding yet another complication in a matter that is already messy. And I suspect that most tools that have issues with the former case, have issues as well with the latter case, so that half-fixing is equivalent to not fixing.

If we want to ”fix” the definition of line terminator somewhere, we should ”fix” it everywhere.

(Note that the recent addition of U+2028 and U+2029 inside string literals does not constitutes a modification of the definition of line terminator in that context; it is rather allowing string literals to span multiple lines in some specific cases.)

—Claude

# Logan Smyth (7 years ago)

Great, thank you for that resource Allen, it's helpful to have something concrete to consider.

What you'd prefer is that that other languages should also be rendered with U+2028/29 as creating new lines, even though their specifications do not define them as lines? That means that any parser for these languages that follows the language spec would them be outputting line numbers that would potentially not correspond with the code as rendered inside of the developer's editor, if the editor renders U+2028/29 a line separators? That would for instance mean that Rust's single-line comments could actually be rendered as multiple lines, even though they are a single line according to the spec.

My frustration here isn't that the characters exist, it's just that their behavior in a world of explicitly defined syntactic grammars that depend on line numbers for errors and things, they seem poorly-defined, even if their behavior in text documents may have more meaning. For instance, here is XCode's rendering of 2028/2029 [image: Screen Shot 2018-10-26 at 2.33.56 PM.png]

2028 does seem to render as a "line separator" in that visually the code is on a new line, but it is rendered within the same line number marker as the start of that snippet of text. That seems to satisfy the behavior defined by Unicode, but it's not helpful from the standpoint of code looking to process sourcecode. Should a parser follow that definition of line separator, since 2028 suggests rendering a new line, but since it's not a paragraph, it's conceptually part of the same paragraph? What is a paragraph in source code? Unicode has no sense of line numbers as far as I know, which means it seems up to an individual language to define what line number a given token is on.

All of them recognise both characters as newlines (and increment the line

number for those that display it).

Revisiting my tests on my OSX machine, it seems like there is a difference in treatment of 2028 and 2029 that threw off at least some of my tests.

VSCode: 2028 is a unicode placeholder and 2029 seems to be rendered zero-width, no new lines
Sublime 3: 2028/29 rendered zero-width, no new lines
TextEdit: 2028 is a newline, 2029 is zero-width, no new lines
XCode: Per above screenshot, 2028 creates a line but renders within the same line number, 2029 creates a new line number
Firefox, Chrome, and Safari, with text in a <pre> or <textarea> renders

them all on one line zero-width, no new lines (though how HTML renders may just be a whole separate question)

Great, thank you for that resource Allen, it's helpful to have something
concrete to consider.

What you'd prefer is that that other languages should also be rendered with
U+2028/29 as creating new lines, even though their specifications do not
define them as lines? That means that any parser for these languages that
follows the language spec would them be outputting line numbers that would
potentially not correspond with the code as rendered inside of the
developer's editor, if the editor renders U+2028/29 a line separators? That
would for instance mean that Rust's single-line comments could actually be
rendered as multiple lines, even though they are a single line according to
the spec.

My frustration here isn't that the characters exist, it's just that their
behavior in a world of explicitly defined syntactic grammars that depend on
line numbers for errors and things, they seem poorly-defined, even if their
behavior in text documents may have more meaning. For instance, here is
XCode's rendering of 2028/2029
[image: Screen Shot 2018-10-26 at 2.33.56 PM.png]

2028 does seem to render as a "line separator" in that visually the code is
on a new line, but it is rendered within the same line number marker as the
start of that snippet of text. That seems to satisfy the behavior defined
by Unicode, but it's not helpful from the standpoint of code looking to
process sourcecode. Should a parser follow that definition of line
separator, since 2028 suggests rendering a new line, but since it's not a
paragraph, it's conceptually part of the same paragraph? What is a
paragraph in source code? Unicode has no sense of line numbers as far as I
know, which means it seems up to an individual language to define what line
number a given token is on.

> All of them recognise both characters as newlines (and increment the line
number for those that display it).

Revisiting my tests on my OSX machine, it seems like there is a difference
in treatment of 2028 and 2029 that threw off at least some of my tests.
* VSCode: 2028 is a unicode placeholder and 2029 seems to be rendered
zero-width, no new lines
* Sublime 3: 2028/29 rendered zero-width, no new lines
* TextEdit: 2028 is a newline, 2029 is zero-width, no new lines
* XCode: Per above screenshot, 2028 creates a line but renders within the
same line number, 2029 creates a new line number
* Firefox, Chrome, and Safari, with text in a <pre> or <textarea> renders
them all on one line zero-width, no new lines (though how HTML renders may
just be a whole separate question)

On Fri, Oct 26, 2018 at 7:42 AM Claude Pache <claude.pache at gmail.com> wrote:

>
>
> >
> > Would it be worth exploring a definition of U+2028/29 in the spec such
> that they behave as line terminators for ASI, but otherwise do not
> increment things line number counts and behave as whitespace characters?
>
> Diverging the definition of line terminator for the purpose of line
> counting on one side, and ASI and single-line comment on the other side, is
> adding yet another complication in a matter that is already messy. And I
> suspect that most tools that have issues with the former case, have issues
> as well with the latter case, so that half-fixing is equivalent to not
> fixing.
>
> If we want to ”fix” the definition of line terminator somewhere, we should
> ”fix” it everywhere.
>
> (Note that the recent addition of U+2028 and U+2029 inside string literals
> does not constitutes a modification of the definition of line terminator in
> that context; it is rather allowing string literals to span multiple lines
> in some specific cases.)
>
> —Claude
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20181026/5c6506fc/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Screen Shot 2018-10-26 at 2.33.56 PM.png
Type: image/png
Size: 14073 bytes
Desc: not available
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20181026/5c6506fc/attachment-0001.png>

# Isiah Meadows (7 years ago)

So in other words, all these IDEs are broken and in violation of the Unicode spec. BTW, VSCode depends on Chrome, so it'll likely have most of the same behavior if it doesn't correctly account for them..

Isiah Meadows contact at isiahmeadows.com, www.isiahmeadows.com

So in other words, all these IDEs are broken and in violation of the
Unicode spec. BTW, VSCode depends on Chrome, so it'll likely have most
of the same behavior if it doesn't correctly account for them..

-----

Isiah Meadows
contact at isiahmeadows.com
www.isiahmeadows.com

On Fri, Oct 26, 2018 at 5:49 PM Logan Smyth <loganfsmyth at gmail.com> wrote:
>
> Great, thank you for that resource Allen, it's helpful to have something concrete to consider.
>
> What you'd prefer is that that other languages should also be rendered with U+2028/29 as creating new lines, even though their specifications do not define them as lines? That means that any parser for these languages that follows the language spec would them be outputting line numbers that would potentially not correspond with the code as rendered inside of the developer's editor, if the editor renders U+2028/29 a line separators? That would for instance mean that Rust's single-line comments could actually be rendered as multiple lines, even though they are a single line according to the spec.
>
> My frustration here isn't that the characters exist, it's just that their behavior in a world of explicitly defined syntactic grammars that depend on line numbers for errors and things, they seem poorly-defined, even if their behavior in text documents may have more meaning. For instance, here is XCode's rendering of 2028/2029
>
>
> 2028 does seem to render as a "line separator" in that visually the code is on a new line, but it is rendered within the same line number marker as the start of that snippet of text. That seems to satisfy the behavior defined by Unicode, but it's not helpful from the standpoint of code looking to process sourcecode. Should a parser follow that definition of line separator, since 2028 suggests rendering a new line, but since it's not a paragraph, it's conceptually part of the same paragraph? What is a paragraph in source code? Unicode has no sense of line numbers as far as I know, which means it seems up to an individual language to define what line number a given token is on.
>
>
> > All of them recognise both characters as newlines (and increment the line number for those that display it).
>
> Revisiting my tests on my OSX machine, it seems like there is a difference in treatment of 2028 and 2029 that threw off at least some of my tests.
> * VSCode: 2028 is a unicode placeholder and 2029 seems to be rendered zero-width, no new lines
> * Sublime 3: 2028/29 rendered zero-width, no new lines
> * TextEdit: 2028 is a newline, 2029 is zero-width, no new lines
> * XCode: Per above screenshot, 2028 creates a line but renders within the same line number, 2029 creates a new line number
> * Firefox, Chrome, and Safari, with text in a <pre> or <textarea> renders them all on one line zero-width, no new lines (though how HTML renders may just be a whole separate question)
>
>
> On Fri, Oct 26, 2018 at 7:42 AM Claude Pache <claude.pache at gmail.com> wrote:
>>
>>
>>
>> >
>> > Would it be worth exploring a definition of U+2028/29 in the spec such that they behave as line terminators for ASI, but otherwise do not increment things line number counts and behave as whitespace characters?
>>
>> Diverging the definition of line terminator for the purpose of line counting on one side, and ASI and single-line comment on the other side, is adding yet another complication in a matter that is already messy. And I suspect that most tools that have issues with the former case, have issues as well with the latter case, so that half-fixing is equivalent to not fixing.
>>
>> If we want to ”fix” the definition of line terminator somewhere, we should ”fix” it everywhere.
>>
>> (Note that the recent addition of U+2028 and U+2029 inside string literals does not constitutes a modification of the definition of line terminator in that context; it is rather allowing string literals to span multiple lines in some specific cases.)
>>
>> —Claude
>
> _______________________________________________
> es-discuss mailing list
> es-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss

# Logan Smyth (7 years ago)

Sounds good. This means that the expectation, from the standpoint of Unicode spec, is that all existing parsers and tooling for all languages would also be updated to have line numbering that include U+2028/29, or else that the line numbers would indefinitely be out of sync with the line numbers rendered in an editor, when a file contains these characters. I assume it would be a breaking change for most languages to add U+2028/29 as line terminators for line comments and potentially string content at a minimum, meaning that even if the editors render them as line separators, it would still be a single-line token bridging multiple lines in those languages.

I understand the desire for all this, but I hope it's understandable why this situation is a little frustrating. There is no clean way to update any given tool to treat U+2028/29 as newlines without either special-casing JS or accepting that the tool's line numbers will not correspond with line numbers from any tool that does not treat them as newlines, which is realistically the vast majority of tooling and parsers. It is difficult to make the argument to migrate any given tool when you're migrating away from the current defacto behavior, even if you're migrating to the Unicode-defined behavior, especially when it's not clear that the community as a whole is even aware that they should be treating these as newlines in the first place.

Sounds good. This means that the expectation, from the standpoint of
Unicode spec, is that all existing parsers and tooling for all languages
would also be updated to have line numbering that include U+2028/29, or
else that the line numbers would indefinitely be out of sync with the line
numbers rendered in an editor, when a file contains these characters. I
assume it would be a breaking change for most languages to add U+2028/29 as
line terminators for line comments and potentially string content at a
minimum, meaning that even if the editors render them as line separators,
it would still be a single-line token bridging multiple lines in those
languages.

I understand the desire for all this, but I hope it's understandable why
this situation is a little frustrating. There is no clean way to update any
given tool to treat U+2028/29 as newlines without either special-casing JS
or accepting that the tool's line numbers will not correspond with line
numbers from any tool that does not treat them as newlines, which is
realistically the vast majority of tooling and parsers. It is difficult to
make the argument to migrate any given tool when you're migrating away from
the current defacto behavior, even if you're migrating to the
Unicode-defined behavior, especially when it's not clear that the community
as a whole is even aware that they should be treating these as newlines in
the first place.




On Fri, Oct 26, 2018 at 4:52 PM Isiah Meadows <isiahmeadows at gmail.com>
wrote:

> So in other words, all these IDEs are broken and in violation of the
> Unicode spec. BTW, VSCode depends on Chrome, so it'll likely have most
> of the same behavior if it doesn't correctly account for them..
>
> -----
>
> Isiah Meadows
> contact at isiahmeadows.com
> www.isiahmeadows.com
>
> On Fri, Oct 26, 2018 at 5:49 PM Logan Smyth <loganfsmyth at gmail.com> wrote:
> >
> > Great, thank you for that resource Allen, it's helpful to have something
> concrete to consider.
> >
> > What you'd prefer is that that other languages should also be rendered
> with U+2028/29 as creating new lines, even though their specifications do
> not define them as lines? That means that any parser for these languages
> that follows the language spec would them be outputting line numbers that
> would potentially not correspond with the code as rendered inside of the
> developer's editor, if the editor renders U+2028/29 a line separators? That
> would for instance mean that Rust's single-line comments could actually be
> rendered as multiple lines, even though they are a single line according to
> the spec.
> >
> > My frustration here isn't that the characters exist, it's just that
> their behavior in a world of explicitly defined syntactic grammars that
> depend on line numbers for errors and things, they seem poorly-defined,
> even if their behavior in text documents may have more meaning. For
> instance, here is XCode's rendering of 2028/2029
> >
> >
> > 2028 does seem to render as a "line separator" in that visually the code
> is on a new line, but it is rendered within the same line number marker as
> the start of that snippet of text. That seems to satisfy the behavior
> defined by Unicode, but it's not helpful from the standpoint of code
> looking to process sourcecode. Should a parser follow that definition of
> line separator, since 2028 suggests rendering a new line, but since it's
> not a paragraph, it's conceptually part of the same paragraph? What is a
> paragraph in source code? Unicode has no sense of line numbers as far as I
> know, which means it seems up to an individual language to define what line
> number a given token is on.
> >
> >
> > > All of them recognise both characters as newlines (and increment the
> line number for those that display it).
> >
> > Revisiting my tests on my OSX machine, it seems like there is a
> difference in treatment of 2028 and 2029 that threw off at least some of my
> tests.
> > * VSCode: 2028 is a unicode placeholder and 2029 seems to be rendered
> zero-width, no new lines
> > * Sublime 3: 2028/29 rendered zero-width, no new lines
> > * TextEdit: 2028 is a newline, 2029 is zero-width, no new lines
> > * XCode: Per above screenshot, 2028 creates a line but renders within
> the same line number, 2029 creates a new line number
> > * Firefox, Chrome, and Safari, with text in a <pre> or <textarea>
> renders them all on one line zero-width, no new lines (though how HTML
> renders may just be a whole separate question)
> >
> >
> > On Fri, Oct 26, 2018 at 7:42 AM Claude Pache <claude.pache at gmail.com>
> wrote:
> >>
> >>
> >>
> >> >
> >> > Would it be worth exploring a definition of U+2028/29 in the spec
> such that they behave as line terminators for ASI, but otherwise do not
> increment things line number counts and behave as whitespace characters?
> >>
> >> Diverging the definition of line terminator for the purpose of line
> counting on one side, and ASI and single-line comment on the other side, is
> adding yet another complication in a matter that is already messy. And I
> suspect that most tools that have issues with the former case, have issues
> as well with the latter case, so that half-fixing is equivalent to not
> fixing.
> >>
> >> If we want to ”fix” the definition of line terminator somewhere, we
> should ”fix” it everywhere.
> >>
> >> (Note that the recent addition of U+2028 and U+2029 inside string
> literals does not constitutes a modification of the definition of line
> terminator in that context; it is rather allowing string literals to span
> multiple lines in some specific cases.)
> >>
> >> —Claude
> >
> > _______________________________________________
> > es-discuss mailing list
> > es-discuss at mozilla.org
> > https://mail.mozilla.org/listinfo/es-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20181029/050399e4/attachment.html>

# Boris Zbarsky (7 years ago)

On 10/29/18 2:04 PM, Logan Smyth wrote:

This means that the expectation, from the standpoint of Unicode spec, is that all existing parsers and tooling for all languages would also be updated to have line numbering that include U+2028/29

There is also the somewhat widespread opinion that Unicode goofed by adding those characters and that the best thing to do with them is to simply ignore them. So I expect a number of languages and tools to do just that.

On 10/29/18 2:04 PM, Logan Smyth wrote:
> This means that the expectation, from the standpoint of 
> Unicode spec, is that all existing parsers and tooling for all languages 
> would also be updated to have line numbering that include U+2028/29

There is also the somewhat widespread opinion that Unicode goofed by 
adding those characters and that the best thing to do with them is to 
simply ignore them.   So I expect a number of languages and tools to do 
just that.

-Boris

# Isiah Meadows (7 years ago)

I could give a pass for languages, since the grammar for some are ASCII-only (like Lua), and it's easier to just not include those code points as line terminators (they're rarely used anyways). But editors and browsers generally deal with arbitrary text display and modification, so it makes little sense for them to ignore this semantic.

Isiah Meadows contact at isiahmeadows.com, www.isiahmeadows.com

I could give a pass for languages, since the grammar for some are
ASCII-only (like Lua), and it's easier to just not include those code
points as line terminators (they're rarely used anyways). But editors
and browsers generally deal with arbitrary text display and
modification, so it makes little sense for them to ignore this
semantic.

-----

Isiah Meadows
contact at isiahmeadows.com
www.isiahmeadows.com

On Mon, Oct 29, 2018 at 3:13 PM Boris Zbarsky <bzbarsky at mit.edu> wrote:
>
> On 10/29/18 2:04 PM, Logan Smyth wrote:
> > This means that the expectation, from the standpoint of
> > Unicode spec, is that all existing parsers and tooling for all languages
> > would also be updated to have line numbering that include U+2028/29
>
> There is also the somewhat widespread opinion that Unicode goofed by
> adding those characters and that the best thing to do with them is to
> simply ignore them.   So I expect a number of languages and tools to do
> just that.
>
> -Boris
> _______________________________________________
> es-discuss mailing list
> es-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss

# Carsten Bormann (7 years ago)

On Oct 26, 2018, at 10:48, Claude Pache <claude.pache at gmail.com> wrote:

I have just tried to open a file containing U+2028 and U+2029 in four different text editors / integrated environments on my Mac. All of them recognise both characters as newlines (and increment the line number for those that display it).

Hi Claude,

can you identify those apps? I’d like to meet them in person.

Grüße, Carsten

On Oct 26, 2018, at 10:48, Claude Pache <claude.pache at gmail.com> wrote:
> 
> I have just tried to open a file containing U+2028 and U+2029 in four different text editors / integrated environments on my Mac. All of them recognise both characters as newlines (and increment the line number for those that display it).

Hi Claude,

can you identify those apps?  I’d like to meet them in person.

Grüße, Carsten

# J Decker (7 years ago)

On Mon, Oct 29, 2018 at 1:50 PM Carsten Bormann <cabo at tzi.org> wrote:

On Oct 26, 2018, at 10:48, Claude Pache <claude.pache at gmail.com> wrote:

I have just tried to open a file containing U+2028 and U+2029 in four different text editors / integrated environments on my Mac. All of them recognise both characters as newlines (and increment the line number for those that display it).

Hi Claude,

can you identify those apps? I’d like to meet them in person.

esdiscuss.org/topic/expectations-around-line-ending-behavior-for-u-2028-and-u-2029#content-10 The screen shot didn't save in that archive though...

It looks to me like 2028 isn't a real line break; just a visual line break... since it kept the same line number. Not sure what to do about a 'column' count in this case though... character index on line != column in this case...

2029 does look like '\r\n' (if \r is return and \n is linefeed as in classic TTY)

On Mon, Oct 29, 2018 at 1:50 PM Carsten Bormann <cabo at tzi.org> wrote:

> On Oct 26, 2018, at 10:48, Claude Pache <claude.pache at gmail.com> wrote:
> >
> > I have just tried to open a file containing U+2028 and U+2029 in four
> different text editors / integrated environments on my Mac. All of them
> recognise both characters as newlines (and increment the line number for
> those that display it).
>
> Hi Claude,
>
> can you identify those apps?  I’d like to meet them in person.
>

https://esdiscuss.org/topic/expectations-around-line-ending-behavior-for-u-2028-and-u-2029#content-10
The screen shot didn't save in that archive though...

It looks to me like 2028 isn't a real line break; just a visual line
break... since it kept the same line number.  Not sure what to do about a
'column' count in this case though... character index on line != column in
this case...

2029 does look like '\r\n' (if \r is return and \n is linefeed as in
classic TTY)

>
> Grüße, Carsten
>
> _______________________________________________
> es-discuss mailing list
> es-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20181029/a4446632/attachment.html>

# Carsten Bormann (7 years ago)

On Oct 29, 2018, at 21:55, J Decker <d3ck0r at gmail.com> wrote:

esdiscuss.org/topic/expectations-around-line-ending-behavior-for-u-2028-and-u-2029#content-10

Your message was non-surprising to me: Most editors indeed do not heed the Unicode lore on 2028 and 2029, as nobody uses these characters in plaintext.

I was interested in learning about ones that do implement 2028 and 2029 as intended by Unicode; Claude had four of them, and I would like to meet them.

Grüße, Carsten

On Oct 29, 2018, at 21:55, J Decker <d3ck0r at gmail.com> wrote:
> 
> https://esdiscuss.org/topic/expectations-around-line-ending-behavior-for-u-2028-and-u-2029#content-10

Your message was non-surprising to me: Most editors indeed do not heed the Unicode lore on 2028 and 2029, as nobody uses these characters in plaintext.

I was interested in learning about ones that do implement 2028 and 2029 as intended by Unicode; Claude had four of them, and I would like to meet them.

Grüße, Carsten