Expectations around line ending behavior for U+2028 and U+2029

# Logan Smyth (6 years ago)

Something I've recently realized just how much U+2028 and U+2029 being newlines introduces a mismatch between different parts of a dev environment, and I'm curious for thoughts.

Engines understandable take these two characters into account when defining their line number offsets in stack traces, since they are part of the LineTerminator grammar. Similarly, Babel's parser and I assume others will do the same and take then into account for their line number data. On the other hand, it seems like every editor that I've looked at so far will not render these characters as newlines, which can create confusion for users because error messages will not align with what they see in their editors. This seems like a burden for editors, since they would need to know the type of file in order to know how to render it. There's also a question of mixed content. If I have an HTML file with a <script>, would an editor need

to be content-aware to render the newlines correctly only within the <script> tag, since U+2028/29 are not newline characters for HTML?

Another case that comes to mind is that sourcemaps don't appear to specify what counts as a line. While mappings are defined per-line, it's not clear whether these should take U+2028/29 into account or not, though I'd assume the intention is /\r?\n/. Tooling like Babel will currently take U+2028/29 into account because otherwise we'd need two independent concepts of line/column number for each location. That said, this Babel behavior is likely a bad idea because it means the application of a sourcemap would need to be aware of whether a given mapping within a file applies to JS content, or something else.

Would it be worth exploring a definition of U+2028/29 in the spec such that they behave as line terminators for ASI, but otherwise do not increment things line number counts and behave as whitespace characters? If not, what are your thoughts on the issues I've mentioned?

# Richard Gibson (6 years ago)

The only explicit mention of line numbers in the spec is to note that "<CR><LF>… should be considered a single SourceCharacter for the purpose of

reporting line numbers", but it's clear from things like ASI and termination of single-line comments that every LineTerminatorSequence is equal in this sense. Editors and HTML are free to do what they want, but in my opinion ECMAScript tooling at least should not pretend that these input elements don't terminate lines.

# Logan Smyth (6 years ago)

Yeah, LineTerminatorSequence is definitely the canonical definition of line numbers in JS at the moment. As we explore tc39/proposal-error-stacks, it would be good to clearly specify how a line number is computed from the original source. As currently specified, a line number in a stack trace takes U+2028/29 into account, and thus requires any consumer of this source code and line number value needs to have a special case for JS code. It seems unrealistic to expect every piece of tooling that works with source code would have a special case for JS code to take these 2 characters into account. Given that, the choices are

  1. Every tool that manipulates source code needs to know what type so it can special-case JS it is in order to process line-related information
  2. Every tool should consider U+2028/29 newlines, causing line numbers to be off in other programming languages
  3. Accept that tooling and the spec will never correspond and the use of these two characters in source code will continue to cause issues
  4. Diverge the definition of current source-code line from the current LineTerminatorSequence lexical grammar such that source line number is always /\r?\n/, which is what the user is realistically going to see in their editor
# Carsten Bormann (6 years ago)

On Oct 25, 2018, at 18:24, Logan Smyth <loganfsmyth at gmail.com> wrote:

  1. Diverge the definition of current source-code line from the current LineTerminatorSequence lexical grammar such that source line number is always /\r?\n/, which is what the user is realistically going to see in their editor

This. U+2028/U+2029 is widely recognized as a mistake (and not only because of the discrepancy with JSON it created).

While this mistake probably cannot be repaired easily in existing parts of the specification, you can make sure that the problem does not infect new parts. (This may lead to the people who are actually using 2028/2029 to take some damage, but that is entirely OK as long as their precious existing scripts don’t break.)

Grüße, Carsten

# Waldemar Horwat (6 years ago)

On 10/25/2018 09:24 AM, Logan Smyth wrote:

Yeah, /LineTerminatorSequence/ is definitely the canonical definition of line numbers in JS at the moment. As we explore tc39/proposal-error-stacks, it would be good to clearly specify how a line number is computed from the original source. As currently specified, a line number in a stack trace takes U+2028/29 into account, and thus requires any consumer of this source code and line number value needs to have a special case for JS code. It seems unrealistic to expect every piece of tooling that works with source code would have a special case for JS code to take these 2 characters into account. Given that, the choices are

  1. Every tool that manipulates source code needs to know what type so it can special-case JS it is in order to process line-related information
  2. Every tool should consider U+2028/29 newlines, causing line numbers to be off in other programming languages
  3. Accept that tooling and the spec will never correspond and the use of these two characters in source code will continue to cause issues
  4. Diverge the definition of current source-code line from the current /LineTerminatorSequence/ lexical grammar such that source line number is always /\r?\n/, which is what the user is realistically going to see in their editor

The Unicode standard is the more relevant one here. Choice 2 is the correct one per the Unicode standard. Tools that do not consider U+2028/29 to be line breaks are not behaving as they should according to the latest Unicode standard.

 Waldemar
# Logan Smyth (6 years ago)

Tools that do not consider U+2028/29 to be line breaks are not behaving

as they should according to the latest Unicode standard.

That's part of what I'm attempting to understand. What specifically does Unicode require for these code points? What are the expectations for languages that have differing definitions of line separators? The HTML spec defines newlines in html.spec.whatwg.org/#newlines as CR and LF only. Is that technically in violation of the Unicode spec then? If code editors were to adopt U+2028 and U+2029 as line separators, is the expectation that they would apply that to HTML files too, even though that would put the the editor's concept of a line in conflict with the language's specification?

It seems unrealistic to expect that all tooling that processes source code would adopt a new type of line separator. Given that, JS is the outlier. Similarly, does Unicode make any guarantees about what counts as a line terminator? If it changes in the future, would JS be forced to add that as a type of LineTerminator as well? If it did, that could break existing code, and if it doesn't, then JS would end up right back in the same place with a concept of line numbers that differs from other tooling. CR and LF are already the defacto standards, is it really realistic to expect tooling to ever change? It is much more likely that JS will have simply specified itself as a special-case forever, which tooling will never handle.

# Allen Wirfs-Brock (6 years ago)

On Oct 25, 2018, at 4:49 PM, Logan Smyth <loganfsmyth at gmail.com> wrote:

Tools that do not consider U+2028/29 to be line breaks are not behaving as they should according to the latest Unicode standard.

That's part of what I'm attempting to understand. What specifically does Unicode require for these code points? What are the expectations for languages that have differing definitions of line separators?

see www.unicode.org/versions/Unicode11.0.0/ch05.pdf#G10213, www.unicode.org/versions/Unicode11.0.0/ch05.pdf#G10213

# Carsten Bormann (5 years ago)

On Oct 26, 2018, at 02:17, Allen Wirfs-Brock <allen at wirfs-brock.com> wrote:

see www.unicode.org/versions/Unicode11.0.0/ch05.pdf#G10213

Please explain how this is even remotely relevant for a programming language.

(Clearly, this was written by people who were trying to encode word processing text. The giveaway is the phrase “simple text editors such as program editors”. This recommendation is attempting to solve a problem that programming languages do not have. I’m not aware of any uptake in the word processing world, either.)

Grüße, Carsten

# Claude Pache (5 years ago)

Le 24 oct. 2018 à 21:58, Logan Smyth <loganfsmyth at gmail.com> a écrit :

On the other hand, it seems like every editor that I've looked at so far will not render these characters as newlines,

I have just tried to open a file containing U+2028 and U+2029 in four different text editors / integrated environments on my Mac. All of them recognise both characters as newlines (and increment the line number for those that display it).

# Claude Pache (5 years ago)

Would it be worth exploring a definition of U+2028/29 in the spec such that they behave as line terminators for ASI, but otherwise do not increment things line number counts and behave as whitespace characters?

Diverging the definition of line terminator for the purpose of line counting on one side, and ASI and single-line comment on the other side, is adding yet another complication in a matter that is already messy. And I suspect that most tools that have issues with the former case, have issues as well with the latter case, so that half-fixing is equivalent to not fixing.

If we want to ”fix” the definition of line terminator somewhere, we should ”fix” it everywhere.

(Note that the recent addition of U+2028 and U+2029 inside string literals does not constitutes a modification of the definition of line terminator in that context; it is rather allowing string literals to span multiple lines in some specific cases.)

# Logan Smyth (5 years ago)

Great, thank you for that resource Allen, it's helpful to have something concrete to consider.

What you'd prefer is that that other languages should also be rendered with U+2028/29 as creating new lines, even though their specifications do not define them as lines? That means that any parser for these languages that follows the language spec would them be outputting line numbers that would potentially not correspond with the code as rendered inside of the developer's editor, if the editor renders U+2028/29 a line separators? That would for instance mean that Rust's single-line comments could actually be rendered as multiple lines, even though they are a single line according to the spec.

My frustration here isn't that the characters exist, it's just that their behavior in a world of explicitly defined syntactic grammars that depend on line numbers for errors and things, they seem poorly-defined, even if their behavior in text documents may have more meaning. For instance, here is XCode's rendering of 2028/2029 [image: Screen Shot 2018-10-26 at 2.33.56 PM.png]

2028 does seem to render as a "line separator" in that visually the code is on a new line, but it is rendered within the same line number marker as the start of that snippet of text. That seems to satisfy the behavior defined by Unicode, but it's not helpful from the standpoint of code looking to process sourcecode. Should a parser follow that definition of line separator, since 2028 suggests rendering a new line, but since it's not a paragraph, it's conceptually part of the same paragraph? What is a paragraph in source code? Unicode has no sense of line numbers as far as I know, which means it seems up to an individual language to define what line number a given token is on.

All of them recognise both characters as newlines (and increment the line

number for those that display it).

Revisiting my tests on my OSX machine, it seems like there is a difference in treatment of 2028 and 2029 that threw off at least some of my tests.

  • VSCode: 2028 is a unicode placeholder and 2029 seems to be rendered zero-width, no new lines
  • Sublime 3: 2028/29 rendered zero-width, no new lines
  • TextEdit: 2028 is a newline, 2029 is zero-width, no new lines
  • XCode: Per above screenshot, 2028 creates a line but renders within the same line number, 2029 creates a new line number
  • Firefox, Chrome, and Safari, with text in a <pre> or <textarea> renders

them all on one line zero-width, no new lines (though how HTML renders may just be a whole separate question)

# Isiah Meadows (5 years ago)

So in other words, all these IDEs are broken and in violation of the Unicode spec. BTW, VSCode depends on Chrome, so it'll likely have most of the same behavior if it doesn't correctly account for them..


Isiah Meadows contact at isiahmeadows.com, www.isiahmeadows.com

# Logan Smyth (5 years ago)

Sounds good. This means that the expectation, from the standpoint of Unicode spec, is that all existing parsers and tooling for all languages would also be updated to have line numbering that include U+2028/29, or else that the line numbers would indefinitely be out of sync with the line numbers rendered in an editor, when a file contains these characters. I assume it would be a breaking change for most languages to add U+2028/29 as line terminators for line comments and potentially string content at a minimum, meaning that even if the editors render them as line separators, it would still be a single-line token bridging multiple lines in those languages.

I understand the desire for all this, but I hope it's understandable why this situation is a little frustrating. There is no clean way to update any given tool to treat U+2028/29 as newlines without either special-casing JS or accepting that the tool's line numbers will not correspond with line numbers from any tool that does not treat them as newlines, which is realistically the vast majority of tooling and parsers. It is difficult to make the argument to migrate any given tool when you're migrating away from the current defacto behavior, even if you're migrating to the Unicode-defined behavior, especially when it's not clear that the community as a whole is even aware that they should be treating these as newlines in the first place.

# Boris Zbarsky (5 years ago)

On 10/29/18 2:04 PM, Logan Smyth wrote:

This means that the expectation, from the standpoint of Unicode spec, is that all existing parsers and tooling for all languages would also be updated to have line numbering that include U+2028/29

There is also the somewhat widespread opinion that Unicode goofed by adding those characters and that the best thing to do with them is to simply ignore them. So I expect a number of languages and tools to do just that.

# Isiah Meadows (5 years ago)

I could give a pass for languages, since the grammar for some are ASCII-only (like Lua), and it's easier to just not include those code points as line terminators (they're rarely used anyways). But editors and browsers generally deal with arbitrary text display and modification, so it makes little sense for them to ignore this semantic.


Isiah Meadows contact at isiahmeadows.com, www.isiahmeadows.com

# Carsten Bormann (5 years ago)

On Oct 26, 2018, at 10:48, Claude Pache <claude.pache at gmail.com> wrote:

I have just tried to open a file containing U+2028 and U+2029 in four different text editors / integrated environments on my Mac. All of them recognise both characters as newlines (and increment the line number for those that display it).

Hi Claude,

can you identify those apps? I’d like to meet them in person.

Grüße, Carsten

# J Decker (5 years ago)

On Mon, Oct 29, 2018 at 1:50 PM Carsten Bormann <cabo at tzi.org> wrote:

On Oct 26, 2018, at 10:48, Claude Pache <claude.pache at gmail.com> wrote:

I have just tried to open a file containing U+2028 and U+2029 in four different text editors / integrated environments on my Mac. All of them recognise both characters as newlines (and increment the line number for those that display it).

Hi Claude,

can you identify those apps? I’d like to meet them in person.

esdiscuss.org/topic/expectations-around-line-ending-behavior-for-u-2028-and-u-2029#content-10 The screen shot didn't save in that archive though...

It looks to me like 2028 isn't a real line break; just a visual line break... since it kept the same line number. Not sure what to do about a 'column' count in this case though... character index on line != column in this case...

2029 does look like '\r\n' (if \r is return and \n is linefeed as in classic TTY)

# Carsten Bormann (5 years ago)

On Oct 29, 2018, at 21:55, J Decker <d3ck0r at gmail.com> wrote:

esdiscuss.org/topic/expectations-around-line-ending-behavior-for-u-2028-and-u-2029#content-10

Your message was non-surprising to me: Most editors indeed do not heed the Unicode lore on 2028 and 2029, as nobody uses these characters in plaintext.

I was interested in learning about ones that do implement 2028 and 2029 as intended by Unicode; Claude had four of them, and I would like to meet them.

Grüße, Carsten