Language Negotiation API

# Zbigniew Braniecki (11 years ago)

Currently, ECMA 402 specifies a pretty nice language negotiation algorithm... and keeps it private.

While working on l10n frameworks, we need to be able to negotiate between at least two parties - application and user preferences, in the very same way I18n API does, so if we could get the language negotiation bits exposed, we could just use that, and keep language choices between l10n and i18n in sync.

So, what do we need?

While working on L20n we identified two functions as crucial:

  1. CanonicalizeLanguageTag 1

Because language tags come from developers and users, ability to canonicalize them is crucial to us. ECMA 402 specifies this function and all we need is to expose it in the API

1.1) CanonicalizeLocaleList 2

That would also be nice to have :)

  1. LookupAvailableLocales

This function has almost identical heuristic to LookupSupportedLocales 3 with a single difference being in step d).

Replace:

"If availableLocale is not undefined, then append locale to the end of subset."

with:

"If availableLocale is not undefined, then append availableLocale to the end of subset."

The reason behind this is that localization frameworks need to choose the available locales that closest match the user preferences. If we used LookupSupportedLocales, we will receive the locales that user requested, not ones that are available on the system. In result on each of those, we'd have to call BestAvailableLocale 4 to receive the tag name that we can pull resources for.

With that one change, we are actually going to receive the right set of language tags that we can then use to provide best language with fallbacks.

Example implementation of this is L20n localization framework 5 which copies Mozilla ECMA 402 code to expose the required functions and uses custom function called prioritizeLocales to build the final locale fallback chain.

Comments? Feedback? Next steps? :)

# Andy Earnshaw (11 years ago)

On Thu, Jul 11, 2013 at 11:52 PM, Zbigniew Braniecki <zbraniecki at mozilla.com> wrote

  1. CanonicalizeLanguageTag [1]

Because language tags come from developers and users, ability to canonicalize them is crucial to us. ECMA 402 specifies this function and all we need is to expose it in the API

I was thinking the same thing recently, at least for CanonicalizeLanguageTag. I was working with a platform that gave me a language tag in non-canonical form, meaning I had to either canonicalize it or rename my language files to match the same non-canonical form. Exposing it as Intl.canonicalizeLanguageTag(tag) seems like a good idea.

1.1) CanonicalizeLocaleList [2]

That would also be nice to have :)

I don't think you could expose CanonicalizeLocaleList directly without altering it to return an array, you'd have to do something similar to step 5 of LookupSupportedLocales. I'm not sure we could change that function in the spec without other abstracts potentially being affected by tainted a Array.prototype, so I guess you'd need to specify a new function. In which case I'm wondering if maybe you'd be better off with Intl.canonicalizeTags(tags) which would cover both CanonicalizeLanguageTag() and CanonicalizeLocaleList().

  1. LookupAvailableLocales

This function has almost identical heuristic to LookupSupportedLocales [3] with a single difference being in step d).

Replace:

"If availableLocale is not undefined, then append locale to the end of subset."

with:

"If availableLocale is not undefined, then append availableLocale to the end of subset."

The reason behind this is that localization frameworks need to choose the available locales that closest match the user preferences. If we used LookupSupportedLocales, we will receive the locales that user requested, not ones that are available on the system. In result on each of those, we'd have to call BestAvailableLocale [4] to receive the tag name that we can pull resources for.

You can at least work around this for a single locale with Intl.NumberFormat(tag).resolvedOptions().locale. If you're already using the native localisation APIs, this might not be too much of a hindrance. What you're suggesting would need to be a function property of the constructors, e.g. Intl.NumberFormat.availableLocalesOf(). I'm not so sure this approach makes sense, though; wouldn't you still have a problem if your own API provided variant data where the system does not?

# André Bargull (11 years ago)

I was thinking the same thing recently, at least for CanonicalizeLanguageTag. I was working with a platform that gave me a language tag in non-canonical form, meaning I had to either canonicalize it or rename my language files to match the same non-canonical form. Exposing it as Intl.canonicalizeLanguageTag(tag) seems like a good idea.

Only exposing CanonicalizeLanguageTag does not seem useful to me without having access to IsStructurallyValidLanguageTag. Most likely a combined IsStructurallyValidLanguageTag + CanonicalizeLanguageTag function is necessary/wanted for most use cases.

I don't think you could expose CanonicalizeLocaleList directly without altering it to return an array, you'd have to do something similar to step 5 of LookupSupportedLocales. I'm not sure we could change that function in the spec without other abstracts potentially being affected by tainted a Array.prototype, so I guess you'd need to specify a new function. In which case I'm wondering if maybe you'd be better off with Intl.canonicalizeTags(tags) which would cover both CanonicalizeLanguageTag() and CanonicalizeLocaleList().

I don't see why you'd need to change CanonicalizeLocaleList at all. Just let it return the internal list as-is, and then define Intl.canonicalizeLocaleList like so:

Intl.canonicalizeLocaleList(locales):

  1. Let canonicalizedLocaleList be the result of CanonicalizeLocaleList(locales).
  2. ReturnIfAbrupt(canonicalizedLocaleList).
  3. Return CreateArrayFromList(canonicalizedLocaleList).

(ReturnIfAbrupt and CreateArrayFromList are defined in ES6 as internal abstract operations.)

It also needs to be considered whether the duplicate removal in CanonicalizeLocaleList creates any issues for users of a potential Intl.canonicalizeLocaleList or Intl.canonicalizeTags function.

# Andy Earnshaw (11 years ago)

On Sat, Jul 13, 2013 at 1:05 PM, André Bargull <andre.bargull at udo.edu> wrote:

... Only exposing CanonicalizeLanguageTag does not seem useful to me without having access to IsStructurallyValidLanguageTag. Most likely a combined IsStructurallyValidLanguageTag + CanonicalizeLanguageTag function is necessary/wanted for most use cases.

Hmm. I'm not sure I'd agree it's necessary. IsStructurallyValidLanguageTag makes sense as an abstract function because you need to throw accordingly when an invalid tag is passed to the constructors or methods. However, it's still the developer's responsibility to make sure their tags are valid during the development process. Canonicalisation would still throw an error if the tag is invalid.

I don't see why you'd need to change CanonicalizeLocaleList at all. Just let it return the internal list as-is, and then define Intl.canonicalizeLocaleList like so:

Lists are internal, they aren't part of the ECMAScript language. It makes no sense to return an internal list to ECMAScript code unless you intend to go the whole hog and specify them with a constructor/prototype.

It also needs to be considered whether the duplicate removal in CanonicalizeLocaleList creates any issues for users of a potential Intl.canonicalizeLocaleList or Intl.canonicalizeTags function.

Perhaps. Are there any cases you think of where removing duplicates would be a problem?

# André Bargull (11 years ago)

On 7/13/2013 8:48 PM, Andy Earnshaw wrote:

On Sat, Jul 13, 2013 at 1:05 PM, André Bargull <andre.bargull at udo.edu> wrote:

... Only exposing CanonicalizeLanguageTag does not seem useful to me without having access to IsStructurallyValidLanguageTag. Most likely a combined IsStructurallyValidLanguageTag + CanonicalizeLanguageTag function is necessary/wanted for most use cases.

Hmm. I'm not sure I'd agree it's necessary. IsStructurallyValidLanguageTag makes sense as an abstract function because you need to throw accordingly when an invalid tag is passed to the constructors or methods. However, it's still the developer's responsibility to make sure their tags are valid during the development process. Canonicalisation would still throw an error if the tag is invalid.

CanonicalizeLanguageTag isn't even defined for non-structurally valid language tags. That's why I meant a combined IsStructurallyValidLanguageTag + CanonicalizeLanguageTag function is more useful than access to the bare CanonicalizeLanguageTag function.

I don't see why you'd need to change CanonicalizeLocaleList at all. Just let it return the internal list as-is, and then define Intl.canonicalizeLocaleList like so:

Lists are internal, they aren't part of the ECMAScript language. It makes no sense to return an internal list to ECMAScript code unless you intend to go the whole hog and specify them with a constructor/prototype.

The internal list structure is not returned to user code instead a possible Intl.canonicalizeLocaleList function is a simple wrapper around CanonicalizeLocaleList to perform the necessary conversion from list to array. That's exactly the point of the algorithm steps in my previous mail.

It also needs to be considered whether the duplicate removal in CanonicalizeLocaleList creates any issues for users of a potential Intl.canonicalizeLocaleList or Intl.canonicalizeTags function.

Perhaps. Are there any cases you think of where removing duplicates would be a problem?

I thought about use cases when a user assumes the i-th element of the output array is the canonicalised value of the i-th element in the input array. I can't tell whether this is a valid use case - I've only implemented ECMA-402, so I know a bit about the spec, but never actually used it in an application...

# Norbert Lindenberg (11 years ago)

On Jul 13, 2013, at 12:37 , André Bargull <andre.bargull at udo.edu> wrote:

CanonicalizeLanguageTag isn't even defined for non-structurally valid language tags. That's why I meant a combined IsStructurallyValidLanguageTag + CanonicalizeLanguageTag function is more useful than access to the bare CanonicalizeLanguageTag function.

Correct. As currently specified, the CanonicalizeLanguageTag abstract operation assumes that its input is a String value that's a structurally valid language tag. An API cannot make such assumptions - it has to be ready to deal with any input, as well as the absence of input. It has to do something like the steps in CanonicalizeLocaleList 8.c.ii-iv before calling the current CanonicalizeLanguageTag.

Before we get too much into spec details: Do others believe that exposing API as proposed by Zbigniew would be useful?

# Andy Earnshaw (11 years ago)

On Sun, Jul 14, 2013 at 2:07 AM, Norbert Lindenberg <ecmascript at lindenbergsoftware.com> wrote:

CanonicalizeLanguageTag isn't even defined for non-structurally valid language tags. That's why I meant a combined IsStructurallyValidLanguageTag

  • CanonicalizeLanguageTag function is more useful than access to the bare CanonicalizeLanguageTag function.

Correct. As currently specified, the CanonicalizeLanguageTag abstract operation assumes that its input is a String valueI'm not too sure about the that's a structurally valid language tag. An API cannot make such assumptions - it has to be ready to deal with any input, as well as the absence of input. It has to do something like the steps in CanonicalizeLocaleList 8.c.ii-iv before calling the current CanonicalizeLanguageTag.

You're both right, it assumes a string and doesn't check validity. That didn't occur to me, it's been a few months since my implementation.

Before we get too much into spec details: Do others believe that exposing API as proposed by Zbigniew would be useful?

I certainly do, at least for Canonicalize-. I've come across one user agent that returns navigator.language in non-canonical form which presented a small problem for data I had stored with canonical file names. This was a WebKit based Smart TV platform from 2012, so it was fairly recent, there could be other platforms or frameworks that do the same.

As for LookupAvailableLocales, there might be a problem with Zbigniew's vision of it as any tags would be returned without extensions. I'm not sure if this is something that we'd need to worry about, though.

# Zbigniew Braniecki (11 years ago)

As for LookupAvailableLocales, there might be a problem with Zbigniew's vision of it as any tags would be returned without extensions. I'm not sure if this is something that we'd need to worry about, though.

No, that's good, because locales will be stored under names without them as well.

# Andy Earnshaw (11 years ago)

Would you expect to support the same locales as Intl constructors in your library? Can you safely make that assumption?

Canonicalisation makes sense because I would expect a library to canonicalise the tag and then try and load the file containing relevant data whether the native API supports it or not. Forgive me if I'm misunderstanding something, I didn't have a look at your project in great detail.

# Zbigniew Braniecki (11 years ago)

Would you expect to support the same locales as Intl constructors in your library?

Yes.

Can you safely make that assumption?

I'd have to think more about edge cases, but my initial reaction is - yes.

Canonicalisation makes sense because I would expect a library to canonicalise the tag and then try and load the file containing relevant data whether the native API supports it or not. Forgive me if I'm misunderstanding something, I didn't have a look at your project in great detail.

There's no need to look at my project. All I'm asking is to talk about exposing the API for negotiating between locales provided by the application and locales requested by the user with the result being the list of available locales that the user wants sorted by the user preference.

That enables us to load the locale 0 and fallback to locale 1 and then to locale 2 etc.

The only crucial point here is that we need to operate on the list of available locales, not requested, because we will be selecting from the available ones.

# Anne van Kesteren (11 years ago)

On Sun, Jul 14, 2013 at 5:20 AM, Andy Earnshaw <andyearnshaw at gmail.com> wrote:

I certainly do, at least for Canonicalize-. I've come across one user agent that returns navigator.language in non-canonical form which presented a small problem for data I had stored with canonical file names. This was a WebKit based Smart TV platform from 2012, so it was fairly recent, there could be other platforms or frameworks that do the same.

FWIW, exposing a new API because another API is broken in a particular implementation is a known anti-pattern. We should fix problems at the source.

# Andy Earnshaw (11 years ago)

On Mon, Jul 15, 2013 at 9:37 PM, Anne van Kesteren <annevk at annevk.nl> wrote:

On Sun, Jul 14, 2013 at 5:20 AM, Andy Earnshaw <andyearnshaw at gmail.com> wrote:

I certainly do, at least for Canonicalize-. I've come across one user agent that returns navigator.language in non-canonical form which presented a small problem for data I had stored with canonical file names. This was a WebKit based Smart TV platform from 2012, so it was fairly recent, there could be other platforms or frameworks that do the same.

FWIW, exposing a new API because another API is broken in a particular implementation is a known anti-pattern. We should fix problems at the source.

Normally, I would agree. However, I was just using my scenario as an example for where exposing the API would have been useful for me. I can also think of a few other reasons:

  • Language tags can be in extlang form or canonical form. Depending on the source providing the language tag, it's not guaranteed to be the canonical form (extlang form can reinstate extlang subtags that were removed during canonicalisation).
  • The Internationalization API doesn't cover all aspects of its namesake, like translation, or formatting of postal codes or telephone numbers, as a few examples. Developer libraries could augment Intl with this data, so it would make lives easier if we exposed CanonicalizeLanguageTag to be used by such libraries.
  • Canonicalisation has at least a couple of optional steps (like normalising case or ordering variant subtags) so exposing a canonicalizing method would give developers a way to achieve consistency with the Internationalisation API.

navigator.language isn't part of any stable specification, and even the current HTML 5.1 draft doesn't specify that tags should be returned in canonical form. Do you think it would be a good idea to raise an issue for this?

Andy

# Anne van Kesteren (11 years ago)

On Mon, Jul 15, 2013 at 7:51 PM, Andy Earnshaw <andyearnshaw at gmail.com> wrote:

navigator.language isn't part of any stable specification, and even the current HTML 5.1 draft doesn't specify that tags should be returned in canonical form. Do you think it would be a good idea to raise an issue for this?

Filed www.w3.org/Bugs/Public/show_bug.cgi?id=22681

-- annevankesteren.nl

# Ian Hickson (11 years ago)

On Tue, 16 Jul 2013, Andy Earnshaw wrote:

navigator.language isn't part of any stable specification

It's part of the HTML standard:

whatwg.org/html/#language-preferences

...which is very stable at this point (there's basically no way that part of the spec can change in an incompatible fashion, since it's widely implemented; the only possible changes are those that approach reality more, and those that add features).

and even the current HTML 5.1 draft doesn't specify that tags should be returned in canonical form. Do you think it would be a good idea to raise an issue for this?

Fixed. (A change that approaches reality more.)

# Zbigniew Braniecki (11 years ago)

Anne van Kesteren <mailto:annevk at annevk.nl> July 15, 2013 1:37 PM

FWIW, exposing a new API because another API is broken in a particular implementation is a known anti-pattern. We should fix problems at the source.

Good point, but I believe that there are more potential sources of language tags passed to language negotiation, including programmed composition, feeding from unknown sources (databases etc.), or even manually entered by the user.

Having a function that enables us to canonicalize it (even the simplest part of that - upper/lower cases) allows to use compare operators (langTag1 == langTag2), or, in localization case, allows us to build a path to the resource on case sensitive systems.