LocaleInfo implementation details and best match algorithm

# Nebojša Ćirić (14 years ago)

I’ve started implementation of the LocaleInfo class in Chrome and I would like to clarify what the actual parameters are and how do we construct the object given those parameters.

Differences to the current proposal are (for sake of simplification, but without any loss in clarity or functionality):

Combine languageNames and localeName parameters from the original proposal (0.5) into a single localeID parameter.
Leave currencyCode and currencySymbol as parameters for NumberFormat constructor.
LocaleInfo.options.regionID contains either user specified regionID or some inferred value (vs. only user specified value) one the LocaleInfo object is created.

LocaleInfo constructor takes options parameter with two fields as input.

Optional: localeID - string or an array of unicode identifiers: [unicode locale id, ulocid,...]. We assume array is sorted by importance, from more to less important locale. LocaleID format is specified in UTS#35 documentunicode.org/reports/tr35/#Language_Locale_Field_Definitions .
Optional: regionID - string representing region ID, used for currency inference and possibly other uses. The region ID values follow UTS#35 region subtag specificationunicode.org/reports/tr35/#Language_Locale_Field_Definitions(ISO3166-1

specification with addition of invalid, undefined and reserved region codes).

If options parameter is missing, we set resulting localeID field to default** value.

I propose algorithm pastebin.mozilla.org/1198734 for resolving best

match locale id and region id from inputs. We should discuss if actual distance computation for best match should be left to implementers or if we should standardize it (data it relies on may be different).

As the product of the algorithm LocaleInfo object will have:

options.localeID set to a string that is a best match to available locales for given implementation and it will be in canonical form. At the minimum localeID will have base language subtag. In case localeID input was empty or we couldn’t find a match, localeID will have a default value.
options.regionID set to a canonicalized (uppercased) region ID. If one was not specified as an input, we store inferred value - either region subtag from original locale or most likely one using original locale if the region subtag is missing. Invalid/undefined/reserved codes are preserved.

Examples - actual values may vary among implementations because of data variation. Implementers can also decide to pick different most likely locale (say en-GB instead of en-US) based on their target region... localeIDregionIDoptions.localeIDoptions.regionID--defaultdefault’s regionfr- frFRfr-CA-fr-CACAfrBEfr-BEBEfrRUfr (fr-RU is not available)RU[‘es’, ‘es-419’]-esES[‘pt’, ‘pt-BR’] PTpt-PTPTsrZZsr (didn’t match sr-ZZ, best match sr).ZZde-Latn-DE-u-co-phonebkATde-DE-u-co-phonebk (best match de-DE, and added extension)ATsr-MNBAsr-RS (didn’t match sr-MN, best match was sr-RS)BA

** - Implementation are free to pick the best default value for their platform. One possible default could be root locale.

I’ve started implementation of the LocaleInfo class in Chrome and I would
like to clarify what the actual parameters are and how do we construct the
object given those parameters.

Differences to the current proposal are (for sake of simplification, but
without any loss in clarity or functionality):

- Combine languageNames and localeName parameters from the original
proposal (0.5) into a single localeID parameter.
- Leave currencyCode and currencySymbol as parameters for NumberFormat
constructor.
- LocaleInfo.options.regionID contains either user specified regionID or
some inferred value (vs. only user specified value) one the LocaleInfo
object is created.

LocaleInfo constructor takes options parameter with two fields as input.

- Optional: localeID - string or an array of unicode identifiers:
[unicode locale id, ulocid,...]. We assume array is sorted by importance,
from more to less important locale. LocaleID format is specified in UTS#35
document<http://unicode.org/reports/tr35/#Language_Locale_Field_Definitions>
.
- Optional: regionID - string representing region ID, used for currency
inference and possibly other uses. The region ID values follow UTS#35
region subtag
specification<http://unicode.org/reports/tr35/#Language_Locale_Field_Definitions>(ISO3166-1
specification with addition of invalid, undefined and reserved
region codes).
- If options parameter is missing, we set resulting localeID field to
default** value.

I propose algorithm <http://pastebin.mozilla.org/1198734> for resolving best
match locale id and region id from inputs. We should discuss if actual
distance computation for best match should be left to implementers or if we
should standardize it (data it relies on may be different).

As the product of the algorithm LocaleInfo object will have:

- options.localeID set to a string that is a best match to available
locales for given implementation and it will be in canonical form. At the
minimum localeID will have base language subtag. In case localeID input was
empty or we couldn’t find a match, localeID will have a default value.
- options.regionID set to a canonicalized (uppercased) region ID. If one
was not specified as an input, we store inferred value - either region
subtag from original locale or most likely one using original locale if the
region subtag is missing. Invalid/undefined/reserved codes are preserved.

Examples - actual values may vary among implementations because of data
variation. Implementers can also decide to pick different most likely locale
(say en-GB instead of en-US) based on their target region...
localeIDregionIDoptions.localeIDoptions.regionID--defaultdefault’s regionfr-
frFRfr-CA-fr-CACAfrBEfr-BEBEfrRUfr (fr-RU is not available)RU[‘es’,
‘es-419’]-esES[‘pt’, ‘pt-BR’] PTpt-PTPTsrZZsr (didn’t match sr-ZZ, best
match sr).ZZde-Latn-DE-u-co-phonebkATde-DE-u-co-phonebk (best match de-DE,
and added extension)ATsr-MNBAsr-RS (didn’t match sr-MN, best match was
sr-RS)BA

** - Implementation are free to pick the best default value for their
platform. One possible default could be root locale.

--
Nebojša Ćirić
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20110406/ddd8cb35/attachment-0001.html>

# Nebojša Ćirić (14 years ago)

Richard, see my comments inline.

I finally had a chance to look at your proposal, and I'm having some trouble following it. It sort of winds up sounding like there are four different values-- localeID, regionID, options.localeID, and options.regionID, and they all mean different things. It seems like this might make sense if the "options" stuff were output parameters, but the text makes it all sound like they're input parameters. This seems extremely confusing.

I agree. Input parameter to LocaleInfo constructor is called options. LocaleInfo also contains options key that holds resolved input values (like actual localeID, canonicalized regionID). I will rename input parameters to settings to clarify things in strawman.

The other thing I find interesting is that the caller can supply an ARRAY of language tags-- a language preference list, essentially. I work in an environment where there's always one locale, so I might be biased, but I'm wondering whether we really need all this functionality now, or if we can have everything take one language tag now and work up to taking multiple tags in a future version.

I think once you manage to match one localeID from input with available locales on the system, extending the problem to matching priority list is fairly simple. I feel single localeID input is going to be prevalent way of specifying which locale you want, but I think the vision here was to enable something like acceptLanguage in HTTP protocol where user can have a say in how fall back should work. I'll let Shawn comment more here.

I've also never been a fan of specifying the country separately from the country in the language. I understand "de-AT" just means "The Austrian dialect of German," not "Austrian German, and the country is Austria," but I'm wondering again if the need the extra flexibility of specifying a region code for the language and a SEPARATE country code just for things like currency formatting, especially with a bunch of extra machinery for guessing the country code if the user doesn't supply one. This seems like way more machinery than the vast majority of users will ever need, and I'm wondering if we can blow it off or defer it to some future iteration.

There was a long discussion about this item and we decided to go with more strict definition (specifying both localeID and regionID). Users are of course free to use only localeID, but some results may be off in case localeID doesn't match the expected region for the user (e.g. currency for now, maybe more in the future). Without regionID there is no way to specify that preferred region for en-CA user may be UK...

I agree that current usability is pretty low - defining currency only, but I can't predict all future uses.

For that matter, if the only thing that's driven by the country code is the currency unit, why not just specify the currency unit? What other stuff do we expect to be driven off the country code in the future?

I was hoping to move currencyCode/Symbol to NumberFormat constructor.

Region code could influence measurement units, default paper size or something else in the future.

Region info could also help with things outside of the API, say default search engine domain (is it google.com or google.rs...) or bookstore (amazon for western people, something else for others)

For whatever it's worth...

Please, keep it coming :).

Richard,
 see my comments inline.


> I finally had a chance to look at your proposal, and I'm having some
> trouble following it.  It sort of winds up sounding like there are four
> different values-- localeID, regionID, options.localeID, and
> options.regionID, and they all mean different things.  It seems like this
> might make sense if the "options" stuff were output parameters, but the text
> makes it all sound like they're input parameters.  This seems extremely
> confusing.
>

I agree. Input parameter to LocaleInfo constructor is called options.
LocaleInfo also contains options key that holds resolved input values (like
actual localeID, canonicalized regionID). I will rename input parameters to
settings to clarify things in strawman.


> The other thing I find interesting is that the caller can supply an ARRAY
> of language tags-- a language preference list, essentially.  I work in an
> environment where there's always one locale, so I might be biased, but I'm
> wondering whether we really need all this functionality now, or if we can
> have everything take one language tag now and work up to taking multiple
> tags in a future version.
>

I think once you manage to match one localeID from input with available
locales on the system, extending the problem to matching priority list is
fairly simple. I feel single localeID input is going to be prevalent way of
specifying which locale you want, but I think the vision here was to enable
something like acceptLanguage in HTTP protocol where user can have a say in
how fall back should work. I'll let Shawn comment more here.


> I've also never been a fan of specifying the country separately from the
> country in the language.  I understand "de-AT" just means "The Austrian
> dialect of German," not "Austrian German, and the country is Austria," but
> I'm wondering again if the need the extra flexibility of specifying a region
> code for the language and a SEPARATE country code just for things like
> currency formatting, especially with a bunch of extra machinery for guessing
> the country code if the user doesn't supply one.  This seems like way more
> machinery than the vast majority of users will ever need, and I'm wondering
> if we can blow it off or defer it to some future iteration.
>

There was a long discussion about this item and we decided to go with more
strict definition (specifying both localeID and regionID). Users are of
course free to use only localeID, but some results may be off in case
localeID doesn't match the expected region for the user (e.g. currency for
now, maybe more in the future). Without regionID there is no way to specify
that preferred region for en-CA user may be UK...

I agree that current usability is pretty low - defining currency only, but I
can't predict all future uses.


> For that matter, if the only thing that's driven by the country code is the
> currency unit, why not just specify the currency unit?  What other stuff do
> we expect to be driven off the country code in the future?
>

I was hoping to move currencyCode/Symbol to NumberFormat constructor.

Region code could influence measurement units, default paper size or
something else in the future.

Region info could also help with things outside of the API, say default
search engine domain (is it google.com or google.rs...) or bookstore (amazon
for western people, something else for others)


>
> For whatever it's worth...
>

Please, keep it coming :).


>
>
> On Apr 6, 2011, at 12:32 PM, Nebojša Ćirić wrote:
>
> I’ve started implementation of the LocaleInfo class in Chrome and I would
> like to clarify what the actual parameters are and how do we construct the
> object given those parameters.
>
> Differences to the current proposal are (for sake of simplification, but
> without any loss in clarity or functionality):
>
>    - Combine languageNames and localeName parameters from the original
>    proposal (0.5) into a single localeID parameter.
>    - Leave currencyCode and currencySymbol as parameters for NumberFormat
>    constructor.
>    - LocaleInfo.options.regionID contains either user specified regionID
>    or some inferred value (vs. only user specified value) one the LocaleInfo
>    object is created.
>
>
> LocaleInfo constructor takes options parameter with two fields as input.
>
>    - Optional: localeID - string or an array of unicode identifiers:
>    [unicode locale id, ulocid,...]. We assume array is sorted by importance,
>    from more to less important locale. LocaleID format is specified in UTS#35
>    document<http://unicode.org/reports/tr35/#Language_Locale_Field_Definitions>
>    .
>    - Optional: regionID - string representing region ID, used for currency
>    inference and possibly other uses. The region ID values follow UTS#35
>    region subtag specification<http://unicode.org/reports/tr35/#Language_Locale_Field_Definitions>(ISO3166-1 specification with addition of invalid, undefined and reserved
>    region codes).
>    - If options parameter is missing, we set resulting localeID field to
>    default** value.
>
>
> I propose algorithm <http://pastebin.mozilla.org/1198734> for resolving
> best match locale id and region id from inputs. We should discuss if actual
> distance computation for best match should be left to implementers or if we
> should standardize it (data it relies on may be different).
>
> As the product of the algorithm LocaleInfo object will have:
>
>    - options.localeID set to a string that is a best match to available
>    locales for given implementation and it will be in canonical form. At the
>    minimum localeID will have base language subtag. In case localeID input was
>    empty or we couldn’t find a match, localeID will have a default value.
>    - options.regionID set to a canonicalized (uppercased) region ID. If
>    one was not specified as an input, we store inferred value - either region
>    subtag from original locale or most likely one using original locale if the
>    region subtag is missing.  Invalid/undefined/reserved codes are preserved.
>
>
> Examples - actual values may vary among implementations because of data
> variation. Implementers can also decide to pick different most likely locale
> (say en-GB instead of en-US) based on their target region...
>  localeID regionID options.localeID options.regionID - - default default’s
> region fr - fr FR fr-CA - fr-CA CA fr BE fr-BE BE fr RU fr (fr-RU is not
> available) RU [‘es’, ‘es-419’] - es ES [‘pt’, ‘pt-BR’]  PT pt-PT PT sr ZZ sr
> (didn’t match sr-ZZ, best match sr). ZZ de-Latn-DE-u-co-phonebk AT de-DE-u-co-phonebk
> (best match de-DE, and added extension) AT sr-MN BA sr-RS (didn’t match
> sr-MN, best match was sr-RS) BA
>
> ** - Implementation are free to pick the best default value for their
> platform. One possible default could be root locale.
>
> --
> Nebojša Ćirić
>
>
>


-- 
Nebojša Ćirić
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20110412/c4f07ef9/attachment-0001.html>