fix: BL-14889 use indiv langtags for rep languages in tag creation#134
fix: BL-14889 use indiv langtags for rep languages in tag creation#134andrew-polk merged 1 commit intomainfrom
Conversation
andrew-polk
left a comment
There was a problem hiding this comment.
@andrew-polk made 3 comments.
Reviewable status: 0 of 11 files reviewed, 3 unresolved discussions (waiting on @nabalone).
components/language-chooser/common/find-language/languageTagUtils.ts line 49 at r1 (raw file):
} export function ensureLangSubtagIsCanonicalForReps(
This function wants a comment.
components/language-chooser/common/find-language/languageTagUtils.ts line 55 at r1 (raw file):
const canonicalLanguageSubtag = language.isRepresentativeForMacrolanguage ? language.alternativeTags?.[0]?.split("-")[0] || language.languageSubtag : language.languageSubtag;
pasting devin comment below.
We should at least have a comment?
components/language-chooser/common/find-language/languageTagUtils.ts:R53-55
ensureLangSubtagIsCanonicalForReps relies on alternativeTags[0] ordering convention
The canonical macrolanguage code is derived from language.alternativeTags?.[0]?.split("-")[0] at languageTagUtils.ts:54. This assumes the first entry in alternativeTags always begins with the macrolanguage code (e.g., ar-Arab-EG for Standard Arabic, uz-Latn-UZ for Northern Uzbek). If the data ordering ever changes such that the first alternative tag starts with the individual code (e.g., arb-Brai), the canonical extraction would silently return the wrong code. The current data appears consistent, and the test at languageTagUtils.spec.ts:516-531 covers the empty alternativeTags fallback case, but there is no validation that alternativeTags[0] actually starts with a macrolanguage code.
components/language-chooser/common/find-language/searchForLanguage.ts line 234 at r1 (raw file):
const canonicalLanguageSubtag = language.isRepresentativeForMacrolanguage ? language.alternativeTags?.[0]?.split("-")[0] || language.languageSubtag : language.languageSubtag;
devin says
components/language-chooser/common/find-language/searchForLanguage.ts:R232-248
Redundant canonical conversion in parseLangtagFromLangChooser
On line 238, the code manually replaces the language subtag via languageTag.replace(languageSubtag || "", canonicalLanguageSubtag), and then also passes the language object to getMaximalLangtag, which internally calls ensureLangSubtagIsCanonicalForReps again (languageTagUtils.ts:111). The second conversion is always a no-op since the subtag is already canonical after the manual replace, so this is harmless but redundant. Additionally, canonicalLanguageSubtag is computed inline here duplicating the logic from ensureLangSubtagIsCanonicalForReps. The manual replace could be removed entirely since getMaximalLangtag now handles the conversion internally via the language parameter.
andrew-polk
left a comment
There was a problem hiding this comment.
@andrew-polk reviewed 11 files and all commit messages, and made 8 comments.
Reviewable status: all files reviewed, 11 unresolved discussions (waiting on @nabalone).
components/language-chooser/common/find-language/languageTagUtils.spec.ts line 672 at r1 (raw file):
and so are tested with the modifier instead of here
I think there is a mistake in the wording here. At least, I don't understand what it means
components/language-chooser/common/find-language/searchForLanguage.ts line 227 at r1 (raw file):
The script must be implied, look for the equivalent maximal tag, which will have a script subtag explicit.
I think this should be two sentences? Or at least a semicolon?
components/language-chooser/common/find-language/searchForLanguage.ts line 231 at r1 (raw file):
to to
to do?
components/language-chooser/common/find-language/languageTagUtils.spec.ts line 102 at r1 (raw file):
NORTHERN_UZBEK_LANGUAGE ) ).toEqual("uz");
For any test cases where the output of the test is not the same as the output of the chooser overall, it seems like we want a clarifying comment?
I.e. I think this is giving uz because it is beyond the scope of createTag to make sure it is using the representative language? But at first glance, it seems like this is validating something different than our rep-first policy.
components/language-chooser/common/find-language/languageTagUtils.spec.ts line 196 at r1 (raw file):
expect( getShortestSufficientLangtag("uzn-Latn-UZ", NORTHERN_UZBEK_LANGUAGE) ).toContain("uz");
contain here seems odd since it will match uzn and uz
components/language-chooser/common/find-language/languageTagUtils.spec.ts line 202 at r1 (raw file):
expect( getShortestSufficientLangtag("uzn-Cyrl-x-foobar", NORTHERN_UZBEK_LANGUAGE) ).toContain("-x-foobar");
Should verify language subtag?
components/language-chooser/common/find-language/languageTagUtils.spec.ts line 258 at r1 (raw file):
expect(getMaximalLangtag("arb-Arab", STANDARD_ARABIC_LANGUAGE)).toContain( "-Arab-EG" );
Seems like being explicit about the full tag we expect would be better.
I.e. use toEqual instead of toContain
components/language-chooser/common/find-language/README.md line 203 at r1 (raw file):
#### Language tag shortening <!-- TODO this should be createTagFromOrthography? -->
Looks like this TODO comment can be removed.
Should the "createTag" in the last sentence also be changed to createTagFromOrthography?
8f72658 to
096a374
Compare
nabalone
left a comment
There was a problem hiding this comment.
@nabalone made 8 comments.
Reviewable status: all files reviewed, 11 unresolved discussions (waiting on @andrew-polk).
components/language-chooser/common/find-language/languageTagUtils.ts line 49 at r1 (raw file):
Previously, andrew-polk wrote…
This function wants a comment.
Done.
components/language-chooser/common/find-language/languageTagUtils.ts line 55 at r1 (raw file):
Previously, andrew-polk wrote…
pasting devin comment below.
We should at least have a comment?
components/language-chooser/common/find-language/languageTagUtils.ts:R53-55
ensureLangSubtagIsCanonicalForReps relies on alternativeTags[0] ordering convention
The canonical macrolanguage code is derived from
language.alternativeTags?.[0]?.split("-")[0]atlanguageTagUtils.ts:54. This assumes the first entry inalternativeTagsalways begins with the macrolanguage code (e.g.,ar-Arab-EGfor Standard Arabic,uz-Latn-UZfor Northern Uzbek). If the data ordering ever changes such that the first alternative tag starts with the individual code (e.g.,arb-Brai), the canonical extraction would silently return the wrong code. The current data appears consistent, and the test atlanguageTagUtils.spec.ts:516-531covers the emptyalternativeTagsfallback case, but there is no validation thatalternativeTags[0]actually starts with a macrolanguage code.
Done.
components/language-chooser/common/find-language/languageTagUtils.spec.ts line 102 at r1 (raw file):
Previously, andrew-polk wrote…
For any test cases where the output of the test is not the same as the output of the chooser overall, it seems like we want a clarifying comment?
I.e. I think this is giving uz because it is beyond the scope ofcreateTagto make sure it is using the representative language? But at first glance, it seems like this is validating something different than our rep-first policy.
Done.
components/language-chooser/common/find-language/languageTagUtils.spec.ts line 196 at r1 (raw file):
Previously, andrew-polk wrote…
contain here seems odd since it will match uzn and uz
The point of all these functions that I was really trying to test here is that they remove or add the other subtags as appropriate, but I can just use equals if it creates less confusion.
components/language-chooser/common/find-language/languageTagUtils.spec.ts line 202 at r1 (raw file):
Previously, andrew-polk wrote…
Should verify language subtag?
Done.
components/language-chooser/common/find-language/languageTagUtils.spec.ts line 258 at r1 (raw file):
Previously, andrew-polk wrote…
Seems like being explicit about the full tag we expect would be better.
I.e. use toEqual instead of toContain
Done.
components/language-chooser/common/find-language/languageTagUtils.spec.ts line 672 at r1 (raw file):
Previously, andrew-polk wrote…
and so are tested with the modifier instead of here
I think there is a mistake in the wording here. At least, I don't understand what it means
Hmm, does this make sense?
components/language-chooser/common/find-language/README.md line 203 at r1 (raw file):
Previously, andrew-polk wrote…
Looks like this TODO comment can be removed.
Should the "createTag" in the last sentence also be changed to createTagFromOrthography?
Done. oops
nabalone
left a comment
There was a problem hiding this comment.
@nabalone made 1 comment.
Reviewable status: 6 of 14 files reviewed, 12 unresolved discussions (waiting on @andrew-polk).
components/language-chooser/common/find-language/scripts/langtagProcessing.ts line 114 at r2 (raw file):
isRepresentativeForMacrolanguage: entry.isRepresentativeForMacrolanguage, // entry.tag is the canonical tag (https://github.com/silnrsi/langtags/blob/release/doc/langtags.md) alternativeTags: new Set([entry.tag, entry.full, ...(entry.tags || [])]),
Note this change tends to bump representative languages up in the results order, e.g. if you type "uz", Northern Uzbek now shows up on top of the Uzbek Macrolanguage whereas before the latter would come up first. But I'm thinking this change is not bad since we want to discourage macrolanguages anyway
andrew-polk
left a comment
There was a problem hiding this comment.
@andrew-polk reviewed 8 files and all commit messages, made 2 comments, and resolved 7 discussions.
Reviewable status: all files reviewed, 6 unresolved discussions (waiting on @nabalone).
components/language-chooser/common/find-language/README.md line 203 at r1 (raw file):
Previously, nabalone (Noel) wrote…
Done. oops
The text in this paragraph still references "createTag". I think that's a mistake. But maybe I'm not undestanding.
components/language-chooser/react/language-chooser-react-mui/e2e/languageSearch.e2e.ts line 105 at r2 (raw file):
await expect(sanCard).not.toContainText("vsn"); });
Sorry; I should have asked for this in the last round...
Seems worth having one e2e here for selecting a representative language and getting the correct subtag out?
I.e. something which would have failed before these changes?
andrew-polk
left a comment
There was a problem hiding this comment.
@andrew-polk made 2 comments.
Reviewable status: all files reviewed, 8 unresolved discussions (waiting on @nabalone).
components/language-chooser/common/find-language/testUtils.ts line 73 at r2 (raw file):
"arb-Brai", "ar", ],
devin says the following.
Feel free to say this doesn't matter
components/language-chooser/common/find-language/testUtils.ts:R65-73
STANDARD_ARABIC_LANGUAGE test data has non-canonical first alternativeTag
The STANDARD_ARABIC_LANGUAGE test constant has alternativeTags[0] = "ar-Arab-EG" (a full tag), whereas the actual generated data in languageData.json has "ar" as the first element. The code comment at languageTagUtils.ts:57 states language.alternativeTags[0] is the canonical tag. The test languageSearch.spec.ts:528 verifies real data has the canonical tag first. The test data inconsistency doesn't cause failures because ensureLangSubtagIsCanonicalForReps does .split("-")[0] to extract just the language subtag ("ar" in both cases), but the test data doesn't match real data structure and could mislead future developers.
components/language-chooser/common/find-language/scripts/langtagProcessing.ts line 75 at r2 (raw file):
entry.full, ...(entry.tags ?? []), ]);
devin says
Missing `entry.tag` in combine path of `addOrCombineLangtagsEntry` causes canonical short tags to be absent from `alternativeTags`
The PR added entry.tag to the new-entry path (line 114) to ensure the canonical tag appears first in alternativeTags, but the combine path (lines 71-75) was not updated to also include entry.tag. This means canonical short tags from subsequent langtags.json entries for the same language are missing from alternativeTags.
When a language has multiple entries in langtags.json (e.g., Northern Uzbek has entries for uz, uz-AF, uz-Brai, uz-Cyrl, uz-Sogd), only the first entry goes through the new-entry path at line 114 where entry.tag is now added. Subsequent entries go through the combine path at lines 71-75, which adds entry.full and entry.tags but not entry.tag.
For example, for Northern Uzbek the canonical short tags uz-AF, uz-Brai, uz-CN, uz-Cyrl, and uz-Sogd are confirmed missing from alternativeTags in the generated languageData.json. Since alternativeTags is used as a Fuse search key (searchForLanguage.ts:26), searching for these canonical short forms won't match the language.
The first entry's canonical tag (uz) IS correctly included and first, so ensureLangSubtagIsCanonicalForReps works correctly. The impact is limited to search completeness for secondary canonical tag forms.
Suggested fix
langs[entry.indivIsoCode].alternativeTags = new Set([
...langs[entry.indivIsoCode].alternativeTags,
entry.tag,
entry.full,
...(entry.tags ?? []),
]);
096a374 to
91c4be9
Compare
nabalone
left a comment
There was a problem hiding this comment.
@nabalone made 6 comments.
Reviewable status: all files reviewed, 8 unresolved discussions (waiting on @andrew-polk).
components/language-chooser/common/find-language/README.md line 203 at r1 (raw file):
Previously, andrew-polk wrote…
The text in this paragraph still references "createTag". I think that's a mistake. But maybe I'm not undestanding.
Done.
components/language-chooser/common/find-language/searchForLanguage.ts line 227 at r1 (raw file):
Previously, andrew-polk wrote…
The script must be implied, look for the equivalent maximal tag, which will have a script subtag explicit.
I think this should be two sentences? Or at least a semicolon?
Done.
components/language-chooser/common/find-language/searchForLanguage.ts line 231 at r1 (raw file):
Previously, andrew-polk wrote…
to to
to do?
Done.
components/language-chooser/common/find-language/searchForLanguage.ts line 234 at r1 (raw file):
Previously, andrew-polk wrote…
devin says
components/language-chooser/common/find-language/searchForLanguage.ts:R232-248
Redundant canonical conversion in parseLangtagFromLangChooser
On line 238, the code manually replaces the language subtag via
languageTag.replace(languageSubtag || "", canonicalLanguageSubtag), and then also passes thelanguageobject togetMaximalLangtag, which internally callsensureLangSubtagIsCanonicalForRepsagain (languageTagUtils.ts:111). The second conversion is always a no-op since the subtag is already canonical after the manual replace, so this is harmless but redundant. Additionally,canonicalLanguageSubtagis computed inline here duplicating the logic fromensureLangSubtagIsCanonicalForReps. The manual replace could be removed entirely sincegetMaximalLangtagnow handles the conversion internally via thelanguageparameter.
Done.
components/language-chooser/common/find-language/testUtils.ts line 73 at r2 (raw file):
Previously, andrew-polk wrote…
devin says the following.
Feel free to say this doesn't matter
components/language-chooser/common/find-language/testUtils.ts:R65-73
STANDARD_ARABIC_LANGUAGE test data has non-canonical first alternativeTag
The
STANDARD_ARABIC_LANGUAGEtest constant hasalternativeTags[0]="ar-Arab-EG"(a full tag), whereas the actual generated data inlanguageData.jsonhas"ar"as the first element. The code comment atlanguageTagUtils.ts:57stateslanguage.alternativeTags[0] is the canonical tag. The testlanguageSearch.spec.ts:528verifies real data has the canonical tag first. The test data inconsistency doesn't cause failures becauseensureLangSubtagIsCanonicalForRepsdoes.split("-")[0]to extract just the language subtag ("ar"in both cases), but the test data doesn't match real data structure and could mislead future developers.
Done. Wow, I didn't think of this, glad Devin caught it! It doesn't make a practical difference now but is good for consistency and potential future use.
components/language-chooser/react/language-chooser-react-mui/e2e/languageSearch.e2e.ts line 105 at r2 (raw file):
Previously, andrew-polk wrote…
Sorry; I should have asked for this in the last round...
Seems worth having one e2e here for selecting a representative language and getting the correct subtag out?I.e. something which would have failed before these changes?
Done. There are a lot because copilot wrote a lot
91c4be9 to
537a292
Compare
537a292 to
38baaa0
Compare
nabalone
left a comment
There was a problem hiding this comment.
@nabalone made 1 comment.
Reviewable status: 11 of 15 files reviewed, 8 unresolved discussions (waiting on @andrew-polk).
components/language-chooser/common/find-language/scripts/langtagProcessing.ts line 75 at r2 (raw file):
Previously, andrew-polk wrote…
devin says
Missing `entry.tag` in combine path of `addOrCombineLangtagsEntry` causes canonical short tags to be absent from `alternativeTags`
The PR added
entry.tagto the new-entry path (line 114) to ensure the canonical tag appears first inalternativeTags, but the combine path (lines 71-75) was not updated to also includeentry.tag. This means canonical short tags from subsequent langtags.json entries for the same language are missing fromalternativeTags.When a language has multiple entries in langtags.json (e.g., Northern Uzbek has entries for
uz,uz-AF,uz-Brai,uz-Cyrl,uz-Sogd), only the first entry goes through the new-entry path at line 114 whereentry.tagis now added. Subsequent entries go through the combine path at lines 71-75, which addsentry.fullandentry.tagsbut notentry.tag.For example, for Northern Uzbek the canonical short tags
uz-AF,uz-Brai,uz-CN,uz-Cyrl, anduz-Sogdare confirmed missing fromalternativeTagsin the generated languageData.json. SincealternativeTagsis used as a Fuse search key (searchForLanguage.ts:26), searching for these canonical short forms won't match the language.The first entry's canonical tag (
uz) IS correctly included and first, soensureLangSubtagIsCanonicalForRepsworks correctly. The impact is limited to search completeness for secondary canonical tag forms.Suggested fix
langs[entry.indivIsoCode].alternativeTags = new Set([ ...langs[entry.indivIsoCode].alternativeTags, entry.tag, entry.full, ...(entry.tags ?? []), ]);
Done.
andrew-polk
left a comment
There was a problem hiding this comment.
@andrew-polk reviewed 7 files and all commit messages, and resolved 7 discussions.
Reviewable status: all files reviewed, 1 unresolved discussion.
andrew-polk
left a comment
There was a problem hiding this comment.
@andrew-polk resolved 1 discussion and dismissed @nabalone from a discussion.
Reviewable status:complete! all files reviewed, all discussions resolved (waiting on @nabalone).
This change is