Regexp: support for case-insensitive Unicode matching by balajirrao · Pull Request #2130 · mozilla/rhino

balajirrao · 2025-10-17T16:06:00Z

Enable Unicode case-insensitive regex matching (/iu flag combination) using approximate case folding.

rbri · 2025-11-21T06:30:48Z

@balajirrao any plans for finishing this? Waiting for that to makt the separate engine pr...

balajirrao · 2025-11-21T09:13:47Z

@rbri I thought I was going to finish it and then I hit a wall. I believe that in order to do this in the general case, we'd need icu4j. I'm considering creating a module outside of rhino, say, rhino-icu4j that when included would offer complete Unicode support in regexps and possibly in other cases too. How does that sound ?

andreabergia · 2025-11-28T10:13:36Z

@rbri I thought I was going to finish it and then I hit a wall. I believe that in order to do this in the general case, we'd need icu4j. I'm considering creating a module outside of rhino, say, rhino-icu4j that when included would offer complete Unicode support in regexps and possibly in other cases too. How does that sound ?

IMHO that's the right approach. An opt-in module that, if present, adds the capability. If not, we can error out with "not supported". It would be a good improvement on what we do now.

aardvark179 · 2025-12-01T14:07:59Z

I'm not sure the complement classes present an insurmountable wall. icu4j would certainly offer a route to a complete implementation, but it would also be entirely reasonable to calculate classes, and their complements, when needed. Looping from 0 to MAX_CODE_POINT and building a range structure doesn't actually take much time, and most unicode classes have ranges that can be represented pretty compactly.

# Conflicts: # rhino/src/test/java/org/mozilla/javascript/tests/NativeRegExpTest.java

For case-insensitive matching of Unicode surrogate pairs

… matchers

balajirrao · 2026-02-24T14:08:24Z

I've finally managed to finish it up.

@aardvark179 It turns out I didn't need to compute case fold of arbitrary Unicode regions in the u mode. It's needed only for the v mode - it was clear from the spec, it was MDN that I was confused by.

@rbri would appreciate you taking a look when you have a chance!

rbri · 2026-03-01T15:12:25Z

@balajirrao did a smoke thest with this and also took the chance to ask some LLM's to create test cases for that. Looks all good - i think we can go with it.

rbri · 2026-03-12T18:41:45Z

@gbrail will be great if this is in...

rbri · 2026-03-18T04:49:32Z

@gbrail reminder

gbrail

Please take a look at this test case, generated by AI, but which looks right to me. It succeeds in Node but fails in Rhino. It looks like our use of .toLowerCase and .toUpperCase for case folding in this case may not be strictly correct according to Unicode. Please check out this test case and let me know what you think. Thanks!

console.log("Is 'ß' a word character (\\w) with /ui? " + wordChar);
if (wordChar) {
        console.log("  FAILED: Should be false");
}

const matches =  /S/ui.test('ß');
console.log("Does /S/ui match 'ß'? " + matches);
if (matches) {
        console.log("  FAILED: Should not match");
}

const boundary = ' ß'.search(/\b/ui);
console.log("Boundary index in ' ß': " + boundary);
if (boundary !== -1) {
        console.log("  FAILED: Should be -1, was %d", boundary);
}```

gbrail · 2026-03-21T17:58:27Z

+        }
+        // For other characters, use Java's built-in case conversion
+        // This approximates Unicode case folding for most common cases
+        return Character.toString(codePoint)


I do not have extensive experience here but my AI tells me that this is incorrect for JavaScript. I will attach a test case that fails in Rhino but works in Node.

Cool, thanks! Just to be sure - you mean a new test case and not the comment right ?

I'm not an expert in Unicode myself either - I was going for a decent approximation of case folding without including the raw case folding data or using icu4j.

Sorry, I put the test case in the comment -- it basically exercises the German letter 'ß', and I am not a Unicode expert either but I do know someone working on this who speaks German so maybe he can help!

Will have a look...

Regarding this fancy german letter 'ß' - the uppercase version of this is 'SS' and this is something that java can't handle at the moment (as far as i and my new girlfriend Claude.ia knowing).

From my point of view we can:

we can start with this one because it is the 80% solution

for the rest we should go with ICU4J - yes its a dependency but there seem to be no real alternative and we might need it at some other places also. And dependency mgnt is no longer a problem these days

if we agree @balajirrao can create another pr that brings us closer to the spec based on ICU4j

And Claude.ia generates already some more test cases for me. will attach them as file
casefolding.txt

@rbri thanks for the test cases! I think I have found a way to make the test cases from @gbrail pass - but at the cost of a few existing tests failing. I'm trying to get them to pass now. I'll let you know how it goes.

I've managed to get those cases to pass. Please take a look!

icu4j would be very useful in regexp vmode - this requires support for arbitrary sets of Unicode codepoints and operations on them. @rbri it seems you are suggesting that icu4j can be dependency of rhino itself ? My idea was to provide a rhino-icu4j project that has first class support for Unicode in regexp and in the language itself.

gbrail · 2026-03-27T22:40:23Z

@rbri if this is close enough for you then I'm OK to merge it as we figure out if we want to incorporate the whole ICU4j library later.

rbri · 2026-03-28T07:34:19Z

happy with this, let's merge

gbrail · 2026-03-28T22:23:02Z

That is great -- thanks -- I like this version better.

FWIW, Gemini Pro had some code review suggestions related to handling of a few more character classes, and also a suggestion that would reduce GC pressure. I suggest running some AI code reviews on this as there are optimization and correctness fixes we could certainly apply, but this is good so far. Thanks for all the work!

balajirrao · 2026-03-30T08:05:41Z

That is great -- thanks -- I like this version better.

FWIW, Gemini Pro had some code review suggestions related to handling of a few more character classes, and also a suggestion that would reduce GC pressure. I suggest running some AI code reviews on this as there are optimization and correctness fixes we could certainly apply, but this is good so far. Thanks for all the work!

Thanks! Yeah, I'll continue improving this.

balajirrao force-pushed the regexp-unicode-caseinsensitive branch from 6a24f28 to 647c882 Compare October 17, 2025 16:06

balajirrao force-pushed the regexp-unicode-caseinsensitive branch 3 times, most recently from fa34971 to e8f1bf2 Compare February 24, 2026 09:44

balajirrao added 5 commits February 24, 2026 10:56

Allow 'u' and 'i' flags to be used together

e782982

Add approximate unicode case-folding

9c73b8d

Change isWord to handle case-insensitive Unicode mode

a3572a1

# Conflicts: # rhino/src/test/java/org/mozilla/javascript/tests/NativeRegExpTest.java

Introduce opcode REOP_UCSPFLAT1i

2b52a81

For case-insensitive matching of Unicode surrogate pairs

Case-insensitive matching with anchor

64e9ece

balajirrao force-pushed the regexp-unicode-caseinsensitive branch from e8f1bf2 to 462c8b0 Compare February 24, 2026 10:29

balajirrao marked this pull request as ready for review February 24, 2026 13:38

balajirrao added 5 commits February 24, 2026 15:07

case-insensitive unicode support for flatNIMatcher and flatNIBackward…

7b6f835

… matchers

case-insensitive matching support for classes

57c0769

Property escapes

cce9253

Backref matcher

33535f0

Update test262.properties

db0763e

balajirrao force-pushed the regexp-unicode-caseinsensitive branch from 462c8b0 to db0763e Compare February 24, 2026 14:08

balajirrao changed the title ~~Regexp: support for case-insensitive unicode matching~~ Regexp: support for case-insensitive Unicode matching Feb 24, 2026

rbri approved these changes Mar 1, 2026

View reviewed changes

rbri mentioned this pull request Mar 21, 2026

Make NativeDate.toLocaleString() behave more like real browsers #2338

Merged

gbrail reviewed Mar 21, 2026

View reviewed changes

Improve support for 'ß' and other tiny fixes

e4981b6

Optimize caseFolding for ASCII and BMP cases

71df1ff

balajirrao requested review from gbrail and rbri March 27, 2026 14:15

rbri approved these changes Mar 28, 2026

View reviewed changes

gbrail merged commit ffda2f2 into mozilla:master Mar 28, 2026
11 checks passed

Conversation

balajirrao commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rbri commented Nov 21, 2025

Uh oh!

balajirrao commented Nov 21, 2025

Uh oh!

andreabergia commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aardvark179 commented Dec 1, 2025

Uh oh!

balajirrao commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rbri commented Mar 1, 2026

Uh oh!

rbri commented Mar 12, 2026

Uh oh!

rbri commented Mar 18, 2026

Uh oh!

gbrail left a comment

Choose a reason for hiding this comment

Uh oh!

gbrail Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

balajirrao Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

gbrail Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

rbri Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

rbri Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

balajirrao Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

balajirrao Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

gbrail commented Mar 27, 2026

Uh oh!

rbri commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gbrail commented Mar 28, 2026

Uh oh!

Uh oh!

balajirrao commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

balajirrao commented Oct 17, 2025 •

edited

Loading

andreabergia commented Nov 28, 2025 •

edited

Loading

balajirrao commented Feb 24, 2026 •

edited

Loading

balajirrao Mar 26, 2026 •

edited

Loading

rbri commented Mar 28, 2026 •

edited

Loading