Regexp: support for case-insensitive Unicode matching#2130
Regexp: support for case-insensitive Unicode matching#2130gbrail merged 12 commits intomozilla:masterfrom
Conversation
6a24f28 to
647c882
Compare
|
@balajirrao any plans for finishing this? Waiting for that to makt the separate engine pr... |
|
@rbri I thought I was going to finish it and then I hit a wall. I believe that in order to do this in the general case, we'd need icu4j. I'm considering creating a module outside of rhino, say, rhino-icu4j that when included would offer complete Unicode support in regexps and possibly in other cases too. How does that sound ? |
IMHO that's the right approach. An opt-in module that, if present, adds the capability. If not, we can error out with "not supported". It would be a good improvement on what we do now. |
|
I'm not sure the complement classes present an insurmountable wall. icu4j would certainly offer a route to a complete implementation, but it would also be entirely reasonable to calculate classes, and their complements, when needed. Looping from 0 to MAX_CODE_POINT and building a range structure doesn't actually take much time, and most unicode classes have ranges that can be represented pretty compactly. |
fa34971 to
e8f1bf2
Compare
# Conflicts: # rhino/src/test/java/org/mozilla/javascript/tests/NativeRegExpTest.java
For case-insensitive matching of Unicode surrogate pairs
e8f1bf2 to
462c8b0
Compare
462c8b0 to
db0763e
Compare
|
I've finally managed to finish it up. @aardvark179 It turns out I didn't need to compute case fold of arbitrary Unicode regions in the @rbri would appreciate you taking a look when you have a chance! |
|
@balajirrao did a smoke thest with this and also took the chance to ask some LLM's to create test cases for that. Looks all good - i think we can go with it. |
|
@gbrail will be great if this is in... |
|
@gbrail reminder |
gbrail
left a comment
There was a problem hiding this comment.
Please take a look at this test case, generated by AI, but which looks right to me. It succeeds in Node but fails in Rhino. It looks like our use of .toLowerCase and .toUpperCase for case folding in this case may not be strictly correct according to Unicode. Please check out this test case and let me know what you think. Thanks!
console.log("Is 'ß' a word character (\\w) with /ui? " + wordChar);
if (wordChar) {
console.log(" FAILED: Should be false");
}
const matches = /S/ui.test('ß');
console.log("Does /S/ui match 'ß'? " + matches);
if (matches) {
console.log(" FAILED: Should not match");
}
const boundary = ' ß'.search(/\b/ui);
console.log("Boundary index in ' ß': " + boundary);
if (boundary !== -1) {
console.log(" FAILED: Should be -1, was %d", boundary);
}```
| } | ||
| // For other characters, use Java's built-in case conversion | ||
| // This approximates Unicode case folding for most common cases | ||
| return Character.toString(codePoint) |
There was a problem hiding this comment.
I do not have extensive experience here but my AI tells me that this is incorrect for JavaScript. I will attach a test case that fails in Rhino but works in Node.
There was a problem hiding this comment.
Cool, thanks! Just to be sure - you mean a new test case and not the comment right ?
I'm not an expert in Unicode myself either - I was going for a decent approximation of case folding without including the raw case folding data or using icu4j.
There was a problem hiding this comment.
Sorry, I put the test case in the comment -- it basically exercises the German letter 'ß', and I am not a Unicode expert either but I do know someone working on this who speaks German so maybe he can help!
There was a problem hiding this comment.
Regarding this fancy german letter 'ß' - the uppercase version of this is 'SS' and this is something that java can't handle at the moment (as far as i and my new girlfriend Claude.ia knowing).
From my point of view we can:
- we can start with this one because it is the 80% solution
- for the rest we should go with ICU4J - yes its a dependency but there seem to be no real alternative and we might need it at some other places also. And dependency mgnt is no longer a problem these days
- if we agree @balajirrao can create another pr that brings us closer to the spec based on ICU4j
And Claude.ia generates already some more test cases for me. will attach them as file
casefolding.txt
There was a problem hiding this comment.
I've managed to get those cases to pass. Please take a look!
icu4j would be very useful in regexp vmode - this requires support for arbitrary sets of Unicode codepoints and operations on them. @rbri it seems you are suggesting that icu4j can be dependency of rhino itself ? My idea was to provide a rhino-icu4j project that has first class support for Unicode in regexp and in the language itself.
|
@rbri if this is close enough for you then I'm OK to merge it as we figure out if we want to incorporate the whole ICU4j library later. |
|
happy with this, let's merge |
|
That is great -- thanks -- I like this version better. FWIW, Gemini Pro had some code review suggestions related to handling of a few more character classes, and also a suggestion that would reduce GC pressure. I suggest running some AI code reviews on this as there are optimization and correctness fixes we could certainly apply, but this is good so far. Thanks for all the work! |
Thanks! Yeah, I'll continue improving this. |
Enable Unicode case-insensitive regex matching (/iu flag combination) using approximate case folding.