Regex whole word match not working for Vietnamese language

Ben_Ghaner · July 12, 2021, 12:50am

Good evening!

I have a string field in Mongo which holds text in Vietnamese. For example, 2 documents:

bắt
bắt cóc

The program needs to do a whole “word” match against the text. Link to regex101 website with a demo working as expected. E.g…

bắt → return #1 and #2 above
cóc → return #2 above
bắ → return nothing
có → return nothing

Here’s the .NET code, but the word boundary \b is not working. E.g. using “bắ”

string syllable = "bắ";
string pattern = @"\b" + syllable + @"\b";

var options = RegexOptions.CultureInvariant | RegexOptions.IgnoreCase;
var regex = new Regex(pattern, options);
var filter = Builders<Translation>.Filter.Regex(t => t.Vietnamese, new BsonRegularExpression(regex));

var results = collection
    .Find(filter)
    .ToList();

Expected: No results
Actual: Both 1. bắt and 2. bắt cóc are returned in error

The same issue occurs querying directly through Atlas / Compass. E.g.

{ Vietnamese: { $regex: /\bcó\b/, $options: "ui" } }

Expected: No results
Actual: #2. bắt cóc is returned in error

It has something to do with encoding. When I run the same code against data with only Basic Latin characters, it works as expected.

Thanks,
Ben

Ben_Ghaner · July 15, 2021, 12:52am

Fixed using “negative lookbehind” and “negative lookahead.”

The same goal to have a whole “word” match. The negative lookbehind ensures the match is not preceded by anything except a space, if any. The negative lookahead ensures the match is not succeeded by anything except a space, if any.

(?<!\S)match(?!\S)

E.g. bắ

var regex = new Regex(@"(?<!\S)" + match + @"(?!\S)", RegexOptions.IgnoreCase);

Asya_Kamsky · July 16, 2021, 7:15pm

Does this look like the same issue mentioned in this ticket?

Ben_Ghaner · July 17, 2021, 12:42pm

Yes, that looks like it! \b unicode issue

Thanks for finding that Asya.

Tung_Tri_u · November 11, 2024, 4:21am

How do you resolve this issue ? Can you help me handles, i has a trouble with this issue too