Good evening!
I have a string field in Mongo which holds text in Vietnamese. For example, 2 documents:
- bắt
- bắt cóc
The program needs to do a whole “word” match against the text. Link to regex101 website with a demo working as expected. E.g…
bắt → return #1 and #2 above
cóc → return #2 above
bắ → return nothing
có → return nothing
Here’s the .NET code, but the word boundary \b is not working. E.g. using “bắ”
string syllable = "bắ";
string pattern = @"\b" + syllable + @"\b";
var options = RegexOptions.CultureInvariant | RegexOptions.IgnoreCase;
var regex = new Regex(pattern, options);
var filter = Builders<Translation>.Filter.Regex(t => t.Vietnamese, new BsonRegularExpression(regex));
var results = collection
.Find(filter)
.ToList();
Expected: No results
Actual: Both 1. bắt and 2. bắt cóc are returned in error
The same issue occurs querying directly through Atlas / Compass. E.g.
{ Vietnamese: { $regex: /\bcó\b/, $options: "ui" } }
Expected: No results
Actual: #2. bắt cóc is returned in error
It has something to do with encoding. When I run the same code against data with only Basic Latin characters, it works as expected.
Thanks,
Ben