string equivalence whitespace handling

#fhirpath

21 messages · View on Zulip →

Brian Postlethwaite Feb 18, 2026, 09:04 PM

FHIR-55473 @Josh Mandel recorded this issue and wanting to get some feedback from others in the community. The specific ask is if non-breaking space could be included in the whitespace lexical category .

Vasilii Kupryakov Feb 19, 2026, 10:31 AM

For whitespace you could consider Unicode Whitespace and Pattern_Whitespace categories. But what is the usecase for string equivalence? It is problematic and inconsistent between existing implementations. You can check ('İ' ~ 'I').combine('İ' ~ 'i').combine('İ' ~ 'ı').combine('I' ~ 'i').combine('I' ~ 'ı').combine('i' ~ 'ı') ('ẞ' ~ 'ß').combine('ẞ' ~ 'SS').combine('ẞ' ~ 'ss').combine('ß' ~ 'SS').combine('ß' ~ 'ss') // Or with escape syntax ('\u0130' ~ 'I').combine('\u0130' ~ 'i').combine('\u0130' ~ '\u0131').combine('I' ~ 'i').combine('I' ~ '\u0131').combine('i' ~ '\u0131') ('\u1e9e' ~ '\u00df').combine('\u1e9e' ~ 'SS').combine('\u1e9e' ~ 'ss').combine('\u00df' ~ 'SS').combine('\u00df' ~ 'ss')

Brian Postlethwaite Mar 4, 2026, 01:04 AM

/poll Should we add the non-breaking space (\u00A0) to the list of whitespace lexical elements Yes No

Paul Lynch Mar 4, 2026, 04:16 PM

Why "no"?

Vasilii Kupriakov Mar 5, 2026, 09:24 AM

I think if you want to add non-ASCII whitespace, you should add the whole Unicode category, not just one character.

Brian Postlethwaite Mar 5, 2026, 09:26 AM

Add it to the list of characters that are interpreted as a whitespace?

Brian Postlethwaite Mar 5, 2026, 09:26 AM

And can are normalised when doing a string equivalence test? (which is what we're discussing)

Vasilii Kupriakov Mar 5, 2026, 09:35 AM

To the list of characters interpreted as equivalent inside strings. But not to the list of lexical whitespaces. You can see the list of these characters here: https://en.wikipedia.org/wiki/Whitespace_character#Unicode

Brian Postlethwaite Mar 5, 2026, 09:38 AM

So you're not saying no, you're saying include these others too?

Vasilii Kupriakov Mar 5, 2026, 09:40 AM

Yes

Brian Postlethwaite Mar 5, 2026, 09:40 AM

Cool, that makes more sense. Thanks.

Brian Postlethwaite Mar 18, 2026, 07:51 PM

@Bryn Rhodes thoughts? Concern on the performance impact of adding 20 more characters here?

Steve Munini Mar 18, 2026, 08:11 PM

I'd advocate yes for these reasons: The current 4-character set is ASCII-era thinking NBSP is the most common "invisible surprise" character in real clinical data Implementations using standard Unicode whitespace handling are likely already treating it as whitespace - the spec should align with reality rather than force implementers to write custom matchers At minimum, add \u{00A0}; ideally, consider adopting Unicode's \p{White_Space} property to future-proof it - https://www.unicode.org/reports/tr44/#White_Space

Brian Postlethwaite Mar 18, 2026, 08:14 PM

Thanks, that last reference is what I was hunting for!

Brian Postlethwaite Mar 18, 2026, 08:36 PM

Maybe this set? https://www.unicode.org/charts/collation/chart_Whitespace.html dotnet has 17 of them listed: https://learn.microsoft.com/en-us/dotnet/api/system.char.iswhitespace?view=net-10.0

Brian Postlethwaite Mar 18, 2026, 08:47 PM

While wandering down this rabbit hole, I also came across this: Unicode Normalization - https://unicode.org/reports/tr15 Which helps normalize unicode characters, and lots of programming languages support this too. https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/normalize

Brian Postlethwaite Mar 18, 2026, 08:53 PM

This is where the unicode whitespace characters are enumerated. https://www.unicode.org/Public/UCD/latest/ucd/PropList.txt

Brian Postlethwaite Mar 18, 2026, 08:58 PM

/poll Should we change the whitespace equivalence handling for strings: No Add JUST non-breaking space (\u00A0) Add ALL unicode whitespace (as defined in https://www.unicode.org/Public/UCD/latest/ucd/PropList.txt )

Brian Postlethwaite Mar 18, 2026, 08:59 PM

/poll Should we also add unicode normalization to the string equivalence No Add Unicode Normalization (as a may) Add Unicode Normalization

Brian Postlethwaite Mar 18, 2026, 09:10 PM

I don't believe that CQL does any of this either.

Vasilii Kupriakov Mar 20, 2026, 02:37 PM

I think the most universal is collation-based equality with root search collation, normalization, primary strength, and shifted alternate. https://icu4c-demos.unicode.org/icu-bin/collation.html