← Back to search
#fhirpath
string equivalence whitespace handling
21 messages · View on Zulip →
Brian Postlethwaite Feb 18, 2026, 09:04 PM
FHIR-55473 @Josh Mandel recorded this issue and wanting to get some feedback from others in the community. The specific ask is if non-breaking space could be included in the whitespace lexical category .
V
Vasilii Kupryakov Feb 19, 2026, 10:31 AM
For whitespace you could consider Unicode Whitespace and Pattern_Whitespace categories. But what is the usecase for string equivalence? It is problematic and inconsistent between existing implementations. You can check ('İ' ~ 'I').combine('İ' ~ 'i').combine('İ' ~ 'ı').combine('I' ~ 'i').combine('I' ~ 'ı').combine('i' ~ 'ı') ('ẞ' ~ 'ß').combine('ẞ' ~ 'SS').combine('ẞ' ~ 'ss').combine('ß' ~ 'SS').combine('ß' ~ 'ss') // Or with escape syntax ('\u0130' ~ 'I').combine('\u0130' ~ 'i').combine('\u0130' ~ '\u0131').combine('I' ~ 'i').combine('I' ~ '\u0131').combine('i' ~ '\u0131') ('\u1e9e' ~ '\u00df').combine('\u1e9e' ~ 'SS').combine('\u1e9e' ~ 'ss').combine('\u00df' ~ 'SS').combine('\u00df' ~ 'ss')
Brian Postlethwaite Mar 4, 2026, 01:04 AM
/poll Should we add the non-breaking space (\u00A0) to the list of whitespace lexical elements Yes No
Paul Lynch Mar 4, 2026, 04:16 PM
Why "no"?
V
Vasilii Kupriakov Mar 5, 2026, 09:24 AM
I think if you want to add non-ASCII whitespace, you should add the whole Unicode category, not just one character.
Brian Postlethwaite Mar 5, 2026, 09:26 AM
Add it to the list of characters that are interpreted as a whitespace?
Brian Postlethwaite Mar 5, 2026, 09:26 AM
And can are normalised when doing a string equivalence test? (which is what we're discussing)
V
Vasilii Kupriakov Mar 5, 2026, 09:35 AM
To the list of characters interpreted as equivalent inside strings. But not to the list of lexical whitespaces. You can see the list of these characters here: https://en.wikipedia.org/wiki/Whitespace_character#Unicode
Brian Postlethwaite Mar 5, 2026, 09:38 AM
So you're not saying no, you're saying include these others too?
V
Vasilii Kupriakov Mar 5, 2026, 09:40 AM
Yes
Brian Postlethwaite Mar 5, 2026, 09:40 AM
Cool, that makes more sense. Thanks.
Brian Postlethwaite Mar 18, 2026, 07:51 PM
@Bryn Rhodes thoughts? Concern on the performance impact of adding 20 more characters here?
Steve Munini Mar 18, 2026, 08:11 PM
I'd advocate yes for these reasons: The current 4-character set is ASCII-era thinking NBSP is the most common "invisible surprise" character in real clinical data Implementations using standard Unicode whitespace handling are likely already treating it as whitespace - the spec should align with reality rather than force implementers to write custom matchers At minimum, add \u{00A0}; ideally, consider adopting Unicode's \p{White_Space} property to future-proof it - https://www.unicode.org/reports/tr44/#White_Space
Brian Postlethwaite Mar 18, 2026, 08:14 PM
Thanks, that last reference is what I was hunting for!
Brian Postlethwaite Mar 18, 2026, 08:36 PM
Maybe this set? https://www.unicode.org/charts/collation/chart_Whitespace.html dotnet has 17 of them listed: https://learn.microsoft.com/en-us/dotnet/api/system.char.iswhitespace?view=net-10.0
Brian Postlethwaite Mar 18, 2026, 08:47 PM
While wandering down this rabbit hole, I also came across this: Unicode Normalization - https://unicode.org/reports/tr15 Which helps normalize unicode characters, and lots of programming languages support this too. https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/normalize
Brian Postlethwaite Mar 18, 2026, 08:53 PM
This is where the unicode whitespace characters are enumerated. https://www.unicode.org/Public/UCD/latest/ucd/PropList.txt
Brian Postlethwaite Mar 18, 2026, 08:58 PM
/poll Should we change the whitespace equivalence handling for strings: No Add JUST non-breaking space (\u00A0) Add ALL unicode whitespace (as defined in https://www.unicode.org/Public/UCD/latest/ucd/PropList.txt )
Brian Postlethwaite Mar 18, 2026, 08:59 PM
/poll Should we also add unicode normalization to the string equivalence No Add Unicode Normalization (as a may) Add Unicode Normalization
Brian Postlethwaite Mar 18, 2026, 09:10 PM
I don't believe that CQL does any of this either.
V
Vasilii Kupriakov Mar 20, 2026, 02:37 PM
I think the most universal is collation-based equality with root search collation, normalization, primary strength, and shifted alternate. https://icu4c-demos.unicode.org/icu-bin/collation.html