Micro-blog exercise should cover graphemes

ageron · October 10, 2024, 7:23am

Hi there!

The instructions of the micro-blog exercise say:

The trick to this exercise is to use APIs designed around Unicode characters (codepoints) instead of Unicode codeunits.

I understand that we want to keep things simple, but I think this is misleading. For example, in the Roc track the instructions led some people to split the string into codepoints when in fact there’s actually a very simple function to split the string into graphemes instead: the tests pass in both cases because they only include graphemes composed of a single codepoint, but they would fail if the tests included flags, or characters with multiple diacritics, or complex emojis, or basically any grapheme composed of multiple codepoints (i.e., extended grapheme clusters).

In short: we shouldn’t encourage people to work with codepoints when they can just as easily work with graphemes.

I suggest at least updating the instructions to cover graphemes, but also including some tests with extended grapheme clusters. If we’re going to handle unicode, we should try to handle all possible characters. Handling graphemes might be harder in some languages, but in that case they can just disable the extended grapheme tests.

I’m happy to submit a PR if there’s an agreement on this issue.

iHiD · October 10, 2024, 9:54am

@ageron Thanks.

It feels to me that this should be something defined by tracks in amend files, not something in problem specs.

Opinions, maintainers?

BethanyG · October 10, 2024, 10:24am

Definitely should be an amend/append and supplement situation for … so so many reasons:

“might be harder” is a massive understatement. For languages like Rust or Python, it is darn near impossible to handle graphemes and extended graphemes properly without manually implementing the Unicode grapheme cluster break algo. The only “sane” solution is to use third party libraries that implement the algo properly.

And “properly” is debatable…

There are extended grapheme clusters and tailored grapheme clusters when dealing with Unicode. Turns out that the major grapheme libraries for Python disagree on what default (extended vs tailored) to implement. Half of them use one, half use the other.

And for languages like Devanagari, that causes a bit of chaos: python - Combining Devanagari characters - Stack Overflow. Some libraries say हिन्दी" decomposes into two tailored clusters (as do speakers of the language), others say three extended clusters (which is what the third-party Python regex engine does).

I could go on and on here … but I shouldn’t. Suffice to say that just about every programming language that doesn’t have a core implementation of the cluster break algorithm would need to have their specific explanations/decisions as to the alternative. And in some cases, it also requires adjustments to the test runners and other tooling.

Don’t get me wrong: I want unicode and grapheme cluster tests for Microblog, Reverse String, and anywhere else they make sense for text.

But I have to ignore all of them until I can make a decision about modifying the Python runner.

I suspect many tracks are in the same situation.

tasx · October 10, 2024, 11:18am

Not sure how relevant this is, but there’s a note in the problem specs that says:

“Avoid adding tests that involve characters (graphemes) that are made up”,
“of multiple characters, or introduce them as a more advanced step.”

ageron · October 10, 2024, 11:46am

You’re right, Unicode is complex, for sure.

But claiming that one codepoint = one character is just wrong. We should at the very least mention graphemes, and explain that for the sake of simplicity, the tests only use basic graphemes composed of a single codepoint each. What do you think?

Some time ago, half of the emails I used to receive would call me “AurÃ©lien”, “Aur�lien”, “Aurlien” or “Aur?lien”. So it’s a bit personal for me. Seriously though, extended grapheme clusters are not a luxury: some languages really need them, and billions of people use all sorts of emojis to express themselves now, so we should not hide them under the rug.

iHiD · October 10, 2024, 2:14pm

I think mentioning that the examples use only one codepoint sounds sensible. And noting that in the real world, we need more complex solutions.

But I don’t really want to do a lot of explaining about graphemes here. That is the space of Learning Exercises, which different tracks have that explain these topics in detail there.

I’d also welcome an exercise (or even exercises!) that use graphemes that can be implemented by languages where it’s more straightforward. But we shouldn’t turn this language into a “grapheme” exercise if it’s currently a “codepoint” exercise.

We can work out the exact wording etc once we have some consensus. I’d like to hear the thoughts from a wide range of maintainers before we get to that stage though, as there are so many complexities with unicode across different programming languages.

Cool-Katt · October 10, 2024, 6:36pm

@ageron
Having wrote the JS implementation of this exercise, I feel like I should chime in on this. My solution was to include an explanation (and even a ‘dig deeper’ article) about the difference between code points, code units and graphemes (do check it out).

Specifically, in JS graphemes are a bit of a pain. Most built-in methods for working with strings are Unicode-aware, but they work with code units, instead of code points and this isn’t really mentioned anywhere obvious (I had to read into the docs quite a bit). There is a way to work with graphemes, but it comes with performance trade-offs and judging by the community solutions tab, it’s also a lot more clunkier than regular string manipulation methods, so not many people are aware of it and not many use it. I imagine the situation is similar in other languages.

As Jeremy said, I feel like this exercise is about code points vs. code units and also that we can’t expect every language to handle graphemes in the same way that’s reliable enough.

However, I think I understand the point about languages and emoji. After all that was the entire point of creating Unicode, so that we can represent every character in a uniform way. My suggestions would be something like:

Maybe include an explanation about the difference and how a specific language handles Unicode, as an addendum to that language’s implementation of this exercise.
~~If you wish you can also add additional tests to a specific implementation that aren’t in the canonical-data~~ (i think so at least…)
A better idea would be to have a similar exercise that focuses solely on graphemes and grapheme clusters and have an implementation on a per-language basis, so that languages that have good methods of handling graphemes can implement that one, instead of (or in conjunction with) this exercise.

ageron · October 11, 2024, 12:06am

Thanks for all your interesting answers. I like the idea of keeping this exercise focused on code points, and creating a new exercise specifically focused on graphemes.

Regarding the micro-blog exercise, I believe the explanation still needs to be more accurate, without going into too much detail. Right now, it conflates code points and characters, and it talks about code units without defining what they are. I suggest fixing both these issues by changing the part after the word “Emoji” (line 30) to something like the explanation below. I checked: it just adds 30 seconds of reading time, and I believe this is time well spent.

In Unicode, each character is called a grapheme, and it is represented using one or more numbers called code points. For example:

The letter ‘A’ is represented using a single code point, number 65 (noted U+0041 because 65 is 0x41 in hexadecimal), which represents the Latin capital letter A.
The family emoji ‘’ is represented using 7 code points: U+1F468 (Man), U+200D (Zero Width Joiner), U+1F469 (Woman), U+200D + U+1F467 (Girl), U+200D + U+1F466 (Boy).

Note: To keep things simple in this exercise, we will avoid characters that require multiple code points, so you can assume that one character = one code point.

UTF-8 and UTF-16 are both variable length encodings, which means that different code points take up different amounts of space. For example, in UTF-8 the code point for ‘A’ (U+0041) takes one 8-bit code unit (1 byte) while the code point for ‘’ (U+1F61B) takes 4 code units (4 bytes). In UTF-16, ‘A’ takes one 16-bit code unit (2 bytes), while ‘’ takes two code units (4 bytes).

The trick to this exercise is to use APIs designed around Unicode code points instead of Unicode code units.