Unicode testing for anagram doesnt actually test for grapheme

Meatball · April 21, 2024, 5:25pm

The current tests doesnt actually test that the application can work with grapheme clusters. Current tests only tests if the system can handle unicode characters not that a character made of multiple unicode points actually gets treated as a single character.

Given the following condition would we expect that it would be true, since there are no letter which are similiar between them:

 Anagram.find("üy", ["uÿ"]).should eq([] of String)

But since if we actually split these 2 strings into all the unicode points we will be left with: ["u", "y", "<two dots>"], so the 2 arrays will actually be the same.

The current test cases doesnt test for a scenario like this.

ErikSchierboom · April 23, 2024, 7:20am

Excellent point! I think a new test case would make sense.

Cool-Katt · April 23, 2024, 7:56am

I noticed this when writing up the implementation for Microblog on the JS track, that’s why I also wrote a bunch of approaches to explain that some solutions will fail in some cases.

I think the point for such a test is excellently made, but in the case of Microblog for example, it’ll also invalidate almost all current solutions since almost all of them don’t take grapheme clusters into account.

ErikSchierboom · April 23, 2024, 7:57am

Well, it’ll only invalidate them if tracks choose to implement it. That’s why we have the tests.toml file where tracks can opt-out from indidivual test cases.

senekor · April 23, 2024, 10:26am

On the Rust track, some test cases are made optional by hiding them behind feature flags. The test runner ignores those, but people can run them locally. Optional bonus challenges are implemented this way, including handling grapheme clusters. Adding such tests would be backwards-compatible. Maybe a similar approach is possible in other languages.

Cool-Katt · April 23, 2024, 11:13am

Do you have an example of such exercise at hand? I’d investigate.

senekor · April 23, 2024, 12:10pm

sure: reverse-string

BethanyG · May 18, 2024, 5:03pm

I would have replied earlier, but this somehow fell through the cracks. For this exercise in particular, there are ~~only~~ mostly English words used. The exercise instructions specify ASCII only.

I think that unless we are going to introduce non-English words as start words and candidates, we should probably leave this particular exercise unicode-free.

Not sure how we’d make sure that a student knew when a word was an anagram of another in different languages (I know we ask that already of non-English speakers, but as it stands now it’s one language as opposed to potentially multiple languages) .

We could provide examples/lookups or wordlists. But then the exercise becomes different and potentially much more difficult. And that’s fine – but unlike Reverse String or Microblog, anagram detection requires that the word be valid in the target language, so we’d need to make sure that students had a way of doing that.

There is also the question of how Unicode would be treated. Would the Unicode characters be different, or would they be treated as their ASCII equivalents? Would a word that has Unicode characters be considered an Anagram of a word that displayed the same characters, but was made up of only ASCII?

BethanyG · May 18, 2024, 5:28pm

As @Meatball has pointed out in the comments of his PR, we’ve already added Greek and another Unicode scenario to this. Which I at the time approved – but shouldn’t have.

So at the very least, we need to change the instructions for the exercise (Yes- I know that we can write instruction appends, but in this case it feels like we should be re-wording things to be clearer.), but I really think that if we want to do unicode variations here, we probably want to introduce an all-Unicode version of the exercise and not add-on Unicode to this one.

Meatball · May 18, 2024, 5:36pm

I am open to various approaches. I personally don’t like the unicode tests (a bit ironic that I open a pr adding more) and have thereby disabaled them (on the tracks I maintain), if there are still to exsist I might implement a similliar system to the rust track (which was talked about earlier in this thread). I am not against removing unicode tests but I am not going to push a larger change to this exercise (including removing the tests). I thereby leave that open to somebody else, and with that am I open to close my pr if that route is choosen.

senekor · May 18, 2024, 6:55pm

@BethanyG What are the downsides of excluding these tests on tracks that don’t want them? That was the point of the original forum discussion. Just exclude that scenario if it doesn’t fit a track.

(sorry for the duplicate comment in the PR, I guess discussion should continue here)

senekor · May 18, 2024, 6:59pm

I don’t get it, this is already the case, no? The tests define input candidates, which are passed to the students. Students don’t need to know anything about the language. Am I misunderstanding something?

BethanyG · May 18, 2024, 9:24pm

As mentioned in the PR Comment, the difference here is that the instructions define an Anagram as ASCII-only. We should change that.

So admittedly, this is very contrived. But in the proposed test case, we have üy, which according to wiktionary.org means house in Crimean Tartar.

According to the same source, uÿ means water in Xinca. And Yü is the name of a Chinese emperor, which is also spelled Yu, since the diaeresis is for pronunciation only.

So the expected could be ["uÿ"], not [], and it could be argued that the candidates could actually be `[uÿ, yu, Yü, uy] – depending on weather or not a given language considers a diacritical mark “part of” a letter or not.

Do we consider the accents/diacritics part of the letter for all languages? Do we assume that all candidates are valid words?
According to some sources, diacritics can be added, moved, or omitted to form Anagrams. Here is an example using Mädchen in German. Are all of those valid anagrams, or only the word that includes the umlaut?

It just becomes … .well, complicated in the way the Pig Latin exercise became complicated with different rulesets. So I think we probably want to be clear about what we consider an Anagram, and maybe even what languages we consider as part of the search space.

(getting down of the soapbox now, because I look stupid up here…)

senekor · May 19, 2024, 6:39am

I agree with the first point. The instructions on the Rust track are self-contradicting right now. It would be better if the general instructions were less strict about the character range, such that individual tracks can expand on that without having to contradict it.

The second point is not very convincing to me, but at best, it’s an argument against grapheme clusters. Which I can kinda get on board with. reverse-string is only one exercise where the Rust track has tests for grapheme clusters, and even there, they are an optional challenge.

Let’s think about what the goal of the exercise is. Is it to teach people how to deal with grapheme clusters? I don’t think so. That would be an argument to not add tests for grapheme clusters.

The existing unicode tests are different though. We have many of them on the Rust track to make sure students use the standard unicode-aware string manipulation functions from the standard library. Without these tests, students could grab a string and convert it into a byte slice, which would be bad.