Anagram exercise needs more tests

merlin-vrn · February 5, 2024, 11:40am

Hello,

The “Anagram” exercise is a great one, and it uncovers quite unconventional ways to do things.

In Go it even has tests that check the correctness of processing of non-latin Unicode strings. The problem with that is that if some solution counts byte occurencies instead of rune occurencies that program could give incorrect results. But there are no tests to detect this kind of a problem. For example, it would consider “бղ” and “вձ” to be anagrams, because their UTF-8 encodings are 0xD0 0xB1 0xD5 0xB2 and 0xD0 0xB2 0xD5 0xB1, respectively, and these are “byte-level anagrams” so to speak, but decoding these two strings as UTF-8 reveals them to be not anagrams, since they don’t share any characters.

This is not artificially invented problem; I’ve seen the solution that is susceptible to this kind of a problem. That uses a brilliant idea, but unfortunate, the implementation is flawed.

IsaacG · February 5, 2024, 2:31pm

Make sure you give Suggesting Exercise Improvements | Exercism's Docs a read.
Requiring rune support changes the difficulty and scope of the exercise, making this a bit of a different, harder exercise. Just because exercises can be extended and more difficult doesn’t mean they should be. There is value in limiting the scope and challenge of an exercise.
I’m not a Go maintainer so my opinion is by no means a verdict on this idea.

iHiD · February 5, 2024, 2:52pm

This is basically a runes vs graphemes question, I think (if I understand Go correctly!). Do we want anagram to allow you to practice working with runes or graphemes? I think if we decide runes, we should probably add a comment to the exercise acknowledging what you’re saying above. If graphemes, then the exercise gets harder and existing solutions will break (not necessarily wrong, but much more consequential).

merlin-vrn · February 6, 2024, 5:14am

The description of the Go’s assignment already states that it should process Unicode correctly. Here’s a cite:

Unlike other tracks, the Go version of the exercise includes test cases that use UTF-8 (non-ASCII) characters in the strings. However, with Go’s first-class support for UTF-8 given by the rune type, that should not bother you too much.

Indeed, it includes the test with greek letters.

So, this is not an extension of the exercise. It doesn’t make it harder than it is. A note about this corner case can be added, but even that is not obligatory: the requirement to process Unicode is already given.

andrerfcsantos · February 7, 2024, 1:02am

Since the exercise already expects the student to handle runes correctly, I don’t see any problem extending the tests to include more cases with unicode characters.

To consider adding the test case you suggest, it would be great to see a solution that currently passes the tests (including the tests with unicode), but uses byte-level concepts to detect anagrams. Can you point us to this solution you are talking about?

This is valid in general, but Go is unique in that it sees strings as a collection of runes by default for the most part. This is true in the standard library, but also in the language constructs, i.e a for loop over a string will iterate over the runes, not the bytes.

So I don’t think rune support makes this exercise harder, because seeing strings as runes is the default in Go.

To keep using the exercise to practice runes is a good idea, I think. Graphemes not so much, as it adds complexity to the exercise without adding value to the core concepts the exercise is about.

merlin-vrn · February 7, 2024, 7:18am

Yes, at least this one: gaetanww's solution for Anagram in Go on Exercism

I hadn’t included a link from the beginning to possibly not offend anyone. Also I want to point out, the number theory based approach is very good idea in general, but for this case it would require the table of 4 billion primes to work (e.g. a dedicated prime for each Unicode codepoint), while the byte-level solution requires only 256 primes. I am a bit sad it doesn’t work that well on byte level, but that’s the life.

andrerfcsantos · February 7, 2024, 12:59pm

Downloaded that solution and ran it against the most recent version of the exercise that includes all the tests (including the unicode ones). Here are the findings:

The solution does pass the current unicode tests. This because product on the current unicode tests, the rune-level anagrams are also byte-level anagrams when converted to lowercase. Made a Go Playground showing this: Go Playground - The Go Programming Language
Adding a new test case with the strings бղ and вձ like originally suggested and expecting them to not be anagrams does indeed make that solution fail as expected, because while the strings are byte-level anagrams, they are not rune-level anagrams.

So I think this is a good test case to add. @merlin-vrn Does this test case also makes the other solutions you’ve seen fail?

@merlin-vrn if you or anyone else wants to open a PR with this, I’ll gladly accept it