[custom-set] Add test for deduplication?

Sorry for PR with no forum post.

Here’s the original issue on the Rust track.

It argues that there should be a test where a set is instantiated with duplicate values.

@neenjaw already commented:

Is a set with duplicate elements a valid set? I think part of what may be adding to confusion here is that an array notation is being used to hold the contents of a set.

Also, I am not sure this is a test that tests for deduplication as exercism/rust#1741 seemed concerned about and this was identified as the solve covering it.

Is a set with duplicate elements a valid set?

Good question. I guess there are three options:

  • Such input is considered invalid
    • Exercism often has exercises where testing invalid input is out of scope → We should not include the test.
    • Include the test, but expect an error.
  • Such input is considered valid → include the test as is.

I do tend toward the third. Just from intuition, it feels like a common operation to build sets from arrays, and in my opinion the least surprising semantics are that this operation deduplicates automatically. But I guess the opposite opinion would be equally valid.

Also, I am not sure this is a test that tests for deduplication as exercism/rust#1741 seemed concerned about

I’m actually not sure what the example test case in that issue was supposed to do. It seems to me a straight forward implementation which does have this duplication “bug” would probably pass that test as well.

I tried to add the simplest possible test case to make sure some deduplication happens when the set is instantiated. One could also imagine testing that add, union and so on don’t introduce duplicates later on. (those tests might exist already, I haven’t checked)

Yea, I think you’re right, there is a grey area here for sure and I think we do need to strive for a balance between “correctness” and “teaching the concept”.

Yea, I’m not saying that it isn’t common to build from arrays, I think the difference is what is the test testing? Testing that you’ve implemented set operations? or testing the implementation of the set? Maybe it’s both, but it appears at this point that it tests the implementation via the behavior of the set operations.

If testing for deduplication in a set, perhaps writing a test for the “size” of a set would be more clear to assert this property of the custom set. Testing for equality against another set constructed with only 1 input feels circumstantial and roundabout.

Testing that you’ve implemented set operations? or testing the implementation of the set?

I’d say it’s testing the equality operation, not the implementation of the set. Testing the implementation is impossible, because it’s private. It is possible to pass my test while still storing an array containing duplicates internally, one would have to write a custom Eq implementation which disregards duplicates.

perhaps writing a test for the “size” of a set would be more clear

I don’t quite see the difference. Wouldn’t that just test the .len() operation or whatever? Seems equally circumstantial than equality. Also, the test suite has no property for size so far. Adding such a test would expand the scope of the exercise.

There is at least mathematically nothing wrong with having multiple of the same element in a set (I think at least).

Say the Set should be implemented in the way that the Python sets work (e.g. if you have a duplicate item will one of them be removed). I would personally find it weird that a duplicate test is only found under one of the tasks, I would in that case expect at least one for every task.

That’s not quite right:

Set theory begins with a fundamental binary relation between an object o and a set A. If o is a member (or element) of A, the notation oA is used.

from Wikipedia.

Although I haven’t heard of it in the context of mathematics, in my experience a “set” with duplicates is usually called a bag.

Set theory begins with a fundamental binary relation between an object o and a set A . If o is a member (or element ) of A , the notation oA is used.

This doesn’t strictly speaking say that there can’t be duplicate items. This is at least how Wikipedia deffiens a set “A set is the mathematical model for a collection of different things; a set contains elements or members, which can be mathematical objects of any kind: numbers, symbols, points in space, lines, other geometrical shapes, variables, or even other sets.”

And looking under cardinality (or “size”), you can see they are using a set with duplicates: Set (mathematics) - Wikipedia

It is just there is no reason to have duplicates since that won’t change how they work (again I think).

A quite common usage is having different sets of numbers. And then working with equations and etc, you can say this element is a member of for example Natural Numbers. Or you can create your own sets. Computer science sets often differ from mathematical sets in the way that mathematical sets can be infinite while computer science sets are most often finite.

Because set theory has no notion of duplicates. The link behind “binary relation” might make it more clear.

Binary relations are used in many branches of mathematics to model a wide variety of concepts. These include, among others:

Maybe more intuitive for us programmers: A set can be interpreted as a function which takes two arguments, the potential element and the potential container, and it returns a boolean, indicating whether or not arg1 is element of arg2.

they are using a set with duplicates

That’s just a notational convention. {1, 1} is not a set with duplicates, is just a funny way to write {1}.

Which might be an argument to include the test. We had this question earlier:

Is [1, 1] an invalid set or is it equal to [1]?

That is merely a question of notation. And mathematicians have chosen the second option, {1, 1} is just a funny way to write {1}.

I think in theory it can have multiple of the same item, but it will act as it only have one of the items. As it will act as it is unorderd, even If most implimentation of the exercises likely implimenets through an orderd collection. And then make it act as the order doesnt matter.

Moving away from theory of maths and into the practicality of programming…

Are there any programming languages on Exercism that allow duplicates in a set? Ruby will automatically remove duplicates for example. I feel like when implementing a custom set, it should implement using the same rules as the built-in set in that language. I raise this as it’s important to consider whether different languages have different considerations here.

My intuition/understanding is that you can’t have duplicates in Sets and that all programming languages that use the concept of “Set” will largely do so for this reason. But I may be wrong and welcome any counterexamples :slight_smile:

Is a set with duplicate elements a valid set?

I feel that this should come down to the language again. In Ruby, for example, [1,1] is valid input to create a set, but results in a set of [1].

(Excuse the double post)

If languages do differ, we may want to consider adding two (or more) tests with different scenario tags. What we don’t want to do is a test that is conceptually “odd” or even “wrong” in any given language. And also, we should consider in adding the test whether we want to invalidate old solutions.

My sense is that I’m happy to invalidate old solutions as long as there’s a learning opportunity (e.g. learning how a Set works in that particular language and that its different to the student’s implementation)

Note, we do already have a test which checks that {1, 2, 3} + {3} == {1, 2, 3}. This test checks specifically the initialization. Is a duplicate in the initialization value necessarily valid? Maybe that should be a value error?

1 Like

Are there any programming languages on Exercism that allow duplicates in a set?

Are there any programming languages where two plus two equals five? Even if there was, I would vehemently oppose including it in our tests.

Exercism’s job is to help people learn the idioms of a language, not to define what the languages should do :slight_smile: So if that was idiomatic in a language, we’d have a test for it for that language, even that language didn’t comply with what we consider correct most of the time. (I do agree that this would feel weird - but we’re not the guardians of correctness - just educators about what different people consider true :wink: But I did once work in a very old and esotoric variant of Objective C where dividing a number by itself resulted in 0 - if we added that (terrible) language - we’d have to include a test for that so that it didn’t catch people out!). Whether these tests should be added to problem-specs would be a different question though.

This was my point really. Duplication in init is valid in Ruby. Is it valid in every language on Exercism? If so, then we add a standardised test. If not, then we need either zero or multiple scenario-based tests.

I think the test makes sense and is a good addition to the exercise, but we need to consider all the programming languages before doing so.

So my question is, are there languages on Exercism where [1,1] is an invalid init value?

Incomplete list of languages where initialization of a set from an array automatically deduplicates:

  • Rust
  • Python
  • C++
  • Java
  • C#
  • JavaScript
  • Swift
  • Ruby
  • Kotlin
  • Haskell
  • Elixir
  • Gleam

All of the mainstream languages agree on this. I couldn’t find one that does something different.

Outliers:

  • C & Go: don’t have sets in their standard library.
1 Like

OK, so my preference would be to add a test that tests that dedup init behaviour, as I do feel it’s a sensible requirement of Sets and it seems thanks to @senekor’s research that the majority of languages agree on that.

Does anyone disagree - and if so, why?

(Note to everyone that agrees with me: Let’s make space to hear opposing opinions before debating them pls! :slight_smile:)

I don’t have anything against the approach. The only “spec” that is missing is having another set inside a set. But that could make the exercises a bit complicated though.

But I would like to push on my comment earlier:

I would personally find it weird that a duplicate test is only found under one of the tasks, I would in that case expect at least one for every task.

Multiple tasks don’t test for duplicates, from my perspective would it be odd if only equal had such a test.

Not all the operations need to deduplicate in the first place, the ones that do are in my opinion constructing a set from an array, adding an element, constructing the union of two sets. The first is covered by my new test, the other two already have tests covering deduplication.

@Meatball which operations do you suggest should receive deduplication tests as well?