Bug in Baffling Birthdays Tests

rzuckerm · May 29, 2025, 12:45am

The tests for Baffling Birthdays is not testing the whether the estimated probability is range. This:

function Test-Probability($Probability, $Expected, $Tolerance) {
    ($Probability -ge $Expected - $Tolerance) -or ($Probability -le $Expected + $Tolerance)
}

should be this:

function Test-Probability($Probability, $Expected, $Tolerance) {
    ($Probability -ge $Expected - $Tolerance) -and ($Probability -le $Expected + $Tolerance)
}

Also, in order the tests to complete in a reasonable amount of time, the tolerance needs to be higher than 1%. Through trial-and-error I found that 5% seems to work if the number of runs for testing random birthdays of the specified number of people is 1000.

I can do a PR to make this change if you’d like. I’ve never made a PR to an exercism repo before. Where can I find the documentation on what to do?

rzuckerm · May 30, 2025, 7:54pm

Actually a better fix for the tests would be this:

$tolerance = 5
function Test-Probability($Probability, $Expected, $Tolerance) {
    [Math]::Abs($Probability- $Tolerance) -le $Tolerance
}

Also, the example code has a bug. The return value from Invoke-BafflingBirthdays should be this:

$count * 100.0 / $runs

Can someone please tell me how to go about doing a PR for this?

IsaacG · May 30, 2025, 8:34pm

What’s a reasonable amount of time? How did you get to 1000?

I’m able to run 5,000 tests for 73 people in bash in 2.5 seconds. With 5,000 simulations, a 2% tolerance should be plenty. At 1000 tests, my simulations suggest 4% ought to suffice.

Ask if a PR is welcomed on the forum prior to opening a PR. Get approval from the track maintainer before creating a PR.
Suggesting Exercise Improvements
Exercism Pull Requests

Changing the or to an and is Powershell specific. The rest looks like it may belong in the upstream problem spec.

+cc @glaxxie

rzuckerm · May 30, 2025, 9:01pm

A reasonable amount of time is 1 to 2 seconds. On my PC, running Ubuntu 22.04 in WSL with Windows 11, all the tests completed in about 1.5 seconds when I used 1000. My main goal is to have the tests run reliably without timing out.

The or, and thing is Powershell-specific. I did a GitHub search on the exercism org to see what other tracks are using this exercise, and it looks like the only other one is C#. The example and tests looked right for that language.

Is this the right place to do that? If not, where?

IsaacG · May 30, 2025, 9:39pm

This is the right place We just need to wait for the maintainer to chime in.

rzuckerm · May 30, 2025, 9:55pm

I tried using 5000 runs and 2% tolerance. It looks like it works, but it takes anything from 7 to 10 seconds. I’ll submit the solution and see if it runs fast enough on the server.

IsaacG · May 30, 2025, 9:57pm

None the less, if the tests ought to be setting a fixed threshold, that threshold is something that might make sense to set independently of the track implementation/details.

rzuckerm · May 30, 2025, 9:57pm

It’s not fast enough on the server.

rzuckerm · May 30, 2025, 10:19pm

I modified my code that finds the shared birthdays to be more efficient. Now my local run completes in about 3-4 seconds for 5000 runs and a 2% tolerance.

rzuckerm · May 30, 2025, 11:26pm

I pulled down the code from exercism/powershell and ran it locally. It is quite slow, even for just 1000 iterations. I took about 5 seconds. I had to set the tolerance to 4% to get the tests to pass.

glaxxie · May 31, 2025, 4:48pm

I’m very welcome of a PR.
I think I was juggling 2 exercises at the time so I made a mistake for that test file and thus lead to the mistake for the example file as well. Sorry about that.

So yeah I will take a PR that fix the mistake of the test file (and instead of or), and an update of the example file.

About the point of iterations number and tolerance, here are my two cents:

I really want to keep the 1% point of tolerance, but that require a greater number of iterations (10k+) for better accuracy and some language is just not very well equip to deal with that. They are still fast but maybe not fast enough for the platform to run in time, not to mention the need for generating random birthday between batches.
I feel like the exercise is about working with daytime objects, random and ultimately proving that the paradox might sound unusual at first but they are indeed correct. So having a 4% for tolerance wouldn’t make sense for me, one of the test has an expected value of 11.694818, 4% deviation would be like 1/3 off.
With the above point, 5000 (still 12 mil unique pairs) for 2% seem reasonable, we can start with that and see how it work on the platform. Remember that learners only have that 2% tolerance number from the test file, how many iterations they ultimately choose is kinda depend on them. So if this work well, we might as well hardcode that number of run somewhere in the stub.

IsaacG · May 31, 2025, 4:51pm

Note, my implementation I wrote up quickly for benchmarking didn’t use any date objects I just generated a bunch of integer numbers in the 1-365 range and checked for a repeat value.

glaxxie · May 31, 2025, 5:28pm

Oh yeah, just for the base concept of it, sure. What I meant was based on the exercise spec itself here . As you can see there is a number of tests about randomized datetime objects, tests for non leap year, etc.

It would have been much simpler and faster if we can skip that part.

rzuckerm · May 31, 2025, 6:11pm

I’m not 100% sure what to do at this point. The example code is much too slow to handle 5000 runs. It takes about 5 seconds for 1000 runs on my PC, whereas my solution takes less than 1 second. I found that randomizing the month and day is really slow, and it is a lot more efficient to generate a number from 0 to 364 and add it to January 1 of the chosen year. Also having Get-Random generate multiple numbers at once seems to work better than generating one number at a time in a loop.

With regard to the PR, would I just change the test file and example, or is there something else I need to do? Also, does the build run for the PR run the same tests are run for a regular user submission? In other words, will it have the same timeout?

glaxxie · May 31, 2025, 6:22pm

The example code is not always the best solution btw, oftentimes it is encourage for maintainer to write the “best” solution. However its main function is proof that the exercise can be solved.
Generally I don’t go and change example file that already there, unless there is actual problem like this case because there were other issue in the test suite.

For the PR, yes, you should make change for the test file and the example file, that should be enough.

Also, does the build run for the PR run the same tests are run for a regular user submission? In other words, will it have the same timeout?

I don’t think so. Timeout on the platform is specify for each track (there is a general acceptable number, but im not sure what), the main reason is because you don’t want something stuck in a loop and cost you an insane amount of money. The pr run might be more forgiving but it also has limit obviously.

kotp · May 31, 2025, 6:31pm

5000 runs is the problem, I think. If 23 birthdays have a possibility of 50% collision, and 75 has a 99.9% possibility, then at most we should be doing 750 to get a 99.99% collision possibility. At this point, we might just be checking that it can match and not match.

Statistically, then, on the rare case that the code is right, but we do not have a collision, we can fail the test with the message “no duplicate found, and statistically this has a 1/1000 chance to happen based on this test. Consider re-running the test.”

The tests do not have to be exhaustive, it only needs to check that things are as expected.

rzuckerm · May 31, 2025, 6:41pm

@glaxxie Here is the PR:

kotp · May 31, 2025, 6:44pm

This I find surprising as there are only 365 days in a year. Once you have 366 people (in a standard year) someone should have a duplicate birthday).

If you want to show that the statistics match, you might generate 23 birthdays, and “flip the coin” 5 times to see if you get a normal distribution.

If you generate 75 birthdays, you find out if 95% or greater is true, and maybe do that twice, to get the median closer to the actual probability, and pass.

This should be less exhaustive than the Test-Probability going through thousands of tests to get scientific style results.

The tests are not meant to be exhaustive here, as much as they are to confirm a situation of student may have solved the exercise, with strong possibility.

rzuckerm · May 31, 2025, 6:54pm

I’m not saying that 1000 people are needed. I’m saying that the number of times to run the simulation is 1000 to get within 5% of the true probability. That’s what I originally got working.

glaxxie · May 31, 2025, 7:04pm

Looks good, but why didn’t you include the extra optimization that you did to be included in the example file as well? Since we’re changing the file anyway, might as well do it.
Please include it and I’ll see how faster it can be, then we can add in a hint for people to start with 5000 iterations.