Can the error experience during tests be improved?

I’ve seen this message a few times in various tracks and I’ve wondered if this particular experience can be improved?

An error occurred while running your tests. This might mean that there was an issue in our infrastructure, or it might mean that you have something in your code that’s causing our systems to break.

Please check your code, and if nothing seems to be wrong, try running the tests again.

In most cases, I’d say some hidden issue in my code was causing the problem. Often, a local test run will reveal some problem. This isn’t always the case, though, and sometimes it feels impossible to move forward.

Also, I’ve noticed that this seems to plague some tracks more than others. For example, some tracks display compiler errors where others do not. I assume this is due to the implementation of each test runner but I’m not sure.

Are there any proposals or ideas of ways to provide more insights when these errors occur? Something that can help indicate if the problem is in fact with the code or the infrastructure? I would love to bring a nice solution to the discussion but I haven’t thought of anything good just yet.

Thanks!

This error shows when the test runner dies without an expected result.

This is normally down to:

  1. Some underlying infrastructure issue (generally an outage of some kind)
  2. A test runner that performs unpredictably, or that is close to the timeout window and sometimes trickles over.
  3. Some issue in the code

We fundamentally don’t know which one of these three things is causing the issues, so it’s nigh on impossible to improve the error message.

What we can do is try to write better software so that (1) and (2) happen less. And maybe improve our software so that (3) is better reported to.

Some assorted information/thoughts below.


About 1:
I spend a lot of my programming time working on (1). Sadly, for the last year, that’s been very little time because I’ve being doing a lot of other things on Exercism. I’m changing things now so I have more time to code, which hopefully means (1) happens less often. The reality is though, the infrastructure works great and processes millions of submissions a month. When it breaks, it always breaks in a new unexpected way, which means we then have to add code to stop that from happening again. As it’s just me that’s working on this, sometimes we have outages for longer than I’d like as it takes me time to get back to a computer to debug and fix. But normally we don’t get the same type of outage more than once.

As an example of this, this week, this PR was merged. That took down the whole test runner infrastructure, because it reran all Haskell solutions, and due to a bug in the Haskell test runner, that meant that all the machines ran out of HDD space and collapsed. Some code that was intended to catch this didn’t work (because I made a mistake and didn’t test something properly) and so rather than magically fixing itself (by the machine killing itself and being replaced) instead it just sat there hoping to get well again.

These three PRs solve this issue:

That last PR has had probably 30 hours of work go into it so far.

As you can see, when we do have issues they tend to be complex, multi-faceted and difficult to predict in advance.


About 2:

Maintainers do do this all the time, and @ErikSchierboom is currently going through reviewing all test-runners to see where problems are occurring. We’re intending to add more CI to the test-runner repos with benchmarking and some meta-testing, to check for regressions. We have over 100 different pieces of production tooling running code in 65 languages though, so this is also complex and time-consuming.


About 3:

We could maybe do a better job at detecting things like student’s code timeouts within the test runners themselves. This would then allow us to provide better messages. We could also report infinite loops or other such things. But this needs to happen within the test-runners for it to work, which again means working across all 65 of them.

Maintainers could help with (or do) this on many of the bigger languages though, and maybe this is a good piece of work we should ask people to consider doing.

3 Likes

Well, I think I might add a bit to 2 and 3.

Also, I’ve noticed that this seems to plague some tracks more than others. For example, some tracks display compiler errors where others do not. I assume this is due to the implementation of each test runner but I’m not sure.

Test Runners are a piece of complex machinery, designed uniquely for each track. Design a robust feature-rich one can be quite hard and time-consuming. On the Swift track I had the same experience that the track failed, and even so that it sometimes failed and rerunning would made it pass (an example of A test runner that performs unpredictably, or that is close to the timeout window and sometimes trickles over.). Thereby I decided to completely rewrite that test runner, that body of work took weeks and I would like to say that work paid off since I have very rarely experienced a fail.

But it would be good to know which track which have that kind of behavior that the test runner just fails.

We’re intending to add more CI to the test-runner repos with benchmarking

Benchmarking is a very good thing to do, but can be a bit tricky since the execution time is never constant, and updating anything could lead to that benchmarking needing to change and having a suitable benchmarking system feels quite hard and complex to make.

We could maybe do a better job at detecting things like student’s code timeouts within the test runners themselves.

This would be really good, although no clue how that would be pulled off since it can be hard to control the execution of test and also it could lead to a lot of false positives if the test runners start to eject because it feels like it is running out of time. Not saying it is impossible but just that it would be hard to pull of at least for some tracks.

It feels like it would be easier if the docker containers themselves know that they ejected the execution due to timing out and then giving that error to the student.

1 Like

Thanks so much for the deep, very thoughtful reply, Jeremy! Kudos to you and the team for all the hard work that goes into Exercism!

Your insights have helped shape some thoughts about the common causes you mentioned. Specifically about causes 2 and 3, the non-underlying-infrastructure issues.

One idea is to consider adding a path forward to the error message for those interested in contributing improvements to the test runners. Maybe an expandable part of the message for those interested in learning more? I could see a path from the error message to the README of the test runner, for example, but I’m sure there are many ways this could be done nicely.

Along those lines, I wonder if it would be worth evaluating and documenting the robustness of the tooling for each track? I could see creating a feature list based on the track with the most robust tooling and turning that into a template/wish list for the others (obviously, there are vast differences in the capabilities and tooling for different languages that would present significant limitations). Something like this might be really nice for contributors. Maybe this already exists?

These are 100% take-it-or-leave-it ideas and not very well thought through but, in any case, I imagine I may not be the only one who gets a lot of value from Exercism and feels like it would be worth helping out as a contributor.

Thanks again!

1 Like