Feature proposal: benchmarking program efficiency

Let me begin by saying that my first intro to Exercism was with the Clojure track – I’ve learned to really love this platform and have been a regular donor for a while now.
I’ve taken to learning C in stride here, and I noticed something that I think perhaps has not been fully taken into account with these lower level languages, especially ones like C: performance is critical.

If you can shave off a few bytes from memory, or significantly reduce time complexity, it adds a really interesting aspect to solving a problem.

I came across this project recently:
https://exercism.org/tracks/c/exercises/grains/dig_deeper
Which is the classic “Grains on a chessboard” problem. The dig deeper solution implemented something i’d never seen before: rather than storing the binary data in an array, it was stored in a uint32_t, which apparently gives immense speedups, because bit-shifting values is significantly faster than indexing an array.

Another example: this Fast Inverse Square Root algorithm, which was used in doom to give a 4x speedup when calculating reflections:
https://en.wikipedia.org/wiki/Fast_inverse_square_root

Of course, there is a balance between code readability and performance. I’d often much rather have clear readable code than a performance increase, and I think I’m in the majority here.

That being said:
I think that automatically benchmarking user submitted solutions could add a really interesting dynamic to this platform, especially for lower-level languages. Users could choose to sort by stars, or by performance, among other metrics.

I’m sure this topic has been discussed internally, as it’s not a new concept. But I couldn’t find a relevant topic here on the forum. I’m interested in seeing if this is something that the community would support, and perhaps fostering a discussion around it.

Thanks!

The Go track (and Go tooling) has benchmarking users can run to see how efficient various approaches might be.

One of the challenges of running benchmarking on student’s solutions is that, in general, benchmarking tends to be CPU intensive and Exercism pays for the CPU cycles it consumes :) Drastically spiking the CPU load to get benchmark results has a very real cost associated with it.

Depending on the student and questions they ask, I do suggest both profiling and benchmarks. Sometimes I will help them with their benchmark or profiling. But any of these things are available for any student that wants to explore it, on their own systems.

I suspect that these things come up in mentoring sessions, when the communication and solutions help bubble those things up.

Having it being done, as Isaac says, would have to have a return on investment, for the additional cost of work being done on the platform, so I would imagine this is not something that makes sense built into the system.

It is interesting, though.

Is there any way that the users can run the benchmarks locally?

I would imagine that different computers would get different results based on things like processor architecture, cpu speed, memory, etc. I wonder if there’s a creative and easy way around this.

Assuming we trust the users, we could run the computation locally and just upload the result. Or perhaps users could opt-in to offer their computational resources, and the Exercism client could coordinate and verify computations across multiple machines as a proof of authenticity.

I don’t have enough CS knowledge to know if this is a viable solution, or if it’s even possible. But I’m thinking that if you could virtualize or containerize the software, and constrain the resources, I would expect to see similar results across different machines.

Perhaps there are benchmarking utilities that can count the number of instructions that a program performs during a run, making testing on platforms with different computational resources a non-issue.

FWIW I’d be interested in creating a PR to add profiling support to the C track, given that I can grow my knowledge to be able to to so properly within the next month or so.

While this is true, correct me if I’m wrong but usually when you benchmark things it is to compare different objects or implementations against each other?
So even when the hardware is different between user to user, the idea behind should still be the same. More efficient code is still faster than less efficient code when they are being tested in a “slow” system or a “fast” system, and that’s explain why when people test for smaller piece of code they pump the iteration to hundred, thousand or even million times to see the significant time difference.

Also when you talk about efficiency, I assume you mostly mean speed and not space.

I don’t even think “trust the users” is a thing, since they can already post their findings in comments when they profile or benchmark.

When in a language track doing the practice exercises, (some tracks do not have a “learning mode” yet) it may be presumed that they may already be running whatever they want locally on their own computers. No trust necessary, except from the platform to the user. It is their own code they would be running.

Static analysis we already do for some tracks, as well. This can compare and may give some insights to “more performant” code, to a measure of performance.

But users/students are free to benchmark and do anything else that they may want to do with their code, already.

The same machine can also yield different results if it’s being used by other workloads at the same time :)

Thank you :slight_smile:

There’s two conversations that have happened here I think:

  1. Benchmarking two solutions by a student against each other.
  2. Benchmarking to student’s solutions against each other.

For the first, doing this locally is definitely the best way IMO, as you have control over the environment so can be confident in the benchmarks and it’s an important skill to learn and good to do it yourself.

For the second, which was the OP question, if we had a repeatable infrastructure, then we’d love to do it. But we don’t. Our servers change, the workloads on the servers change, the performance of many test runners varies wildly - so it’s not a feature we can offer.

None-timing benchmarks like you suggest (e.g. num instructions, amount of memory allocated) are interesting alternative metrics that we could potentially measure at no increased cost or effort. But I’m not of the value, without the raw benchmarking data too?