The Exercism docs say, that there is a 20 second limit for a test runner to process a solution in production. That is a good thing to have.
Do tracks actually test for the runtime of their example solutions in track CI? What are the experiences about runtime differences between GitHub CI and Exercism production environment?
The background:
While upgrading the PHP track to a newer version of PHP and PHPUnit we experienced serious performance issues. We hit the timeout of the GitHub action first, which could be mitigated by deactivating XDebug. But still we are not far from the GitHub timeout.
While investigating the issue further, I could break the issue down to one test taking ~ 1 minute to run. Which was unexpected and is an issue with an assertion of a series of PHPUnit versions.
So while the exercise test was successfully passing all the CI tests, students would likely have hit the 20 second timeout with any solution submitted, because the one test took ~1 minute.
For the Python content repo, we build the test runner and run each example & exemplar solution against their test suites for each PR to main.
While we aren’t specifically timing them, pytest does report how long each took to run, and we output that during the CI run. Here is a typical run for the test runner phase.
We’ve rarely hit any timeouts with GitHub CI (that I recall), and when we have its usually been around not being able to pull/build a container, some permission error, or som flat-out error or bug in one of the action scripts. Most of our issues cause CI failure as opposed to CI timeout.
Timeouts in CI (for me) are causes of concern, since I assume that the CI would run slightly quicker than the production docker containers do.
We also have tests for the test runner, but have yet to hook up the content repo tests in the test runner repo.
Thanks for your feedback. We currently can also scan the CI logs manually for such slow tests. But it is error prone and I’d like to set a limit for reporting slow tests. But which limit is appropriate?
The TL;DR? Maybe if we did do timeouts for each exercise in CI and were being generous, we’d set them for .5 - 1 second per exercise, just in case there were GH slowdowns or Pytest hiccups we didn’t account for.
That being said, Python’s average run time for testing solutions in production is set to 2 seconds in config.json.
But I can’t speak for your tests or tooling. I don’t know what a “reasonable” overhead is for PHP tests. But I think it should be well below that production limit of 20s for docker spin-up and test suite completion.
For Python, if any one exercise took half a second for testing, it would be significant. The entire run of building the docker container and running all exercise example solutions against all test suites (144 exercises) completes in CI in 45-55s. So if that piece of the CI run jumped up in time, we’d be looking hard at why.
Pulling, building, installing dependancies and running a solution through the test-runner docker container on my local machine takes 15.9/16s, with the test run portion for a single solution taking around 0.17s. My internet is not blazing fast, and neither is my computer, so production is quite a bit quicker. But that still means that spin up is the bigger portion of the time. So the testing itself needs to be less than half of that time limit.
@ErikSchierboom Is there any resource we could look into for runtimes of tests in production? Log files with timing information?
@BethanyG Thanks for all of your feedback. I did similar analysis for PHP as you did and came to the conclusion, that there could be a limit of about 1 second per test for all but 1 test in CI. But that limit would not be OK for my computer, which takes > 10x the CI time to run the tests. The online test runner surely isn’t that slow compared to CI.
So I think I’ll go for 5 seconds / test as the CI limit. That’s hopefully well below the 20 seconds for the whole container enchilada in production and well above the CI runtime hiccups. It also leaves headroom for contributors to submit unoptimized solutions.
But I really would like to have some data for that.
OK, so no data. Maybe I will go crazy about this and write some solutions to put into the online editor that measure total time from first to last call of my code…or probably not.
Here is some, very un-scientifically gathered, data to look at.
TL;DR: The production test runner is ~3x faster than GitHub CI.
All systems ran PHP 8.2 on Alpine or Ubuntu Linux, times measured in minutes : seconds.
I spotted the longest running single test case we have in the PHP track (alphametics, ten letters). Then I ran the example code in the online test runner, measuring the runtime of that test case only. The very same solution code I ran on my local machine and measured the same test case’s performance. This gave me a somewhat data backed performance relation between my machine and the online test runner.
Production test runner
My machine
00:02.952
00:04.948
00:02.733
00:04.981
00:04.989
00:05.030
00:02.596
00:04.969
00:03.574
00:04.973
00:02.738
00:04.988
00:02.892
00:04.951
The online test runner varied much more in performance than my machine (no surprise to me), so a comparison should be with the lowest performance of the test runner and an average of my machine.
4.989s vs. ø 4.977s → test runner is as fast as my machine
To compare my machine against GitHub CI I used the same exercise alphametics, but compared the total time of all the tests (which is seen in the CI logs).
My machine
GitHub CI
00:14.178
00:42.455
00:14.299
00:43.015
00:14.438
00:42.605
00:14.256
00:42.682
00:14.593
00:43.054
00:14.317
00:42.973
00:14.254
00:43.713
As both machines seem to perform quite stable, comparing the average should be OK.
14.333s vs. ø 42.928s → GitHub CI takes ~3x longer than my machine