I’m in the process of writing a test generator for the Rust track. I’m sufficiently far along that I think it would be great to get some input from people who may have experience with this.
Here’s what I’ve done already:
I’ve got Rust data sructures for stuff like the track config.json and exercise .meta/config.json so I can easily test their structure and manipulate them.
I have a small CLI tool to add an entry in the track config for a new practice exercise, with slug, name and difficulty (the only required properties I think).
And here’s what I’m planning to do:
shell out to configlet sync to generate part of the skeleton.
write a program to generate the Rust-specific program skeleton
use the local cache maintained by configlet of the problem-specifications to read test data and generate tests based on that.
the generated test cases defer to a separate function implemented by the exercise author. this function may do some data massaging, since we’re getting json from the problem spec. We don’t want the students to have to output json, and we don’t want the author to have to manually adjust every test case.
I want it to be easy and fast to sync test suites with updates from problem-specifications. So I intend to add some comment marker in the test file (e.g. “// below here is generated code, do not edit!”)
custom tests, which cannot be upstreamed for some reason, can be added above that line.
the program for generating a new exercise and the one for syncing with problem-specifications share most of the code. the syncing workflow simply skips overwriting gaps already filled in by the exercise author.
The ideal workflow I’m aiming at is one command, one commit, one PR, and the whole repository is perfectly in sync with problem-specifications. Adding a new exercise takes only as much effort as filling in the gaps that cannot be automated, with guidance from tools what those gaps are.
What do experienced people think about this plan? Am I missing something? Any stumbling blocks to look out for?
The only thing, that I would change is the order of testing. If I have understood you correctly, you would first have the “special rust related” tests first and then the “standard” test case.
To my knowldge, the way (other) test runners work, is from top to bottom, stopping when the first test fails. That might lead to exercises, where the edge cases are tested first and the general ones afterwards. For TDD that might make it hard for students to solve exercises one step at a time.
Practice exercises must have a tests.toml file. See this from the configlet docs for the details (sample one from Python here). See also the practice exercise spec, for other requirements.
Not every test case for every exercise can be implemented in every track, so it’s important to periodically review incoming test cases to see if they “make sense”, and set the ones that don’t to include = false. And when a change is made to an existing test case in prob. specs, it is important to look for the reimplements = <UUID> key.
Because stuff … happens … I strongly recommend a CI workflow that includes validating existing auto-generated test files against their newly generated versions to ensure things didn’t unexpectedly change/get mutated. Some examples here of what we’ve done on the Python track.
Also recommend testing all example solutions against all test files, in case examples need to change when test cases change.
Because you are already going to have to read the prob-specs JSON, I recommend that additional/custom test data for the track also be written in JSON, so you don’t have to write additional code for processing. An example from Python here
Recommend that you have an option of using a local clone of Problem Specs, in case you are in a situation where you can’t (or don’t want to) connect to GH. Sounds silly, but its saved me on several occasions.
Thank you, this is very helpful! couple follow up questions:
Because stuff … happens … I strongly recommend a CI workflow that includes validating existing auto-generated test files against their newly generated versions to ensure things didn’t unexpectedly change/get mutated.
Just to make sure I understand this correctly. The intent is to prevent humans from editing tests that were auto-generated? I’m guessing that’s what the --check flag here does:
Luckily, tests to make sure example solutions pass all tests are already in place, so that should prevent completely bad test cases from slipping through code review.
Recommend that you have an option of using a local clone of Problem Specs, in case you are in a situation where you can’t (or don’t want to) connect to GH.
Sounds like a great idea, how are you handling that? It looks to me like the most recent version of problem-specifications is cloned in CI. Wouldn’t that break when there is an update upstream, or am I misunderstanding how generate_tests.py --check operates?
We’re doing some work that will add a configlet command to do this (not yet available though).
Sounds great! I might keep around what I have already until that’s ready. Then I’ll just delete it. Less code is better code.
Are there many of these?
I have no idea. The intention is just to future proof. I like the approach in the python repo mentioned by @BethanyG. Additional tests are specified in a separate json file just like the problem spec tests.
The thing is, none of the exercises in the Rust repo have a tests.toml. Running configlet sync is just a huge wall of warnings. Difficult to say how far apart the Rust repo has drifted from problem specs.
Here’s a recent example of where I was made aware of custom tests that weren’t upstreamed:
Yeah, that’s what I’ve been using for now. Bethany’s comment got me thinking about submodules. I’m still not sure but the way I interpret the CI in the python repo, it looks like it would break if the the problem spec repo is updated. A submodule is just what came to mind to solve this problem, one could have a pinned version of the problem spec repo included.
What I dislike about $XDG_CACHE_HOME/exercism/configlet/ is that it’s platform specific. I don’t wanna go out of my way to support proprietary operating systems, but a submodule sounds like a clean and low-effort solution.
Apologies for the delay – and for the long message!
That’s part of it, but it also is a check against someone not having updated their repo branch recently, or where problem specs has changed or deprecated a test case, but the changes have not yet been synced to the repo.
I guess this depends on how you define “break” . When a mis-match between the existing test files in the branch and the ones generated from problem-specs is detected, an error message about re-generating the mis-matched test file(s) is printed out in the CI run. Any mis-matched test files then cause the CI to fail. When this happens, we go back and use the same script manually in “create” mode to regenerate/update the erroring test files and check them into the branch.
There are two issues with this that I hope to address at some point:
If new test cases are added upstream, the tests.toml file has to be updated with them before the generation. Generation only happens for tests recorded in tests.toml. This means that occasionally, we miss new upstream test cases – although not often.
Simply sucking in any new test case without review doesn’t quite work. Not sure how I handle the review of new cases in a smooth fashion that’s not manual.
This sucks if the PR in question didn’t touch any of the mis-match files, but usually I can see the error quickly and help anyone who gets confused by either doing the update, having them refresh their branch, or by editing their branch directly to refresh/add the files.
Often, I also use this as a reminder to do a general sync to pick up any interim updates/additions to problem specifications metadata or exercise text.
Ideally, the script would do the sync automagically by calling configlet to update tests.toml with a yes flag for any additions, but … I haven’t thought through all the steps of that yet, so its half-automated and half-manual
But I digress. The scenario where you would work off a local copy of problem specifications is where you are adding an exercise, adding track-specific tests, or updating a bunch of test files and want to do that from the local clone instead of re-cloning from problem specs. The script has a flag for specifying the path to the desired prob-specs repo. But you could also jus use the configlet cach, and also instruct configlet to work offline.
Why doesn’t it work? If the example solution passes the new test cases, it’s probably fine to include, right?
This sucks if the PR in question didn’t touch any of the mis-match files, but usually I can see the error quickly and help
Yeah, I think I want to avoid that. As I’m currently the only maintainer of the Rust track after months of nobody maintaining it, I can’t be sure there will always be someone around to help with surprises like this.
I’ll try the submodule approach to always have a pinned version locally. Thinking into the future, I could add a cron job GitHub Action that updates the submodule in case of upstream changes. Like depandabot, basically.
Python uses the JinJa templating system for this
Oh wow, that looks awesome. I especially like that tests generated with templates like this are probably easier to read for students than ones that do nothing but pass json to a different function. Definitely gonna try it out.
… this is probably true. Hum. Now you have me thinking about how I make things better in the test generator I probably won’t get to it a for a bit (currently doing violence to the Python test runner…), but I certainly have some more ideas on my list. Thank you.
ooh. This. An auto sync job to pull in test case changes and warn if any example solutions fail the newly generated test files…or something like that. Hum… I am going to have to play with that!