So I have been fooling around with implementing exercises lately, mainly on the JS track, and I noticed that the configlet docs mention test generation, which got me wondering, which tracks currently have a test generator?
I think, JS doesn’t, at least to my knowledge, so I’ve been writing test manually like a caveman. (Correct me if i’m wrong @ JS maintainers.)
Also, if one was intending to write/implement a test generator for a specific track, where should one start?
There are quite some tracks with generators, including Rust, Python, C#, F#, Crystal, and more.
My general advice would be to keep it simple. Don’t try to create a one-for-all solution (like I did for C# and F#), but instead create a little scaffolding and then have exercise-specific templates.
I’m happy to help, since I wrote the Rust generator recently (and still improving it), so all the little bumps along the road are still fresh in memory.
Also, if one was intending to write/implement a test generator for a specific track, where should one start?
To give a very concrete answer here (but still just a suggestion): Write a program that does these very simple steps:
read an exercise slug as a command line argument
read canonical-data.json for that exercise, this is the input to the template
read the template for the exercise, e.g. in $SLUG/.meta/tests.my_fav_tmpl_engine
instantiate the template with the canonical data
write the result to the test file
That would be a pretty decent MVP. From there, you can exend according to your needs.
There is a bit of a gotcha with reading canonical-data.json. That is in the repository problem-specifications. I decided to add it as a submodule, but you can also use configlet’s cache directly. The location of that is platform dependent though, see also this discussion.
Python uses Jinja2 templates. I’m planning on setting that up for the Pyret track since there’s a lot of test suite boilerplate. It does add a bit of barrier for contributors, but I don’t believe there’s not a lot that will need to be tweaked on a case-by-case basis.
We haven’t done as much base templating as we maybe should on the Python track.
I think a lot of boilerplate could be put in a well thought out base template + a few macros that could then be more easily extended by contributors who have less JinJa2 knowledge.
LMK if you want some help there, I might be able to help with some templating.
I wasn’t necessarily planning to make a test generator, but I may consider it more seriously, and If I do, whatever @senekor said seems like a pretty straightforward instruction to follow.
About the canonical-data.json problem, I think taking it from cache is fine-ish, since you’re expected to run configlet sync before you start writing tests and that fetches the latest available files. Another approach may be fetching it directly from GitHub, if possible?
For templating, maybe something like handlebars can work? Tho I haven’t used it in a while so I’d have to reacquaint myself with it.
I do think I see a bit of an issue with this approach tho, you’d have to have a template for every exercise and this may end up being the same in terms of effort as just writing the tests yourself. Maybe we can agree on an agnostic template engine and include one in the problem-specifications for each exercise?
The PHP track is currently also looking for a good way to implement the test generator.
The lesson #1 I took from the attempts so far is: Don’t care for the templates, it’s the structure of the canonical-data.json which needs to be parsed and processed.
The lesson #2 is: The produced code should not pretend to be “done”. It is more relevant to have all the comments and unexpected things dumped into the output and force the user to read and think about these than perfect formatting or handling every edge case.
Anyways, @ErikSchierboom saying “exercise template”: If it is recommendable to have a template per exercise, couldn’t it be a generic generator with a common templating language?
Maybe it could be added to configlet and then implementers would have to write the template for the exercise using some common template engine syntax, but then it boils down to pretty much the same thing, writing templates manually, instead of tests and this I can’t tell if it’s better
I’m wondering if it’s sensible to have a generic track-specific template, like a catch-all for exercises that have canonical-data but don’t have a template for tests?
It could, but I’m recommending not to do that. It’s what I did in C#, but that was quite complex. I mean, the boilerplate stuff could be in a shared template, that would be fine I guess, but an exercise-specific template is likely easier to read. What I’m arguing in favor of is:
Writing templates is better! It means you can automate pulling in updates from problem-specifications.
I’ve thought about that as well. I can’t actually think of many reasons why that wouldn’t work, apart from the fact that everyone has to agree on The Best Templating Language™ Rust is using Tera atm, others Jinja, maybe more?
Speaking only for the Python track, the templating is effortful for more complex exercises that might have error handling or particular conditions or nested logic. But for 80%+ of the exercises, that’s not the case. It is also effort that is expended more or less once at exercise creation.
The payoff is not only ease of update when canonical data changes – it also means we don’t have creeping errors through PRs where contributors try to “improve” tests and make a mistake – or someone is convinced that a test is wrong when it is not wrong. While those things can be caught in review, no one is going to catch all the small errors all the time. Knowing that canonical tests largely don’t change at the repo level really cuts down on ongoing review effort.
It also serves as a checkpoint against the addition of problematic tests, since contributors have to discuss test changes at the problem-spec level, or make a case for a particular Python-specific test (which we handle with a additional_tests.json file.)
There are more advantages – but I will get down off my soapbox.
I am deeply uninterested in re-writing all of Python’s test templates (or testing rewritten ones – we have 115+ at the moment). I’m also reticent as a maintainer to learn new tooling & syntax to support something that already works well for the track. I’m guaranteed to introduce a bunch of problems. So I’d make a strong argument for a track-by-track opt-out there.
Each templating language I’ve come across assumes or requires a different programming language and tooling setup, so we’d need to agree on that as well…
This list isn’t fully accurate or up-to-date (for example, the Crystal track uses Crystal-specific ecr for templating) but here is a Wikipedia Rundown of some templating engines. It doesn’t have Handlebars or mustache or probably many others, but there are quite a few there.
I agree that rewriting anything is not worth it. To say that we all have to agree on the template language was silly - of course tracks can keep doing what they want to do.
Still, it might be nice for a decent default solution being available for tracks that don’t have their own generator yet. I wonder how many tracks do and do not have generators already? And if maintainers of the smaller tracks would be interested in this. I haven’t had the experience of building a new track from scratch, but I imagine having a generator available from the start would speed things up.
I would be happy to bang the Rust generator into a generic shape if that is useful to others. Should be as simple as dropping a couple Rust-specific goodies. The Tera engine is fast and easily extensible with simple Rust functions. If it is desired that the generator be included in configlet, the generator would have to be written in Nim though.
I think, from reading through all the contributions, there are two kinds of test generators required. For me, it boils down to:
From Scratch vs. Updating
The latter requires various templates to match the current design of the exercise and produces a repeatable production-ready output. The former requires a boilerplate template and the output must contain all things to design the production-ready exercise.
These are, in my eyes, fundamentally different. And attempts to do both in one must fail.
“From scratch”
This requires some kind of basic template to produce an output, that helps design the exercise. All comments from canonical data must be in there, all the descriptions from test case collections, anything provided by canonical-data.json.
The template may match a common test case (like calling a function with input and comparing to expected). But it will never produce production ready exercises. It will produce something to design the exercise (the student interface and possible solutions) from.
“Updating”
The target is, to ease the repeated updating of exercises without redesigning. So here one template per exercise makes totally sense. It ignores all the additional information from comments or groups of test cases and focuses on rendering the previously designed test file.
This very much relies on “stable canonical-data.json structures” for each exercise. Anything new (like adding an error handling scenario when there was none before) will need additional template development without the generator failing / helping.
Bringing them together
This is not possible. It requires on the one hand parsing canonical-data.json and deciding, if it fits into the template for “From scratch” generation or must be rendered into some “unknown” catch-all output. On the other hand it must only provide all data to a template engine and the exercise specific template contains all information how to render the exercises data.
The thing that may be the same is the parsing of the canonical data and providing the template engine with the required set of data. But even this contains “knowing, what the template requires”. So it makes no sense to me.
Okay, i think I do see the point of a track-by-track approach, but it might still be useful to have some sort of a starting point when trying to implement a generator for a track that doesn’t have one. It doesn’t necessarily have to be a generic generator, maybe we could make some sort of guideline doc for generators, so that we don’t have to keep opening forum posts for every test generator.
Also, I still think that having a sort of catch-all template on the track level may be useful. It’d be used when implementing an exercise from scratch, that has canonical-data but no template of it’s own, to get you started. It may not produce production ready tests by design, so that you are forced to look at them and not rely on the tool completely.
IF on the other hand, an exercise needs to be updated, or already has its own template, that would be used with priority to produce a reliable and predictable output.
I like the conceptual separation of “from scratch” and “updating”. My generator does both, with the expected limitation: If the exercise doesn’t exist yet, generate some non-perfect and likely non-functional stubs and defaults. If the exercise does exists, generate a new test case from the manually adjusted template. So, for adding a new exercise, I have to run the generator at least twice:
run generator
fix incomplete stubs and test template
run generator again
I’m not sure how familiar you are with all of the different shapes and sizes canonical-data can take. My experience has been that about every second exercise has some weird oddity not seen anywhere else. I’d say it’s next to impossible for a catch all template to produce anything of value. I think that’s what Eric’s trying to get at as well, it’s not worth the effort to try to make something that works in the general case.
Anyway, there’s a weaker version that probably provides what you want: In the “from scratch” scenario, just dump a (non-functional) default template into the exercise directory as a starting point for the manual work. That’s what the Rust generator does and it works great.