[Maintainers] Helping us build training data for tags

iHiD · October 31, 2023, 1:46pm

Hello lovely maintainers

We’ve recently added “tags” to student’s solutions. These express the constructs, paradigms and techniques that a solution uses. We are going to be using these tags for lots of things including filtering, pointing a student to alternative approaches, and much more.

In order to do this, we’ve built out a full AST-based tagger in C#, which has allowed us to do things like detect recursion or bit shifting. We’ve set things up so other tracks can do the same for their languages, but its a lot of work, and we’ve determined that actually it may be unnecessary. Instead we think that we can use machine learning to achieve tagging with good enough results. We’ve fine-tuned a model that can determine the correct tags for C# from the examples with a high success rate. It’s also doing reasonably well in an untrained state for other languages. We think that with only a few examples per language, we can potentially get some quite good results, and that we can then refine things further as we go.

I released a new video on the Insiders page that talks through this in more detail.

We’re going to be adding a fully-fledged UI in the coming weeks that allow maintainers and mentors to tag solutions and create training sets for the neural networks, but to start with, we’re hoping you would be willing to manually tag 20 solutions for your track(s). We’ll be creating an issue on your track where we’ll add 20 comments, each with a student’s solution, and the tags our model has generated. Your mission (should you choose to accept it) is to edit the tags on each issue, removing any incorrect ones, and add any that are missing. In order to build one model that performs well across languages, it’s best if you stick as closely as possible to the C# tags as you can. Those are listed here. If you want to add extra tags, that’s totally fine, but please don’t arbitrarily reword existing tags, even if you don’t like what Erik’s chosen, as it’ll just make it less likely that your language gets the correct tags assigned by the neural network.

If you have questions or want to discuss things, it would be best done on here (either in this post or separate posts for fidelity), so the knowledge can be shared across all maintainers in all tracks. Thanks!

glennj · October 31, 2023, 3:21pm

In the github issue, what are the mechanics for “updating the tags” in the ticket? How specifically do I indicate my changes to the list of tags for a particular solution?

iHiD · October 31, 2023, 3:33pm

I believe you can edit the comments and make your changes directly to them. If you can’t (because we’ve misunderstood GH permissions) then this whole mechanism will collapse

Meatball · October 31, 2023, 6:53pm

In the docs I can’t find ranges?, feels like a concept/literal apparent in many languages, so feel like it should be there?

BethanyG · October 31, 2023, 7:14pm

I’m not seeing an issue for the Python track (yet). When should I expect one - or should I be doing this via other means?

ErikSchierboom · October 31, 2023, 7:48pm

I got rate like this limited. It’ll be there tomorrow!

angelikatyborska · November 1, 2023, 7:26am

I came here to ask the same question. It’s really not clear from the Github issue you’ve opened

If you can’t (because we’ve misunderstood GH permissions) then this whole mechanism will collapse

I do have permissions to update the comments.

angelikatyborska · November 1, 2023, 7:41am

@ErikSchierboom I am under the impression that the suggested tags included in the 20 solutions were supposed to be the “approved” tags listed in Tagging solutions | Exercism's Docs based on C# tags.

Well, the ones in the Elixir solutions are a big mess. There are so many suggested tags not listed in the approved tags. For example, I found those tag pairs that are both different wordings for the same concepts:

construct:charlist
construct:char-list

construct:fn
construct:function

construct:defp
construct:private-function

construct:when_guard
construct:guard

Then I found some tags that make no sense to me (I don’t know what they mean) and they aren’t explained on the list of approved tags:

construct:ampersand
construct:annotation
construct:definition
construct:header
construct:underscore
construct:tagged-atom

Then there are also tags that use snake_case for some reason:

construct:date_to_seconds
construct:underscored_number
construct:if_unless
construct:implicit_capture

I wouldn’t mind helping out removing tags that don’t apply to the specific solution, but if I’m also supposed to manually check every tag if it’s one of the preapproved tags, that’s just too much work…

sanderploegsma · November 1, 2023, 8:40am

I’m looking at the 20 solutions for the Java track and based on that I have some thoughts/questions:

A couple of the submissions listed there are actually not the solutions but the test files presumably also submitted by the students. We should find a way to make sure these are filtered before having a neural network tag the solution as the tags will probably be incorrect.
At least one of the submission listed there is incomplete. Should these be tagged at all? IMO it would make more sense if submissions are only tagged once a student publishes them and/or they pass all of the tests.
The number of construct tags is a bit over the top. Tags like construct:assignment can probably be applied to 99% of solutions in the Java track, effectively rendering them useless.

iHiD · November 1, 2023, 9:32am

I am under the impression that the suggested tags included in the 20 solutions were supposed to be the “approved” tags listed in Tagging solutions | Exercism’s Docs based on C# tags.

The network is outputting text so it can output pretty much anything. It’s pretty consistent now it’s been trained on C#, but inconsistent on other languages as it only knows tags in the context of C#, so it’s making things up for things like defp.

Choose the one that is most sensible (e.g. private-function is better than defp in my eyes, or maybe even normalise to function).

So I think, for example, underscore means when Elixir ignores that value, when say doing a, _, b = [1,2,3] (presuming my memory is correct that Elixir work the same as Ruby). I don’t know what the idiomatic name for that behaviour is in Elixir (or Ruby or any other language), but we should standardise on that.

I wouldn’t mind helping out removing tags that don’t apply to the specific solution, but if I’m also supposed to manually check every tag if it’s one of the preapproved tags, that’s just too much work…

That’s totally fine. If you could just say in your final comment on the Issue that you’ve not checked them against the tags, we’ll know to do that. Many new tags will need to be added anyway

@sanderploegsma Thanks for the comments!

That’s useful to know. Thanks. Maybe we can delete those ones and generate you some extra ones. I’ll speak to @ErikSchierboom
I don’t think it hugely matters. These are the submitted , published solutions, but we haven’t checked historic’s solution’s tests. That said, it probably does make sense moving forward only to use those we have as examples. Thanks for the prompt
Sort of true. But also (theoretically for example), the one solution that doesn’t use assignment but still passes the tests is quite interesting. I think it will be more hard to teach a neural network to ignore common things at this stage. I’d rather we weren’t telling it that assignment is not actually a construct just because it’s common. We do have code further downstream that means assignment won’t show in the filter list.

BethanyG · November 1, 2023, 10:53am

I’ve got multiple issues as well:

Quite a few test files have been “chosen” as solutions.
Deprecated exercises like “error handling”, “parallel letter frequency”, “trinary”, and “accumulate” have been chosen.
Stub code (or very close to it) for “Markdown” has been chosen.
Many of the most coded exercises for the track have been omitted in favor of exercises that don’t have that many solutions, and don’t use as many Python idioms (“rest api”, “go-counting”, “complex-numbers”).
Samples of later (post 3.8-3.9) code are needed to make sure that things like the walrus operator, structural pattern matching, positional-only parameters, dictionary union operators … and more that I can’t remember off the top of my head are represented and tagged.

Additionally, I’ve noticed that many many of the tag groups show lambda where lambdas are not used in code. I suspect that the model is having trouble recognizing comprehension and generator syntax and mis-classifying it as lambda.

Also noticing a high amount of code classified as OOP when class definitions are not being used (an example is the Word Count code which uses an object from the collections class, but does so with a plain function.) – but also seeing higher-order-functions and functional tags being applied to code that is defining classes, and not really using any “functional” tools or patterns.

I keep seeing visibility and visibility-modifiers come up as tags. To my knowledge, these aren’t an explicit “thing” really in Python, and not clear on why it is showing up for solutions.

Before I wade further in, I think I’d like to get rid of all the test files and deprecated exercises and pick a set of exercises that is more indicative of what students are likely to submit going forward. I’d also like to add in some specific solutions that use constructs that may or may not be easily tagged (using things like lambdas, comprehensions, walrus structural pattern matching, context handlers, property decorators, etc.), so we know what we are up against.

iHiD · November 1, 2023, 11:02am

All of those are valid issues, but I don’t think any of them will particularly stop a neural network from learning. The reason “visibility modifiers” comes up is that it’s hard to write a C# program that doesn’t have them, so the network has learnt it’s easier just to output them than not, etc. That’s the case for lots of things that seem wrong. Training on some languages other than C# will quickly fix this.

The neural net won’t know about solutions or anything like that - it just knows that there’s “here’s some Python code”, and that it needs to say what it does etc. Classifying the code that’s there will hold pretty much the same amount of value, even if the exercises are deprecated. Feel free to skip the ones with just test files though!

Generating these tags is time-consuming and expensive, so I’d rather not do another pass until I’ve got a set of examples from all the languages done.

sanderploegsma · November 2, 2023, 8:47am

Since the list of commonly used tags does not contain anything for a modulo construct, I expect different tracks to define their own. Can we decide on a shared one here?

construct:mod
construct:modulo
construct:modulus
Something else

ErikSchierboom · November 2, 2023, 12:19pm

Let’s go with construct:modulo

Meatball · November 2, 2023, 3:54pm

Just wondering, what is the difference between sorting and ordering?

IsaacG · November 2, 2023, 3:58pm

Technically ordering is having the ability to compare and rank values in some order – and not necessarily putting them in order. For instance, Poker (selecting a winning hand with the “highest” score) is an ordering exercise.

Bajger · November 3, 2023, 10:33am

I miss (so far for Pharo track):
construct:symbol - special kind of constant (constant string that are unique in the program, used for naming variables or for matching between several options: e.g. #ThisIsASymbol or #'A symbol containing spaces' ). See detals here.
construct:string-concatenation - concatenating strings

Update:
maybe also:
construct:message-sending - method lookup on object and invocation based on message send (duck-typing).

glennj · November 4, 2023, 1:54pm

Something for languages that use 1/0 for boolean values, maybe concept:int-as-boolean

paradigm:data-oriented – e.g. AWK
paradigm:event-oriented

concept:substring – whether by a slice or a function

iHiD · November 4, 2023, 2:26pm

Different but similar, there’s also concept:truthy and concept:falsey.

glennj · November 4, 2023, 2:30pm

This is a hugely worthwhile effort. But it’s big. It’s taken me about an hour to review 5 solutions on 1 track (part of that hour is getting familiar with the concepts), and I’m a maintainer for several tracks. There’s no way I can give this much time for this effort.

We need to expand the “call to action” beyond maintainers.