to get the most out of the Exercism environment for the C++ track, I want to start with the representer.
The goal is to have the same representation for two solutions, iff they “do the same thing.”
Doing the same thing is very general, so I want to have your opinions as to what is considered “the same”.
Please consider, that all files that are transformed into a representation have already passed the test-runner, so we know they are working.
Ideas
Remove all comments
That is a low-hanging fruit for multiline and single-line comments
Remove all whitespace
Currently, I do not see a scenario, where this might break things.
Does anyone have an example, where two files pass the tests, but would do different things, that are not visible if you remove the whitespace?
Sort Classes
I would try to sort all members alphanumerically. In their groups of public, protected, and private, and have these blocks sorted as well. I don’t think this is easy from a parsing standpoint
Remove all Preprocessor comments
I think this will help a lot to narrow down representations.
Unnecessary includes would not matter
No difference from #pragma once to #if !defined(... headers
Renaming with placeholders
PLACEHOLDER_ for variables, parameters, classes, enums, constants, and functions
Is there something I might miss?
Sort Functions
Just sort them alphabetically after placeholders are in place?
Strategy
Is it sensible to do a `“version 1”, with whitespace and comment removal, to have something “done” and do other parts, later? Or is this a terrible idea, as the whole mass of solutions has to be rescanned, @ErikSchierboom@iHiD?
Current implementation
I spend a lot of time to research AST parsing of cpp files and there is basically clang and tree-sitter to help. Clang’s -dump-ast does not seem to offer any real interface and might be “internal” for a while. Maybe @AndrewKhon98 can say something about his ideas?
Tree-sitter is promising, but I could not find a way to really change the generated tree programmatically based on queries. It would help a lot to move functions and parts around when they just sit in a tree. Regex parsing is definitely not up to the task.
Python’s pygments library is easy to use (for me). I already have a prototype to replace names, remove comments, preprocessor comments, and whitespace. It is a python script, so performance is slower than clang and tree-sitter (a C program). Also the docker container to run these is bigger.
( unclear how valuable; I haven’t yet written a Representer myself )
As far as I understand it, a representation does not need to be code, or even look anything like code. If you wanted to*, you could just feed the solution to the compiler, let it simplify the heck out of it, and then write the resulting binary – or just a hash of it – to representation.txt.
* which you plausibly do not, as the compiler is ‘too smart’.
You list some potential differences between solutions that you would like to remove. But what would be some distinctions that you would like to keep?
A Representer for use on Exercism should not necessarily be able to deal with any and all C++ that could be thrown at it. Most submissions are rather basic, and it is probably enough to be able to deal with just those.
So perhaps you do not need to depend on such powerful tools as the compiler’s parser, or tree-sitter. Maybe a simpler parser written by yourself would be nicer?
Statements depend on each other, and these dependencies form a DAG, which you can maybe extract a topological sort out of.
In my opinion renaming all variables with placehold is the hardest task that clang can help with. One could write a small tool to do this, which would bypass ast recursively and rename variables. An example of rewriting code with clang can be found here: rewritersample.
(Nearly?) all representers use ASTs to do the representation. I’d say if you can, you definitely want to use an AST parser for this.
AST’s don’t have things like comments or whitespace as part of them (ie you get those normalised/eliminated for free). You can choose to easily just ignore/remove any nodes you don’t care about (e.g. probably/maybe preprocessor comments form a node), and you can probably easily sort the different elements in the AST too. Placeholder naming becomes easy too.
So I think I’d definitely start with the AST approach - I think the learning curve will be steep but the dividends heavy. And it’s very reliable. The normal method is to make an AST then iterate over it, either manipulating, or just making a new “document” with the information in it. (e.g. you could walk over it, write each node back to a new file and manipulate the node as you go).
Well I might give you my input from my tracks. As I see there are 2 main approaches: Either return an ast tree, or code.
What the Crystal representer does, is it parse the code, remove comments, and normalize the code, and also removing all syntax sugar.
Then it uses ast parsing to rewrite the original code, I find this helpful since it makes it retain its original structure and makes it easier to debug. I think the Elixir representer uses the same approach, and I think DJ mentioned, that he wanted the Javascript or Typescript representer to switch to this approach.
So I mean in short what the Crystal representer do is 2 things, 1 reorder methods, and 1 removing nonstandard library names. And that is important for me since using different standard library methods can be a big difference, but can lead to the same result. And this is a 600-line program (lol), but it is super fast so I can make it super unoptimized and it won’t affect performance too much (thanks to Crystal’s amazing preformance).
To be clear, for this bit are you using AST or string processing? It sounds like string processing but as the AST does a lot of that for you for free (e.g. remove comments), so if so I’m not sure why you’d use string parsing? I believe Elixir just this via the AST: https://github.com/exercism/elixir-representer/blob/main/lib/representer.ex#L27-L31
I just want to clarify as I can’t see a situation where doing string parsing is valuable when you have the AST to hand. But maybe I’m underestimating something?
To be clear, for this bit are you using AST or string processing?
Yeah I formulated myself a bit wrong, I meant that when I parse the code it does the following, so I am not doing any of those things like removing comments. Since the parser does it. It has the downside of not giving me super much control but at the same time I don’t really care about it (since I want those stuff to be done) and parsing the code requires 2 line which does a lot of things. And the parser returns an ast.
For the rest, I am using 98% ast parsing, but when I meet a macro in the ast, I have to manually “represent” that code block so that is done by a combination of ast and string parsing but I am digging into optional other methods.
It would be a very different task if C++ would have “code is data, data is code” written on its flag. Some reflective behavior would be really nice as well.
I can see the advantages of AST parsing, I’d love to use it. Maybe someone else has better google foo, and can hint me at something that can really manipulate code with a good API.
I could use tree-sitters representation and write some parser to get it into some useable format, but at some point that does not feel too different from abusing the pygments lexer.
The only thing that my prototype needs is a sorting mechanism, so it doesn’t feel too far off what I was aiming for. Starting another shot at using some AST feels like a more tiresome endeavor. Maybe I overlooked something really obvious?
I thought of using the llvm intermediate representation to get to the bottom of the problem. There would be a lot of smart shortcuts, that we cannot control and the optimization might not see all the shortcomings that could be marked on certain solutions, because they get corrected.
Your example shows one of the problems I have with the clang ast-dumping. It is not stable, changes a lot, and is rather limited in its documentation as far as I can see it. LLVM uses it for internal tooling and has not offered a lot of guidance. At least I can’t find anything of real value. The repository you link is abandoned since 2018 as the requirements were shifting too much.
This all should not sound like a rambling complaint. I just want to open up and see if there is some solution that I should try or some abandoned strategy that I should follow because there is that great resource over at that one website.
I don’t speak C++ so it’s difficult for me to know what’s easy/hard to use, but this comes up frequently: GitHub - standardese/cppast: Library to parse and work with the C++ AST - that seems to be relatively easy to generate and output the AST. And then you have a relatively standardised structure you can work with and manipulate as you wish? (The manipulation doesn’t need to be using any specific AST methods, you could use string parsing on an AST, just much more easily than you ever could on code).
I have seen this and was all excited because it touches the clang points I complained about. But it is not useable for us:
Missing features:
- support modification of parsed entities […]
- function bodies aren’t parsed at all […]
- [separately specified] member specialization […] is not supported.
So we have no idea what happens inside of a function and members of classes are not visible either in many cases.
Maybe I have to write my own parser for the clang ast output and nail down the version so it cannot change. A new C++ version is released every three years, so it has to be redone every now and then.