Can a generic representer safely normalise whitespace?

iHiD · August 15, 2023, 3:00pm

We’re adding a generic representer for tracks that don’t have it.

As the solutions being represented already pass the tests and the analyzer should give comments for any formatting considerations, I believe we can have it safely remove all whitespace from solutions without losing meaning.

Are there any counter examples that people can come up with where removing the whitespace makes two solutions be considered identical, when that shouldn’t be the case?

Thanks.

vaeng · August 15, 2023, 3:27pm

Just to understand that correctly:
If the solution does not pass the test-runner, it will not reach the analyzer and neither will the represented ever see it, right?

So every solution will have the correct syntax, because it can only reach the representer after the fact of being checked for correctness, right?

I cannot come up with a cpp example that would have two solutions that can compile, do different things and get mashed into the same stage if you remove the whitespace.
Maybe very strange preprocessor macros.

MatthijsBlom · August 15, 2023, 3:43pm

The meaning of the following snippet depends on the indentation of where.

-- Haskell

data T = A | B

x :: String
x = "Kiekeboe!"

f :: T -> String
f = \case
  A -> x  -- 👈 shadowed or not?
  B -> x
  where
    x = "I shadow the outer x"

If where is indented to align with A and B, or less, then both f A and f B will evaluate to "I shadow the outer x". However, if where is indented any further, then f A will produce "Kiekeboe!".

iHiD · August 15, 2023, 3:54pm

Exactly.

But then they wouldn’t both pass the same test suite?

Thanks!

MatthijsBlom · August 15, 2023, 4:27pm

Almost none of the test suites are exhaustive, so they might yet.

~~I’ll think on it some more. Finding compelling counterexamples does not strike me as impossible, merely more difficult.~~

Found one, I think:

-- Haskell, nth-prime

module Prime (nth) where

import Control.Monad

nth :: Int -> Maybe Integer
nth n = last theFirstFewPrimes <$ guard (n > 0)
  where
    theFirstFewPrimes = take n primes

primes :: [Integer]
primes = 2 : filter isPrime [3, 5 ..]

isPrime :: Integer -> Bool
isPrime n = all (\p -> n `mod` p /= 0) $ takeWhile ((n >=) . (^ 2)) primes

Indentation can affect performance. If the above primes and isPrime are moved into the where clause (through nothing but indentation), then they will be recomputed every time nth is called. But if they are defined outside of nth, both will be computed only once, and primes will function somewhat like a cache.

superfastjellyfish · August 15, 2023, 6:25pm

~~Please excuse my ignorance for not knowing what a representer is.~~
Thanks @MatthijsBlom for DM with explanation and link

It’s really difficult to find examples in existing exercises, but here’s a small code snippet that does two different things based on indentation, and could still pass the test suite if the offending item is positioned last in the input stream:

# prohibits any calculated numbers to be 42
for i, n in enumerate(integers):
    transformed.append(i * n ** 2 - 6)
    if transformed[-1] == 42:
        raise ValueError("don't panic")

# only prohibits the last calculated number to be 42
for i, n in enumerate(integers):
    transformed.append(i * n ** 2 - 6)
if transformed[-1] == 42:
    raise ValueError("don't panic")

iHiD · August 15, 2023, 8:42pm

If you have a representer that uses ASTs you probably don’t care about whitespace anyway? This is more the case where there are not ASTs and we want to do basic text parsing.

Thanks for the examples everyone. Some good ideas generated there!

I’m currently leaning to having some incorrect groupings as I feel like we’re going to be a lot more useful overgrouping than under grouping in 99% of the cases here. The advantage of catching big things like

function x() {
  // ...
}

vs

function x()
{
  // ...
}

or trailing whitespace, or tabs vs spaces, etc etc outweighs the occasional place where we get things wrong I think.

Currently I feel like I want to see a really bad (from a grouping perspective) and realistic (as in happens more than once in a blue moon) example before not going down this route.

astyanax · August 16, 2023, 3:49pm

I think the main risks are Haskell, CoffeeScript and Fortran, because whitespace will have semantic meaning.

Python, Ruby, F# and others already have representers, so we assume those are fine.

Otherwise, I’m not sure if Lisp or Erlang family languages will depend on whitespace (at least having a single one to split “tokens” or keywords).

sh/awk definitely do, but I assume you want to normalize whitespace down to a single one, as well as excess newlines (i.e. always keep at least 1)? With the caveat that we’d want to remove spaces between successive newlines, if there are no other printable characters between then (code, newline, spaces and tabs, newline, code).

As long as you don’t prune strings or try to minify in a generic way, my gut feeling is that other languages shouldn’t be affected by whitespace, as long as they’re syntactically correct.

I recalled now, back in the day, when I’d use AStyle or https://universalindent.sourceforge.net/features.php for formatting. As expected, most tools are focused on C/C++ code, universalindent is just a wrapper to individual tools for each respective language.

iHiD · August 16, 2023, 3:57pm

I was planning on removing it altogether actually. To normalise common things like:

function(){ xxx }
function() { xxx }

I’m not sure that there’s more of a tangible risk of removing it than reducing it down to one, although I understand that instinctively it feels riskier. Again, I’d welcome concrete examples!

MatthijsBlom · August 16, 2023, 10:15pm

Don’t you mean removing/ignoring whitespace as a token?

I’d expect you don’t want to collapse Python to defname(args):e=1returne, or Haskell to mapsuccxs.

To the list if risky languages should be added: PureScript, and maybe Elm.

IsaacG · August 16, 2023, 11:55pm

If the uncollapsed version passes tests, why not?

ErikSchierboom · August 17, 2023, 9:04am

This is the key point.

superfastjellyfish · August 17, 2023, 9:35am

That could tremendously improve matches for subtleties like i=0 vs i = 0, and so on.

Could one make the case it’s only left-padding whitespace that increase risk compared to other languages? If so, that could limit the scope of the problem.

ramblings:
I spent some significant time last night trying to come up with potential issues and it’s really difficult problem. The closest I got was like combining a variable i with function call f(boolean) becoming if(predicate), or something that would trigger other keywords, but the syntax always breaks down. The same went for pattern matching in Haskell. One could perhaps find some obscure example where combining two pattern matches like func (x:xs) and func xs could form a recursive call, but again, syntax always break down due to e.g. incomplete expressions.

Given that the tests ensure correctness (mostly), is it likely that a single whitespace mismatch would even be enough to warrant a solution as being marked unique? You’d probably need at least an additional difference for the first one to make any sense, imo.