Go tests failing with an error occurred

Hello all. :wave:

I’ve noticed more tests failing recently, tests that first pass then when I check later they’ve failed, either by a timed out or a generic an error occurred.

I was instructed to open a topic to see if the issue is widespread, so if you have the time, consider answering the poll if your tests have failed unexpectedly. :slightly_smiling_face:

  • Yes
  • No

0 voters

Another way to help gauge the potential issue is to reply below with the language and exercise. Could turn out to be limited to a language-specific test runner maybe.

The issue could also be my code, of course, I’m not excluding that potential. :joy:

I keep having this problem with Go exercises… Every other one, really :cry:
The tests run fine on my computer… then run fine on Exercism after I submit…
And then switch to “Failed” after a few seconds :flushed:
It has not happened on any other track so far… but then again, I’m focusing on Go during Mechanical March :sweat_smile:

Yesterday I had the same issue on the Go track. I’ve been updating some old exercises and solving some new ones, and a majority of the exercises failed the tests when submitted via the command line.
image
I could get the app to run the tests again via the online editor. And the tests never failed (with this message) when run via the editor.

Looks like the Go test runner is taking a long time to execute:

Mar 16 22:19:31 ip-10-1-0-249 tooling-invoker[1349]: SYSTEM: Worker 1: Alive at 2023-03-16 22:19:31 +0000
Mar 16 22:19:39 ip-10-1-0-249 tooling-invoker[1349]: 6af7db1c7e7a433e98d51b0b794b6028: Starting job
Mar 16 22:19:39 ip-10-1-0-249 tooling-invoker[1349]: 6af7db1c7e7a433e98d51b0b794b6028: Invoking request: test-runner:go:weather-forecast
Mar 16 22:19:39 ip-10-1-0-249 tooling-invoker[1349]: 6af7db1c7e7a433e98d51b0b794b6028: Preparing input
Mar 16 22:19:39 ip-10-1-0-249 tooling-invoker[1349]: 6af7db1c7e7a433e98d51b0b794b6028: Invoking container
Mar 16 22:19:39 ip-10-1-0-249 tooling-invoker[1349]: 6af7db1c7e7a433e98d51b0b794b6028: Running /usr/bin/timeout -s SIGTERM -k 1 20 docker container run -a stdout -a stderr --stop-timeout 0 --rm -v /opt/jobs/6af7db1c7e7a433e98d51b0b794b6028-02577d22bfd26651fd675f47845a6aab/code:/mnt/exercism-iteration -l exercism-6af7db1c7e7a433e98d51b0b794b6028-372ad2e28d19734a82ce6bcbdb4926ea --network none -m 3GB 681735686245.dkr.ecr.eu-west-2.amazonaws.com/go-test-runner:production weather-forecast /mnt/exercism-iteration/ /mnt/exercism-iteration/
Mar 16 22:19:52 ip-10-1-0-249 tooling-invoker[1349]: 6af7db1c7e7a433e98d51b0b794b6028: Docker time: 13.593092441558838
Mar 16 22:19:53 ip-10-1-0-249 tooling-invoker[1349]: 6af7db1c7e7a433e98d51b0b794b6028: Total time: 13.925870895385742

(cc @andrerfcsantos @junedev)

The last code change was merged on Sunday and the last bigger change even before that so it’s strange this wasn’t reported before if the changes were the cause.

@iHiD Do we have stats on those execution times to see when it started to degrade?

For the people seeing this issue:

  • When did you start noticing this?
  • Does it happen for all exercises or only some? Any difference between practice exercises and exercises linked from the syllabus?
  • Does it happen that the same solution does not time out after trying again?

@andrerfcsantos I will probably not have time to do anything about this tomorrow. In case you have time you could try to revert some of the latest changes and reploy the old state. Go back at least 2 versions though as going back to the last version will bring back the bug with the extra tests in practice exercises we had.

I’ve just spent a little time looking and it’s consistently 13-20(timeout) seconds.

  • I don’t have logs beyond 24hrs (they already cost us $600/month!) so I can’t say when it happened.
  • It’s happening on every exercise I see go through the test runners.
  • I can test this tomorrow, but we don’t re-run automatically after a timeout, and as everything seems to be in this 13-20 sec window, I don’t imagine it will change.

I’m gonna try and work on it with @ErikSchierboom tomorrow too.

On the plus side, Go is wildly the most popular language based on test-runs on the servers right now!

Bit more debugging from me

Looking at the tags on DockerHub

  • f9f9e3c5a069d5a4d23fa24bb8f9c7fab0979af9: Fast (5s)
  • 6aa7c0d947ad0cf7dc74851040d9b0cd72dd8c56: Slow (Times out)

So one of #97 is causing it.

Update. I’ve merged Revert changes that have made the test runner slow by iHiD · Pull Request #101 · exercism/go-test-runner · GitHub which has now made it very fast in production again.

Andre’s been pairing with me as I’ve done this so he’s up to speed. I’ll leave the Go-pros to work it out from here :slight_smile:

FYI the command I’m using to test is:

docker container run -a stdout -a stderr --stop-timeout 0 --rm -v /tmp/exercism/tooling-jobs/57fdc613273d4db9a88973e43d5c48b6-1a7f0a0d968dc7b739c6c5558328b70a/code:/mnt/exercism-iteration -l exercism-57fdc613273d4db9a88973e43d5c48b6-a372545262df007ac62e9d96ec7593a8 exercism/go-test-runner:f9f9e3c5a069d5a4d23fa24bb8f9c7fab0979af9 hello-world /mnt/exercism-iteration/ /mnt/exercism-iteration/

Which is of the form:

docker container run -a stdout -a stderr --stop-timeout 0 --rm -v /PATH_TO_SOLUTION_CODE:/mnt/exercism-iteration -l RANDOM_SLUG exercism/go-test-runner:f9f9e3c5a069d5a4d23fa24bb8f9c7fab0979af9 hello-world /mnt/exercism-iteration/ /mnt/exercism-iteration/

Makes no difference whether I set the network or memory flags (which are in production, but that’s not what’s making it slow)

And FYI /tmp/exercism/tooling-jobs/57fdc613273d4db9a88973e43d5c48b6-1a7f0a0d968dc7b739c6c5558328b70a/code is

➜ ls /tmp/exercism/tooling-jobs/57fdc613273d4db9a88973e43d5c48b6-1a7f0a0d968dc7b739c6c5558328b70a/code
total 32
0 drwxrwxrwx  7 iHiD  wheel   224 17 Mar 00:41 .
0 drwxr-xr-x  4 iHiD  wheel   128 17 Mar 00:41 ..
0 drwxrwxrwx  5 iHiD  wheel   160 17 Mar 00:41 .meta
8 -rwxrwxrwx  1 iHiD  wheel    25 17 Mar 00:41 go.mod
8 -rwxrwxrwx  1 iHiD  wheel   117 17 Mar 00:41 hello_world.go
8 -rwxrwxrwx  1 iHiD  wheel  1382 17 Mar 00:41 hello_world_test.go
8 -rw-r--r--  1 iHiD  wheel   489 17 Mar 00:41 results.json

With some more debugging, I narrowed down the root of the problem a bit more.

Since 6aa7c0 was the problem, I pulled the revert Jeremy did and cherry-picked the changes of 6aa7c0 back to see what the problem was with that commit.

2 main things were happening in that commit:

  • Update of the Go version to 1.20
  • Some code changes to the generator

I first removed the code changes, so that I was only working with the changes related with the Go version and the test runs stayed slow. So, this confirmed the problem was with upgrading to 1.20 and not with the code changes.

Then I tried several things like changing the base image from alpine to buster, in alpine and buster I tried some older images too, and tried to change versions of Go. The pattern I saw was that 1.20 in general was slow, both in alpine and buster. But 1.19 seemed consistently faster in all my images/versions combinations.

Since this was a Go version problem and had nothing to do with the code, I reverted Jeremy’s reverts (effectively bringing all changes back) and downgraded Go to 1.19. That way we have the changes to the code and are on semi-updated version of Go.

If that sounds good to you, here’s the PR bringing all changes back but with Go 1.19: Revert of revert with downgrade to Go1.19 by andrerfcsantos · Pull Request #102 · exercism/go-test-runner · GitHub

Left 2 comments there.

We should keep investigating exactly why 1.20 is slow and if we can fix it somehow, but I think this is a good solution for now.

Debugging this was fun, but I’m off to bed as it is way past my bedtime :smiley:

1 Like

So, we did that and things seem to be running smoothly. The next step is to figure out why Go 1.20 is being slow.


Initially, I suspected it could be something to do with cgo. In systems without a C compiler, for some modules of the stdlib the implementation now uses pure Go instead of relying on C. But if there is a C compiler available, the old modules should be used.

However, even when using images that had a C compiler available and setting CGO_ENABLED=1, things are slow.

I then put some timings inside the test runner to see exactly where the time was being lost:

1.20.2-bullseye with the env vars for cgo set:

$ time docker container run -a stdout -a stderr --stop-timeout 0 --rm -v /Users/andre/lab/hello-world:/mnt/exercism-iteration -l exercism-meta exercism/test-runner:1.20.2-bullseye-cgo hello-world /mnt/exercism-iteration/ /mnt/exercism-iteration/
2023/03/17 20:39:44 Checking args 11.713µs
2023/03/17 20:39:44     Parsing exercise config took 9.080198ms
2023/03/17 20:39:58         testCmd.Run() took 13.557003663s
2023/03/17 20:39:58     Running tests took 13.557117492s
2023/03/17 20:39:58     Getting structure took 7.914157ms
2023/03/17 20:39:58     Marshalling took 75.356µs
2023/03/17 20:39:58 Executing tests 13.574476829s
2023/03/17 20:39:58 Writing results 11.859595ms
2023/03/17 20:39:58 Test runner took 13.58681567s
docker container run -a stdout -a stderr --stop-timeout 0 --rm -v  -l       0.08s user 0.06s system 0% cpu 15.542 total

1.19.7-bullseye with the exact same conditions:

time docker container run -a stdout -a stderr --stop-timeout 0 --rm -v /Users/andre/lab/hello-world:/mnt/exercism-iteration -l exercism-meta exercism/test-runner:1.19.7-bullseye-cgo hello-world /mnt/exercism-iteration/ /mnt/exercism-iteration/
2023/03/17 20:44:01 Checking args 13.02µs
2023/03/17 20:44:01     Parsing exercise config took 6.497086ms
2023/03/17 20:44:03         testCmd.Run() took 1.991614259s
2023/03/17 20:44:03     Running tests took 1.991730087s
2023/03/17 20:44:03     Getting structure took 13.368996ms
2023/03/17 20:44:03     Marshalling took 150.848µs
2023/03/17 20:44:03 Executing tests 2.011823182s
2023/03/17 20:44:03 Writing results 21.412597ms
2023/03/17 20:44:03 Test runner took 2.033506323s
docker container run -a stdout -a stderr --stop-timeout 0 --rm -v  -l       0.08s user 0.05s system 3% cpu 3.401 total

The interesting thing to note here is that the difference in time between 1.19 and 1.20 is in the call testCmd.Run(). This is when we “shell out” to run go test. Everything else looks normal and no differences are noticeable between versions.

However, running the test runner locally outside docker with Go 1.20 (mac) doesn’t show the same problem:

$ ./go-test-runner /Users/andre/lab/hello-world /Users/andre/lab/hello-world
2023/03/17 23:04:28 Checking args 18.503µs
2023/03/17 23:04:28 	Parsing exercise config took 914.321µs
2023/03/17 23:04:28 		testCmd.Run() took 697.718911ms
2023/03/17 23:04:28 	Running tests took 697.857369ms
2023/03/17 23:04:28 	Getting structure took 1.090841ms
2023/03/17 23:04:28 	Marshalling took 65.432µs
2023/03/17 23:04:28 Executing tests 699.98051ms
2023/03/17 23:04:28 Writing results 1.461088ms
2023/03/17 23:04:28 Test runner took 701.635727ms

In fact, attaching directly to the container and running the test runner there also shows similar results to running locally:

$ docker run -it -v /Users/andre/lab/hello-world:/mnt/exercism-iteration --user appuser --entrypoint /bin/bash exercism/test-runner:1.20.2-bullseye-cgo
appuser@17b5fc365cf7:/opt/test-runner$ go version
go version go1.20.2 linux/amd64
appuser@17b5fc365cf7:/opt/test-runner$ bin/test-runner /mnt/exercism-iteration/ /mnt/exercism-iteration/
2023/03/17 22:22:50 Checking args 33.339µs
2023/03/17 22:22:50 	Parsing exercise config took 11.801936ms
2023/03/17 22:22:51 		testCmd.Run() took 1.827009028s
2023/03/17 22:22:51 	Running tests took 1.827114131s
2023/03/17 22:22:51 	Getting structure took 4.361865ms
2023/03/17 22:22:51 	Marshalling took 64.518µs
2023/03/17 22:22:51 Executing tests 1.843516918s
2023/03/17 22:22:51 Writing results 5.259687ms
2023/03/17 22:22:51 Test runner took 1.849310084s

It seems the problem it’s not with Go 1.20, but some other docker configuration that might be making things slow.

1 Like

Found a possible solution for this: Make 1.20 fast by andrerfcsantos · Pull Request #103 · exercism/go-test-runner · GitHub

I created a minimal example to reproduce the issue. As you will see in the code, there is really not much code involved to reproduce the performance regression.

The issue seems to be related to “shell out” for “go test” (or maybe go commands in general), it does not happen if I shell out “ls” for example.
The issue only appears when using the non-root user.

I might open an issue in the Go GitHub repo. I want to check if any of the flags we use for the new user cause this first though.

cc @andrerfcsantos

Sounds a good idea :)

The --disabled-password seems mandatory. Tried without --gecos "" with the docker on your repo and it didn’t seem to make a difference.

To discard this being a permissions thing, I also tried make appuser the owner of everything in /app with RUN chown -R appuser:appuser . just before USER appuser - no differences there too.

Ok, this is interesting.

So, making appuser the owner of app after go build and before switching to it doesn’t seem to work.

But making appuser the owner of app, switching to it and building the binary with appuser does seem to fix it. For reference this is the dockerfile I used:

FROM golang:1.20.2-alpine3.17

RUN adduser --disabled-password appuser
COPY . .
RUN chown -R appuser:appuser app
USER appuser
WORKDIR app
RUN go build -o main
CMD ["./main"]

I get times of <2s using this.

@junedev This could be the fix on our side?

I think it still deserves an issue in the Go repo.

PS: Managed to reproduce this fix in the actual testrunner with this Dockerfile:

go-test-runner Dockerfile
FROM golang:1.20.2-alpine3.17

# add addtional packages needed for the race detector to work

RUN apk add --update build-base make

# add a non-root user to run our code as

RUN adduser --disabled-password appuser

COPY . /opt/test-runner

RUN chown -R appuser /opt/test-runner

USER appuser

# Install external packages

WORKDIR /opt/test-runner/external-packages

RUN go mod download

# Build the test runner

WORKDIR /opt/test-runner

RUN go build -o /opt/test-runner/bin/test-runner /opt/test-runner

ENTRYPOINT ["sh", "/opt/test-runner/bin/run.sh"]

(removing ENV GOCACHE=/tmp does seem necessary still. Even making appuser the owner of /tmp does not fix it)

Updated the PR: Make 1.20 fast by andrerfcsantos · Pull Request #103 · exercism/go-test-runner · GitHub

1 Like

I filed the issue in the Go repo now: os/exec: 400% perf regression with Go 1.20 when triggering "go test" in Docker container · Issue #59146 · golang/go · GitHub

1 Like

Hi all :wave:

not sure if this could be related to the same issue. I am getting some test timeouts in Rust track. Yesterday got one for Gigasecond exercise and thought it was just temporary, but today I got the same for Clock exercise.

I was able to see, for a short time, the hints from the automated analysis, but after some seconds I got:

Tests are passing in my local environment, and the exercise was submitted using the CLI.

Hi!

Yesterday we re-introduced Go 1.20 in the track, but from our testing everything seemed ok, so I’m not sure this is related with that.

Do you mind posting your code for us to take a look?

Hi André,

thanks for checking this out!
I marked my solution as completed, so you could use this link: leomarsjr's solution for Clock in Rust on Exercism

If you need more information, please let me know.

Best regards,

Leomar

I ran the tests on your solution using the online editor and they worked just fine.

Make sure you have the lastest version of the exercise and try to submit again. Or open the online editor, make a small change (like adding a new line) and run the tests and submit there.