How do submission codes run by the website?

Hi,
I was trying to write a lightweight and performant judge API for PHP using Docker and found your platform in my internet surfing.
I’ve been reading the documentations and repositories in this organization and found a lot of helpful insights but there’s just this one missing piece of the puzzle that I haven’t been able to figure out:

How does the “website” run the users’ codes?

I come from a Python/Django world and know close to nothing about Ruby/Rails but what I understood so far was that for each programming language (Track), there are 4 repositories: The main exercises repository, test runner, analyzer and representer.

What I’m focused on is the test runner which is a Docker image that tests a specific version of a submission (Submission::Iteration) and returns its results.
The exercises repository holds many tests, grouped into different exercises, that the test runner needs. So the process of testing a submission code is to run the test runner while mounting the exercises repository into that Docker container.

What I didn’t understand is “Who runs the Docker image?”

This is a very important architectural question that I’m facing developing my online judge.

What I was looking for was some kind of hook or signal in the API code that for example executes a docker run command when a code is submitted for testing or a TestRun object is created.
I found that there is a command module called ToolingJob that is the closest thing to what I’m looking for but that was as far as I could reach before coming to a dead end in the website project.

I also realized that there is a Sidekiq somewhere that runs some Rake tasks but not sure if it’s related to this part of the platform.

Can anyone help me out on this?

I don’t read Ruby well or know my way around the internal application, but I found a TestRunsChannel and a SubmissionTestRunsController which might help.

1 Like

Seen them already but couldn’t quite understand.

The TestRunsChannel file seems to be sending an event or signal to somewhere but I don’t know to where!

The SubmissionTestRunsController seems to be just an API endpoint (since it’s inheriting from ApplicationController) that responds to a user request but again not completely sure.

Can you explain more about these two files?

Nope :slight_smile: I was able to find those but I have no idea how Ruby or the site works :smiley:

A bank of a dozen EC2 instances, each with all ~100 Docker images loaded onto them. Each instance is constantly picking things off a queue (see Tooling Orchestrator for some bits of the queuing/orchestration) and then executing them by quickly spawning, using and destroying a container of the various Docker image. The main bits of code that do all this are in the Tooling Invoker which runs on every EC2 machine. We also make big use of EFS to avoid having to pass data around everywhere.

There’s a lot of complexity that I’m skimming over here, but that’s the fundamental method. Much easier would be to use AWS Lambda for this, but Lambda leaves containers in a dirty state between invocations (ie it would reuse the same container for each student) so it’s not currently usable for this purpose.

3 Likes

The Tooling Invoker was one of the only repositories that I didn’t look deep into it because it had no documentation of what it does but now I can clearly see it in the code.

But still not sure what the Tooling Orchestrator exactly does. Is it the message queue that the Tooling Invokers consume messages from? Or is it a tool to manage the Tooling Invoker instances across the compute clusters? Or both? Or neither? :sweat_smile:

The Tooling Orchestrator is mainly the message queue handler, yeah. You could use others solutions if you didn’t mind the latency of them (e.g. sns/sqs) or missing some features I needed (e.g. sidekiq), but I wanted enough specific queue behaviours to roll out my own :slight_smile:

1 Like

Thank you very much!

Just one more question… :sweat_smile:

One of the main reasons I was asking the previous questions was concurrency.

How do you balance the load and stay performant when thousands of tests are hoarding the queue?

I tested the Python test runner on a bare machine and it took 750 milliseconds to create the container, run the Guido’s Gorgeous Lasagna test and write the result file, which was fairly performant.

But when I tried to run a hundred Docker containers at the same time, I faced an issue that I’m not sure if it was Docker engine’s low concurrency’s fault or Linux’s scheduling fault. A few of the containers again took less than a second, but then the others started to pile up on compute time where some of them took more than 20 seconds to finish the test.

The command I ran to test it was something like this:

for i in {1..10}
do
docker run -d --rm -w /python --mount type=bind,src=$PWD,dst=/python --entrypoint ./bin/test_exercises.py python-test-runner --runner test-runner guidos-gorgeous-lasagna &
done

If the message queue handles the tests one by one, it will take even more than that to handle a hundred tests (100 * 0.75 ~= 75 seconds).

Roughly how many tests can one of your EC2 instances handle at the same time on average? Do these instances have the minimum specs of EC2 (General Purpose) or some higher spec (e.g. Compute Optimized)?

1 Like

Three. Even that is slower than one.

I’ve experimented with both. Different languages would be better optimised to different things, so you’ll need to experiment with this.

We’ve worked out average high-demand and keep that level of machines on. We could auto-scale but it’s more hassle than worth for us. So to take an example. If you have 1,000 people solving at the same time would mean roughly 1000/60 (presuming people tend to run the tests once per minute) / 3 (per machine) = 5 machines. If you then also have analyzers/representers you need more machines for that (but I doubt you’re doing that).

A lot of your decisions are going to come down to how spikey your demand is. Ours is relatively flat. It doubles at the busiest to the quietest times, but we just take the approach of under-utilizing out machines during that period[0].

That make sense?


[0] This actually often isn’t true, as we have a huge background queue of 1,000,000s of old solutions being retested and analyzed against the latest versions of exercises, but that’s probably not relevant to you here :slightly_smiling_face:

1 Like