edgedb-challenge/description.md

5.0 KiB

Background

This is based on a real-world problem we encountered early on in EdgeDB Cloud (but to be clear, we have a working solution for this already in place, we are not asking you to do any work on our actual cloud as part of this interview process)

Problem overview

We run Consul in our internal cloud infrastructure.

Consul limits the number of open HTTP connections from clients, defaulting to 200.

Consul also exposes telemetry in standard Prometheus format.

However, "current count of open HTTP connections" is not part of Consul's telemetry, so we have no way of knowing how close we are to this 200-connection limit.

The goal here is to gather this count of open connections ourselves and send it to our Prometheus metrics server, where we can graph it or alert on it alongside the other metrics that Consul exposes natively.

Technical details

  • All of the connections that we care about are on TCP port 8500 (Consul's primary service port).

  • All of the connections we currently have are using IPv4, but we try to leave ourselves open to IPv6 compatibility. It's up to you whether you want to support IPv6 or leave that as a future TODO.

  • We run the Prometheus node_exporter on all hosts that run Consul, and have its textfile collector enabled with --collector.textfile.directory=/tmp/node-exporter. It's up to you whether you want to write a full Prometheus metrics collector implementation, or write to the node_exporter's textfile directory.

  • It's up to you how to collect the metric value itself - calling netstat and looking at its output, as in the example below; implementing it yourself by looking at files under /proc or /sys; using a 3rd-party library that exposes the value, etc.

  • We've provided a test script in Python, but there is absolutely no requirement that your solution be in Python. Use any language you feel appropriate.

Working example

Setting up a full Consul installation to repro this problem would be non-trivial and outside the scope of this interview question, so we have provided a simple Python script that simulates the behavior, by opening a server on a specified port, then opening a specified number of client connections to the server, and holding them open until the script is killed.

The script should run on any Python higher than 3.7 and uses the stdlib only (does not require a virtualenv or pip install or anything similar).

$ ./port-opener.py -h
usage: port-opener.py [-h] [--port PORT] [--num-connections NUM_CONNECTIONS] [--ipv {4,6}] [--verbose]

options:
  -h, --help          show this help message and exit
  --port PORT, -p PORT  port to listen and make connections on
  --num-connections NUM_CONNECTIONS, -n NUM_CONNECTIONS
                      number of connections to open
  --ipv {4,6}         IP version (4 or 6) to use
  --verbose, -v       enable verbose logging

To see it in action:

$ ./port-opener.py
started server on 127.0.0.1:8500
opened 200 client connections

Then, in a separate terminal window:

$ netstat -tn
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State
...
tcp        0      0 127.0.0.1:8500          127.0.0.1:34980         ESTABLISHED
tcp        0      0 127.0.0.1:8500          127.0.0.1:35006         ESTABLISHED
...

With those 200 connections open, your solution to this problem should emit a Prometheus metric that looks something like consul_open_http_connections 200 or open_tcp_conns{port=8500} 200 or something similar.

Note that if you run the port-opener script and then kill it, this will close the connections, but for a few minutes afterwards netstat will still list them in TIME_WAIT state:

> netstat -tn
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State
...
tcp        0      0 127.0.0.1:55036         127.0.0.1:8500          TIME_WAIT
tcp        0      0 127.0.0.1:54748         127.0.0.1:8500          TIME_WAIT
tcp        0      0 127.0.0.1:54690         127.0.0.1:8500          TIME_WAIT
...

We don't care about any TIME_WAIT connections, because from Consul's point of view they don't count towards the 200-connection limit.

Your solution

Send us:

  • Your code implementing the Prometheus metric collection

  • A readme with:

    • How to run your code, if there's any non-obvious steps

    • Any design decisions or tradeoffs you made

    • Any test files or scripts you wrote, or modifications to our port-opener script

  • This take-home question is also new-ish for us, so we would also appreciate any feedback you have about:

    • Any challenges you had understanding our description of the problem, or getting our test script to run, etc

    • Any feedback you have on this question that we can use to improve it for other candidates