edgedb-challenge/description.md

139 lines
5.0 KiB
Markdown
Raw Normal View History

2024-07-09 14:15:16 -04:00
## Background
This is based on a real-world problem we encountered early on in EdgeDB Cloud
(but to be clear, we have a working solution for this already in place, we are
not asking you to do any work on our actual cloud as part of this interview
process)
## Problem overview
We run Consul in our internal cloud infrastructure.
Consul limits the number of open HTTP connections from clients,
[defaulting to 200](https://developer.hashicorp.com/consul/docs/agent/config/config-files#http_max_conns_per_client).
Consul also [exposes telemetry](https://developer.hashicorp.com/consul/docs/agent/monitor/telemetry)
in standard Prometheus format.
However, "current count of open HTTP connections" is not part of Consul's
telemetry, so we have no way of knowing how close we are to this 200-connection
limit.
The goal here is to gather this count of open connections ourselves and send it
to our Prometheus metrics server, where we can graph it or alert on it alongside
the other metrics that Consul exposes natively.
## Technical details
- All of the connections that we care about are on TCP port 8500 (Consul's
primary service port).
- All of the connections we currently have are using IPv4, but we try to leave
ourselves open to IPv6 compatibility. It's up to you whether you want to
support IPv6 or leave that as a future TODO.
- We run the Prometheus node_exporter on all hosts that run Consul, and have
its [textfile collector](https://github.com/prometheus/node_exporter?tab=readme-ov-file#textfile-collector)
enabled with `--collector.textfile.directory=/tmp/node-exporter`. It's up to
you whether you want to write a full Prometheus metrics collector
implementation, or write to the node_exporter's textfile directory.
- It's up to you how to collect the metric value itself - calling `netstat` and
looking at its output, as in the example below; implementing it yourself by
looking at files under /proc or /sys; using a 3rd-party library that exposes
the value, etc.
- We've provided a test script in Python, but there is absolutely no requirement
that your solution be in Python. Use any language you feel appropriate.
## Working example
Setting up a full Consul installation to repro this problem would be non-trivial
and outside the scope of this interview question, so we have provided a simple
Python script that simulates the behavior, by opening a server on a specified
port, then opening a specified number of client connections to the server, and
holding them open until the script is killed.
The script should run on any Python higher than 3.7 and uses the stdlib only
(does not require a virtualenv or `pip install` or anything similar).
```
$ ./port-opener.py -h
usage: port-opener.py [-h] [--port PORT] [--num-connections NUM_CONNECTIONS] [--ipv {4,6}] [--verbose]
options:
-h, --help show this help message and exit
--port PORT, -p PORT port to listen and make connections on
--num-connections NUM_CONNECTIONS, -n NUM_CONNECTIONS
number of connections to open
--ipv {4,6} IP version (4 or 6) to use
--verbose, -v enable verbose logging
```
To see it in action:
```
$ ./port-opener.py
started server on 127.0.0.1:8500
opened 200 client connections
```
Then, in a separate terminal window:
```
$ netstat -tn
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign Address State
...
tcp 0 0 127.0.0.1:8500 127.0.0.1:34980 ESTABLISHED
tcp 0 0 127.0.0.1:8500 127.0.0.1:35006 ESTABLISHED
...
```
With those 200 connections open, your solution to this problem should emit a
Prometheus metric that looks something like `consul_open_http_connections 200`
or `open_tcp_conns{port=8500} 200` or something similar.
Note that if you run the port-opener script and then kill it, this will close
the connections, but for a few minutes afterwards netstat will still list them
in `TIME_WAIT` state:
```
> netstat -tn
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign Address State
...
tcp 0 0 127.0.0.1:55036 127.0.0.1:8500 TIME_WAIT
tcp 0 0 127.0.0.1:54748 127.0.0.1:8500 TIME_WAIT
tcp 0 0 127.0.0.1:54690 127.0.0.1:8500 TIME_WAIT
...
```
We don't care about any `TIME_WAIT` connections, because from Consul's point of
view they don't count towards the 200-connection limit.
## Your solution
Send us:
- Your code implementing the Prometheus metric collection
- A readme with:
- How to run your code, if there's any non-obvious steps
- Any design decisions or tradeoffs you made
- Any test files or scripts you wrote, or modifications to our port-opener
script
- This take-home question is also new-ish for us, so we would also appreciate
any feedback you have about:
- Any challenges you had understanding our description of the problem, or
getting our test script to run, etc
- Any feedback you have on this question that we can use to improve it for
other candidates