Sometimes, I know that there's something that I should be doing, and I just don't do it. I keep putting it off because it's too much work, or too complicated, or uninteresting. It's easy to find an excuse not to do it.
For well over a decade, I've run this web site using a web server I wrote in Scheme. (My server runs behind nginx.) Until yesterday, I hadn't configured my Linux instance to start the server on boot. My hosting provider, Linode, is reliable, so I rarely had an unplanned outage. Whenever my instance restarted, I manually logged in and started my server. It worked.
But I knew that that was wrong. If there was a crash, or a power outage, or unplanned maintenance, I wanted my server to restart on its own, and I wanted it to do so right away. I didn't want to have to race to a terminal to get it going again. Let me show you how I finally made that happen. With dtach and systemd, we can make a web service that starts on its own and whose REPL is always available, even across logouts. And with call-with-current-continuation, we can take debugging to a whole new level.
dtach
your REPL
If we're going to take advantage of the power of Scheme, we need
access to a Read-Eval-Print Loop. That way, we can debug problems with more than just HTTP logs. We
can even experiment with changes to the running server. But we need
some way to connect to the REPL. I use dtach
, which lets me start my server and log out, knowing that I'll be able
to connect to the REPL whenever I log back in.
To start the server using my start-web-server
script (not shown), we pass dtach
a filename to tell it where to create a Unix domain socket:
dtach -n /tmp/speechcode.dtach /home/speechcode/bin/start-web-server
To connect to the server's REPL later, we specify the same socket with
the -a
option:
dtach -a /tmp/speechcode.dtach
If we type '()
, this is what we see:
'() ;Value: () 1 ]=> █
Now we have access to the full power of the REPL. We can check the status of the server, inspect its data structures, debug problems (more on that later), and even make changes to code while the server is running.
systemd
Today, systemd
is how one creates a service that automatically starts when Linux
boots. The controversy around systemd
was one reason I avoided this project. But Unix has taught me the
secret to happiness: low expectations. (See The UNIX-HATERS Handbook.) When I finally started digging into systemd
, none of the criticisms surprised me. The Tragedy of systemd, a thoughtful talk by Benno Rice (YouTube), finally convinced me to get moving.
The hardest problem I faced was convincing systemd
that a process created using dtach
, which immediately forks and exits, was still running. The key was to
wrap dtach
in a script that itself doesn't exit until the server does. It's
called start-speechcode-service
:
#!/bin/bash source /home/speechcode/.environment dtach -n /tmp/speechcode.dtach /home/speechcode/bin/start-web-server PID=`lsof -t /tmp/speechcode.dtach` tail --follow /dev/null --pid=$PID
This script uses source
to set up the environment variables that configure the server. Next,
it runs dtach
, which starts the server, creates a socket for communicating with it,
and exits. Finally, it finds the process ID of the forked dtach
process, which is the parent process of the web server. Since it's
not possible to use bash
's wait
command on a process that isn't a subprocess of the shell, I use tail
to keep the script running until the server exits.
The speechcode
service is defined in /etc/systemd/system/speechcode.service
:
[Unit] Description=Speechcode server [Service] ExecStart=/home/speechcode/bin/start-speechcode-service Restart=always Type=simple User=speechcode WorkingDirectory=/home/speechcode/scheme/web/ [Install] WantedBy=multi-user.target
This defines what user will run the server, in what directory the
server will start, and what script will start it. It also arranges for
the server to start when the system reaches run level 2 (multi-user.target
), and to restart if it ever exits — unless we use systemctl
to stop it deliberately.
To start the server manually, we run:
sudo systemctl start speechcode
To check the server's status (and see recent log lines), we run:
sudo systemctl status speechcode -l
To stop the server, we run:
sudo systemctl stop speechcode
That's it.
But wait. There's more.
call-with-current-continuation
When debugging a problem with a web server, we use logging. But
logging is just an advanced way to use print
statements. For it to be useful, we have to know, in advance, what
information to print. If there's an unexpected problem, all we can do
is stare at the code and the logs and try to imagine what could have
caused the problem. If we come up with a hypothesis, we can add more
logging and wait for the problem to recur. Eventually, perhaps, we'll
find and fix our bug.
This is Scheme, though, and we have call-with-current-continuation
. We can use it to capture the stack, all the variables the stack
contains, and all the values they reference. This is of more than academic value. It's exactly the kind of
information we'd like to have while debugging.
In my web server, I wrap the code that dispatches HTTP requests inside
report-errors
, defined below:
(define most-recent-condition #f) (define record-most-recent-condition! (let ((record-most-recent-condition-mutex (make-thread-mutex))) (lambda (condition) (with-thread-mutex-lock record-most-recent-condition-mutex (lambda () (set! most-recent-condition condition)))))) (define (report-errors thunk) (call-with-current-continuation (lambda (continuation) (bind-condition-handler (list condition-type:error) (lambda (condition) (record-most-recent-condition! condition) (continuation #f)) thunk))))
This code uses MIT/GNU Scheme's exception-handling system and threads, but it could just as easily
use R6RS or R7RS Small exceptions and SRFI 18 threads. The idea is that, if an error is ever detected, we capture
the current condition in the variable most-recent-condition
, then continue along our merry way. The server's outer exception
handlers will run, and they will send the client the right HTTP error,
and perhaps even an error page. But when it comes time to investigate
the problem, we can connect to our REPL and run our debugger on most-recent-condition
, which contains the continuation in effect at the time the error
occurred.
Let's try an example. We'll define a web request handler that always
fails. It will handle any GET
request of the form /example/X
by calling error
. First, we connect to our REPL:
dtach -a /tmp/speechcode.dtach
Now we define a new GET
handler directly on the server. No restart is required.
(define-web-dispatcher ((request get) ("example" (? x)) ()) (error "This request failed." x) (make-http-response (lambda () (write-string x))))) ;Unspecified return value 1 ]=> █
Since we're trying to show what an ordinary web handler would do, our
handler includes code after the call to error
that does what any handler in my server should do — it returns the
response code, an alist of additional headers, and a thunk that would
have written the HTTP response if the error hadn't occurred.
After we visit http://localhost:8443/example/foo
, let's see whether we've captured a condition:
1 ]=> most-recent-condition ;Value: #[condition 1263 "simple-error"]
We have. Now let's attach MIT/GNU Scheme's debugger to the continuation. We'll be in the innermost frame on the stack. We can look around and see what was happening at the time of the error, even if it occurred hours earlier:
1 ]=> (debug most-recent-condition) There are 50 subproblems on the stack. Subproblem level: 0 (this is the lowest subproblem level) Expression (from stack): (begin <!> (make-http-response (lambda () (write-string x)))) subproblem being executed (marked by <!>): (error "This request failed." x) Environment created by a LAMBDA special form applied to: (#[http-request 1258] "foo") There is no execution history for this subproblem. You are now in the debugger. Type q to quit, ? for commands. 2 debug> █
Immediately, we see the expression that caused the error. We have
access to local variables and objects accessible from this frame.
Let's examine the variable x
:
2 debug> v v Evaluate expression: x Value: foo 2 debug> v Evaluate expression: (string? x) Value: #t 2 debug> █
We see "foo", so this must be the HTTP request we made. Let's pretty-print the request:
2 debug> v v Evaluate expression: (pp request) #[http-request 1258] (connection #[textual-i/o-port 1259 for channel: #[channel 1260]]) (headers ((host . "localhost:8443") (connection . "keep-alive") (pragma . "no-cache") (cache-control . "no-cache") … (accept-encoding . "gzip, deflate, br") (accept-language . "en-US,en;q=0.9"))) (http-version http/1.1) (method get) (peer-address #u8(127 0 0 1)) (request-arrival-ticks 185379108) (request-arrival-time 3819504338) (request-number 941) (uri #[web-uri 1261]) (web-server #[web-server 1262]) No value 2 debug> █
For brevity, I've elided some output. Still, you can see how valuable this information is, and we've only examined one frame. If we needed to, we could visit other frames and evaluate expressions in them, too.
I use this technique all the time to find problems in web apps running on my server. I can't imagine going back to debugging based on logs alone. And because I have the REPL, not only can I track bugs down, but I can fix them without restarting the server.
On a high-traffic server, we might only capture the continuations of certain classes of errors, or errors at certain URLs, and we might capture a few at a time rather than just one, but the basic idea would be the same, and would still be valuable.
In software, there are few things as empowering as being "inside" a running program.