REPL as a Service

Tue 12 Jan 2021

Sometimes, I know that there's something that I should be doing, and I just don't do it. I keep putting it off because it's too much work, or too complicated, or uninteresting. It's easy to find an excuse not to do it.

For well over a decade, I've run this web site using a web server I wrote in Scheme. (My server runs behind nginx.) Until yesterday, I hadn't configured my Linux instance to start the server on boot. My hosting provider, Linode, is reliable, so I rarely had an unplanned outage. Whenever my instance restarted, I manually logged in and started my server. It worked.

But I knew that that was wrong. If there was a crash, or a power outage, or unplanned maintenance, I wanted my server to restart on its own, and I wanted it to do so right away. I didn't want to have to race to a terminal to get it going again. Let me show you how I finally made that happen. With dtach and systemd, we can make a web service that starts on its own and whose REPL is always available, even across logouts. And with call-with-current-continuation, we can take debugging to a whole new level.

`dtach` your REPL

If we're going to take advantage of the power of Scheme, we need access to a Read-Eval-Print Loop. That way, we can debug problems with more than just HTTP logs. We can even experiment with changes to the running server. But we need some way to connect to the REPL. I use dtach, which lets me start my server and log out, knowing that I'll be able to connect to the REPL whenever I log back in.

To start the server using my start-web-server script (not shown), we pass dtach a filename to tell it where to create a Unix domain socket:

dtach -n /tmp/speechcode.dtach /home/speechcode/bin/start-web-server

To connect to the server's REPL later, we specify the same socket with the -a option:

dtach -a /tmp/speechcode.dtach

If we type '(), this is what we see:

'()

;Value: ()

1 ]=> █

Now we have access to the full power of the REPL. We can check the status of the server, inspect its data structures, debug problems (more on that later), and even make changes to code while the server is running.

`systemd`

Today, systemd is how one creates a service that automatically starts when Linux boots. The controversy around systemd was one reason I avoided this project. But Unix has taught me the secret to happiness: low expectations. (See The UNIX-HATERS Handbook.) When I finally started digging into systemd, none of the criticisms surprised me. The Tragedy of systemd, a thoughtful talk by Benno Rice (YouTube), finally convinced me to get moving.

The hardest problem I faced was convincing systemd that a process created using dtach, which immediately forks and exits, was still running. The key was to wrap dtach in a script that itself doesn't exit until the server does. It's called start-speechcode-service:

#!/bin/bash

source /home/speechcode/.environment
dtach -n /tmp/speechcode.dtach /home/speechcode/bin/start-web-server

PID=`lsof -t /tmp/speechcode.dtach`

tail --follow /dev/null --pid=$PID

This script uses source to set up the environment variables that configure the server. Next, it runs dtach, which starts the server, creates a socket for communicating with it, and exits. Finally, it finds the process ID of the forked dtach process, which is the parent process of the web server. Since it's not possible to use bash's wait command on a process that isn't a subprocess of the shell, I use tail to keep the script running until the server exits.

The speechcode service is defined in /etc/systemd/system/speechcode.service:

[Unit]
Description=Speechcode server

[Service]
ExecStart=/home/speechcode/bin/start-speechcode-service
Restart=always
Type=simple
User=speechcode
WorkingDirectory=/home/speechcode/scheme/web/

[Install]
WantedBy=multi-user.target

This defines what user will run the server, in what directory the server will start, and what script will start it. It also arranges for the server to start when the system reaches run level 2 (multi-user.target), and to restart if it ever exits — unless we use systemctl to stop it deliberately.

To start the server manually, we run:

sudo systemctl start speechcode

To check the server's status (and see recent log lines), we run:

sudo systemctl status speechcode -l

To stop the server, we run:

sudo systemctl stop speechcode

That's it.

But wait. There's more.

`call-with-current-continuation`

When debugging a problem with a web server, we use logging. But logging is just an advanced way to use print statements. For it to be useful, we have to know, in advance, what information to print. If there's an unexpected problem, all we can do is stare at the code and the logs and try to imagine what could have caused the problem. If we come up with a hypothesis, we can add more logging and wait for the problem to recur. Eventually, perhaps, we'll find and fix our bug.

This is Scheme, though, and we have call-with-current-continuation. We can use it to capture the stack, all the variables the stack contains, and all the values they reference. This is of more than academic value. It's exactly the kind of information we'd like to have while debugging.

In my web server, I wrap the code that dispatches HTTP requests inside report-errors, defined below:

(define most-recent-condition #f)

(define record-most-recent-condition!
  (let ((record-most-recent-condition-mutex (make-thread-mutex)))
    (lambda (condition)
      (with-thread-mutex-lock
       record-most-recent-condition-mutex
       (lambda ()
         (set! most-recent-condition condition))))))

(define (report-errors thunk)
  (call-with-current-continuation
   (lambda (continuation)
     (bind-condition-handler
      (list condition-type:error)
      (lambda (condition)
        (record-most-recent-condition! condition)
        (continuation #f))
      thunk))))

This code uses MIT/GNU Scheme's exception-handling system and threads, but it could just as easily use R⁶RS or R⁷RS Small exceptions and SRFI 18 threads. The idea is that, if an error is ever detected, we capture the current condition in the variable most-recent-condition, then continue along our merry way. The server's outer exception handlers will run, and they will send the client the right HTTP error, and perhaps even an error page. But when it comes time to investigate the problem, we can connect to our REPL and run our debugger on most-recent-condition, which contains the continuation in effect at the time the error occurred.

Let's try an example. We'll define a web request handler that always fails. It will handle any GET request of the form /example/X by calling error. First, we connect to our REPL:

dtach -a /tmp/speechcode.dtach

Now we define a new GET handler directly on the server. No restart is required.

(define-web-dispatcher ((request get) ("example" (? x)) ())
  (error "This request failed." x)
  (make-http-response (lambda () (write-string x)))))

;Unspecified return value

1 ]=> █

Since we're trying to show what an ordinary web handler would do, our handler includes code after the call to error that does what any handler in my server should do — it returns the response code, an alist of additional headers, and a thunk that would have written the HTTP response if the error hadn't occurred.

After we visit http://localhost:8443/example/foo, let's see whether we've captured a condition:

1 ]=> most-recent-condition

;Value: #[condition 1263 "simple-error"]

We have. Now let's attach MIT/GNU Scheme's debugger to the continuation. We'll be in the innermost frame on the stack. We can look around and see what was happening at the time of the error, even if it occurred hours earlier:

1 ]=> (debug most-recent-condition)

There are 50 subproblems on the stack.

Subproblem level: 0 (this is the lowest subproblem level)
Expression (from stack):
    (begin <!> (make-http-response (lambda () (write-string x))))
 subproblem being executed (marked by <!>):
    (error "This request failed." x)
Environment created by a LAMBDA special form

 applied to: (#[http-request 1258] "foo")
There is no execution history for this subproblem.
You are now in the debugger.   Type q to quit, ? for commands.

2 debug> █

Immediately, we see the expression that caused the error. We have access to local variables and objects accessible from this frame. Let's examine the variable x:

2 debug> v
v

Evaluate expression: x

Value: foo
2 debug> v

Evaluate expression: (string? x)

Value: #t

2 debug> █

We see "foo", so this must be the HTTP request we made. Let's pretty-print the request:

2 debug> v
v

Evaluate expression: (pp request)
#[http-request 1258]
(connection #[textual-i/o-port 1259 for channel: #[channel 1260]])
(headers
 ((host . "localhost:8443")
  (connection . "keep-alive")
  (pragma . "no-cache")
  (cache-control . "no-cache")
  …
  (accept-encoding . "gzip, deflate, br")
  (accept-language . "en-US,en;q=0.9")))
(http-version http/1.1)
(method get)
(peer-address #u8(127 0 0 1))
(request-arrival-ticks 185379108)
(request-arrival-time 3819504338)
(request-number 941)
(uri #[web-uri 1261])
(web-server #[web-server 1262])
No value

2 debug> █

For brevity, I've elided some output. Still, you can see how valuable this information is, and we've only examined one frame. If we needed to, we could visit other frames and evaluate expressions in them, too.

I use this technique all the time to find problems in web apps running on my server. I can't imagine going back to debugging based on logs alone. And because I have the REPL, not only can I track bugs down, but I can fix them without restarting the server.

On a high-traffic server, we might only capture the continuations of certain classes of errors, or errors at certain URLs, and we might capture a few at a time rather than just one, but the basic idea would be the same, and would still be valuable.

In software, there are few things as empowering as being "inside" a running program.

dtach your REPL

systemd

call-with-current-continuation

`dtach` your REPL

`systemd`

`call-with-current-continuation`