Category Archives: System Administration

Scaling an API: The Things Nobody Tells You

I’ve spent a lot of time in my career building and scaling out APIs, and over the years I’ve noticed something: most of what makes an API scale is not what people write about. The standard advice stops at “put a load balancer in front and add more web servers.” That is the easy part. The web tier is stateless, so you can buy your way out of it. The interesting problems sit behind it, in the database, the cache, and the way clients talk to you.

So I wrote down the less-obvious things, the decisions that decide whether an API absorbs a traffic spike or falls over at the first one. None of it is exotic. It is just the stuff that tends to get learned the hard way.

Put Rate Limiting at the Front Door

Rate limiting is almost always an afterthought. The API works, traffic is light, and capping requests feels like next quarter’s problem. Then one bad client or a runaway retry loop sends ten thousand requests a minute and takes the whole service down. It belongs at the very front, not the end: a request you reject at the edge costs nothing, no query, no cache lookup, no compute. Use a token bucket keyed per client, which absorbs short bursts and settles to a steady rate:

# token bucket, one per client, refilled at a steady rate
b = buckets[client_id]
b.refill(rate = 100/min)
if b.tokens < 1 {
    return 429, ratelimit_headers(b)   # reject at the edge, cheaply
}
b.tokens -= 1

What people skip is telling the client what the limits are. A bare 429 leaves every client guessing, and a guessing client retries badly. The IETF is standardizing this in RateLimit header fields for HTTP, replacing the dozen proprietary X-RateLimit-* variants with one standard set:

HTTP/1.1 429 Too Many Requests
RateLimit-Limit: 100
RateLimit-Remaining: 0
RateLimit-Reset: 42
Retry-After: 42

It also defines a combined RateLimit field and a RateLimit-Policy that advertises the quota up front. The draft comes from the Italian government, Red Hat, and Microsoft rather than a single vendor, and providers like Cloudflare already emit these. Send them, and well-behaved clients throttle themselves, which is load you never have to shed by force.

Retries Are the Other Half of Rate Limiting

Rate limiting controls what comes in. Retries are what happens when you say no, and they can hurt more than the original traffic. The classic failure: you get slow, clients time out, every client retries at once, and that traffic lands on a system already underwater. It is called a retry storm, and it is one of the most common ways an API takes itself down.

Retries need exponential backoff with jitter. Backoff makes each attempt wait longer than the last; jitter adds randomness so a thousand clients that failed at the same instant do not all retry at the same instant and collide again on the next wave. AWS has a good write-up on why jitter matters. Cap the attempts as well, so a failing dependency is not hammered forever:

for attempt in 0..maxRetries {
    resp = send(request)
    if resp.ok or not retryable(resp) {
        return resp
    }
    delay = min(cap, base * 2**attempt)
    sleep(random(0, delay))            # full jitter desynchronizes clients
}
return last_error                      # give up; do not retry past the budget

The other question is whether a retry is even safe. A retried write must not charge a card or create a record twice. Idempotency keys solve this: the client sends a unique key, you store the result against it, and a repeat of that key returns the stored result instead of redoing the work.

POST /v1/charges
Idempotency-Key: 8f14e45f-ceea-467a-9f1a-1b2c3d4e5f60

# server side: claim the key atomically BEFORE doing any work
key = request.header["Idempotency-Key"]
if not store.claim(key) {              # insert-if-absent; false if it already exists
    return store.await_result(key)     # another copy owns it: wait for or replay its result
}
resp = process(request)
store.put(key, resp, ttl=24h)          # record the result for future replays
return resp

The claim has to be atomic. A plain check-then-write lets two concurrent retries both miss and both run, which defeats the whole point; an insert guarded by a unique constraint, or a Redis SETNX, makes exactly one caller the owner; the rest wait for or replay its result. Stripe wrote the canonical description. Build it in early, because retrofitting idempotency onto live write endpoints is painful.

Caching Is a Stack, Not a Setting

Everyone caches. What gets missed is that caching happens at several independent layers, and the biggest win is often the one you ignored:

  • HTTP caching via Cache-Control, so browsers and intermediaries reuse a response without asking again.
  • CDN or edge caching, serving one response to many clients from a nearby location.
  • Application caching in memory or Redis, for computed values and query results.
  • Database caching, the query-plan and buffer caches the database keeps on its own.

The application layer usually follows the cache-aside pattern: check the cache, fall back to the source on a miss, then populate it for next time. Store a sentinel for empty results too, or a lookup that finds nothing falls through to the database on every call.

# cache-aside: nil means a real miss, not a cached "not found"
v = cache.get(key)
if v == nil {
    v = db.query(key)
    cache.set(key, v ?? NEGATIVE, ttl = 10s)   # cache empties too, as a sentinel
}
return (v == NEGATIVE) ? null : v

The lesson that changed how I think about caching is that even a tiny TTL pays off. You do not need an hour. On an endpoint hit a hundred times a second, a ten-second cache means you compute the real answer once and serve it from cache about a thousand times:

Cache-Control: public, max-age=10, stale-while-revalidate=30

The data is ten seconds stale, which nobody notices on most read paths, and you have cut the load behind the cache by three orders of magnitude. Short TTLs are the cheapest scaling lever there is, and they get left on the table because ten seconds feels too short to bother with. The stale-while-revalidate hint, from RFC 5861, goes further still: a cache serves the stale copy instantly while it refreshes in the background, so no client ever waits on the miss.

Not Every Response Needs a Body

If a client already holds a copy that has not changed, the cheapest possible response is to say exactly that. You attach an ETag, a short version fingerprint, to the response; the client sends it back on its next request, and if nothing changed you return a 304 with no body:

GET /v1/widgets/42
If-None-Match: "a1b2c3d4"

HTTP/1.1 304 Not Modified

MDN has the details. That saves bandwidth, and often computation too: if you can decide the ETag is unchanged without building the full response, you skip the expensive work entirely.

Stampedes Bypass Your Cache

This is the one that bites people, because it defeats the cache you just built, and the reason is hiding in that cache-aside snippet: the cache is only written at the end, after the query returns. That is fine until traffic is concurrent.

Picture a popular key with a ten-second TTL. It expires, and in the window between the first request missing and that same request finishing its query, every other request that arrives also finds the cache empty, also misses, and also queries the database, because nobody has populated it yet. Under load that window holds hundreds of requests, and they all run the same query at once. The cache that was meant to protect the database has aimed a thundering herd straight at it. In the logs it looks like this:

12:00:09.998  hit   key=widgets:top
12:00:10.001  MISS  key=widgets:top  -> db query 1422ms
12:00:10.001  MISS  key=widgets:top  -> db query 1455ms
12:00:10.002  MISS  key=widgets:top  -> db query 1480ms
12:00:10.002  MISS  key=widgets:top  -> db query 1503ms
   ... 900 more identical queries in the same 4ms ...

The more popular the key, the bigger the herd, so a stampede hits your hottest paths hardest. Wikipedia calls it a cache stampede. The fix is single-flight, or request coalescing: when many callers need the same uncached value, one does the work and the rest wait and share its result. Go ships it as golang.org/x/sync/singleflight:

v, _, _ := g.Do("widgets:top", func() (any, error) {
    return loadFromDB()      // runs once; the herd shares its result
})

One query runs instead of a thousand. But single-flight only coalesces within one process. Across a fleet of servers each one still rebuilds the key once, so a popular key gives you one stampede per server. To collapse the rebuild across the whole fleet, take a short distributed lock in Redis with SET NX and let the single winner rebuild while everyone else serves the previous value:

# only ONE caller across the fleet rebuilds the key
ok = redis.set("lock:" + key, token, NX=true, EX=5)
if ok {                                # we won the lock
    v = loadFromDB()
    cache.set(key, v, ttl=10s)
    redis.del("lock:" + key)           # release (verify token first in real code)
} else {                               # someone else is already rebuilding
    v = cache.get_stale(key)           # serve the slightly-stale copy, do not pile on
}

SET key value NX means set the key only if it does not already exist. It is atomic, so it succeeds for exactly one caller and fails for every other, which is precisely the property a lock needs. The EX 5 gives the lock a five-second expiry, so a holder that crashes mid-rebuild cannot wedge the key shut forever. The winner does the one rebuild; everyone else sees the lock is taken and serves the old value instead of hammering the database. Single-flight handles the in-process herd, the Redis lock handles the cross-fleet herd, and they compose.

Two cheaper measures help on top: jitter your TTLs so keys created together do not all expire together, and use probabilistic early expiration so a hot key refreshes just before it goes cold. Behind a CDN the same idea is origin shield or tiered caching, collapsing many edge misses into a single origin request; CloudFront calls it Origin Shield. Turn it on.

Send Reads to Read Replicas

Most APIs are read-heavy, and mine is no exception: reads dwarf writes. Yet by default they all go to the same primary that handles writes, and that primary becomes your bottleneck long before anything else does. Split them: the primary takes writes, one or more replicas take reads, and you route each request to the right place:

func dbFor(req) Conn {
    if req.Method != "GET" {
        return primary              // all writes go to the primary
    }
    if req.Path == "/v1/orders/checkout" {
        return primary              // read-after-write must be current
    }
    return replicas.next()          // round-robin across read replicas
}

The catch is replication lag. A replica trails the primary by a fraction of a second, so a client that writes and immediately reads its own write can see stale data. Route those read-after-write paths to the primary, as the checkout case does, and send everything else to the replicas.

An application routing writes to a primary database and reads to two read replicas, with replication flowing from the primary to the replicas
Writes go to the primary; reads fan out to the replicas. Read-after-write paths stay on the primary, because replication lags by a fraction of a second.

Connection Pooling Is the Hidden Ceiling

Read replicas lead straight to the least glamorous and most important item here. A database connection is not free. On most engines each one is a server-side process or thread with real memory behind it, and there is a hard limit on how many can exist at once. If every worker opens its own, you hit that ceiling long before you run out of CPU or memory, and the database refuses new connections while your servers sit mostly idle.

The fix is a connection pooler between the application and the database, multiplexing many client connections onto a few real ones. Every engine has the option: PgBouncer or pgpool for Postgres, ProxySQL for MySQL, an in-process pool in most drivers and ORMs. PgBouncer makes a good example, because the detail that matters is its pooling mode:

[pgbouncer]
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 20

In transaction mode a client holds a real connection only for the length of a single transaction, so the twenty above can serve a thousand clients. The mode is the whole game: session pooling pins a connection to a client for its entire session and scales far worse. The one catch is that anything relying on session state, like prepared statements or advisory locks, needs care or a different mode.

Pooling and the read/write split from the last section are separate jobs, but they meet here: you usually run one pool per endpoint, one against the primary and one against the replicas, and the application picks the pool the same way the router earlier picked the database. Some proxies will do that routing for you. ProxySQL and pgpool-II can inspect a query and send writes to the primary and reads to a replica, and RDS Proxy offers read-only endpoints for Aurora. It is convenient, but I keep the routing in the application, because a proxy guessing read versus write still has to get transactions and read-after-write right, and that is exactly where it tends to surprise you.

This gets dramatically worse with serverless, a perfect segue to the next section. If every function instance opens its own connection, a spike that starts hundreds of instances opens hundreds of connections at once, and the database refuses them. On AWS the managed answer is RDS Proxy. Whatever the engine, the rule is the same: the connection count is a hard ceiling, and pooling is how you raise it.

The OFFSET Pagination Cliff

This one stays invisible until your tables get big. LIMIT/OFFSET pagination works perfectly in testing, but OFFSET does not skip rows for free: to reach the rows after an offset, the database has to generate every row before it and throw them away. The first page is instant. By the five-thousandth, it is walking past a hundred thousand rows and discarding them on every request, and it only gets slower the deeper anyone pages. Use keyset, or cursor, pagination instead: remember the sort key of the last row you returned and ask for the rows after it:

select id, name from widgets
where id > :last_id
order by id
limit 20;

With an index on the sort column, every page costs the same at any depth. It also stays correct when rows are inserted mid-page, where OFFSET silently skips or duplicates. The one thing you give up is jumping to an arbitrary page number, which most APIs do not actually need.

One catch: the sort key has to be unique. If you page by something that is not, like a timestamp, order by a unique tiebreaker as well, such as (created_at, id), and compare the pair; otherwise two rows that share a value can straddle a page boundary and get skipped or repeated.

Watch for Hot Keys

Even with replicas and caching, a single key can be popular enough that the one node holding it becomes a bottleneck while everything else sits idle: a celebrity account, a viral item, one enormous customer. Horizontal scaling does not help, because you cannot shard a single key across machines. Cache that value locally in each process so it never touches the shared store, or, for a write hot spot, add a small random suffix to the key so the load spreads across nodes:

# pin a hot value in-process so it never hits the shared store
if local.fresh(key) {
    return local.get(key)              # zero network, zero shared load
}
v = redis.get(key)
local.set(key, v, ttl = 1s)            # a one-second local cache is enough
return v

The trap is that average load can look perfectly healthy while one key is on fire, so watch your hottest keys specifically, not just the average.

Move Work to Write-Time

The cheapest request is one whose answer was computed before it arrived. A surprising amount of read load is the same aggregation or join, recomputed on every call, when it could have been computed once, when the data changed, and simply read back. This is the same instinct as caching, moved one step earlier: do the work on the write, not on the read.

The bluntest form is a denormalized counter or rollup maintained on write. Instead of a count(*) across a large table on every read, keep the number in a column and update it in the same transaction as the write:

-- on write: keep the rollup current
update widgets set view_count = view_count + 1 where id = :id;

-- on read: no aggregation, just a column
select view_count from widgets where id = :id;

For heavier aggregates, a materialized view precomputes the result and stores it, so the read is a plain select against an indexed table instead of a large scan. You refresh it on a schedule or when the inputs change:

create materialized view top_widgets as
    select widget_id, count(*) as hits
    from views
    group by widget_id;

refresh materialized view concurrently top_widgets;   -- on a schedule

The fullest form is a precomputed read model: when a write lands, assemble the exact shape the read endpoint serves and store it, so the endpoint becomes a key lookup rather than a query. The tradeoff is real, though. You do more work on writes, the read can lag the truth by a little, and you now have a thing to keep in sync. For a read-heavy API that is almost always the right trade, because writes are rare and reads are constant, so you are moving work off the path that runs a thousand times as often.

Do Not Let One Slow Dependency Sink Everything

Scaling is also about failing well. When one dependency slows down, the default is for requests to pile up waiting on it, each holding a thread and a connection, until the whole service is exhausted and even unrelated endpoints stop responding. One slow thing takes down everything. Three patterns prevent it: load shedding, rejecting some requests when you are over capacity, because a fast 503 beats a slow timeout; circuit breakers, which stop calling a clearly-failing dependency and fail fast; and bulkheads, which isolate the connection or thread pool per dependency so one cannot starve the others. Load shedding can be as blunt as a concurrency cap at the front of the handler:

# a bounded semaphore acquires-or-fails atomically: no check-then-act race
if not slots.try_acquire() {        # slots = maxInflight permits
    w.Header().Set("Retry-After", "1")
    return 503, "over capacity"     // shed early, protect the rest
}
defer slots.release()

A circuit breaker applies the same idea to a downstream dependency: once it is clearly failing, stop calling it and return right away instead of making every request wait out the timeout. The piece people forget is the way back: after a cooldown the breaker goes half-open and lets a single probe through, which closes it on success or re-opens it on failure. Without that, it stays open forever.

if breaker.state == OPEN {
    if not breaker.cooldown_elapsed() {
        return fallback()              # fail fast while the breaker is open
    }
    breaker.state = HALF_OPEN          # cooldown passed: let one probe through
}
resp, err = call(dependency)
breaker.record(err)                    # a clean probe closes it; a failure re-opens
return resp

And a lot of work does not belong in the request at all. If a client asks for something slow, accept it, return 202 Accepted with a URL to poll, and hand the work to a queue that background workers drain at their own pace:

HTTP/1.1 202 Accepted
Location: /v1/jobs/abc123

That keeps the request path fast and lets the slow work scale independently of the API that accepts it.

Scaling With Managed Pieces

You do not have to build all of this yourself. On AWS, API Gateway and Lambda bundle a lot of it together, and the pair is worth understanding even if you never deploy on it, because it is a good illustration of what sensible defaults look like.

Amazon API Gateway in front of AWS Lambda, which scales to many copies whose connections pass through Amazon RDS Proxy before reaching Amazon RDS
API Gateway throttles at the edge, Lambda scales to many copies, and RDS Proxy pools their connections down to a set the database can actually handle.

API Gateway sits in front as the door. It gives you throttling, per-client usage plans, response caching, and request validation at the edge, before a bad request ever reaches your code. A usage plan is the same token bucket from the first section, plus a burst ceiling and a monthly quota, attached to an API key and handed to you as configuration instead of code you write and maintain. Behind it, Lambda runs your handler and scales the compute automatically, spinning up copies to match concurrency, with provisioned concurrency to keep a pool warm if cold starts hurt.

The trap is the one from the last section: Lambda scales by running many isolated copies of your function, and if each copy opens its own database connection, a spike turns into a connection storm, which is exactly why RDS Proxy exists. But that connection problem hints at a larger truth: not every API belongs in Lambda in the first place.

Lambda is a great fit when traffic is spiky or has a low baseline, the kind of workload where you would otherwise pay for servers that sit idle most of the time. A webhook receiver that fires a few times a minute, a payment or Slack callback, an internal endpoint hit occasionally, a scheduled job, a public API whose traffic swings between near-zero and sudden bursts: in all of those you pay only for what actually runs, and the automatic scaling does precisely what you want.

It is a poor fit in a few specific cases, and they are worth knowing before you commit to it:

  • Sustained high throughput. At a steady few thousand requests a second, per-invocation pricing costs far more than a couple of always-on containers, and you get none of the idle-time savings that justify Lambda in the first place. A busy search or feed API is cheaper and more predictable running on containers.
  • Tight latency budgets. A cold start adds tens to hundreds of milliseconds. If your p99 target is 50ms, a cold start blows straight through it, and buying enough provisioned concurrency to avoid that is just paying for warm servers with extra steps.
  • Long-running or streaming requests. API Gateway caps a synchronous integration at 29 seconds, with no exceptions. A long report, a large file download, a slow upstream call, or a server-sent-events stream simply does not fit. That work belongs on a normal server, or behind the 202-and-a-queue pattern from earlier.
  • Heavy shared state. Each invocation is isolated, so the in-process hot-key cache from earlier does not survive between requests, and anything that wants a large warm local cache or a long-lived connection fights the execution model the whole way.

The rule I actually use: reach for Lambda when traffic is bursty and the work is short and self-contained, and reach for plain containers behind a load balancer when traffic is steady, latency-sensitive, or long-running. The underlying pattern outlives the vendor either way. Elastic compute in front of a database with a fixed connection limit needs a pooler between the two, or it will scale right past what the database can take.

The Thread Running Through All of It

If there is one idea connecting all of this, it is that scaling an API is mostly about not doing work. Cache the answer so you do not compute it twice. Coalesce duplicate requests so one query serves a thousand callers. Send reads to replicas so the primary is free for writes. Shed or defer the load you cannot handle right now. Reject early what you were going to reject anyway. Adding servers is the part everyone talks about, because it is the part you can buy. These are the things that decide whether those servers spend their time on real work, or on work you could have avoided.

Secure Email – Exim, Dovecot, Perdition.

Is your email secure?

The default behavior of POP3 and IMAP, is to pass your username and password (and all your e-mail for that matter) over a socket with no encryption. This means that all that information is susceptible to a “man in the middle” type attack, where the plain text packets can be intercepted and read.

Is that really going to happen? Probably not, but you can avoid it, by setting up your e-mail servers to provide SSL encryption for all incoming and outgoing email.

This post will talk only about the mail server software I use: Exim (SMTP), Dovecot (POP3 & IMAP) and Perdition (for POP3/IMAP proxying / load balancing).

SSL vs. TLS vs. STARTTLS

First off, there’s a lot of confusion around the naming conventions.

TLS (Transport Layer Security) is SSL- technically, TLS version 1 is SSL version 3. The use of SSL v1 and v2 should be avoided if not completed disabled, so for all intents and purposes, when we say “SSL”, we mean SSL v3.

SSL uses separate dedicated ports for mail: 993 for secure IMAP, 995 for secure POP3 and 465 for secure SMTP. The encrypted connection is negotiated immediately after the socket is opened, and all communication is encrypted. This is more attune to how HTTPS works. The downside with this implementation, is that the service is provided on different ports, which may mean changes to your firewall rules, ACL’s and mail clients.

STARTTLS is a command implementation that works on the existing IMAP, POP3 and SMTP ports. The email client connects to the normal unencrypted ports, and uses a plain text command (STARTTLS in SMTP and IMAP, and STLS in POP3) to initiate a TLS (SSLv3) connection. From that point on, all communication is encrypted. The plus with this, is that the service works over the existing ports, so no firewall or ACL changes are needed.

A big part of the confusion comes from the mail clients, as they all seem to refer to the implementations with different names:

http://en.wikipedia.org/wiki/Comparison_of_e-mail_clients#SSL_and_TLS_support

In a lot of cases, the options are between “SSL” and “TLS”; SSL being an SSLv3 (TLS) connection on the different ports, and TLS meaning a STARTTLS connection. So for this post, we’ll use the same language.

We’re going to setup both (SSL and STARTTLS), which gives us the most flexibility with our clients, though technically, having dedicated ports for SSL services was deprecated in favor of STARTTLS, as it was thought that using two different ports for plaintext and SSL connections seemed wasteful.

Certificates

For our purposes, we’re just going to use a self-signed certificate for e-mail. This provides the same encryption as a certificate bought from a trusted authority- the only difference is the trust part. Obviously, if you’re going to set this up for customers or for the general public, a trusted authority is better, as you won’t receive any of those annoying trust errors when connecting.

There’s a million tutorials out there that show you how to create a self-signed certificate, so I won’t go on too much about this. There’s even a handy site that will generate it for you online.

http://www.selfsignedcertificate.com/

Create  the key:

openssl genrsa -out mail.example.com.key 2048

Create the certificate:

openssl req -new -x509 -key mail.example.com.key -out mail.example.com.crt -days 3650 -subj /CN=mail.example.com

Exim

The SSL config for Exim is pretty straight forward:

tls_advertise_hosts     = *
tls_on_connect_ports    = 465
tls_certificate         = /path/to/ssl/mail.example.com.crt
tls_privatekey          = /path/to/ssl/mail.example.com.key

This basically says to advertise STARTTLS to everybody (*), to assume connections on port 456 are SSL connections, and use the certificate and key files referenced.

To make SSL work on port 465, you’ll need to tell Exim to also listen on that port. This would likely require adding another entry to your “local_interfaces” option- something like:

local_interfaces = 192.168.0.1.25:192.168.0.1.587:192.168.0.1.465

Restart Exim, and SSL should work. More information can be found here.

Dovecot

Dovecot provides POP3 and IMAP services, and supports both SSL and STARTTLS. The setup, again, is pretty straight forward:

protocols = imap pop3 imaps pop3s
ssl = yes
ssl_cert = </path/to/ssl/mail.example.com.crt
ssl_key = </path/to/ssl/mail.example.com.key

Restart Dovecot and test. More information can be found here.

Perdition

Perdition is a mail retrieval proxy. It handles load balancing and load distribution of POP3/IMAP connections, by proxying based on database or regex lookups. One other nice feature of Perdition, is that it can offload the encryption handling of mail connections. It can handle the SSL/STARTTLS negotiation, and then proxy the connections to local servers unencrytped, which reduces the overhead on the actual mail servers.

In my system, I use Perdition to load balance mail between multiple mail servers, as well as handle all the encryption/decryption.

Copy your imap4 and pop3 config files to new imap4s and pop3s config files, and change each, respectively, to:

protocol IMAP4S

and

protocol POP3S

and then add to each:

ssl_mode ssl_listen
ssl_cert_file /path/to/ssl/mail.example.com.crt
ssl_key_file /path/to/ssl/mail.example.com.key

Now, like I said before, in my case I only listen for encrypted connections with Perdition, then I relay mail internally over a non-encrypted link. Perdition has all sorts of ssl_mod options for handling different setups.

In my existing imap4 and pop3 config files I also added:

ssl_mode tls_listen
ssl_cert_file /path/to/ssl/mail.example.com.crt
ssl_key_file /path/to/ssl/mail.example.com.key

Telling it to listen for STARTTLS requests on the non-secure IMAP and POP3 ports.

Restart your existing IMAP4 and POP3 servers, and then start two new perdition instances, using the new IMAPs and POP3s config files.

Testing

The easiest way to test is to use the OpenSSL command line “s_client”, which lets you connect to encrypted services as a client, and validate that the SSL config is working.

IMAPs (SSL)

openssl s_client -connect mail.example.com:993

IMAP + STARTTLS

openssl s_client -connect mail.example.com:143 -crlf -starttls imap

POP3s (SSL)

openssl s_client -connect mail.example.com:995

POP3 + STARTTLS

openssl s_client -connect mail.example.com:110 -crlf -starttls pop3

SMTPs (SSL)

openssl s_client -connect mail.example.com:465

SMTP + STARTTLS

openssl s_client -connect mail.example.com:25 -crlf -starttls smtp

Adding encryption to your mail system is easy, and pretty much every mail client supports either SSL or STARTTLS, if not both.

No “Fail Whale” for Fonolo After Slashdot

Last Thursday (June 23rd), my company Fonolo was featured on Slashdot, which then it made it’s way to Boing Boing, and a handful of other sites.

Now, this was great news for us, as it generated a TON of traffic into our sites, but I was also understandably concerned about what this extra traffic coming into our system was going to do. I’ve heard (and seen) dozens of horror stories about sites featured on Slashdot, who’s servers simply ended up melting from the extra traffic. Having a full out “fail whale” situation, would simply be an embarrassment to us; and I would consider it a personal failure, given that I have been the chief architect of the Fonolo system.

But despite traffic peaking at ~44MB sustained, our system handled it with tons of room to spare- no interruptions, no failures, no downtime.

and lighttpd handled all the web traffic like a champ.

Simple FreeBSD Tunnels

It’s pretty easy in FreeBSD to create simple tunnels between two FreeBSD machines, between a FreeBSD machine and a Cisco router, or to any other host that supports the GRE interface.

The GRE (Generic Routing Encapsulation) interface (defined in RFC 1701 and 1702) has been included for a while, and provides a “standard” encapsulation protocol for tunneling.

So lets assume first a host-to-host tunnel:

Host A:

External IP: 192.168.10.1
Tunnel IP: 10.255.255.1/30
Internal Block: 10.10.0.1/24

Host B:

External IP: 192.168.20.1
Tunnel IP: 10.255.255.2/30
Internal Block: 10.20.0.1/24

Obviously, in a real-world situation, the “External IP” value in your setup, would likely be a real, valid, routable IP address. Also, I like to use a “stub” host block for the tunnel IP addresses- this is basically a block of four IP’s (/30, 2 usable), that are simply used to route either end of the tunnel.

Yes, you can use an IP in the internal block as the tunnel interface, which means you wouldn’t need the extra routes to route the local blocks, but this can break OSPF, so it’s just good (maybe?) practice to keep the two blocks separate.

First off, you need to make sure the gre interface type is available; your kernel should be compiled with:

device gre

or, if you’ve built all the modules as loadable, you can add:

if_gre_load="YES"

to your /boot/loader.conf; you can then either reboot, or run:

host> /sbin/kldload /boot/kernel/if_gre.ko

Which should load the module.

Next, you need to configure the interfaces on both machines; this assumes you don’t already have a greN interface setup- if you do, then just use the next logical number instead of 0.

On Host A:

ifconfig gre0 create
ifconfig gre0 10.255.255.1 10.255.255.2 netmask 255.255.255.252
ifconfig gre0 tunnel 192.168.10.1 192.168.20.1

On Host B:

ifconfig gre0 create
ifconfig gre0 10.255.255.2 10.255.255.1 netmask 255.255.255.252
ifconfig gre0 tunnel 192.168.20.1 192.168.10.1

You can run ifconfig gre0 on each host, and the interface should show as UP. You can also confirm it’s up by pinging the other end of the tunnel from each host; so from Host A, ping the 10.255.255.2 IP address (which is the remote end of the tunnel):

hosta> ping -c 3 10.255.255.2
64 bytes from 10.255.255.2: icmp_seq=0 ttl=64 time=43.192 ms
64 bytes from 10.255.255.2: icmp_seq=1 ttl=64 time=51.099 ms
64 bytes from 10.255.255.2: icmp_seq=2 ttl=64 time=51.459 ms

--- 10.255.255.2 ping statistics ---
3 packets transmitted, 3 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 43.192/48.583/51.459/3.815 ms

Now you need to add routes for your internal blocks, so each end know to route over the tunnel:

On Host A:

route add 10.20.0.1/24 10.255.255.2

On Host B:

route add 10.10.0.1/24 10.255.255.1

Now don’t forget to add this config to your rc.conf files, so the interfaces get setup when the machines reboot:

On Host A:

cloned_interfaces="gre0"
ifconfig_gre0="inet 10.255.255.1 10.255.255.2 netmask 255.255.255.252 
tunnel 192.168.10.1 192.168.20.1"
static_routes="tunnel"
route_tunnel="10.20.0.1/24 10.255.255.2"

On Host B:

cloned_interfaces="gre0"
ifconfig_gre0="inet 10.255.255.2 10.255.255.1 netmask 255.255.255.252 
tunnel 192.168.20.1 192.168.10.1"
static_routes="tunnel"
route_tunnel="10.10.0.1/24 10.255.255.1"

(the ifconfig line was broken for display purposes; it should be one long line in you rc.conf file)

Now if your Host B in this scenario is not a FreeBSD box, but a Cisco device, the Cisco setup goes like this:

interface Tunnel0
  ip address 10.255.255.2 255.255.252
  tunnel mode gre
  tunnel source 192.168.20.1
  tunnel destination 192.168.10.1
!

ip route 10.10.0.1/24 255.255.255.0 10.255.255.2

Remember, this is just a simple encapsulation protocol, and not encryption- so data is not secure. Later, I’ll do an update which includes IPSEC on top of the GRE interface.

Using DKIM in Exim

Since Exim 4.70, DKIM (DomainKeys Indentified Mail – RFC4871) has been supported by default.

The current implementation supports signing outgoing mail, as well as verifying signatures in incoming messages, using the acl_smtp_dkim ACL. By default, DKIM signatures are verified as new messages come in, though no action is taken unless you’ve implicitly configured rules in the DKIM ACL.

After installing Exim (>= 4.70), you should see debug logs for incoming mail from servers that have DKIM signatures setup- they look like:

DKIM: d=gmail.com s=gamma c=relaxed/relaxed a=rsa-sha256 [verification succeeded]
Verifying Incoming Mail

By default, Exim does not filter any mail based on the validity of the DKIM signature- it’s up to you to add ACL rules to control what happens when you receive messages with “bad” signatures.

First add an ACL section for the DKIM processing; this should be included with your other ACL statements:

acl_smtp_dkim = acl_check_dkim

Next, after the “begin acl”, section, add your DKIM ACL section, and by default, accept all messages in this ACL:

acl_check_dkim:

	accept

Now you need to decide what kind of rules you want to setup- you probably don’t want to put a rule that applies to all domains- though, if the company went to the trouble of adding DKIM signatures to their e-mail, you’d hope they’d get it right, and not publish invalid public keys.

For now, let’s add a simple rule for gmail; google knows what they’re doing, so their systems should be setup correctly:

acl_check_dkim:

	#
	# check the DKIM signature for gmail
	#
	deny 	message 	= Common guys, what's going on?
		sender_domains 	= gmail.com
		dkim_signers 	= gmail.com
		dkim_status 	= none:invalid:fail

	accept

You can add as many rules, for whatever domains you want in this ACL.

Signing Outgoing Mail

Now that you’re checking incoming mail, you probably want to sign mail coming out of your system. This is a relatively easy process, that I’ve broken down into three steps:

Step1– Generate a private and public key to sign your messages; you can do this easily with openssl:

#openssl genrsa -out dkim.private.key 1024

Then extract the public key from the private key:

#openssl rsa -in dkim.private.key -out dkim.public.key -pubout -outform PEM

Step2– Configure the Exim remote-smtp transport to sign outgoing messages, using your new private key. You’ll need to pick a domain and a selector for this process.

When remote SMTP servers validate your DKIM signatures, they simply do a DNS look up, based on the selector and your domain- the domain needs to (obviously) be a valid domain you own, that you can add DNS entries to, and the selector can be any string you want. So, for example, using the domain “example.com”, and the selector “x”, you would add to the remote_smtp transport in Exim:

remote_smtp:
        driver = smtp
        dkim_domain = example.com
        dkim_selector = x
        dkim_private_key = dkim.private.key
        dkim_canon = relaxed

This tells Exim to sign any outbound e-mail, using the domain example.com, the selector “x”, and the private key we just generated. The dkim_canon = relaxed, sets the canonicalization method to use when signing messages. DKIM supports “simple” and “relaxed” algorithms- to understand the difference, see section 3.4 of the DKIM RFC.

Step3– add your DKIM public key to your DNS.

The DKIM public key generated above is advertised to other SMTP servers, using a DNS TXT record. In DNS for the domain example.com, add a new TXT record:

x._domainkey.example.com.   TXT v=DKIM1; t=y; k=rsa; p=<public key>

Where “x” is the selector you used above, and <public key> is the public key data (minus the key header/footer text).

When setup correctly, your DKIM text record should look something like this:

# host -t txt x._domainkey.example.com

x._domainkey.example.com descriptive text "v=DKIM1\; t=y\; k=rsa\; p=MIGfMA0GCS
qGSIb3DQEBAQUAA4GNADCBiQKBgQC5k8yUyuyu9UAVHHU7Al4ppTDtxFWsZ6Pqd9NWZnomtewBdz8I
2LJkqmA/3Cyb5Eiaqk4NulPFfDbfA0Lkw7SNyOS9BRN02KGtKIWjFqDwjB99haaWYw9H4IZcuJp0Y
q0kySCdBp/sPP+iTotdBiE85Jakw3tzgYkdvaS05ZUdBwIDAQAB"

(lines breaks were added for readability- your entry should be one continuous line)

This DNS record is referred to as the “selector” record; you need to also setup a “policy” record. The policy record is your domains policy for domain keys- you should start with something like:

_domainkey.example.com. t=y; o=~;

The t=y specifies that you are in test mode and this should be removed when you are certain that your domain key setup is functioning properly. The “~” in the o=~ specifies that some of the mail from your domain is signed, but not all. You could also specify o=- if all of the mail coming from your domain will be signed.

Once you have all of that in-place,  restart Exim, and send out a message using the remote-smtp transport. You should now see a DKIM-Signature: header listed in the message headers, which lists your domain (as d=), and selector (as s=), as well as a signature for this e-mail, which can be validated against your public DKIM key, that you’ve published in DNS.

For more information, see the Exim DKIM page, or the DKIM RFC.

UPDATE:

Once you’ve set everything up, you can test your DKIM (and SPF and SenderID, etc) install, by using the port25.com validation service.

Just send an e-mail to check-auth@verifier.port25.com, and it will auto-respond with a validation report

UPDATE 2:

I’ve updated this to use a key length of at least 1024 bits, otherwise it’s possible to crack the DKIM key, and fake it to show that your email is valid. This came to light largely because of an article about how Google was using a 512 bit key and a “hacker” factored the key, and spoofed emails to the Google founders:

http://www.wired.com/threatlevel/2012/10/dkim-vulnerability-widespread/

I previously listed a key length of 768, which is significantly harder to break than a 512 bit key, but just to be safe, use 1024 or better yet, 2048.