Infrastructure Active 2026

Platform Stack

A self-hosted platform behind Cloudflare. Traefik routes traffic, Tailscale handles SSH, Docker Compose runs the services. No public ingress. Deploys come from GitHub Actions; Prometheus and Grafana watch what's running.

Zero-ingress self-hosted platform

Docker
Traefik
Tailscale
Cloudflare
GitHub Actions
Prometheus
Grafana

GitHub

platform-stack

routestunnelnetworksmetrics

app.example.comhealthy→ traefikapi.example.comhealthy→ traefikgrafana.example.comaccesscf gatedprom.example.comaccesscf gated

cloudflare ─tunnel→ cloudflared ─edge→ traefik

Why I built this

I wanted somewhere to run my own services (small APIs, internal tools, side projects) without paying for a managed runtime per app, and without standing up a Kubernetes cluster I’d then have to operate. One Linux host, around ten service stacks, a shared Postgres and Redis. That’s the right size for what I actually do.

Most self-hosted writeups stop at docker compose up. The bit that actually matters is the wrapper around it: keeping the host off the public internet, shipping changes from CI without SSH-ing in by hand, and having enough signal to know what’s broken without having to log in. This repo is that wrapper.

What it is

A reproducible runtime for one Linux host:

Cloudflare DNS + Tunnel for public ingress without open ports.
Traefik v3 for label-driven, host-header routing to Compose containers.
Docker Compose: one stack per service, shared Postgres and Redis.
GitHub Actions over Tailscale SSH for deploys without a public SSH port.
Prometheus + Grafana + exporters for metrics, dashboards, and alerts.
Terraform for Cloudflare DNS, tunnel ingress, and Access policies.

Hostnames, tokens, and the actual list of services are scrubbed in the public repo. The shape is real.

Request flow

The only path from the public internet to a service is Cloudflare → tunnel → Traefik. The host has no inbound 80/443. ufw is default-deny on the public interface, with allow rules on the tailnet interface and loopback. To bypass any of this you need a tailnet identity.

Operators take a different path entirely:

TLS terminates at Cloudflare; inside the tunnel it’s plain HTTP. If this ever leaves Cloudflare, that’s the day I’d add Let’s Encrypt to Traefik.

Networks

Three Docker networks, all internal. Services join only the ones they need; the Docker socket is mounted read-only into Traefik and nowhere else.

Network	Members
`edge`	Traefik, cloudflared, every HTTP-serving container
`data`	Postgres, Redis, services that need them
`monitoring`	Prometheus, Grafana, exporters

Deployments

CI builds the image, pushes to GHCR, joins the tailnet with a short-lived identity, SSH’s to the host as deploy, and runs compose pull && up -d.

The runner has no standing identity on the host. Tailscale OAuth gives each job an ephemeral tailnet IP, the SSH key is short-lived, and the deploy user can only run compose inside /srv/* (no sudo). Image references resolve to the commit SHA; :latest doesn’t exist in production.

Migrations run in a one-shot container under a Compose migrate profile before the new app comes up, and have to be backwards-compatible with the previous version. If a migration fails, the old container keeps serving and the workflow goes red.

Rollback is the same workflow with a different tag. A workflow_dispatch input redeploys any past SHA. That’s the only rollback path I keep working, because anything fancier rots between the times I’d actually need it.

Adding a service

Five steps. Four are platform-side, the fifth runs from CI on every deploy.

Service manifest. A small YAML file in the service’s own repo with name, domain, image, port, healthcheck, and middleware chain.
Compose stack at /srv/<name>/ with Traefik labels for routing. For containers I can’t relabel, scripts/render-service.sh builds a Traefik file-provider config from the same manifest.
DNS + tunnel route. Add the hostname to terraform/cloudflare/main.tf. make apply creates the CNAME, the tunnel ingress rule, and (for internal hostnames) the Cloudflare Access app and email-domain policy.
Scrape target in docker/monitoring/prometheus.yml, then reload Prometheus.
Deploy workflow. Copy github-actions/deploy-example.yml into the service repo, set SERVICE and IMAGE, push.

To sanity-check, I look in four places: Docker logs, the Grafana host overview, up{service="<name>"} in Prometheus, and a real curl /health. If any of those is missing the service, the wiring is wrong somewhere in steps 1 to 4.

Observability

Prometheus uses static scrape config. There are few enough services that auto-discovery would add complexity without buying me anything. What I watch:

Edge (Traefik): RPS, latency histograms, error rate by router.
Host: CPU, memory, disk, load (node_exporter).
Containers: per-container CPU/RAM and restart count (cAdvisor).
Postgres: connections, tx rate, replication lag, slow queries.
Per service: /metrics with RED metrics.

Logs are JSON to stdout. Vector tails the Docker socket and ships them off-host. The pipeline doesn’t do any parsing. Structure happens at the source.

Alerts are a deliberately short list. Anything that fires more than once a quarter without anyone doing something about it gets tuned or deleted.

Alert	Trigger
`HostDown`	node_exporter scrape fails for 2m
`DiskWillFillIn4h`	linear projection of `/` and `/var/lib/docker`
`TraefikHigh5xx`	5xx ratio > 2% for 5m on any router
`ContainerRestartLoop`	container restarts > 5 in 15m
`PostgresReplicationLag`	lag > 60s for 5m
`BackupTooOld`	last successful backup > 36h ago

For most of these, “did the thing run at all?” turns out to be more useful than threshold-based alerts. Thresholds drift over time. Absence doesn’t.

Backups

Nightly pg_dump plus WAL archive to off-host object storage. A weekly job restores the dump into a throwaway container and runs a smoke query against it. That’s the actual restore drill, not just a check that a backup file exists somewhere.

Key decisions

Docker Compose, not Kubernetes. For one host running about ten services, k8s would be more operational surface than the rest of the platform combined. Compose’s downsides (no multi-host, no rolling updates by default) are manageable in practice. Healthchecks and Traefik retries turn a recreate into a sub-second blip.

Cloudflare Tunnel, not public 80/443. Public 80/443 means an ACME setup, fail2ban tuning, and a public IP that turns up in shodan within hours. The tunnel is one container, no inbound ports, and Cloudflare’s WAF and rate limiting come along for free. The cost is a hard dependency on Cloudflare and one extra hop per request. Both are fine here.

Traefik, not Caddy or NGINX. The Docker provider is the reason. Adding labels to a container and having it routable is the right ergonomic for a Compose-based setup. Middlewares (rate limiting, headers, compression) as labels cover most of what I want without me writing config files.

GitHub Actions + Tailscale SSH, not a self-hosted runner. Actions is already there. The host doesn’t run a CI agent, so a worker compromise doesn’t put the attacker on the host. Tailscale OAuth gives each job an ephemeral identity, which is cleaner than IP-allowlisting GitHub egress.

Tailscale, not a bastion or hand-rolled WireGuard. WireGuard works, but I’d be writing Tailscale’s control plane badly. A bastion is one more box in the failure path. The control-plane dependency is real: existing connections survive a control-plane outage, but new logins might not. Break-glass plan is the hosting provider’s web console.

One host, for now. A second host would multiply operational surface without buying anything I currently need. Single-host risk is mitigated by nightly backups with weekly restore drills, IaC for the Cloudflare layer, and a documented rebuild path. None of that is HA. I don’t think paying the cost of HA makes sense until something here actually needs it.

Boundaries

Things I’ve intentionally not solved yet:

No multi-host orchestration. That’s a v2 problem.
No zero-downtime guarantees. A recreate is a sub-second blip that Traefik retries through, but it isn’t a guarantee.
No autoscaling. The next bottleneck after CPU and RAM is “buy a bigger box.”
No tracing yet. Services emit a trace_id in their logs but there’s no collector.
No long-term metric storage. Fifteen days locally; I’ll add remote-write to a managed Prometheus when capacity-planning trends actually matter.
No image signing. Images are pushed with a PAT, not cosign.
No inter-container network policy. For one host, segmentation by Compose network is enough.

Each of these is a gap I know about, with a known cost. They get fixed when the threat model or operational reality changes enough to make them worth it.

What I learned

Most of the value is in the wrapper, not the components. Traefik, Compose, Cloudflare, Tailscale, and Prometheus all do their jobs out of the box. The interesting question is how they fit together: what’s public, what’s private, what each layer is allowed to talk to, and what happens when one of them is unhealthy.

Writing it down made the boundaries obvious. There were three or four points where the only reason something worked was that I remembered to do a thing manually. Those got automated or moved into Terraform.

The full architecture and runbook are on GitHub.

#Why I built this

#What it is

#Request flow

#Networks

#Deployments

#Adding a service

#Observability

#Backups

#Key decisions

#Boundaries

#What I learned

Why I built this

What it is

Request flow

Networks

Deployments

Adding a service

Observability

Backups

Key decisions

Boundaries

What I learned