One-click homelab: integrating Gitlab, Proxmox and K8s with GitOps Principles

One unlucky day, I destroyed my homelab while I was trying to upgrade various components. I had a backup of my virtual machines where my Kubernetes cluster was running, but unfortunately, I forgot to enable the backup of the secondary disk attached to each VM. This secondary disk was used to store all the persistent data of my cluster, including the Longhorn volumes. I thought that the backups were fine, but when I restored the VMs, I realized that the secondary disks were missing. As a result, I lost all my persistent data, and I had to start from scratch.

My homelab was at least 3/4 years old, and I had accumulated a lot of configurations, customizations, and data over time. I started it while approaching the DevOps world, so it didn't really follow best practices. Rebuilding everything from scratch was a daunting task, and I realized that I needed a better way to manage my homelab infrastructure.

It might sound incredible, but while it might be acceptable for a personal homelab, losing data due to a lack of proper backup strategies is a common issue in the industry as well. Some notable examples include:

Given the experience I gathered over the years, I decided to rebuild my homelab using Infrastructure as Code (IaC) principles. This time, I wanted to ensure that I could easily recreate my entire setup with just one click, without having to go through the tedious process of manual configuration. So my objective was not just having a backup of my infrastructure, but having a recovery time objective (RTO) of minutes.

This way, I can also brag about having a better infrastructure than most companies out there! πŸ˜„

What to Expect from This Blog Post

In this blog post you will find my personal solution to address the creation of a personal homelab following IaC principles. In particular, I will show you how I've successfully integrated open source tools with minimal cost and custom code to achieve a one-click deployment of my homelab infrastructure.

The solution has the following properties:

The tech stack of the solution includes:

  1. GitLab SaaS: I use it for simplicity, but you can also use your own GitLab instance at your discretion.
  2. Proxmox: I use it in my homelab, but you can use any Hypervisor you prefer, the only important thing is that there is an OpenTofu provider for that.
  3. OpenTofu: Necessary to create the various component of the solution. OpenTofu is the open-source fork of Terraform, which I use for infrastructure provisioning.
  4. Ubuntu Cloud Images: I use Ubuntu Cloud Images as the base operating system for the VMs in my Proxmox cluster. These images are optimized for cloud environments and provide an automated way to provision VMs with Cloud-Init.
  5. KubeSpray: I use KubeSpray to create the Kubernetes cluster on top of Proxmox VMs. KubeSpray is a popular open-source project that provides a set of Ansible playbooks for deploying and managing Kubernetes clusters.
  6. FluxCD: I use FluxCD for GitOps management of the Kubernetes cluster. FluxCD is a popular open-source project that enables continuous delivery and GitOps for Kubernetes.
  7. Sealed Secrets: I use Sealed Secrets for storing all the credentials directly on git. While for an homelab solution is more than enough, OpenBao might be a better solution for an Enterprise.
  8. Longhorn: I use Longhorn as the storage solution for the Kubernetes cluster. Longhorn is a popular open-source project that provides a distributed block storage system for Kubernetes.
  9. GCP Cloud Storage: In my solution I use GCP cloud storage for off-site backup. Please note that the solution I will provide ensures client-side encryption of data (so even Google will not be able to decipher your data). Additionally, you can use any NFS server or self-hosted S3 Object store solution like Garage.

Recovery Time Breakdown

Here's what happens during a full disaster recovery (1 hour RTO):

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Full Infrastructure Recovery Timeline                           β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                 β”‚
β”‚ Parallel:                                                       β”‚
β”‚  GCP Bucket β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                                            β”‚
β”‚  (4-5 min, ~8% - runs in parallel)                              β”‚
β”‚                                                                 β”‚
β”‚ Main Flow:                                                      β”‚
β”‚  0min                                                     60min β”‚
β”‚  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€    β”‚
β”‚  β”‚         β”‚                        β”‚       β”‚              β”‚    β”‚
β”‚  β–Ό         β–Ό                        β–Ό       β–Ό              β–Ό    β”‚
β”‚  VM       K8s                    Sealed  Flux          Flux     β”‚
β”‚  Prov.    Bootstrap              Secrets Deploy        Reconcileβ”‚
β”‚                                                                 β”‚
β”‚  9-10     23                     1-2   3-4              20-25   β”‚
β”‚  min      min                    min   min              min     β”‚
β”‚  (~16%)   (~38%)                 (~2%) (~6%)            (~38%)  β”‚
β”‚                                                                 β”‚
β”‚  Total: ~60 minutes                                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Phase Details:
β”œβ”€ GCP Bucket (4-5 min): Create/verify backup storage [RUNS IN PARALLEL]
β”œβ”€ VM Provisioning (9-10 min): Download Ubuntu Cloud Images, create VMs with Cloud-Init
β”œβ”€ K8s Bootstrap (23 min): KubeSpray cluster deployment (long phase - ~38% of total time)
β”œβ”€ Sealed Secrets (1-2 min): Deploy secret management controller
β”œβ”€ FluxCD Deploy (3-4 min): Install GitOps operator and sync repositories
└─ FluxCD Reconcile (20-25 min): Complete reconciliation of cluster state from GitOps repos (long phase - ~38% of total time)

Note: The Kubernetes bootstrapping phase accounts for approximately 38% of the total RTO. The GCP bucket creation runs in parallel with VM provisioning, so it doesn't add to the overall recovery time. Also it might be possible to optimize the FluxCD reconciliation time by tweaking its configuration for more aggressive syncs.

Here's a time-lapse of the complete infrastructure deployment from scratch to a running cluster (~60 minutes compressed):

Repositories Structure

All the infrastructure code is organized in five repositories:

  1. gitlab-runner repository: contains the OpenTofu code to create the Gitlab Runner on Proxmox as an LXC container.
  2. IaC repository: contains all the code necessary to bootstrap the Proxmox VMs and the Kubernetes cluster using KubeSpray.
  3. d2-fleet repository: defines the desired state of the Kubernetes clusters and tenants in the fleet.
  4. d2-infra repository: defines the desired state of the cluster add-ons and the monitoring stack.
  5. d2-apps repositories: defines the desired state of the applications deployed across environments.

We will examine the IaC repository in detail, the d2-* repositories simply apply the d2-reference-architecture provided by the FluxCD team, which, I must say, is very well thought out and implemented. πŸ‘

The Gitlab Runner repository is also quite straightforward, as it only contains the OpenTofu code to create the LXC container and register the runner with Gitlab, so I will not cover it here.

How They Work Together

The deployment flow follows a clear progression:

  1. gitlab-runner β†’ Bootstraps the CI/CD infrastructure needed to run the automated pipelines
  2. IaC β†’ Handles the foundational layer: VMs, backup storage, Kubernetes cluster, and essential components (Sealed Secrets, FluxCD operator)
  3. d2-fleet, d2-infra, d2-apps β†’ Once the IaC pipeline completes, FluxCD takes over and continuously reconciles the cluster state based on these GitOps repositories

In essence, the IaC repository gets you from an empty Hypervisor to a GitOps-ready cluster, then the d2-repositories manage everything from that point forward. This separation means the IaC pipeline only needs to run for infrastructure changes and periodic disaster recovery tests, while the d2 repositories handle all day-to-day operations through FluxCD's automatic reconciliation.

IaC Pipeline

The high-level steps of provisioning the homelab infrastructure performed by the IaC pipeline are:

  1. Creates the GCP Cloud Storage bucket for offsite backups.
  2. Creates the Proxmox VMs using OpenTofu.
  3. Bootstraps the Kubernetes cluster using KubeSpray.
  4. Uploads the generated Kubeconfig file to GitLab as an artifact with restricted access.
  5. Trigger the deployment of Sealed Secrets sub-pipeline.
  6. Trigger the deployment of FluxCD sub-pipeline.
  7. Longhorn will be deployed by FluxCD as part of the d2-infra repository.
  8. Restore Jobs defined in the d2-infra repository will restore Longhorn volumes from the GCP Cloud Storage bucket.

GCP Cloud Storage Bucket

The only prerequisites to create this part of the pipeline are:

  1. Create an OpenTofu service account with the necessary permissions to create and manage GCP Cloud Storage buckets.
  2. Create a GitLab Personal Access Token with the necessary permissions to modify CI/CD variables in the GitLab project. Note: I had to use a PAT because I'm on the free tier of GitLab SaaS, which does not yet support Project Access Tokens. If you have a paid plan, you should use a Project Access Token instead.

The process the pipeline follows is:

  1. If not already present, it creates the GCP Cloud Storage bucket using google_storage_bucket resource.

    1resource "google_storage_bucket" "longhorn_backup_bucket" {
    2 name = var.gcp_backup_longhorn_bucket_name
    3 location = var.gcp_backup_region
    4 storage_class = "NEARLINE"
    5 uniform_bucket_level_access = true
    6 public_access_prevention = "enforced"
    7 hierarchical_namespace {
    8 enabled = true
    9 }
    10}
  2. It creates a longhorn_backup_service_account.

  3. Assigns to the newly created service account the roles/storage.objectAdmin role for the created bucket.

  4. Creates/Syncs a HMAC key for the service account.

  5. The GitLab Pipeline will save the generated HMAC key as a CI/CD variable in the GitLab project. In my case, it will use the Access Token created in the pre-requisites step to do so.

This HMAC key is later injected as a secret into the Kubernetes cluster, allowing Longhorn to connect to the GCP Cloud Storage bucket.

Proxmox VMs

The OpenTofu provider used is bpg/proxmox.

The prerequisites to create this part of the pipeline are:

  1. Create a Proxmox API token with the necessary permissions to create and manage VMs.
  2. Create an SSH key pair and store the private key as a GitLab CI/CD variable. This key will be used by the on-premise GitLab Runner to connect to the Proxmox VMs.

The pipeline follows this process:

  1. OpenTofu makes Proxmox download the Ubuntu Cloud Image using proxmox_virtual_environment_download_file resource.

    1resource "proxmox_virtual_environment_download_file" "ubuntu_24_noble_qcow2_img" {
    2content_type = "iso"
    3datastore_id = "nfs-nas-1"
    4node_name = "pve2"
    5url = var.proxmox_k8s_node_image_url
    6overwrite = true
    7file_name = var.proxmox_k8s_node_image_name
    8checksum = var.proxmox_k8s_node_image_checksum
    9checksum_algorithm = var.proxmox_k8s_node_image_checksum_algorithm
    10}
  2. OpenTofu creates the Cloud-Init configuration file for the worker and master nodes using proxmox_virtual_environment_file resource. The important configurations to add to the Cloud-Init file are:

    • SSH Keys: Inject both your personal public key and the GitLab Runner's public key
    • Package Management: Install necessary packages (like qemu-guest-agent)
    • Longhorn Prerequisites: Configure according to Longhorn documentation
    Code snippet (worker nodes)
    1resource "proxmox_virtual_environment_file" "ubuntu_cloud_init_worker" {
    2content_type = "snippets"
    3datastore_id = "nfs-nas-1"
    4node_name = "pve2"
    5overwrite = true
    6source_raw {
    7 data = <<EOF
    8#cloud-config
    9users:
    10- default
    11- name: ${var.proxmox_k8s_node_username}
    12 groups:
    13 - sudo
    14 shell: /bin/bash
    15 ssh_authorized_keys:
    16%{~for key in var.proxmox_k8s_node_ssh_keys}
    17 - ${key}
    18%{~endfor}
    19 sudo: ALL=(ALL) NOPASSWD:ALL
    20
    21package_update: true
    22package_upgrade: true
    23packages:
    24- qemu-guest-agent
    25- nfs-common
    26
    27# Disk partitioning setup
    28disk_setup:
    29/dev/sdb:
    30 table_type: gpt
    31 layout: true
    32 overwrite: false
    33
    34# Filesystem setup
    35fs_setup:
    36- label: data
    37 filesystem: ext4
    38 device: /dev/sdb1
    39 partition: auto
    40 overwrite: false
    41
    42# Mount configuration
    43mounts:
    44- [/dev/sdb1, /mnt/data, ext4, "defaults,nofail", "0", "2"]
    45
    46write_files:
    47- path: /etc/modules-load.d/dm_crypt.conf
    48 content: |
    49 dm_crypt
    50 owner: root:root
    51 permissions: '0644'
    52
    53 runcmd:
    54 - systemctl enable qemu-guest-agent
    55 - systemctl start qemu-guest-agent
    56 - systemctl stop multipathd.socket
    57 - systemctl stop multipathd
    58 - systemctl disable multipathd.socket
    59 - systemctl disable multipathd
    60 - systemctl mask multipathd
    61 - systemctl mask multipathd.socket
    62 - modprobe dm_crypt
    63 - systemctl enable iscsid
    64 - systemctl start iscsid
    65 - echo "done" > /tmp/cloud-config.done
    66EOF
    67 file_name = "ubuntu.cloud-config-worker.yaml"
    68 }
    69}
  3. OpenTofu generates the output kubespray_inventory using the templatefile function with inventory.tpl. You can customize the inventory.tpl to fit your needs following the KubeSpray documentation. Here's my version of the inventory.tpl file.

    inventory.tpl
    1{
    2 "all": {
    3 "vars": {
    4 "ansible_user": "${ansible_user}",
    5 "ansible_become": true,
    6 "calico_cni_name": "k8s-pod-network",
    7 "nat_outgoing": true,
    8 "nat_outgoing_ipv6": true,
    9 "calico_pool_blocksize": 26,
    10 "calico_network_backend": "vxlan",
    11 "calico_vxlan_mode": "CrossSubnet",
    12 "kube_proxy_strict_arp": true,
    13 "kube_encrypt_secret_data": true,
    14 "kubeconfig_localhost": true,
    15 "artifacts_dir": "/output",
    16 "etcd_deployment_type": "host",
    17 "etcd_metrics_port": 2381,
    18 "etcd_listen_metrics_urls": "http://0.0.0.0:2381",
    19 "etcd_metrics_service_labels": {
    20 "k8s-app": "etcd",
    21 "app.kubernetes.io/managed-by": "kubespray",
    22 "app": "kube-prometheus-stack-kube-etcd",
    23 "release": "kube-prometheus-stack"
    24 },
    25 "kube_proxy_metrics_bind_address": "0.0.0.0:10249"
    26 },
    27 "children": {
    28 "kube_control_plane": {
    29 "hosts": {
    30 %{ for name in master_nodes ~}
    31 "${name}",
    32 %{ endfor ~}
    33 }
    34 },
    35 "etcd": {
    36 "hosts": {
    37 %{ for name in master_nodes ~}
    38 "${name}",
    39 %{ endfor ~}
    40 }
    41 },
    42 "kube_node": {
    43 "hosts": {
    44 %{ for name in worker_nodes ~}
    45 "${name}",
    46 %{ endfor ~}
    47 }
    48 },
    49 "k8s_cluster": {
    50 "children": [
    51 "kube_control_plane",
    52 "kube_node"
    53 ]
    54 }
    55 }
    56 }
    57}

    IMHO, the most important thing here is to limit Kubespray to install only the necessary components to have a minimal Kubernetes cluster ready for FluxCD deployment.

  4. Saves the generated inventory file as a GitLab artifact for use in the next stage.

Kubernetes Cluster Bootstrapping

This part of the pipeline bootstraps the Kubernetes cluster using KubeSpray.

I think KubeSpray is the best solution for creating a production-ready Kubernetes cluster in an on-premise environment, but I also think it should be limited to only the necessary components for a working cluster. KubeSpray, which is based on Ansible, provides many options to install various components like CNI, Ingress controllers, and monitoring stacks. However, in my opinion, these components should be installed using a more mature GitOps tool like FluxCD.

The prerequisites for this part of the pipeline are:

The pipeline follows these steps:

1apply-kubespray-production-home:
2 stage: deploy
3 image: ${CI_DEPENDENCY_PROXY_GROUP_IMAGE_PREFIX}/docker:cli
4 tags:
5 - mgmt-zone
6 - self-hosted
7 needs:
8 - job: opentofu-apply-production-home
9 artifacts: true
10 services:
11 - *dind
12 before_script:
13 - apk add --no-interactive jq
14 script:
15 - mkdir -p $CI_PROJECT_DIR/inventory
16 - mkdir -p output
17 - chmod 600 $RUNNER_SSH_PRIVATE_KEY_PATH
18 - jq -r '.kubespray_inventory.value' home-tf-output.json > $CI_PROJECT_DIR/inventory/inventory.json
19 - docker run --mount type=bind,source=$CI_PROJECT_DIR/inventory,dst=/inventory --mount type=bind,source=$RUNNER_SSH_PRIVATE_KEY_PATH,dst=/ssh/id --mount type=bind,source=$CI_PROJECT_DIR/output,dst=/output --rm quay.io/kubespray/kubespray:$KUBESPRAY_VERSION ansible-playbook -i /inventory/inventory.json --private-key /ssh/id cluster.yml
20 environment:
21 name: production/home
22 artifacts:
23 when: on_success
24 access: developer
25 expire_in: "10 mins"
26 paths:
27 - output/**

This step leverages the official KubeSpray Docker image to run the Ansible playbooks against the Proxmox VMs created in the previous step. It then saves the generated kubeconfig file as a restricted-access artifact for use in subsequent pipeline stages.

Uploading Kubeconfig to Gitlab

This step uploads the generated kubeconfig file to GitLab as a CI/CD variable using the GitLab API directly.

1.upload-secret-base64-encoded:
2 image: ${CI_DEPENDENCY_PROXY_GROUP_IMAGE_PREFIX}/alpine/curl:latest
3 script:
4 - |
5 set -e
6 DATA=$(base64 -w 0 ${DATA_FILE_PATH})
7 curl -s -f --request PUT \
8 --header "PRIVATE-TOKEN: ${ACCESS_TOKEN}" \
9 --header "Content-Type: application/json" \
10 --data "{\"variable_type\":\"file\",\"key\":\"${VAR_NAME}\",\"value\":\"$DATA\",\"hidden\":false,\"protected\":true,\"masked\":true,\"raw\":true,\"description\":\"\"}" \
11 "$CI_API_V4_URL/projects/$CI_PROJECT_ID/variables/${VAR_NAME}" > /dev/null 2>&1

This way, after the IaC pipeline finishes, the kubeconfig file will be available for administrators to download and use.

Sealed Secrets Deployment

This part of the pipeline creates the Sealed Secrets controller in the Kubernetes cluster.

The prerequisite is a certificate/key pair to be used by the Sealed Secrets controller. You can find the instructions here: Bring your own certificates.

The private key needs to be stored as a GitLab CI/CD variable, while the certificate can be stored directly in the IaC repository.

FluxCD Deployment

This part of the pipeline uses OpenTofu to deploy the FluxCD Operator in the Kubernetes cluster as described in the official documentation.

This step also creates the Kubernetes secret with the GCP HMAC key for Longhorn and the secret with the registry credentials to pull container images from private registries.

Longhorn Volumes Restoration Process

Unfortunately, at the time of writing, Longhorn does not support a declarative way to restore volumes from offsite backups (issue#5787).

To work around this limitation, I have created a simple Dockerized Python program that leverages the Longhorn API to restore the latest backup for defined volumes from offsite storage. You can find the repository here.

Simply define a Job in the d2-infra repository for each volume you want to restore. For example, here is the Job definition to restore the Authelia Longhorn volume:

1apiVersion: batch/v1
2kind: Job
3metadata:
4 name: authelia-volume-restore
5 namespace: authelia
6spec:
7 template:
8 spec:
9 containers:
10 - name: restore
11 image: your-registry/longhorn-backup-restore:latest
12 env:
13 - name: LONGHORN_URL
14 value: http://longhorn-frontend.longhorn-system.svc.cluster.local
15 - name: VOLUME_HANDLE
16 value: authelia-production-vol
17 - name: NUMBER_OF_REPLICAS
18 value: "3"
19 - name: LOG_LEVEL
20 value: INFO
21 restartPolicy: Never
22 backoffLimit: 3

Then you need to create the corresponding PV and PVC to use the restored volume in your application. These can be defined in the d2-infra repository, for example in the same file as the restore job.

You can find more information in the README of the repository.

Encrypt Longhorn Backups Client-Side

Unfortunately, at the time of writing, Longhorn does not support client-side encryption of backups natively (issue#5220).

A simple solution I found is to use rclone to encrypt the backups client-side before uploading them to the offsite backup location.

Simply declare a Deployment, a Service, some Secrets, and a one-time Job in the same namespace where Longhorn is installed.

The important bits in the

deployment manifest
1apiVersion: apps/v1
2kind: Deployment
3metadata:
4 name: s3-rclone-longhorn-bck
5 labels:
6 app.kubernetes.io/name: s3-rclone-longhorn-bck
7spec:
8 replicas: 1
9 selector:
10 matchLabels:
11 app.kubernetes.io/name: s3-rclone-longhorn-bck
12 template:
13 metadata:
14 labels:
15 app.kubernetes.io/name: s3-rclone-longhorn-bck
16 spec:
17 containers:
18 - name: rclone
19 image: ghcr.io/rclone/rclone:1.71.1
20 imagePullPolicy: IfNotPresent
21 command:
22 - "rclone"
23 - "serve"
24 - "s3"
25 - "--no-cleanup"
26 - "--auth-key"
27 - "$(RC_ACCESS_KEY_ID),$(RC_ACCESS_KEY)"
28 - "crypt_out_s3:"
29 - "--s3-force-path-style=true"
30 - "--addr=:8080"
31 - "--log-level=WARNING"
32 env:
33 - name: RCLONE_CONFIG_OUT_S3_ACCESS_KEY_ID
34 valueFrom:
35 secretKeyRef:
36 name: longhorn-backup-hmac-key
37 key: access_id
38 - name: RCLONE_CONFIG_OUT_S3_SECRET_ACCESS_KEY
39 valueFrom:
40 secretKeyRef:
41 name: longhorn-backup-hmac-key
42 key: secret
43 - name: RCLONE_CONFIG_CRYPT_OUT_S3_PASSWORD
44 valueFrom:
45 secretKeyRef:
46 name: rclone-secret
47 key: password
48 # salt is used as password2
49 - name: RCLONE_CONFIG_CRYPT_OUT_S3_PASSWORD2
50 valueFrom:
51 secretKeyRef:
52 name: rclone-secret
53 key: salt
54 - name: RC_ACCESS_KEY_ID
55 valueFrom:
56 secretKeyRef:
57 name: longhorn-rclone-bck-key-secret
58 key: AWS_ACCESS_KEY_ID
59 - name: RC_ACCESS_KEY
60 valueFrom:
61 secretKeyRef:
62 name: longhorn-rclone-bck-key-secret
63 key: AWS_SECRET_ACCESS_KEY
64 ports:
65 - name: http
66 containerPort: 8080
67 protocol: TCP
68 volumeMounts:
69 - name: config
70 mountPath: /root/.config/rclone
71 volumes:
72 - name: config
73 configMap:
74 name: s3-rclone-longhorn-bck-config
75 items:
76 - key: rclone.conf
77 path: rclone.conf
78

and in the

rclone config
1[out_s3]
2type = s3
3provider = GCS
4endpoint = https://storage.googleapis.com
5region = europe-west4
6use_multipart_uploads = false
7
8[crypt_out_s3]
9type = crypt
10remote = out_s3:<BUCKET-NAME-REDACTED>
11directory_name_encryption = false
12filename_encryption = off

I used are

Why We Use rclone serve s3

rclone serve s3 is the key component that makes this solution work. It implements a basic S3-compatible server that exposes any rclone backend (in our case, the encrypted crypt_out_s3 remote) as an S3 endpoint.

This is essential because:

The command essentially creates an S3 gateway that sits between Longhorn and the actual storage backend, handling all encryption/decryption automatically.

Understanding the rclone crypt Options

The crypt remote configuration uses two important settings:

directory_name_encryption = false

This keeps directory names in plaintext (unencrypted). While this reduces security slightly, it has practical benefits:

The actual file data is still fully encrypted, so the main content security is preserved.

filename_encryption = off

With this setting, files only get a .bin extension added instead of having their filenames encrypted. This provides several advantages:

Security trade-off: This setting trades some security for practicality. If you need maximum security, you could use standard encryption, which encrypts filenames completely.

Both options make the encrypted remote more manageable and reduce the risk of hitting storage provider limitations while keeping the actual file content fully encrypted.


Next, create the corresponding Service and a simple Job to initialize the backup bucket.

The important bits in the

job manifest
1apiVersion: batch/v1
2kind: Job
3metadata:
4 name: s3-rclone-longhorn-bck-init
5 labels:
6 app.kubernetes.io/name: s3-rclone-longhorn-bck
7 app.kubernetes.io/component: init
8spec:
9 backoffLimit: 3
10 template:
11 metadata:
12 labels:
13 app.kubernetes.io/name: s3-rclone-longhorn-bck
14 app.kubernetes.io/component: init
15 spec:
16 restartPolicy: OnFailure
17 containers:
18 - name: rclone-init
19 image: ghcr.io/rclone/rclone:1.71.1
20 imagePullPolicy: IfNotPresent
21 command:
22 - "/bin/sh"
23 - "-c"
24 - |
25 set -e
26
27 echo "Waiting for rclone S3 service to be ready..."
28 # limit to 10 minutes
29 COUNTER=0
30 until nc -z s3-rclone-longhorn-bck 8080 2>/dev/null; do
31 echo "Waiting for service..."
32 sleep 10
33 COUNTER=`expr $COUNTER + 1`
34 if [ $COUNTER -ge 60 ]; then
35 echo "Timeout waiting for service after 10 minutes"
36 exit 1
37 fi
38 done
39 echo "Service is ready!"
40 COUNTER=0
41
42 echo "Using bucket name: $${BUCKET_NAME}"
43
44 # Set up the remote for the local rclone S3 service (encrypted)
45 export RCLONE_CONFIG_LOCAL_S3_TYPE=s3
46 export RCLONE_CONFIG_LOCAL_S3_PROVIDER=Other
47 export RCLONE_CONFIG_LOCAL_S3_ENDPOINT=http://s3-rclone-longhorn-bck:8080
48 export RCLONE_CONFIG_LOCAL_S3_ACCESS_KEY_ID="$${RC_ACCESS_KEY_ID}"
49 export RCLONE_CONFIG_LOCAL_S3_SECRET_ACCESS_KEY="$${RC_ACCESS_KEY}"
50 export RCLONE_CONFIG_LOCAL_S3_FORCE_PATH_STYLE=true
51
52 export RCLONE_CONFIG_OUT_S3_USE_MULTIPART_UPLOADS=false
53 export RCLONE_CONFIG_OUT_S3_NO_CHECK_BUCKET=true
54 export RCLONE_CONFIG_OUT_S3_ACCESS_KEY_ID="$${BACKEND_S3_ACCESS_KEY_ID}"
55 export RCLONE_CONFIG_OUT_S3_SECRET_ACCESS_KEY="$${BACKEND_S3_SECRET_ACCESS_KEY}"
56
57 # Check if INFO.txt already exists in the backend (unencrypted)
58 echo "Checking if INFO.txt already exists in backend..."
59 if rclone lsf "out_s3:$${BUCKET_NAME}/INFO.txt" 2>/dev/null | grep -q "INFO.txt"; then
60 echo "INFO.txt already exists in backend - initialization already complete"
61 echo "Bucket is ready for Longhorn backups"
62 exit 0
63 fi
64
65 # Check if bucket exists (via encrypted endpoint)
66 echo "Checking if bucket exists..."
67 if rclone lsd local_s3: 2>/dev/null | grep -q "$${BUCKET_NAME}"; then
68 echo "Bucket '$${BUCKET_NAME}' already exists"
69
70 # Check if bucket contains any encrypted data
71 echo "Checking bucket contents (encrypted view)..."
72 FILE_COUNT=$$(rclone ls "local_s3:$${BUCKET_NAME}/" 2>/dev/null | wc -l)
73
74 if [ "$$FILE_COUNT" -gt 0 ]; then
75 echo "Bucket contains $${FILE_COUNT} encrypted file(s)"
76 echo "Listing existing files:"
77 rclone ls "local_s3:$${BUCKET_NAME}/" --max-depth 1
78 fi
79 else
80 echo "Creating new bucket: $${BUCKET_NAME}"
81 rclone mkdir "local_s3:$${BUCKET_NAME}" --log-level=INFO
82 echo "Bucket created successfully"
83 fi
84
85 echo "Listing all buckets..."
86 rclone lsd local_s3: --log-level=INFO
87
88 echo "Generating INFO.txt file..."
89 TIMESTAMP=$$(date -u +"%Y-%m-%d %H:%M:%S UTC")
90 HOSTNAME=$$(hostname)
91
92 cat > /tmp/INFO.txt <<EOF
93 ============================================
94 Longhorn Backup Bucket Information
95 ============================================
96
97 Bucket Name: $${BUCKET_NAME}
98 Created: $${TIMESTAMP}
99 Created By: $${HOSTNAME}
100
101 Configuration:
102 - Service: s3-rclone-longhorn-bck
103 - Endpoint: http://s3-rclone-longhorn-bck:8080
104 - Encryption: Enabled (rclone crypt)
105 - Backend: Google Cloud Storage (GCS)
106 - Region: europe-west4
107
108 Rclone Configuration:
109 - Remote: crypt_out_s3
110 - Base Remote: out_s3
111 - Encryption: Standard encryption with password and salt
112 - Directory Name Encryption: Disabled
113 - Filename Encryption: Disabled
114
115 Environment:
116 - Kubernetes Namespace: $${K8S_NAMESPACE}
117 - Init Job: s3-rclone-longhorn-bck-init
118
119 Notes:
120 - All data stored in this bucket is encrypted using rclone crypt
121 - Access requires proper HMAC credentials (stored in secrets)
122 - Encryption password and salt are required for decryption
123 - INFO.txt is stored UNENCRYPTED for easy access
124
125 Secrets Used:
126 - longhorn-backup-hmac-key: GCS HMAC credentials
127 - rclone-secret: Encryption password and salt
128 - longhorn-rclone-bck-key-secret: API access credentials
129
130 ============================================
131 EOF
132
133 echo "Uploading INFO.txt to backend GCS (UNENCRYPTED)..."
134 rclone copy /tmp/INFO.txt "out_s3:$${BUCKET_NAME}/" --log-level=INFO --s3-no-check-bucket
135
136 echo "Verifying upload..."
137 echo "Files in encrypted view:"
138 rclone ls "local_s3:$${BUCKET_NAME}/" --log-level=INFO
139 echo ""
140 echo "Files in unencrypted backend:"
141 rclone ls "out_s3:$${BUCKET_NAME}/" --max-depth 1 --log-level=INFO
142
143 echo "Initialization complete!"
144 echo "Bucket '$${BUCKET_NAME}' is ready for Longhorn backups"
145 echo "INFO.txt is available unencrypted in the backend storage"
146 env:
147 - name: BUCKET_NAME
148 value: ~
149 - name: RCLONE_CONFIG_OUT_S3_TYPE
150 value: ~
151 - name: RCLONE_CONFIG_OUT_S3_PROVIDER
152 value: ~
153 - name: RCLONE_CONFIG_OUT_S3_ENDPOINT
154 value: ~
155 - name: RCLONE_CONFIG_OUT_S3_REGION
156 value: ~
157 - name: RCLONE_CONFIG_OUT_S3_USE_MULTIPART_UPLOADS
158 value: ~
159 - name: K8S_NAMESPACE
160 valueFrom:
161 fieldRef:
162 fieldPath: metadata.namespace
163 - name: RC_ACCESS_KEY_ID
164 valueFrom:
165 secretKeyRef:
166 name: longhorn-rclone-bck-key-secret
167 key: AWS_ACCESS_KEY_ID
168 - name: RC_ACCESS_KEY
169 valueFrom:
170 secretKeyRef:
171 name: longhorn-rclone-bck-key-secret
172 key: AWS_SECRET_ACCESS_KEY
173 - name: BACKEND_S3_ACCESS_KEY_ID
174 valueFrom:
175 secretKeyRef:
176 name: longhorn-backup-hmac-key
177 key: access_id
178 - name: BACKEND_S3_SECRET_ACCESS_KEY
179 valueFrom:
180 secretKeyRef:
181 name: longhorn-backup-hmac-key
182 key: secret

are:

Why We Generate the INFO.txt File

The INFO.txt file serves several important purposes:

  1. Documentation: It provides human-readable information about the bucket configuration, encryption setup, and required credentials. This is invaluable when you need to restore or troubleshoot backups months or years later.

  2. Accessible without encryption: Critically, INFO.txt is stored directly in the backend (out_s3), bypassing the encryption layer. This means it can be read without needing the encryption keys, making it a self-documenting backup location.

  3. Verification: By uploading it to the backend, we verify that:

    • The GCS backend connection works correctly
    • Credentials are properly configured
    • The bucket is accessible and writable
  4. Recovery aid: If you ever lose your rclone configuration but still have the encryption passwords in a password manager, the INFO.txt file tells you exactly how the encryption was configured. This makes it possible to reconstruct the setup and recover your backups.

  5. Idempotency check: The script checks for INFO.txt existence to determine if initialization has already been completed, preventing duplicate initialization runs.

Why We Create the Bucket with rclone mkdir "local_s3:${BUCKET_NAME}"

Creating the bucket through the local_s3: remote (the encrypted S3 endpoint) rather than directly on the backend has several advantages:

  1. End-to-end testing: This verifies the entire encryption pipeline is working:

    • The rclone serve s3 service is running and accessible
    • Authentication is correctly configured
    • The crypt layer is properly set up
    • The underlying GCS backend is reachable
  2. S3 API validation: It ensures that bucket creation operations work through the S3 API layer, which is exactly how Longhorn will interact with the system. If rclone mkdir succeeds through local_s3:, we know Longhorn's S3 operations will also work.

  3. Consistent access path: By creating the bucket the same way Longhorn will access it (through the S3 API), we ensure there are no surprises or incompatibilities when Longhorn starts using the bucket.

  4. Automatic bucket initialization: On GCS, when you create a "bucket" through rclone's S3 interface, it actually creates a folder/prefix in the specified GCS bucket (configured as <BUCKET-NAME-REDACTED> in the config). This happens automatically through the crypt layer.

  5. Proper permissions verification: This confirms that the service account credentials (RC_ACCESS_KEY_ID/RC_ACCESS_KEY) have the necessary permissions to create buckets through the S3 interface.

RTO and RPO

With this setup, you can easily modify the RPO by adjusting the Longhorn backup schedule to fit your needs.

For RTO, it mainly depends on the pipeline execution time. The main time-consuming components are the Kubernetes cluster bootstrapping using KubeSpray and the FluxCD reconciliation process.

In my case, the total time to provision the entire infrastructure from scratch and reconcile the cluster state is around 1 hour, which works well for my use case and is also acceptable for most organizations.

To lower the RTO further, you could customize FluxCD timeouts, retry periods, and parallelism to achieve more aggressive reconciliation.

However, I'm planning to set up a disaster recovery site for even better redundancy. More on this in the "Next Steps" section.

Conclusion

This journey from a homelab data loss incident to a production-grade IaC setup taught me that:

While the initial setup took about a month (including research, testing, and iteration), I can now rebuild my entire infrastructure in 1 hour. More importantly, I've eliminated the anxiety of "did I back that up?"β€”everything is code, versioned, and reproducible.

The total monthly cost (~€10 for GCS storage) is minimal compared to the value of reliable, reproducible infrastructure. If you're running a homelab, I encourage you to treat it like productionβ€”your future self will thank you.

Limitations

As you might have noticed, some components are not yet part of the automated rebuild process.

Infrastructure Assumptions:

While this is acceptable for a homelab, an enterprise-grade setup should include these components in the IaC pipeline as well. For on-premises environments, this presents additional challenges:

These omissions mean a true "datacenter destroyed" scenario still requires some manual intervention. However, for the more common scenarios (VM corruption, cluster misconfiguration, accidental deletion), the current setup provides comprehensive protection.

Next Steps

While Google Cloud is a great option and my monthly cost is only ~€10, in the future I would like to explore Hetzner. The pricing is really competitive and they have an S3-compatible object storage service. They also have an OpenTofu provider.

Another area I would like to explore is leveraging Longhorn Disaster Recovery Volumes in conjunction with the FluxCD d2 architecture. This way, I might be able to create a recovery cluster in another location and have a more robust disaster recovery plan. I think that by using Hetzner with only the strictly necessary services, a single server might be sufficient to host the recovery cluster cheaply.

With this setup, it might be possible to achieve a very low RTO and RPO.

Finally, I hope that in the future Longhorn will natively support client-side encryption and a declarative way to restore volumes from offsite backups so that I can simplify the current setup.