One-click homelab: integrating Gitlab, Proxmox and K8s with GitOps Principles

One unlucky day, I destroyed my homelab while I was trying to upgrade various components. I had a backup of my virtual machines where my Kubernetes cluster was running, but unfortunately, I forgot to enable the backup of the secondary disk attached to each VM. This secondary disk was used to store all the persistent data of my cluster, including the Longhorn volumes. I thought that the backups were fine, but when I restored the VMs, I realized that the secondary disks were missing. As a result, I lost all my persistent data, and I had to start from scratch.

My homelab was at least 3/4 years old, and I had accumulated a lot of configurations, customizations, and data over time. I started it while approaching the DevOps world, so it didn't really follow best practices. Rebuilding everything from scratch was a daunting task, and I realized that I needed a better way to manage my homelab infrastructure.

It might sound incredible, but while it might be acceptable for a personal homelab, losing data due to a lack of proper backup strategies is a common issue in the industry as well. Some notable examples include:

Given the experience I gathered over the years, I decided to rebuild my homelab using Infrastructure as Code (IaC) principles. This time, I wanted to ensure that I could easily recreate my entire setup with just one click, without having to go through the tedious process of manual configuration. So my objective was not just having a backup of my infrastructure, but having a recovery time objective (RTO) of minutes.

This way, I can also brag about having a better infrastructure than most companies out there! πŸ˜„

What to Expect from This Blog Post

In this blog post you will find my personal solution to address the creation of a personal homelab following IaC principles. In particular, I will show you how I've successfully integrated open source tools with minimal cost and custom code to achieve a one-click deployment of my homelab infrastructure.

The solution has the following properties:

The tech stack of the solution includes:

  1. GitLab SaaS: I use it for simplicity, but you can also use your own GitLab instance at your discretion.
  2. Proxmox: I use it in my homelab, but you can use any Hypervisor you prefer, the only important thing is that there is an OpenTofu provider for that.
  3. OpenTofu: Necessary to create the various component of the solution. OpenTofu is the open-source fork of Terraform, which I use for infrastructure provisioning.
  4. Ubuntu Cloud Images: I use Ubuntu Cloud Images as the base operating system for the VMs in my Proxmox cluster. These images are optimized for cloud environments and provide an automated way to provision VMs with Cloud-Init.
  5. KubeSpray: I use KubeSpray to create the Kubernetes cluster on top of Proxmox VMs. KubeSpray is a popular open-source project that provides a set of Ansible playbooks for deploying and managing Kubernetes clusters.
  6. FluxCD: I use FluxCD for GitOps management of the Kubernetes cluster. FluxCD is a popular open-source project that enables continuous delivery and GitOps for Kubernetes.
  7. Sealed Secrets: I use Sealed Secrets for storing all the credentials directly on git. While for an homelab solution is more than enough, OpenBao might be a better solution for an Enterprise.
  8. Longhorn: I use Longhorn as the storage solution for the Kubernetes cluster. Longhorn is a popular open-source project that provides a distributed block storage system for Kubernetes.
  9. GCP Cloud Storage: In my solution I use GCP cloud storage for off-site backup. Please note that the solution I will provide ensures client-side encryption of data (so even Google will not be able to decipher your data). Additionally, you can use any NFS server or self-hosted S3 Object store solution like Garage.

Recovery Time Breakdown

Here's what happens during a full disaster recovery (1 hour RTO):

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Full Infrastructure Recovery Timeline                           β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                 β”‚
β”‚ Parallel:                                                       β”‚
β”‚  GCP Bucket β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                                            β”‚
β”‚  (4-5 min, ~8% - runs in parallel)                              β”‚
β”‚                                                                 β”‚
β”‚ Main Flow:                                                      β”‚
β”‚  0min                                                     60min β”‚
β”‚  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€    β”‚
β”‚  β”‚         β”‚                        β”‚       β”‚              β”‚    β”‚
β”‚  β–Ό         β–Ό                        β–Ό       β–Ό              β–Ό    β”‚
β”‚  VM       K8s                    Sealed  Flux          Flux     β”‚
β”‚  Prov.    Bootstrap              Secrets Deploy        Reconcileβ”‚
β”‚                                                                 β”‚
β”‚  9-10     23                     1-2   3-4              20-25   β”‚
β”‚  min      min                    min   min              min     β”‚
β”‚  (~16%)   (~38%)                 (~2%) (~6%)            (~38%)  β”‚
β”‚                                                                 β”‚
β”‚  Total: ~60 minutes                                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Phase Details:
β”œβ”€ GCP Bucket (4-5 min): Create/verify backup storage [RUNS IN PARALLEL]
β”œβ”€ VM Provisioning (9-10 min): Download Ubuntu Cloud Images, create VMs with Cloud-Init
β”œβ”€ K8s Bootstrap (23 min): KubeSpray cluster deployment (long phase - ~38% of total time)
β”œβ”€ Sealed Secrets (1-2 min): Deploy secret management controller
β”œβ”€ FluxCD Deploy (3-4 min): Install GitOps operator and sync repositories
└─ FluxCD Reconcile (20-25 min): Complete reconciliation of cluster state from GitOps repos (long phase - ~38% of total time)

Note: The Kubernetes bootstrapping phase accounts for approximately 38% of the total RTO. The GCP bucket creation runs in parallel with VM provisioning, so it doesn't add to the overall recovery time. Also it might be possible to optimize the FluxCD reconciliation time by tweaking its configuration for more aggressive syncs.

Here's a time-lapse of the complete infrastructure deployment from scratch to a running cluster (~60 minutes compressed):

Repositories Structure

All the infrastructure code is organized in five repositories:

  1. gitlab-runner repository: contains the OpenTofu code to create the Gitlab Runner on Proxmox as an LXC container.
  2. IaC repository: contains all the code necessary to bootstrap the Proxmox VMs and the Kubernetes cluster using KubeSpray.
  3. d2-fleet repository: defines the desired state of the Kubernetes clusters and tenants in the fleet.
  4. d2-infra repository: defines the desired state of the cluster add-ons and the monitoring stack.
  5. d2-apps repositories: defines the desired state of the applications deployed across environments.

We will examine the IaC repository in detail, the d2-* repositories simply apply the d2-reference-architecture provided by the FluxCD team, which, I must say, is very well thought out and implemented. πŸ‘

The Gitlab Runner repository is also quite straightforward, as it only contains the OpenTofu code to create the LXC container and register the runner with Gitlab, so I will not cover it here.

How They Work Together

The deployment flow follows a clear progression:

  1. gitlab-runner β†’ Bootstraps the CI/CD infrastructure needed to run the automated pipelines
  2. IaC β†’ Handles the foundational layer: VMs, backup storage, Kubernetes cluster, and essential components (Sealed Secrets, FluxCD operator)
  3. d2-fleet, d2-infra, d2-apps β†’ Once the IaC pipeline completes, FluxCD takes over and continuously reconciles the cluster state based on these GitOps repositories

In essence, the IaC repository gets you from an empty Hypervisor to a GitOps-ready cluster, then the d2-repositories manage everything from that point forward. This separation means the IaC pipeline only needs to run for infrastructure changes and periodic disaster recovery tests, while the d2 repositories handle all day-to-day operations through FluxCD's automatic reconciliation.

IaC Pipeline

The high-level steps of provisioning the homelab infrastructure performed by the IaC pipeline are:

  1. Creates the GCP Cloud Storage bucket for offsite backups.
  2. Creates the Proxmox VMs using OpenTofu.
  3. Bootstraps the Kubernetes cluster using KubeSpray.
  4. Uploads the generated Kubeconfig file to GitLab as an artifact with restricted access.
  5. Trigger the deployment of Sealed Secrets sub-pipeline.
  6. Trigger the deployment of FluxCD sub-pipeline.
  7. Longhorn will be deployed by FluxCD as part of the d2-infra repository.
  8. Restore Jobs defined in the d2-infra repository will restore Longhorn volumes from the GCP Cloud Storage bucket.

GCP Cloud Storage Bucket

The only prerequisites to create this part of the pipeline are:

  1. Create an OpenTofu service account with the necessary permissions to create and manage GCP Cloud Storage buckets.
  2. Create a GitLab Personal Access Token with the necessary permissions to modify CI/CD variables in the GitLab project. Note: I had to use a PAT because I'm on the free tier of GitLab SaaS, which does not yet support Project Access Tokens. If you have a paid plan, you should use a Project Access Token instead.

The process the pipeline follows is:

  1. If not already present, it creates the GCP Cloud Storage bucket using google_storage_bucket resource.

    resource "google_storage_bucket" "longhorn_backup_bucket" {
        name                        = var.gcp_backup_longhorn_bucket_name
        location                    = var.gcp_backup_region
        storage_class               = "NEARLINE"
        uniform_bucket_level_access = true
        public_access_prevention    = "enforced"
        hierarchical_namespace {
            enabled = true
        }
    }
  2. It creates a longhorn_backup_service_account.

  3. Assigns to the newly created service account the roles/storage.objectAdmin role for the created bucket.

  4. Creates/Syncs a HMAC key for the service account.

  5. The GitLab Pipeline will save the generated HMAC key as a CI/CD variable in the GitLab project. In my case, it will use the Access Token created in the pre-requisites step to do so.

This HMAC key is later injected as a secret into the Kubernetes cluster, allowing Longhorn to connect to the GCP Cloud Storage bucket.

Proxmox VMs

The OpenTofu provider used is bpg/proxmox.

The prerequisites to create this part of the pipeline are:

  1. Create a Proxmox API token with the necessary permissions to create and manage VMs.
  2. Create an SSH key pair and store the private key as a GitLab CI/CD variable. This key will be used by the on-premise GitLab Runner to connect to the Proxmox VMs.

The pipeline follows this process:

  1. OpenTofu makes Proxmox download the Ubuntu Cloud Image using proxmox_virtual_environment_download_file resource.

    resource "proxmox_virtual_environment_download_file" "ubuntu_24_noble_qcow2_img" {
    content_type       = "iso"
    datastore_id       = "nfs-nas-1"
    node_name          = "pve2"
    url                = var.proxmox_k8s_node_image_url
    overwrite          = true
    file_name          = var.proxmox_k8s_node_image_name
    checksum           = var.proxmox_k8s_node_image_checksum
    checksum_algorithm = var.proxmox_k8s_node_image_checksum_algorithm
    }
  2. OpenTofu creates the Cloud-Init configuration file for the worker and master nodes using proxmox_virtual_environment_file resource. The important configurations to add to the Cloud-Init file are:

    • SSH Keys: Inject both your personal public key and the GitLab Runner's public key
    • Package Management: Install necessary packages (like qemu-guest-agent)
    • Longhorn Prerequisites: Configure according to Longhorn documentation
    Code snippet (worker nodes)
    resource "proxmox_virtual_environment_file" "ubuntu_cloud_init_worker" {
    content_type = "snippets"
    datastore_id = "nfs-nas-1"
    node_name    = "pve2"
    overwrite    = true
    source_raw {
        data      = <<EOF
    #cloud-config
    users:
    - default
    - name: ${var.proxmox_k8s_node_username}
        groups:
        - sudo
        shell: /bin/bash
        ssh_authorized_keys:
    %{~for key in var.proxmox_k8s_node_ssh_keys}
        - ${key}
    %{~endfor}
        sudo: ALL=(ALL) NOPASSWD:ALL
    
    package_update: true
    package_upgrade: true
    packages:
    - qemu-guest-agent
    - nfs-common
    
    # Disk partitioning setup
    disk_setup:
    /dev/sdb:
        table_type: gpt
        layout: true
        overwrite: false
    
    # Filesystem setup
    fs_setup:
    - label: data
        filesystem: ext4
        device: /dev/sdb1
        partition: auto
        overwrite: false
    
    # Mount configuration
    mounts:
    - [/dev/sdb1, /mnt/data, ext4, "defaults,nofail", "0", "2"]
    
    write_files:
    - path: /etc/modules-load.d/dm_crypt.conf
        content: |
        dm_crypt
        owner: root:root
        permissions: '0644'
    
        runcmd:
        - systemctl enable qemu-guest-agent
        - systemctl start qemu-guest-agent
        - systemctl stop multipathd.socket
        - systemctl stop multipathd
        - systemctl disable multipathd.socket
        - systemctl disable multipathd
        - systemctl mask multipathd
        - systemctl mask multipathd.socket
        - modprobe dm_crypt
        - systemctl enable iscsid
        - systemctl start iscsid
        - echo "done" > /tmp/cloud-config.done
    EOF
          file_name = "ubuntu.cloud-config-worker.yaml"
        }
    }
  3. OpenTofu generates the output kubespray_inventory using the templatefile function with inventory.tpl. You can customize the inventory.tpl to fit your needs following the KubeSpray documentation. Here's my version of the inventory.tpl file.

    inventory.tpl
    {
        "all": {
            "vars": {
            "ansible_user": "${ansible_user}",
            "ansible_become": true,
            "calico_cni_name": "k8s-pod-network",
            "nat_outgoing": true,
            "nat_outgoing_ipv6": true,
            "calico_pool_blocksize": 26,
            "calico_network_backend": "vxlan",
            "calico_vxlan_mode": "CrossSubnet",
            "kube_proxy_strict_arp": true,
            "kube_encrypt_secret_data": true,
            "kubeconfig_localhost": true,
            "artifacts_dir": "/output",
            "etcd_deployment_type": "host",
            "etcd_metrics_port": 2381,
            "etcd_listen_metrics_urls": "http://0.0.0.0:2381",
            "etcd_metrics_service_labels": {
                "k8s-app": "etcd",
                "app.kubernetes.io/managed-by": "kubespray",
                "app": "kube-prometheus-stack-kube-etcd",
                "release": "kube-prometheus-stack"
            },
            "kube_proxy_metrics_bind_address": "0.0.0.0:10249"
            },
            "children": {
            "kube_control_plane": {
                "hosts": {
        %{ for name in master_nodes ~}
                "${name}",
        %{ endfor ~}
                }
            },
            "etcd": {
                "hosts": {
        %{ for name in master_nodes ~}
                "${name}",
        %{ endfor ~}
                }
            },
            "kube_node": {
                "hosts": {
        %{ for name in worker_nodes ~}
                "${name}",
        %{ endfor ~}
                }
            },
            "k8s_cluster": {
                "children": [
                "kube_control_plane",
                "kube_node"
                ]
            }
            }
        }
    }

    IMHO, the most important thing here is to limit Kubespray to install only the necessary components to have a minimal Kubernetes cluster ready for FluxCD deployment.

  4. Saves the generated inventory file as a GitLab artifact for use in the next stage.

Kubernetes Cluster Bootstrapping

This part of the pipeline bootstraps the Kubernetes cluster using KubeSpray.

I think KubeSpray is the best solution for creating a production-ready Kubernetes cluster in an on-premise environment, but I also think it should be limited to only the necessary components for a working cluster. KubeSpray, which is based on Ansible, provides many options to install various components like CNI, Ingress controllers, and monitoring stacks. However, in my opinion, these components should be installed using a more mature GitOps tool like FluxCD.

The prerequisites for this part of the pipeline are:

The pipeline follows these steps:

apply-kubespray-production-home:
  stage: deploy
  image: ${CI_DEPENDENCY_PROXY_GROUP_IMAGE_PREFIX}/docker:cli
  tags:
    - mgmt-zone
    - self-hosted
  needs:
    - job: opentofu-apply-production-home
      artifacts: true
  services:
    - *dind
  before_script:
    - apk add --no-interactive jq
  script:
    - mkdir -p $CI_PROJECT_DIR/inventory
    - mkdir -p output
    - chmod 600 $RUNNER_SSH_PRIVATE_KEY_PATH
    - jq -r '.kubespray_inventory.value' home-tf-output.json > $CI_PROJECT_DIR/inventory/inventory.json
    - docker run --mount type=bind,source=$CI_PROJECT_DIR/inventory,dst=/inventory --mount type=bind,source=$RUNNER_SSH_PRIVATE_KEY_PATH,dst=/ssh/id --mount type=bind,source=$CI_PROJECT_DIR/output,dst=/output --rm quay.io/kubespray/kubespray:$KUBESPRAY_VERSION ansible-playbook -i /inventory/inventory.json --private-key /ssh/id  cluster.yml
  environment:
    name: production/home
  artifacts:
    when: on_success
    access: developer
    expire_in: "10 mins"
    paths:
      - output/**

This step leverages the official KubeSpray Docker image to run the Ansible playbooks against the Proxmox VMs created in the previous step. It then saves the generated kubeconfig file as a restricted-access artifact for use in subsequent pipeline stages.

Uploading Kubeconfig to Gitlab

This step uploads the generated kubeconfig file to GitLab as a CI/CD variable using the GitLab API directly.

.upload-secret-base64-encoded:
  image: ${CI_DEPENDENCY_PROXY_GROUP_IMAGE_PREFIX}/alpine/curl:latest
  script:
    - |
      set -e
      DATA=$(base64 -w 0 ${DATA_FILE_PATH})
      curl -s -f --request PUT \
        --header "PRIVATE-TOKEN: ${ACCESS_TOKEN}" \
        --header "Content-Type: application/json" \
        --data "{\"variable_type\":\"file\",\"key\":\"${VAR_NAME}\",\"value\":\"$DATA\",\"hidden\":false,\"protected\":true,\"masked\":true,\"raw\":true,\"description\":\"\"}" \
        "$CI_API_V4_URL/projects/$CI_PROJECT_ID/variables/${VAR_NAME}" > /dev/null 2>&1

This way, after the IaC pipeline finishes, the kubeconfig file will be available for administrators to download and use.

Sealed Secrets Deployment

This part of the pipeline creates the Sealed Secrets controller in the Kubernetes cluster.

The prerequisite is a certificate/key pair to be used by the Sealed Secrets controller. You can find the instructions here: Bring your own certificates.

The private key needs to be stored as a GitLab CI/CD variable, while the certificate can be stored directly in the IaC repository.

FluxCD Deployment

This part of the pipeline uses OpenTofu to deploy the FluxCD Operator in the Kubernetes cluster as described in the official documentation.

This step also creates the Kubernetes secret with the GCP HMAC key for Longhorn and the secret with the registry credentials to pull container images from private registries.

Longhorn Volumes Restoration Process

Unfortunately, at the time of writing, Longhorn does not support a declarative way to restore volumes from offsite backups (issue#5787).

To work around this limitation, I have created a simple Dockerized Python program that leverages the Longhorn API to restore the latest backup for defined volumes from offsite storage. You can find the repository here.

Simply define a Job in the d2-infra repository for each volume you want to restore. For example, here is the Job definition to restore the Authelia Longhorn volume:

apiVersion: batch/v1
kind: Job
metadata:
  name: authelia-volume-restore
  namespace: authelia
spec:
  template:
    spec:
      containers:
        - name: restore
          image: your-registry/longhorn-backup-restore:latest
          env:
            - name: LONGHORN_URL
              value: http://longhorn-frontend.longhorn-system.svc.cluster.local
            - name: VOLUME_HANDLE
              value: authelia-production-vol
            - name: NUMBER_OF_REPLICAS
              value: "3"
            - name: LOG_LEVEL
              value: INFO
      restartPolicy: Never
  backoffLimit: 3

Then you need to create the corresponding PV and PVC to use the restored volume in your application. These can be defined in the d2-infra repository, for example in the same file as the restore job.

You can find more information in the README of the repository.

Encrypt Longhorn Backups Client-Side

Unfortunately, at the time of writing, Longhorn does not support client-side encryption of backups natively (issue#5220).

A simple solution I found is to use rclone to encrypt the backups client-side before uploading them to the offsite backup location.

Simply declare a Deployment, a Service, some Secrets, and a one-time Job in the same namespace where Longhorn is installed.

The important bits in the

deployment manifest
apiVersion: apps/v1
kind: Deployment
metadata:
  name: s3-rclone-longhorn-bck
  labels:
    app.kubernetes.io/name: s3-rclone-longhorn-bck
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: s3-rclone-longhorn-bck
  template:
    metadata:
      labels:
        app.kubernetes.io/name: s3-rclone-longhorn-bck
    spec:
      containers:
        - name: rclone
          image: ghcr.io/rclone/rclone:1.71.1
          imagePullPolicy: IfNotPresent
          command:
            - "rclone"
            - "serve"
            - "s3"
            - "--no-cleanup"
            - "--auth-key"
            - "$(RC_ACCESS_KEY_ID),$(RC_ACCESS_KEY)"
            - "crypt_out_s3:"
            - "--s3-force-path-style=true"
            - "--addr=:8080"
            - "--log-level=WARNING"
          env:
            - name: RCLONE_CONFIG_OUT_S3_ACCESS_KEY_ID
              valueFrom:
                secretKeyRef:
                  name: longhorn-backup-hmac-key
                  key: access_id
            - name: RCLONE_CONFIG_OUT_S3_SECRET_ACCESS_KEY
              valueFrom:
                secretKeyRef:
                  name: longhorn-backup-hmac-key
                  key: secret
            - name: RCLONE_CONFIG_CRYPT_OUT_S3_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: rclone-secret
                  key: password
            # salt is used as password2
            - name: RCLONE_CONFIG_CRYPT_OUT_S3_PASSWORD2
              valueFrom:
                secretKeyRef:
                  name: rclone-secret
                  key: salt
            - name: RC_ACCESS_KEY_ID
              valueFrom:
                secretKeyRef:
                  name: longhorn-rclone-bck-key-secret
                  key: AWS_ACCESS_KEY_ID
            - name: RC_ACCESS_KEY
              valueFrom:
                secretKeyRef:
                  name: longhorn-rclone-bck-key-secret
                  key: AWS_SECRET_ACCESS_KEY
          ports:
            - name: http
              containerPort: 8080
              protocol: TCP
          volumeMounts:
            - name: config
              mountPath: /root/.config/rclone
      volumes:
        - name: config
          configMap:
            name: s3-rclone-longhorn-bck-config
            items:
              - key: rclone.conf
                path: rclone.conf

and in the

rclone config
[out_s3]
type = s3
provider = GCS
endpoint = https://storage.googleapis.com
region = europe-west4
use_multipart_uploads = false

[crypt_out_s3]
type = crypt
remote = out_s3:<BUCKET-NAME-REDACTED>
directory_name_encryption = false
filename_encryption = off

I used are

Why We Use rclone serve s3

rclone serve s3 is the key component that makes this solution work. It implements a basic S3-compatible server that exposes any rclone backend (in our case, the encrypted crypt_out_s3 remote) as an S3 endpoint.

This is essential because:

The command essentially creates an S3 gateway that sits between Longhorn and the actual storage backend, handling all encryption/decryption automatically.

Understanding the rclone crypt Options

The crypt remote configuration uses two important settings:

directory_name_encryption = false

This keeps directory names in plaintext (unencrypted). While this reduces security slightly, it has practical benefits:

The actual file data is still fully encrypted, so the main content security is preserved.

filename_encryption = off

With this setting, files only get a .bin extension added instead of having their filenames encrypted. This provides several advantages:

Security trade-off: This setting trades some security for practicality. If you need maximum security, you could use standard encryption, which encrypts filenames completely.

Both options make the encrypted remote more manageable and reduce the risk of hitting storage provider limitations while keeping the actual file content fully encrypted.


Next, create the corresponding Service and a simple Job to initialize the backup bucket.

The important bits in the

job manifest
apiVersion: batch/v1
kind: Job
metadata:
  name: s3-rclone-longhorn-bck-init
  labels:
    app.kubernetes.io/name: s3-rclone-longhorn-bck
    app.kubernetes.io/component: init
spec:
  backoffLimit: 3
  template:
    metadata:
      labels:
        app.kubernetes.io/name: s3-rclone-longhorn-bck
        app.kubernetes.io/component: init
    spec:
      restartPolicy: OnFailure
      containers:
        - name: rclone-init
          image: ghcr.io/rclone/rclone:1.71.1
          imagePullPolicy: IfNotPresent
          command:
            - "/bin/sh"
            - "-c"
            - |
              set -e

              echo "Waiting for rclone S3 service to be ready..."
              # limit to 10 minutes
              COUNTER=0
              until nc -z s3-rclone-longhorn-bck 8080 2>/dev/null; do
                echo "Waiting for service..."
                sleep 10
                COUNTER=`expr $COUNTER + 1`
                if [ $COUNTER -ge 60 ]; then
                  echo "Timeout waiting for service after 10 minutes"
                  exit 1
                fi
              done
              echo "Service is ready!"
              COUNTER=0

              echo "Using bucket name: $${BUCKET_NAME}"

              # Set up the remote for the local rclone S3 service (encrypted)
              export RCLONE_CONFIG_LOCAL_S3_TYPE=s3
              export RCLONE_CONFIG_LOCAL_S3_PROVIDER=Other
              export RCLONE_CONFIG_LOCAL_S3_ENDPOINT=http://s3-rclone-longhorn-bck:8080
              export RCLONE_CONFIG_LOCAL_S3_ACCESS_KEY_ID="$${RC_ACCESS_KEY_ID}"
              export RCLONE_CONFIG_LOCAL_S3_SECRET_ACCESS_KEY="$${RC_ACCESS_KEY}"
              export RCLONE_CONFIG_LOCAL_S3_FORCE_PATH_STYLE=true

              export RCLONE_CONFIG_OUT_S3_USE_MULTIPART_UPLOADS=false
              export RCLONE_CONFIG_OUT_S3_NO_CHECK_BUCKET=true
              export RCLONE_CONFIG_OUT_S3_ACCESS_KEY_ID="$${BACKEND_S3_ACCESS_KEY_ID}"
              export RCLONE_CONFIG_OUT_S3_SECRET_ACCESS_KEY="$${BACKEND_S3_SECRET_ACCESS_KEY}"

              # Check if INFO.txt already exists in the backend (unencrypted)
              echo "Checking if INFO.txt already exists in backend..."
              if rclone lsf "out_s3:$${BUCKET_NAME}/INFO.txt" 2>/dev/null | grep -q "INFO.txt"; then
                echo "INFO.txt already exists in backend - initialization already complete"
                echo "Bucket is ready for Longhorn backups"
                exit 0
              fi

              # Check if bucket exists (via encrypted endpoint)
              echo "Checking if bucket exists..."
              if rclone lsd local_s3: 2>/dev/null | grep -q "$${BUCKET_NAME}"; then
                echo "Bucket '$${BUCKET_NAME}' already exists"
                
                # Check if bucket contains any encrypted data
                echo "Checking bucket contents (encrypted view)..."
                FILE_COUNT=$$(rclone ls "local_s3:$${BUCKET_NAME}/" 2>/dev/null | wc -l)
                
                if [ "$$FILE_COUNT" -gt 0 ]; then
                  echo "Bucket contains $${FILE_COUNT} encrypted file(s)"
                  echo "Listing existing files:"
                  rclone ls "local_s3:$${BUCKET_NAME}/" --max-depth 1
                fi
              else
                echo "Creating new bucket: $${BUCKET_NAME}"
                rclone mkdir "local_s3:$${BUCKET_NAME}" --log-level=INFO
                echo "Bucket created successfully"
              fi

              echo "Listing all buckets..."
              rclone lsd local_s3: --log-level=INFO

              echo "Generating INFO.txt file..."
              TIMESTAMP=$$(date -u +"%Y-%m-%d %H:%M:%S UTC")
              HOSTNAME=$$(hostname)

              cat > /tmp/INFO.txt <<EOF
              ============================================
              Longhorn Backup Bucket Information
              ============================================

              Bucket Name: $${BUCKET_NAME}
              Created: $${TIMESTAMP}
              Created By: $${HOSTNAME}

              Configuration:
              - Service: s3-rclone-longhorn-bck
              - Endpoint: http://s3-rclone-longhorn-bck:8080
              - Encryption: Enabled (rclone crypt)
              - Backend: Google Cloud Storage (GCS)
              - Region: europe-west4

              Rclone Configuration:
              - Remote: crypt_out_s3
              - Base Remote: out_s3
              - Encryption: Standard encryption with password and salt
              - Directory Name Encryption: Disabled
              - Filename Encryption: Disabled

              Environment:
              - Kubernetes Namespace: $${K8S_NAMESPACE}
              - Init Job: s3-rclone-longhorn-bck-init

              Notes:
              - All data stored in this bucket is encrypted using rclone crypt
              - Access requires proper HMAC credentials (stored in secrets)
              - Encryption password and salt are required for decryption
              - INFO.txt is stored UNENCRYPTED for easy access

              Secrets Used:
              - longhorn-backup-hmac-key: GCS HMAC credentials
              - rclone-secret: Encryption password and salt
              - longhorn-rclone-bck-key-secret: API access credentials

              ============================================
              EOF

              echo "Uploading INFO.txt to backend GCS (UNENCRYPTED)..."
              rclone copy /tmp/INFO.txt "out_s3:$${BUCKET_NAME}/" --log-level=INFO --s3-no-check-bucket

              echo "Verifying upload..."
              echo "Files in encrypted view:"
              rclone ls "local_s3:$${BUCKET_NAME}/" --log-level=INFO
              echo ""
              echo "Files in unencrypted backend:"
              rclone ls "out_s3:$${BUCKET_NAME}/" --max-depth 1 --log-level=INFO

              echo "Initialization complete!"
              echo "Bucket '$${BUCKET_NAME}' is ready for Longhorn backups"
              echo "INFO.txt is available unencrypted in the backend storage"
          env:
            - name: BUCKET_NAME
              value: ~
            - name: RCLONE_CONFIG_OUT_S3_TYPE
              value: ~
            - name: RCLONE_CONFIG_OUT_S3_PROVIDER
              value: ~
            - name: RCLONE_CONFIG_OUT_S3_ENDPOINT
              value: ~
            - name: RCLONE_CONFIG_OUT_S3_REGION
              value: ~
            - name: RCLONE_CONFIG_OUT_S3_USE_MULTIPART_UPLOADS
              value: ~
            - name: K8S_NAMESPACE
              valueFrom:
                fieldRef:
                  fieldPath: metadata.namespace
            - name: RC_ACCESS_KEY_ID
              valueFrom:
                secretKeyRef:
                  name: longhorn-rclone-bck-key-secret
                  key: AWS_ACCESS_KEY_ID
            - name: RC_ACCESS_KEY
              valueFrom:
                secretKeyRef:
                  name: longhorn-rclone-bck-key-secret
                  key: AWS_SECRET_ACCESS_KEY
            - name: BACKEND_S3_ACCESS_KEY_ID
              valueFrom:
                secretKeyRef:
                  name: longhorn-backup-hmac-key
                  key: access_id
            - name: BACKEND_S3_SECRET_ACCESS_KEY
              valueFrom:
                secretKeyRef:
                  name: longhorn-backup-hmac-key
                  key: secret

are:

Why We Generate the INFO.txt File

The INFO.txt file serves several important purposes:

  1. Documentation: It provides human-readable information about the bucket configuration, encryption setup, and required credentials. This is invaluable when you need to restore or troubleshoot backups months or years later.

  2. Accessible without encryption: Critically, INFO.txt is stored directly in the backend (out_s3), bypassing the encryption layer. This means it can be read without needing the encryption keys, making it a self-documenting backup location.

  3. Verification: By uploading it to the backend, we verify that:

    • The GCS backend connection works correctly
    • Credentials are properly configured
    • The bucket is accessible and writable
  4. Recovery aid: If you ever lose your rclone configuration but still have the encryption passwords in a password manager, the INFO.txt file tells you exactly how the encryption was configured. This makes it possible to reconstruct the setup and recover your backups.

  5. Idempotency check: The script checks for INFO.txt existence to determine if initialization has already been completed, preventing duplicate initialization runs.

Why We Create the Bucket with rclone mkdir "local_s3:${BUCKET_NAME}"

Creating the bucket through the local_s3: remote (the encrypted S3 endpoint) rather than directly on the backend has several advantages:

  1. End-to-end testing: This verifies the entire encryption pipeline is working:

    • The rclone serve s3 service is running and accessible
    • Authentication is correctly configured
    • The crypt layer is properly set up
    • The underlying GCS backend is reachable
  2. S3 API validation: It ensures that bucket creation operations work through the S3 API layer, which is exactly how Longhorn will interact with the system. If rclone mkdir succeeds through local_s3:, we know Longhorn's S3 operations will also work.

  3. Consistent access path: By creating the bucket the same way Longhorn will access it (through the S3 API), we ensure there are no surprises or incompatibilities when Longhorn starts using the bucket.

  4. Automatic bucket initialization: On GCS, when you create a "bucket" through rclone's S3 interface, it actually creates a folder/prefix in the specified GCS bucket (configured as <BUCKET-NAME-REDACTED> in the config). This happens automatically through the crypt layer.

  5. Proper permissions verification: This confirms that the service account credentials (RC_ACCESS_KEY_ID/RC_ACCESS_KEY) have the necessary permissions to create buckets through the S3 interface.

RTO and RPO

With this setup, you can easily modify the RPO by adjusting the Longhorn backup schedule to fit your needs.

For RTO, it mainly depends on the pipeline execution time. The main time-consuming components are the Kubernetes cluster bootstrapping using KubeSpray and the FluxCD reconciliation process.

In my case, the total time to provision the entire infrastructure from scratch and reconcile the cluster state is around 1 hour, which works well for my use case and is also acceptable for most organizations.

To lower the RTO further, you could customize FluxCD timeouts, retry periods, and parallelism to achieve more aggressive reconciliation.

However, I'm planning to set up a disaster recovery site for even better redundancy. More on this in the "Next Steps" section.

Conclusion

This journey from a homelab data loss incident to a production-grade IaC setup taught me that:

While the initial setup took about a month (including research, testing, and iteration), I can now rebuild my entire infrastructure in 1 hour. More importantly, I've eliminated the anxiety of "did I back that up?"β€”everything is code, versioned, and reproducible.

The total monthly cost (~€10 for GCS storage) is minimal compared to the value of reliable, reproducible infrastructure. If you're running a homelab, I encourage you to treat it like productionβ€”your future self will thank you.

Limitations

As you might have noticed, some components are not yet part of the automated rebuild process.

Infrastructure Assumptions:

While this is acceptable for a homelab, an enterprise-grade setup should include these components in the IaC pipeline as well. For on-premises environments, this presents additional challenges:

These omissions mean a true "datacenter destroyed" scenario still requires some manual intervention. However, for the more common scenarios (VM corruption, cluster misconfiguration, accidental deletion), the current setup provides comprehensive protection.

Next Steps

While Google Cloud is a great option and my monthly cost is only ~€10, in the future I would like to explore Hetzner. The pricing is really competitive and they have an S3-compatible object storage service. They also have an OpenTofu provider.

Another area I would like to explore is leveraging Longhorn Disaster Recovery Volumes in conjunction with the FluxCD d2 architecture. This way, I might be able to create a recovery cluster in another location and have a more robust disaster recovery plan. I think that by using Hetzner with only the strictly necessary services, a single server might be sufficient to host the recovery cluster cheaply.

With this setup, it might be possible to achieve a very low RTO and RPO.

Finally, I hope that in the future Longhorn will natively support client-side encryption and a declarative way to restore volumes from offsite backups so that I can simplify the current setup.