One-click homelab: integrating Gitlab, Proxmox and K8s with GitOps Principles

{homelab} {devops} :: #proxmox #kubernetes #gitlab #iac #kubespray #fluxcd #longhorn

One unlucky day, I destroyed my homelab while I was trying to upgrade various components. I had a backup of my virtual machines where my Kubernetes cluster was running, but unfortunately, I forgot to enable the backup of the secondary disk attached to each VM. This secondary disk was used to store all the persistent data of my cluster, including the Longhorn volumes. I thought that the backups were fine, but when I restored the VMs, I realized that the secondary disks were missing. As a result, I lost all my persistent data, and I had to start from scratch.

My homelab was at least 3/4 years old, and I had accumulated a lot of configurations, customizations, and data over time. I started it while approaching the DevOps world, so it didn't really follow best practices. Rebuilding everything from scratch was a daunting task, and I realized that I needed a better way to manage my homelab infrastructure.

It might sound incredible, but while it might be acceptable for a personal homelab, losing data due to a lack of proper backup strategies is a common issue in the industry as well. Some notable examples include:

South Korea's NIRS lost 858TB of government data due to not having an offsite backup
Gitlab lost 6 hours of data due to improper backup procedures and a lack of verification

Given the experience I gathered over the years, I decided to rebuild my homelab using Infrastructure as Code (IaC) principles. This time, I wanted to ensure that I could easily recreate my entire setup with just one click, without having to go through the tedious process of manual configuration. So my objective was not just having a backup of my infrastructure, but having a recovery time objective (RTO) of minutes.

This way, I can also brag about having a better infrastructure than most companies out there! 😄

What to Expect from This Blog Post

In this blog post you will find my personal solution to address the creation of a personal homelab following IaC principles. In particular, I will show you how I've successfully integrated open source tools with minimal cost and custom code to achieve a one-click deployment of my homelab infrastructure.

The solution has the following properties:

RTO of 1 hour
RPO of 1 day (can be customized)
Cost effective (less than 10$ / month in my case; depends on how much data you have to backup to the offsite location)
Privacy oriented (more later)
Fully automated using GitOps principles

The tech stack of the solution includes:

GitLab SaaS: I use it for simplicity, but you can also use your own GitLab instance at your discretion.
Proxmox: I use it in my homelab, but you can use any Hypervisor you prefer, the only important thing is that there is an OpenTofu provider for that.
OpenTofu: Necessary to create the various component of the solution. OpenTofu is the open-source fork of Terraform, which I use for infrastructure provisioning.
Ubuntu Cloud Images: I use Ubuntu Cloud Images as the base operating system for the VMs in my Proxmox cluster. These images are optimized for cloud environments and provide an automated way to provision VMs with Cloud-Init.
KubeSpray: I use KubeSpray to create the Kubernetes cluster on top of Proxmox VMs. KubeSpray is a popular open-source project that provides a set of Ansible playbooks for deploying and managing Kubernetes clusters.
FluxCD: I use FluxCD for GitOps management of the Kubernetes cluster. FluxCD is a popular open-source project that enables continuous delivery and GitOps for Kubernetes.
Sealed Secrets: I use Sealed Secrets for storing all the credentials directly on git. While for an homelab solution is more than enough, OpenBao might be a better solution for an Enterprise.
Longhorn: I use Longhorn as the storage solution for the Kubernetes cluster. Longhorn is a popular open-source project that provides a distributed block storage system for Kubernetes.
GCP Cloud Storage: In my solution I use GCP cloud storage for off-site backup. Please note that the solution I will provide ensures client-side encryption of data (so even Google will not be able to decipher your data). Additionally, you can use any NFS server or self-hosted S3 Object store solution like Garage.

Recovery Time Breakdown

Here's what happens during a full disaster recovery (1 hour RTO):

┌─────────────────────────────────────────────────────────────────┐
│ Full Infrastructure Recovery Timeline                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│ Parallel:                                                       │
│  GCP Bucket ████████                                            │
│  (4-5 min, ~8% - runs in parallel)                              │
│                                                                 │
│ Main Flow:                                                      │
│  0min                                                     60min │
│  ├─────────┬────────────────────────┬───────┬──────────────┤    │
│  │         │                        │       │              │    │
│  ▼         ▼                        ▼       ▼              ▼    │
│  VM       K8s                    Sealed  Flux          Flux     │
│  Prov.    Bootstrap              Secrets Deploy        Reconcile│
│                                                                 │
│  9-10     23                     1-2   3-4              20-25   │
│  min      min                    min   min              min     │
│  (~16%)   (~38%)                 (~2%) (~6%)            (~38%)  │
│                                                                 │
│  Total: ~60 minutes                                             │
└─────────────────────────────────────────────────────────────────┘

Phase Details:
├─ GCP Bucket (4-5 min): Create/verify backup storage [RUNS IN PARALLEL]
├─ VM Provisioning (9-10 min): Download Ubuntu Cloud Images, create VMs with Cloud-Init
├─ K8s Bootstrap (23 min): KubeSpray cluster deployment (long phase - ~38% of total time)
├─ Sealed Secrets (1-2 min): Deploy secret management controller
├─ FluxCD Deploy (3-4 min): Install GitOps operator and sync repositories
└─ FluxCD Reconcile (20-25 min): Complete reconciliation of cluster state from GitOps repos (long phase - ~38% of total time)

Note: The Kubernetes bootstrapping phase accounts for approximately 38% of the total RTO. The GCP bucket creation runs in parallel with VM provisioning, so it doesn't add to the overall recovery time. Also it might be possible to optimize the FluxCD reconciliation time by tweaking its configuration for more aggressive syncs.

Here's a time-lapse of the complete infrastructure deployment from scratch to a running cluster (~60 minutes compressed):

Repositories Structure

All the infrastructure code is organized in five repositories:

gitlab-runner repository: contains the OpenTofu code to create the Gitlab Runner on Proxmox as an LXC container.
IaC repository: contains all the code necessary to bootstrap the Proxmox VMs and the Kubernetes cluster using KubeSpray.
d2-fleet repository: defines the desired state of the Kubernetes clusters and tenants in the fleet.
d2-infra repository: defines the desired state of the cluster add-ons and the monitoring stack.
d2-apps repositories: defines the desired state of the applications deployed across environments.

We will examine the IaC repository in detail, the d2-* repositories simply apply the d2-reference-architecture provided by the FluxCD team, which, I must say, is very well thought out and implemented. 👍

The Gitlab Runner repository is also quite straightforward, as it only contains the OpenTofu code to create the LXC container and register the runner with Gitlab, so I will not cover it here.

How They Work Together

The deployment flow follows a clear progression:

gitlab-runner → Bootstraps the CI/CD infrastructure needed to run the automated pipelines
IaC → Handles the foundational layer: VMs, backup storage, Kubernetes cluster, and essential components (Sealed Secrets, FluxCD operator)
d2-fleet, d2-infra, d2-apps → Once the IaC pipeline completes, FluxCD takes over and continuously reconciles the cluster state based on these GitOps repositories

In essence, the IaC repository gets you from an empty Hypervisor to a GitOps-ready cluster, then the d2-repositories manage everything from that point forward. This separation means the IaC pipeline only needs to run for infrastructure changes and periodic disaster recovery tests, while the d2 repositories handle all day-to-day operations through FluxCD's automatic reconciliation.

IaC Pipeline

The high-level steps of provisioning the homelab infrastructure performed by the IaC pipeline are:

Creates the GCP Cloud Storage bucket for offsite backups.
Creates the Proxmox VMs using OpenTofu.
Bootstraps the Kubernetes cluster using KubeSpray.
Uploads the generated Kubeconfig file to GitLab as an artifact with restricted access.
Trigger the deployment of Sealed Secrets sub-pipeline.
Trigger the deployment of FluxCD sub-pipeline.
Longhorn will be deployed by FluxCD as part of the d2-infra repository.
Restore Jobs defined in the d2-infra repository will restore Longhorn volumes from the GCP Cloud Storage bucket.

GCP Cloud Storage Bucket

The only prerequisites to create this part of the pipeline are:

Create an OpenTofu service account with the necessary permissions to create and manage GCP Cloud Storage buckets.
Create a GitLab Personal Access Token with the necessary permissions to modify CI/CD variables in the GitLab project. Note: I had to use a PAT because I'm on the free tier of GitLab SaaS, which does not yet support Project Access Tokens. If you have a paid plan, you should use a Project Access Token instead.

The process the pipeline follows is:

If not already present, it creates the GCP Cloud Storage bucket using google_storage_bucket resource.

 1resource "google_storage_bucket" "longhorn_backup_bucket" {
 2    name                        = var.gcp_backup_longhorn_bucket_name
 3    location                    = var.gcp_backup_region
 4    storage_class               = "NEARLINE"
 5    uniform_bucket_level_access = true
 6    public_access_prevention    = "enforced"
 7    hierarchical_namespace {
 8        enabled = true
 9    }
10}

It creates a longhorn_backup_service_account.
Assigns to the newly created service account the roles/storage.objectAdmin role for the created bucket.
Creates/Syncs a HMAC key for the service account.
The GitLab Pipeline will save the generated HMAC key as a CI/CD variable in the GitLab project. In my case, it will use the Access Token created in the pre-requisites step to do so.

This HMAC key is later injected as a secret into the Kubernetes cluster, allowing Longhorn to connect to the GCP Cloud Storage bucket.

Proxmox VMs

The OpenTofu provider used is bpg/proxmox.

The prerequisites to create this part of the pipeline are:

Create a Proxmox API token with the necessary permissions to create and manage VMs.
Create an SSH key pair and store the private key as a GitLab CI/CD variable. This key will be used by the on-premise GitLab Runner to connect to the Proxmox VMs.

The pipeline follows this process:

OpenTofu makes Proxmox download the Ubuntu Cloud Image using proxmox_virtual_environment_download_file resource.

 1resource "proxmox_virtual_environment_download_file" "ubuntu_24_noble_qcow2_img" {
 2content_type       = "iso"
 3datastore_id       = "nfs-nas-1"
 4node_name          = "pve2"
 5url                = var.proxmox_k8s_node_image_url
 6overwrite          = true
 7file_name          = var.proxmox_k8s_node_image_name
 8checksum           = var.proxmox_k8s_node_image_checksum
 9checksum_algorithm = var.proxmox_k8s_node_image_checksum_algorithm
10}

OpenTofu creates the Cloud-Init configuration file for the worker and master nodes using proxmox_virtual_environment_file resource. The important configurations to add to the Cloud-Init file are:

SSH Keys: Inject both your personal public key and the GitLab Runner's public key
Package Management: Install necessary packages (like qemu-guest-agent)
Longhorn Prerequisites: Configure according to Longhorn documentation

Code snippet (worker nodes)

 1resource "proxmox_virtual_environment_file" "ubuntu_cloud_init_worker" {
 2content_type = "snippets"
 3datastore_id = "nfs-nas-1"
 4node_name    = "pve2"
 5overwrite    = true
 6source_raw {
 7    data      = <<EOF
 8#cloud-config
 9users:
10- default
11- name: ${var.proxmox_k8s_node_username}
12    groups:
13    - sudo
14    shell: /bin/bash
15    ssh_authorized_keys:
16%{~for key in var.proxmox_k8s_node_ssh_keys}
17    - ${key}
18%{~endfor}
19    sudo: ALL=(ALL) NOPASSWD:ALL
20
21package_update: true
22package_upgrade: true
23packages:
24- qemu-guest-agent
25- nfs-common
26
27# Disk partitioning setup
28disk_setup:
29/dev/sdb:
30    table_type: gpt
31    layout: true
32    overwrite: false
33
34# Filesystem setup
35fs_setup:
36- label: data
37    filesystem: ext4
38    device: /dev/sdb1
39    partition: auto
40    overwrite: false
41
42# Mount configuration
43mounts:
44- [/dev/sdb1, /mnt/data, ext4, "defaults,nofail", "0", "2"]
45
46write_files:
47- path: /etc/modules-load.d/dm_crypt.conf
48    content: |
49    dm_crypt
50    owner: root:root
51    permissions: '0644'
52
53    runcmd:
54    - systemctl enable qemu-guest-agent
55    - systemctl start qemu-guest-agent
56    - systemctl stop multipathd.socket
57    - systemctl stop multipathd
58    - systemctl disable multipathd.socket
59    - systemctl disable multipathd
60    - systemctl mask multipathd
61    - systemctl mask multipathd.socket
62    - modprobe dm_crypt
63    - systemctl enable iscsid
64    - systemctl start iscsid
65    - echo "done" > /tmp/cloud-config.done
66EOF
67      file_name = "ubuntu.cloud-config-worker.yaml"
68    }
69}

OpenTofu generates the output kubespray_inventory using the templatefile function with inventory.tpl. You can customize the inventory.tpl to fit your needs following the KubeSpray documentation. Here's my version of the inventory.tpl file.

inventory.tpl

 1{
 2    "all": {
 3        "vars": {
 4        "ansible_user": "${ansible_user}",
 5        "ansible_become": true,
 6        "calico_cni_name": "k8s-pod-network",
 7        "nat_outgoing": true,
 8        "nat_outgoing_ipv6": true,
 9        "calico_pool_blocksize": 26,
10        "calico_network_backend": "vxlan",
11        "calico_vxlan_mode": "CrossSubnet",
12        "kube_proxy_strict_arp": true,
13        "kube_encrypt_secret_data": true,
14        "kubeconfig_localhost": true,
15        "artifacts_dir": "/output",
16        "etcd_deployment_type": "host",
17        "etcd_metrics_port": 2381,
18        "etcd_listen_metrics_urls": "http://0.0.0.0:2381",
19        "etcd_metrics_service_labels": {
20            "k8s-app": "etcd",
21            "app.kubernetes.io/managed-by": "kubespray",
22            "app": "kube-prometheus-stack-kube-etcd",
23            "release": "kube-prometheus-stack"
24        },
25        "kube_proxy_metrics_bind_address": "0.0.0.0:10249"
26        },
27        "children": {
28        "kube_control_plane": {
29            "hosts": {
30    %{ for name in master_nodes ~}
31            "${name}",
32    %{ endfor ~}
33            }
34        },
35        "etcd": {
36            "hosts": {
37    %{ for name in master_nodes ~}
38            "${name}",
39    %{ endfor ~}
40            }
41        },
42        "kube_node": {
43            "hosts": {
44    %{ for name in worker_nodes ~}
45            "${name}",
46    %{ endfor ~}
47            }
48        },
49        "k8s_cluster": {
50            "children": [
51            "kube_control_plane",
52            "kube_node"
53            ]
54        }
55        }
56    }
57}

IMHO, the most important thing here is to limit Kubespray to install only the necessary components to have a minimal Kubernetes cluster ready for FluxCD deployment.

Saves the generated inventory file as a GitLab artifact for use in the next stage.

Kubernetes Cluster Bootstrapping

This part of the pipeline bootstraps the Kubernetes cluster using KubeSpray.

I think KubeSpray is the best solution for creating a production-ready Kubernetes cluster in an on-premise environment, but I also think it should be limited to only the necessary components for a working cluster. KubeSpray, which is based on Ansible, provides many options to install various components like CNI, Ingress controllers, and monitoring stacks. However, in my opinion, these components should be installed using a more mature GitOps tool like FluxCD.

The prerequisites for this part of the pipeline are:

A Personal Access Token with API scope to upload the kubeconfig file as a GitLab CI/CD variable. As already mentioned, if you have a paid plan, you should use a Project Access Token instead. Additionally, you can use the same token created for the GCP Bucket creation step.
The GitLab Runner needs to connect to the Proxmox VMs using SSH. The private key for the SSH connection should be stored as a GitLab CI/CD variable in the project, as already mentioned in the Proxmox VMs creation step.

The pipeline follows these steps:

 1apply-kubespray-production-home:
 2  stage: deploy
 3  image: ${CI_DEPENDENCY_PROXY_GROUP_IMAGE_PREFIX}/docker:cli
 4  tags:
 5    - mgmt-zone
 6    - self-hosted
 7  needs:
 8    - job: opentofu-apply-production-home
 9      artifacts: true
10  services:
11    - *dind
12  before_script:
13    - apk add --no-interactive jq
14  script:
15    - mkdir -p $CI_PROJECT_DIR/inventory
16    - mkdir -p output
17    - chmod 600 $RUNNER_SSH_PRIVATE_KEY_PATH
18    - jq -r '.kubespray_inventory.value' home-tf-output.json > $CI_PROJECT_DIR/inventory/inventory.json
19    - docker run --mount type=bind,source=$CI_PROJECT_DIR/inventory,dst=/inventory --mount type=bind,source=$RUNNER_SSH_PRIVATE_KEY_PATH,dst=/ssh/id --mount type=bind,source=$CI_PROJECT_DIR/output,dst=/output --rm quay.io/kubespray/kubespray:$KUBESPRAY_VERSION ansible-playbook -i /inventory/inventory.json --private-key /ssh/id  cluster.yml
20  environment:
21    name: production/home
22  artifacts:
23    when: on_success
24    access: developer
25    expire_in: "10 mins"
26    paths:
27      - output/**

This step leverages the official KubeSpray Docker image to run the Ansible playbooks against the Proxmox VMs created in the previous step. It then saves the generated kubeconfig file as a restricted-access artifact for use in subsequent pipeline stages.

Uploading Kubeconfig to Gitlab

This step uploads the generated kubeconfig file to GitLab as a CI/CD variable using the GitLab API directly.

 1.upload-secret-base64-encoded:
 2  image: ${CI_DEPENDENCY_PROXY_GROUP_IMAGE_PREFIX}/alpine/curl:latest
 3  script:
 4    - |
 5      set -e
 6      DATA=$(base64 -w 0 ${DATA_FILE_PATH})
 7      curl -s -f --request PUT \
 8        --header "PRIVATE-TOKEN: ${ACCESS_TOKEN}" \
 9        --header "Content-Type: application/json" \
10        --data "{\"variable_type\":\"file\",\"key\":\"${VAR_NAME}\",\"value\":\"$DATA\",\"hidden\":false,\"protected\":true,\"masked\":true,\"raw\":true,\"description\":\"\"}" \
11        "$CI_API_V4_URL/projects/$CI_PROJECT_ID/variables/${VAR_NAME}" > /dev/null 2>&1

This way, after the IaC pipeline finishes, the kubeconfig file will be available for administrators to download and use.

Sealed Secrets Deployment

This part of the pipeline creates the Sealed Secrets controller in the Kubernetes cluster.

The prerequisite is a certificate/key pair to be used by the Sealed Secrets controller. You can find the instructions here: Bring your own certificates.

The private key needs to be stored as a GitLab CI/CD variable, while the certificate can be stored directly in the IaC repository.

FluxCD Deployment

This part of the pipeline uses OpenTofu to deploy the FluxCD Operator in the Kubernetes cluster as described in the official documentation.

This step also creates the Kubernetes secret with the GCP HMAC key for Longhorn and the secret with the registry credentials to pull container images from private registries.

Longhorn Volumes Restoration Process

Unfortunately, at the time of writing, Longhorn does not support a declarative way to restore volumes from offsite backups (issue#5787).

To work around this limitation, I have created a simple Dockerized Python program that leverages the Longhorn API to restore the latest backup for defined volumes from offsite storage. You can find the repository here.

Simply define a Job in the d2-infra repository for each volume you want to restore. For example, here is the Job definition to restore the Authelia Longhorn volume:

 1apiVersion: batch/v1
 2kind: Job
 3metadata:
 4  name: authelia-volume-restore
 5  namespace: authelia
 6spec:
 7  template:
 8    spec:
 9      containers:
10        - name: restore
11          image: your-registry/longhorn-backup-restore:latest
12          env:
13            - name: LONGHORN_URL
14              value: http://longhorn-frontend.longhorn-system.svc.cluster.local
15            - name: VOLUME_HANDLE
16              value: authelia-production-vol
17            - name: NUMBER_OF_REPLICAS
18              value: "3"
19            - name: LOG_LEVEL
20              value: INFO
21      restartPolicy: Never
22  backoffLimit: 3

Then you need to create the corresponding PV and PVC to use the restored volume in your application. These can be defined in the d2-infra repository, for example in the same file as the restore job.

You can find more information in the README of the repository.

Encrypt Longhorn Backups Client-Side

Unfortunately, at the time of writing, Longhorn does not support client-side encryption of backups natively (issue#5220).

A simple solution I found is to use rclone to encrypt the backups client-side before uploading them to the offsite backup location.

Simply declare a Deployment, a Service, some Secrets, and a one-time Job in the same namespace where Longhorn is installed.

The important bits in the

deployment manifest

 1apiVersion: apps/v1
 2kind: Deployment
 3metadata:
 4  name: s3-rclone-longhorn-bck
 5  labels:
 6    app.kubernetes.io/name: s3-rclone-longhorn-bck
 7spec:
 8  replicas: 1
 9  selector:
10    matchLabels:
11      app.kubernetes.io/name: s3-rclone-longhorn-bck
12  template:
13    metadata:
14      labels:
15        app.kubernetes.io/name: s3-rclone-longhorn-bck
16    spec:
17      containers:
18        - name: rclone
19          image: ghcr.io/rclone/rclone:1.71.1
20          imagePullPolicy: IfNotPresent
21          command:
22            - "rclone"
23            - "serve"
24            - "s3"
25            - "--no-cleanup"
26            - "--auth-key"
27            - "$(RC_ACCESS_KEY_ID),$(RC_ACCESS_KEY)"
28            - "crypt_out_s3:"
29            - "--s3-force-path-style=true"
30            - "--addr=:8080"
31            - "--log-level=WARNING"
32          env:
33            - name: RCLONE_CONFIG_OUT_S3_ACCESS_KEY_ID
34              valueFrom:
35                secretKeyRef:
36                  name: longhorn-backup-hmac-key
37                  key: access_id
38            - name: RCLONE_CONFIG_OUT_S3_SECRET_ACCESS_KEY
39              valueFrom:
40                secretKeyRef:
41                  name: longhorn-backup-hmac-key
42                  key: secret
43            - name: RCLONE_CONFIG_CRYPT_OUT_S3_PASSWORD
44              valueFrom:
45                secretKeyRef:
46                  name: rclone-secret
47                  key: password
48            # salt is used as password2
49            - name: RCLONE_CONFIG_CRYPT_OUT_S3_PASSWORD2
50              valueFrom:
51                secretKeyRef:
52                  name: rclone-secret
53                  key: salt
54            - name: RC_ACCESS_KEY_ID
55              valueFrom:
56                secretKeyRef:
57                  name: longhorn-rclone-bck-key-secret
58                  key: AWS_ACCESS_KEY_ID
59            - name: RC_ACCESS_KEY
60              valueFrom:
61                secretKeyRef:
62                  name: longhorn-rclone-bck-key-secret
63                  key: AWS_SECRET_ACCESS_KEY
64          ports:
65            - name: http
66              containerPort: 8080
67              protocol: TCP
68          volumeMounts:
69            - name: config
70              mountPath: /root/.config/rclone
71      volumes:
72        - name: config
73          configMap:
74            name: s3-rclone-longhorn-bck-config
75            items:
76              - key: rclone.conf
77                path: rclone.conf
78

and in the

rclone config

 1[out_s3]
 2type = s3
 3provider = GCS
 4endpoint = https://storage.googleapis.com
 5region = europe-west4
 6use_multipart_uploads = false
 7
 8[crypt_out_s3]
 9type = crypt
10remote = out_s3:<BUCKET-NAME-REDACTED>
11directory_name_encryption = false
12filename_encryption = off

I used are

Why We Use `rclone serve s3`

rclone serve s3 is the key component that makes this solution work. It implements a basic S3-compatible server that exposes any rclone backend (in our case, the encrypted crypt_out_s3 remote) as an S3 endpoint.

This is essential because:

S3 Gateway: Longhorn expects to communicate with an S3-compatible storage backend for backups. By using rclone serve s3, we provide Longhorn with a standard S3 interface.
Encryption layer: By pointing the S3 server at the crypt_out_s3 remote, all data is transparently encrypted/decrypted as it passes through rclone
No Longhorn modifications: This approach requires zero changes to Longhorn itself—it just sees a standard S3 endpoint
Authentication: The --auth-key parameter allows us to secure the S3 endpoint with credentials that Longhorn can use

The command essentially creates an S3 gateway that sits between Longhorn and the actual storage backend, handling all encryption/decryption automatically.

Understanding the rclone crypt Options

The crypt remote configuration uses two important settings:

directory_name_encryption = false

This keeps directory names in plaintext (unencrypted). While this reduces security slightly, it has practical benefits:

Easier to navigate the bucket structure directly if needed
Simpler debugging—you can see the folder structure at a glance

The actual file data is still fully encrypted, so the main content security is preserved.

filename_encryption = off

With this setting, files only get a .bin extension added instead of having their filenames encrypted. This provides several advantages:

Shorter encrypted filenames, avoiding path length limits on some cloud storage providers
Easier to identify files when accessing the backend directly

Security trade-off: This setting trades some security for practicality. If you need maximum security, you could use standard encryption, which encrypts filenames completely.

Both options make the encrypted remote more manageable and reduce the risk of hitting storage provider limitations while keeping the actual file content fully encrypted.

Next, create the corresponding Service and a simple Job to initialize the backup bucket.

The important bits in the

job manifest

  1apiVersion: batch/v1
  2kind: Job
  3metadata:
  4  name: s3-rclone-longhorn-bck-init
  5  labels:
  6    app.kubernetes.io/name: s3-rclone-longhorn-bck
  7    app.kubernetes.io/component: init
  8spec:
  9  backoffLimit: 3
 10  template:
 11    metadata:
 12      labels:
 13        app.kubernetes.io/name: s3-rclone-longhorn-bck
 14        app.kubernetes.io/component: init
 15    spec:
 16      restartPolicy: OnFailure
 17      containers:
 18        - name: rclone-init
 19          image: ghcr.io/rclone/rclone:1.71.1
 20          imagePullPolicy: IfNotPresent
 21          command:
 22            - "/bin/sh"
 23            - "-c"
 24            - |
 25              set -e
 26
 27              echo "Waiting for rclone S3 service to be ready..."
 28              # limit to 10 minutes
 29              COUNTER=0
 30              until nc -z s3-rclone-longhorn-bck 8080 2>/dev/null; do
 31                echo "Waiting for service..."
 32                sleep 10
 33                COUNTER=`expr $COUNTER + 1`
 34                if [ $COUNTER -ge 60 ]; then
 35                  echo "Timeout waiting for service after 10 minutes"
 36                  exit 1
 37                fi
 38              done
 39              echo "Service is ready!"
 40              COUNTER=0
 41
 42              echo "Using bucket name: $${BUCKET_NAME}"
 43
 44              # Set up the remote for the local rclone S3 service (encrypted)
 45              export RCLONE_CONFIG_LOCAL_S3_TYPE=s3
 46              export RCLONE_CONFIG_LOCAL_S3_PROVIDER=Other
 47              export RCLONE_CONFIG_LOCAL_S3_ENDPOINT=http://s3-rclone-longhorn-bck:8080
 48              export RCLONE_CONFIG_LOCAL_S3_ACCESS_KEY_ID="$${RC_ACCESS_KEY_ID}"
 49              export RCLONE_CONFIG_LOCAL_S3_SECRET_ACCESS_KEY="$${RC_ACCESS_KEY}"
 50              export RCLONE_CONFIG_LOCAL_S3_FORCE_PATH_STYLE=true
 51
 52              export RCLONE_CONFIG_OUT_S3_USE_MULTIPART_UPLOADS=false
 53              export RCLONE_CONFIG_OUT_S3_NO_CHECK_BUCKET=true
 54              export RCLONE_CONFIG_OUT_S3_ACCESS_KEY_ID="$${BACKEND_S3_ACCESS_KEY_ID}"
 55              export RCLONE_CONFIG_OUT_S3_SECRET_ACCESS_KEY="$${BACKEND_S3_SECRET_ACCESS_KEY}"
 56
 57              # Check if INFO.txt already exists in the backend (unencrypted)
 58              echo "Checking if INFO.txt already exists in backend..."
 59              if rclone lsf "out_s3:$${BUCKET_NAME}/INFO.txt" 2>/dev/null | grep -q "INFO.txt"; then
 60                echo "INFO.txt already exists in backend - initialization already complete"
 61                echo "Bucket is ready for Longhorn backups"
 62                exit 0
 63              fi
 64
 65              # Check if bucket exists (via encrypted endpoint)
 66              echo "Checking if bucket exists..."
 67              if rclone lsd local_s3: 2>/dev/null | grep -q "$${BUCKET_NAME}"; then
 68                echo "Bucket '$${BUCKET_NAME}' already exists"
 69                
 70                # Check if bucket contains any encrypted data
 71                echo "Checking bucket contents (encrypted view)..."
 72                FILE_COUNT=$$(rclone ls "local_s3:$${BUCKET_NAME}/" 2>/dev/null | wc -l)
 73                
 74                if [ "$$FILE_COUNT" -gt 0 ]; then
 75                  echo "Bucket contains $${FILE_COUNT} encrypted file(s)"
 76                  echo "Listing existing files:"
 77                  rclone ls "local_s3:$${BUCKET_NAME}/" --max-depth 1
 78                fi
 79              else
 80                echo "Creating new bucket: $${BUCKET_NAME}"
 81                rclone mkdir "local_s3:$${BUCKET_NAME}" --log-level=INFO
 82                echo "Bucket created successfully"
 83              fi
 84
 85              echo "Listing all buckets..."
 86              rclone lsd local_s3: --log-level=INFO
 87
 88              echo "Generating INFO.txt file..."
 89              TIMESTAMP=$$(date -u +"%Y-%m-%d %H:%M:%S UTC")
 90              HOSTNAME=$$(hostname)
 91
 92              cat > /tmp/INFO.txt <<EOF
 93              ============================================
 94              Longhorn Backup Bucket Information
 95              ============================================
 96
 97              Bucket Name: $${BUCKET_NAME}
 98              Created: $${TIMESTAMP}
 99              Created By: $${HOSTNAME}
100
101              Configuration:
102              - Service: s3-rclone-longhorn-bck
103              - Endpoint: http://s3-rclone-longhorn-bck:8080
104              - Encryption: Enabled (rclone crypt)
105              - Backend: Google Cloud Storage (GCS)
106              - Region: europe-west4
107
108              Rclone Configuration:
109              - Remote: crypt_out_s3
110              - Base Remote: out_s3
111              - Encryption: Standard encryption with password and salt
112              - Directory Name Encryption: Disabled
113              - Filename Encryption: Disabled
114
115              Environment:
116              - Kubernetes Namespace: $${K8S_NAMESPACE}
117              - Init Job: s3-rclone-longhorn-bck-init
118
119              Notes:
120              - All data stored in this bucket is encrypted using rclone crypt
121              - Access requires proper HMAC credentials (stored in secrets)
122              - Encryption password and salt are required for decryption
123              - INFO.txt is stored UNENCRYPTED for easy access
124
125              Secrets Used:
126              - longhorn-backup-hmac-key: GCS HMAC credentials
127              - rclone-secret: Encryption password and salt
128              - longhorn-rclone-bck-key-secret: API access credentials
129
130              ============================================
131              EOF
132
133              echo "Uploading INFO.txt to backend GCS (UNENCRYPTED)..."
134              rclone copy /tmp/INFO.txt "out_s3:$${BUCKET_NAME}/" --log-level=INFO --s3-no-check-bucket
135
136              echo "Verifying upload..."
137              echo "Files in encrypted view:"
138              rclone ls "local_s3:$${BUCKET_NAME}/" --log-level=INFO
139              echo ""
140              echo "Files in unencrypted backend:"
141              rclone ls "out_s3:$${BUCKET_NAME}/" --max-depth 1 --log-level=INFO
142
143              echo "Initialization complete!"
144              echo "Bucket '$${BUCKET_NAME}' is ready for Longhorn backups"
145              echo "INFO.txt is available unencrypted in the backend storage"
146          env:
147            - name: BUCKET_NAME
148              value: ~
149            - name: RCLONE_CONFIG_OUT_S3_TYPE
150              value: ~
151            - name: RCLONE_CONFIG_OUT_S3_PROVIDER
152              value: ~
153            - name: RCLONE_CONFIG_OUT_S3_ENDPOINT
154              value: ~
155            - name: RCLONE_CONFIG_OUT_S3_REGION
156              value: ~
157            - name: RCLONE_CONFIG_OUT_S3_USE_MULTIPART_UPLOADS
158              value: ~
159            - name: K8S_NAMESPACE
160              valueFrom:
161                fieldRef:
162                  fieldPath: metadata.namespace
163            - name: RC_ACCESS_KEY_ID
164              valueFrom:
165                secretKeyRef:
166                  name: longhorn-rclone-bck-key-secret
167                  key: AWS_ACCESS_KEY_ID
168            - name: RC_ACCESS_KEY
169              valueFrom:
170                secretKeyRef:
171                  name: longhorn-rclone-bck-key-secret
172                  key: AWS_SECRET_ACCESS_KEY
173            - name: BACKEND_S3_ACCESS_KEY_ID
174              valueFrom:
175                secretKeyRef:
176                  name: longhorn-backup-hmac-key
177                  key: access_id
178            - name: BACKEND_S3_SECRET_ACCESS_KEY
179              valueFrom:
180                secretKeyRef:
181                  name: longhorn-backup-hmac-key
182                  key: secret

are:

Why We Generate the INFO.txt File

The INFO.txt file serves several important purposes:

Documentation: It provides human-readable information about the bucket configuration, encryption setup, and required credentials. This is invaluable when you need to restore or troubleshoot backups months or years later.
Accessible without encryption: Critically, INFO.txt is stored directly in the backend (out_s3), bypassing the encryption layer. This means it can be read without needing the encryption keys, making it a self-documenting backup location.
Verification: By uploading it to the backend, we verify that:
- The GCS backend connection works correctly
- Credentials are properly configured
- The bucket is accessible and writable
Recovery aid: If you ever lose your rclone configuration but still have the encryption passwords in a password manager, the INFO.txt file tells you exactly how the encryption was configured. This makes it possible to reconstruct the setup and recover your backups.
Idempotency check: The script checks for INFO.txt existence to determine if initialization has already been completed, preventing duplicate initialization runs.

Why We Create the Bucket with `rclone mkdir "local_s3:${BUCKET_NAME}"`

Creating the bucket through the local_s3: remote (the encrypted S3 endpoint) rather than directly on the backend has several advantages:

End-to-end testing: This verifies the entire encryption pipeline is working:
- The rclone serve s3 service is running and accessible
- Authentication is correctly configured
- The crypt layer is properly set up
- The underlying GCS backend is reachable
S3 API validation: It ensures that bucket creation operations work through the S3 API layer, which is exactly how Longhorn will interact with the system. If rclone mkdir succeeds through local_s3:, we know Longhorn's S3 operations will also work.
Consistent access path: By creating the bucket the same way Longhorn will access it (through the S3 API), we ensure there are no surprises or incompatibilities when Longhorn starts using the bucket.
Automatic bucket initialization: On GCS, when you create a "bucket" through rclone's S3 interface, it actually creates a folder/prefix in the specified GCS bucket (configured as <BUCKET-NAME-REDACTED> in the config). This happens automatically through the crypt layer.
Proper permissions verification: This confirms that the service account credentials (RC_ACCESS_KEY_ID/RC_ACCESS_KEY) have the necessary permissions to create buckets through the S3 interface.

RTO and RPO

With this setup, you can easily modify the RPO by adjusting the Longhorn backup schedule to fit your needs.

For RTO, it mainly depends on the pipeline execution time. The main time-consuming components are the Kubernetes cluster bootstrapping using KubeSpray and the FluxCD reconciliation process.

In my case, the total time to provision the entire infrastructure from scratch and reconcile the cluster state is around 1 hour, which works well for my use case and is also acceptable for most organizations.

To lower the RTO further, you could customize FluxCD timeouts, retry periods, and parallelism to achieve more aggressive reconciliation.

However, I'm planning to set up a disaster recovery site for even better redundancy. More on this in the "Next Steps" section.

Conclusion

This journey from a homelab data loss incident to a production-grade IaC setup taught me that:

Recovery Time Objectives aren't just for enterprises
GitOps principles significantly reduce operational burden in the long run

While the initial setup took about a month (including research, testing, and iteration), I can now rebuild my entire infrastructure in 1 hour. More importantly, I've eliminated the anxiety of "did I back that up?"—everything is code, versioned, and reproducible.

The total monthly cost (~€10 for GCS storage) is minimal compared to the value of reliable, reproducible infrastructure. If you're running a homelab, I encourage you to treat it like production—your future self will thank you.

Limitations

As you might have noticed, some components are not yet part of the automated rebuild process.

Infrastructure Assumptions:

Proxmox Host: The Proxmox hypervisor itself is treated as static infrastructure and is not recreated from scratch
DNS Records: External DNS records for services are assumed to be pre-configured
Network configuration: The underlying network infrastructure (VLANs, subnets, firewall rules) is not managed by the IaC pipeline

While this is acceptable for a homelab, an enterprise-grade setup should include these components in the IaC pipeline as well. For on-premises environments, this presents additional challenges:

Hypervisor bootstrapping requires out-of-band management (IPMI/iLO)
Network configuration can be scripted using OpenTofu, which leverages Proxmox Software Defined Networks
DNS automation depends on your DNS provider's API availability

These omissions mean a true "datacenter destroyed" scenario still requires some manual intervention. However, for the more common scenarios (VM corruption, cluster misconfiguration, accidental deletion), the current setup provides comprehensive protection.

Next Steps

While Google Cloud is a great option and my monthly cost is only ~€10, in the future I would like to explore Hetzner. The pricing is really competitive and they have an S3-compatible object storage service. They also have an OpenTofu provider.

Another area I would like to explore is leveraging Longhorn Disaster Recovery Volumes in conjunction with the FluxCD d2 architecture. This way, I might be able to create a recovery cluster in another location and have a more robust disaster recovery plan. I think that by using Hetzner with only the strictly necessary services, a single server might be sufficient to host the recovery cluster cheaply.

With this setup, it might be possible to achieve a very low RTO and RPO.

Finally, I hope that in the future Longhorn will natively support client-side encryption and a declarative way to restore volumes from offsite backups so that I can simplify the current setup.

This article is licensed under the CC BY-SA 4.0 license.

What to Expect from This Blog Post#

Recovery Time Breakdown#

Repositories Structure#

How They Work Together#

IaC Pipeline#

GCP Cloud Storage Bucket#

Proxmox VMs#

Kubernetes Cluster Bootstrapping#

Uploading Kubeconfig to Gitlab#

Sealed Secrets Deployment#

FluxCD Deployment#

Longhorn Volumes Restoration Process#

Encrypt Longhorn Backups Client-Side#

Why We Use rclone serve s3#

Understanding the rclone crypt Options#

Why We Generate the INFO.txt File#

Why We Create the Bucket with rclone mkdir "local_s3:${BUCKET_NAME}"#

RTO and RPO#

Conclusion#

Limitations#

Next Steps#