One-click homelab: integrating Gitlab, Proxmox and K8s with GitOps Principles

{homelab} {devops} :: #proxmox #kubernetes #gitlab #iac #kubespray #fluxcd #longhorn

One unlucky day, I destroyed my homelab while I was trying to upgrade various components. I had a backup of my virtual machines where my Kubernetes cluster was running, but unfortunately, I forgot to enable the backup of the secondary disk attached to each VM. This secondary disk was used to store all the persistent data of my cluster, including the Longhorn volumes. I thought that the backups were fine, but when I restored the VMs, I realized that the secondary disks were missing. As a result, I lost all my persistent data, and I had to start from scratch.

My homelab was at least 3/4 years old, and I had accumulated a lot of configurations, customizations, and data over time. I started it while approaching the DevOps world, so it didn't really follow best practices. Rebuilding everything from scratch was a daunting task, and I realized that I needed a better way to manage my homelab infrastructure.

It might sound incredible, but while it might be acceptable for a personal homelab, losing data due to a lack of proper backup strategies is a common issue in the industry as well. Some notable examples include:

South Korea's NIRS lost 858TB of government data due to not having an offsite backup
Gitlab lost 6 hours of data due to improper backup procedures and a lack of verification

Given the experience I gathered over the years, I decided to rebuild my homelab using Infrastructure as Code (IaC) principles. This time, I wanted to ensure that I could easily recreate my entire setup with just one click, without having to go through the tedious process of manual configuration. So my objective was not just having a backup of my infrastructure, but having a recovery time objective (RTO) of minutes.

This way, I can also brag about having a better infrastructure than most companies out there! 😄

What to Expect from This Blog Post

In this blog post you will find my personal solution to address the creation of a personal homelab following IaC principles. In particular, I will show you how I've successfully integrated open source tools with minimal cost and custom code to achieve a one-click deployment of my homelab infrastructure.

The solution has the following properties:

RTO of 1 hour
RPO of 1 day (can be customized)
Cost effective (less than 10$ / month in my case; depends on how much data you have to backup to the offsite location)
Privacy oriented (more later)
Fully automated using GitOps principles

The tech stack of the solution includes:

GitLab SaaS: I use it for simplicity, but you can also use your own GitLab instance at your discretion.
Proxmox: I use it in my homelab, but you can use any Hypervisor you prefer, the only important thing is that there is an OpenTofu provider for that.
OpenTofu: Necessary to create the various component of the solution. OpenTofu is the open-source fork of Terraform, which I use for infrastructure provisioning.
Ubuntu Cloud Images: I use Ubuntu Cloud Images as the base operating system for the VMs in my Proxmox cluster. These images are optimized for cloud environments and provide an automated way to provision VMs with Cloud-Init.
KubeSpray: I use KubeSpray to create the Kubernetes cluster on top of Proxmox VMs. KubeSpray is a popular open-source project that provides a set of Ansible playbooks for deploying and managing Kubernetes clusters.
FluxCD: I use FluxCD for GitOps management of the Kubernetes cluster. FluxCD is a popular open-source project that enables continuous delivery and GitOps for Kubernetes.
Sealed Secrets: I use Sealed Secrets for storing all the credentials directly on git. While for an homelab solution is more than enough, OpenBao might be a better solution for an Enterprise.
Longhorn: I use Longhorn as the storage solution for the Kubernetes cluster. Longhorn is a popular open-source project that provides a distributed block storage system for Kubernetes.
GCP Cloud Storage: In my solution I use GCP cloud storage for off-site backup. Please note that the solution I will provide ensures client-side encryption of data (so even Google will not be able to decipher your data). Additionally, you can use any NFS server or self-hosted S3 Object store solution like Garage.

Recovery Time Breakdown

Here's what happens during a full disaster recovery (1 hour RTO):

┌─────────────────────────────────────────────────────────────────┐
│ Full Infrastructure Recovery Timeline                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│ Parallel:                                                       │
│  GCP Bucket ████████                                            │
│  (4-5 min, ~8% - runs in parallel)                              │
│                                                                 │
│ Main Flow:                                                      │
│  0min                                                     60min │
│  ├─────────┬────────────────────────┬───────┬──────────────┤    │
│  │         │                        │       │              │    │
│  ▼         ▼                        ▼       ▼              ▼    │
│  VM       K8s                    Sealed  Flux          Flux     │
│  Prov.    Bootstrap              Secrets Deploy        Reconcile│
│                                                                 │
│  9-10     23                     1-2   3-4              20-25   │
│  min      min                    min   min              min     │
│  (~16%)   (~38%)                 (~2%) (~6%)            (~38%)  │
│                                                                 │
│  Total: ~60 minutes                                             │
└─────────────────────────────────────────────────────────────────┘

Phase Details:
├─ GCP Bucket (4-5 min): Create/verify backup storage [RUNS IN PARALLEL]
├─ VM Provisioning (9-10 min): Download Ubuntu Cloud Images, create VMs with Cloud-Init
├─ K8s Bootstrap (23 min): KubeSpray cluster deployment (long phase - ~38% of total time)
├─ Sealed Secrets (1-2 min): Deploy secret management controller
├─ FluxCD Deploy (3-4 min): Install GitOps operator and sync repositories
└─ FluxCD Reconcile (20-25 min): Complete reconciliation of cluster state from GitOps repos (long phase - ~38% of total time)

Note: The Kubernetes bootstrapping phase accounts for approximately 38% of the total RTO. The GCP bucket creation runs in parallel with VM provisioning, so it doesn't add to the overall recovery time. Also it might be possible to optimize the FluxCD reconciliation time by tweaking its configuration for more aggressive syncs.

Here's a time-lapse of the complete infrastructure deployment from scratch to a running cluster (~60 minutes compressed):

Repositories Structure

All the infrastructure code is organized in five repositories:

gitlab-runner repository: contains the OpenTofu code to create the Gitlab Runner on Proxmox as an LXC container.
IaC repository: contains all the code necessary to bootstrap the Proxmox VMs and the Kubernetes cluster using KubeSpray.
d2-fleet repository: defines the desired state of the Kubernetes clusters and tenants in the fleet.
d2-infra repository: defines the desired state of the cluster add-ons and the monitoring stack.
d2-apps repositories: defines the desired state of the applications deployed across environments.

We will examine the IaC repository in detail, the d2-* repositories simply apply the d2-reference-architecture provided by the FluxCD team, which, I must say, is very well thought out and implemented. 👍

The Gitlab Runner repository is also quite straightforward, as it only contains the OpenTofu code to create the LXC container and register the runner with Gitlab, so I will not cover it here.

How They Work Together

The deployment flow follows a clear progression:

gitlab-runner → Bootstraps the CI/CD infrastructure needed to run the automated pipelines
IaC → Handles the foundational layer: VMs, backup storage, Kubernetes cluster, and essential components (Sealed Secrets, FluxCD operator)
d2-fleet, d2-infra, d2-apps → Once the IaC pipeline completes, FluxCD takes over and continuously reconciles the cluster state based on these GitOps repositories

In essence, the IaC repository gets you from an empty Hypervisor to a GitOps-ready cluster, then the d2-repositories manage everything from that point forward. This separation means the IaC pipeline only needs to run for infrastructure changes and periodic disaster recovery tests, while the d2 repositories handle all day-to-day operations through FluxCD's automatic reconciliation.

IaC Pipeline

The high-level steps of provisioning the homelab infrastructure performed by the IaC pipeline are:

Creates the GCP Cloud Storage bucket for offsite backups.
Creates the Proxmox VMs using OpenTofu.
Bootstraps the Kubernetes cluster using KubeSpray.
Uploads the generated Kubeconfig file to GitLab as an artifact with restricted access.
Trigger the deployment of Sealed Secrets sub-pipeline.
Trigger the deployment of FluxCD sub-pipeline.
Longhorn will be deployed by FluxCD as part of the d2-infra repository.
Restore Jobs defined in the d2-infra repository will restore Longhorn volumes from the GCP Cloud Storage bucket.

GCP Cloud Storage Bucket

The only prerequisites to create this part of the pipeline are:

Create an OpenTofu service account with the necessary permissions to create and manage GCP Cloud Storage buckets.
Create a GitLab Personal Access Token with the necessary permissions to modify CI/CD variables in the GitLab project. Note: I had to use a PAT because I'm on the free tier of GitLab SaaS, which does not yet support Project Access Tokens. If you have a paid plan, you should use a Project Access Token instead.

The process the pipeline follows is:

If not already present, it creates the GCP Cloud Storage bucket using google_storage_bucket resource.

1 resource "google_storage_bucket" "longhorn_backup_bucket" {
2     name                        = var.gcp_backup_longhorn_bucket_name
3     location                    = var.gcp_backup_region
4     storage_class               = "NEARLINE"
5     uniform_bucket_level_access = true
6     public_access_prevention    = "enforced"
7     hierarchical_namespace {
8         enabled = true
9     }
10 }

It creates a longhorn_backup_service_account.
Assigns to the newly created service account the roles/storage.objectAdmin role for the created bucket.
Creates/Syncs a HMAC key for the service account.
The GitLab Pipeline will save the generated HMAC key as a CI/CD variable in the GitLab project. In my case, it will use the Access Token created in the pre-requisites step to do so.

This HMAC key is later injected as a secret into the Kubernetes cluster, allowing Longhorn to connect to the GCP Cloud Storage bucket.

Proxmox VMs

The OpenTofu provider used is bpg/proxmox.

The prerequisites to create this part of the pipeline are:

Create a Proxmox API token with the necessary permissions to create and manage VMs.
Create an SSH key pair and store the private key as a GitLab CI/CD variable. This key will be used by the on-premise GitLab Runner to connect to the Proxmox VMs.

The pipeline follows this process:

OpenTofu makes Proxmox download the Ubuntu Cloud Image using proxmox_virtual_environment_download_file resource.

1 resource "proxmox_virtual_environment_download_file" "ubuntu_24_noble_qcow2_img" {
2 content_type       = "iso"
3 datastore_id       = "nfs-nas-1"
4 node_name          = "pve2"
5 url                = var.proxmox_k8s_node_image_url
6 overwrite          = true
7 file_name          = var.proxmox_k8s_node_image_name
8 checksum           = var.proxmox_k8s_node_image_checksum
9 checksum_algorithm = var.proxmox_k8s_node_image_checksum_algorithm
10 }

OpenTofu creates the Cloud-Init configuration file for the worker and master nodes using proxmox_virtual_environment_file resource. The important configurations to add to the Cloud-Init file are:

SSH Keys: Inject both your personal public key and the GitLab Runner's public key
Package Management: Install necessary packages (like qemu-guest-agent)
Longhorn Prerequisites: Configure according to Longhorn documentation

Code snippet (worker nodes)

1 resource "proxmox_virtual_environment_file" "ubuntu_cloud_init_worker" {
2 content_type = "snippets"
3 datastore_id = "nfs-nas-1"
4 node_name    = "pve2"
5 overwrite    = true
6 source_raw {
7     data      = <<EOF
8 #cloud-config
9 users:
10 - default
11 - name: ${var.proxmox_k8s_node_username}
12     groups:
13     - sudo
14     shell: /bin/bash
15     ssh_authorized_keys:
16 %{~for key in var.proxmox_k8s_node_ssh_keys}
17     - ${key}
18 %{~endfor}
19     sudo: ALL=(ALL) NOPASSWD:ALL
20 
21 package_update: true
22 package_upgrade: true
23 packages:
24 - qemu-guest-agent
25 - nfs-common
26 
27 # Disk partitioning setup
28 disk_setup:
29 /dev/sdb:
30     table_type: gpt
31     layout: true
32     overwrite: false
33 
34 # Filesystem setup
35 fs_setup:
36 - label: data
37     filesystem: ext4
38     device: /dev/sdb1
39     partition: auto
40     overwrite: false
41 
42 # Mount configuration
43 mounts:
44 - [/dev/sdb1, /mnt/data, ext4, "defaults,nofail", "0", "2"]
45 
46 write_files:
47 - path: /etc/modules-load.d/dm_crypt.conf
48     content: |
49     dm_crypt
50     owner: root:root
51     permissions: '0644'
52 
53     runcmd:
54     - systemctl enable qemu-guest-agent
55     - systemctl start qemu-guest-agent
56     - systemctl stop multipathd.socket
57     - systemctl stop multipathd
58     - systemctl disable multipathd.socket
59     - systemctl disable multipathd
60     - systemctl mask multipathd
61     - systemctl mask multipathd.socket
62     - modprobe dm_crypt
63     - systemctl enable iscsid
64     - systemctl start iscsid
65     - echo "done" > /tmp/cloud-config.done
66 EOF
67       file_name = "ubuntu.cloud-config-worker.yaml"
68     }
69 }

OpenTofu generates the output kubespray_inventory using the templatefile function with inventory.tpl. You can customize the inventory.tpl to fit your needs following the KubeSpray documentation. Here's my version of the inventory.tpl file.

inventory.tpl

1 {
2     "all": {
3         "vars": {
4         "ansible_user": "${ansible_user}",
5         "ansible_become": true,
6         "calico_cni_name": "k8s-pod-network",
7         "nat_outgoing": true,
8         "nat_outgoing_ipv6": true,
9         "calico_pool_blocksize": 26,
10         "calico_network_backend": "vxlan",
11         "calico_vxlan_mode": "CrossSubnet",
12         "kube_proxy_strict_arp": true,
13         "kube_encrypt_secret_data": true,
14         "kubeconfig_localhost": true,
15         "artifacts_dir": "/output",
16         "etcd_deployment_type": "host",
17         "etcd_metrics_port": 2381,
18         "etcd_listen_metrics_urls": "http://0.0.0.0:2381",
19         "etcd_metrics_service_labels": {
20             "k8s-app": "etcd",
21             "app.kubernetes.io/managed-by": "kubespray",
22             "app": "kube-prometheus-stack-kube-etcd",
23             "release": "kube-prometheus-stack"
24         },
25         "kube_proxy_metrics_bind_address": "0.0.0.0:10249"
26         },
27         "children": {
28         "kube_control_plane": {
29             "hosts": {
30     %{ for name in master_nodes ~}
31             "${name}",
32     %{ endfor ~}
33             }
34         },
35         "etcd": {
36             "hosts": {
37     %{ for name in master_nodes ~}
38             "${name}",
39     %{ endfor ~}
40             }
41         },
42         "kube_node": {
43             "hosts": {
44     %{ for name in worker_nodes ~}
45             "${name}",
46     %{ endfor ~}
47             }
48         },
49         "k8s_cluster": {
50             "children": [
51             "kube_control_plane",
52             "kube_node"
53             ]
54         }
55         }
56     }
57 }

IMHO, the most important thing here is to limit Kubespray to install only the necessary components to have a minimal Kubernetes cluster ready for FluxCD deployment.

Saves the generated inventory file as a GitLab artifact for use in the next stage.

Kubernetes Cluster Bootstrapping

This part of the pipeline bootstraps the Kubernetes cluster using KubeSpray.

I think KubeSpray is the best solution for creating a production-ready Kubernetes cluster in an on-premise environment, but I also think it should be limited to only the necessary components for a working cluster. KubeSpray, which is based on Ansible, provides many options to install various components like CNI, Ingress controllers, and monitoring stacks. However, in my opinion, these components should be installed using a more mature GitOps tool like FluxCD.

The prerequisites for this part of the pipeline are:

A Personal Access Token with API scope to upload the kubeconfig file as a GitLab CI/CD variable. As already mentioned, if you have a paid plan, you should use a Project Access Token instead. Additionally, you can use the same token created for the GCP Bucket creation step.
The GitLab Runner needs to connect to the Proxmox VMs using SSH. The private key for the SSH connection should be stored as a GitLab CI/CD variable in the project, as already mentioned in the Proxmox VMs creation step.

The pipeline follows these steps:

1 apply-kubespray-production-home:
2   stage: deploy
3   image: ${CI_DEPENDENCY_PROXY_GROUP_IMAGE_PREFIX}/docker:cli
4   tags:
5     - mgmt-zone
6     - self-hosted
7   needs:
8     - job: opentofu-apply-production-home
9       artifacts: true
10   services:
11     - *dind
12   before_script:
13     - apk add --no-interactive jq
14   script:
15     - mkdir -p $CI_PROJECT_DIR/inventory
16     - mkdir -p output
17     - chmod 600 $RUNNER_SSH_PRIVATE_KEY_PATH
18     - jq -r '.kubespray_inventory.value' home-tf-output.json > $CI_PROJECT_DIR/inventory/inventory.json
19     - docker run --mount type=bind,source=$CI_PROJECT_DIR/inventory,dst=/inventory --mount type=bind,source=$RUNNER_SSH_PRIVATE_KEY_PATH,dst=/ssh/id --mount type=bind,source=$CI_PROJECT_DIR/output,dst=/output --rm quay.io/kubespray/kubespray:$KUBESPRAY_VERSION ansible-playbook -i /inventory/inventory.json --private-key /ssh/id  cluster.yml
20   environment:
21     name: production/home
22   artifacts:
23     when: on_success
24     access: developer
25     expire_in: "10 mins"
26     paths:
27       - output/**

This step leverages the official KubeSpray Docker image to run the Ansible playbooks against the Proxmox VMs created in the previous step. It then saves the generated kubeconfig file as a restricted-access artifact for use in subsequent pipeline stages.

Uploading Kubeconfig to Gitlab

This step uploads the generated kubeconfig file to GitLab as a CI/CD variable using the GitLab API directly.

1 .upload-secret-base64-encoded:
2   image: ${CI_DEPENDENCY_PROXY_GROUP_IMAGE_PREFIX}/alpine/curl:latest
3   script:
4     - |
5       set -e
6       DATA=$(base64 -w 0 ${DATA_FILE_PATH})
7       curl -s -f --request PUT \
8         --header "PRIVATE-TOKEN: ${ACCESS_TOKEN}" \
9         --header "Content-Type: application/json" \
10         --data "{\"variable_type\":\"file\",\"key\":\"${VAR_NAME}\",\"value\":\"$DATA\",\"hidden\":false,\"protected\":true,\"masked\":true,\"raw\":true,\"description\":\"\"}" \
11         "$CI_API_V4_URL/projects/$CI_PROJECT_ID/variables/${VAR_NAME}" > /dev/null 2>&1

This way, after the IaC pipeline finishes, the kubeconfig file will be available for administrators to download and use.

Sealed Secrets Deployment

This part of the pipeline creates the Sealed Secrets controller in the Kubernetes cluster.

The prerequisite is a certificate/key pair to be used by the Sealed Secrets controller. You can find the instructions here: Bring your own certificates.

The private key needs to be stored as a GitLab CI/CD variable, while the certificate can be stored directly in the IaC repository.

FluxCD Deployment

This part of the pipeline uses OpenTofu to deploy the FluxCD Operator in the Kubernetes cluster as described in the official documentation.

This step also creates the Kubernetes secret with the GCP HMAC key for Longhorn and the secret with the registry credentials to pull container images from private registries.

Longhorn Volumes Restoration Process

Unfortunately, at the time of writing, Longhorn does not support a declarative way to restore volumes from offsite backups (issue#5787).

To work around this limitation, I have created a simple Dockerized Python program that leverages the Longhorn API to restore the latest backup for defined volumes from offsite storage. You can find the repository here.

Simply define a Job in the d2-infra repository for each volume you want to restore. For example, here is the Job definition to restore the Authelia Longhorn volume:

1 apiVersion: batch/v1
2 kind: Job
3 metadata:
4   name: authelia-volume-restore
5   namespace: authelia
6 spec:
7   template:
8     spec:
9       containers:
10         - name: restore
11           image: your-registry/longhorn-backup-restore:latest
12           env:
13             - name: LONGHORN_URL
14               value: http://longhorn-frontend.longhorn-system.svc.cluster.local
15             - name: VOLUME_HANDLE
16               value: authelia-production-vol
17             - name: NUMBER_OF_REPLICAS
18               value: "3"
19             - name: LOG_LEVEL
20               value: INFO
21       restartPolicy: Never
22   backoffLimit: 3

Then you need to create the corresponding PV and PVC to use the restored volume in your application. These can be defined in the d2-infra repository, for example in the same file as the restore job.

You can find more information in the README of the repository.

Encrypt Longhorn Backups Client-Side

Unfortunately, at the time of writing, Longhorn does not support client-side encryption of backups natively (issue#5220).

A simple solution I found is to use rclone to encrypt the backups client-side before uploading them to the offsite backup location.

Simply declare a Deployment, a Service, some Secrets, and a one-time Job in the same namespace where Longhorn is installed.

The important bits in the

deployment manifest

1 apiVersion: apps/v1
2 kind: Deployment
3 metadata:
4   name: s3-rclone-longhorn-bck
5   labels:
6     app.kubernetes.io/name: s3-rclone-longhorn-bck
7 spec:
8   replicas: 1
9   selector:
10     matchLabels:
11       app.kubernetes.io/name: s3-rclone-longhorn-bck
12   template:
13     metadata:
14       labels:
15         app.kubernetes.io/name: s3-rclone-longhorn-bck
16     spec:
17       containers:
18         - name: rclone
19           image: ghcr.io/rclone/rclone:1.71.1
20           imagePullPolicy: IfNotPresent
21           command:
22             - "rclone"
23             - "serve"
24             - "s3"
25             - "--no-cleanup"
26             - "--auth-key"
27             - "$(RC_ACCESS_KEY_ID),$(RC_ACCESS_KEY)"
28             - "crypt_out_s3:"
29             - "--s3-force-path-style=true"
30             - "--addr=:8080"
31             - "--log-level=WARNING"
32           env:
33             - name: RCLONE_CONFIG_OUT_S3_ACCESS_KEY_ID
34               valueFrom:
35                 secretKeyRef:
36                   name: longhorn-backup-hmac-key
37                   key: access_id
38             - name: RCLONE_CONFIG_OUT_S3_SECRET_ACCESS_KEY
39               valueFrom:
40                 secretKeyRef:
41                   name: longhorn-backup-hmac-key
42                   key: secret
43             - name: RCLONE_CONFIG_CRYPT_OUT_S3_PASSWORD
44               valueFrom:
45                 secretKeyRef:
46                   name: rclone-secret
47                   key: password
48             # salt is used as password2
49             - name: RCLONE_CONFIG_CRYPT_OUT_S3_PASSWORD2
50               valueFrom:
51                 secretKeyRef:
52                   name: rclone-secret
53                   key: salt
54             - name: RC_ACCESS_KEY_ID
55               valueFrom:
56                 secretKeyRef:
57                   name: longhorn-rclone-bck-key-secret
58                   key: AWS_ACCESS_KEY_ID
59             - name: RC_ACCESS_KEY
60               valueFrom:
61                 secretKeyRef:
62                   name: longhorn-rclone-bck-key-secret
63                   key: AWS_SECRET_ACCESS_KEY
64           ports:
65             - name: http
66               containerPort: 8080
67               protocol: TCP
68           volumeMounts:
69             - name: config
70               mountPath: /root/.config/rclone
71       volumes:
72         - name: config
73           configMap:
74             name: s3-rclone-longhorn-bck-config
75             items:
76               - key: rclone.conf
77                 path: rclone.conf
78

and in the

rclone config

1 [out_s3]
2 type = s3
3 provider = GCS
4 endpoint = https://storage.googleapis.com
5 region = europe-west4
6 use_multipart_uploads = false
7 
8 [crypt_out_s3]
9 type = crypt
10 remote = out_s3:<BUCKET-NAME-REDACTED>
11 directory_name_encryption = false
12 filename_encryption = off

I used are

Why We Use `rclone serve s3`

rclone serve s3 is the key component that makes this solution work. It implements a basic S3-compatible server that exposes any rclone backend (in our case, the encrypted crypt_out_s3 remote) as an S3 endpoint.

This is essential because:

S3 Gateway: Longhorn expects to communicate with an S3-compatible storage backend for backups. By using rclone serve s3, we provide Longhorn with a standard S3 interface.
Encryption layer: By pointing the S3 server at the crypt_out_s3 remote, all data is transparently encrypted/decrypted as it passes through rclone
No Longhorn modifications: This approach requires zero changes to Longhorn itself—it just sees a standard S3 endpoint
Authentication: The --auth-key parameter allows us to secure the S3 endpoint with credentials that Longhorn can use

The command essentially creates an S3 gateway that sits between Longhorn and the actual storage backend, handling all encryption/decryption automatically.

Understanding the rclone crypt Options

The crypt remote configuration uses two important settings:

directory_name_encryption = false

This keeps directory names in plaintext (unencrypted). While this reduces security slightly, it has practical benefits:

Easier to navigate the bucket structure directly if needed
Simpler debugging—you can see the folder structure at a glance

The actual file data is still fully encrypted, so the main content security is preserved.

filename_encryption = off

With this setting, files only get a .bin extension added instead of having their filenames encrypted. This provides several advantages:

Shorter encrypted filenames, avoiding path length limits on some cloud storage providers
Easier to identify files when accessing the backend directly

Security trade-off: This setting trades some security for practicality. If you need maximum security, you could use standard encryption, which encrypts filenames completely.

Both options make the encrypted remote more manageable and reduce the risk of hitting storage provider limitations while keeping the actual file content fully encrypted.

Next, create the corresponding Service and a simple Job to initialize the backup bucket.

The important bits in the

job manifest

1 apiVersion: batch/v1
2 kind: Job
3 metadata:
4   name: s3-rclone-longhorn-bck-init
5   labels:
6     app.kubernetes.io/name: s3-rclone-longhorn-bck
7     app.kubernetes.io/component: init
8 spec:
9   backoffLimit: 3
10   template:
11     metadata:
12       labels:
13         app.kubernetes.io/name: s3-rclone-longhorn-bck
14         app.kubernetes.io/component: init
15     spec:
16       restartPolicy: OnFailure
17       containers:
18         - name: rclone-init
19           image: ghcr.io/rclone/rclone:1.71.1
20           imagePullPolicy: IfNotPresent
21           command:
22             - "/bin/sh"
23             - "-c"
24             - |
25               set -e
26 
27               echo "Waiting for rclone S3 service to be ready..."
28               # limit to 10 minutes
29               COUNTER=0
30               until nc -z s3-rclone-longhorn-bck 8080 2>/dev/null; do
31                 echo "Waiting for service..."
32                 sleep 10
33                 COUNTER=`expr $COUNTER + 1`
34                 if [ $COUNTER -ge 60 ]; then
35                   echo "Timeout waiting for service after 10 minutes"
36                   exit 1
37                 fi
38               done
39               echo "Service is ready!"
40               COUNTER=0
41 
42               echo "Using bucket name: $${BUCKET_NAME}"
43 
44               # Set up the remote for the local rclone S3 service (encrypted)
45               export RCLONE_CONFIG_LOCAL_S3_TYPE=s3
46               export RCLONE_CONFIG_LOCAL_S3_PROVIDER=Other
47               export RCLONE_CONFIG_LOCAL_S3_ENDPOINT=http://s3-rclone-longhorn-bck:8080
48               export RCLONE_CONFIG_LOCAL_S3_ACCESS_KEY_ID="$${RC_ACCESS_KEY_ID}"
49               export RCLONE_CONFIG_LOCAL_S3_SECRET_ACCESS_KEY="$${RC_ACCESS_KEY}"
50               export RCLONE_CONFIG_LOCAL_S3_FORCE_PATH_STYLE=true
51 
52               export RCLONE_CONFIG_OUT_S3_USE_MULTIPART_UPLOADS=false
53               export RCLONE_CONFIG_OUT_S3_NO_CHECK_BUCKET=true
54               export RCLONE_CONFIG_OUT_S3_ACCESS_KEY_ID="$${BACKEND_S3_ACCESS_KEY_ID}"
55               export RCLONE_CONFIG_OUT_S3_SECRET_ACCESS_KEY="$${BACKEND_S3_SECRET_ACCESS_KEY}"
56 
57               # Check if INFO.txt already exists in the backend (unencrypted)
58               echo "Checking if INFO.txt already exists in backend..."
59               if rclone lsf "out_s3:$${BUCKET_NAME}/INFO.txt" 2>/dev/null | grep -q "INFO.txt"; then
60                 echo "INFO.txt already exists in backend - initialization already complete"
61                 echo "Bucket is ready for Longhorn backups"
62                 exit 0
63               fi
64 
65               # Check if bucket exists (via encrypted endpoint)
66               echo "Checking if bucket exists..."
67               if rclone lsd local_s3: 2>/dev/null | grep -q "$${BUCKET_NAME}"; then
68                 echo "Bucket '$${BUCKET_NAME}' already exists"
69                 
70                 # Check if bucket contains any encrypted data
71                 echo "Checking bucket contents (encrypted view)..."
72                 FILE_COUNT=$$(rclone ls "local_s3:$${BUCKET_NAME}/" 2>/dev/null | wc -l)
73                 
74                 if [ "$$FILE_COUNT" -gt 0 ]; then
75                   echo "Bucket contains $${FILE_COUNT} encrypted file(s)"
76                   echo "Listing existing files:"
77                   rclone ls "local_s3:$${BUCKET_NAME}/" --max-depth 1
78                 fi
79               else
80                 echo "Creating new bucket: $${BUCKET_NAME}"
81                 rclone mkdir "local_s3:$${BUCKET_NAME}" --log-level=INFO
82                 echo "Bucket created successfully"
83               fi
84 
85               echo "Listing all buckets..."
86               rclone lsd local_s3: --log-level=INFO
87 
88               echo "Generating INFO.txt file..."
89               TIMESTAMP=$$(date -u +"%Y-%m-%d %H:%M:%S UTC")
90               HOSTNAME=$$(hostname)
91 
92               cat > /tmp/INFO.txt <<EOF
93               ============================================
94               Longhorn Backup Bucket Information
95               ============================================
96 
97               Bucket Name: $${BUCKET_NAME}
98               Created: $${TIMESTAMP}
99               Created By: $${HOSTNAME}
100 
101               Configuration:
102               - Service: s3-rclone-longhorn-bck
103               - Endpoint: http://s3-rclone-longhorn-bck:8080
104               - Encryption: Enabled (rclone crypt)
105               - Backend: Google Cloud Storage (GCS)
106               - Region: europe-west4
107 
108               Rclone Configuration:
109               - Remote: crypt_out_s3
110               - Base Remote: out_s3
111               - Encryption: Standard encryption with password and salt
112               - Directory Name Encryption: Disabled
113               - Filename Encryption: Disabled
114 
115               Environment:
116               - Kubernetes Namespace: $${K8S_NAMESPACE}
117               - Init Job: s3-rclone-longhorn-bck-init
118 
119               Notes:
120               - All data stored in this bucket is encrypted using rclone crypt
121               - Access requires proper HMAC credentials (stored in secrets)
122               - Encryption password and salt are required for decryption
123               - INFO.txt is stored UNENCRYPTED for easy access
124 
125               Secrets Used:
126               - longhorn-backup-hmac-key: GCS HMAC credentials
127               - rclone-secret: Encryption password and salt
128               - longhorn-rclone-bck-key-secret: API access credentials
129 
130               ============================================
131               EOF
132 
133               echo "Uploading INFO.txt to backend GCS (UNENCRYPTED)..."
134               rclone copy /tmp/INFO.txt "out_s3:$${BUCKET_NAME}/" --log-level=INFO --s3-no-check-bucket
135 
136               echo "Verifying upload..."
137               echo "Files in encrypted view:"
138               rclone ls "local_s3:$${BUCKET_NAME}/" --log-level=INFO
139               echo ""
140               echo "Files in unencrypted backend:"
141               rclone ls "out_s3:$${BUCKET_NAME}/" --max-depth 1 --log-level=INFO
142 
143               echo "Initialization complete!"
144               echo "Bucket '$${BUCKET_NAME}' is ready for Longhorn backups"
145               echo "INFO.txt is available unencrypted in the backend storage"
146           env:
147             - name: BUCKET_NAME
148               value: ~
149             - name: RCLONE_CONFIG_OUT_S3_TYPE
150               value: ~
151             - name: RCLONE_CONFIG_OUT_S3_PROVIDER
152               value: ~
153             - name: RCLONE_CONFIG_OUT_S3_ENDPOINT
154               value: ~
155             - name: RCLONE_CONFIG_OUT_S3_REGION
156               value: ~
157             - name: RCLONE_CONFIG_OUT_S3_USE_MULTIPART_UPLOADS
158               value: ~
159             - name: K8S_NAMESPACE
160               valueFrom:
161                 fieldRef:
162                   fieldPath: metadata.namespace
163             - name: RC_ACCESS_KEY_ID
164               valueFrom:
165                 secretKeyRef:
166                   name: longhorn-rclone-bck-key-secret
167                   key: AWS_ACCESS_KEY_ID
168             - name: RC_ACCESS_KEY
169               valueFrom:
170                 secretKeyRef:
171                   name: longhorn-rclone-bck-key-secret
172                   key: AWS_SECRET_ACCESS_KEY
173             - name: BACKEND_S3_ACCESS_KEY_ID
174               valueFrom:
175                 secretKeyRef:
176                   name: longhorn-backup-hmac-key
177                   key: access_id
178             - name: BACKEND_S3_SECRET_ACCESS_KEY
179               valueFrom:
180                 secretKeyRef:
181                   name: longhorn-backup-hmac-key
182                   key: secret

are:

Why We Generate the INFO.txt File

The INFO.txt file serves several important purposes:

Documentation: It provides human-readable information about the bucket configuration, encryption setup, and required credentials. This is invaluable when you need to restore or troubleshoot backups months or years later.
Accessible without encryption: Critically, INFO.txt is stored directly in the backend (out_s3), bypassing the encryption layer. This means it can be read without needing the encryption keys, making it a self-documenting backup location.
Verification: By uploading it to the backend, we verify that:
- The GCS backend connection works correctly
- Credentials are properly configured
- The bucket is accessible and writable
Recovery aid: If you ever lose your rclone configuration but still have the encryption passwords in a password manager, the INFO.txt file tells you exactly how the encryption was configured. This makes it possible to reconstruct the setup and recover your backups.
Idempotency check: The script checks for INFO.txt existence to determine if initialization has already been completed, preventing duplicate initialization runs.

Why We Create the Bucket with `rclone mkdir "local_s3:${BUCKET_NAME}"`

Creating the bucket through the local_s3: remote (the encrypted S3 endpoint) rather than directly on the backend has several advantages:

End-to-end testing: This verifies the entire encryption pipeline is working:
- The rclone serve s3 service is running and accessible
- Authentication is correctly configured
- The crypt layer is properly set up
- The underlying GCS backend is reachable
S3 API validation: It ensures that bucket creation operations work through the S3 API layer, which is exactly how Longhorn will interact with the system. If rclone mkdir succeeds through local_s3:, we know Longhorn's S3 operations will also work.
Consistent access path: By creating the bucket the same way Longhorn will access it (through the S3 API), we ensure there are no surprises or incompatibilities when Longhorn starts using the bucket.
Automatic bucket initialization: On GCS, when you create a "bucket" through rclone's S3 interface, it actually creates a folder/prefix in the specified GCS bucket (configured as <BUCKET-NAME-REDACTED> in the config). This happens automatically through the crypt layer.
Proper permissions verification: This confirms that the service account credentials (RC_ACCESS_KEY_ID/RC_ACCESS_KEY) have the necessary permissions to create buckets through the S3 interface.

RTO and RPO

With this setup, you can easily modify the RPO by adjusting the Longhorn backup schedule to fit your needs.

For RTO, it mainly depends on the pipeline execution time. The main time-consuming components are the Kubernetes cluster bootstrapping using KubeSpray and the FluxCD reconciliation process.

In my case, the total time to provision the entire infrastructure from scratch and reconcile the cluster state is around 1 hour, which works well for my use case and is also acceptable for most organizations.

To lower the RTO further, you could customize FluxCD timeouts, retry periods, and parallelism to achieve more aggressive reconciliation.

However, I'm planning to set up a disaster recovery site for even better redundancy. More on this in the "Next Steps" section.

Conclusion

This journey from a homelab data loss incident to a production-grade IaC setup taught me that:

Recovery Time Objectives aren't just for enterprises
GitOps principles significantly reduce operational burden in the long run

While the initial setup took about a month (including research, testing, and iteration), I can now rebuild my entire infrastructure in 1 hour. More importantly, I've eliminated the anxiety of "did I back that up?"—everything is code, versioned, and reproducible.

The total monthly cost (~€10 for GCS storage) is minimal compared to the value of reliable, reproducible infrastructure. If you're running a homelab, I encourage you to treat it like production—your future self will thank you.

Limitations

As you might have noticed, some components are not yet part of the automated rebuild process.

Infrastructure Assumptions:

Proxmox Host: The Proxmox hypervisor itself is treated as static infrastructure and is not recreated from scratch
DNS Records: External DNS records for services are assumed to be pre-configured
Network configuration: The underlying network infrastructure (VLANs, subnets, firewall rules) is not managed by the IaC pipeline

While this is acceptable for a homelab, an enterprise-grade setup should include these components in the IaC pipeline as well. For on-premises environments, this presents additional challenges:

Hypervisor bootstrapping requires out-of-band management (IPMI/iLO)
Network configuration can be scripted using OpenTofu, which leverages Proxmox Software Defined Networks
DNS automation depends on your DNS provider's API availability

These omissions mean a true "datacenter destroyed" scenario still requires some manual intervention. However, for the more common scenarios (VM corruption, cluster misconfiguration, accidental deletion), the current setup provides comprehensive protection.

Next Steps

While Google Cloud is a great option and my monthly cost is only ~€10, in the future I would like to explore Hetzner. The pricing is really competitive and they have an S3-compatible object storage service. They also have an OpenTofu provider.

Another area I would like to explore is leveraging Longhorn Disaster Recovery Volumes in conjunction with the FluxCD d2 architecture. This way, I might be able to create a recovery cluster in another location and have a more robust disaster recovery plan. I think that by using Hetzner with only the strictly necessary services, a single server might be sufficient to host the recovery cluster cheaply.

With this setup, it might be possible to achieve a very low RTO and RPO.

Finally, I hope that in the future Longhorn will natively support client-side encryption and a declarative way to restore volumes from offsite backups so that I can simplify the current setup.

This article is licensed under the CC BY-SA 4.0 license.

1	resource "google_storage_bucket" "longhorn_backup_bucket" {
2	name = var.gcp_backup_longhorn_bucket_name
3	location = var.gcp_backup_region
4	storage_class = "NEARLINE"
5	uniform_bucket_level_access = true
6	public_access_prevention = "enforced"
7	hierarchical_namespace {
8	enabled = true
9	}
10	}

1	resource "proxmox_virtual_environment_download_file" "ubuntu_24_noble_qcow2_img" {
2	content_type = "iso"
3	datastore_id = "nfs-nas-1"
4	node_name = "pve2"
5	url = var.proxmox_k8s_node_image_url
6	overwrite = true
7	file_name = var.proxmox_k8s_node_image_name
8	checksum = var.proxmox_k8s_node_image_checksum
9	checksum_algorithm = var.proxmox_k8s_node_image_checksum_algorithm
10	}

1	resource "proxmox_virtual_environment_file" "ubuntu_cloud_init_worker" {
2	content_type = "snippets"
3	datastore_id = "nfs-nas-1"
4	node_name = "pve2"
5	overwrite = true
6	source_raw {
7	data = <<EOF
8	#cloud-config
9	users:
10	- default
11	- name: ${var.proxmox_k8s_node_username}
12	groups:
13	- sudo
14	shell: /bin/bash
15	ssh_authorized_keys:
16	%{~for key in var.proxmox_k8s_node_ssh_keys}
17	- ${key}
18	%{~endfor}
19	sudo: ALL=(ALL) NOPASSWD:ALL
20
21	package_update: true
22	package_upgrade: true
23	packages:
24	- qemu-guest-agent
25	- nfs-common
26
27	# Disk partitioning setup
28	disk_setup:
29	/dev/sdb:
30	table_type: gpt
31	layout: true
32	overwrite: false
33
34	# Filesystem setup
35	fs_setup:
36	- label: data
37	filesystem: ext4
38	device: /dev/sdb1
39	partition: auto
40	overwrite: false
41
42	# Mount configuration
43	mounts:
44	- [/dev/sdb1, /mnt/data, ext4, "defaults,nofail", "0", "2"]
45
46	write_files:
47	- path: /etc/modules-load.d/dm_crypt.conf
48	content: \|
49	dm_crypt
50	owner: root:root
51	permissions: '0644'
52
53	runcmd:
54	- systemctl enable qemu-guest-agent
55	- systemctl start qemu-guest-agent
56	- systemctl stop multipathd.socket
57	- systemctl stop multipathd
58	- systemctl disable multipathd.socket
59	- systemctl disable multipathd
60	- systemctl mask multipathd
61	- systemctl mask multipathd.socket
62	- modprobe dm_crypt
63	- systemctl enable iscsid
64	- systemctl start iscsid
65	- echo "done" > /tmp/cloud-config.done
66	EOF
67	file_name = "ubuntu.cloud-config-worker.yaml"
68	}
69	}

1	{
2	"all": {
3	"vars": {
4	"ansible_user": "${ansible_user}",
5	"ansible_become": true,
6	"calico_cni_name": "k8s-pod-network",
7	"nat_outgoing": true,
8	"nat_outgoing_ipv6": true,
9	"calico_pool_blocksize": 26,
10	"calico_network_backend": "vxlan",
11	"calico_vxlan_mode": "CrossSubnet",
12	"kube_proxy_strict_arp": true,
13	"kube_encrypt_secret_data": true,
14	"kubeconfig_localhost": true,
15	"artifacts_dir": "/output",
16	"etcd_deployment_type": "host",
17	"etcd_metrics_port": 2381,
18	"etcd_listen_metrics_urls": "http://0.0.0.0:2381",
19	"etcd_metrics_service_labels": {
20	"k8s-app": "etcd",
21	"app.kubernetes.io/managed-by": "kubespray",
22	"app": "kube-prometheus-stack-kube-etcd",
23	"release": "kube-prometheus-stack"
24	},
25	"kube_proxy_metrics_bind_address": "0.0.0.0:10249"
26	},
27	"children": {
28	"kube_control_plane": {
29	"hosts": {
30	%{ for name in master_nodes ~}
31	"${name}",
32	%{ endfor ~}
33	}
34	},
35	"etcd": {
36	"hosts": {
37	%{ for name in master_nodes ~}
38	"${name}",
39	%{ endfor ~}
40	}
41	},
42	"kube_node": {
43	"hosts": {
44	%{ for name in worker_nodes ~}
45	"${name}",
46	%{ endfor ~}
47	}
48	},
49	"k8s_cluster": {
50	"children": [
51	"kube_control_plane",
52	"kube_node"
53	]
54	}
55	}
56	}
57	}

1	apply-kubespray-production-home:
2	stage: deploy
3	image: ${CI_DEPENDENCY_PROXY_GROUP_IMAGE_PREFIX}/docker:cli
4	tags:
5	- mgmt-zone
6	- self-hosted
7	needs:
8	- job: opentofu-apply-production-home
9	artifacts: true
10	services:
11	- *dind
12	before_script:
13	- apk add --no-interactive jq
14	script:
15	- mkdir -p $CI_PROJECT_DIR/inventory
16	- mkdir -p output
17	- chmod 600 $RUNNER_SSH_PRIVATE_KEY_PATH
18	- jq -r '.kubespray_inventory.value' home-tf-output.json > $CI_PROJECT_DIR/inventory/inventory.json
19	- docker run --mount type=bind,source=$CI_PROJECT_DIR/inventory,dst=/inventory --mount type=bind,source=$RUNNER_SSH_PRIVATE_KEY_PATH,dst=/ssh/id --mount type=bind,source=$CI_PROJECT_DIR/output,dst=/output --rm quay.io/kubespray/kubespray:$KUBESPRAY_VERSION ansible-playbook -i /inventory/inventory.json --private-key /ssh/id cluster.yml
20	environment:
21	name: production/home
22	artifacts:
23	when: on_success
24	access: developer
25	expire_in: "10 mins"
26	paths:
27	- output/**

1	.upload-secret-base64-encoded:
2	image: ${CI_DEPENDENCY_PROXY_GROUP_IMAGE_PREFIX}/alpine/curl:latest
3	script:
4	- \|
5	set -e
6	DATA=$(base64 -w 0 ${DATA_FILE_PATH})
7	curl -s -f --request PUT \
8	--header "PRIVATE-TOKEN: ${ACCESS_TOKEN}" \
9	--header "Content-Type: application/json" \
10	--data "{\"variable_type\":\"file\",\"key\":\"${VAR_NAME}\",\"value\":\"$DATA\",\"hidden\":false,\"protected\":true,\"masked\":true,\"raw\":true,\"description\":\"\"}" \
11	"$CI_API_V4_URL/projects/$CI_PROJECT_ID/variables/${VAR_NAME}" > /dev/null 2>&1

1	apiVersion: batch/v1
2	kind: Job
3	metadata:
4	name: authelia-volume-restore
5	namespace: authelia
6	spec:
7	template:
8	spec:
9	containers:
10	- name: restore
11	image: your-registry/longhorn-backup-restore:latest
12	env:
13	- name: LONGHORN_URL
14	value: http://longhorn-frontend.longhorn-system.svc.cluster.local
15	- name: VOLUME_HANDLE
16	value: authelia-production-vol
17	- name: NUMBER_OF_REPLICAS
18	value: "3"
19	- name: LOG_LEVEL
20	value: INFO
21	restartPolicy: Never
22	backoffLimit: 3

1	apiVersion: apps/v1
2	kind: Deployment
3	metadata:
4	name: s3-rclone-longhorn-bck
5	labels:
6	app.kubernetes.io/name: s3-rclone-longhorn-bck
7	spec:
8	replicas: 1
9	selector:
10	matchLabels:
11	app.kubernetes.io/name: s3-rclone-longhorn-bck
12	template:
13	metadata:
14	labels:
15	app.kubernetes.io/name: s3-rclone-longhorn-bck
16	spec:
17	containers:
18	- name: rclone
19	image: ghcr.io/rclone/rclone:1.71.1
20	imagePullPolicy: IfNotPresent
21	command:
22	- "rclone"
23	- "serve"
24	- "s3"
25	- "--no-cleanup"
26	- "--auth-key"
27	- "$(RC_ACCESS_KEY_ID),$(RC_ACCESS_KEY)"
28	- "crypt_out_s3:"
29	- "--s3-force-path-style=true"
30	- "--addr=:8080"
31	- "--log-level=WARNING"
32	env:
33	- name: RCLONE_CONFIG_OUT_S3_ACCESS_KEY_ID
34	valueFrom:
35	secretKeyRef:
36	name: longhorn-backup-hmac-key
37	key: access_id
38	- name: RCLONE_CONFIG_OUT_S3_SECRET_ACCESS_KEY
39	valueFrom:
40	secretKeyRef:
41	name: longhorn-backup-hmac-key
42	key: secret
43	- name: RCLONE_CONFIG_CRYPT_OUT_S3_PASSWORD
44	valueFrom:
45	secretKeyRef:
46	name: rclone-secret
47	key: password
48	# salt is used as password2
49	- name: RCLONE_CONFIG_CRYPT_OUT_S3_PASSWORD2
50	valueFrom:
51	secretKeyRef:
52	name: rclone-secret
53	key: salt
54	- name: RC_ACCESS_KEY_ID
55	valueFrom:
56	secretKeyRef:
57	name: longhorn-rclone-bck-key-secret
58	key: AWS_ACCESS_KEY_ID
59	- name: RC_ACCESS_KEY
60	valueFrom:
61	secretKeyRef:
62	name: longhorn-rclone-bck-key-secret
63	key: AWS_SECRET_ACCESS_KEY
64	ports:
65	- name: http
66	containerPort: 8080
67	protocol: TCP
68	volumeMounts:
69	- name: config
70	mountPath: /root/.config/rclone
71	volumes:
72	- name: config
73	configMap:
74	name: s3-rclone-longhorn-bck-config
75	items:
76	- key: rclone.conf
77	path: rclone.conf
78

1	[out_s3]
2	type = s3
3	provider = GCS
4	endpoint = https://storage.googleapis.com
5	region = europe-west4
6	use_multipart_uploads = false
7
8	[crypt_out_s3]
9	type = crypt
10	remote = out_s3:<BUCKET-NAME-REDACTED>
11	directory_name_encryption = false
12	filename_encryption = off

What to Expect from This Blog Post#

Recovery Time Breakdown#

Repositories Structure#

How They Work Together#

IaC Pipeline#

GCP Cloud Storage Bucket#

Proxmox VMs#

Kubernetes Cluster Bootstrapping#

Uploading Kubeconfig to Gitlab#

Sealed Secrets Deployment#

FluxCD Deployment#

Longhorn Volumes Restoration Process#

Encrypt Longhorn Backups Client-Side#

Why We Use rclone serve s3#

Understanding the rclone crypt Options#

Why We Generate the INFO.txt File#

Why We Create the Bucket with rclone mkdir "local_s3:${BUCKET_NAME}"#

RTO and RPO#

Conclusion#

Limitations#

Next Steps#