One-click homelab: integrating Gitlab, Proxmox and K8s with GitOps Principles
One unlucky day, I destroyed my homelab while I was trying to upgrade various components. I had a backup of my virtual machines where my Kubernetes cluster was running, but unfortunately, I forgot to enable the backup of the secondary disk attached to each VM. This secondary disk was used to store all the persistent data of my cluster, including the Longhorn volumes. I thought that the backups were fine, but when I restored the VMs, I realized that the secondary disks were missing. As a result, I lost all my persistent data, and I had to start from scratch.
My homelab was at least 3/4 years old, and I had accumulated a lot of configurations, customizations, and data over time. I started it while approaching the DevOps world, so it didn't really follow best practices. Rebuilding everything from scratch was a daunting task, and I realized that I needed a better way to manage my homelab infrastructure.
It might sound incredible, but while it might be acceptable for a personal homelab, losing data due to a lack of proper backup strategies is a common issue in the industry as well. Some notable examples include:
- South Korea's NIRS lost 858TB of government data due to not having an offsite backup
- Gitlab lost 6 hours of data due to improper backup procedures and a lack of verification
Given the experience I gathered over the years, I decided to rebuild my homelab using Infrastructure as Code (IaC) principles. This time, I wanted to ensure that I could easily recreate my entire setup with just one click, without having to go through the tedious process of manual configuration. So my objective was not just having a backup of my infrastructure, but having a recovery time objective (RTO) of minutes.
This way, I can also brag about having a better infrastructure than most companies out there! π
What to Expect from This Blog Post
In this blog post you will find my personal solution to address the creation of a personal homelab following IaC principles. In particular, I will show you how I've successfully integrated open source tools with minimal cost and custom code to achieve a one-click deployment of my homelab infrastructure.
The solution has the following properties:
- RTO of 1 hour
- RPO of 1 day (can be customized)
- Cost effective (less than 10$ / month in my case; depends on how much data you have to backup to the offsite location)
- Privacy oriented (more later)
- Fully automated using GitOps principles
The tech stack of the solution includes:
- GitLab SaaS: I use it for simplicity, but you can also use your own GitLab instance at your discretion.
- Proxmox: I use it in my homelab, but you can use any Hypervisor you prefer, the only important thing is that there is an OpenTofu provider for that.
- OpenTofu: Necessary to create the various component of the solution. OpenTofu is the open-source fork of Terraform, which I use for infrastructure provisioning.
- Ubuntu Cloud Images: I use Ubuntu Cloud Images as the base operating system for the VMs in my Proxmox cluster. These images are optimized for cloud environments and provide an automated way to provision VMs with Cloud-Init.
- KubeSpray: I use KubeSpray to create the Kubernetes cluster on top of Proxmox VMs. KubeSpray is a popular open-source project that provides a set of Ansible playbooks for deploying and managing Kubernetes clusters.
- FluxCD: I use FluxCD for GitOps management of the Kubernetes cluster. FluxCD is a popular open-source project that enables continuous delivery and GitOps for Kubernetes.
- Sealed Secrets: I use Sealed Secrets for storing all the credentials directly on git. While for an homelab solution is more than enough, OpenBao might be a better solution for an Enterprise.
- Longhorn: I use Longhorn as the storage solution for the Kubernetes cluster. Longhorn is a popular open-source project that provides a distributed block storage system for Kubernetes.
- GCP Cloud Storage: In my solution I use GCP cloud storage for off-site backup. Please note that the solution I will provide ensures client-side encryption of data (so even Google will not be able to decipher your data). Additionally, you can use any NFS server or self-hosted S3 Object store solution like Garage.
Recovery Time Breakdown
Here's what happens during a full disaster recovery (1 hour RTO):
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Full Infrastructure Recovery Timeline β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Parallel: β
β GCP Bucket ββββββββ β
β (4-5 min, ~8% - runs in parallel) β
β β
β Main Flow: β
β 0min 60min β
β βββββββββββ¬βββββββββββββββββββββββββ¬ββββββββ¬βββββββββββββββ€ β
β β β β β β β
β βΌ βΌ βΌ βΌ βΌ β
β VM K8s Sealed Flux Flux β
β Prov. Bootstrap Secrets Deploy Reconcileβ
β β
β 9-10 23 1-2 3-4 20-25 β
β min min min min min β
β (~16%) (~38%) (~2%) (~6%) (~38%) β
β β
β Total: ~60 minutes β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Phase Details:
ββ GCP Bucket (4-5 min): Create/verify backup storage [RUNS IN PARALLEL]
ββ VM Provisioning (9-10 min): Download Ubuntu Cloud Images, create VMs with Cloud-Init
ββ K8s Bootstrap (23 min): KubeSpray cluster deployment (long phase - ~38% of total time)
ββ Sealed Secrets (1-2 min): Deploy secret management controller
ββ FluxCD Deploy (3-4 min): Install GitOps operator and sync repositories
ββ FluxCD Reconcile (20-25 min): Complete reconciliation of cluster state from GitOps repos (long phase - ~38% of total time)
Note: The Kubernetes bootstrapping phase accounts for approximately 38% of the total RTO. The GCP bucket creation runs in parallel with VM provisioning, so it doesn't add to the overall recovery time. Also it might be possible to optimize the FluxCD reconciliation time by tweaking its configuration for more aggressive syncs.
Here's a time-lapse of the complete infrastructure deployment from scratch to a running cluster (~60 minutes compressed):
Repositories Structure
All the infrastructure code is organized in five repositories:
- gitlab-runner repository: contains the OpenTofu code to create the Gitlab Runner on Proxmox as an LXC container.
- IaC repository: contains all the code necessary to bootstrap the Proxmox VMs and the Kubernetes cluster using KubeSpray.
- d2-fleet repository: defines the desired state of the Kubernetes clusters and tenants in the fleet.
- d2-infra repository: defines the desired state of the cluster add-ons and the monitoring stack.
- d2-apps repositories: defines the desired state of the applications deployed across environments.
We will examine the IaC repository in detail, the d2-* repositories simply apply the d2-reference-architecture provided by the FluxCD team, which, I must say, is very well thought out and implemented. π
The Gitlab Runner repository is also quite straightforward, as it only contains the OpenTofu code to create the LXC container and register the runner with Gitlab, so I will not cover it here.
How They Work Together
The deployment flow follows a clear progression:
- gitlab-runner β Bootstraps the CI/CD infrastructure needed to run the automated pipelines
- IaC β Handles the foundational layer: VMs, backup storage, Kubernetes cluster, and essential components (Sealed Secrets, FluxCD operator)
- d2-fleet, d2-infra, d2-apps β Once the IaC pipeline completes, FluxCD takes over and continuously reconciles the cluster state based on these GitOps repositories
In essence, the IaC repository gets you from an empty Hypervisor to a GitOps-ready cluster, then the d2-repositories manage everything from that point forward. This separation means the IaC pipeline only needs to run for infrastructure changes and periodic disaster recovery tests, while the d2 repositories handle all day-to-day operations through FluxCD's automatic reconciliation.
IaC Pipeline
The high-level steps of provisioning the homelab infrastructure performed by the IaC pipeline are:
- Creates the GCP Cloud Storage bucket for offsite backups.
- Creates the Proxmox VMs using OpenTofu.
- Bootstraps the Kubernetes cluster using KubeSpray.
- Uploads the generated Kubeconfig file to GitLab as an artifact with restricted access.
- Trigger the deployment of Sealed Secrets sub-pipeline.
- Trigger the deployment of FluxCD sub-pipeline.
- Longhorn will be deployed by FluxCD as part of the d2-infra repository.
- Restore Jobs defined in the d2-infra repository will restore Longhorn volumes from the GCP Cloud Storage bucket.
GCP Cloud Storage Bucket
The only prerequisites to create this part of the pipeline are:
- Create an OpenTofu service account with the necessary permissions to create and manage GCP Cloud Storage buckets.
- Create a GitLab Personal Access Token with the necessary permissions to modify CI/CD variables in the GitLab project. Note: I had to use a PAT because I'm on the free tier of GitLab SaaS, which does not yet support Project Access Tokens. If you have a paid plan, you should use a Project Access Token instead.
The process the pipeline follows is:
-
If not already present, it creates the GCP Cloud Storage bucket using
google_storage_bucketresource.1 resource "google_storage_bucket" "longhorn_backup_bucket" { 2 name = var.gcp_backup_longhorn_bucket_name 3 location = var.gcp_backup_region 4 storage_class = "NEARLINE" 5 uniform_bucket_level_access = true 6 public_access_prevention = "enforced" 7 hierarchical_namespace { 8 enabled = true 9 } 10 } -
It creates a
longhorn_backup_service_account. -
Assigns to the newly created service account the
roles/storage.objectAdminrole for the created bucket. -
Creates/Syncs a HMAC key for the service account.
-
The GitLab Pipeline will save the generated HMAC key as a CI/CD variable in the GitLab project. In my case, it will use the Access Token created in the pre-requisites step to do so.
This HMAC key is later injected as a secret into the Kubernetes cluster, allowing Longhorn to connect to the GCP Cloud Storage bucket.
Proxmox VMs
The OpenTofu provider used is bpg/proxmox.
The prerequisites to create this part of the pipeline are:
- Create a Proxmox API token with the necessary permissions to create and manage VMs.
- Create an SSH key pair and store the private key as a GitLab CI/CD variable. This key will be used by the on-premise GitLab Runner to connect to the Proxmox VMs.
The pipeline follows this process:
-
OpenTofu makes Proxmox download the Ubuntu Cloud Image using
proxmox_virtual_environment_download_fileresource.1 resource "proxmox_virtual_environment_download_file" "ubuntu_24_noble_qcow2_img" { 2 content_type = "iso" 3 datastore_id = "nfs-nas-1" 4 node_name = "pve2" 5 url = var.proxmox_k8s_node_image_url 6 overwrite = true 7 file_name = var.proxmox_k8s_node_image_name 8 checksum = var.proxmox_k8s_node_image_checksum 9 checksum_algorithm = var.proxmox_k8s_node_image_checksum_algorithm 10 } -
OpenTofu creates the Cloud-Init configuration file for the worker and master nodes using
proxmox_virtual_environment_fileresource. The important configurations to add to the Cloud-Init file are:- SSH Keys: Inject both your personal public key and the GitLab Runner's public key
- Package Management: Install necessary packages (like
qemu-guest-agent) - Longhorn Prerequisites: Configure according to Longhorn documentation
Code snippet (worker nodes)
1 resource "proxmox_virtual_environment_file" "ubuntu_cloud_init_worker" { 2 content_type = "snippets" 3 datastore_id = "nfs-nas-1" 4 node_name = "pve2" 5 overwrite = true 6 source_raw { 7 data = <<EOF 8 #cloud-config 9 users: 10 - default 11 - name: ${var.proxmox_k8s_node_username} 12 groups: 13 - sudo 14 shell: /bin/bash 15 ssh_authorized_keys: 16 %{~for key in var.proxmox_k8s_node_ssh_keys} 17 - ${key} 18 %{~endfor} 19 sudo: ALL=(ALL) NOPASSWD:ALL 20 21 package_update: true 22 package_upgrade: true 23 packages: 24 - qemu-guest-agent 25 - nfs-common 26 27 # Disk partitioning setup 28 disk_setup: 29 /dev/sdb: 30 table_type: gpt 31 layout: true 32 overwrite: false 33 34 # Filesystem setup 35 fs_setup: 36 - label: data 37 filesystem: ext4 38 device: /dev/sdb1 39 partition: auto 40 overwrite: false 41 42 # Mount configuration 43 mounts: 44 - [/dev/sdb1, /mnt/data, ext4, "defaults,nofail", "0", "2"] 45 46 write_files: 47 - path: /etc/modules-load.d/dm_crypt.conf 48 content: | 49 dm_crypt 50 owner: root:root 51 permissions: '0644' 52 53 runcmd: 54 - systemctl enable qemu-guest-agent 55 - systemctl start qemu-guest-agent 56 - systemctl stop multipathd.socket 57 - systemctl stop multipathd 58 - systemctl disable multipathd.socket 59 - systemctl disable multipathd 60 - systemctl mask multipathd 61 - systemctl mask multipathd.socket 62 - modprobe dm_crypt 63 - systemctl enable iscsid 64 - systemctl start iscsid 65 - echo "done" > /tmp/cloud-config.done 66 EOF 67 file_name = "ubuntu.cloud-config-worker.yaml" 68 } 69 } -
OpenTofu generates the output kubespray_inventory using the
templatefilefunction with inventory.tpl. You can customize the inventory.tpl to fit your needs following the KubeSpray documentation. Here's my version of the inventory.tpl file.inventory.tpl
1 { 2 "all": { 3 "vars": { 4 "ansible_user": "${ansible_user}", 5 "ansible_become": true, 6 "calico_cni_name": "k8s-pod-network", 7 "nat_outgoing": true, 8 "nat_outgoing_ipv6": true, 9 "calico_pool_blocksize": 26, 10 "calico_network_backend": "vxlan", 11 "calico_vxlan_mode": "CrossSubnet", 12 "kube_proxy_strict_arp": true, 13 "kube_encrypt_secret_data": true, 14 "kubeconfig_localhost": true, 15 "artifacts_dir": "/output", 16 "etcd_deployment_type": "host", 17 "etcd_metrics_port": 2381, 18 "etcd_listen_metrics_urls": "http://0.0.0.0:2381", 19 "etcd_metrics_service_labels": { 20 "k8s-app": "etcd", 21 "app.kubernetes.io/managed-by": "kubespray", 22 "app": "kube-prometheus-stack-kube-etcd", 23 "release": "kube-prometheus-stack" 24 }, 25 "kube_proxy_metrics_bind_address": "0.0.0.0:10249" 26 }, 27 "children": { 28 "kube_control_plane": { 29 "hosts": { 30 %{ for name in master_nodes ~} 31 "${name}", 32 %{ endfor ~} 33 } 34 }, 35 "etcd": { 36 "hosts": { 37 %{ for name in master_nodes ~} 38 "${name}", 39 %{ endfor ~} 40 } 41 }, 42 "kube_node": { 43 "hosts": { 44 %{ for name in worker_nodes ~} 45 "${name}", 46 %{ endfor ~} 47 } 48 }, 49 "k8s_cluster": { 50 "children": [ 51 "kube_control_plane", 52 "kube_node" 53 ] 54 } 55 } 56 } 57 } IMHO, the most important thing here is to limit Kubespray to install only the necessary components to have a minimal Kubernetes cluster ready for FluxCD deployment.
-
Saves the generated inventory file as a GitLab artifact for use in the next stage.
Kubernetes Cluster Bootstrapping
This part of the pipeline bootstraps the Kubernetes cluster using KubeSpray.
I think KubeSpray is the best solution for creating a production-ready Kubernetes cluster in an on-premise environment, but I also think it should be limited to only the necessary components for a working cluster. KubeSpray, which is based on Ansible, provides many options to install various components like CNI, Ingress controllers, and monitoring stacks. However, in my opinion, these components should be installed using a more mature GitOps tool like FluxCD.
The prerequisites for this part of the pipeline are:
- A Personal Access Token with API scope to upload the kubeconfig file as a GitLab CI/CD variable. As already mentioned, if you have a paid plan, you should use a Project Access Token instead. Additionally, you can use the same token created for the GCP Bucket creation step.
- The GitLab Runner needs to connect to the Proxmox VMs using SSH. The private key for the SSH connection should be stored as a GitLab CI/CD variable in the project, as already mentioned in the Proxmox VMs creation step.
The pipeline follows these steps:
1 apply-kubespray-production-home:
2 stage: deploy
3 image: ${CI_DEPENDENCY_PROXY_GROUP_IMAGE_PREFIX}/docker:cli
4 tags:
5 - mgmt-zone
6 - self-hosted
7 needs:
8 - job: opentofu-apply-production-home
9 artifacts: true
10 services:
11 - *dind
12 before_script:
13 - apk add --no-interactive jq
14 script:
15 - mkdir -p $CI_PROJECT_DIR/inventory
16 - mkdir -p output
17 - chmod 600 $RUNNER_SSH_PRIVATE_KEY_PATH
18 - jq -r '.kubespray_inventory.value' home-tf-output.json > $CI_PROJECT_DIR/inventory/inventory.json
19 - docker run --mount type=bind,source=$CI_PROJECT_DIR/inventory,dst=/inventory --mount type=bind,source=$RUNNER_SSH_PRIVATE_KEY_PATH,dst=/ssh/id --mount type=bind,source=$CI_PROJECT_DIR/output,dst=/output --rm quay.io/kubespray/kubespray:$KUBESPRAY_VERSION ansible-playbook -i /inventory/inventory.json --private-key /ssh/id cluster.yml
20 environment:
21 name: production/home
22 artifacts:
23 when: on_success
24 access: developer
25 expire_in: "10 mins"
26 paths:
27 - output/**
This step leverages the official KubeSpray Docker image to run the Ansible playbooks against the Proxmox VMs created in the previous step. It then saves the generated kubeconfig file as a restricted-access artifact for use in subsequent pipeline stages.
Uploading Kubeconfig to Gitlab
This step uploads the generated kubeconfig file to GitLab as a CI/CD variable using the GitLab API directly.
1 .upload-secret-base64-encoded:
2 image: ${CI_DEPENDENCY_PROXY_GROUP_IMAGE_PREFIX}/alpine/curl:latest
3 script:
4 - |
5 set -e
6 DATA=$(base64 -w 0 ${DATA_FILE_PATH})
7 curl -s -f --request PUT \
8 --header "PRIVATE-TOKEN: ${ACCESS_TOKEN}" \
9 --header "Content-Type: application/json" \
10 --data "{\"variable_type\":\"file\",\"key\":\"${VAR_NAME}\",\"value\":\"$DATA\",\"hidden\":false,\"protected\":true,\"masked\":true,\"raw\":true,\"description\":\"\"}" \
11 "$CI_API_V4_URL/projects/$CI_PROJECT_ID/variables/${VAR_NAME}" > /dev/null 2>&1
This way, after the IaC pipeline finishes, the kubeconfig file will be available for administrators to download and use.
Sealed Secrets Deployment
This part of the pipeline creates the Sealed Secrets controller in the Kubernetes cluster.
The prerequisite is a certificate/key pair to be used by the Sealed Secrets controller. You can find the instructions here: Bring your own certificates.
The private key needs to be stored as a GitLab CI/CD variable, while the certificate can be stored directly in the IaC repository.
FluxCD Deployment
This part of the pipeline uses OpenTofu to deploy the FluxCD Operator in the Kubernetes cluster as described in the official documentation.
This step also creates the Kubernetes secret with the GCP HMAC key for Longhorn and the secret with the registry credentials to pull container images from private registries.
Longhorn Volumes Restoration Process
Unfortunately, at the time of writing, Longhorn does not support a declarative way to restore volumes from offsite backups (issue#5787).
To work around this limitation, I have created a simple Dockerized Python program that leverages the Longhorn API to restore the latest backup for defined volumes from offsite storage. You can find the repository here.
Simply define a Job in the d2-infra repository for each volume you want to restore. For example, here is the Job definition to restore the Authelia Longhorn volume:
1 apiVersion: batch/v1
2 kind: Job
3 metadata:
4 name: authelia-volume-restore
5 namespace: authelia
6 spec:
7 template:
8 spec:
9 containers:
10 - name: restore
11 image: your-registry/longhorn-backup-restore:latest
12 env:
13 - name: LONGHORN_URL
14 value: http://longhorn-frontend.longhorn-system.svc.cluster.local
15 - name: VOLUME_HANDLE
16 value: authelia-production-vol
17 - name: NUMBER_OF_REPLICAS
18 value: "3"
19 - name: LOG_LEVEL
20 value: INFO
21 restartPolicy: Never
22 backoffLimit: 3
Then you need to create the corresponding PV and PVC to use the restored volume in your application. These can be defined in the d2-infra repository, for example in the same file as the restore job.
You can find more information in the README of the repository.
Encrypt Longhorn Backups Client-Side
Unfortunately, at the time of writing, Longhorn does not support client-side encryption of backups natively (issue#5220).
A simple solution I found is to use rclone to encrypt the backups client-side before uploading them to the offsite backup location.
Simply declare a Deployment, a Service, some Secrets, and a one-time Job in the same namespace where Longhorn is installed.
The important bits in the
deployment manifest
1 apiVersion: apps/v1
2 kind: Deployment
3 metadata:
4 name: s3-rclone-longhorn-bck
5 labels:
6 app.kubernetes.io/name: s3-rclone-longhorn-bck
7 spec:
8 replicas: 1
9 selector:
10 matchLabels:
11 app.kubernetes.io/name: s3-rclone-longhorn-bck
12 template:
13 metadata:
14 labels:
15 app.kubernetes.io/name: s3-rclone-longhorn-bck
16 spec:
17 containers:
18 - name: rclone
19 image: ghcr.io/rclone/rclone:1.71.1
20 imagePullPolicy: IfNotPresent
21 command:
22 - "rclone"
23 - "serve"
24 - "s3"
25 - "--no-cleanup"
26 - "--auth-key"
27 - "$(RC_ACCESS_KEY_ID),$(RC_ACCESS_KEY)"
28 - "crypt_out_s3:"
29 - "--s3-force-path-style=true"
30 - "--addr=:8080"
31 - "--log-level=WARNING"
32 env:
33 - name: RCLONE_CONFIG_OUT_S3_ACCESS_KEY_ID
34 valueFrom:
35 secretKeyRef:
36 name: longhorn-backup-hmac-key
37 key: access_id
38 - name: RCLONE_CONFIG_OUT_S3_SECRET_ACCESS_KEY
39 valueFrom:
40 secretKeyRef:
41 name: longhorn-backup-hmac-key
42 key: secret
43 - name: RCLONE_CONFIG_CRYPT_OUT_S3_PASSWORD
44 valueFrom:
45 secretKeyRef:
46 name: rclone-secret
47 key: password
48 # salt is used as password2
49 - name: RCLONE_CONFIG_CRYPT_OUT_S3_PASSWORD2
50 valueFrom:
51 secretKeyRef:
52 name: rclone-secret
53 key: salt
54 - name: RC_ACCESS_KEY_ID
55 valueFrom:
56 secretKeyRef:
57 name: longhorn-rclone-bck-key-secret
58 key: AWS_ACCESS_KEY_ID
59 - name: RC_ACCESS_KEY
60 valueFrom:
61 secretKeyRef:
62 name: longhorn-rclone-bck-key-secret
63 key: AWS_SECRET_ACCESS_KEY
64 ports:
65 - name: http
66 containerPort: 8080
67 protocol: TCP
68 volumeMounts:
69 - name: config
70 mountPath: /root/.config/rclone
71 volumes:
72 - name: config
73 configMap:
74 name: s3-rclone-longhorn-bck-config
75 items:
76 - key: rclone.conf
77 path: rclone.conf
78
and in the
rclone config
1 [out_s3]
2 type = s3
3 provider = GCS
4 endpoint = https://storage.googleapis.com
5 region = europe-west4
6 use_multipart_uploads = false
7
8 [crypt_out_s3]
9 type = crypt
10 remote = out_s3:<BUCKET-NAME-REDACTED>
11 directory_name_encryption = false
12 filename_encryption = off
I used are
Why We Use rclone serve s3
rclone serve s3 is the key component that makes this solution work. It implements a basic S3-compatible server that exposes any rclone backend (in our case, the encrypted crypt_out_s3 remote) as an S3 endpoint.
This is essential because:
- S3 Gateway: Longhorn expects to communicate with an S3-compatible storage backend for backups. By using
rclone serve s3, we provide Longhorn with a standard S3 interface. - Encryption layer: By pointing the S3 server at the
crypt_out_s3remote, all data is transparently encrypted/decrypted as it passes through rclone - No Longhorn modifications: This approach requires zero changes to Longhorn itselfβit just sees a standard S3 endpoint
- Authentication: The
--auth-keyparameter allows us to secure the S3 endpoint with credentials that Longhorn can use
The command essentially creates an S3 gateway that sits between Longhorn and the actual storage backend, handling all encryption/decryption automatically.
Understanding the rclone crypt Options
The crypt remote configuration uses two important settings:
directory_name_encryption = false
This keeps directory names in plaintext (unencrypted). While this reduces security slightly, it has practical benefits:
- Easier to navigate the bucket structure directly if needed
- Simpler debuggingβyou can see the folder structure at a glance
The actual file data is still fully encrypted, so the main content security is preserved.
filename_encryption = off
With this setting, files only get a .bin extension added instead of having their filenames encrypted. This provides several advantages:
- Shorter encrypted filenames, avoiding path length limits on some cloud storage providers
- Easier to identify files when accessing the backend directly
Security trade-off: This setting trades some security for practicality. If you need maximum security, you could use standard encryption, which encrypts filenames completely.
Both options make the encrypted remote more manageable and reduce the risk of hitting storage provider limitations while keeping the actual file content fully encrypted.
Next, create the corresponding Service and a simple Job to initialize the backup bucket.
The important bits in the
job manifest
1 apiVersion: batch/v1
2 kind: Job
3 metadata:
4 name: s3-rclone-longhorn-bck-init
5 labels:
6 app.kubernetes.io/name: s3-rclone-longhorn-bck
7 app.kubernetes.io/component: init
8 spec:
9 backoffLimit: 3
10 template:
11 metadata:
12 labels:
13 app.kubernetes.io/name: s3-rclone-longhorn-bck
14 app.kubernetes.io/component: init
15 spec:
16 restartPolicy: OnFailure
17 containers:
18 - name: rclone-init
19 image: ghcr.io/rclone/rclone:1.71.1
20 imagePullPolicy: IfNotPresent
21 command:
22 - "/bin/sh"
23 - "-c"
24 - |
25 set -e
26
27 echo "Waiting for rclone S3 service to be ready..."
28 # limit to 10 minutes
29 COUNTER=0
30 until nc -z s3-rclone-longhorn-bck 8080 2>/dev/null; do
31 echo "Waiting for service..."
32 sleep 10
33 COUNTER=`expr $COUNTER + 1`
34 if [ $COUNTER -ge 60 ]; then
35 echo "Timeout waiting for service after 10 minutes"
36 exit 1
37 fi
38 done
39 echo "Service is ready!"
40 COUNTER=0
41
42 echo "Using bucket name: $${BUCKET_NAME}"
43
44 # Set up the remote for the local rclone S3 service (encrypted)
45 export RCLONE_CONFIG_LOCAL_S3_TYPE=s3
46 export RCLONE_CONFIG_LOCAL_S3_PROVIDER=Other
47 export RCLONE_CONFIG_LOCAL_S3_ENDPOINT=http://s3-rclone-longhorn-bck:8080
48 export RCLONE_CONFIG_LOCAL_S3_ACCESS_KEY_ID="$${RC_ACCESS_KEY_ID}"
49 export RCLONE_CONFIG_LOCAL_S3_SECRET_ACCESS_KEY="$${RC_ACCESS_KEY}"
50 export RCLONE_CONFIG_LOCAL_S3_FORCE_PATH_STYLE=true
51
52 export RCLONE_CONFIG_OUT_S3_USE_MULTIPART_UPLOADS=false
53 export RCLONE_CONFIG_OUT_S3_NO_CHECK_BUCKET=true
54 export RCLONE_CONFIG_OUT_S3_ACCESS_KEY_ID="$${BACKEND_S3_ACCESS_KEY_ID}"
55 export RCLONE_CONFIG_OUT_S3_SECRET_ACCESS_KEY="$${BACKEND_S3_SECRET_ACCESS_KEY}"
56
57 # Check if INFO.txt already exists in the backend (unencrypted)
58 echo "Checking if INFO.txt already exists in backend..."
59 if rclone lsf "out_s3:$${BUCKET_NAME}/INFO.txt" 2>/dev/null | grep -q "INFO.txt"; then
60 echo "INFO.txt already exists in backend - initialization already complete"
61 echo "Bucket is ready for Longhorn backups"
62 exit 0
63 fi
64
65 # Check if bucket exists (via encrypted endpoint)
66 echo "Checking if bucket exists..."
67 if rclone lsd local_s3: 2>/dev/null | grep -q "$${BUCKET_NAME}"; then
68 echo "Bucket '$${BUCKET_NAME}' already exists"
69
70 # Check if bucket contains any encrypted data
71 echo "Checking bucket contents (encrypted view)..."
72 FILE_COUNT=$$(rclone ls "local_s3:$${BUCKET_NAME}/" 2>/dev/null | wc -l)
73
74 if [ "$$FILE_COUNT" -gt 0 ]; then
75 echo "Bucket contains $${FILE_COUNT} encrypted file(s)"
76 echo "Listing existing files:"
77 rclone ls "local_s3:$${BUCKET_NAME}/" --max-depth 1
78 fi
79 else
80 echo "Creating new bucket: $${BUCKET_NAME}"
81 rclone mkdir "local_s3:$${BUCKET_NAME}" --log-level=INFO
82 echo "Bucket created successfully"
83 fi
84
85 echo "Listing all buckets..."
86 rclone lsd local_s3: --log-level=INFO
87
88 echo "Generating INFO.txt file..."
89 TIMESTAMP=$$(date -u +"%Y-%m-%d %H:%M:%S UTC")
90 HOSTNAME=$$(hostname)
91
92 cat > /tmp/INFO.txt <<EOF
93 ============================================
94 Longhorn Backup Bucket Information
95 ============================================
96
97 Bucket Name: $${BUCKET_NAME}
98 Created: $${TIMESTAMP}
99 Created By: $${HOSTNAME}
100
101 Configuration:
102 - Service: s3-rclone-longhorn-bck
103 - Endpoint: http://s3-rclone-longhorn-bck:8080
104 - Encryption: Enabled (rclone crypt)
105 - Backend: Google Cloud Storage (GCS)
106 - Region: europe-west4
107
108 Rclone Configuration:
109 - Remote: crypt_out_s3
110 - Base Remote: out_s3
111 - Encryption: Standard encryption with password and salt
112 - Directory Name Encryption: Disabled
113 - Filename Encryption: Disabled
114
115 Environment:
116 - Kubernetes Namespace: $${K8S_NAMESPACE}
117 - Init Job: s3-rclone-longhorn-bck-init
118
119 Notes:
120 - All data stored in this bucket is encrypted using rclone crypt
121 - Access requires proper HMAC credentials (stored in secrets)
122 - Encryption password and salt are required for decryption
123 - INFO.txt is stored UNENCRYPTED for easy access
124
125 Secrets Used:
126 - longhorn-backup-hmac-key: GCS HMAC credentials
127 - rclone-secret: Encryption password and salt
128 - longhorn-rclone-bck-key-secret: API access credentials
129
130 ============================================
131 EOF
132
133 echo "Uploading INFO.txt to backend GCS (UNENCRYPTED)..."
134 rclone copy /tmp/INFO.txt "out_s3:$${BUCKET_NAME}/" --log-level=INFO --s3-no-check-bucket
135
136 echo "Verifying upload..."
137 echo "Files in encrypted view:"
138 rclone ls "local_s3:$${BUCKET_NAME}/" --log-level=INFO
139 echo ""
140 echo "Files in unencrypted backend:"
141 rclone ls "out_s3:$${BUCKET_NAME}/" --max-depth 1 --log-level=INFO
142
143 echo "Initialization complete!"
144 echo "Bucket '$${BUCKET_NAME}' is ready for Longhorn backups"
145 echo "INFO.txt is available unencrypted in the backend storage"
146 env:
147 - name: BUCKET_NAME
148 value: ~
149 - name: RCLONE_CONFIG_OUT_S3_TYPE
150 value: ~
151 - name: RCLONE_CONFIG_OUT_S3_PROVIDER
152 value: ~
153 - name: RCLONE_CONFIG_OUT_S3_ENDPOINT
154 value: ~
155 - name: RCLONE_CONFIG_OUT_S3_REGION
156 value: ~
157 - name: RCLONE_CONFIG_OUT_S3_USE_MULTIPART_UPLOADS
158 value: ~
159 - name: K8S_NAMESPACE
160 valueFrom:
161 fieldRef:
162 fieldPath: metadata.namespace
163 - name: RC_ACCESS_KEY_ID
164 valueFrom:
165 secretKeyRef:
166 name: longhorn-rclone-bck-key-secret
167 key: AWS_ACCESS_KEY_ID
168 - name: RC_ACCESS_KEY
169 valueFrom:
170 secretKeyRef:
171 name: longhorn-rclone-bck-key-secret
172 key: AWS_SECRET_ACCESS_KEY
173 - name: BACKEND_S3_ACCESS_KEY_ID
174 valueFrom:
175 secretKeyRef:
176 name: longhorn-backup-hmac-key
177 key: access_id
178 - name: BACKEND_S3_SECRET_ACCESS_KEY
179 valueFrom:
180 secretKeyRef:
181 name: longhorn-backup-hmac-key
182 key: secret
are:
Why We Generate the INFO.txt File
The INFO.txt file serves several important purposes:
-
Documentation: It provides human-readable information about the bucket configuration, encryption setup, and required credentials. This is invaluable when you need to restore or troubleshoot backups months or years later.
-
Accessible without encryption: Critically, INFO.txt is stored directly in the backend (
out_s3), bypassing the encryption layer. This means it can be read without needing the encryption keys, making it a self-documenting backup location. -
Verification: By uploading it to the backend, we verify that:
- The GCS backend connection works correctly
- Credentials are properly configured
- The bucket is accessible and writable
-
Recovery aid: If you ever lose your rclone configuration but still have the encryption passwords in a password manager, the INFO.txt file tells you exactly how the encryption was configured. This makes it possible to reconstruct the setup and recover your backups.
-
Idempotency check: The script checks for INFO.txt existence to determine if initialization has already been completed, preventing duplicate initialization runs.
Why We Create the Bucket with rclone mkdir "local_s3:${BUCKET_NAME}"
Creating the bucket through the local_s3: remote (the encrypted S3 endpoint) rather than directly on the backend has several advantages:
-
End-to-end testing: This verifies the entire encryption pipeline is working:
- The rclone serve s3 service is running and accessible
- Authentication is correctly configured
- The crypt layer is properly set up
- The underlying GCS backend is reachable
-
S3 API validation: It ensures that bucket creation operations work through the S3 API layer, which is exactly how Longhorn will interact with the system. If
rclone mkdirsucceeds throughlocal_s3:, we know Longhorn's S3 operations will also work. -
Consistent access path: By creating the bucket the same way Longhorn will access it (through the S3 API), we ensure there are no surprises or incompatibilities when Longhorn starts using the bucket.
-
Automatic bucket initialization: On GCS, when you create a "bucket" through rclone's S3 interface, it actually creates a folder/prefix in the specified GCS bucket (configured as
<BUCKET-NAME-REDACTED>in the config). This happens automatically through the crypt layer. -
Proper permissions verification: This confirms that the service account credentials (
RC_ACCESS_KEY_ID/RC_ACCESS_KEY) have the necessary permissions to create buckets through the S3 interface.
RTO and RPO
With this setup, you can easily modify the RPO by adjusting the Longhorn backup schedule to fit your needs.
For RTO, it mainly depends on the pipeline execution time. The main time-consuming components are the Kubernetes cluster bootstrapping using KubeSpray and the FluxCD reconciliation process.
In my case, the total time to provision the entire infrastructure from scratch and reconcile the cluster state is around 1 hour, which works well for my use case and is also acceptable for most organizations.
To lower the RTO further, you could customize FluxCD timeouts, retry periods, and parallelism to achieve more aggressive reconciliation.
However, I'm planning to set up a disaster recovery site for even better redundancy. More on this in the "Next Steps" section.
Conclusion
This journey from a homelab data loss incident to a production-grade IaC setup taught me that:
- Recovery Time Objectives aren't just for enterprises
- GitOps principles significantly reduce operational burden in the long run
While the initial setup took about a month (including research, testing, and iteration), I can now rebuild my entire infrastructure in 1 hour. More importantly, I've eliminated the anxiety of "did I back that up?"βeverything is code, versioned, and reproducible.
The total monthly cost (~β¬10 for GCS storage) is minimal compared to the value of reliable, reproducible infrastructure. If you're running a homelab, I encourage you to treat it like productionβyour future self will thank you.
Limitations
As you might have noticed, some components are not yet part of the automated rebuild process.
Infrastructure Assumptions:
- Proxmox Host: The Proxmox hypervisor itself is treated as static infrastructure and is not recreated from scratch
- DNS Records: External DNS records for services are assumed to be pre-configured
- Network configuration: The underlying network infrastructure (VLANs, subnets, firewall rules) is not managed by the IaC pipeline
While this is acceptable for a homelab, an enterprise-grade setup should include these components in the IaC pipeline as well. For on-premises environments, this presents additional challenges:
- Hypervisor bootstrapping requires out-of-band management (IPMI/iLO)
- Network configuration can be scripted using OpenTofu, which leverages Proxmox Software Defined Networks
- DNS automation depends on your DNS provider's API availability
These omissions mean a true "datacenter destroyed" scenario still requires some manual intervention. However, for the more common scenarios (VM corruption, cluster misconfiguration, accidental deletion), the current setup provides comprehensive protection.
Next Steps
While Google Cloud is a great option and my monthly cost is only ~β¬10, in the future I would like to explore Hetzner. The pricing is really competitive and they have an S3-compatible object storage service. They also have an OpenTofu provider.
Another area I would like to explore is leveraging Longhorn Disaster Recovery Volumes in conjunction with the FluxCD d2 architecture. This way, I might be able to create a recovery cluster in another location and have a more robust disaster recovery plan. I think that by using Hetzner with only the strictly necessary services, a single server might be sufficient to host the recovery cluster cheaply.
With this setup, it might be possible to achieve a very low RTO and RPO.
Finally, I hope that in the future Longhorn will natively support client-side encryption and a declarative way to restore volumes from offsite backups so that I can simplify the current setup.
This article is licensed under the CC BY-SA 4.0 license.