host1, back in the cluster: a boot freeze, an old NVMe, and 319 GiB of stale TRIM

host1 went dark, came back on a different CPU, and revealed that years of LXC snapshots had quietly left 319 GiB of free blocks the SSD never knew about. The recovery story and the storage cleanup, in two phases.

clusterFer is two Proxmox hosts, pve1 and host1, sharing one /etc/pve via corosync. host1 had been silent for weeks. I rebuilt it on a different CPU, fought a boot freeze, brought it back into quorum, and then realized its 500 GB NVMe had 319 GiB of free blocks the SSD did not know about.

This is the story of how clusterFer got its second node back, and what I found once I could log in.

The hardware migration

host1's old chassis was an AMD 8x4 GHz with 16 GB of RAM and three disks: a 4 TB SATA for the system and a legacy LVM-thin pool, a 500 GB NVMe (Crucial P1, QLC) for the active CTs and VMs, and a second 4 TB SATA for backups exported via NFS. Proxmox 8.x had been installed years ago and never reinstalled. The board failed.

I moved the disks into a different chassis with a Xeon E5-2680 v4 and 32 GB of RAM, expecting Proxmox to boot on the new CPU and figure it out. GRUB came up, I picked Proxmox Boot, and it stopped at:

Loading Linux 6.8.12-4-pve ...
Loading initial ramdisk ...

That was it. No kernel panic, no further output, no reaction to keys. The system was either still loading silently, or the initramfs was struggling with the new hardware - the AMD board had a wholly different chipset, and the SATA controllers were different too. I let it sit, power-cycled, tried again. After a few attempts the kernel came up, the system started, and I could log in over SSH. The boot stack survived the CPU swap; what had stalled the first attempts looks in hindsight like a slow firmware re-handshake that GRUB was not reporting visually.

Back in quorum

From pve1 I checked corosync:

$ pvecm status
Cluster information
-------------------
Name:             clusterFer
Nodes:            2
Quorate:          Yes

Membership information
----------------------
    Nodeid      Votes Name
0x00000001          1 192.168.1.200
0x00000002          1 192.168.1.2 (local)

Both nodes online, quorate. /etc/pve was already in sync. host1 had 47 LXC containers and 18 VMs in its inventory, only one CT running. I started looking at the storage to confirm everything I cared about was still there.

The LVM-thin snapshot tree, the reason host1 exists

host1 has always been the experimental node. The reason is its NVMe storage layout: a single LVM-thin pool, nvme/data, with 47 thin volumes formatted ext4 inside. Each CT and VM disk lives in that pool, and snapshots are independent thin LVs that share blocks via copy-on-write at 256 KiB granularity.

What makes that interesting in Proxmox is the parent: field in /etc/pve/lxc/<id>.conf. It does not just record a linear history; it records a tree. CT 244 is the canonical example. Its snapshot tree:

rootAccess
  +- Odoo16_oca_manual
  +- Odoo16_oca_working   (current head)
  +- Odoo17_oca
  |    +- Odoo17_workiing
  +- Odoo14_not_working
  +- Odoo_18_working
  +- Documents
       +- visualCodeConfigSettings

From the same baseline I had branched five different Odoo installs (14, 16 manual, 16 OCA, 17, 18) and a writing-tools branch. Rolling back to one of them does not delete the others. ZFS supports the same shape via zfs clone, but the Proxmox UI for LXC on ZFS does not expose it the same way; it treats snapshots as a line. host1's NVMe had become my "try things and back out without losing the other branches" environment over four years.

That mattered now because I needed to confirm the tree was intact. It was. pct listsnapshot 244 still listed the nine snapshots; the same for CT 1002 with seven. The hardware change had not touched the LVM metadata.

The 11% wear problem

While I was in there, I ran SMART on the NVMe:

Model Number:        CT500P1SSD8
Percentage Used:     11%
Data Units Written:  42,702,673 [21.8 TB]
Available Spare:     100%

11% wear after 21.8 TB written is healthy for QLC, but it raised a question I had never asked: was the SSD actually receiving TRIM? I checked.

systemctl is-enabled fstrim.timer on the host: disabled. Never run.
/etc/lvm/lvm.conf: thin_pool_autoextend_threshold = 100. Autoextend disabled, no early-warn.
lvs --segments -o +zero,discards nvme/data: zero passdown. passdown is good, but zero on means the pool zeroes every reused chunk - doubling writes for free.
43 of 47 LXC containers are unprivileged. fstrim from inside fails with FITRIM ioctl: Operation not permitted. systemd's fstrim.timer upstream skips itself with ConditionVirtualization=!container.
VMs had no discard=on on their disk lines. Guest TRIM had no path through QEMU.

The QLC controller had been doing four years of garbage collection without ever knowing which blocks were actually free. Time to fix it.

Phase 1: no downtime

On pve1 (ZFS rpool, 4 TB Predator NVMe, 2% wear):

zpool set autotrim=on rpool
zpool trim rpool
zfs set atime=off rpool
zfs set xattr=sa rpool
systemctl enable --now zfs-scrub-monthly@rpool.timer

On host1 (LVM-thin nvme/data, QLC NVMe, 11% wear):

lvchange --zero n nvme/data
systemctl enable --now fstrim.timer
sed -i 's/thin_pool_autoextend_threshold = 100/thin_pool_autoextend_threshold = 80/' /etc/lvm/lvm.conf
# install /usr/local/bin/check-thinpool.sh + /etc/cron.d/check-thinpool

And cluster-wide, since /etc/pve is shared, I scoped each storage to the host that owns it:

pvesm set local-lvm   --nodes host1
pvesm set nvme-lvm    --nodes host1
pvesm set zfs-storage --nodes pve1

Until then both nodes had been showing each other's storages as inactive in the UI - not broken, just messy.

Phase 2: rolling

For VMs I added discard=on,ssd=1 to every disk on LVM-thin storage. A small Python script, add_discard.py, parsed each .conf, only edited the active section (everything before the first [snapshot_name] header), and skipped lines whose storage was not on LVM-thin.

For LXC the story was less clean. The Proxmox 8 schema rejects discard=on on rootfs: for LXC; that option does not exist for containers. And as I had already seen, unprivileged containers cannot FITRIM. The only way to get TRIM into the SSD for LXC volumes is offline:

Stop the container.
Mount its thin volume on the host with mount -o discard /dev/nvme/vm-XXX-disk-0 /mnt/trim-XXX.
Run fstrim -v /mnt/trim-XXX.
Unmount, start the container again.

I wrote that as /usr/local/bin/trim-lxc.sh and ran it across every CT on nvme-lvm. The numbers, in bytes finally returned to the SSD:

CT 188:  23.3 GiB    CT 245:  18.1 GiB    CT 515:  32.2 GiB
CT 227:   4.6 GiB    CT 246:  20.3 GiB    CT 516:  26.6 GiB
CT 234:  15.3 GiB    CT 513:  16.6 GiB    CT 519:  20.8 GiB
CT 236:  27.4 GiB    CT 514:   2.1 GiB    CT 1000: 18.0 GiB
CT 237:  16.6 GiB    CT 244:  16.6 GiB    CT 1001: 18.8 GiB
CT 1002:  9.9 GiB    CT 1003: 18.9 GiB    CT 1004: 13.8 GiB
                                          ~319 GiB total

The thin pool itself dropped from 60.35% to 54.82%. Most of the 319 GiB never freed at the pool level because they were referenced by snapshots somewhere in the DAG; but the SSD now knows they were unallocated bytes from the pool's perspective. From the controller's view, 319 GiB of stale references have been wiped.

The local-lvm pool that was not

One detail kept nagging. pvesm status on host1 reported local-lvm as inactive with this message:

activating LV 'pve/data' failed:
Activation of logical volume pve/data is prohibited
while logical volume pve/data_tmeta is active.

Classic LVM-thin partial-activation bug. udev had brought up the metadata LV directly, leaving the parent thin pool unable to attach. The fix is three lvchange calls, archived in a tiny script:

$ cat /root/fixLVpveData.sh
lvchange -an pve/data_tdata
lvchange -an pve/data_tmeta
lvchange -ay pve/data

I ran it. pve/data came up with attribute twi-aotz-- (active, open, thin pool, zero), the sub-LVs reattached cleanly, and local-lvm in pvesm status went from inactive to active 48.65%. 3.49 TiB of legacy thin volumes back in scope.

The schema gotcha

That is when I caught my own bug. With local-lvm active, Proxmox revalidated all VM configs on it. Five VMs failed to parse:

vm 405 - unable to parse value of 'virtio0' - format error
ssd: property is not defined in schema
   and the schema does not allow additional properties

My Phase 2 script had blanket-added ,discard=on,ssd=1 to every disk. It turns out ssd=1 is not in the schema for virtio* lines - only for scsi* and sata*. virtio-blk does not expose a rotational rate, so QEMU has no concept of SSD emulation for it; the option is rejected, and Proxmox 8 strict-validates instead of silently accepting it like older versions.

A tiny followup script removed ,ssd=1 from virtio* lines and left discard=on in place. Five VMs (101, 306, 405, 408, 409) parsed again. None of them had been running, so there was no operational impact - but if I had not noticed the warnings I would have hit it the next time someone tried to start one.

Final state

Name              Status         Used        Total       %
local-lvm    lvmthin   active   1.7 TiB     3.5 TiB    48.65
nvme-lvm     lvmthin   active   249 GiB     465 GiB    54.82
backupHost1  nfs       active   2.5 TiB     3.6 TiB    68.44

Both nodes quorate, every storage scoped to its owner, autotrim on for the ZFS root pool, fstrim weekly on host1, zero-on-reuse off on the QLC pool, autoextend at 80% threshold with a cron alert at /usr/local/bin/check-thinpool.sh, VMs ready to TRIM through to the SSD on next start, and 319 GiB of stale free blocks finally returned to the controller.

The lessons

Hardware migration on Proxmox usually just works. The boot freeze that scared me on the first attempt resolved itself on a later reboot. Proxmox's modular initramfs and modern kernel happily picked up a different CPU, different chipset, different SATA controllers. I should have given the first boot more time before assuming it had hung.

LVM-thin's snapshot DAG is genuinely better than ZFS for the "branch and try" workflow on Proxmox LXC. Five Odoo versions branched from one baseline, four years later, all still rollbackable, no merge conflicts. The Proxmox UI exposes that tree directly via the parent: field. ZFS could do the same with clones, but the UI does not let you.

Unprivileged LXC plus LVM-thin equals zero TRIM unless you set up a workaround. The container cannot FITRIM, the host's fstrim does not see container mounts, and the per-disk discard=on option does not exist for rootfs in the schema. Stop, mount on the host, fstrim, start. If you have hundreds of thousands of file deletions in your CTs, you owe your SSD this.

Schema validation is per-disk-type. ssd=1 on virtio* breaks parsing in Proxmox 8. Read the schema, or use the UI to add disk options and copy what it produces.

May 9, 2026 - written from pve1, two nodes quorate, host1 humming on a Xeon it had never met before that morning.

Three monitors, four agents, one apt upgrade

Yesterday's kernel upgrade left two of my three screens dark at MATE login. Here is how four parallel AI agents found the cause in eighteen minutes - without ever blaming the kernel.