I've maintained a Ceph cluster at home since 2011. It spans 3 machines with 6 disks each. For quite a while, in those early days, I experienced the loss of several hard disks, due to local heat, misconfigurations, early dogfooding of ceph and btrfs, and uses that abused the disks too hard, but that has long been left behind, and for years the cluster has been quite stable as long-term storage, mainly for my archival backups.
But as of the past... months? years? I'm not sure... when I ran monthly-ish archival backups, I hit some quite annoying problems.
On one of the servers, one of the disks often seemed to overheat while I copied the backups to the cluster, and then the disk would stop talking to the server until I left it to cool down for long enough. Not a great thing when it holds part of your root filessytem, and a swap partition.
Another disk on the same server would then slow down to a crawl. Other disks would kick it from the cluster, then it would rejoin, and there'd be a lot of additional slowdown due to the recovery activity.
They had been like that for a while, and I got used to enduring those annoyances. It was something to be looked into some day, postponed forever because, honestly, I did enough hard disk fixing for a lifetime in those early days, and the peace of mind of data replication across lots of disks was the primary reason for me to get interested in Ceph to begin with.
I just didn't feel like dealing with such failures any more. Presumably those disks would soon fail for good, and the replacements wouldn't be so annoying. Meanwhile, they provided me with a little extra redundancy, that I hoped I'd never need.
The other day, I replaced the old USB fan that I had running pointed at the disks. It was hardly spinning any more. I hoped it would help get the backups finished. This backup, ran at the hottest end of the Summer, was hitting the long-known problems pretty hard.
Shortly after unplugging the old fan and plugging the new one in, I smelled smoke.
Ugh, right?
Well... Not so fast!
I powered off that server right away, disconnected the USB disks, powered the server back up and started plugging the disks back in one by one. They were all alive. Phew! (or, for those rooting for the annoying disks to rest in pieces, no such luck)
But when I plugged them in through the USB hub, they all seemed sort of dead. Their leds would lighten up, but the server couldn't detect them any more. So the hub was the source of the magic smoke that had escaped! Weirdly, it could still carry power to the disks and to the fan, but not data.
Anyway, the hub was replaced, and then the server came back up, resynced with the other machines, got the remaining pieces of the archival backup in, and I've been checking the data and it has been running its own scrubbing for much longer than it used to take for those annoying disks to start giving me headaches, and none of them did.
Conclusion: it was the USB hub all along! Sorry disks. You aren't annoying, and I hope you live a long and happy life now that you're connected to a hub that doesn't start depriving you of the power you need when it heated up.
So, I guess I should thank the fan for solving the problem for me?
I wonder if the older fan was spinning slowly because it was power-deprived too. I shall check that theory some day.
Anyway, for now, I'm happy with the newly-headache-free cluster. And the newly-data-less but still power-capable hub has got a new fan, spinning happily and cooling down the disks.
This post celebrates 16 years of Blonging for Freedom.
It kind of sucks to launch a blog on April 1st, because anniversary posts may always be mistaken as pranks. So I haven't posted much on such dates (or at all), but this is another such post, and to the best of my knowledge, it's all true. And so were the earlier ones. No kidding!
So blong...