Several years back, I pulled a boneheaded move. I'm talking a real whopper. So, first gig in the civilian sector was a small MSP. We had a client that had no real budget, but my boss had sold on virtualization. He sold them a single host, running the free version of ESXi, with onboard storage. Well, before this was implemented, he'd also sold them an external SCSI tape drive.
Well, after the virtualization project, I had been directed to use the tape drive and - somehow, without any direction or budget - create a backup solution that backed up all the vmdks on the host to another building, nightly. I asked for assistance, or advice, or any sort of feedback on how to get it done and was told to "just figure it out."
So I tried some free product veeam was offering at the time. No storage APIs, so that was out. Found ghettoVCB. Long story short, I used ghettovcb to back things up to an old box I put in the other building, then dump that to tape. Problem was, the destination only had enough space for each night's backup, so I had to do some cleanup. You can probably see where I'm going with this.
The script I built to automate all this didn't have any error handling, didn't check for the existence of the destination, nothing. Sooo, yeah, bad, idiot, boneheaded idea. One night, the thing kicks off, and the destination datastore isn't reachable. But the script isn't checking for that, and just goes on to the next command, which is - you guessed it - rm -rf.
So I come in the next morning, and all the active VMs are humming right along, doing their thing. Didn't realize it at that point, but that was the first thing I learned that day - you can nuke an entire host, and whatever VMs are running on it at the time will continue to do so. Until you reboot them, that is. I needed to bounce a box at one point, so RDP into it, reboot, and wait. It doesn't come back. Fire up vsphere client to take a look, vsohere client won't connect. WTF. Grab the crash cart, hook it up to the host, and see a heart-stopping message:
no hypervisor found
Next mistake: bounced the host. Nothing comes back. Oh. Fuck. Even better, my boss had virtualized the fucking router, too. So now, I can't even check the vmware kb.
So I prod and prod and prod, and eventually I call the boss. He has no idea wtf could cause the error, and eventually we make the call to yank the host, dump it in my car, haul ass back to the office. Hook the box up on the bench, try to troubleshoot the issue.
Eventually, we figure out that the [whatever]-flat.vmdk files are still there, but literally everything else - to include the bootbanks - is gone. Copy the flat.vmdk files off to an external drive, reinstall esxi, dump the flat.vmdk files back on the array. Ok... so I've got the data itself, but no vmx, and no disk descriptors. Used vi to rebuild the disk descriptors by hand, then manually recreated the vmxes and just mounted the vmdks. Power shit on, it all comes back up.
Figured out what caused the problem, and learned more about esxi's boot process, the vmdks, etc. than I ever wanted to know at the time. Also rebuilt my scripts to make damned sure that the fucking things knew exactly what was going on before they executed their cleanup stuff.
God, I feel like a complete dipshit to this day even thinking about the incident. Overall, though, the total downtime was only about 2.5 - 3 hours. Still though, they never should have happened - and wouldn't have - if I'd been thinking about how my scripts worked rather than "just get it done."