Recently, some security researchers identified a security vulnerability with some cloud services (in particular VPS.net and Rackspace Cloud) that caused customers’ data to be leaked.
Secure virtual disk deletion is something we wrote about back in 2007 and Brightbox Cloud had this feature as part of the very first design specification, over two years ago. So, I thought I’d go through some of the options we considered and what we finally decided.
To maximise performance (and minimise complexity!) we use full disk allocation so that each server is allocated a contiguous section of disk using LVM. This also means we avoid a lot of very subtle data security problems associated with file-backed disks that make secure wiping extremely difficult (such as customer data being written to host file system journals).
This still leaves the problem of preventing a newly built server from reading data from previously deleted servers.
We originally considered using a copy-on-write system that would return zeroes when reading sections of disk that had not yet been written (as Amazon EC2 does) but this design has a few downsides. In particular, it affects write performance because the first write to each block requires a copy-on-write bitmask update to be synchronously written to disk.
It would also mean that customers’ data would still actually be on disk after they destroyed their server, which we didn’t like. If you destroy a server, you should be confident that the data was destroyed too - it shouldn’t still be stored anywhere, no matter what assurances are given about the access security.
Instead, we opted for fully wiping the disks when a server is destroyed. We implemented QoS-style limits on our disk wipes, so they don’t interfere with active servers on the same host. The disk wiping is very low priority and only runs when the host has IOPs to spare. This is why destroying a server takes a little time to complete - we only confirm that it’s been destroyed when the disk wipe finishes successfully.
Physical disk failure
If a physical disk fails, our policy is to never send it back to the manufacturer without being wiped. Disks often fail in such a way that reading existing data from them is easy, even though they’re not usable in production. Sending these back to the original manufacturer means we would be sending out fragments of customer data to an untrusted third party.
So, the disks we’re able to wipe are returned, or retained for use in staging systems. Disks that do not respond are physically destroyed.
I hope this small glimpse of the steps we take to safeguard our customers’ data shows that we take it seriously and this level of data security is at the heart of the Brightbox Cloud architecture.
posted 27 Apr 2012 by John Leach