(October 2007)
The company I work for, is a small startup. By definition, startups are forced to work with limited budgets, limited resources, limited staff - which makes them perfect places for people like me. Why? Well... Let's just say that my coding skills span over a... wide range of "stuff". I have worked on almost everything you can think of, from mini OSes for tiny embedded MPC860s and the Windows device drivers that talked to them, to full fledged database-driven MFC and C++ Builder GUIs for chemical plants... I enjoy diversity, and I can suggest no better way of keeping your mental health and staying sharp, than working in a startup.
(Just make sure you have loads of patience... you'll need it :‑)
One of the roles I double (or is it triple? quadruple? :‑) for, is that of "the crazy Unix guy" (tm). Put simply, when the services offered by my company involve Unix technology, advanced text processing and/or scripting, I am frequently "called to the rescue". It is in this context that I was called to manage a USA-based Virtual Dedicated Server (VDS) coupled with a CentOS installation (i.e. a Linux distribution). The VDS is used for the hosting (web site, e-mails, etc) of our company products.
Initially, I tried to avoid as much effort as possible :‑) I opted for a small web and e-mail hosting package, that gave no shell access. Unfortunately, I quickly found out that there were lots of other sites co-hosted by our provider on the same IP address (our own). Some of them were spammers - or more probably, their users were infected by malware that sent spam. This made our e-mail domain a permanent resident of Spamhaus, which occasionally caused mails sent by us to travel to Never-never land. In the end, we were forced to move - to our very own server, with an IP address which was ours and ours alone.
That immediately took care of our Spamhaus issues. Moreover, SSH access allowed me to tinker with the machine, and setup many things I deemed to be necessary. Over the course of weeks, I fine-tuned the installation: I got acquainted with bind, and made our DNS server report SPF records, so no spammers could use our e-mail addresses as phony "From"s ; I setup Mantis-based bug trackers for our product and client support teams; Joomla for our main site; scripted and automated package updates; and lots more. And naturally, I had to devise a way to backup the server; if the real machine hosting our server crashed, I wanted to be able to re-introduce all my custom changes - with minimal effort.
Unfortunately... startups have limited resources. Our company uses a unified Windows backup strategy, so whatever data I was to collect from our VDS machine, I would have to store them in Windows machines (equipped with RAID drives, and a host of external USB disks).
As far as remote synchronization is concerned, I believe it is safe to say that there is an undisputed king: the best way to synchronize data between a remote machine and a local one is through rsync. This amazing utility identifies and transfers only the differences between files; it currently allows us to take a backup of around 3GB of data (residing inside our USA-based VDS), and synchronize it with our local mirror transferring just under 5MB (and doing so in less than one minute)! Moreover, this transfer is done over a secured, compressible SSH tunnel, so what's not to like?
My only issue was where to store the data. That is, where to store the actual files received through rsync... I only have Windows machines to work with, so a number of challenges started to crop up...
Initially, I thought about Cygwin; I performed a small test, and found the results to be ... so and so. Certainly not optimal - Cygwin uses an emulation layer to transform Unix system calls (made by rsync) to those supported by Windows, so it is slower than the real thing. Rsync-ing to Windows filesystems also meant issues with file permissions and owners: try as I might, I could never figure out how to successfully map the UNIX machine's owners (those inside the VDS) to the Windows ones of our domain. Problems would inevitably crop up when - eventually - I would need to restore something.
Performance wise, I found out that VMWARE server offered a much better alternative. I installed the free version on my Windows machine (as a Windows service, so that it runs even when I am not logged on), and proceeded to install a small Debian under it (I would have used Qemu, but didn't find any way to run it as a Windows service). I benchmarked the virtual machine I created and found out that it was running rsync at near native speeds - a much better performer than Cygwin. The tests I did, however, were run on the virtual disk that the VMWARE server offered to my Debian - that wasn't enough. I had to somehow store the files in the Windows backup server, since only then would they be backed up (copied to the external USB drives, etc).
root# mount -t smbfs -o username=SEMANTIX\\BackupAdministrator \ //WindowsBackup/VDS-backup /mnt/backup... and the /mnt/backup directory offered me a way into the automatically backed-up folders of the WindowsBackup machine.
Now, you might think that this solved the problem; I now had a Linux accessible folder to run rsync in - and I would enjoy the full native speed of my machine in running it. Unfortunately, you are forgetting that speed was only a small part of the problem. The big one remained: the permissions/owners issue; whether you run rsync under Cygwin or over a Samba-mounted folder in Linux, you still have to cope with the mappings of Unix owners and permissions to the Windows ones. Darn. And performance of rsync itself is not enough; even though rsync would be running natively, it would have to write many small files over the Samba protocol - not the best way towards achieving speed.
root# mount -o loop,encryption=aes256 BigFile /mnt/
I already knew about this... What I didn't realize for quite some time, was that I could use it to mount a Samba-mounted BigFile!
root# mount -t smbfs -o lfs,username=SEMANTIX\\BackupAdministrator \ //WindowsBackup/VDS-backup /mnt/windows root# mount -o loop,encryption=aes256 \ /mnt/windows/BigFile /mnt/backup
Now that solves many issues... The /mnt/backup folder will be used as rsync's destination, and since the actual filesystem inside BigFile can be any Linux supported fs (Ext3, ReiserFS, etc), UNIX permissions and owners will be kept just fine!
Caveat:By default, Samba mounting has a file size limitation of 2GB (or 4GB, not sure). This is why the "lfs" option is used in mounting, to allow for larger files.
Then again, how big should this (Windows hosted) BigFile be? Possible hard drive extensions in our VDS must be accounted for...
Well, how about anticipating all possible sizes, without wasting a single bit of real hard drive space? Unbeknownst to many people, sparse files come to our rescue: They are considered common-place in the UNIX world, being in use for decades; Microsoft OSes started supporting them with the introduction of NTFS.
Sparse files are files whose real allocation needs are fulfilled only when data are written inside them. If you try to read from places you haven't written before, you get zeroes - and these zeroes don't really occupy any space in your hard drive. If only one sector of data (512 bytes) gets written at some offset inside a sparse file, then a sector is all the sparse file will reserve from the filesystem - not the size reported by the filesystem!
All that is required to create a 150GB sparse file under Windows is this:
(From within a Cygwin command prompt) dd if=/dev/zero of=BigFile bs=1M count=1 seek=150000
This command will execute in 1 second, and it will only reserve 1MB of real hard drive space. Real storage will grow as needed, when data are written inside BigFile.
At this point, all the necessary components are there:
dd if=/dev/zero of=BigFile bs=1M count=1 seek=150000
root# mount -t smbfs -o lfs,username=SEMANTIX\\BackupAdministrator \ //WindowsBackup/VDS-backup /mnt/windows
root# losetup /dev/loop0 /mnt/windows/BigFile root# mkreiserfs /dev/loop0
root# mount /dev/loop0 /mnt/backup root# cd /mnt/backup root# rsync -avz root@hosting.machine.in.US:/ ./
But this can be improved even more... Navigateable directories with daily snapshots can be created, and - again - almost no hard drive space needs to be paid for it!
This is accomplished through hard links. Try this under any Unix (or Cygwin):
bash$ dd if=/dev/zero of=/tmp/file1 bs=1K count=10 bash$ cp -al /tmp/file1 /tmp/file2 bash$ ls -la /tmp/file1 /tmp/file2 -rw-r--r-- 2 owner group 10240 Oct 14 12:14 /tmp/file1 -rw-r--r-- 2 owner group 10240 Oct 14 12:14 /tmp/file2Did you notice the "2", in the second column, for both output lines?
bash$ rm -f /tmp/file1 bash$ ls -l /tmp/file2...the other one can still be used to access the data:
-rw-r--r-- 1 owner group 10240 Oct 14 12:14 /tmp/file2Notice that the "link count" column, is now 1. Only if you remove this file as well, will the data really go to Never-never land.
How can this be put to use in our backups? Simple: Since hard links take almost no space, a "free" local mirror of the previous backup can be taken into another directory (using hardlinks), and THEN, rsync will work on one of the two copies, leaving the hard links in the other one to point to the old data:
root# mount /dev/loop0 /mnt/backup root# cd /mnt/backup root# rm -rf OneBeforeLast root# cp -al LastBackup OneBeforeLast root# cd LastBackup root# rsync -avz --delete root@hosting.machine.in.US:/ ./The "cp -al" creates a zero-cost copy of the data (using hardlinks, the only price paid is the one of the directory entries, and ReiserFS is well known for its ability to store these extremely efficiently). Then, rsync is executed with the --delete option: meaning that it must remove from our local mirror all the files that were removed on the server - and thus creating an accurate image of the current state.
And here's the icing on the cake: The data inside these files are not lost! They are still accessible from the OneBeforeLast/ directory, since hard links (the old directory entries) are pointing to them!
In plain terms, simple navigation inside OneBeforeLast can be used to examine the exact contents of the server as they were BEFORE the last mirroring.
Naturally, this history bookkeeping can be extended beyond two images. Personally, I created a script that keeps a little over two weeks worth of daily backups - and is run automatically from cron, at 3:00am every day, performing actions like these:
#!/bin/bash rm -rf backup.16 mv backup.15 backup.16 mv backup.14 backup.15 ... mv backup.2 backup.3 cp -al backup.1 backup.2 rsync -avz --delete root@hosting.machine.in.US:/ backup.1/
Current statistics show that our instantly navigateable backups of the last 15 days (for 3GB of remote data) are only costing...
Am I not right, in calling this the perfect backup?
If you are to keep anything in mind from the above, it is this:
The power of UNIX lies in NOT trying to hit all birds with one stone; each utility does one job, and it does it well. The amazing overall functionality we have developed (not existing, as a whole - at least to my knowledge - in any commercial offering) was obtained by "glue"-ing functionality of different components together. That's how the Unix world works: by weaving components together:
In fact, I could already add compression of backed up data in many places; I could ask the Windows server (through Explorer) to keep the BigFile as a compressed one; or I could apply any of the many FUSE based compressible filesystems to the mounted ReiserFS image (much better).
The possibilities are endless - in the open source world, at least.
Keep in mind, that in the years that passed since I wrote this, my backup procedure improved further, by moving to... FreeBSD/ZFS. I am still rsyncing - but the target storage is a FreeBSD/ZFS Atom330 machine: Atoms have low wattage (which is important for an always-on backup machine), and much more importantly...
Update 2: Use Borgbackup, store on ZFS
Borg goes a step further than ZFS; it uses rolling checksums (the kind that rsync uses) to identify parts of a stream that remain identical. Think of it this way: ZFS snapshots will only store the sectors that changed, via Copy-on-Write semantics. Borg backups, will identify the same areas regardless of where they are found - even in separate files. This de-duplication is then followed by compression; with an end result that is insanely efficient; I can safely say that I will be able to store 1000s of daily backups without breaking a sweat.
And of course, the Borg backup-ed data are stored in ZFS; to take advantage of the self-healing (automatic detection and recovery from data corruption) that ZFS offers.
Also note: you don't have to use FreeBSD anymore; ZFS-on-Linux works perfectly.
Executive summary: As of 2021, at least, the state of the art is Borgbackups stored in ZFS pools.
Index | CV | Updated: Sat Oct 8 12:33:59 2022 |