A comparison of rsync vs. rdiff-backup

Abstract

rsync and rdiff-backup are both incremental solutions, so only the differences in files from source to destination are transferred, not the entire source.

rdiff-backup is more CPU intensive than rsync because it SHA-1 checksums everything it encounters[1][2], so if a hash of a file in the destination directory is different to the source file, it then calculates what the differences are between the two and pings them over. rsync, however, is super quick. It doesn’t checksum everything, it simply reads the timestamps of a file and then if that file in the destination is older than the source, rsync calculates the differences and applies them to the destination.

I am not going to comment on how either of the programs calculate file differences in this post.

Whilst rdiff-backup is slow, on the flipside it does provide you with access to a version history of your data by archiving old data, just in case you make any accidental changes, and then backup those changes. This is not naturally achievable with rsync, although there are scripts you can run along side rsync to give you a similar feature. Mike Rubel briefly touches on how this is achievable with rsync.

rsync has an incredibly extensive options list, so I’m sure you can configure it any way imaginable. rsync is still being officially supported, to this date.

rdiff-backup’s options list is minimal, although this was its intention; to have sensible default configurations for simplicity. rdiff-backup was brought to a halt in 2009. We are yet to see a return from the rdiff-backup development team.

Both utilities can backup to a local or remote destination via ssh.

Speed

The first thing I will mention is the speed. By looking at the table below, we can see that rdiff-backup is actually faster than rsync in practice for the initial full-backup. The table shows a time-lapse comparison between the rsync and rdiff-backup from my Desktop, via my Virgin SuperHub, to my Raspbery Pi fileserer. The total amount of data transmitted was 50.4GB.

It’s worth taking these results with a pinch of salt, as I ran the experiment over an uncontrolled network, so interference from other housemates’ network activity may have impacted the results. Then again I’m not a networking expert, I’m sure the router could handle local traffic well without too much disruption.

Time (hh:mm:ss) Average speed
rsync 10:26:19 1.37 MB/s
rdiff-backup 09:53:29 1.44 MB/s

Introduction

Professionals and home users swear beside utilities like rsync or rdiff-backup to do their home, small business or enterprise backups. Take a quick glance at rsync’s and rdiff-backup’s man pages to see how far you can bend these utilities. You won’t see much of rdiff-backup and rsync used in businsses though, the lack of corporate or business level support doesn’t appeal to large stakeholders.

I will not cover all options available for both commands, I will merely pick out those that I deem worthy enough of a comparison and/or extra explanation. If you want to know any more, drop a comment or check the manual page.

Look at my GitHub repo of the Python scripts I made using both rsync and rdiff-backup, transferring my Dropbox directory to my Raspberry Pi file server.

Differential and Incremental backups

This isn’t detrimental to understanding rdiff-backup vs rsync, since they’re both incremental backup solutions. So feel free to skip, or read on to be more informed.

So you create a complete backup of your machine and you plop this on your server. To maintain an up to date backup on your server you could just do the entire backup process again, say, once a week. This is the most inefficient way to do this. It’s a very time consuming and expensive process.

Clever technologies exist to help you free up the time you spend on backing up your data.

Differential backups allow you to add only the differences on top of your full-backup. Whereas differential backups only provides you with access to the whole lot, or none of it – this entails sending a large chunk of data per backup transaction.

Incremental backups appeals to me as I like the idea of drip-feeding my RPi every night small chunks of data, and being able to reference back to older pieces of documents from x days ago. Depending how often you schedule it, 10’s or 100’s of MBs of data takes no time at all, especially within a LAN.

It’s important to note that incremental backups have serious consequences on your full-backup accessibility if any version of your previous incremental backups become corrupt. Say for example today is Friday, and you did a full-backup on Monday. You need to access Wednesday’s version of Sales.docx. BUT! On Tuesday an error occurred and caused a corruption on Sales.docx. Because Wednesday is based on Tuesday’s contents, this will impair your ability [to operate machinery] to access, or at least view correctly, Wednesday’s Sales.docx.

For more information on this, simply read on or see the useful reads/watches section of this post.

Speed (continuted)

Preston states in his book that rdiff-backup consumes more CPU power than any other rsync script. I can confirm that the i7 870 @ 3.53 GHz on my desktop was occupied during the backup with rdiff-backup, across all four cores. Whereas the CPU usage for rsync was as low as idle CPU usage. This makes sense rdiff-backup checksums all files.

rsync operates on a timestamp basis; ignoring files that are newer (or have same timestamps than the original). This is significantly quicker and doesn’t consume so much CPU power, and time, to calculate.

Your overall backup time with rdiff-backup and rsync can be mitigated by specifying a limit on the maximum file sizes you want to transfer.

Preston progresses to suggest that slow speeds by rdiff-backup are unnoticeable when the bottleneck is networking or disk drives. In my situation with the RPi, the bottleneck is indeed present with the Ethernet port, which shares the same hub as the USB 2.0 interface. Therefore you could argue that rdiff-backup’s full potential was never going to be met in the speed test that I carried out.

See this post by ppumpkin with greater details on the RPi’s architecture.

I would only suggest using rdiff-backup locally. I would like to recommend it remotely for faster servers,  as after all, I am using a RPi, but I can’t comment on that as I haven’t tested it.

Simplicity

As you saw by glancing at rdiff-backup’s manual page, you can see it’s far smaller than rsync’s. It is how it looks at first glance; rdiff-backup was built to be just as powerful, but much simpler than rsync.

Disk space

As we’ve highlighted, rsync can’t provide file versions by default. Where rdiff-backup does have this feature, it requires extra space to store the deltas. So rdiff-backup by default compresses old versions of data with gzip.

It does however give you the –remove-older-than option, so you can control your disk space usage with rdiff-backup.

Progression feedback

By default, rsync prints out everything it’s doing. Not to the extend where it’s unreadable, but in a friendly scrolling manner which illustrates each directory or file it is currently working on. rdiff-backup however does not provide this functionality by default.

[rdiff-backup] --terminal-verbosity=[0-9]

Here’s the documentation for the different levels of verbosity in rdiff-backup. Otherwise below is a brief overview.

Note: the level of verbosity specified here will only occur within terminal, not the log file. If you want to adjust the verbosity of the log files, look at the –log-file-format option.

[rdiff-backup] --terminal-verbosity=3

3 is the default and it no longer prints the starting table, so it is what you see within the logs and as stdout to the terminal without this option specified. As I was seeking to improve progression feedback with rdiff-backup, so it was similar to rsync, I didn’t bother trying any lower than 3.

[rsync] --progress

The above command is rsync’s equivalent to rdiff-backup’s verbosity level 5. This option for rsync provides you with statistical feedback within terminal per file transaction; time lapsed, MB/s etc. It’s helpful if your source directory comprises of many large files.

--progress

–progress

[rsync] -i

This itemized changes option is clearly illustrates what is happening to each file as the terminal is quickly scrolling through the file list. See this link or this link for brilliant descriptions of all the –itemize-changes= arguments.

[rsync] --out-format=

Using the above command you can customise the standard output datastream to only display the information you’re interested in. Arguments given in this option does not change the format of the log file.

End of transfer statistics

[rdiff-backup] --print-statistics

The above is very useful feedback that sums up the changes the occurred in the destination directory.

--print-statistics

–print-statistics

[rsync] --stats

Rsync provides an identical feature to the previous command, but with greater control on the formatting and which kind of stats print out as stdout and to the log file. Similar to –out-format=, you dictate what information you want to see.

You will notice with this option that it doesn’t provide you with a start and finish time, like rdiff-backup does. It makes it a little difficult to identify the overall time-lapsed (the “Total time:” at the end was my Python datetime implementation). Thankfully, the log file prints the current time on the left column per new line of print. Check out the manual page for a detailed going-over of this option’s formatted, including details of all values printed out at the end.

--stats

–stats

[rsync] --info=

I didn’t quite know whether to put this here, or in Miscellaneous. I place it within this section because –info=progress2 gives statistics based on the whole transfer, as opposed to the –progress option that provides statistics per file. I can also tell you that –info=name0 stops all the names scrolling through during the whole transaction; a silent terminal. The manual gives very little detail on how to use –info.

Note: I’m running rsync 3.0.9, and this version doesn’t support this option. Although according to the manual, version 3.1.0 does support it. If I could find any documentation on this option, then I might just go through the effort of compiling the latest version, because by looking at the manual, it seems like an awesome option.

Logging

[rsync] --log-file=

If you’re running this from your server, then you single-handedly with this command instruct where you want to keep your log file. Your log file’s contents by default will be what is printed out to terminal. However, this can be altered with –log-file-format. If you wish to just alter the stdout formatting and not the log file, then use –out-format.

Note: I used this option to measure the time on how long rsync took to execute. You would think –stats would be of use, but no. –info= would be of use, but read on to find out why (in my instance, at least).

[rsync] --log-file-format=
rsynclog

The rsync log using the default format masks

This is something quite unique that rdiff-backup does not offer. Whilst documentation is again, minimal, it seems very powerful. You can control exactly what is being printed to the log file. The best example I can give you is showing the masks used in order to produce the default log file format. I will also provide a screenshot of the log file by rsync to illustrate understanding:
--log-file-format="%i%n%L"

[rsync] -v

This option is for verbosity and controls the level of detail being printed to the terminal during transfer. The more v’s you add, the greater the level of detail. Its successors is –info and –debug. However, as previously mentioned, documentation for the arguments to the options is scarce.

[rdiff-backup] logging
The rdiff-backup logfile

An example illustrating one of rdiff-backup’s backup.log files

The above is not a command, it’s just a title for this part. By default, rdiff-backup creates a log file. The log is stored within your source directory, and the dedicated rdiff-backup-data directory, name backup.log for backups, and restore.log for restores.

[rdiff-backup] -v[0-9]


This option is for verbosity and controls the level of verbosity being printed to the log file. You can specify the level of of verbosity you want to by adding a value between 0-9. Greater the value, greater the detail.

Miscellaneous

[rdiff-backup] --remove-older-than **

This removes the incremental backup information in the destination directory that has been around longer than the given time. time_spec can be either an absolute time, like “2002-01-04”, or a time interval. The time interval is an integer followed by the character s, m, h, D, W, M, or Y, indicating seconds, minutes, hours, days, weeks, months, or years respectively, or a number of these concatenated. For example, 32m means 32 minutes, and 3W2D10h7s means 3 weeks, 2 days, 10 hours, and 7 seconds. In this context, a month means 30 days, a year is 365 days, and a day is always 86400 seconds.

I may have just copied and pasted that from the manual…

Note: you can not backup and use –remove-older-than in the same command. You must execute these commands separately. You will receive an error if you attempt to do so. If you wish to run the –remove-older-than option and delete archives older than say, 2 weeks, on a remote directory, use the following command (notice the use of a double colon, not sure why… maybe you do?):
rdiff-backup --remove-older-than 2W --remote-schema "ssh %s -p1019 rdiff-backup --server" user@x.x.x.x::/path/to/backup/directory

[rsync] --rsync-path=

I use this command to execute commands on the remote server. There is now a new option in rsync 3.1.* which is –remote-option which promotes you to execute non-rsync command via this method. However, if you’re like me and got 3.0.9 from Cygwin, then this will suffices.

For example, I use –rsync-path=’rsync –log-file=/mnt/disk1/path/to/directory/rsync.log’ to identify where I want to keep the log file at the remote end. I use –rsync-path for this because as I’m executing the backup transaction from the client side I’m unable to tell my local rsync program which directory to keep the log file in; when that directory doesn’t exist on my client. So I execute the rsync command at the server end and it works well – doesn’t override the data upon each transaction, just appends the new data.

[rdiff-backup] --max-file-size=
[rsync] --max-size=

This option takes it arguments in integer form, of bytes value. It will disregard any files for calculating a checksum (and thus transfer) that exceed the given file size. It will briefly pause to read the file’s size, so even if you use this option to speed up the process, it is still much slower than rsync overall.

Support

As much as I’ve been supporting rdiff-backup in this post, unfortunately there is no longer any more official support for it. Its most stable version was released in 2009 and has not seen any activity since. I’ve read various things around the web saying the original developer stepped down, but a new one stood up two years ago – we’re yet to see any new features or improvements.

This means that rdiff-backup may be suspecticle to failure by bugs. rsync on the other hand has turned proprietary by offering a GUI and 64-bit edition – the CLI edition and 32-bit edition remains free. You can feel more confident for support by developers with rsync than what you could for rdiff-backup. Probably a safer bet if you’re considering it for your business.

Installation

rsync and rdiff-backup are both very easy to install. I got both as a Cygwin package so my process was very easy. No difficulties were experienced when installing on my RPi either – Raspbian.

Note: ensure you download from original source to get latest versions. Cygwin provides you with rdiff-backup 1.2.8 and rsync 3.0.9, where there is a 3.1.* version is available of rdiff-backup. If you wish to use rdiff-backup within Windows PowerShell or CMD then you must download the executable from here and add the directory of where you decide to keep the directory – say, C:\Users\<name>\Downloads – to your PATH environment variable. I keep mine in C:\Program Files\, just for organisation.

Remember to change your home directory in Cygwin’s passwd file. See this post for more details! However, remember to use the absolute Cygwin path values within the –remote-schema option and connecting via SSH, so /cygdrive/c/Users/<name>/ instead of C:\Users\<name>.

As just mentioned, the free edition of rsync is 32-bit. The environment variable defined in line #13 of cwrsync.cmd was configured for a 32-bit system, thus causing the command to fail when trying to execute the script on a 64-bit system. This is easily mitigated by changing line #13 from:

SET CWRSYNCHOME=%programfiles%\CWRSYNC

To:

SET CWRSYNCHOME=%programfiles(x86)%\CWRSYNC

Useful reads/watches

Conclusion

An important note to point out is that I’ve only compared two backup utilities here. Take a look at rsnapshot, Preston suggests that you can use rsync with rsnapshot. Whilst I haven’t look at this tool yet (and I plan to), I suggest you do before you make your decision just to scope out potentially more suitable options for your needs. I also must point out that I haven’t detailed all the required commands, that I believe to make up an appropriate command set for a suitable and efficient backup for rsync and rdiff-backup. Please check out my posts (will update in due course) that are dedicated to either rsync or rdiff-backup.

All in all, if you’re thinking of applying rdiff-backup to a setup that is similar to mine, then I recommend rsync. Purely because how slow rdiff-backup is. If you have a standard server, or want to backup to an old computer, with a PCI network card then give it a go and tell me how you get on with rdiff-backup. It is a shame rdiff-backup is no longer supported, maybe the developer experienced great difficulty programming it to create deltas as well as be super quick, like rsync.

Personally, I really look forward to continue working with these utilities. I’ve never taken backups seriously, ever. So I’m glad that I’ve learnt these tools because I can stop riding my luck and taking my data for granted.

Bibliography

W. Curtis Preston. (2007). Backup & Recovery: Inexpensive Backup Solutions for Open Systems. Available: http://books.google.co.uk/books?id=6-w4fXbBInoC&pg=PA197&dq=rsync+vs+rdiff-backup&hl=en&sa=X&ei=qZg5U6fJFMGp7Aba6YCIAQ&ved=0CDEQ6AEwAA#v=onepage&q=rsync%20vs%20rdiff-backup&f=false. Last accessed 31st March 2014.

ppumpkin. (2012). What is the highest performing hardware configuration?. Available: http://raspberrypi.stackexchange.com/questions/1262/what-is-the-highest-performing-hardware-configuration. Last accessed 31st March 2014.

5 Comments:

  1. Nice writeup, note that rsync-backup does have a mailling list that while not very active seems to be active enough to help with issues.

  2. Thanks for providing this valuable information/ comparison.
    In regard to access to a version history, you mention that this is not naturally achievable with rsync, although there are scripts you can run along side rsync to give you a similar feature. Not sure what “scripts along side rsync” you’re referring to. You can also save versions with rsync by using the “–hard-links” flag. No extra scripts required 🙂

    • Hi

      rsnapshot is a shining example of a comprehensive script written in Perl that uses rsync with the hardlinks option you mentioned.

      Adam

  3. A huge difference between rsync and rdiff–backup is the following: Suppose you have a file A, hundreds of MBs in size. Also suppose that you are incrementally backing it up on a daily basis. Finally, suppose that everyday you change just a few bytes in A.

    Both rsync and rdiff-backup will have to transfer just a few bytes every day – the bytes that changed in A. However, rsync will keep a complete copy of A for each incremental backup, whereas rdiff-backup will only the differences. for each backup.

    In the case described, therefore, rsync will require many hundreds of MBs more of disk space to hold the same backup data as rdiff-backup.

Leave a Reply

Your email address will not be published. Required fields are marked *