Amazon’s Simple Storage Service (S3) is a robust, inexpensive and highly-available internet data storage service. At Blue Gecko, we occasionally help our customers design and implement S3-based backup strategies.
Compared to conventional off site tape vaulting services, the advantages of vaulting database and other backups to S3 are many. S3 backups are always on line, so you never have to wait for a truck to arrive with your tapes. S3 backups are replicated, so if one of Amazon’s availability zones experiences a failure, your data is still intact and available at one of their other zones. Best of all, Amazon also offers the Elastic Compute Cloud (AKA EC2, virtual server hosts by the hour), so your S3 backups double as a super-low-cost disaster recovery strategy. S3 is low-cost, starting at just 3.7¢ / GB / month for storage, and 10¢ / GB for uploads.
I back up all my home computers to S3 using third-party software called Jungle Disk. Jungle Disk runs in the background on a Mac or PC and backs up new data to the cloud ever five minutes (or whatever frequency you specify). Many times these backups have come in handy for me, as I am able to browse and retrieve files from my home computers (such as photos and documents) from the office, without my home computers even being on.
Sounds like an ideal backup solution for small to medium-sized business, right? So what’s the catch?
The catch is your Internet connection. Consider the following:
A typical cable Internet connection with an actual upload rate of 2.8 Mb/s could transfer a 100G file to Amazon S3 in just over ten hours. For many businesses, that’s the whole backup window. If there is more than 100G to transfer, you’re out of luck. If your servers are co-located somewhere with a fast connection to the Internet, you might get better transfer rates, but there are other limiting factors, like number of hops to Amazon S3 and overall latency.
Compress before uploading
If the backup files are not already compressed, you can almost always dramatically improve upload times by compressing them before uploading them.
Parallel uploads save the day
Upload performance to Amazon S3 can almost always be improved by running uploads in parallel. Choosing a degree of parallelism depends on the connection between your site and Amazon, the host from which you are uploading, and several other factors. The best way to determine your optimal degree of parallelism is to test it!
Blue Gecko happens to have a customer who wants to vault their Microsoft SQL Server backups to Amazon S3. Unlike Oracle, SQL Server has no native facility that allows it to stream backups directly to S3 like tape. Instead, with this customer, we will compress and upload their database backups after they complete each night. To make it easy to find the optimal degree of parallelism, I delved into the murky world of Windows command shell programming. Against all instincts, I wrote this tool in Batch so that it would work easily on any of this customer SQL Server hosts.
This tool is designed to allow you to effectively determine the optimal parallel degree for backing up data from a particular server over the Internet to Amazon S3. It generates its own large files to upload. All you need is an Amazon S3 account. The tool comes as a pair of scripts that call a Ruby tool called s3cmd. To use it, follow these steps:
- Download and install Ruby 1.8.7-p334 for Windows.
- Download S3Sync into a convenient directory.
- Download and install gnuwin gzip and gnuwin tar for Windows
- Open a command prompt window, and change to the directory where you downloaded S3Sync
c:> cd my_directory
- Unzip and untar the S3Sync package, then change to the S3Sync directory:
c:my_directory> "Program FilesGnuWin32bingzip.exe" -d s3sync.tar.gz c:my_directory> "Program FilesGnuWin32bintar.exe" xvf s3sync.tar c:my_directory> cd s3sync
- Edit a file called test_parallel.bat, and paste the following contents into the file:
echo off set /a filesize = %1 / %2 for /l %%v in (1,1,%2) do ( fsutil file createnew dummy.%%v %filesize% ) for /l %%v in (1,1,%2) do ( start /b upload %3 %%v ) echo on
- Create a bucket for testing. Make sure to substitute your AWS security credentials in the appropriate places:
c:my_directory> set AWS_ACCESS_KEY=<em>your AWS access key ID</em> c:my_directory> set AWS_SECRET_ACCESS_KEY=<em>your AWS secret access key</em> c:my_directory> set AWS_CALLING_FORMAT=SUBDOMAIN c:my_directory> s3cmd.rb createbucket my_test_bucket_1234
- Edit a file called upload.bat, and paste the following contents into the file. Make sure to substitute your AWS security credentials in the appropriate places:
echo off set AWS_ACCESS_KEY_ID=<em>your AWS access key id</em> set AWS_SECRET_ACCESS_KEY=<em>your AWS secret access key</em> set AWS_CALLING_FORMAT=SUBDOMAIN echo %time% s3cmd.rb put %1:dummy.%2 dummy.%2 echo %time% del dummy.%2
- Now you can start testing parallel uploads to AWS. The syntax to call test_parallel.bat is:
c:my_directory> test_parallel file_size degree bucket
Here is an upload example for 128M at parallel 10 to a bucket called my_test_bucket_1234:
c:my_directory> test_parallel 125829120 10 my_test_bucket_1234
The elapsed time can be determined as follows:
- Look for the first timestamp displayed in the output and note it.
- Waitfor the last timestamp to display. Note it.
- The intervening time is the elapsed time for the upload. You can use Ex el or any other tool you like to calculate time deltas.
I typically run the upload without parallelism (degree = 1), then increase it in increments of five. If there is any doubt as to which S3 region will provide the best performance, I create a bucket in each region (US-West and US-East), then perform identical tests against each.