Copying blob data between storage accounts on Azure. Did you say scale? (part 1)

A couple of weeks ago I published a post for copying data between storage accounts using Azure Functions. The code for the function is quite basic and responds only to upload events. Every time someone uploads blobs to a container monitored by the Function, the blob gets automatically copied across to the destination storage account. That’s great for small workloads. But, what about copying 100k blobs from one storage account to another? And how about making that task super scalable? Challenge accepted! In this post I will explain how to solve this problem using the .NET Azure SDK and the Azure Data Movement library.

The architecture

The architecture of the solution is fairly simple but there was a particular focus on making the copy operation concurrent and very scalable.

Firstly, I wrote a task to upload all the existing blob URIs to a queue. In this instance I used an Azure Storage Queue. Then, I wrote a simple console application to spin up multiple threads (configurable) that grab the first item (blob reference) from the queue and copy it to the destination storage account in a loop.
Each task is configured to run until all items in the queue have been consumed. Overall, I managed to keep the architecture to a minimum while allowing the application to scale horizontally. To rephrase, you can spin up multiple processes on multiple VMs until you run out of data to copy or money.

For example, for 100k blobs, you could spin 100 VMs to run the application across 1000 threads. That’s 100k threads running concurrently, hence making the copy operation extremely fast. And the design allows you to scale it to meet your needs.

The code

The code consists of 3 parts:

  1. The Console application that retrieves and uploads all the blob references (URIs) to the Azure storage queue. The queue gets created automatically if you haven’t got one. The code can be found here. The important part is shown below:
static void Main(string[] args)  
{
    AddAllBlobsToQueue("video-queue");
}

private static void AddAllBlobsToQueue(string queueName)  
{
    var azureUtil = new AzureUtil();
    var queue = azureUtil.GetCloudQueue(queueName);
    var blobUris = azureUtil.GetAllBlobsInStorageAccount(StorageLocation.source);

    foreach(var blobUri in blobUris)
    {
        queue.AddMessage(new CloudQueueMessage(blobUri));
        Console.WriteLine($"Added {blobUri} to the queue");
    }
}
  1. The Console application that spins up a configurable number of threads to run the copy operation between the storage accounts. The code for this part of the operation can be found here
static void Main(string[] args)  
{
    var parallelTaskCount = args.Length == 0 ? 1000 : int.Parse(args[0]);

    Parallel.For(0, parallelTaskCount, i => 
    {
        CopyBlob();
    });
}

public static void CopyBlob()  
{
    var queueName = "video-queue";
    var azureUtil = new AzureUtil();

    CloudQueueMessage nextQueueItem;
    while(true)
    {
        nextQueueItem = azureUtil.GetQueueItem(queueName);
        if(nextQueueItem == null)
        {
            return;
        }

        var destinationContainer = azureUtil.GetContainerFromBlobUri(nextQueueItem.AsString);
        azureUtil.CopyBlob(nextQueueItem.AsString, destinationContainer).GetAwaiter().GetResult();
        azureUtil.DeleteQueueMessage(queueName, nextQueueItem);

    }
}
  1. A helper class (AzureUtil) that contains all the Azure Storage code and the Data Movement library which performs the copy server-side. This means that the blobs will be copied between the storage accounts without having to download them locally first.

The full project

The code is freely available on GitHub for everyone to use (under MIT license) here


  • Share this post on
comments powered by Disqus