What Protocol Does Azcopy Use

0 views

Skip to first unread message

Marybelle Bailey

unread,

Aug 4, 2024, 2:57:38 PM8/4/24

to winbiderda

Ive been tasked with looking into Azure File to gradually move our file server to the cloud. This file server has well over 2 million files and close to 5TB of Data. The NTFS permissions are a mess in terms of broken inheritance. I've gone though and setup a File share and had a look at File Sync and Azcopy appears to be for blob storage only right now.

With File sync it doesn't appear to carry over the NTFS permissions from my file server. With the File share i went through the process of setting up Azure Share to use AD but ran into the issue of port 445 being blocked on my ISP. I will have to look into the alternatives Azure P2S VPN, Azure S2S VPN, or Express Route to tunnel SMB traffic over a different port.

It is a known problem that some ISPs are blocking TCP/445 port. This practice originates from security guidance about legacy and deprecated versions of the SMB protocol. Although SMB 3.0 is an internet-safe protocol (and Azure Files are only using this version), older versions of SMB, especially SMB 1.0 are not.

Azure File Sync does not look like what we need. What we are trying to do is move chucks of data (files and documents) to Azure, set the share permission to read only and maintain the existing NTFS permissions. We would map that data stored and shared from Azure for end users.

AzCopy by default starts a certain number of concurrent operations to increase the data transfer throughput. Note that large number of concurrent operations in a low-bandwidth environment may overwhelm the network connection and prevent the operations from fully completing. Throttle concurrent operations based on actual available network bandwidth.

Each time you issue a command to AzCopy, it checks whether a journal file exists in the default folder, or whether it exists in a folder that you specified via this option. If the journal file does not exist in either place, AzCopy treats the operation as new and generates a new journal file.

Don't miss your chance to meet and learn from your favorite Practical 365 Authors at TEC 2024. As a proud sponsor of TEC, we are offering BOGO discounts on exclusive pre-con workshops with authors Michel De Rooij, Jaap Wesselius, and more!

When planning Purview controls, a flat SharePoint architecture offers the most implementation flexibility. In this first edition of Practical Purview, we review what a Flat SharePoint architecture is and how to apply different controls within Purview.

Since I have a very large number of files, my question is this: does rclone client call out to Azure for every file to get the md5sum in order to decide whether to upload, or does it keep some kind of local cache of such values?

If you want to sync you should probably use the sync command rather than the copy command.

If you are unsure about the differences I recommend you read the documentation first. The gist of it is that copy will never delete anything unless it is to overwrite a file with the same name in the same place.

sync will delete any files that do no longer exist on the source (thus making an exact clone of your data on the other side).

rclone can cache file attributes - but probably not in the way you mean.

it would not be very secure to use unverified old cached data to sync as the files on the cloud could have changed since last time.

But no - rclone does need to ping every file and ask it for it's attributes.

While I am not very familiar with Azure in particular, basically all cloud storage uses listings (and from what I can see AzureBlob does also). rclone will ask for the listing data from the server (which is just a little text that contains all names, hashes and other attribute data). This listing will be for many files at once. Usually all files in a logical folder pr request - and this process is multithreaded as much as you set it to be / the server can handle. If the cloud-system supports fast-list (also called recursive listing) then rclone can ask for a recursive listing from the server for whole folder-trees all at once. This greatly increases efficiency. It can easily be 15-20x faster than not using it (it does use some extra memory though, but just as much as it takes to store these listing texts)

The result is that you can pretty easily get a lot of file-attributes (which rclone then makes choices based upon) from relatively few API calls, and at a pretty good speed. If your folder hierarchy is relatively flat then it will be even more efficient. In AzureBlob it does up to 5000 files in a single listing request by default (configurable).

Can I ask what the background for the question is?

Are you worried about what it will cost you to sync based on operation-type? Or is the worry how it will perform? Knowing your motivation will make it easier for me to give you pertinent advice and recommendations.

Rclone will, by default check the size and the modification date of each file to see if it needs to be uploaded - this is quick to read locally and no cache is needed. When rclone uploads a file to azure it sets the modification time and this is read back in the directory listings rclone requests.

Note that on S3 and Swift, reading the modification time does take an extra transaction but it doesn't on Azure Blob or GCS. I should probably put this in a column in the overview as it is quite important information for optimization.

If the OP needs help understanding how the operation-types as defined in the AzureBlob pricing documentation relates to rclone then I will be happy to help with that But generally the listing-type operations are in the cheapest tier of operations. You often pay pr. tens-of-thouansds of that tier (based on experience with Google Cloud storage).

Let's imagine this scenario: you have built a hybrid network between your corporate network and Azure using Azure Express Route and now you would like to leverage this private high-bandwidth connectivity for various purposes.

The first category is represented by tools (azcopy or Azure Storage SDK) and Azure services (like Azure Data Factory) and it requires a good network bandwidth (and time). The second category is dominated by the Azure Data Box product family.

If you need some help with making the right choice, you could leverage quite new experience in the Azure portal. When you provision a new Storage Account (or open an existing one), you can see the "Data Transfer" blade, where you fill in three attributes - estimated data size, available network bandwidth, and transfer frequency - and based on your input you will get a recommendation similar to this one:

So far so good? Well, the moment you start planning your data transfer in details, you will (most likely) realize, that Azure Storage service exposes several public endpoints, and for Blob storage it will look something like this:

Unless you configured something called Microsoft peering on your Express Route circuit, all traffic from your network to those public endpoints will be routed over your Internet connection (router) and not via Express Route!

Do not despair. Microsoft launched a new service called Private Link that allows you to create a "private endpoint" that represents a specific instance of a PaaS service like Azure Storage and make it available (present it in form of a network interface card with a private IP) inside your Virtual Network.

This has several benefits like the ability to completely close public endpoints to your PaaS instance (block any access from the Internet) and make this instance available not only from within your VNet, but all peered VNets as well as your corporate network.

How is this helping in our data transfer scenario? A lot. That Private Endpoint that I will attach to my target Blob storage will have a private IP address belonging to the IP range that is advertised via BGP protocol and my Express Route connection to my corporate network. In other words, if I use a tool like azcopy or Azure Storage Explorer from a computer inside my network and target such private endpoint, this connection will utilize my Express Route connection and its bandwidth. This is exactly what we wanted :)

You can keep 'Integrate with Private DNS zone' option enabled, so the wizard also creates a private DNS zone (part of Azure DNS service offering). However, depending on your scenario, making this private zone available from your on-premises network requires additional configuration and careful planning. We will use a simpler option to make it working for our use case.

There is one more step you need to do to complete the setup. As I mentioned above, the name resolution is a complex topic in hybrid DNS scenarios, where you are in many cases bringing your existing DNS to Azure VNets while trying to utilize Azure DNS private zones. I will leave this topic for another article :).

For now, we will use the simplest option we have available, we will modify the hosts file in our source server (e.g. a machine, where we have our archive mounted) and add an entry that will resolve our blob endpoint to the private IP address of our Private endpoint. You can get both values from the Private endpoint resource:

This might look like a lot of work, but provisioning and configuring those handful of resources took me approx. 15 minutes (again assuming that all the "plumbing" was done beforehand), so it is definitely worth it.

@jeevith Yep, here it is

Create Blob Container: Invalid Blob Container Name XXXXXXX/test. Please check Naming and Referencing Containers, Blobs, and Metadata - Azure Storage Microsoft Learn for more information