How to sync only new or changed/modified files from AWS bucket to GCP bucket ?

3,920 views
Skip to first unread message

Liran Gabay

unread,
Jul 4, 2017, 6:34:57 AM7/4/17
to gce-discussion
Hi,
I have copied all my AWS S3 bucket to GCP a month ago using Google Cloud Storage Transfer,
I want now to sync only new or changed/modified objects from S3 to my bucket on Google Cloud Storage, this way I will not need to pay again for all the data in the bucket.
Is the below command is the best way to do it ?
gsutil rsync -r -m s3://bucket gs://bucket
Thanks

Faizan (Google Cloud Support)

unread,
Jul 4, 2017, 7:49:42 PM7/4/17
to gce-dis...@googlegroups.com
Hello Liran,

The gsutil rsync command is the best way to sync contents between source and destination. You need to make sure gsutil has access to the files on source and destination. Also If you are synchronizing a large amount of data between clouds you might consider setting up a Google Compute Engine instance and running gsutil there. Since cross-provider gsutil data transfers flow through the machine where gsutil is running, doing this can make your transfer run significantly faster than running gsutil on your local workstation.

You can also take advantage of gsutil rsync flags, for example -n flag can be used to run rsync in dry-run which will just output the files without actually copying/deleting the files. For more information you can refer to this link[1].

I hope that helps.

Faizan

Liran Gabay

unread,
Jul 5, 2017, 7:38:15 AM7/5/17
to gce-discussion
Hi Faizan,
When I ran the command `gsutil -m rsync -r s3://photos gs://photo`, I got  this following messages:
Non-MD5 etag ("ba1264cfb028c7cb60d6e98e0da11c15-1") present for key <Key: gs_photos,001ef37b27e845af0/216cb6b9bd729ec.jpg>, data integrity checks are not possible.

# gsutil --version
gsutil version: 4.26
Any thought?
Thanks

Liran Gabay

unread,
Jul 5, 2017, 7:51:38 AM7/5/17
to gce-discussion
From what I understand that warning means there is no way to ensure the data that downloaded from S3 is actually the same data that stored on GCP and S3 does not provide an MD5 or other standard hash for to check against.
So there is no real way around this?
Thanks

Liran Gabay

unread,
Jul 5, 2017, 10:15:33 AM7/5/17
to gce-discussion
I cannot submit a support request ticket because google support up to 1000 characters :/ so I cannot explain everything in one ticket, weird... :/
My goal is to copy only new or changed/modified files from AWS bucket to GCP bucket in order not to pay again and again for all the data transfer from AWS to GCP.
It seems that if I repeat the same process copy AWS S3 bucket to GCP using Google Cloud Storage Transfer it copy only the new changes, can anyone from google please confirm that this is the right way to do it ?
Sorry, but I could not find the answer for this question on GCP documentation
Thanks

Faizan (Google Cloud Support)

unread,
Jul 5, 2017, 4:25:05 PM7/5/17
to gce-discussion
On Wednesday, July 5, 2017 at 10:15:33 AM UTC-4, Liran Gabay wrote:
I cannot submit a support request ticket because google support up to 1000 characters :/ so I cannot explain everything in one ticket, weird... :/
My goal is to copy only new or changed/modified files from AWS bucket to GCP bucket in order not to pay again and again for all the data transfer from AWS to GCP.
It seems that if I repeat the same process copy AWS S3 bucket to GCP using Google Cloud Storage Transfer it copy only the new changes, can anyone from google please confirm that this is the right way to do it ?
Sorry, but I could not find the answer for this question on GCP documentation
Thanks

Cloud Storage Transfer service can be used to synchronize data, as mentioned here [1]:

Cloud Storage Transfer Service has options that make data transfers and synchronization between data sources and data sinks easier. For example, you can:

  • Schedule one-time transfers or recurring transfers.
  • Delete existing objects in the destination bucket if they don't have a corresponding object in the source.
  • Delete source objects after transferring them.
  • Schedule periodic synchronization from data source to data sink with advanced filters based on file creation dates, file-name filters, and the times of day you prefer to import data.



On Wednesday, July 5, 2017 at 2:51:38 PM UTC+3, Liran Gabay wrote:
From what I understand that warning means there is no way to ensure the data that downloaded from S3 is actually the same data that stored on GCP and S3 does not provide an MD5 or other standard hash for to check against.
So there is no real way around this?
Thanks

The warning you have received is expected due to the same reason which you have mentioned.

Liran Gabay

unread,
Jul 5, 2017, 5:20:25 PM7/5/17
to gce-discussion
Thank you, I am just trying to be careful, I have a 50TB of data and I want to move part of my system from aws to google but before I need to be sure that cloud storage transfer copy only new or changed/modified files, because I cannot pay again for the all transfer, the filters its not so good for my case because might be a situation that file was created 3 month ago but has changed 2 weeks ago.
I saw the below on the documentation.

By default, Cloud Storage Transfer Service copies a file from the data source if the file doesn't exist in the data sink or if it differs between the version in the source and the sink. The default is also to retain files in the source after the transfer.

Liran Gabay

unread,
Jul 5, 2017, 5:29:28 PM7/5/17
to gce-discussion
The below error happens only when I try to sync from the root buckets `gsutil -m rsync -r s3://photos gs://photo`, but if I do something like
`gsutil -m rsync -r s3://photos/aaa gs://photo/aaa` it worked..

Non-MD5 etag ("ba1264cfb028c7cb60d6e98e0da11c15-1") present for key <Key: gs_photos,001ef37b27e845af0/216cb6b9bd729ec.jpg>,data integrity checks are not possible.
Reply all
Reply to author
Forward
0 new messages