get_url suppot for comparing remote file size against local file size

528 views
Skip to first unread message

William Jimenez

unread,
Jan 7, 2014, 5:33:34 PM1/7/14
to ansible...@googlegroups.com
Hi Guys
I submitted a pull request today but wanted to provide some background on the use case. I ran into some situations where it was desirable to be able to update an artifact in a content repository (say artifactory) and then re-run ansible as is and have that updated artifact be pulled down. Currently, I didn't see a way to do this with get_url since it will skip it if already sees the file on disk. So after talking about it with the team, we came up with an idea to detect these changes. The method is low cost since it doesn't require any md5 or sha hashing of the local file on each run, and for the remote side it just makes a http header request for the file size. There is a small chance that this method could return a false positive in the case of different files being the same size, but its a trade-off for speed. Looking forward to your feedback!


Thanks,

William

Michael DeHaan

unread,
Jan 7, 2014, 5:48:07 PM1/7/14
to ansible...@googlegroups.com
Was the time of doing something SHA related actually being a problem, or more a problem of needing to calculate the SHA?




--
You received this message because you are subscribed to the Google Groups "Ansible Project" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ansible-proje...@googlegroups.com.
To post to this group, send email to ansible...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.



--
Michael DeHaan <mic...@ansibleworks.com>
CTO, AnsibleWorks, Inc.
http://www.ansibleworks.com/

William Jimenez

unread,
Jan 7, 2014, 5:55:58 PM1/7/14
to ansible...@googlegroups.com
The latter, if we have to SHA/MD5 a 500M+ file every time we run ansbile the thought was that would be too slow. 


--
You received this message because you are subscribed to a topic in the Google Groups "Ansible Project" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/ansible-project/SmmJg3BSArY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to ansible-proje...@googlegroups.com.

Michael DeHaan

unread,
Jan 7, 2014, 6:49:31 PM1/7/14
to ansible...@googlegroups.com
Seems you would want to run it once and then put that value in the playbook.

If the SHA/MD5 was of a build product, maybe it could be generated by Jenkins as an artifact?

Just playing Devil's advocate somewhat -- I suspect people *may* raise the "but... but... file sizes aren't secure" complaint without seeing the SHA option.  We could also of course just make sure this was very very very very obvious in the docs, but I'm also a little wary of including a feature for a specific use case, so if there's a more mainstream way, we should perhaps see if it's workable first?

Thoughts welcome.


William Jimenez

unread,
Jan 8, 2014, 1:56:46 PM1/8/14
to ansible...@googlegroups.com
Those are good points. I guess the challenge is how much time are you willing to spend when running playbooks to compute sha/md5's of files on disk? If we are OK spending that time, then we could just as easily have the conditionals do that. I was looking at the file module to see what approach is used elsewhere and from what I can tell, ansible isn't computing hashes, it is just asserting the file isn't already there. So doing this in get_url seems to be a new thing for file related modules.

What about adding both options, file-length or md5/sha for checking files. Both are optional, and both offer different trade-offs of cost during playbook run time vs level of accuracy? 

This is just me thinking out loud so please point out any oversight in my logic.

Brian Coca

unread,
Jan 8, 2014, 2:35:53 PM1/8/14
to ansible...@googlegroups.com
File sizes can be tricky, most systems report how much the file occupies, which is normally larger than the exact bit count of the file itself (be it fs blocks, packets or encoded base64/mime encapsulated/armored/etc).

Chad Scott

unread,
Jan 8, 2014, 2:59:29 PM1/8/14
to ansible...@googlegroups.com
I hit the wrong reply option previously. Sorry for the duplicate in private, Brian.

I see this pretty rarely. It's normally the length of the file in bytes.

I think we're overthinking this. I had originally suggested to William that comparing the timestamps and length would be a super inexpensive way to determine conclusively we needed to download the file without having to download the hash file and do a hash locally. Obviously, if all of those indicators agree, we'd need to do that anyway.

If the math is too hard, just hash it each time.

Michael DeHaan

unread,
Jan 8, 2014, 4:52:03 PM1/8/14
to ansible...@googlegroups.com
Brian,

Are you sure that's true of os.stat() and not just the shell commands?

Seems like it would not be.

I think I'm ok with adding a bytes= parameter if we can get around this question.




--
You received this message because you are subscribed to the Google Groups "Ansible Project" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ansible-proje...@googlegroups.com.
To post to this group, send email to ansible...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Brian Coca

unread,
Jan 8, 2014, 5:26:11 PM1/8/14
to ansible...@googlegroups.com
I've been burned by this before, stat is supposed to return the size of the file in bytes. 

I haven't really checked in a long time as I've grown accustomed to not rely on size for file comparisons, but my issues stemmed from some tools/implementations using # of blocks * block_size to measure the file size.



--
Brian Coca
Stultorum infinitus est numerus
0110000101110010011001010110111000100111011101000010000001111001011011110111010100100000011100110110110101100001011100100111010000100001
Pedo mellon a minno

William Jimenez

unread,
Jan 10, 2014, 2:28:44 PM1/10/14
to ansible...@googlegroups.com
After more thought on this and further discussion, it seems that MD5 is the right way to do this. The use case that I am designing for however requires the repo manager aritfactory because it looks for the md5 sums that it generates for each artifact (other similar tools likely have the same feature). With that in mind, it might make more sense to create an artifactory module that then incorporates the md5 sum, but is essentially the same functionality as get_url. This also satisfies the concern of the increased cost to run get_url with md5 sum, because it can be made clear in the documentation that the artifactory module does this validation as opposed to straight get_url. 

It also sets a framework for future artifactory specific features by having a new module for it.

What do you all think about this?


--
You received this message because you are subscribed to a topic in the Google Groups "Ansible Project" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/ansible-project/SmmJg3BSArY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to ansible-proje...@googlegroups.com.

To post to this group, send email to ansible...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.



--
William Jimenez | Site Reliability Engineer | Office 415.442.8434 | Mobile 650.241.8470

AppDynamics
Monitor your Apps in Production. Now.

Watch our video | Download AppDynamics for free
Reply all
Reply to author
Forward
0 new messages