Hackfest tomorrow night at thoughtbot!

13 views
Skip to first unread message

Dan Croak

unread,
Aug 3, 2009, 4:05:19 PM8/3/09
to boston-r...@googlegroups.com
Reminder:

http://bostonrb.org/events/94

All are welcome. Learn, hack, meet, teach, share.

--
Dan Croak
@Croaky

Philipp Hanes

unread,
Aug 4, 2009, 11:50:59 AM8/4/09
to boston-r...@googlegroups.com
I don't remember who I was talking to at a previous thoughtbot hackfest about a big "database-in-the-sky" containing file hashes and meta-information about those files.
I'd like to explore that idea a little further, if anyone is interested.
Probably mostly discussion and only minimal coding, but maybe the bare bones of a Sinatra app or something could come out of it (most of the work on a project like this I suspect would be the clients of various flavors/languages that would gather the information, but I might be wrong)

Keenan Brock

unread,
Aug 4, 2009, 12:46:44 PM8/4/09
to boston-r...@googlegroups.com
Wasn't there but

Philipp Hanes

unread,
Aug 4, 2009, 2:56:56 PM8/4/09
to boston-r...@googlegroups.com
I probably completely misrepresented the concept :-)
The idea was about generating a hash for every file on your hard disk, and uploading the hash, and meta-information about that file, to a server somewhere.
Eventually, there would be an aggregate consensus on what a particular file hash "means".  And how common it is out there.
So I could, for example, easily identify the 99.99% of all valid windows files on my system, and perhaps find out what they are supposed to be doing, and get suspicious about the 0.01% of items that no one else has ever seen before, and which might therefore be infected with a virus.
It could also have as a convenient side effect the ability to tell you that you have umpteen copies of the same file on your file system.

With some additional specialized file-type information, the clients could glean more specific stuff (e.g. this is the same MP3 file as that one, but someone modified the ID3 tags)
Or they could look inside archives.

Open to suggestions as to similar existing projects, or uses for it.

    Philipp

Steve Morss

unread,
Aug 4, 2009, 5:39:55 PM8/4/09
to boston-r...@googlegroups.com
Git does something very much like the hashing you are describing. It
creates a unique hash for every file(blob) stored in its database. It
then knows the file's contents just by knowing its hash. It uses a big
hash (160 bits) and it uses SHA1 encrypting, so it's virtually
impossible to create 2 different files with the same hash (by mistake or
by intent).

If you used the same hashing algorithm as Git, you'd get a known good
method for creating hashes, and your database of hash values could be
hugely augmented by finding and adding the millions of hash values
already out there in all the Git repositories of the web.

Steve

Philipp Hanes

unread,
Aug 5, 2009, 4:09:16 PM8/5/09
to boston-r...@googlegroups.com
Interesting.
Git does a bit of prefixing, (with "blob ", the file size, and a binary 0) before it calculates the SHA1
I think I'd want some straight-up checksum, as well.
Then again, it may actually be worth collecting multiple checksums, anyway.

Note that all of this is pretty expensive on large hard drives full of data, since the utility would have to read every single byte, and do the calculation on them.

I wonder if there are hooks into other systems that hold on to checksums (like corporate intrusion detection systems and the like)

Philipp

Wyatt Greene

unread,
Aug 5, 2009, 4:15:30 PM8/5/09
to boston-r...@googlegroups.com
Intensive, yes.  It strikes me as about the same intensity as scanning for viruses:  something that takes a long time running in the background to accomplish.  But once you index everything, you could have the OS tell you whenever a file changed which would be less resource-heavy.

Philipp Hanes

unread,
Aug 5, 2009, 6:35:50 PM8/5/09
to boston-r...@googlegroups.com
Interesting, yeah, I guess people would have to be willing to have yet another constantly-running process sitting around.  I already hate all the stuff that's sucking up my CPU and I/O etc.

That said, there may be hooks into virus scanners, too.  Not that I would expect to be able to access that :-/

Another interesting challenge might be how to deal with moving files around.  Again, if you're having the OS inform you of changes, it's not too bad.  Though the tools might get confused if someone goes behind their back at some point.  Or things on a shared file system.

Wyatt Greene

unread,
Aug 5, 2009, 6:39:29 PM8/5/09
to boston-r...@googlegroups.com
It sounds like functionality that might fit better *in* an operating system instead of on top of it.  But then you couldn't do it in Ruby. ;)
Reply all
Reply to author
Forward
0 new messages