Re: Sub chunks

35 views
Skip to first unread message

dcs dcs

unread,
Apr 21, 2011, 12:12:11 PM4/21/11
to Google Safe Browsing API
As far as I know sub chunks are white list. So I just append them
into my database. And I never try to delete add chunks from sub
chunks through addChunkNum. I've been testing the data downloaded
from google. after first download I got the following sub chunk.

chunkNum Host key Prefix
AddChunkNum time
48787 0x85A1C960 0x613D643C 36280 2011-04-21 10:14:23.423 Y

Then I used "goog-malware-shavar;a:34881-36338:s:47361-48787" for ny
subsequent download, I got the same record again only at different
time as following.

chunkNum Host key Prefix
AddChunkNum time
48787 0x85A1C960 0x613D643C 36280 2011-04-21 10:49:56.680 Y


Why there's such duplicate records for subsequent pulls? Should I try
to delete the add chunk from my database?

Garrett Casto

unread,
Apr 21, 2011, 3:12:01 PM4/21/11
to google-safe-...@googlegroups.com
Because of the way that we cache, we will occasionally send you chunks that you already know about. You can safely ignore a chunk number that you already know about, the new chunk will not contain any new data. 


--
You received this message because you are subscribed to the Google Groups "Google Safe Browsing API" group.
To post to this group, send email to google-safe-...@googlegroups.com.
To unsubscribe from this group, send email to google-safe-browsi...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/google-safe-browsing-api?hl=en.


dcs dcs

unread,
Apr 21, 2011, 3:59:18 PM4/21/11
to google-safe-...@googlegroups.com
Thanks, but it's not practical to check if the chunk number is
already in the database for every new chunk comes in . I would rather
let duplicates stay in the database. What's your suggestion? Now I
have 132 distinct chunks that have duplicates among total 17515 of
chunks in my database. Is this reasonable? I did not believe that
you could send dups for the same chunk before.

Regards,

dcs dcs

Garrett Casto

unread,
Apr 21, 2011, 5:12:09 PM4/21/11
to google-safe-...@googlegroups.com
I'm not sure why it's not practical to check for every new chunk. You are going to want your database to be keyed by chunk number anyway since we tell you delete whole chunks at a time. If you just want to keep adding duplicate entries, that's fine too. The information contained will be the same, it's just wasteful. Having 132 duplicate entries does not seem unreasonable to me.

Sam C

unread,
Apr 22, 2011, 10:10:44 AM4/22/11
to Google Safe Browsing API
Surely if you're indexing your database by chunk number (the obvious
way to me), then it's not a huge operation to check if that chunk
already exists (because they're indexed). Obviously if you're using a
relational database model then keying the chunk number as unique would
also work (and the insert operation would just fail). In my opinion
having duplicates in any database is wasteful and can always be
avoided; if you didn't want to check on insert for duplicates then
have a cleanup script that runs periodically to check.

On Apr 21, 8:59 pm, dcs dcs <ddcs...@gmail.com> wrote:
> Thanks, but it's not  practical to check if the chunk number is
> already in the database for every new chunk comes in .  I would rather
> let  duplicates stay in the database.  What's your suggestion?  Now I
> have 132 distinct chunks that have duplicates among total  17515 of
> chunks in my database.  Is this reasonable?  I did not believe that
> you could send dups for the same chunk before.
>
> Regards,
>
> dcs dcs
>
>
>
>
>
>
>
> On Thu, Apr 21, 2011 at 12:12 PM, Garrett Casto <gca...@google.com> wrote:
> > Because of the way that we cache, we will occasionally send you chunks that
> > you already know about. You can safely ignore a chunk number that you
> > already know about, the new chunk will not contain any new data.
>

dcs dcs

unread,
Apr 22, 2011, 12:04:33 PM4/22/11
to google-safe-...@googlegroups.com
This does not impose too much problem on my side either. ( I was just
surprised that they can send dups for their own reason, and it's also
not in their document).

1) I read google document that each subsequent download will have the
add chunk number incremented (correct me if I read wrongly), but when
I checked the database the new download had add chunk number less than
the add chunk numbers in my previous download. Can you explain to me
for that?

2) After more than 20 downloads from google for both malware and
phish, I only have about 15,000 unique add chunks now. Is ths
reasonable amount of chunks one should have? In addition now it's
only downloading 3 chunks at a time. I wonder there's something
wrong, why not more chunks to be downloaded at this time since some
document says you should have about 773,404 add chunks in all.


Thanks!

Garrett Casto

unread,
Apr 22, 2011, 4:48:17 PM4/22/11
to google-safe-...@googlegroups.com
On Fri, Apr 22, 2011 at 9:04 AM, dcs dcs <ddc...@gmail.com> wrote:
This does not impose too much problem on my side either.  ( I was just
surprised that they can send dups for their own reason, and it's also
not in their document).

1) I read google document that each subsequent download will have the
add chunk number incremented (correct me if I read wrongly), but when
I checked the database the new download had add chunk number less than
the add chunk numbers in my previous download.  Can you explain to me
for that?


We don't guarantee that the chunk numbers you get will be increasing. In fact, if you aren't up to date you will get the newest chunks first so you may just get chunks with smaller numbers on your next run. I guess the spec says that "Chunk numbers within the same chunk type grow increasingly, without gaps.", but we don't specify the order that you will receive them in.
 
2)  After more than 20 downloads from google for both malware and
phish, I only have about 15,000 unique add chunks now.  Is ths
reasonable amount of chunks one should have?  In addition now it's
only downloading 3 chunks at a time.  I wonder there's something
wrong, why not more chunks to be downloaded at this time since some
document says you should have  about 773,404 add chunks in all.


As of right now, valid chunks ranges are 29086-36395 for malware and 132422-136695 for phishing. So you should have closer to 20,000 chunks at the moment, but not 773,404. Where did you get that number? If you don't have all of the chunks I mentioned (or close to it as the numbers are changing constantly) you can ping me and I can see if there is a bug in the server, but I would be very surprised if a bug of this type existed.

dcs dcs

unread,
Apr 28, 2011, 11:49:03 AM4/28/11
to google-safe-...@googlegroups.com
Hi all:

Thank you for claritying my concerns.

I got the number of chunks that is about to 20,000 now. I took the
number from the web site which is one of a few documents available
in internet, but that is wrong.

1) But I have chunk numbers for Phish is from 92450-99557 for add
chunks, and 2634-3023 for sub chunks. Why it's so far a part from you
number? Is there any thing wrong with my Phish. Why the sub chunk
numbers are so small (which is kind meaningless)? The malware part
is very close to your number.

Regards,

DCS DCS

Reply all
Reply to author
Forward
0 new messages