Re: why duplicate host, and prefix records in different add chunks

57 views
Skip to first unread message

dcs dcs

unread,
May 24, 2011, 6:10:20 PM5/24/11
to google-safe-...@googlegroups.com
Hi all:

I am just puzzled by what I've just found in my database for google
safe browsing v2. I have the same host, Prefix in two different add
chunks.

Chunk_num Host_key prefix
creation_date
31992 0x0021DC6F 0x523AB964
2011-04-21 15:23:41:217
31800 0x0021DC6F 0x523AB964
2011-04021 15:25:02:590


I have the saved data to prove that this is from the same pull from
google down load. I believe it should not have duplicates in
different add chunks. Can someone explain to me if this is okay? On
the other hand, it may be okay since you need to have add chunk number
in sub chunks.

Thank you!

DCS DCS

Denis Ilguzin

unread,
May 25, 2011, 12:02:01 AM5/25/11
to Google Safe Browsing API
Hi!

Well, actually it's not a big deal for prefixes to be duplicated in
different add chunks. So its just an accident. Try to get full hashes
for these prefixes, they should be different.
By the way, using chunks does not guarantee for prefixes to be unique.
The only goal of chunks is to divide whole list of prefixes into
batches that can be disabled/enable by one or by the whole batch.
Another one goal is to decrease volume of exchange information data
for those who already have part of the GSB database (i.e. using chunks
you only say to google what chunks you have (instead of sending all
your prefixes) before requesting new prefixes).

Denis

Denis Ilguzin

unread,
May 25, 2011, 12:16:37 AM5/25/11
to Google Safe Browsing API
The same situation for the Hostkey. I don't think it is efficiently to
store all identical Hostkeys in the same chunk. If chunk is created at
the GSB side they will never add prefixes into it, but they can delete
prefixes from chunks using SUB chunk number. Telling what ADD chunks
and SUB chunks number you have let google know full information about
you copy of database. If you imagine that "Hostkey+Prefix" is a "key"
for full hashes its should be clear that Hostkey+Fullhash will be
unique for the whole database, thus you won't find duplicated Hostkey
+Fullhash in different chunks and in the whole database.

Garrett Casto

unread,
May 25, 2011, 1:47:51 PM5/25/11
to google-safe-...@googlegroups.com
This is mostly true. Two things to note. First, you may start an update with a hash in chunk 10, have it get deleted, and then re-added in chunk 50. That is we try to keep only one "live" version of the hash but it may move around. Second, in the phishing list it actually is possible, though rare, that you may actually have two live chunks that contain the same hostkey + fullhash. This is a by product of the way that we don't always SUB evey hash that we take off of this list. Like I said it's rather rare, but you should make sure that your implementation can handle this case.

Garrett

--
You received this message because you are subscribed to the Google Groups "Google Safe Browsing API" group.
To post to this group, send email to google-safe-...@googlegroups.com.
To unsubscribe from this group, send email to google-safe-browsi...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/google-safe-browsing-api?hl=en.


dcs dcs

unread,
Jun 13, 2011, 11:20:39 AM6/13/11
to google-safe-...@googlegroups.com
Hi Denis:

It seems like that google does not have any restrictions on their data
sending to clients, that makes database design to be inefficent for
client. This is minor issue since we can spend some time to speed it
up.

The main problem from it is when do we have to update the full paths
for these prefixes. If one prefix is deleted from one chunk, and the
other chunk with same chunk is not deleted. Are we supposed to keep
that full path for the prefix or we have to pull it every time?( it
looks like we have to every time, since we don't know what we have is
up to date even from last pull)?

Regards,

DCS DCS

dcs dcs

unread,
Jun 13, 2011, 12:10:32 PM6/13/11
to google-safe-...@googlegroups.com
That's not true according to other reply I got. Phish can have
host+prefix in different chunks, full paths only depend on prefix. So
you assumption is invalid and will not work. Plus your database will
fail.


Regards,

Sam C

unread,
Jun 14, 2011, 10:53:14 AM6/14/11
to Google Safe Browsing API
I'll lookup and I'll sort by full-hashes; that way the records that
already have full-hashes will be returned. This way if I have
duplicate data in chunks I'll get the most efficient one with the full-
hash already cached.
If I don't have any full-hashes yet then I'll then lookup the full-
hash. Google will then respond with either a match or no match. If it
responds with a match it'll provide the chunknum (you'll already have
a prefix). You can then just insert the full-hash where prefix matches
prefix and the chunknum matches the chunknum.

I've got an extra few records that /could/ be removed in my database
but they have to stay there to keep all my chunks correct. I don't see
a extra few bytes of data as an issue. Am I missing the point?

--Sam

dcs dcs

unread,
Jun 14, 2011, 11:25:40 AM6/14/11
to google-safe-...@googlegroups.com
Hi Sam:

Thanks for your time and reply. Since the full path is only depended
on prefix. That mean we don't need to concern with host key or chunk
number for that matter, but only related to prefix for efficiency.
That means this full path is for the same prefix in different chunks
that may even have different host keys from download. And I am just
concerned with one prefix may have multiple full paths. (I guess that
could be the case from basic reasoning). Tell me otherwise.

The problem is that how can you be sure the full paths you have is up to date.
By the way, don't do the sorting, that will slow down the search in SQL.


Regards,

DCS DCS

Garrett Casto

unread,
Jun 14, 2011, 2:35:32 PM6/14/11
to google-safe-...@googlegroups.com
On Tue, Jun 14, 2011 at 8:25 AM, dcs dcs <ddc...@gmail.com> wrote:
Hi Sam:

Thanks for your time and reply.  Since the full path is only depended
on prefix.  That mean we don't need to concern with host key or chunk
number for that matter, but only related to prefix for efficiency.
That means this full path is for the same prefix in different chunks
that may even have different host keys from download.  And I am just
concerned with one prefix may have multiple full paths.  (I guess that
could be the case from basic reasoning).  Tell me otherwise.


I'm not sure that I follow this, in particular I don't know what you mean by full path (full hash?).  Let me just give an example and hope that clears things up.

Current Database
ChunkNum   HostPrefixHash HashPrefix  FullHash
1                 1234                 1111           11111111
2                 1234                 1111           N/A
3                  5678                1111           11112222

Let's say that you are looking up a url that gives a HostPrefix of 1234 and a FullHash of 11112222. In this case you have to make a request to Google to see if chunk 2 matches, as chunk 1 doesn't (wrong FullHash) and chunk 3 doesn't (wrong HostPrefixHash).  Basically if there is any possible match in the database, you should make a hash request to see if there is a match. Does this help?

Sam C

unread,
Jun 15, 2011, 6:40:35 AM6/15/11
to Google Safe Browsing API
That's the answer :) I need to sort out my implementation then! I
incorrectly presumed that if a record had the same HostKey and Prefix
then it would more than likely be the same but in fact it could be
different and all results returned should be checked (even if the
FullHash on one doesn't match).

On Jun 14, 7:35 pm, Garrett Casto <gca...@google.com> wrote:

dcs dcs

unread,
Jun 15, 2011, 5:25:21 PM6/15/11
to google-safe-...@googlegroups.com
Thanks for your time. I understand.


I am more concerned with efficiency. We only have max (5 hosts*6
paths) 30 prefixes to query the database. Since duplications in
different chunks for host, and prefix. I don't have any control how
many records will come back to me or how many unique chunk numbers
will be there in the query result. Can you guarantee it's going to be
only 30 records at most in the database not for full hash.


Regards,

DCS DCS

dcs dcs

unread,
Jun 17, 2011, 4:22:11 PM6/17/11
to google-safe-...@googlegroups.com
Hi Garrett:

I tested a bunch sites that have malwares. But I do'nt have pages
that contain malwares for testing. Somehow if you have these urls
handy, can you send me some of the urls to me. I just don't find them
in internet.


Thank you!

Peter Zhou

On Tue, Jun 14, 2011 at 2:35 PM, Garrett Casto <gca...@google.com> wrote:
>
>

Reply all
Reply to author
Forward
0 new messages