Best way to get_all_keys for large # of keys?

709 views
Skip to first unread message

jeff

unread,
Feb 3, 2007, 5:08:40 AM2/3/07
to boto-users
What's the recommended way to get_all_keys for bucket with a large
number of keys (i.e. > 1000, the default max)? I have hacked up some
code that works for me, but it clearly is not the recommended practice
because I use the private _results variable.

My working code follows- how can it be better? Note that originally I
was doing a while rs.is_truncated, but it never got set to false when
I reached the end of the keys (bug?)

rs = b.get_all_keys()
keys = rs;
while True:
last = keys[len(keys)-1]
print "Getting file list (" + str(len(keys)) + ")"
rs = b.get_all_keys(marker=last.key)
if len(rs) < 1:
break
keys._results.extend(rs._results)

Mitch....@gmail.com

unread,
Feb 3, 2007, 6:26:19 PM2/3/07
to boto-users
Hi Jeff -

I think you've found a couple of bugs. First, the is_truncated
attribute is being set to the string "true" or "false" rather than the
boolean True or False. Secondly, the marker attribute doesn't seem to
be getting set at all. I'll fix both of those this weekend.

I'm also thinking that the ResultSet object should be augmented with a
method to fetch the next set of results when the results are
truncated. That would seem like the most natural way for things to
work to me. What do you think?

Mitch

jeff

unread,
Feb 3, 2007, 7:10:25 PM2/3/07
to boto-users
I'm not quite sure that I'm qualified (yet) to comment on best
practices for python implementation, because I quite frankly consider
myself a hack and slasher with the language at this point. To give
you an idea, I had to dive into your source code and python
documentation to figure out the parameter passing convention for the
'marker=' stuff (which is _very_ nice, but I just had no familiarity
with the Python concept).

>From an API user perspective however, I would prefer that more of the
details of of S3 be abstracted away from me. For example, I should be
able to "get_all_keys(maxkeys=25000)", and have it return a single
results set while taking care of multiple queries 'under the hood'.
For that matter, I don't even know that it should return a ResultsSet
object, as it encapsulates S3 implementation details that most API use
cases should not need to use. The ResultsSet object is great, but
maybe there should be another level of abstraction above it that most
API users would interact with.

I love boto and it helps me get work done.

thanks,
jeff

Mitch....@gmail.com

unread,
Feb 4, 2007, 11:01:01 AM2/4/07
to boto-users
Well, you're using boto so you are qualified to comment.

I guess I'm a little conflicted on this topic. The first library I
did for accessing S3 was called bitbucket and it's still available and
I'm still fixing bugs when they come up. For bitbucket, I was very
much of the mind that I wanted to abstract as much of S3 away as
possible and basically made the bucket object act, fairly
transparently, as a Python dictionary (or, more specifically, a
Mapping object). That was nice in a way but after a while I found
myself getting a bit irritated with it because in the process of
abstracting the details of S3 it was doing lots of things (e.g.
caching) behind the scene that could produce unintended consequences.

So, with boto I wanted to make the library simpler and more directly
linked to the underlying Amazon services. But, then perfectly
reasonable requests such as yours come along and start to push things
back in the other direction. I need to ponder it for a while. I'm
interested in other opinions.

Anyway, the problem with is_truncated is fixed and checked in. See:

http://code.google.com/p/boto/issues/detail?id=29

for details.

Mitch

bress

unread,
Mar 1, 2007, 4:54:24 PM3/1/07
to boto-users
On Feb 4, 11:01 am, Mitch.Garn...@gmail.com wrote:
> Well, you're using boto so you are qualified to comment.
>
> I guess I'm a little conflicted on this topic. The first library I
> did for accessing S3 was called bitbucket and it's still available and
> I'm still fixing bugs when they come up. For bitbucket, I was very
> much of the mind that I wanted to abstract as much of S3 away as
> possible and basically made the bucket object act, fairly
> transparently, as a Python dictionary (or, more specifically, a
> Mapping object). That was nice in a way but after a while I found
> myself getting a bit irritated with it because in the process of
> abstracting the details of S3 it was doing lots of things (e.g.
> caching) behind the scene that could produce unintended consequences.
>
> So, with boto I wanted to make the library simpler and more directly
> linked to the underlying Amazon services. But, then perfectly
> reasonable requests such as yours come along and start to push things
> back in the other direction. I need to ponder it for a while. I'm
> interested in other opinions.
>

I think a method called get_all_keys() that doesn't return all keys is
a bit misleading. The right way to do this is probably to have a
method called get_keys(), which acts like the current get_all_keys(),
and make get_all_keys() do exactly what it claims to do. If a user
wants caching of data, they can write their own fancy version of
get_all_keys. If I want a huge number of keys, I'm willing to either
do something clever, or wait for the results.

This bit me this week. I started writing a horrid hack to deal with
the large number of files, but the cache keeps getting out of sync, so
now I'm just calling get_all_keys multiple times. I'm OK with it
being a little slow.

Mitch....@gmail.com

unread,
Mar 1, 2007, 6:01:51 PM3/1/07
to boto-users
Good point. A method called get_all_keys really should get all keys.
For BitBucket, I created a generator class which wasn't a bad way to
deal with this problem.

Mitch

Reply all
Reply to author
Forward
0 new messages