My working code follows- how can it be better? Note that originally I
was doing a while rs.is_truncated, but it never got set to false when
I reached the end of the keys (bug?)
rs = b.get_all_keys()
keys = rs;
while True:
last = keys[len(keys)-1]
print "Getting file list (" + str(len(keys)) + ")"
rs = b.get_all_keys(marker=last.key)
if len(rs) < 1:
break
keys._results.extend(rs._results)
I think you've found a couple of bugs. First, the is_truncated
attribute is being set to the string "true" or "false" rather than the
boolean True or False. Secondly, the marker attribute doesn't seem to
be getting set at all. I'll fix both of those this weekend.
I'm also thinking that the ResultSet object should be augmented with a
method to fetch the next set of results when the results are
truncated. That would seem like the most natural way for things to
work to me. What do you think?
Mitch
>From an API user perspective however, I would prefer that more of the
details of of S3 be abstracted away from me. For example, I should be
able to "get_all_keys(maxkeys=25000)", and have it return a single
results set while taking care of multiple queries 'under the hood'.
For that matter, I don't even know that it should return a ResultsSet
object, as it encapsulates S3 implementation details that most API use
cases should not need to use. The ResultsSet object is great, but
maybe there should be another level of abstraction above it that most
API users would interact with.
I love boto and it helps me get work done.
thanks,
jeff
I guess I'm a little conflicted on this topic. The first library I
did for accessing S3 was called bitbucket and it's still available and
I'm still fixing bugs when they come up. For bitbucket, I was very
much of the mind that I wanted to abstract as much of S3 away as
possible and basically made the bucket object act, fairly
transparently, as a Python dictionary (or, more specifically, a
Mapping object). That was nice in a way but after a while I found
myself getting a bit irritated with it because in the process of
abstracting the details of S3 it was doing lots of things (e.g.
caching) behind the scene that could produce unintended consequences.
So, with boto I wanted to make the library simpler and more directly
linked to the underlying Amazon services. But, then perfectly
reasonable requests such as yours come along and start to push things
back in the other direction. I need to ponder it for a while. I'm
interested in other opinions.
Anyway, the problem with is_truncated is fixed and checked in. See:
http://code.google.com/p/boto/issues/detail?id=29
for details.
Mitch
I think a method called get_all_keys() that doesn't return all keys is
a bit misleading. The right way to do this is probably to have a
method called get_keys(), which acts like the current get_all_keys(),
and make get_all_keys() do exactly what it claims to do. If a user
wants caching of data, they can write their own fancy version of
get_all_keys. If I want a huge number of keys, I'm willing to either
do something clever, or wait for the results.
This bit me this week. I started writing a horrid hack to deal with
the large number of files, but the cache keeps getting out of sync, so
now I'm just calling get_all_keys multiple times. I'm OK with it
being a little slow.
Mitch