RBTools API design

Christian Hammond

unread,

Jul 11, 2011, 4:08:37 AM7/11/11

to reviewb...@googlegroups.com, Alexander Solovets

Hi Alexander,

(For context, Alexander's working on the RBTools API work, started originally as a UCOSP project in 2010 and being hopefully completed for GSoC this year. See the proposal at http://reviews.reviewboard.org/r/2458/diff/)

This design looks like a good start and I like that server-side introspection plays a part in this. I have some questions and comments about the design though, before much work goes into it. A lot of this is based on working with developing async APIs for other software. The e-mail is long, but don't let that scare you :) Just a braindump of thoughts with some code samples.

1) Method names for getters

In your example, you access "review_requests()." This is a bit odd for a function name. It'd be better to rename it to "get_review_requests". In fact, I'd much prefer any function that's going to fetch information be prefixed with "get_".

However, "review_requests" makes perfect sense as an attribute name. Below I talk a bit about a proposal for handling both async and sync methods. I think if you had an attribute wrapper for each get_* that fetches lists of resources, that would be awesome. So, "resource.get_review_requests()" and "resource.review_requests". It'll make more sense later I hope.

2) fire()

I think it is important to have something that executes the request and takes a callback for the result. I'm not sure of the current design of having another object with a method that must be called, and want to get your opinions on my counter-proposal.

Rather than certain functions returning a Request object, which the caller then must call fire() on, why don't the methods themselves take the callback and immediately execute the query? This would look like:

    def on_get_root_done(root):
      root.get_review_requests(on_get_review_requests_done)

    factory.get_root(on_get_root_done)

Note that the callback is the parameter to the getter, instead of on a fire() method provided by an object returned by the getter. It's more clear and natural, as you know that function will fetch the results, rather than fetching something that isn't a result but is instead a proxy to the results.

3) Creating and saving resources.

Looking at the create() call, I sort of want a better way to prepare a request before sending to the server. Sort of like the attribute-based updates you show. Being able to create an instance of a review request, set the data there, and then push to the server would be nice.

I think what I'm thinking is something different from the create/update you have right now, which takes parameters and does a push. Instead, maybe you'd do:

def on_get_review_requests_done(review_requests):
    review_request = review_requests.create()
    review_request.summary = "..."
    review_request.repository = repository_id
    review_request.save(on_review_request_saved)

So create() wouldn't immediately create, but rather would create a new object that could be saved. save() would be the new replacement for the current create().fire() and update().fire().

This mirrors Django's own way of handling objects, for the most part.

Updating would work similarly. Set some attributes, then do save().

The advantage here is that the caller doesn't have to distinguish between creating new objects and updating them. A function may take a review request and populate it and save it, without having to know if we've already created this object or if it's an existing object.

I think it'd be perfectly fine for create() to actually be able to take some default parameters to make the initial setup easier. So the above could be:

    review_request = review_requests.create(summary="...", repository=repository_id)
    review_request.save()

Implementation would be straight-forward. It would just do something like:

    def create(self, **kwargs):
        resource = Resource(...)
        resource.__dict__.update(**kwargs)
        return resource

4) Pagination of resources.

One thing that'll be important is iterating over objects in a list properly without blocking. They're paginated in the payload, instead of being one large list of results. This means if you do:

    for review_request in review_requests:
      ...

It would either not be able to return all results (only the first set fetched), or it would synchronously have to fetch new results.

The solution I believe we need to investigate is to use Python coroutines with generators. These, basically, will allow iteration or yielding of values coming from asynchronous queries. Remember that our goal is to not stall a UI mainloop by blocking the main/only thread, so as long as we can achieve that, we're in good shape.

I'm no expert in coroutines, but there are a couple things to look at:

http://www.python.org/dev/peps/pep-0342/ - The introduction of coroutines into Python (sadly, Python 2.4 does not include it, but 2.5+ does)
http://code.activestate.com/recipes/577129-run-asynchronous-tasks-using-coroutines/ - Some examples of different coroutines

This is all something to keep in mind when developing. However, I think it's reasonable that the first pass at this API just blocks and fetches the extra pages of resources synchronously for now. Just as long as we make a note of it (a big comment would be good, talking about what needs to be done).

Potentially, we could add a method that calls a callback for each item in the list as well. Not as pythonic, but would allow for truly asynchronous iteration across large numbers of results.

5) Synchronous vs. Asynchronous

I'm not entirely sure if this is touched upon in your examples or not, but I think it's important we allow for both sync and async usage. The reason being that complex apps (GUI apps largely) will need to be async (so as not to block the UI event loop), yet simple scripts (of which I expect there will be many) don't need to worry about blocking.

So the design should ideally account for both.

The functions that take a callback could be made to be smart and could handle both. Basic logic:

    def get_something(self, done_func=None, force_async=False):
        # Create the initially empty resource that will be filled in upon fetching.
        resource = Resource()

      if done_func or force_async:
            # Handle this asynchronously and later call done_func with the result.
        else:
            # Handle this synchronously.

        return resource

The caller would immediately have a resource object (which for async calls would be unloaded by default -- we could have a flag indicating whether it's loaded). If the caller provided a done slot, it'd run in async mode, calling the done slot when done. Otherwise, it'd be a synchronous call.

That way, both of these work:

    def on_review_request_saved():
      print "Saved!"

    review_request.save(on_review_request_saved)

And:

    review_request.save()
    print "Saved!"

Note force_async. If there was some way to be notified on that returned object when it was loaded, then a caller could potentially load the resource without having to provide a done function.

Why is this useful? Well, you may have some wrapping code doing the complex work of fetching some resources, but it doesn't care when they're loaded. The caller cares, but not the wrapper. Maybe there's a debug console in a graphical app that's showing the resources as they load. The debug console could be passed these resources, listen itself for when they load, and display something. No need for the wrapper to be involved, aside from returning the resource. All it would have to do is pass force_async=True. Without that, it'd need to set up a potentially empty function to pass in order to fetch asynchronously.

(Hope that made sense -- It's late and I may be rambling.)

6) Getting individual objects

You asked how we should get a particular resource directly. I think there are two ways.

The first way is to have a get() function on a resource list (like review_requests above) that takes an ID and fetches it. Should be straight-forward. It assumes you have the list, and the ID.

The second is URI templates. I think what you have could work, but there are two changes I would make. "get_" prefix, and requiring keyword args for the parameters (in order to know what to fill in where). So:

factory.get_screenshot(on_get_screenshot_done, review_request_id=review_request_id, screenshot_id=screenshot_id)

Now that said, URI templates only get us to certain items. You can't fetch *anything* from here. But it might be enough.

7) Singleton resources

Something you'll come across and have to deal with are special kinds of resources. There are singleton resources (like review_requests/draft) which are not a list and are not ID-based. They either exist at that location or they don't, and they're one object rather than a list of objects. If this is flexible, it should be able to handle it fine, but that's something to be aware of.

8) "Factory"

While technically ResourceFactory is indeed a factory, I think it's a lot more clear if we use the name "ReviewBoardServer." It reads a lot nicer. server.get_root(), etc. In fact, I think it could even have the root pre-loaded and built-in, so instead of the extra step of getting the root, you could do: server.get_review_requests().

Thoughts?

I'd love to hear thoughts from other people on this list, too.

Christian

--
Christian Hammond - chi...@chipx86.com
Review Board - http://www.reviewboard.org
VMware, Inc. - http://www.vmware.com

Alexander Solovets

unread,

Jul 11, 2011, 9:50:05 AM7/11/11

to reviewb...@googlegroups.com, Alexander Solovets

It seems that you didn't get my idea. My approach is based on the fact
that resource tree has quite regular structure. That is every resource
may have 'links' leading to another resources and hence can be turned
into object methods, arbitrary fields which merely converted into
object properties. That is why I suggested the notions of
'ResourceFactory' - the object that knows how to make resource object
out of JSON payload, 'Resource' - stub object whose methods and
properties are filled by ResourceFactory upon HTTP request response
received. 'Request' knows how to wrap particular method into the
following chain: method invocation -> request creation -> response
reception -> resource generation. Of course we could avoid the
redundant call of 'fire()', but there we meet the problem that
sometimes along with callbacks and other internal properties we want
to pass request parameters like 'public=True'. So fire() serves in two
ways: splits request parameters from internal options and makes
request storable, i.e. once created request object via 'r =
request.draft()' we will be able to call 'r.fire()' multiply times.
Finally, the main advantage is that the API users probably wont' have
to change their code each time the API will be changed, yet they wont'
have to wait until changes in API are reflected in rbtools/api.

As you see your approach is likely to be orthogonal to mine and relies
on the fact that we know something more about API itself (like the
fact that only resource lists have 'create' method).

Since I'm slightly behind my schedule, I'd very appreciate other
developers to voice their thoughts about the right approach, so that I
finished this framework sooner.

Hongbin Lu

unread,

Jul 11, 2011, 12:22:41 PM7/11/11

to reviewb...@googlegroups.com, Alexander Solovets

Hi.

It sounds like a very exciting project. I am looking forward to play with these APIs. I don't have much comments for the design. Just some more concerns:

* Does the documentation of the APIs get auto-generated as well.

* It looks like the APIs is for Python only. Correct? How about the support of other programming language.

Best Regards,

Hongbin

Alexander Solovets

unread,

Jul 11, 2011, 6:50:21 AM7/11/11

to Christian Hammond, reviewb...@googlegroups.com

Hongbin Lu

unread,

Jul 11, 2011, 2:38:45 PM7/11/11

to reviewb...@googlegroups.com

Embarrassing. Please don't mind if I killed the discussion. Go ahead guys.

Christian Hammond

unread,

Jul 11, 2011, 3:30:27 PM7/11/11

to Alexander Solovets, reviewb...@googlegroups.com

Hi Alexander,

I definitely get the approach and I think it's a great one. I probably wasn't very clear with my comments, so I'll address your points and see if I can clear that up.

ResourceFactory is great, and I don't want to change the concept. Just the name. From an API writer's point of view, what they care about is talking to the server. A ResourceFactory is fairly generic and sounds like something they may use to create brand new resources (technically they're creating the objects, but not the server-side resources necessarily). So just renaming that to ReviewBoardServer would make the usage more clear. You know you're talking to the server at that point.

The ResourceFactory and Resource model doesn't need to change. I was just recommending simplification of usage by getting rid of Request and merging its responsibilities in with the caller.

I don't think we hit any problems by getting rid of fire() and merging it in with the originating function. Every function operating on a remote resource could look something like:

def foo(done_func=None, **kwargs):

All foo=bar parameters would appear in kwargs, and we could rely on done_func to always be a callback or None. I don't think there are any real problems with separation here.

I get what you're saying about having a storable response and being able to call fire() multiple times, but I don't see how that's really useful in practice (and the semantics of it fees confusing -- I don't think we'd want to advise that usage).

I'm all for having the generic Resource objects that pass things to the server in order to keep knowledge of the API out of these Python bindings. Like you, I think that's key to maintainability.

I know we have disagreements about create(), but I do believe it's worth doing it the way I outlined, and that it's not a hindrance to keeping it maintainable. There are some basic things that will always be true about this kind of API. One of them is that things can be created on list resources (if the list resource claims to allow it, by way of the links). By the time we can do a create() on a list resource, we'd have loaded the first payload of the list resource and we know whether creation is allowed. Because of this, we can know whether create() can return something valid, or whether there'd be an error.

Having built-in support for a basic primitive like create() is far different from hard-coding support for, say, closing a review request.

Something that's very important to me is how the API feels to use. Having a good backend for it is very important, but if it mirrors the REST API or transport too closely, it can be harder to use for people. That's one reason why I want to get rid of fire(), and why I want the same API to handle both sync and async functions.

Down the road, I do believe we'll need to have support for special functions like a close(type, message) for review requests. Again, this goes back to API usability. Right now, it's not necessary, but once we get this initial version in I want to start looking at things like that. Hopefully, though, what we can do is have something server-side that maps "actions" (like close) to sets of parameters and HTTP verbs, and the Python API can just learn from that. But this isn't something to care about right now. More food for thought :)

So, I think I gave the impression in my post that I wanted more knowledge of the API and how it's used, but really I'm aiming for close to what you're aiming for. If our job is done correctly, a client could talk to any Review Board server and just call/update things without having to, as you said, update RBTools or rewrite their code.

Sorry if I gave the wrong impression on my goals, and maybe it's still not completely clear. Let me know. We just have different ideas about parts of the design for this same goal :)

Christian

--
Christian Hammond - chi...@chipx86.com
Review Board - http://www.reviewboard.org
VMware, Inc. - http://www.vmware.com

Alexander Solovets

unread,

Jul 15, 2011, 2:20:51 AM7/15/11

to reviewb...@googlegroups.com, Alexander Solovets

Christian, thank you for explaining your idea one more time! I think I agree with the words above and indeed some functionality should be built-in in API while the other may be generated "on-fly".

Alexander Solovets

unread,

Jul 16, 2011, 9:53:11 PM7/16/11

to reviewb...@googlegroups.com, Alexander Solovets

One more question related to HTTP transport. Taking into account the fact that urllib2 does not support asynchronous IO we have to choose the way of making HTTP requests. At this point I see two ones: use threads or use another network backend. Would like to hear your thoughts =)

Christian Hammond

unread,

Jul 16, 2011, 10:18:57 PM7/16/11

to reviewb...@googlegroups.com, Alexander Solovets

Right now, we can stick with urllib2 without threads. It will be
synchronous, but that's okay right now. The important part is for the
API to at least claim to be async so we can replace the backend later.
We'll tackle the backend itself when other things are in place.

Christian

On Saturday, July 16, 2011, Alexander Solovets <asol...@gmail.com> wrote:
> One more question related to HTTP transport. Taking into account the fact that urllib2 does not support asynchronous IO we have to choose the way of making HTTP requests. At this point I see two ones: use threads or use another network backend. Would like to hear your thoughts =)

--

John J Lee

unread,

Jul 18, 2011, 5:01:51 PM7/18/11

to reviewb...@googlegroups.com

http://eventlet.net/doc/examples.html#web-crawler

(note that's neither a fork of urllib2 nor conventional monkey-patching --
rather, it dynamically injects "greened" dependencies into urllib2,
inserting the "greened" copy of that module in sys.modules under a key
other than "urllib2")

John

Alexander Solovets

unread,

Aug 4, 2011, 2:39:47 AM8/4/11

to reviewb...@googlegroups.com, Alexander Solovets

4) Pagination of resources.

One thing that'll be important is iterating over objects in a list properly without blocking. They're paginated in the payload, instead of being one large list of results. This means if you do:

for review_request in review_requests:
...

It would either not be able to return all results (only the first set fetched), or it would synchronously have to fetch new results.

I think we need to provide methods for both cases: full-list and partial content. I suggest the following strategy here: if fetched resource is a resource list, its 'GET' response will provide pagination arguments (start/max-results) and the portion loaded instantly will be available as sequence:

Example #1

for resource in parent.get_resource():

...

If user desires to fetch whole list, two additional methods are provided: all() and all_async(). The first method is blocked and simply returns list items loading them when needed. The previous example will look like:

Example #2

for resource in parent.get_resource(counts_only=True).all():

...

The second method as usually receives callback which must be generator (i.e. have "yield" statement):

Example #3

def read_list():

while True:

r = yield

parent.get_resource(counts_only=True).all_async(read_list)

At last, if RB API ignores nonexistent parameters, we could let 'counts_only' be default parameter, so that the second example will be shortened to "parent.get_resource().all()". Along with it we will provide delayed loading of resource lists, so that list itself will be loaded only when __iter__ called (implicitly or explicitly).

Christian Hammond

unread,

Aug 7, 2011, 6:23:51 PM8/7/11

to reviewb...@googlegroups.com, Alexander Solovets

Your examples and suggestions for all/all_async sound fine to me. I wasn't sure about the counts_only=True though. Did you mean False? counts_only=True won't return any actual item results, just the number of items found, which isn't really something you can iterate over.

Did I misunderstand?

Christian

--
Christian Hammond - chi...@chipx86.com
Review Board - http://www.reviewboard.org
VMware, Inc. - http://www.vmware.com

Alexander Solovets

unread,

Aug 7, 2011, 6:49:35 PM8/7/11

to reviewb...@googlegroups.com, Alexander Solovets

If we call get_* with no args i.e. letting fetch the first portion of results and then call 'all/all_sync' we will have unnecessary retrieval of information in the first call. So having 'counts_only' set to 'True' we will decrease amount of information fetched by default. Then if necessary ('counts_only' set to 'False' or resource is requested in iterator context) we can re-fetch the list. In other words, the case when the user would want to fetch a resource liast as is (i.e. resource attributes + a portion of child resource list) is likely to be rare - he either won't need this list at all or will call 'all/all_async'.

Christian Hammond

unread,

Aug 7, 2011, 7:05:40 PM8/7/11

to reviewb...@googlegroups.com, Alexander Solovets

I don't want to expose counts_only for what I think will be the common case of fetching the list of resources. It's not a lot of extra work to fetch the standard resource list data for one page, and it simplifies the API for the developer integrating it into their app.

Christian

--
Christian Hammond - chi...@chipx86.com
Review Board - http://www.reviewboard.org
VMware, Inc. - http://www.vmware.com

Reply all

Reply to author

Forward