Today's update

4 views
Skip to first unread message

Kevin Campbell

unread,
Apr 26, 2012, 1:26:41 PM4/26/12
to total-im...@googlegroups.com
I've done a lot of cleanup today of the queues and application code. The system is now runnable using:
 
  python ./totalimpact/api.py


We now have working functional tests of retrieving an item. This will send a post request to the server for each of the test items, and then poll the response until it gets all metrics filled in. It uses the same test items I've been using for other tests, which I've written up in a google doc and will share over shortly. You can run this using:

  python extras/functional_test.py

I've had a lot to change re concurrency, and the processing of items in the queues. I've not done the aliases expansion discussed before, but I have had to change the use of some fields and adjust the model slightly. The lifecycle of an item is now:

    self.last_requested is set when an item is first created
    if self.aliases.last_modified is older than self.last_requested, we will attempt to update aliases
    self.aliases.last_modified is set once alias update succeeds or exceeds retries
    if self.metrics[key] exists and doesn't have an update newer than self.aliases.last_modified   
    self.metrics[key][values][ts] is set once alias update succeeds or exceeds retries

The upshot of this is that None values are required for both aliases and for metrics to signify that they have failed and that they shouldn't be processed. This probably isn't the right semantics, as it's going to lump in too many result types together. Not implemented would result in None, so would exceeding timeouts.

For now, I've implemented aliases failures by clearing out all of the alias values on the object. This is to avoid us processing metrics when we don't have all the aliases for an object, and subsequently obtaining a wrong result.


I have a question regarding multiple aliases on a single object. It seems we are doing item.metrics = self._update_metrics_from_dict(new_metrics, item.metrics) in a loop over the alias list. This then sets multiple metric values on an object. I assumed we'd want just one metric value on each update, and to compose these in some way. Is the issue here that we'd normally never have encounter this situation in normal usage, eg: github wouldn't ever have ('github','egonw/gtd') and ('github','egonw/iconv') in the same Item?

I'm not 100% sure we've got our HTTP API right, or at least that we've fully defined the behaviour. Having these written relating to use cases (end user on the web client) would be great, if any of you had time. Otherwise, if not a clarification would be helpful specifically of what happens when users re-request items. We poll items regularly in the UI looking for updates, so I expect we won't want to set last_requested in that situation (as that would cause us to recalculate). What would be sent to the API when we want to regenerate the results for an item?


Tomorrow I'm going to do the changes to providers I'd stated earlier on the list. The code is just in a mess now, as the providers are having to do far too much work on object saving, etc. The changes will let me simplify these massively and avoid headaches when we go on to implement all other providers.

I've made some improvements to logging, and will hopefully get that all polished off tomorrow as well. Don't worry about the current aesthetics, I've changed the logging format so I can work through things easily. They should give some sort of useful feedback just now when you are doing basic item requests, but it's far from ideal. It's certainly too hard to debug the system under load.

K

Jason Priem

unread,
Apr 27, 2012, 1:47:59 AM4/27/12
to total-im...@googlegroups.com
Nice work, K. To answer your questions:

1. metrics updates using multiple aliases: There are two cases of these: a) where a metric is using completely different namespaces (say, url + doi), or b) using multiple values from the same namespace (three different urls). I can't think of any situations for the former, but I think the latter may be rather common. I think they'll likely be handled the same way in any case. For a given metric, we want to sum values that come from different aliases...the user doesn't care about the aliases, just the item. We can do this

* early: the provider runs updates using as many aliases as it wants (that may not be all of 'em...I could imaging providers may eschew all but one alias, if their sources are already doing deduplication?). once it's done, it adds all the values up, slaps the timestamp on 'em, and chucks that in the metric dict.
* late: the provider keeps a different value for each alias it runs; the metric dict is still keyed by timestamp, but now it holds a tuple of (alias, value). Client code is in charge of summing these.

I'm game to hear other thoughts, but I'd favor early. Yes, we're throwing away data. But it feels like relatively unimportant data, and first thought is that it's not worth the hassle for the db and client code. it feels more like dirt than depth.

2. How do we refresh item data? We can't just do it every request, because we're requesting these bad boys multiple times per second via the frontend polling. I think this is solved more easily: we continue to set last_requested, but we build a delay into the queue using the keys returned from couch...nothing goes on the update queue (even though it's in the couch view) unless it's at least, say, 24hrs since the last request. the first request should still happen instantly, since last_requested starts set to null, but subsequent ones won't. Since we don't have to modify the couch view to do this (it still spits out the same keys), it should be relatively straightforward to implement different staleness thresholds for different providers. I probably need to look at the backend code again more closely, though; let me know if I'm missing something.

This said, another advantage of doing the update queue in-memory is that we don't have to be saving last-updated values up to the db multiple times per second.

These last two are my questions/notes:

3. I think that wiping out all the aliases works ok. Our plan in this event was to leave the aliases, but flip a "alias_update_error" bit on the item that could be used to interpret results accordingly. The advantage is that users can see what's had errors. I'm ok with wiping out the aliases, I guess, but I still thing we need those two error flags. See the ticket from last week for more info: https://github.com/total-impact/total-impact-core/issues/95

4. I'm pretty sure we actually don't want to run the aliases every update--just the first one. It's fairly expensive, and they are really unlikely to change. Once we're done running them all the first time, I think we will just assume they stay that way for the forseeable future (until we build a little updater script that crawls around in there and updates the really old ones)

Thanks for enduring the long post, all. Great job K, as usual.
j

ps to kevin: i made some additional notes on your earlier commits in the github comment system...those don't seem to be working as a communication channel too well, so i'll switch over to just the listserv in the future...no worries there. but would you mind reading the old ones and hitting me back on them? a few points in there that might be helpful to one or both of us.
 



K

--
You received this message because you are subscribed to the Google Groups "total-impact-dev" group.
To post to this group, send email to total-im...@googlegroups.com.
To unsubscribe from this group, send email to total-impact-d...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/total-impact-dev?hl=en.



--
Jason Priem
UNC Royster Scholar
School of Information and Library Science
University of North Carolina at Chapel Hill

Kevin Campbell

unread,
Apr 27, 2012, 6:50:22 AM4/27/12
to total-im...@googlegroups.com
On Fri, Apr 27, 2012 at 6:47 AM, Jason Priem <j...@jasonpriem.org> wrote:
Nice work, K. To answer your questions:

1. metrics updates using multiple aliases: There are two cases of these: a) where a metric is using completely different namespaces (say, url + doi), or b) using multiple values from the same namespace (three different urls). I can't think of any situations for the former, but I think the latter may be rather common. I think they'll likely be handled the same way in any case. For a given metric, we want to sum values that come from different aliases...the user doesn't care about the aliases, just the item. We can do this

* early: the provider runs updates using as many aliases as it wants (that may not be all of 'em...I could imaging providers may eschew all but one alias, if their sources are already doing deduplication?). once it's done, it adds all the values up, slaps the timestamp on 'em, and chucks that in the metric dict.
* late: the provider keeps a different value for each alias it runs; the metric dict is still keyed by timestamp, but now it holds a tuple of (alias, value). Client code is in charge of summing these.

I'm game to hear other thoughts, but I'd favor early. Yes, we're throwing away data. But it feels like relatively unimportant data, and first thought is that it's not worth the hassle for the db and client code. it feels more like dirt than depth.

It would have be the former really. The latter option would only make sense if we were changing the queueing logic, and allowing a provider to process a metrics in stages with interim save. Otherwise, there's no benefit to not having the provider do the summation.

Regarding the providers, it would be nice if we could write up each of them with a top level description. Again, I'll share a doc over on that with some examples. If we had this, I'm sure I could rattle through the remaining providers fairly quickly. It would be much easier than just from the original source.

2. How do we refresh item data? We can't just do it every request, because we're requesting these bad boys multiple times per second via the frontend polling. I think this is solved more easily: we continue to set last_requested, but we build a delay into the queue using the keys returned from couch...nothing goes on the update queue (even though it's in the couch view) unless it's at least, say, 24hrs since the last request. the first request should still happen instantly, since last_requested starts set to null, but subsequent ones won't. Since we don't have to modify the couch view to do this (it still spits out the same keys), it should be relatively straightforward to implement different staleness thresholds for different providers. I probably need to look at the backend code again more closely, though; let me know if I'm missing something.

I'll ensure the code is sensible. I think I meant more from the end user perspective.

If I go to http://total-impact.org/collection/MqAnvI I can see the latest results for that collection. I'm assuming we don't update last_requested on items at this point. Instead, there's a button which says 'update now' which I assumes calls the appropriate API method.

If there's a question of limiting to one request every 24 hours, I'd put that into the API method so that it returns an error code if an item has already been updated within that window.

This said, another advantage of doing the update queue in-memory is that we don't have to be saving last-updated values up to the db multiple times per second.

These last two are my questions/notes:

3. I think that wiping out all the aliases works ok. Our plan in this event was to leave the aliases, but flip a "alias_update_error" bit on the item that could be used to interpret results accordingly. The advantage is that users can see what's had errors. I'm ok with wiping out the aliases, I guess, but I still thing we need those two error flags. See the ticket from last week for more info: https://github.com/total-impact/total-impact-core/issues/95

Yes, I saw that ticket. I assumed we might want to have a nightly cron job which tried to retry failed items and give a short report. This way, we could find any data dependant bugs which start arising. Certainly the failed result is so semantically different from the 'not implemented' result that we'd want that to be distinct.

4. I'm pretty sure we actually don't want to run the aliases every update--just the first one. It's fairly expensive, and they are really unlikely to change. Once we're done running them all the first time, I think we will just assume they stay that way for the forseeable future (until we build a little updater script that crawls around in there and updates the really old ones)

Ok, in which case I'll look at putting that into the system.

Thanks for enduring the long post, all. Great job K, as usual.
j

ps to kevin: i made some additional notes on your earlier commits in the github comment system...those don't seem to be working as a communication channel too well, so i'll switch over to just the listserv in the future...no worries there. but would you mind reading the old ones and hitting me back on them? a few points in there that might be helpful to one or both of us.

Sorry, I hadn't been following on github closely. Will check that from now on and respond to anything on there already.

Regards,
Kevin

Kevin Campbell

unread,
Apr 27, 2012, 2:37:29 PM4/27/12
to total-im...@googlegroups.com
So I've spent quite a bit of time today refactoring the providers, trying to take the item logic out of the providers. It's throwing up some more questions, but I think it's good to get these out in the open.

I've moved the code specific pieces of provider definition into the class. I don't think these belong in the config files, as they're not things an administrator or end user would change, and are highly linked to the code.

class Github(Provider):

    provider_name = "github"
    metric_names = ['github:watchers', 'github:forks']
    metric_namespaces = ["github"]
    alias_namespaces = []
    biblio_namespaces = []

    member_types = ['github_user']

    provides_members = True
    provides_aliases = False
    provides_metrics = True
    provides_biblio = False

    def member_items(self, query_string, query_type, logger):

    def metrics(self, aliases, logger):
    def aliases(self, aliases, logger):
    def biblio(self, aliases, logger):

The logger parameter should go, I'm just using this for now to contextualise the logs. I don't want providers to be supplied the item id, as they shouldn't really care about that. Log lines they produce need to be linked back to the item to be usable though. This part I'm still working on, as it's needing a fair amount of logging changing.

Alias lists passed will allready be stripped before metrics, aliases or biblio are called, based on the _namespace definitions. The methods won't be called with an empty list (results should be zero anyway). The return values from the providers should be:

get_metrics

get_metrics → {'metric' : float, …}
get_metrics → {'metric' : None, …}
  the specified metric doesn't exist for this item

get_metrics → None
  equivalent to {'metric' : None, 'metric' : None, …} for all metrics

Any other failures should result in an error being raised.


get_aliases

get_aliases → [(ns, val), (ns, val), …]
  new aliases found
get_aliases → [ ]
  no new aliases found

Any other failures should result in an error being raised.

I've still work to do here. I've put changes on kevcampb-providers-refactor for now, so there's something visible. The branch is sort of working, or at least for the github example on the functional tests.


Jason Priem

unread,
Apr 27, 2012, 3:48:45 PM4/27/12
to total-im...@googlegroups.com
I think we're agreed on everything except the update procedure. In Edinburgh, we decided that it made sense to get rid of the update button altogether. There's a very good argument that the user should have no say in when stuff gets updated. Our API calls are limited by various sources we go out to, which makes API-calls-remaining our most important resource. We don't want users in the business of deciding how we spend that resource.

In the future, there's a good chance we'll want to implement some sort of a tiered system where users can pay more to get more frequent updates. again, though, the updates should happen on our schedule, not the users. 

This is why the api spec has had no "update" method in it since the Edinburgh  meeting. Users just request or create items...we get to  decide when they get updated. i think the idea way to manage this is like
1. an item gets requested
2. get it from couch
3. for each metric, has it been updated in the last s seconds? 
4.yes: that metric goes on the update queue
4.no: pass
5. send the item

I think you've got the right idea in making these queues in memory rather than couch, as discussed earlier, though i'm not yet sure if we'll have the time to do that. But I think it's the Right Way, since adding/removing from queues will be happening so often (we want it to be fast) and because it feels more straightforward than the couch approach, which has long seemed a bit uncomfortable to me (although I confess it was me that advocated it in the first place).



 and when they get (queued to be) updated should be based on
1. has it been requested and
2. has it been longer than s seconds since the last update

--
You received this message because you are subscribed to the Google Groups "total-impact-dev" group.
To post to this group, send email to total-im...@googlegroups.com.
To unsubscribe from this group, send email to total-impact-d...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/total-impact-dev?hl=en.

Jason Priem

unread,
Apr 27, 2012, 3:51:58 PM4/27/12
to total-im...@googlegroups.com
Awesome stuff, K. I think this is just the direction in which we should be moving...I also wanted to clean up the method sigs of the provider.metrics() et al methods, but you've pushed further in that direction than I was planning to, and that's very much a good thing in this case. 

 Once we've got that gDoc, we can start filling in the specs for the other providers. Would be useful if you had like categories of what you'd most like to know for each provider.

--
You received this message because you are subscribed to the Google Groups "total-impact-dev" group.
To post to this group, send email to total-im...@googlegroups.com.
To unsubscribe from this group, send email to total-impact-d...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/total-impact-dev?hl=en.

Jason Priem

unread,
Apr 27, 2012, 11:59:41 PM4/27/12
to total-im...@googlegroups.com
So, just ran this update flow by Heather, and we're both keen to give it a go. The current system is way more byzantine than it needs to be. I think you probably don't need much convincing in that direction, K, since it was your idea to use the in-memory queues in the first place (actually was Richard's originally, but we ended up going with couch back then...my bad :/ ).

This is the flow that gets used for metrics once we've got an item with the alias list full...filling that should be the job of a relatively separate system (although of course we can share some code). Not sure of the details on how to do that yet...can go over it in the sprint meeting and after. I do reckon it ought to wait for the next sprint.

Heather's got a list of queueing libs up on the etherpad that we can go over next tuesday...she got some good feedback from the lead dev on PLoS's similar altmetrics tool on these. 

Kevin Campbell

unread,
Apr 28, 2012, 2:21:29 PM4/28/12
to total-im...@googlegroups.com
On Sat, Apr 28, 2012 at 4:59 AM, Jason Priem <j...@jasonpriem.org> wrote:
So, just ran this update flow by Heather, and we're both keen to give it a go. The current system is way more byzantine than it needs to be. I think you probably don't need much convincing in that direction, K, since it was your idea to use the in-memory queues in the first place (actually was Richard's originally, but we ended up going with couch back then...my bad :/ ).

This is the flow that gets used for metrics once we've got an item with the alias list full...filling that should be the job of a relatively separate system (although of course we can share some code). Not sure of the details on how to do that yet...can go over it in the sprint meeting and after. I do reckon it ought to wait for the next sprint.

Heather's got a list of queueing libs up on the etherpad that we can go over next tuesday...she got some good feedback from the lead dev on PLoS's similar altmetrics tool on these. 

Jason,

The design I had on the google docs for the queue changes look very similar to the delayed_job that PLoS is reportedly using, or at least that's what it appears on first glance. Effectively all it would do is let schedule an item to be placed back on the queue at a later date, should it fail. This way we aren't simply blocking a worker waiting for timeouts, when there are other items still to process.

I haven't done these yet, but I still have time for that in this sprint.

Otherwise, I'm not sure there's much more needing done. Adding in some *mq style system would be quite a bit of work, and I don't think it will gain us much here.

Regards,
Kevin

Jason Priem

unread,
Apr 28, 2012, 6:41:52 PM4/28/12
to total-im...@googlegroups.com
If you've got time and it's interesting you, I say go for it. The only important thing to me is that the update schedule more or less follows the five-step approach from earlier...stuff goes on the queue based on having been requested+time since last update. 

Alternatively, I like the idea of implementing more providers if you've got the time.

--
You received this message because you are subscribed to the Google Groups "total-impact-dev" group.
To post to this group, send email to total-im...@googlegroups.com.
To unsubscribe from this group, send email to total-impact-d...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/total-impact-dev?hl=en.

Kevin Campbell

unread,
Apr 29, 2012, 1:55:22 PM4/29/12
to total-im...@googlegroups.com
On Sat, Apr 28, 2012 at 11:41 PM, Jason Priem <j...@jasonpriem.org> wrote:
If you've got time and it's interesting you, I say go for it. The only important thing to me is that the update schedule more or less follows the five-step approach from earlier...stuff goes on the queue based on having been requested+time since last update. 

Alternatively, I like the idea of implementing more providers if you've got the time.

I can certainly look at that, although it would really help if you were able to write descriptors for each provider explaining what they do. Maybe the wikipedia question I raised there is a good example? I expect they're all pretty simple.

With the new providers structure, I'm hoping that writing providers becomes very quick to do. Certainly the code for each existing provider has been cut down massively as they're no longer having to deal with the all the complexity of the dom or error handing. Similarly I've factored a lot of the test code out into a base class.

Regards,
Kevin

Jason Priem

unread,
Apr 30, 2012, 1:13:49 AM4/30/12
to total-im...@googlegroups.com
Great, Kevin. A few more thoughts:

It looks like there's still ProviderState stuff in the Provider module, and I can't see that we're using that anymore anywhere. Is that not safe to delete?

Heather and I talked about the relative priority of a new queue architecture and implementing the providers, and we're agreed that the latter should be the priority (although I'm really excited about the new queues getting done next sprint), because it'll really help for the testing the UI stuff I'm doing. I looked at doing provider specs, and it seems like I'd mostly just be translating the Python from the current, working providers into english for you to then translate it back into Python. Instead, would you be willing to try working straight from the original code, and then emailing if you run into trouble or areas that aren't clear?

Also, you lost me on one thing in this email: What do you mean by "they're no longer having to deal with the all the complexity of the dom or error handing. "?  Seems like they are still very much going to have to deal with the dom of results from providers when those results are html, right? Who else would know how to do this? Again, they're also going to have to provide the layer interpreting the rich, beautiful tapestry of non-standard error messages that sources give us back ("200: <response>not found!</response>", etc.) and translating it into the standard error types we've defined, to be dealt with up the stack.

Sorry, that's a lot for one email. Will stop there. If you want to Skype tomorrow morning (I'll be up around 10am EDT) let me know.
j

--
You received this message because you are subscribed to the Google Groups "total-impact-dev" group.
To post to this group, send email to total-im...@googlegroups.com.
To unsubscribe from this group, send email to total-impact-d...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/total-impact-dev?hl=en.

Kevin Campbell

unread,
Apr 30, 2012, 5:18:50 AM4/30/12
to total-im...@googlegroups.com
On Mon, Apr 30, 2012 at 6:13 AM, Jason Priem <j...@jasonpriem.org> wrote:
Great, Kevin. A few more thoughts:

It looks like there's still ProviderState stuff in the Provider module, and I can't see that we're using that anymore anywhere. Is that not safe to delete?

It's safe to delete, I'd left it around rather than removing it at present in case we decided we wanted it. From what I understand, we felt that allowing the external websites to return errors when we hit rate limits, rather than us trying to pre-empt that, is acceptable. As such ProviderState doesn't seem necessary at the moment.
 
Heather and I talked about the relative priority of a new queue architecture and implementing the providers, and we're agreed that the latter should be the priority (although I'm really excited about the new queues getting done next sprint), because it'll really help for the testing the UI stuff I'm doing. I looked at doing provider specs, and it seems like I'd mostly just be translating the Python from the current, working providers into english for you to then translate it back into Python. Instead, would you be willing to try working straight from the original code, and then emailing if you run into trouble or areas that aren't clear?

English descriptions would be very helpful. I can try from the original code, but I'm certain it will be a lot slower to do that.
 
Also, you lost me on one thing in this email: What do you mean by "they're no longer having to deal with the all the complexity of the dom or error handing. "?  Seems like they are still very much going to have to deal with the dom of results from providers when those results are html, right? Who else would know how to do this? Again, they're also going to have to provide the layer interpreting the rich, beautiful tapestry of non-standard error messages that sources give us back ("200: <response>not found!</response>", etc.) and translating it into the standard error types we've defined, to be dealt with up the stack.

Sorry, by DOM I was meaning our models. Yes, they still have to parse back the returned data from the external sources.
 
Regards,
Kevin

Jason Priem

unread,
Apr 30, 2012, 3:01:48 PM4/30/12
to total-im...@googlegroups.com
Sweet then let's delete that provider states stuff when it's convenient for you.

I hear that you really want a text description of what each provider is doing. I'm not convinced that's a good use of our collective time, but I should give it a shot before dismissing.  So, I'll try one today and we can see how it goes. I do think there would be some value in making some of the assumptions we made in the providers explicit, so I can see where you're coming from there.

J

Sent from my iPad
--

Kevin Campbell

unread,
Apr 30, 2012, 5:40:08 PM4/30/12
to total-im...@googlegroups.com
Jason,

Provider states is now gone.

I tried Mendeley today, the notes are at the end of the google doc. If you could check and see if it appears to be mostly correct that would be helpful. I should get that implemented tomorrow morning if it is.

Regards,
Kevin

Jason Priem

unread,
Apr 30, 2012, 8:40:48 PM4/30/12
to total-im...@googlegroups.com
Kevin we can talk about this more at the sprint meeting tomorrow, but for what it's worth I would recommend saving Mendeley towards the end; I think it's our most complicated provider.
J

Sent from my iPad
Reply all
Reply to author
Forward
0 new messages