K
--
You received this message because you are subscribed to the Google Groups "total-impact-dev" group.
To post to this group, send email to total-im...@googlegroups.com.
To unsubscribe from this group, send email to total-impact-d...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/total-impact-dev?hl=en.
Nice work, K. To answer your questions:1. metrics updates using multiple aliases: There are two cases of these: a) where a metric is using completely different namespaces (say, url + doi), or b) using multiple values from the same namespace (three different urls). I can't think of any situations for the former, but I think the latter may be rather common. I think they'll likely be handled the same way in any case. For a given metric, we want to sum values that come from different aliases...the user doesn't care about the aliases, just the item. We can do this* early: the provider runs updates using as many aliases as it wants (that may not be all of 'em...I could imaging providers may eschew all but one alias, if their sources are already doing deduplication?). once it's done, it adds all the values up, slaps the timestamp on 'em, and chucks that in the metric dict.* late: the provider keeps a different value for each alias it runs; the metric dict is still keyed by timestamp, but now it holds a tuple of (alias, value). Client code is in charge of summing these.I'm game to hear other thoughts, but I'd favor early. Yes, we're throwing away data. But it feels like relatively unimportant data, and first thought is that it's not worth the hassle for the db and client code. it feels more like dirt than depth.
2. How do we refresh item data? We can't just do it every request, because we're requesting these bad boys multiple times per second via the frontend polling. I think this is solved more easily: we continue to set last_requested, but we build a delay into the queue using the keys returned from couch...nothing goes on the update queue (even though it's in the couch view) unless it's at least, say, 24hrs since the last request. the first request should still happen instantly, since last_requested starts set to null, but subsequent ones won't. Since we don't have to modify the couch view to do this (it still spits out the same keys), it should be relatively straightforward to implement different staleness thresholds for different providers. I probably need to look at the backend code again more closely, though; let me know if I'm missing something.
This said, another advantage of doing the update queue in-memory is that we don't have to be saving last-updated values up to the db multiple times per second.These last two are my questions/notes:3. I think that wiping out all the aliases works ok. Our plan in this event was to leave the aliases, but flip a "alias_update_error" bit on the item that could be used to interpret results accordingly. The advantage is that users can see what's had errors. I'm ok with wiping out the aliases, I guess, but I still thing we need those two error flags. See the ticket from last week for more info: https://github.com/total-impact/total-impact-core/issues/95
4. I'm pretty sure we actually don't want to run the aliases every update--just the first one. It's fairly expensive, and they are really unlikely to change. Once we're done running them all the first time, I think we will just assume they stay that way for the forseeable future (until we build a little updater script that crawls around in there and updates the really old ones)
Thanks for enduring the long post, all. Great job K, as usual.jps to kevin: i made some additional notes on your earlier commits in the github comment system...those don't seem to be working as a communication channel too well, so i'll switch over to just the listserv in the future...no worries there. but would you mind reading the old ones and hitting me back on them? a few points in there that might be helpful to one or both of us.
--
You received this message because you are subscribed to the Google Groups "total-impact-dev" group.
To post to this group, send email to total-im...@googlegroups.com.
To unsubscribe from this group, send email to total-impact-d...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/total-impact-dev?hl=en.
--
You received this message because you are subscribed to the Google Groups "total-impact-dev" group.
To post to this group, send email to total-im...@googlegroups.com.
To unsubscribe from this group, send email to total-impact-d...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/total-impact-dev?hl=en.
So, just ran this update flow by Heather, and we're both keen to give it a go. The current system is way more byzantine than it needs to be. I think you probably don't need much convincing in that direction, K, since it was your idea to use the in-memory queues in the first place (actually was Richard's originally, but we ended up going with couch back then...my bad :/ ).This is the flow that gets used for metrics once we've got an item with the alias list full...filling that should be the job of a relatively separate system (although of course we can share some code). Not sure of the details on how to do that yet...can go over it in the sprint meeting and after. I do reckon it ought to wait for the next sprint.Heather's got a list of queueing libs up on the etherpad that we can go over next tuesday...she got some good feedback from the lead dev on PLoS's similar altmetrics tool on these.
--
You received this message because you are subscribed to the Google Groups "total-impact-dev" group.
To post to this group, send email to total-im...@googlegroups.com.
To unsubscribe from this group, send email to total-impact-d...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/total-impact-dev?hl=en.
If you've got time and it's interesting you, I say go for it. The only important thing to me is that the update schedule more or less follows the five-step approach from earlier...stuff goes on the queue based on having been requested+time since last update.Alternatively, I like the idea of implementing more providers if you've got the time.
--
You received this message because you are subscribed to the Google Groups "total-impact-dev" group.
To post to this group, send email to total-im...@googlegroups.com.
To unsubscribe from this group, send email to total-impact-d...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/total-impact-dev?hl=en.
Great, Kevin. A few more thoughts:It looks like there's still ProviderState stuff in the Provider module, and I can't see that we're using that anymore anywhere. Is that not safe to delete?
Heather and I talked about the relative priority of a new queue architecture and implementing the providers, and we're agreed that the latter should be the priority (although I'm really excited about the new queues getting done next sprint), because it'll really help for the testing the UI stuff I'm doing. I looked at doing provider specs, and it seems like I'd mostly just be translating the Python from the current, working providers into english for you to then translate it back into Python. Instead, would you be willing to try working straight from the original code, and then emailing if you run into trouble or areas that aren't clear?
Also, you lost me on one thing in this email: What do you mean by "they're no longer having to deal with the all the complexity of the dom or error handing. "? Seems like they are still very much going to have to deal with the dom of results from providers when those results are html, right? Who else would know how to do this? Again, they're also going to have to provide the layer interpreting the rich, beautiful tapestry of non-standard error messages that sources give us back ("200: <response>not found!</response>", etc.) and translating it into the standard error types we've defined, to be dealt with up the stack.
--