dev plans for HTTP Archive

17 views
Skip to first unread message

Steve Souders

unread,
Jul 18, 2011, 7:56:26 PM7/18/11
to httpa...@googlegroups.com
I've been buried with Velocity (US), 4th of July vacation, and kickoff for Velocity Europe. I finally have cleared my inbox and want to get going on HTTP Archive development. The intent of this email is to share my plans, get feedback, and hopefully entice some folks to contribute patches.

There are many (many (many)) enhancements for HTTP Archive. Here are my high-level priorities to help guide priorities at the bug level.

1. erroneous data  - Highest priority is to fix any erroneous information. Correlation coefficient is inaccurate is a possible example.

2. gather new data  - There are a ton of UI issues, and the DB won't scale, but those can be fixed later. What can't be fixed later is gathering data that is not possible to recreate (or is painful to recreate) after the fact. An example is add Connection: Close stat. Plus we want to increase the number of URLs that are crawled. So priority #2 is modify the DB and crawl scripts to start pulling in data we know we want going forward.

3. recruit contributors - As quickly as possible we need to get more contributions to the project. Some things about the code make that difficult, such as separate "downloads/" into separate project. Plus the code is a bit of a rat's nest.

4. maintenance - There's a fair amount of manual maintenance I have to do on the site, esp. wrt running the crawls. Reducing that will allow more time to work on other things.

5. scalability - As we get more data the current DB schema will likely become a major issue. So need to rethink the schema. Also, focusing on more caching so we avoid doing any queries will help here, regardless of how good the schema becomes.

6. analysis - There are new ways of analyzing the data that would be useful, such as view websites by country.

7. performance - Adding better caching of aggregate results (such as cache aggregate stats for trends.php) as well as chaching chart images plus general JS cleanup will make the site more responsive.

8. UI - Last but not least is the UI. The design needs improvement.

I'm going to focus on priorities 1 & 2 first. It's going to be hard for other folks to work in those areas (since it's hard to setup a crawl environment). But there are many fairly standalone bugs people could work on. You can search for issues that have the Contributor label. Here are some:
If you want to work on a bug contact me first and I'll "reserve" it in your name and answer any questions you have about it.

Thanks.

-Steve




 
Reply all
Reply to author
Forward
0 new messages