I've been buried with Velocity (US), 4th of July vacation, and
kickoff for Velocity Europe. I finally have cleared my inbox and
want to get going on HTTP Archive development. The intent of this
email is to share my plans, get feedback, and hopefully entice some
folks to contribute patches.
There are many (many (many)) enhancements for HTTP Archive. Here are
my high-level priorities to help guide priorities at the bug level.
1. erroneous data - Highest priority is
to fix any erroneous information. Correlation
coefficient is inaccurate is a possible example.
2. gather new data - There are a ton of UI issues,
and the DB won't scale, but those can be fixed later. What can't
be fixed later is gathering data that is not possible to recreate
(or is painful to recreate) after the fact. An example is add
Connection: Close stat. Plus we want to increase the number
of URLs that are crawled. So priority #2 is modify the DB and
crawl scripts to start pulling in data we know we want going
forward.
3. recruit contributors - As quickly as possible we
need to get more contributions to the project. Some things about
the code make that difficult, such as separate
"downloads/" into separate project. Plus the code is a bit
of a rat's nest.
4. maintenance - There's a fair amount of manual
maintenance I have to do on the site, esp. wrt running the crawls.
Reducing that will allow more time to work on other things.
5. scalability - As we get more data the current DB
schema will likely become a major issue. So need to rethink the
schema. Also, focusing on more caching so we avoid doing any
queries will help here, regardless of how good the schema becomes.
6. analysis - There are new ways of analyzing the
data that would be useful, such as view
websites by country.
7. performance - Adding better caching of aggregate
results (such as cache
aggregate stats for trends.php) as well as chaching chart
images plus general JS cleanup will make the site more responsive.
8. UI - Last but not least is the UI. The design
needs improvement.
I'm going to focus on priorities 1 & 2 first. It's going to be
hard for other folks to work in those areas (since it's hard to
setup a crawl environment). But there are many fairly standalone
bugs people could work on. You can search for issues that have the
Contributor
label. Here are some:
If you want to work on a bug contact me first and I'll "reserve" it
in your name and answer any questions you have about it.
Thanks.
-Steve