uzbl: a browser that adheres to the unix-philosophy

21 views

Skip to first unread message

Bryan Bishop

unread,

Aug 1, 2009, 12:47:18 PM8/1/09

to kan...@gmail.com, diytrans...@googlegroups.com

http://uzbl.org/

git clone git://github.com/Dieterbe/uzbl.git

I think I am in love. Might eventually need to merge in surfraw utilities.

git clone git://git.debian.org/surfraw/surfraw.git

At the moment it seems that the main work that needs to be done on
uzbl is really porting over the webkit API to the webkit-gtk project,
which doesn't have anywhere near a full implementation of the webkit
that everyone knows and loves.

Webkit has a rather large repo, so be careful when grabbing it (600+ MB).

git clone git://git.webkit.org/WebKit.git WebKit

I was looking into a doxygen --comparison mode or something this
morning, but didn't find anything: it would be useful to figure out
what parts of the webkit API are missing from the webkit-gtk API.
Maybe there's another tool that does this already that I don't know
about?

Sometime last year I complained a lot about browsers:

http://heybryan.org/bookmarking.html

And even came up with a partial solution for running an excessive
number of tabs:

http://heybryan.org/projects/browsehack/tabtabtab.html

On the bookmarking.html page I mentioned how much it sucks to have to
keep on clicking everywhere. And not only that, but it sucks to
continuously rewrite scrapers over and over again. Recently I joined
the zotero-dev mailing list and there were a few others that agreed.
It would be exceedingly awesome if we could somehow port zotero into
pyscholar.

git clone git://github.com/kanzure/pyscholar.git

On the zotero-dev list, we got to talking about things, and we figured
that two types of file formats would be especially useful to share
across the web scraping utilities (maybe even scrapy and BeautifulSoup
and WWW::Mechanize)-- in particular, one file which lists out xpaths
and how they correlate to certain attributes, and then another file
that lists regular expressions for each attribute of data that is
harvested from a certain site, and how to clean it up. There is almost
always something that needs to be cleaned up: either there are commas
all over the place, or HTML tags, or other weird things. Throw this
all into a unit testing framework, and you might have something
relatively interesting .. in particular, the idea is that we wouldn't
have to ever go *back* to a website unless they have changed their
templates (in which case, they suck and should burn in hell for all
eternity).

There is an extension for firefox to help zotero developers called
"xpather", which helps a user figure out an xpath without having to
trudge through the code. Something like that with uzbl, plus the
scraping formats, could mean that the browsing experience is changed
from a passive act of reading a page to something more active like
making a new scraper, or writing code based off of whatever the page
is saying, and making stuff that *works* so that you don't have to go
back and re-read a million and one times.

((As an added bonus, as a way to organize my bookmarks, I think
squid-proxy plus WordNet, grammars, and a bayesian filter might be
useful enough to figure out the underlying topology instead of me
having to continuously reinvent the wheel every time I reorganize my
bookmarks according to some other view, or something.))

- Bryan
http://heybryan.org/
1 512 203 0507

Reply all

Reply to author

Forward

0 new messages