Sediment Metaphor

37 views
Skip to first unread message

Ward Cunningham

unread,
Dec 6, 2010, 4:26:38 PM12/6/10
to software...@googlegroups.com
Friends -- I'd like to advise you of a new metaphor that has turned out to be very productive in high-volume feature-extraction and data-mining. I call it the sediment metaphor:

A parsing algorithm is like a branching river delta.
Collect sample parse data as if they were sediment accumulating in each streamlet.
When too much accumulates somewhere, wash part away, then continue collecting more.
You will have much to look at when you are done.

Best regards. -- Ward



__________________
Ward Cunningham





john.g...@gmail.com

unread,
Dec 6, 2010, 4:57:13 PM12/6/10
to software...@googlegroups.com
Ward that is a neat idea. do you know of any code that attempts to implement this approach?


Sent via DROID on Verizon Wireless


-----Original message-----

Ward Cunningham

unread,
Dec 6, 2010, 5:06:30 PM12/6/10
to software...@googlegroups.com
Yes, I have instrumented peg/leg this way and have found it very useful.
This code has not been released into open source.

__________________
Ward Cunningham




Michael Feathers

unread,
Dec 7, 2010, 12:07:24 AM12/7/10
to software...@googlegroups.com
On Mon, Dec 6, 2010 at 11:06 PM, Ward Cunningham <wa...@c2.com> wrote:
> Yes, I have instrumented peg/leg this way and have found it very useful.
> This code has not been released into open source.

You find the most interesting things to work on. :-) Sounds very cool.

Ward Cunningham

unread,
Dec 7, 2010, 1:14:48 PM12/7/10
to software...@googlegroups.com

Michael -- What a nice thing to say. I've been a little bored with my work. Your comment is a helpful reminder. -- Ward


p.s. a little more about what I am doing ...

1. We scrape websites but have very little insight into what we've scraped. I asked myself the question, how can I look at 30,000 pages for some reasonable notion of look? I started exploring our data with regular expressions in perl scripts. 30,000 pages was within perl's limits: enough for statistical properties to emerge without runs being too slow.

2. Discussing my exploration at Open-Source Bridge this summer, several colleagues mentioned PEG parsers as an alternative to regular expressions. I looked into TreeTop, a PEG parser in Ruby, but it was way too slow. I tried Ian Pumarta's peg/leg parser, which was blindingly fast, but required I write in C, hardly exploratory.

3. I needed an organizing/simplifying principle that could guide my C programming. I allowed myself to compare my quest to read a zillion pages to that of the climate scientists studying the history of weather on Earth with core samples. This step is vague in my memory, but I ended up in Wikipedia here:

http://en.wikipedia.org/wiki/Exner_equation

4. I wrote C subroutines called Aggrade and Degrade, inspired by the Exner equation. These adjusted to the various "flow rates" through the production rules of what ever grammar I happened to be running at the moment. We wrapped this all into a webapp that I could run on EC2 where my big data was stored. Thus is born "Exploratory Parsing", an agile data mining methodology.

I mention this here on this list because of the central role that the Sediment Metaphor took in guiding me when I had no other plan.

I wasn't looking for a way to explain what I had already done, I was looking for a way forward with only a vague notion of what I wanted to do (look at pages.) Lakoff and Johnson say that a metaphor will be sustained in the culture when it works together in a cognitive system that delivers value. Sediment delivered where other ways to think about my problem failed me. The metaphor has compounded upon itself to produce further unexpected bounty. To quote some numbers: I can now read all 30 gigabytes of Wikipedia in 6 seconds, for a defensible definition of read.

Antony Marcano

unread,
Dec 8, 2010, 12:00:21 PM12/8/10
to SoftwareMetaphor
Ward, this sounds awesome... are you able to share code snippets that
give more illustration to how this metaphor influenced the code you
ended up with?

On Dec 7, 6:14 pm, Ward Cunningham <w...@c2.com> wrote:
> On Dec 6, 2010, at 9:07 PM, Michael Feathers wrote:
>

Ward Cunningham

unread,
Dec 9, 2010, 10:19:20 AM12/9/10
to software...@googlegroups.com
On Dec 8, 2010, at 9:00 AM, Antony Marcano wrote:

Ward, this sounds awesome... are you able to share code snippets that
give more illustration to how this metaphor influenced the code you
ended up with?

On Dec 7, 6:14 pm, Ward Cunningham <w...@c2.com> wrote:
On Dec 6, 2010, at 9:07 PM, Michael Feathers wrote:

On Mon, Dec 6, 2010 at 11:06 PM, Ward Cunningham <w...@c2.com> wrote:
Yes, I have instrumented peg/leg this way and have found it very useful.
This code has not been released into open source.

You find the most interesting things to work on. :-)  Sounds very cool.



Antony -- The attached pdf is a figure from a document in preparation. It shows one step of an agile process, sort of like writing and passing one test, but in my case recognizing and describing one additional structure in the data.

In this example I am parsing the collected works of Dickens for "sentences", strings of words followed by a period. My methodology is to offer alternative definitions for sentence, look at the sediment that collects for each, and then revise my definitions to more completely describe what I find. The example in the figure shows me discovering that there are many ellipses in Dickens' text.

The figure is composed of snapshots from three different screens in the Exploratory Parsing webapp workflow:

* The diagrams are clickable SVG generated from a parser run. The numbers on these diagrams show how much data flow there was in any given "stream".  Clicking on a number exposes samples from that flow, analogous to taking core samples of the sediment within a natural stream.

* The screens with yellow highlights are individual parse matches, the material that makes up the "sediment". Successfully parsed text is shown in green, not yet parsed text, red. From my samples I notice that my 4,072 sentence-match failures come mostly from strings of periods.

* The grammar rules screen show improvements in my description of the text. Here I am adding a generous rule for ellipsis: two or more periods in a row. We see in the second diagram that this rule matches 1,847 times leaving only 2 matches in the other-character stream.

I coded the pip() function in C. It says: the text matched within angle brackets is of interest and should be collected as sediment. It calls the aggrade() and degrade() functions mentioned earlier.

Again, I bring my experience to this list because it is an example of a productive metaphor that has guided me in applying parsing technology outside its usual bounds. Collecting and applying parsing's "best practices" could not have lead me here. None of this work is yet current practice. I hope to change that.
Ellipsis Example.pdf
Reply all
Reply to author
Forward
0 new messages