I'm using enlive in my work project, but decided to get familiar with a fun side project.
I think I may have chosen poorly, but I'm determined one way another.
I decided to scrape a web page and see what I could do with it. Unfortunately, and unlike
the html code I'll be writing there are almost no id's or class names in the html.
I've got a select that works to get the spans and the sub tables and I could just take that seq of nodes
and process it however I like in clojure. But I'm wondering if I would be missing out on an opportunity to
really leverage enlive more fully.
I'm only interested in the content of a span, and the rows of data held within several tables within a table below that.
the span content and the sub-table th content need to be nested keys into the data rows. It seems that the non-tree
like shape of what I need to extract is the biggest problem.
It is fairly straight forward.
<div><table><tr><td><a href="#group1"></td></tr>
<tr><td><a href="#group2"></td></tr></table>
<table/>
<table/>
<div>
<table>
<h3><span id="group1">group 1</span></h3>
<div>
<table>
<table>
<th>table1</ht> .....
<table>
<th>table2</th> ....
....
</table>
</div>
<h3><span "id="group2"> group 2</span></h3>
<div>
<table>
<table>
<th>table 3</th> ....
<table>
<th>table 4</th>.....
</table>
</div>
</table>
</div>
Obviously there are <tr><td> etc in there. Overall the tables are very simple.
One header row followed by some data rows.
The very first table has links which I could use to get the groups, but I have managed to get an h3 followed
by the table group that belongs to it with just a selector.
What I would really like to create here is this. I need each Id from the span to contain the th from the table which then contains
the values from the row with the first value of the row as the key for the other values.
{:group1 {:table1 {:foo { :value "bar" :type "baz"}
:foo2 { :value "bar2" :type "baz2"}}
{:table1 {:foo { :value "bar" :type "baz"}
:foo2 { :value "bar2" :type "baz2"}}
:group2 {:table3 {:foo { :value "bar" :type "baz"}
:foo2 { :value "bar2" :type "baz2"}}
{:table4 {:foo { :value "bar" :type "baz"}
:foo2 { :value "bar2" :type "baz2"}}
As I said before, I'm perfectly happy to deconstruct this with clojure, but what I'm wondering is if I could be using enlive
in a better way. All I'm really doing is loading it up and applying a select to get the spans and the table groups.
I had thought I could create a more specific select by getting the id's from the first table, but I need the table following the <span> with the id, and when I get to
the table I need the content of the only <th> in the table to be a parent for the following rows.
The pages are variable. The number of groups and the number of tables in a group can vary.
So, is there a better way than just writing some clojure functions to do the rest?
Thanks.