Newbie trying HTML parsing

495 views
Skip to first unread message

Mike

unread,
Oct 14, 2015, 2:16:50 PM10/14/15
to Clojure
Hello,

For my first real Clojure project I am attempting to get an HTML page and locate a particular <input> tag within the page. I have my program GETting the page, but
I am having trouble parsing it.  I am trying to use clj-tagsoup to parse the page, but I get an error message when I try to ;require it as an external reference. Here
are (what I think are) the pertinent segments of "code":

project.clj file extract:
 
 :dependencies [
                 
[org.clojure/clojure "1.7.0"]
                 
[clj-http "2.0.0"]
 
[clj-tagsoup "0.3.0"]]



source file extract:

(ns one.core
 
(:gen-class)
 
(:require [clj-http.client :as client])
 
(:require [clj-tagsoup :as html]))

  
  
I am using LightTable as my IDE; when I added the clj-tagsoup dependency the next loading of my project did appear to go fetch the required files.

I have some questions:

1) I am using Windows 8.1 Pro, my project is located at c:\Users\Michael\CLJ\one; I have LightTable loaded at c:\LightTableWin\LightTable\LightTable.exe.
Where would I manually verify that the clj-tagsoup files have been loaded?

2) clj-tagsoup didn't have an example of how to reference it in a :require clause.  Did I write this correctly, and how would I know if it needed to be different?

3) Is this an appropriate approach?  I noticed that clj-http has a parse-html function, but when I call that I get an assert that says that crouton is not present.

Any answers or guidance appreciated.

James Reeves

unread,
Oct 14, 2015, 2:53:11 PM10/14/15
to clo...@googlegroups.com
In the clj-tagsoup example it has the following line:

    (use 'pl.danieljanus.tagsoup)

The use function is like require, except it aliases the vars to the current namespace. So the pl.danieljanus.tagsoup is the namespace to use.

If the README doesn't provide any clues, you can sometimes figure out the namespace to use by looking at the src directory in the repository. In the clj-tagsoup repository there's a file:

    src/pl/danieljanus/tagsoup.clj

Which again indicates a namespace of pl.danieljanus.tagsoup. Just convert "/" to "." and "_" to "-", then remove the file extension to produce the namespace of the file.

Crouton is an alternative HTML parsing library (that's coincidentally written by me) and can be found at: https://github.com/weavejester/crouton

Crouton uses a slightly different output syntax, which is compatible with Clojure's xml zipper functions, making it more suitable for document searches and traversal (IMO).

- James

--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clo...@googlegroups.com
Note that posts from new members are moderated - please be patient with your first post.
To unsubscribe from this group, send email to
clojure+u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
---
You received this message because you are subscribed to the Google Groups "Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email to clojure+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Mike

unread,
Oct 14, 2015, 5:53:37 PM10/14/15
to Clojure, ja...@booleanknot.com
Thanks James!  You helped me get another step along the way, I got this working.

Of course you mentioned Crouton; you should and I asked for advice on my approach.  So please allow me to expand the problem statement and you may advise me further...

Once I get this HTML parsed, I know that somewhere buried in this page is an <input> tag that has name="name" attribute where I will specify the name value at run time.  I will need to be able to programmatically find this tag and pul some values out of it.  Will using clj-tagsoup or Crouton make this location operation easier?  Perhaps even using Enlive might make it easier, since the location and path to the tag is not known; it must be located.

James Reeves

unread,
Oct 14, 2015, 7:03:29 PM10/14/15
to Mike, Clojure
I'm not that familiar with Enlive, so I can't comment on the ease of that approach.

However, the way I'd personally do it is that I'd make use of Crouton and the zipper functions in clojure.zip and clojure.data.zip. A zipper is a functional way of navigating an immutable data structure.

So first add Crouton and data.zip to your dependencies:

  [[crouton "0.1.2"]
   [org.clojure/data.zip "0.1.1"]]

Then use Crouton to parse the body of the response from clj-http:

  (crouton.html/parse (:body response))

This will give you a data structure that's compatible with clojure.xml, and therefore compatible with the XML zipper functions

    (dzx/xml1-> (z/xml-zip parsed-html)
              dz/descendents
              (dzx/tag= "input")
              (dzx/attr= "name" "foo")

In the above case I'm using the following namespace aliases:

  (require '[clojure.zip :as z]
           '[clojure.data.zip :as dz]
           '[clojure.data.zip.xml :as dzx])

It's been a while since I've needed to traverse X/HTML in Clojure though, so my code might be a little off.

- James

Matching Socks

unread,
Oct 14, 2015, 7:38:07 PM10/14/15
to Clojure
(Enlive wraps JSoup and TagSoup and causes them both to return a value in the same format as clojure.xml.  Likewise, Enlive's transformation features will work with anything that looks like clojure.xml.)

Mike

unread,
Oct 14, 2015, 8:27:59 PM10/14/15
to Clojure, ja...@booleanknot.com
So now I'm trying to make the conversion to Crouton.  Of course that is not going well.  Here is a chunk of code:

(ns one.core
 
(:gen-class))

(require '[clj-http.client :as client]
         '
[clojure.zip :as z]

         
'[clojure.data.zip :as dz]
         '
[clojure.data.zip.xml :as dzx]

         
'[crouton.html :as html])

(defn get-post-data [url]
  (client/get url))

(def response (get-post-data login-URL))

(html/parse (:body response))

The response value is correct (its the HTML), but when I try to execute the html/parse I get:

java.io.FileNotFoundException:


<html>
<head id="Head1"><title>
 
User Login Page
<\title>
   
<style>
        body
       
{
            color
: #000000;
            font
: 12px\1.4 arial,FreeSans,Helvetica,sans-serif;
            margin
: 0;


... TONS OF HTML DELETED ...


   
<\center>    
   
<\form>
<\body>
<\html>
 
(The filename or extension is too long)
         
(Unknown Source) java.io.FileInputStream.open0
 
FileInputStream.java:195 java.io.FileInputStream.open
 
FileInputStream.java:138 java.io.FileInputStream.<init>
 
... LOTS OF STACK TRACE DELETED ...

I hope someone can help.  TIA.

James Reeves

unread,
Oct 14, 2015, 9:36:54 PM10/14/15
to Mike, Clojure
It looks like the response body is a string rather than a stream. Try using crouton.html/parse-string instead.

- James

Mike

unread,
Oct 15, 2015, 1:00:23 PM10/15/15
to Clojure, ja...@booleanknot.com
I've read the clojure.data.xml.zip docs carefully and looked at many examples, but I don't understand this behavior:
 
(require '[clj-http.client :as client]
         '[clojure.zip :as z]
         '[clojure.data.zip :as dz]
         '[clojure.data.zip.xml :as dzx]
         '[crouton.html :as html])
(def my-html "<html>\n<body>\n<input src='a.png'/>\n</body>\n</html>")
(def my-zipper (z/xml-zip (html/parse-string my-html)))
(dzx/xml1-> my-zipper)
(dzx/xml1-> my-zipper dz/descendants)
(dzx/xml1-> my-zipper :html)

 I am starting with this so that I can understand step-by-step what is happening here.  I built a simple HTML string, converted it into an XML zipper and then tried a few xml1 calls on it:

(dzx/xml1-> my-zipper)

gives me the original zipper, which is what I expected.

(dzx/xml1-> my-zipper dz/descendants)

gives me what appears to be the original zipper structure, which I wasn't expecting.  I was expecting a flattened-out seq of the nodes.

(dzx/xml1-> my-zipper :html)

returns nil, which I really wasn't expecting.  Examples on the web led me to believe that this last call should match on the html tag.  Can anyone provide any explanation on these call and why I got these return values?



James Reeves

unread,
Oct 15, 2015, 2:46:07 PM10/15/15
to Mike, Clojure
On 15 October 2015 at 18:00, Mike <mi...@thefrederickhome.name> wrote:

(dzx/xml1-> my-zipper dz/descendants)

gives me what appears to be the original zipper structure, which I wasn't expecting.  I was expecting a flattened-out seq of the nodes.

The dz/descendants function doesn't return a seq of nodes, but a seq of zippers. Remember, the moment you convert back into an individual node you lose the capability to go back up the tree. So the descendants are actually zippers, with the root of each zipper set to a descendant node.

Because you're using xml1-> and not xml->, you only get the first result of the seq. This is why it looks like it just returns the original zipper, but it actually returns the first descendent zipper.

If you want to get a list of all the descendants as nodes, you need to use xml->, and put clojure.zip/node on the end:

  (dzx/xml-> my-zipper dz/descendants z/node)

 
(dzx/xml1-> my-zipper :html)

returns nil, which I really wasn't expecting.  Examples on the web led me to believe that this last call should match on the html tag.  Can anyone provide any explanation on these call and why I got these return values?

This is because you're searching the children of the current node. So you're looking for a html tag inside a html tag.

These should return non-nil values:

  (dzx/xml1-> my-zipper :head)
  (dzx/xml1-> my-zipper :body)

If you want to find your input:

  (dzx/xml1-> my-zipper :body :input z/node)

Or to do a descendant search:

  (dzx/xml1-> my-zipper dz/descendants :input z/node)

- James

edbond

unread,
Oct 16, 2015, 9:04:30 AM10/16/15
to Clojure, ja...@booleanknot.com
Hello Mike,

Take a look at hickory, it's more straightforward than enlive if you want to find something in html: https://github.com/davidsantiago/hickory

(ns ......
(:require             [hickory.core :as h]
           
[hickory.select :as hs]
           
[cljs.core.async :as a]))


let
[html (:body (a/<! (http/get url)))
            parsed
(-> html h/parse h/as-hickory)
            inputs
(hs/select
                   
(hs/and
                   
(hs/tag :input)
                   
(hs/attr :href #(re-find #"sop://" %)))
                   parsed
)]


Best regards,
Eduard
Reply all
Reply to author
Forward
0 new messages