Parsing HTML in clojure

729 views
Skip to first unread message

Base

unread,
Jun 5, 2011, 9:01:16 PM6/5/11
to Clojure
hi all,

I am working on an app that will parse web pages to do some NLP and
statistics. I am able to parse the HTML using several different tool
( enlive, HTML parser, etc). However I would like to discard all the
rest of the junk in the web page that is not pertinent (I.e. Ads).
Does anyone have any experience doing this? Any tips On how to do
this - or even better, tools that you can recommend? I have been
digging around on this for a while now and am stuck!

Thanks!

Base

Andreas Kostler

unread,
Jun 5, 2011, 11:04:57 PM6/5/11
to clo...@googlegroups.com
There's a Java library called HtmlCleaner. You might wanna give that a shot.
Btw, I'm working on quite a similar project so if you like email me and we can maybe join forces.
Andreas

> --
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clo...@googlegroups.com
> Note that posts from new members are moderated - please be patient with your first post.
> To unsubscribe from this group, send email to
> clojure+u...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en

Myriam Abramson

unread,
Jun 5, 2011, 11:32:02 PM6/5/11
to clo...@googlegroups.com
Me too, starting in October. I still need to get up to speed with Clojure however.

Bruce Williams

unread,
Jun 6, 2011, 12:14:04 AM6/6/11
to clo...@googlegroups.com
I looked at HtmlCleaner and it pretty cleans up the 'syntax' of the
html but does nothing with the 'semantics' - ads,etc

Bruce Williams
Concepts, like individuals, have their histories and are just as  incapable of
withstanding the ravages of time as are individuals.  But in and
through all this
they retain a kind of homesickness  for the scenes of their childhood.
Soren Kierkegaard

Base

unread,
Jun 6, 2011, 8:38:16 AM6/6/11
to Clojure
Hi All -

Thanks for your help! I found this last night and it looks pretty
promising. It is apparently part of Apache Tika (which I have never
heard of until now) that has a lot of interesting functionality!

https://boilerpipe-web.appspot.com/

Thanks!

Rasmus Svensson

unread,
Jun 6, 2011, 8:47:53 AM6/6/11
to clo...@googlegroups.com
2011/6/6 Base <basse...@gmail.com>:

In Enlive there are at least two approaches available:

The first approach is to use the 'select' function to pick out the
interesting part of the element tree. You use CSS-style selectors to
describe the element.

The second approach is to use the 'at' macro. You give it an element
tree and pairs of selectors and transformations. For each
selector-transformation pair, the transformation is applied to all
elements that matches the selector. A transformation takes a node and
returns what it should be replaced with. You can do almost anything
with them, including removing the element (which might be useful for
the ads in your case) or extracting the text of the node (the matching
nodes deepest in the tree are processed first). The result of the 'at'
form is the element tree with all transformations applied.

Both 'select' and 'at' accepts a element tree which you can create
with the html-resource function which accepts, among other things,
URLs.

You probably need to write some html element processing functions, so
it's probably a good idea to get familiar with the data format of the
nodes:

Element: {:tag :a, :attrs {:href "http://example.com/"}, :content
<sequence of nodes>}
Text: "text node"
Comment: {:type :comment, :data "comment node"}

I found the wiki of Enlive very useful. The "Getting Started" explains
what's there and how to use it very well, I think.
https://github.com/cgrand/enlive/wiki/_pages

I should also mention David Nolen's comprehensive tutorial which
begins with scraping: https://github.com/swannodette/enlive-tutorial

// raek

Mukul

unread,
Jun 6, 2011, 7:01:24 AM6/6/11
to clo...@googlegroups.com
Hi,

I have worked on a similar project before and have found the following link useful

http://blog.prashanthellina.com/2009/07/27/extracting-relevant-text-from-html-pages/

Best regards
~ Mukul Joshi

Director & CEO,
SpotOn Software Pvt. Ltd.
SpotOn : One stop spot for your mobile development

Vicente Bosch Campos

unread,
Jun 6, 2011, 5:47:27 PM6/6/11
to clo...@googlegroups.com

Hi,

I am a newbie on clojure have decided to try it out after many years doing Ruby mostly.

I have been trying lately some basic tutorials and I am also reading the joy of clojure.

Anyways in the process I am trying to decide on a suitable project creation workflow with Leiningen and IntelliJ with out much luck.
What I am trying to do is to understand what are the steps in order to have the project correctly working on intelliJ so that REPL and the dependencies are caught etc.

I have currently the following SW:

- Leiningen 1.5.2
- IntelliJ 10.5
- LA Clojure latest plugin
- Leiningen-IntelliJ plugin installed and configured to point to the binary.

I have currently tried the following.

1) Create project with Leiningen on the command console: lein new monkeyproject
2) If I select to open an existing project and point out to the project.clj:
- I have to add manually the lib folder to the dependencies otherwise running wont work.
- It does not add directly the clojure "aspect" or tooling
3) When I select run on the project.clj it indicates me:

/Users/vbosch/Programming/Clojure/hello/project.clj
Exception in thread "main" java.lang.Exception: Unable to resolve symbol: defproject in this context

I understand that is part of the Leiningen project and of course it is not indicated anywhere in the project.clj so maybe Leiningen know how to treat the file and that "intelligence" is missing in intellij
(or maybe just my intelligence... ).

What is the correct way to setup a project with Leiningen and intelliJ?

How should the execution workflow go ? ( I can launch REPL and load files to it, no issues but should I do a lein run from outside the IDE to launch the full project or what ? )

Sorry for my legthy email. ( I promise I have looked at guides like http://blog.kartikshah.info/2010/12/how-to-intellij-idea-for.html ... but no luck )

Regards,
Vicente

Phil Hagelberg

unread,
Jun 6, 2011, 8:05:16 PM6/6/11
to Clojure
On Jun 6, 2:47 pm, Vicente Bosch Campos <vbosch.cloj...@gmail.com>
wrote:
> 1) Create project with Leiningen on the command console: lein new monkeyproject
> 2) If I select to open an existing project and point out to the project.clj:
>             - I have to add manually the lib folder to the dependencies otherwise running wont work.
>             - It does not add directly the clojure "aspect" or tooling
> 3) When I select run on the project.clj it indicates me:
>
> /Users/vbosch/Programming/Clojure/hello/project.clj
> Exception in thread "main" java.lang.Exception: Unable to resolve symbol: defproject in this context          

The project.clj file is just meant to be used by Leiningen. Your
project's actual code should go in src/. Can you load any of the files
in there?

-Phil

Matt Hoyt

unread,
Jun 6, 2011, 10:06:35 PM6/6/11
to clo...@googlegroups.com
Run:
lein pom

Then in intellij goto file->open project and goto the directory of the project and click on the pom.xml file.


From: Vicente Bosch Campos <vbosch....@gmail.com>;
To: <clo...@googlegroups.com>;
Subject: How to setup IntelliJ with Leiningen?
Sent: Mon, Jun 6, 2011 9:47:27 PM


Hi,

I am a newbie on clojure have decided to try it out after many years doing Ruby mostly.

I have been trying lately some basic tutorials and I am also reading the joy of clojure.

Anyways in the process I am trying to decide on a suitable project creation workflow with Leiningen and IntelliJ with out much luck.
What I am trying to do is to understand what are the steps in order to have the project correctly working on intelliJ so that REPL and the dependencies are caught etc.

I have currently the following SW:

- Leiningen 1.5.2
- IntelliJ 10.5
- LA Clojure latest plugin
- Leiningen-IntelliJ plugin installed and configured to point to the binary.

I have currently tried the following.

1) Create project with Leiningen on the command console: lein new monkeyproject
2) If I select to open an existing project and point out to the project.clj:
            - I have to add manually the lib folder to the dependencies otherwise running wont work.
            - It does not add directly the clojure "aspect" or tooling
3) When I select run on the project.clj it indicates me:

/Users/vbosch/Programming/Clojure/hello/project.clj
Exception in thread "main" java.lang.Exception: Unable to resolve symbol: defproject in this context         

I understand that is part of the Leiningen project and of course it is not indicated anywhere in the project.clj so maybe Leiningen know how to treat the file and that "intelligence" is missing in intellij
(or maybe just my intelligence... ).

What is the correct way to setup a project with Leiningen and intelliJ?

How should the execution workflow go ? ( I can launch REPL and load files to it, no issues but should I do a lein run from outside the IDE to launch the full project or what ? )

Sorry for my legthy email. ( I promise I have looked at guides like http://blog.kartikshah.info/2010/12/how-to-intellij-idea-for.html ... but no luck )

Regards,
Vicente

--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clo...@googlegroups.com
Note that posts from new members are moderated - please be patient with your first post.
To unsubscribe from this group, send email to

Vicente Bosch

unread,
Jun 7, 2011, 2:42:34 AM6/7/11
to clo...@googlegroups.com
Hi Phil,

Yeah, I can load my actual code files on the IDEs repl but that is half of the deal. No? ( sorry if it sounds to oppinionated).

Are we supposed to go outside the ide to the console and launch "lein run" or is the a way to have the IntelliJ Leiningen plugin run the project ? ( I am perfectly fine the command console, just thought there would be a more integrated way).

How do you load your Leiningen created projects into intelliJ ? ( I mean the actual steps) Do you create a new one and indicate the folder where the generated code is or do an "import/open" project and select a .clj ?

I have the feeling I am missing something like there must be a nice way to do this and that IntelliJ will understand the depencies, load clojure mode... make dinner for me.

Deeply appreciate the help.

Best,
Vicente


--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clo...@googlegroups.com
Note that posts from new members are moderated - please be patient with your first post.
To unsubscribe from this group, send email to

Vicente Bosch

unread,
Jun 7, 2011, 5:18:54 AM6/7/11
to clo...@googlegroups.com
Hi Matt,

Just saw your response. Working!

After we insert a new dependency in project.clj,  we have to run lein pom again so that everything gets refreshed. I am totally fine with this. Is my assumption correct or is there a better way ?

Regards,
Vicente

Sergey Didenko

unread,
Jun 7, 2011, 6:02:56 AM6/7/11
to clo...@googlegroups.com
Hi Vicent,

I think it's:

*project.clj editing*

lein deps
lein pom
Reply all
Reply to author
Forward
0 new messages