common ways to run regex against either Hickory HTML or zippers?

68 views
Skip to first unread message

lawrence...@gmail.com

unread,
Feb 2, 2022, 3:22:53 PM2/2/22
to Clojure
Assume I've been cursed to scrape HTML. If I convert the pages to Hickory I end up with a big mass of data which, sadly, lacks many "class" or "id"s that would let me easily pick out the data I need. However, for the most part, the only thing I really need off this page is the CVEs, which look like this:

CVE-2021-40539

I'm thinking I might write regex against the plain text of the page, but I'm also curious, is it common to take something like Hiccup or Hickory or a zipper and run regex through it? If yes, how is that done? 

A small part of the data looks like this:

                :content
                [{:type :element,
                  :attrs
                  {:class "tip-intro", :style "font-size: 15px;"},
                  :tag :p,
                  :content
                  [{:type :element,
                    :attrs nil,
                    :tag :em,
                    :content
                    ["This Joint Cybersecurity Advisory uses the MITRE Adversarial Tactics, Techniques, and Common Knowledge (ATT&CK®) framework, Version 8. See the "
                     {:type :element,
                      :attrs
                      {:href
                       "https://attack.mitre.org/versions/v9/techniques/enterprise/"},
                      :tag :a,
                      :content ["ATT&CK for Enterprise"]}
                     " for  referenced threat actor tactics and for techniques."]}]}
                 "\n\n"
                 {:type :element,
                  :attrs nil,
                  :tag :p,
                  :content
                  ["This joint advisory is the result of analytic efforts between the Federal Bureau of Investigation (FBI), United States Coast Guard Cyber Command (CGCYBER), and the Cybersecurity and Infrastructure Security Agency (CISA) to highlight the cyber threat associated with active exploitation of a newly identified vulnerability (CVE-2021-40539) in ManageEngine ADSelfService Plus—a self-service password management and single sign-on solution."]}
                 "\n\n"
                 {:type :element,
                  :attrs nil,
                  :tag :p,
                  :content
                  ["CVE-2021-40539, rated critical by the Common Vulnerability Scoring System (CVSS), is an authentication bypass vulnerability affecting representational state transfer (REST) application programming interface (API) URLs that could enable remote code execution. The FBI, CISA, and CGCYBER assess that advanced persistent threat (APT) cyber actors are likely among those exploiting the vulnerability. The exploitation of ManageEngine ADSelfService Plus poses a serious risk to critical infrastructure companies, U.S.-cleared defense contractors, academic institutions, and other entities that use the software. Successful exploitation of the vulnerability allows an attacker to place webshells, which enable the adversary to conduct post-exploitation activities, such as compromising administrator credentials, conducting lateral movement, and exfiltrating registry hives and Active Directory files."]}
                 "\n\n"

Mark Nutter

unread,
Feb 2, 2022, 3:58:45 PM2/2/22
to clo...@googlegroups.com
I don't know how common it is, but have you looked at the `tree-seq` function in Clojure? This seems like a good use case for it.

Mark

--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clo...@googlegroups.com
Note that posts from new members are moderated - please be patient with your first post.
To unsubscribe from this group, send email to
clojure+u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
---
You received this message because you are subscribed to the Google Groups "Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email to clojure+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/clojure/5f2bd2a4-5c35-463b-9cb4-eecb9148fc89n%40googlegroups.com.

Cora Sutton

unread,
Feb 2, 2022, 4:16:53 PM2/2/22
to clo...@googlegroups.com
If all you're looking for is the format CVE-NNNN-NNNNN then by all means just use regex against the plain text of the page. If you need to do dom traversal then jsoup is a good choice. Otherwise, like Mark said, tree-seq is a great choice if you don't want to play with clojure.walk.

Harold

unread,
Feb 2, 2022, 7:32:55 PM2/2/22
to Clojure
I would use enlive for this.
 - https://github.com/cgrand/enlive


 - The were also having problems with text encoding, hopefully you wont.

hth,
-Harold

Laws

unread,
Feb 3, 2022, 1:43:44 PM2/3/22
to Clojure
Thank you, everyone.
Reply all
Reply to author
Forward
0 new messages