convert nested list of html into nested list in r

81 views
Skip to first unread message

Yoni Sidi

unread,
Sep 17, 2015, 3:07:14 AM9/17/15
to Israel R User Group
i am trying to convert this page https://cran.r-project.org/web/classifications/JEL.html into a nested list in R. does anyone know a good way to do this?

thanks

yoni

amit gal

unread,
Sep 17, 2015, 4:23:28 AM9/17/15
to israel-r-...@googlegroups.com
not at my computer, but simple recursion should do the trick on the parsed xml/html object.
let me know if you need further details, I'll try to give some sample code later.
How do you represent the html document?



--
You received this message because you are subscribed to the Google Groups "Israel R User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to israel-r-user-g...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Yoni Sidi

unread,
Sep 17, 2015, 5:41:42 AM9/17/15
to Israel R User Group
represent? i read the html using

library(rvest)


but that just dumps all the lists and sublists into seperate objects in a list disregarding the heirarchy


On Thursday, September 17, 2015 at 11:23:28 AM UTC+3, Amit Gal wrote:
not at my computer, but simple recursion should do the trick on the parsed xml/html object.
let me know if you need further details, I'll try to give some sample code later.
How do you represent the html document?


On Thu, Sep 17, 2015 at 10:07 AM, Yoni Sidi <yon...@gmail.com> wrote:
i am trying to convert this page https://cran.r-project.org/web/classifications/JEL.html into a nested list in R. does anyone know a good way to do this?

thanks

yoni

--
You received this message because you are subscribed to the Google Groups "Israel R User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to israel-r-user-group+unsub...@googlegroups.com.

Tal Galili

unread,
Sep 17, 2015, 5:54:56 AM9/17/15
to israel-r-...@googlegroups.com
Why would you want them in a nested list of lists? For what purpose?



On Thu, Sep 17, 2015 at 12:41 PM, Yoni Sidi <yon...@gmail.com> wrote:

library(rvest)




----------------Contact Details:-------------------------------------------------------
Contact me: Tal.G...@gmail.com
Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) | www.r-statistics.com (English)
----------------------------------------------------------------------------------------------

Yoni Sidi

unread,
Sep 17, 2015, 7:56:51 AM9/17/15
to Israel R User Group
building an internal heirachal search, that joins another data.frame with external data depending on the branch and leaf. dont mind the typos in the code i wrote before

amit gal

unread,
Sep 17, 2015, 8:21:41 AM9/17/15
to israel-r-...@googlegroups.com
look at this code - it converts an html to a nested list, named by element names

require("XML")

root = xmlRoot(htmlParse("your file name here"))

html2list =function(root) {
  result = list()
  name = xmlName(root)
  sublist = lapply(xmlChildren(root),html2list)
  result[[name]] = sublist
  result
}

nested.list = html2list(root)

I didn't thoroughly debug it, but I give the code just to show the direction.


--
You received this message because you are subscribed to the Google Groups "Israel R User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to israel-r-user-g...@googlegroups.com.

Yoni Sidi

unread,
Sep 17, 2015, 9:17:26 AM9/17/15
to Israel R User Group
thanks, but that gives the attr types and not the text in the lists. but it is the direction

thanks
To unsubscribe from this group and stop receiving emails from it, send an email to israel-r-user-group+unsub...@googlegroups.com.

amit gal

unread,
Sep 17, 2015, 9:30:00 AM9/17/15
to israel-r-...@googlegroups.com
just change the name=xmlName(root) to whatever value you want to save. and I think there is a little bug there that causes every node to be treated twice in a row. if it is of importance, I can fix the code and make it specifically applicable to ul/li structure - let me know if you need further help.

amit


To unsubscribe from this group and stop receiving emails from it, send an email to israel-r-user-g...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Israel R User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to israel-r-user-g...@googlegroups.com.

Yoni Sidi

unread,
Sep 20, 2015, 7:45:03 AM9/20/15
to Israel R User Group
thank you amit:

require("XML")
require("rvest")
#root = xmlRoot(htmlParse("your file name here"))
html2list =function(root) {
  result = list()
  name = xmlName(root)
  sublist = lapply(xmlChildren(root),html2list)
  result[[name]] = sublist
  result
}
pnode = function(node) {
  tuls = findTopLevelULS(node)
  if (length(tuls)==1) {
      pul(tuls[[1]])
  } else {
      lapply(tuls,pul)
  }
}
pul = function(ulTop) {
  result = list()
  #collect li elements
  children = xmlChildren(ulTop)
  cnames = sapply(children,xmlName)
  wli = which(cnames=="li")
  liNodes = children[wli]
  #now extract li text as names in the result list
  #assuming li text is the first
  nnames = sapply(liNodes,function(li) html_text(xmlChildren(li)[[1]]))
  #now build recursively the sublist
  for (i in 1:length(nnames)) {
    result[[nnames[i]]] = pnode(liNodes[[i]])
  }
  result
}

findTopLevelULS = function(node) {
  if (xmlName(node)=="ul") return(list(node))
  children = xmlChildren(node)
  if (length(children)== 0 ) return(list())
  do.call(c,lapply(children,findTopLevelULS))
}

Reply all
Reply to author
Forward
0 new messages