Re: XML to CSV with ldply

Message has been deleted

Brandon Hurr

unread,

Jun 19, 2018, 4:29:58 PM6/19/18

to gi...@tcalumni.columbia.edu, manipulatr

Gerald,

It would help a ton if you could include your example xml file.

One thing that's bugging me is that you're using ldply from plyr. I wondered if you could be using something else like purrr::map or tidyr::bind_rows to slap everything together.

I did a quick google and there were some good resources that might help you solve your problem:

https://dantonnoriega.github.io/ultinomics.org/post/2017-04-18-xmltools-package.html

https://github.com/r-lib/xml2

https://data.metinyazici.org/2017/10/working-with-web-data-in-r.html

HTH,

B

On Tue, Jun 19, 2018 at 1:22 PM Gerald C <gi...@tcalumni.columbia.edu> wrote:

When I run the function on the list file I get the following error
> listFile <- xmlToList(doc)
> data <- ldply(listFile, data.frame)
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE,  : 
  arguments imply differing number of rows: 1, 3, 5
Although the XML file is nested, each document represents one row - a single record. Is there a way to modify the arguments to imply 1 row?
Thanks.

complete script and file below

require(XML)
require(plyr)

file <- "C:\\Users\\GCheves\\Test-Files\\Test.xml"

## xmlParse parse the xml file and stores it in the object doc

doc <- xmlParse(file, useInternalNodes = TRUE)

## convert xml doc to list

listFile <- xmlToList(doc)

data <- ldply(listFile, data.frame)
--
You received this message because you are subscribed to the Google Groups "manipulatr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to manipulatr+...@googlegroups.com.
To post to this group, send email to manip...@googlegroups.com.
Visit this group at https://groups.google.com/group/manipulatr.
For more options, visit https://groups.google.com/d/optout.

Message has been deleted

Gerald C

unread,

Jun 28, 2018, 8:25:44 PM6/28/18

to manipulatr

Please don't banish me from the group. This is the solution I settled on

#! Python3

"""
Program uses xml.etree module to select data elements from the xml tree, names
row headers, and csv.writer then writes them to a csv file. Rows are appended
to the file as each file in the directory is processed.,

Program Author:        Gerald I Cheves
Date written:        25 June 2018
Last Revision:        28 June 2018
"""

import winsound
import xml.etree.ElementTree as ET
import os
import sys
import csv

with open('C:\\Users\\GCheves\\ aladin\\Output\\xml-csv-output.csv', 'w') as fout:
    writer = csv.writer(fout, dialect='excel')
    row = ['ObjectType', 'ObjectID', 'Ethnicity', 'Title', 'Address', 'City', 'State', 'Gender', 'DateOfBirth', 'FirstName', 'MiddleName', 'Lastname', 'Height', 'PropertyRole', 'LinkType', 'ChildRef', 'ParentRef']
    writer.writerow(row)

    folderpath = 'C:\Users\GCheves\aladin\Person-Files'
        for filename in os.listdir(folderpath):
            if filename.endswith('.xml'):
            filepath = os.path.join(path, filename)
            tree = ET.parse(filepath)
            root = tree.getroot()

            ns = {'aa': 'http://www.aladintech.com/pg/schema/export/'}

            for child in root.findall(".//aa:object[@type='com.aladin.object.person']", ns):
                ObjectType = child.attrib.get('type')
                ObjectID = child.attrib.get('id')
                #print(ObjectType)
                #print(ObjectID)

            for child in root.findall(".//aa:property[@type='com.aladin.property.fic.Ethnicity']", ns):
                child = child.find('./aa:propertyValue/aa:propertyData', ns)
                Ethnicity = child.text
                #print(Ethnicity)

            for child in root.findall(".//aa:property[@type='com.aladin.property.IntrinsicTitle']", ns):
                child = child.find("./aa:propertyValue/aa:propertyComponent[@type='TITLE']/aa:propertyData", ns)
                Title = child.text
                #print(Title)

            for child in root.findall(".//aa:property[@type='com.aladin.property.Address']", ns):
                child = child.find("./aa:propertyValue/aa:propertyComponent[@type='ADDRESS1']/aa:propertyData", ns)
                Address = child.text
                #print(Address)

            for child in root.findall(".//aa:property[@type='com.aladin.property.Address']", ns):
                child = child.find("./aa:propertyValue/aa:propertyComponent[@type='CITY']/aa:propertyData", ns)
                City = child.text
                #print(City)

            for child in root.findall(".//aa:property[@type='com.aladin.property.Address']", ns):
                child = child.find("./aa:propertyValue/aa:propertyComponent[@type='STATE']/aa:propertyData", ns)
                State = child.text
                #print(State)

            for child in root.findall(".//aa:property[@type='com.aladin.property.fic.Gender']", ns):
                child = child.find('./aa:propertyValue/aa:propertyData', ns)
                Gender = child.text
                #print(Gender)

            for child in root.findall(".//aa:property[@type='com.aladin.property.DateofBirth']", ns):
                child = child.find('./aa:propertyValue/aa:propertyData', ns)
                DateOfBirth = child.text
                #print(DateOfBirth)

            for child in root.findall(".//aa:property[@type='com.aladin.property.Name']", ns):
                child = child.find("./aa:propertyValue/aa:propertyComponent[@type='FIRST_NAME']/aa:propertyData", ns)
                FirstName = child.text
                #print(FirstName)

            for child in root.findall(".//aa:property[@type='com.aladin.property.Name']", ns):
                child = child.find("./aa:propertyValue/aa:propertyComponent[@type='MIDDLE_NAME']/aa:propertyData", ns)
                MiddleName = child.text
                #print(MiddleName)

            for child in root.findall(".//aa:property[@type='com.aladin.property.Name']", ns):
                child = child.find("./aa:propertyValue/aa:propertyComponent[@type='LAST_NAME']/aa:propertyData", ns)
                LastName = child.text
                #print(LastName)

            for child in root.findall(".//aa:property[@type='com.aladin.property.Height']", ns):
                child = child.find('./aa:propertyValue/aa:propertyData', ns)
                Height = child.text
                #print(Height)

            for child in root.findall(".//aa:property[@type='com.aladin.property.Role']", ns):
                child = child.find('./aa:propertyValue/aa:propertyData', ns)
                PropertyRole = child.text
                #print(PropertyRole)

            for child in root.findall(".//aa:link[@type='com.aladin.link.ApparentPersonInCharge']", ns):
                LinkType = child.attrib['type'].split('.')[-1]
                ParentRef = child.attrib.get('parentRef')
                ChildRef = child.attrib.get('childRef')
                print(LinkType)
                #print(ParentRef)
                #print(ChildRef)

        row = [ObjectType, ObjectID, Ethnicity, Title, Address, City, State, Gender, DateOfBirth, FirstName, MiddleName, Lastname, Height, PropertyRole, LinkType, ChildRef, ParentRef]
        writer.writerow(row)

winsound.Beep(500, 1000)

Brandon Hurr

unread,

Jun 29, 2018, 11:22:26 AM6/29/18

to Gerald Cheves, manipulatr

Glad that you solved it. Sadly I didn't have an answer and I'm guessing neither did anyone else who follows this list.

My suggestion would be to followup on the community page at Rstudio. I'm certain you can get to a similar result in R in the tidyverse, but I don't work with XML much so it's hard to say.

B

Reply all

Reply to author

Forward