Re: XML to CSV with ldply

62 views
Skip to first unread message
Message has been deleted

Brandon Hurr

unread,
Jun 19, 2018, 4:29:58 PM6/19/18
to gi...@tcalumni.columbia.edu, manipulatr
Gerald,

It would help a ton if you could include your example xml file.

One thing that's bugging me is that you're using ldply from plyr. I wondered if you could be using something else like purrr::map or tidyr::bind_rows to slap everything together.

I did a quick google and there were some good resources that might help you solve your problem:

HTH,
B

On Tue, Jun 19, 2018 at 1:22 PM Gerald C <gi...@tcalumni.columbia.edu> wrote:
When I run the function on the list file I get the following error

> listFile <- xmlToList(doc)
> data <- ldply(listFile, data.frame)
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE,  : 
  arguments imply differing number of rows: 1, 3, 5

Although the XML file is nested, each document represents one row - a single record. Is there a way to modify the arguments to imply 1 row?
Thanks.

complete script and file below

require(XML)
require(plyr)

file <- "C:\\Users\\GCheves\\Test-Files\\Test.xml"

## xmlParse parse the xml file and stores it in the object doc

doc <- xmlParse(file, useInternalNodes = TRUE)

## convert xml doc to list

listFile <- xmlToList(doc)

data <- ldply(listFile, data.frame)



--
You received this message because you are subscribed to the Google Groups "manipulatr" group.
To unsubscribe from this group and stop receiving emails from it, send an email to manipulatr+...@googlegroups.com.
To post to this group, send email to manip...@googlegroups.com.
Visit this group at https://groups.google.com/group/manipulatr.
For more options, visit https://groups.google.com/d/optout.
Message has been deleted
Message has been deleted

Gerald C

unread,
Jun 28, 2018, 8:25:44 PM6/28/18
to manipulatr
Please don't banish me from the group. This is the solution I settled on

#!  Python3

"""
Program uses xml.etree module to select data elements from the xml tree, names
 row headers, and  csv.writer then writes them to a csv file. Rows are appended
 to the file as each file in the directory  is processed.,

Program Author:        Gerald I Cheves
Date written:        25 June 2018
Last Revision:        28 June 2018
"""

import winsound   
import xml.etree.ElementTree as ET
import os
import sys
import csv



with open('C:\\Users\\GCheves\\ aladin\\Output\\xml-csv-output.csv', 'w') as fout:
    writer = csv.writer(fout, dialect='excel')
    row = ['ObjectType', 'ObjectID', 'Ethnicity', 'Title', 'Address', 'City', 'State', 'Gender', 'DateOfBirth', 'FirstName', 'MiddleName', 'Lastname', 'Height', 'PropertyRole', 'LinkType', 'ChildRef', 'ParentRef']
    writer.writerow(row)



    folderpath = 'C:\Users\GCheves\aladin\Person-Files'
        for filename in os.listdir(folderpath):
            if filename.endswith('.xml'):
            filepath = os.path.join(path, filename)
            tree = ET.parse(filepath)
            root = tree.getroot()

            ns = {'aa':  'http://www.aladintech.com/pg/schema/export/'}


            for child in root.findall(".//aa:object[@type='com.aladin.object.person']", ns):
                ObjectType = child.attrib.get('type')
                ObjectID = child.attrib.get('id')
                #print(ObjectType)
                #print(ObjectID)

            for child in root.findall(".//aa:property[@type='com.aladin.property.fic.Ethnicity']", ns):
                child = child.find('./aa:propertyValue/aa:propertyData', ns)
                Ethnicity = child.text
                #print(Ethnicity)

            for child in root.findall(".//aa:property[@type='com.aladin.property.IntrinsicTitle']", ns):
                child = child.find("./aa:propertyValue/aa:propertyComponent[@type='TITLE']/aa:propertyData", ns)
                Title = child.text
                #print(Title)

            for child in root.findall(".//aa:property[@type='com.aladin.property.Address']", ns):
                child = child.find("./aa:propertyValue/aa:propertyComponent[@type='ADDRESS1']/aa:propertyData", ns)
                Address = child.text
                #print(Address)

            for child in root.findall(".//aa:property[@type='com.aladin.property.Address']", ns):
                child = child.find("./aa:propertyValue/aa:propertyComponent[@type='CITY']/aa:propertyData", ns)
                City = child.text
                #print(City)

            for child in root.findall(".//aa:property[@type='com.aladin.property.Address']", ns):
                child = child.find("./aa:propertyValue/aa:propertyComponent[@type='STATE']/aa:propertyData", ns)
                State = child.text
                #print(State)

            for child in root.findall(".//aa:property[@type='com.aladin.property.fic.Gender']", ns):
                child = child.find('./aa:propertyValue/aa:propertyData', ns)
                Gender = child.text
                #print(Gender)

            for child in root.findall(".//aa:property[@type='com.aladin.property.DateofBirth']", ns):
                child = child.find('./aa:propertyValue/aa:propertyData', ns)
                DateOfBirth = child.text
                #print(DateOfBirth)

            for child in root.findall(".//aa:property[@type='com.aladin.property.Name']", ns):
                child = child.find("./aa:propertyValue/aa:propertyComponent[@type='FIRST_NAME']/aa:propertyData", ns)
                FirstName = child.text
                #print(FirstName)

            for child in root.findall(".//aa:property[@type='com.aladin.property.Name']", ns):
                child = child.find("./aa:propertyValue/aa:propertyComponent[@type='MIDDLE_NAME']/aa:propertyData", ns)
                MiddleName = child.text
                #print(MiddleName)

            for child in root.findall(".//aa:property[@type='com.aladin.property.Name']", ns):
                child = child.find("./aa:propertyValue/aa:propertyComponent[@type='LAST_NAME']/aa:propertyData", ns)
                LastName = child.text
                #print(LastName)

            for child in root.findall(".//aa:property[@type='com.aladin.property.Height']", ns):
                child = child.find('./aa:propertyValue/aa:propertyData', ns)
                Height = child.text
                #print(Height)

            for child in root.findall(".//aa:property[@type='com.aladin.property.Role']", ns):
                child = child.find('./aa:propertyValue/aa:propertyData', ns)
                PropertyRole = child.text
                #print(PropertyRole)

            for child in root.findall(".//aa:link[@type='com.aladin.link.ApparentPersonInCharge']", ns):
                LinkType = child.attrib['type'].split('.')[-1]
                ParentRef = child.attrib.get('parentRef')
                ChildRef = child.attrib.get('childRef')
                print(LinkType)
                #print(ParentRef)
                #print(ChildRef)

        row = [ObjectType, ObjectID, Ethnicity, Title, Address, City, State, Gender, DateOfBirth, FirstName, MiddleName, Lastname, Height, PropertyRole, LinkType, ChildRef, ParentRef]
        writer.writerow(row)
       
winsound.Beep(500, 1000)

Brandon Hurr

unread,
Jun 29, 2018, 11:22:26 AM6/29/18
to Gerald Cheves, manipulatr
Glad that you solved it. Sadly I didn't have an answer and I'm guessing neither did anyone else who follows this list.

My suggestion would be to followup  on the community page at Rstudio. I'm certain you can get to a similar result in R in the tidyverse, but I don't work with XML much so it's hard to say.

B
Reply all
Reply to author
Forward
0 new messages