How to use StreamXmlRecordReader to parse single & multiline xml records within a single file

8 views

Skip to first unread message

ravi...@fractalanalytics.com

unread,

Jun 12, 2016, 12:16:23 AM6/12/16

to Hadoop Learners from Hadoop-skills.com

I have an input file (txt) as below

<a><b><c>val1</c></b></a>||<a><b><c>val2</c></b></a>||<a><b>
<c>val3</c></b></a>||<a></b><c>val4-c-1</c><c>val4-c-2</c></b><d>val-d-1</d></a>

If you observe the input carefully, the xml data record after the third '||' is split across two lines.

I want to use StreamXmlRecordReader of hadoop streaming to parse this file

-inputreader "org.apache.hadoop.streaming.StreamXmlRecordReader,begin=<a>,end=</a>,slowmatch=true

which I am unable to parse the 3rd record.

I am getting the below error

Traceback (most recent call last):
  File "/home/rsome/test/code/m1.py", line 13, in <module>
    root = ET.fromstring(xml_str.getvalue())
  File "/usr/lib64/python2.6/xml/etree/ElementTree.py", line 964, in XML
    return parser.close()
  File "/usr/lib64/python2.6/xml/etree/ElementTree.py", line 1254, in close
    self._parser.Parse("", 1) # end of data
xml.parsers.expat.ExpatError: no element found: line 1, column 18478

I have used slowmatch=true as well but still no luck.

Below is the output on running the above

$ hdfs dfs -text /user/rsome/poc/testout001/part-*
rec::1
<a><b><c>val1</c></b></a>
rec::2
<a><b><c>val2</c></b></a>
rec::3
<a><b>
rec::4
<c>val3</c></b></a>
rec::1
<a></b><c>val4-c-1</c><c>val4-c-2</c></b><d>val-d-1</d></a>

My expected output is

$ hdfs dfs -text /user/rsome/poc/testout001/part-*
rec::1::mapper1
<a><b><c>val1</c></b></a>
rec::2::mapper1
<a><b><c>val2</c></b></a>
rec::3::mapper1
<a><b><c>val3</c></b></a>
rec::1::mapper2
<a></b><c>val4-c-1</c><c>val4-c-2</c></b><d>val-d-1</d></a>