Simple XML parsing - this should be easy

448 views
Skip to first unread message

da...@cloudera.com

unread,
Nov 6, 2017, 5:35:52 PM11/6/17
to sdc-user
New to streamsets, so I apologize in advance if I am doing something goofy.  All I want to do is parse an xml file with the following format

<?xml version="1.0" encoding="utf-8"?>
<ordata>
  <row Id="2" Id2="1" Count="7" ... />
.
.
.
</ordata>

I've tried multiple combinations of directory reader, with the XML ata format, including xpath /ordata/row/ and row as the record delimiter, and nothing as record delimiter. Wondering if it's because all the fields are attributes, or that there's no explicit end tag.   In preview all I get back is 

Event Record1 (new-file): {MAP}
  filepath: {STRING} "/STREAMSETS/so/source/Data.xml

The sdc log file contains the following error:

2017-11-06 17:29:32,251 [user:*admin] [pipeline:SO Input/SOInputfda19c9f-674f-4325-99b5-1e0533a68d4e] [runner:] [thread:preview-pool-1-thread-1] INFO  Pipeline - Processing lifecycle start event with stage
2017-11-06 17:29:32,254 [user:*admin] [pipeline:SO Input/SOInputfda19c9f-674f-4325-99b5-1e0533a68d4e] [runner:] [thread:preview-pool-1-thread-1] ERROR SpoolDirSource - Failed to process file '/STREAMSETS/SO/source/Data.xml' at position '-1': com.streamsets.pipeline.stage.origin.spooldir.BadSpoolFileException: com.streamsets.pipeline.lib.parser.DataParserException: XML_PARSER_00 - Cannot advance reader 'Data.xml' to offset '0'
com.streamsets.pipeline.stage.origin.spooldir.BadSpoolFileException: com.streamsets.pipeline.lib.parser.DataParserException: XML_PARSER_00 - Cannot advance reader 'Data.xml' to offset '0'
        at com.streamsets.pipeline.stage.origin.spooldir.SpoolDirSource.produce(SpoolDirSource.java:652)
        at com.streamsets.pipeline.stage.origin.spooldir.SpoolDirSource.produce(SpoolDirSource.java:510)
        at com.streamsets.pipeline.configurablestage.DSource.produce(DSource.java:38)
        at com.streamsets.datacollector.runner.StageRuntime$2.call(StageRuntime.java:228)
        at com.streamsets.datacollector.runner.StageRuntime$2.call(StageRuntime.java:222)
        at com.streamsets.datacollector.runner.StageRuntime.execute(StageRuntime.java:180)
        at com.streamsets.datacollector.runner.StageRuntime.execute(StageRuntime.java:249)
        at com.streamsets.datacollector.runner.StagePipe.process(StagePipe.java:231)
        at com.streamsets.datacollector.runner.preview.PreviewPipelineRunner.runPollSource(PreviewPipelineRunner.java:315)
        at com.streamsets.datacollector.runner.preview.PreviewPipelineRunner.run(PreviewPipelineRunner.java:214)
        at com.streamsets.datacollector.runner.Pipeline.run(Pipeline.java:510)
        at com.streamsets.datacollector.runner.preview.PreviewPipeline.run(PreviewPipeline.java:51)
        at com.streamsets.datacollector.execution.preview.sync.SyncPreviewer.start(SyncPreviewer.java:206)
        at com.streamsets.datacollector.execution.preview.async.AsyncPreviewer.lambda$start$0(AsyncPreviewer.java:94)
        at com.streamsets.pipeline.lib.executor.SafeScheduledExecutorService$SafeCallable.lambda$call$0(SafeScheduledExecutorService.java:249)
        at com.streamsets.datacollector.security.GroupsInScope.execute(GroupsInScope.java:33)
        at com.streamsets.pipeline.lib.executor.SafeScheduledExecutorService$SafeCallable.call(SafeScheduledExecutorService.java:245)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: com.streamsets.pipeline.lib.parser.DataParserException: XML_PARSER_00 - Cannot advance reader 'Data.xml' to offset '0'
        at com.streamsets.pipeline.lib.parser.xml.XmlDataParserFactory.createParser(XmlDataParserFactory.java:80)
        at com.streamsets.pipeline.lib.parser.xml.XmlDataParserFactory.getParser(XmlDataParserFactory.java:60)
        at com.streamsets.pipeline.lib.parser.WrapperDataParserFactory.getParser(WrapperDataParserFactory.java:65)
        at com.streamsets.pipeline.stage.origin.spooldir.SpoolDirSource.produce(SpoolDirSource.java:585)
        ... 22 more
Caused by: java.io.IOException: javax.xml.stream.XMLStreamException: ParseError at [row,col]:[1,1]
Message: Content is not allowed in prolog.
        at com.streamsets.pipeline.lib.parser.xml.XmlCharDataParser.<init>(XmlCharDataParser.java:89)
        at com.streamsets.pipeline.lib.parser.xml.XmlDataParserFactory.createParser(XmlDataParserFactory.java:77)
        ... 25 more
Caused by: javax.xml.stream.XMLStreamException: ParseError at [row,col]:[1,1]
Message: Content is not allowed in prolog.
        at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(XMLStreamReaderImpl.java:596)
        at com.sun.xml.internal.stream.XMLEventReaderImpl.peek(XMLEventReaderImpl.java:276)
        at javax.xml.stream.util.EventReaderDelegate.peek(EventReaderDelegate.java:104)
        at com.streamsets.pipeline.lib.xml.StreamingXmlParser.skipIgnorable(StreamingXmlParser.java:232)
        at com.streamsets.pipeline.lib.xml.StreamingXmlParser.hasNext(StreamingXmlParser.java:238)
        at com.streamsets.pipeline.lib.xml.StreamingXmlParser.<init>(StreamingXmlParser.java:113)
        at com.streamsets.pipeline.lib.xml.OverrunStreamingXmlParser.<init>(OverrunStreamingXmlParser.java:59)
        at com.streamsets.pipeline.lib.parser.xml.XmlCharDataParser.<init>(XmlCharDataParser.java:80)
        ... 26 more
2017-11-06 17:29:32,254 [user:*admin] [pipeline:SO Input/SOInputfda19c9f-674f-4325-99b5-1e0533a68d4e] [runner:] [thread:preview-pool-1-thread-1] ERROR DirectorySpooler - Leaving file in error '/STREAMSETS/SO/source/Data.xml' in spool directory
2017-11-06 17:29:32,254 [user:*admin] [pipeline:SO Input/SOInputfda19c9f-674f-4325-99b5-1e0533a68d4e] [runner:] [thread:preview-pool-1-thread-1] INFO  Pipeline - Destroying pipeline with reason=UNKNOWN
2017-11-06 17:29:32,255 [user:*admin] [pipeline:SO Input/SOInputfda19c9f-674f-4325-99b5-1e0533a68d4e] [runner:] [thread:preview-pool-1-thread-1] INFO  Pipeline - Processing lifecycle stop event
2017-11-06 17:29:32,255 [user:*admin] [pipeline:SO Input/SOInputfda19c9f-674f-4325-99b5-1e0533a68d4e] [runner:] [thread:preview-pool-1-thread-1] INFO  Pipeline - Pipeline finished destroying with final reason=FAILURE
2017-11-06 17:29:33,444 [user:admin] [pipeline:SO Input/SOInputfda19c9f-674f-4325-99b5-1e0533a68d4e] [runner:] [thread:webserver-127] WARN  StandaloneAndClusterPipelineManager - Evicting idle previewer 'SOInputfda19c9f-674f-4325-99b5-1e0533a68d4e::0'::'47ee8166-ccaf-4e87-b576-e030695edc91' in status 'FINISHED

Thank you!

Pat Patterson

unread,
Nov 6, 2017, 6:19:53 PM11/6/17
to da...@cloudera.com, sdc-user
Hi Dan,

This worked for me...

Input file:

<?xml version="1.0" encoding="utf-8"?>
<ordata>
  <row Id="2" Id2="1" Count="7" />
</ordata>

Directory Origin config:

Inline image 1

Preview:

Inline image 2

Are you positive there's nothing on the first line of the file except

<?xml version="1.0" encoding="utf-8"?>

That seems to be what it's complaining about.

Take a look at the top of the file with hexdump. I see this for my test file:

$ hexdump -C -n 80 ~/Documents/file.xml
00000000  3c 3f 78 6d 6c 20 76 65  72 73 69 6f 6e 3d 22 31  |<?xml version="1|
00000010  2e 30 22 20 65 6e 63 6f  64 69 6e 67 3d 22 75 74  |.0" encoding="ut|
00000020  66 2d 38 22 3f 3e 0a 3c  6f 72 64 61 74 61 3e 0a  |f-8"?>.<ordata>.|
00000030  20 20 3c 72 6f 77 20 49  64 3d 22 32 22 20 49 64  |  <row Id="2" Id|
00000040  32 3d 22 31 22 20 43 6f  75 6e 74 3d 22 37 22 20  |2="1" Count="7" |

I suspect you might have a byte order mark in your file.


Cheers,

Pat

--

Pat Patterson | Community Champion | http://about.me/patpatterson

--
You received this message because you are subscribed to the Google Groups "sdc-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sdc-user+unsubscribe@streamsets.com.
Visit this group at https://groups.google.com/a/streamsets.com/group/sdc-user/.

Dan Flavin

unread,
Nov 7, 2017, 9:54:12 AM11/7/17
to Pat Patterson, sdc-user

That was it - the file had the UTF-8 efbbbf BOM. Though raw preview displayed OK, and am wondering if there is a warning about BOM that I missed somewhere. I did have the file encoding set to UTF-8. Is there a recommended way to handle BOM's with streamsets? Can I do some pre-processing in streamsets to take out the BOM on the first line only. Would be preferred to keep in all streamsets solution vs. sed, etc.

 

Thank you, Dan

 

 

From: Pat Patterson <p...@streamsets.com>
Date: Monday, November 6, 2017 at 5:19 PM
To: Dan Flavin <da...@cloudera.com>
Cc: sdc-user <sdc-...@streamsets.com>
Subject: Re: [sdc-user] Simple XML parsing - this should be easy

 

Hi Dan,

 

This worked for me...

 

Input file:

 

<?xml version="1.0" encoding="utf-8"?>

<ordata>

  <row Id="2" Id2="1" Count="7" />

</ordata>

 

Directory Origin config:

 

nline image 1

 

Preview:

 

nline image 2

--

To unsubscribe from this group and stop receiving emails from it, send an email to sdc-user+u...@streamsets.com.

Pat Patterson

unread,
Nov 7, 2017, 12:56:20 PM11/7/17
to Dan Flavin, sdc-user
Hi Dan,

This is tricky, since the BOM needs to be removed before the data gets to the origin. The only all-StreamSets solution I can think of would be a new pipeline to just read the files in text format and do the replacement in Expression Evaluator, writing the output to a staging directory for your existing pipeline to read. Here's a pipeline I just created that does this:

Inline image 3

Note - I don't even use regular expressions to remove the BOM, since we know the offending line is at offset zero and we know exactly what the first line of the file should be, so we just use that verbatim. Here's the expression, so it's easy to copy/paste:

${(record:attribute('offset') == 0) ? '<?xml version="1.0" encoding="utf-8"?>' : record:value('/text')}

Cheers,

Pat

--

Pat Patterson | Community Champion | http://about.me/patpatterson

--

To unsubscribe from this group and stop receiving emails from it, send an email to sdc-user+unsubscribe@streamsets.com.

 

--
You received this message because you are subscribed to the Google Groups "sdc-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sdc-user+unsubscribe@streamsets.com.
Reply all
Reply to author
Forward
0 new messages