Beginner questions about EDIFACT processing

Axel Guckelsberger

unread,

Jun 13, 2023, 8:11:24 AM6/13/23

to Smooks Users

Hello all,

I've been trying to do some EDIFACT conversions using Smooks 2.0.0-RC1. During that, I've collected some questions about unclear aspects, hoping you can shed some light on them for me.

First, let me briefly describe the use case I want to cover: I've created a tiny Spring Boot app that includes Smooks using Gradle. The app contains a controller that exposes some REST endpoints for receiving payloads that should be parsed into XML or unparsed again into the raw format. Based on parameters it should process different EDIFACT versions and message types.

1) Most parts of the Smooks documentation and examples are utilizing some kind of XML-based configuration. My feeling told me that defining such resource configurations would be not flexible enough for the use case described above. Instead, I started using a Java-based approach using the EdifactReaderConfigurator class. Do you think this is an adequate foundation to proceed with or should I take a deeper look at the XML-based definitions?

2) I've read in this user group that it is recommended to use different profiles. Can you please elaborate about the conceptual intentions of the mechanisms involved (profile, subprofiles, profile sets, etc.) and how these can/should be utilized in order to manage and distinguish different types of processing?
I've also observed that I currently need to disable the cacheOnDisk flag if I want to change the EDIFACT schema version. Possibly this can be resolved by introducing profiles as well?

3) I've managed to process the two example files known from the EDIFACT d03b schema lying around at GitHub. When I try using other messages, particularly those referencing an older schema (e.g. d96a) I receive many many data within a BadMessage tag. When it comes to validating and debugging what is going wrong:
- Is there some kind of documentation of how to read the debug output of Smooks? There are information bits like bitPosition, childIndex, foundDelimiter, etc. but I have no clue how this can help me.
- Is it also possible to retrieve the diagnostics from Daffodil?
- Can you recommend a validator for EDIFACT files which I could use manually in order to get some insights?
- Are there ways to find out programmatically if and at which line there is an issue in a given message?
- Is EDIFACT backwards compatible? If so, I can maybe rely on some specific versions instead of just trusting/using what is referenced in the message UNH segment.

4) One particular issue regarding older schema versions affect enumerations. It seems that the old schema versions (until and including d96a) do not contain all enumeration values but only the first one for each enum. This seems likely a bug. Hence, all messages are invalid that use other enum values. Can you confirm this?

Thanks in advance for your feedback.

Claude Mamo

unread,

Jun 21, 2023, 4:58:06 PM6/21/23

to smook...@googlegroups.com

Hello Axel,

Thanks for these questions. We should start offering guidance in the docs around some of this stuff since it's not the first time I'm bumping into these.

1) Most parts of the Smooks documentation and examples are utilizing some kind of XML-based configuration. My feeling told me that defining such resource configurations would be not flexible enough for the use case described above. Instead, I started using a Java-based approach using the EdifactReaderConfigurator class. Do you think this is an adequate foundation to proceed with or should I take a deeper look at the XML-based definitions?

There's more than one way to skin a cat and your approach might be the most pragmatic one. I'd hesitate to recommend it as a "foundation" if Smooks will play a big role in your project given that XML configs are typically easier to conceptualise over plain Java. One approach to consider is to have a template Smooks XML config that is materialised at runtime. You would then create a Smooks instance for each materialised config. It goes without saying that this won't scale unless you implement it alongside pooling.

2) I've read in this user group that it is recommended to use different profiles. Can you please elaborate about the conceptual intentions of the mechanisms involved (profile, subprofiles, profile sets, etc.) and how these can/should be utilized in order to manage and distinguish different types of processing?
I've also observed that I currently need to disable the cacheOnDisk flag if I want to change the EDIFACT schema version. Possibly this can be resolved by introducing profiles as well?

I need to find an hour or two to properly document profiles unless we have volunteers in the community. A profile is a scope that is activated when the Smooks execution has its context set to use the profile (e.g., smooks.createExecutionContext("d97b")). You can have many profiles and you can even have profiles within profiles (i.e., sub-profiles). resource-configs like readers can be part of a profile by referencing the profile name with the targetProfile XML attribute. When a profile is active, the resource configs within the profile target the events are per their selectors. On the other hands, when this profile is inactive, these same resource configs are excluded from the execution. There is an example illustrating how Smooks can be used to apply "profile" based transformations on a message. I can imagine having a profile for every version/message type of EDIFACT though I'd generate these profiles if there are many versions and message types you need to support.

Is there some kind of documentation of how to read the debug output of Smooks? There are information bits like bitPosition, childIndex, foundDelimiter, etc. but I have no clue how this can help me.

Probably this is produced from Apache Daffodil. Could you post an example?

Is it also possible to retrieve the diagnostics from Daffodil?

Have you tried turning the debugging attribute on the parser/unparser to true?

- Are there ways to find out programmatically if and at which line there is an issue in a given message?

Not that I know of. I'm seeking to implement better error handling support since it has been asked for in the past. I know it's not ideal but for now my recommendation is to route bad documents to another application. I had created an example showing this. It's not specific to EDIFACT but you can easily adapt it to your use-case.

4) One particular issue regarding older schema versions affect enumerations. It seems that the old schema versions (until and including d96a) do not contain all enumeration values but only the first one for each enum. This seems likely a bug. Hence, all messages are invalid that use other enum values. Can you confirm this?

This issue was fixed a while back and it will be rolled out in RC2 which should go out in the next few days.

Claude

--
You received this message because you are subscribed to the Google Groups "Smooks Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to smooks-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/smooks-user/6228fb08-97a3-4eca-b90d-38697b788012n%40googlegroups.com.

Axel Guckelsberger

unread,

Jun 26, 2023, 6:07:03 AM6/26/23

to Smooks Users

Hello Claude,

many thanks for your answers.

At the moment we are running fine with the Java-based approach. I also enabled support for multiple EDIFACT versions including the slicing per document type by subclassing EdifactDataProcessorFactory and changing materialiseEntrySchema to my needs. Maybe we will re-introduce XML config and profile support if we need to support further transformation types.

This issue was fixed a while back and it will be rolled out in RC2 which should go out in the next few days.

Already updated, works like a charm now.

I noticed though that there are no Java bindings available for the four new schema versions (d20a, d20b, d21a, d21b). Is this intended or were they just overseen?

Is there some kind of documentation of how to read the debug output of Smooks? There are information bits like bitPosition, childIndex, foundDelimiter, etc. but I have no clue how this can help me.

Probably this is produced from Apache Daffodil. Could you post an example?

Here is a snippet cut out of the output:

diff:
bitPosition: 30528 -> 30536
childIndex: 484 -> 485
foundDelimiter: + -> (no value)
foundField: 1 -> (no value)
groupIndex: 1 -> 2
----------------------------------------------------------------- 1268
parser: <Element name='E0020'><DelimiterStackParser>...</DelimiterStackParser></Element>
bitPosition: 30536
data:
│ │
87654321 0011 2233 4455 6677 8899 aabb ccdd eeff 0123456789abcdef
00000ee0: 3127 0a55 4e5a 2b31 2b31 3127 0a 1'␊UNZ+1+11'␊
infoset:
<?xml version="1.0" encoding="UTF-8"?>
<UNZ>
<E0036>1</E0036>
<E0020></E0020>
</UNZ>
diff:
(no differences)
----------------------------------------------------------------- 1269
parser: <StringDelimitedParser/>
bitPosition: 30552
data:
├───┤ ├┤
87654321 0011 2233 4455 6677 8899 aabb ccdd eeff 0123456789abcdef
00000ee0: 0a55 4e5a 2b31 2b31 3127 0a ␊UNZ+1+11'␊
infoset:
<?xml version="1.0" encoding="UTF-8"?>
<UNZ>
<E0036>1</E0036>
<E0020>11</E0020>
</UNZ>
diff:
bitPosition: 30536 -> 30552
foundDelimiter: (no value) -> '␊
foundField: (no value) -> 11

- Are there ways to find out programmatically if and at which line there is an issue in a given message?

Not that I know of. I'm seeking to implement better error handling support since it has been asked for in the past. I know it's not ideal but for now my recommendation is to route bad documents to another application. I had created an example showing this. It's not specific to EDIFACT but you can easily adapt it to your use-case.

Until now I could distinguish the following error causes:

1) On the one hand there are documents which have an old EDIFACT syntax version. Daffodil supports syntax versions 3 and 4, but cannot process versions 1 and 2. This looks like a dead end, unless there is another solution for such documents (but I don't think so).

2) Second, there are documents that do not match their schema. These again break down into two groups:

2.1) Documents with minor problems, such as invalid enum values, can be processed as long as ValidationMode is switched to Off. The invalid values must then be handled subsequently.

2.2) Larger problems, like 20 FTX segments in a row instead of 5, cause the processing to jump to the BadMessage branch. Here I had already a case, which had specified as version 96A, which did not work; with 99B however the conversion had worked. Maybe it would be a fallback approach to try other versions in case of a BadMessage result.

If anyone has any further insights on this topic, I would appreciate feedback.

Best regards,

Axel

Claude Mamo

unread,

Jul 22, 2023, 5:51:20 AM7/22/23

to smook...@googlegroups.com

I noticed though that there are no Java bindings available for the four new schema versions (d20a, d20b, d21a, d21b). Is this intended or were they just overseen?

Good catch! It was missed in PR 193. The bindings will be added in the next RC.

Here is a snippet cut out of the output:

Indeed, it's Daffodil output. It's documented here but admittedly it's not a lot to go on.

On the one hand there are documents which have an old EDIFACT syntax version. Daffodil supports syntax versions 3 and 4, but cannot process versions 1 and 2. This looks like a dead end, unless there is another solution for such documents (but I don't think so).

Perhaps you could modify the generated DFDL schemas to support these syntax versions?

If anyone has any further insights on this topic, I would appreciate feedback.

I've posted a proposal to improve the EDIFACT cartridge's validation. Input is more than welcome.

Claude

--
You received this message because you are subscribed to the Google Groups "Smooks Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to smooks-user...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/smooks-user/b1d4ace8-64a3-4d78-bcb9-fdd35a2df220n%40googlegroups.com.

Claude

unread,

Aug 18, 2023, 10:22:10 AM8/18/23

to Smooks Users

I've given this further thought. A simple way for tolerating invalid documents is to use the general purpose reader from the EDI cartridge (edi:parser). The reader will only complain when the EDI format is broken so you can process EDIFACT documents that are well-formed but invalid. Granted, the EDI reader is not as dynamic as the EDIFACT one since you need to configure ahead of time the delimiters.

Claude

unread,

Aug 27, 2023, 8:26:43 AM8/27/23

to Smooks Users

Following the proposal and a discussion with the Daffodil maintainers, there is a simple solution to problem 2.2 which requires a small change to the generated DFDL schemas and it doesn't need anything fancy like Schematron. I've created an issue and the enhancement will be included in RC3. Note that BadMessage will still be produced for required segments (i.e., minOccurs=1 & maxOccurs=1) but I believe this isn't an issue in your case.

Claude

unread,

Aug 27, 2023, 8:30:27 AM8/27/23

to Smooks Users

Linking here the Daffodil discussion: https://lists.apache.org/thread/fwm9l9z2mt6cw50xdfhyl0yjfmy476rn

Claude

Reply all

Reply to author

Forward