Supporting non-latin Java identifiers in element and attribute names

Skip to first unread message

Hosam Aly

Feb 21, 2018, 9:51:27 AM2/21/18
to scalaxb

This is a suggestion to change the behaviour of ScalaXB in a backwards-incompatible way. I'd like to collect your feedback about it.

Currently, when generating classes for XML entities, ScalaXB encodes non-latin-word characters (anything that doesn't match [a-zA-Z_0-9]) into their decimal ASCII code prefixed with u. For example, an XML element named data-format would result in a Scala class named Datau45format. Additionally, any other characters that don't belong to the aforementioned range (e.g. extended ASCII or Unicode) are treated in the same way.

I'd like to change this behaviour in multiple ways:
  • Characters that are acceptable in a Java identifier, the character should remain as-is because the class or field name would be valid.
  • The four ASCII symbols that are acceptable in an XML identifier (:.-, and _) should be replaced with their names (colon, dot, hyphen, and underscore).
    • Additionally, I'd like to add an option to remove them from the generated class name, so data-format can result in the class name DataFormat.
  • Other characters should be encoded as U0000, where 0000 is replaced by the 4-digit hexadecimal Unicode point of the character.
    • This makes it easier to find out which character it refers to by opening a REPL and typing '\u1234'.
    • The capital U makes it easily distinguishable from the previous word.
I have described and implemented some of these changes in this pull request.

These suggested changes are backwards-incompatible in that ScalaXB would now generate different class names than those it used to generate previously.

I'd like to gauge your interest in these suggestions.
Do you think these are good ideas?
Do you have any suggestions to improve them?
Would you mind the backwards-incompatibility?

Your feedback is appreciated.

Thank you,

Hosam Aly
Reply all
Reply to author
0 new messages