metadata add exception: not well-formed (invalid token): line 101, column 23

85 views
Skip to first unread message

Blake Fitch

unread,
Jul 9, 2024, 6:07:02 AM7/9/24
to iRODS-Chat
iRODS V 4.3.2

Access via python client library. Python logging is turned on.

obj.metadata.add( attr_str, value_str )

Results in the exception: not well-formed (invalid token): line 101, column 23

The metadata does however seem to have been applied.

Any thoughts?

Best,
Blake

Kory Draughn

unread,
Jul 9, 2024, 8:31:39 AM7/9/24
to irod...@googlegroups.com
Hi Blake,

What do attr_str and value_str hold?
How are the values generated?

Kory Draughn
Chief Technologist
iRODS Consortium


--
--
The Integrated Rule-Oriented Data System (iRODS) - https://irods.org
 
iROD-Chat: http://groups.google.com/group/iROD-Chat
---
You received this message because you are subscribed to the Google Groups "iRODS-Chat" group.
To unsubscribe from this group and stop receiving emails from it, send an email to irod-chat+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/irod-chat/acc9270f-13c9-4728-8abe-8a6fe63efee0n%40googlegroups.com.

Blake Fitch

unread,
Jul 9, 2024, 9:19:50 AM7/9/24
to iRODS-Chat
After digging in more, he's the senario...

I have a bunch of metadata which I am trying to apply with: obj.metadata.apply_atomic_operations( *AVU_list )

The way this situation occurs is, that call fails because one of the values has non-printable values: SkopeCalX��# 2d1

The way the code used to work was, after a failure to apply an AVU list, the code would fall back to trying to apply each AVU indifidually using obj.metadata.add(). 

In that case, sinse this is a rare case, the attribute and value strings are run through the following function to convert any 'bad' chars to '_' before using obj.metadata.add(). The value string becomes: SkopeCalX______#_2d1  None the less, that add, and infact all others on the same object, would fail with the exception noted earlier.

Since then, I've noticed that it appears that the that the first call via apply_atomic_operations( *AVU_List ) did apply the metadata, but sent me down the repare path by raising an exception. What I just tried is intercepting and 'fixing' the 'bad' string before the initial call to apply_atomic_operations(). This makes the first call work.

So, it seems like while obj.apply_atomic_operations( *AVU_list ) is depositing all the metadata on the object, it is not returning cleanly. And then the later calls to obj.metadata.add() all fail with the invalid token exception.

Since almost always (years now), I don't get 'bad' chars in the metadata, I'd like continue the call w/o the conversion of bytes outside 0x20 - 0x7E to underbars, but I guess if that's required, I can do it.

Thanks for considering this issue!

Kory Draughn

unread,
Jul 9, 2024, 9:55:29 AM7/9/24
to irod...@googlegroups.com
Got it.

Can you post the error returned by the atomic metadata API?
Is there anything in the server's log file?

You may have uncovered an edge case which needs to be addressed in the API.

We'll look into reproducing the error.


Kory Draughn
Chief Technologist
iRODS Consortium

Blake Fitch

unread,
Jul 9, 2024, 11:20:24 AM7/9/24
to iRODS-Chat

The error on the initial call to obj.metadata.apply_atomic_operations( *AVU_list ) is the same as the others:  ex: not well-formed (invalid token): line 101, column 23

Kory Draughn

unread,
Jul 9, 2024, 12:53:16 PM7/9/24
to irod...@googlegroups.com
What database are you using for the catalog? PostgreSQL, MySQL 8?

Are there any messages in the server's log file related to the atomic metadata API?

Kory Draughn
Chief Technologist
iRODS Consortium

Blake Fitch

unread,
Jul 9, 2024, 2:52:38 PM7/9/24
to iRODS-Chat
Postgres

Blake Fitch

unread,
Jul 9, 2024, 2:53:02 PM7/9/24
to iRODS-Chat
if you google that error, a bunch of other packages/apps come up

Blake Fitch

unread,
Jul 10, 2024, 6:57:52 AM7/10/24
to iRODS-Chat
Some more info...

The test system is running postgresql 12.

There are no updates to files in /var/lib/irods/log and I don't see errors in /var/log/postgresql/

Here's the string dumped as a hex value.
string: SkopeCalX��# 2d1
hexvalue: 53:6b:6f:70:65:43:61:6c:58:c3:af:c2:bf:c2:bd:c3:af:c2:bf:c2:bd:23:01:32:64:31

BTW, this issue to happen on our production system too. That is running iRODS 4.2.10 running postgresql 10
On Tuesday, July 9, 2024 at 6:53:16 PM UTC+2 korydraughn.renci wrote:

Terrell Russell

unread,
Jul 10, 2024, 4:53:35 PM7/10/24
to irod...@googlegroups.com
This is an XML parsing issue.

I've created an issue to track the discussion.


Terrell


dmoore.renci

unread,
Jul 11, 2024, 1:12:58 PM7/11/24
to iRODS-Chat
Blake, 
  I second what Terrell has just said about the parsing.  That could potentially just be a peripheral issue and a matter of switching to the QUASI_XML parser.

  <digression>

  Another matter is  how unprintables should be interpreted when encountered in metadata.  In this application, iRODS is a pass-through to a database that (like the OS on which it sits) in most cases prefers Unicode to binary data. When naked binary data is presented to such a system, the usual interpretation of that is: here's a UTF-8 encoded byte string to be processed  and/or stored as Unicode.

  Looking at that string, it's easy to believe that's not what was intended.  Binary '\x01' won't occur in arbitrary encoded UTF-8 unless it's part of a multibyte encoding, and even then it won't be the first byte of that subsequence. Furthermore, some binary strings will generate errors on the way into iRODS, due to things in the input that look to the system like invalid embedded  UTF-8 encoding sequences....

  So we have at best an impedance mismatch, and at worst a potential integrity problem in using iRODS to store arbitrary byte strings.   Although it may work alright for some values, it will not *in the general case* preserve information authors' intent or even function error-free.

Perhaps a good answer is to convert values such as '\x01' into u'\u0001' on the way into iRODS, i.e. in general just map all non-printable character with an ordinal less than 256 to its corresponding unicode value, and (assuming Python3) use the resulting  string value directly in the process of AVU manipulation and retrieval.  It's forcing a particular interpretation essentially, but it would result in a more stable system overall.  Depending on the application,  this could either be the fallback when unicode decode errors are encountered, or, perhaps the approach should be applied uniformly for consistency's and reliability's sake.  Policy will no doubt, as usual, have to depend on the system owners.

  Forgive the digression, but I thought there was the possibility that (parts of) it might help : )
 
Daniel Moore
Applications Engineer
iRODS Consortium, RENCI, UNC Chapel Hill

dmoore.renci

unread,
Jul 11, 2024, 1:37:10 PM7/11/24
to iRODS-Chat
Blake,

Again, pardon.  It does seem, from the binary dump that you gave us, that it is a valid encoded UTF-8, and as such, the embedded chr(1) does have utility in a Unicode sense.

>>> hexvalues
[83, 107, 111, 112, 101, 67, 97, 108, 88, 195, 175, 194, 191, 194, 189, 195, 175, 194, 191, 194, 189, 35, 1, 50, 100, 49]
>>> bytearray(_)
bytearray(b'SkopeCalX\xc3\xaf\xc2\xbf\xc2\xbd\xc3\xaf\xc2\xbf\xc2\xbd#\x012d1')
>>> _.decode('utf8')
'SkopeCalX��#\x012d1'

So, I'm guessing that this data was intended to represent Unicode.   It will be significant to the iRODS client whether this is presented as a Python3 string  (natively Unicode) or bytestring (natively octet-based).  If the former in your case, then it should be via the decode() method, with argument 'utf8', rather than mapping binary codes directly to produce the str object.  b'...\x01...' will have a fundamentally different meaning than '...\u0001...'

- Dan

dmoore.renci

unread,
Jul 11, 2024, 1:41:52 PM7/11/24
to iRODS-Chat
Or if not '\x01', then other codes will ...

$ python3 -c "print ('\u0001',end='')" |od -vtx1
0000000 01
0000001
$ python3 -c "print ('\u00ce',end='')" |od -vtx1
0000000 c3 8e
0000002

dmoore.renci

unread,
Jul 11, 2024, 1:45:12 PM7/11/24
to iRODS-Chat
And to be complete: 
python -c "print ('\xce',end='')" |od -vtx1  
0000000 c3 8e
0000002

Blake Fitch

unread,
Jul 14, 2024, 9:50:35 AM7/14/24
to iRODS-Chat
Thanks for giving the matter consideration!

I believe it's all just standard Python3 strings. But I also think that's what I see in the examples. Here's some code and output.

Probably tens of thousands of strings have passed through this interface without breaking it until this one.

First, some debug output:

2024-07-14 13:18:23,077 [irods_utils.py: 207 - pydict_to_irods_avus() ] DEBUG: Create AVU OP list. ipath: /MRDataZone/home/mri_etl_s94t/echtdata/studies/42/experiments/EPAB-5ULJ/TWIX/allData#S94Tuebingen#F11217#M340#D120624#T110143#SkopeCalib.dat.part.2024-07-14_13:18:22.626622Z key: SequenceName type(key): <class 'str'> value: SkopeCalX��# 2d1 type(value): <class 'str'>
2024-07-14 13:18:23,077 [irods_utils.py: 223 - pydict_to_irods_avus() ] WARNING: Modify metadata value to contain only ascii chars. new: SkopeCalX______#_2d1 old: SkopeCalX��# 2d1 ipath: /MRDataZone/home/mri_etl_s94t/echtdata/studies/42/experiments/EPAB-5ULJ/TWIX/allData#S94Tuebingen#F11217#M340#D120624#T110143#SkopeCalib.dat.part.2024-07-14_13:18:22.626622Z
2024-07-14 13:18:23,077 [irods_utils.py: 225 - pydict_to_irods_avus() ] WARNING: old hexvalue: 53:6b:6f:70:65:43:61:6c:58:c3:af:c2:bf:c2:bd:c3:af:c2:bf:c2:bd:23:01:32:64:31

This is the code that produces those lines (which is currently removing whatever isn't ascii - though I'd prefer going back to feeding in what ever I'm handed by the application.):

nonprintable_regex_compiled = re.compile( r'[^\x20-\x7E]' )
def convert_non_ascii_to_underscore(s):
     return nonprintable_regex_compiled.sub( '_', s )

# This ignores the AVU units field.
def pydict_to_irods_avus( coll_or_obj, metadata_dict ):

    AVU_list = []

    # Load the dict into a list of AVU ops
    for key, value in metadata_dict.items():
        logging.debug( "Create AVU OP list. ipath: " + coll_or_obj.path
                     + " key: " + key + " type(key): " + str(type(key))
                     + " value: " + value + " type(value): " + str(type(value)) )

        if value == "":
            value = " "

        k = convert_non_ascii_to_underscore(key)
        if k != key:
            logging.warning( f"Modify metadata attribute to contain only ascii chars. new: {k} old: {key} ipath: {coll_or_obj.path}" )
            hexkey = ":".join("{:02x}".format(c) for c in key.encode())
            logging.warning( f"old hexkey: {hexkey}" )
            key = k

        v = convert_non_ascii_to_underscore(value)
        if v != value:
            logging.warning( f"Modify metadata value to contain only ascii chars. new: {v} old: {value} ipath: {coll_or_obj.path}" )
            hexvalue = ":".join("{:02x}".format(c) for c in value.encode())
            logging.warning( f"old hexvalue: {hexvalue}" )
            value = v

        AVU_list.append( AVUOperation( operation='add', avu=iRODSMeta( key, value ) ) )

    worked = False
    try:
        coll_or_obj.metadata.apply_atomic_operations( *AVU_list )
        worked = True
        logging.debug( f"Succeeded in adding AVU_list" )
    except Exception as ex:
        logging.error( "Applying list of AVUs as atomic op to collection or object failed. (Will try individual AVUs)"
                     + " ex: " + str(ex)

... ... ...                                                                                           

dmoore.renci

unread,
Jul 17, 2024, 8:27:31 AM7/17/24
to iRODS-Chat
hi Blake,

  I need to ask one more thing: Were there practical concerns, other than the dysfunctional STANDARD_XML parser, for eliminating unprintables from your metadata?  I can imagine if you were regularly printing out such strings to the terminal, for example, it might be a problem!  You certainly wouldn't want, say, ANSI control sequences to be buried in them.

  Where the parser issue is concerned, we're on a more serious tack to address it in future.  Right now we've opened #582 :  


The path we are on at the moment would provide user, in the near term, more options for controlling what XML parser is in force for a particular section of your application code, within the current thread.  If the standard protocol parser "blowing up" were the only concern , this might help your application avoid the explosive cases!  But I'd understand at this point if you did not want to make QUASI_XML the default for your application, just for the purpose of allowing these unprintable characters.

We are projecting that by Python3.0 we'll have a more robust default parser (either we'll have  put QUASI_XML through the paces and removed any holes in its performance we might find, or ... we'll have something even better).  And, of course, there will be more thorough testing around that default parser, as a matter of course!

At the moment however, it does seem you have a good solution, one that at least preserves the great majority of the metadata in your application.

We're happy to help in any way we can in the meantime.

Daniel Moore
Applications Engineer
iRODS Consortium, RENCI, UNC Chapel Hill

Terrell Russell

unread,
Jul 17, 2024, 8:39:40 AM7/17/24
to irod...@googlegroups.com
We've opened https://github.com/irods/python-irodsclient/issues/586 (xml_mode) to grant a more local option to control which parser is in use.

Terrell


Blake Fitch

unread,
Jul 18, 2024, 9:44:33 AM7/18/24
to iRODS-Chat
Thanks for the update!

I met with the scientific content providers and they were not too enthused about the idea of scrubbing all metadata down to ascii. Something about "this is Europe" and umlauts and such. So wanton scrubbing to ascii is out.

As I mentioned, once the bad string gets into the metadata, even removing the metadata reapply scrubbed metadata on the exception doesn't seem to work.

The new plan is to copy the data object with bad metadata, remove the data object with the bad metadata, and apply ascii-scrubbed metadata to the copy whenever we hit the error. Since this almost never happens, I hope this solves the problem with little overhead or disruption for now.

Terrell Russell

unread,
Jul 18, 2024, 9:54:17 AM7/18/24
to irod...@googlegroups.com
Yes, ASCII is a bit harsh - we won't be suggesting that as the 'solution'.   But valid UTF-8... should be more palatable.

Glad you have a workaround for now - and of course, we'll provide additional tooling with #586.

Terrell


Blake Fitch

unread,
Jul 18, 2024, 3:38:33 PM7/18/24
to iRODS-Chat
Didn't someone famous say: God gave us ASCII, all the rest is the work of men ...
... or was it EBCDIC?

;-)
Reply all
Reply to author
Forward
0 new messages