metadata add exception: not well-formed (invalid token): line 101, column 23

Blake Fitch

unread,

Jul 9, 2024, 6:07:02 AM7/9/24

to iRODS-Chat

iRODS V 4.3.2

Access via python client library. Python logging is turned on.

obj.metadata.add( attr_str, value_str )

Results in the exception: not well-formed (invalid token): line 101, column 23

The metadata does however seem to have been applied.

Any thoughts?

Best,
Blake

Kory Draughn

unread,

Jul 9, 2024, 8:31:39 AM7/9/24

to irod...@googlegroups.com

Hi Blake,

What do attr_str and value_str hold?

How are the values generated?

Kory Draughn

Chief Technologist

iRODS Consortium

--
--
The Integrated Rule-Oriented Data System (iRODS) - https://irods.org

iROD-Chat: http://groups.google.com/group/iROD-Chat
---
You received this message because you are subscribed to the Google Groups "iRODS-Chat" group.
To unsubscribe from this group and stop receiving emails from it, send an email to irod-chat+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/irod-chat/acc9270f-13c9-4728-8abe-8a6fe63efee0n%40googlegroups.com.

Blake Fitch

unread,

Jul 9, 2024, 9:19:50 AM7/9/24

to iRODS-Chat

After digging in more, he's the senario...

I have a bunch of metadata which I am trying to apply with: obj.metadata.apply_atomic_operations( *AVU_list )

The way this situation occurs is, that call fails because one of the values has non-printable values: SkopeCalXï¿½ï¿½# 2d1

The way the code used to work was, after a failure to apply an AVU list, the code would fall back to trying to apply each AVU indifidually using obj.metadata.add().

In that case, sinse this is a rare case, the attribute and value strings are run through the following function to convert any 'bad' chars to '_' before using obj.metadata.add(). The value string becomes: SkopeCalX______#_2d1 None the less, that add, and infact all others on the same object, would fail with the exception noted earlier.

Since then, I've noticed that it appears that the that the first call via apply_atomic_operations( *AVU_List ) did apply the metadata, but sent me down the repare path by raising an exception. What I just tried is intercepting and 'fixing' the 'bad' string before the initial call to apply_atomic_operations(). This makes the first call work.

So, it seems like while obj.apply_atomic_operations( *AVU_list ) is depositing all the metadata on the object, it is not returning cleanly. And then the later calls to obj.metadata.add() all fail with the invalid token exception.

Since almost always (years now), I don't get 'bad' chars in the metadata, I'd like continue the call w/o the conversion of bytes outside 0x20 - 0x7E to underbars, but I guess if that's required, I can do it.

Thanks for considering this issue!

Kory Draughn

unread,

Jul 9, 2024, 9:55:29 AM7/9/24

to irod...@googlegroups.com

Got it.

Can you post the error returned by the atomic metadata API?

Is there anything in the server's log file?

You may have uncovered an edge case which needs to be addressed in the API.

We'll look into reproducing the error.

Kory Draughn

Chief Technologist

iRODS Consortium

To view this discussion on the web visit https://groups.google.com/d/msgid/irod-chat/09287c0e-4920-4906-b237-78815db47121n%40googlegroups.com.

Blake Fitch

unread,

Jul 9, 2024, 11:20:24 AM7/9/24

to iRODS-Chat

The error on the initial call to obj.metadata.apply_atomic_operations( *AVU_list ) is the same as the others: ex: not well-formed (invalid token): line 101, column 23

Kory Draughn

unread,

Jul 9, 2024, 12:53:16 PM7/9/24

to irod...@googlegroups.com

What database are you using for the catalog? PostgreSQL, MySQL 8?

Are there any messages in the server's log file related to the atomic metadata API?

Kory Draughn

Chief Technologist

iRODS Consortium

To view this discussion on the web visit https://groups.google.com/d/msgid/irod-chat/be5f3b0a-71ee-4325-91ef-4953bc741316n%40googlegroups.com.

Blake Fitch

unread,

Jul 9, 2024, 2:52:38 PM7/9/24

to iRODS-Chat

Postgres

Blake Fitch

unread,

Jul 9, 2024, 2:53:02 PM7/9/24

to iRODS-Chat

if you google that error, a bunch of other packages/apps come up

Blake Fitch

unread,

Jul 10, 2024, 6:57:52 AM7/10/24

to iRODS-Chat

Some more info...

The test system is running postgresql 12.

There are no updates to files in /var/lib/irods/log and I don't see errors in /var/log/postgresql/

Here's the string dumped as a hex value.

string: SkopeCalXï¿½ï¿½# 2d1
hexvalue: 53:6b:6f:70:65:43:61:6c:58:c3:af:c2:bf:c2:bd:c3:af:c2:bf:c2:bd:23:01:32:64:31

BTW, this issue to happen on our production system too. That is running iRODS 4.2.10 running postgresql 10

On Tuesday, July 9, 2024 at 6:53:16 PM UTC+2 korydraughn.renci wrote:

Terrell Russell

unread,

Jul 10, 2024, 4:53:35 PM7/10/24

to irod...@googlegroups.com

This is an XML parsing issue.

I've created an issue to track the discussion.

https://github.com/irods/python-irodsclient/issues/582

Terrell

To view this discussion on the web visit https://groups.google.com/d/msgid/irod-chat/0871d88b-dec8-46cd-86ae-34d37b03816fn%40googlegroups.com.

dmoore.renci

unread,

Jul 11, 2024, 1:12:58 PM7/11/24

to iRODS-Chat

Blake,

I second what Terrell has just said about the parsing. That could potentially just be a peripheral issue and a matter of switching to the QUASI_XML parser.

Another matter is how unprintables should be interpreted when encountered in metadata. In this application, iRODS is a pass-through to a database that (like the OS on which it sits) in most cases prefers Unicode to binary data. When naked binary data is presented to such a system, the usual interpretation of that is: here's a UTF-8 encoded byte string to be processed and/or stored as Unicode.

Looking at that string, it's easy to believe that's not what was intended. Binary '\x01' won't occur in arbitrary encoded UTF-8 unless it's part of a multibyte encoding, and even then it won't be the first byte of that subsequence. Furthermore, some binary strings will generate errors on the way into iRODS, due to things in the input that look to the system like invalid embedded UTF-8 encoding sequences....

So we have at best an impedance mismatch, and at worst a potential integrity problem in using iRODS to store arbitrary byte strings. Although it may work alright for some values, it will not *in the general case* preserve information authors' intent or even function error-free.

Perhaps a good answer is to convert values such as '\x01' into u'\u0001' on the way into iRODS, i.e. in general just map all non-printable character with an ordinal less than 256 to its corresponding unicode value, and (assuming Python3) use the resulting string value directly in the process of AVU manipulation and retrieval. It's forcing a particular interpretation essentially, but it would result in a more stable system overall. Depending on the application, this could either be the fallback when unicode decode errors are encountered, or, perhaps the approach should be applied uniformly for consistency's and reliability's sake. Policy will no doubt, as usual, have to depend on the system owners.

Forgive the digression, but I thought there was the possibility that (parts of) it might help : )

Daniel Moore

Applications Engineer

iRODS Consortium, RENCI, UNC Chapel Hill

dmoore.renci

unread,

Jul 11, 2024, 1:37:10 PM7/11/24

to iRODS-Chat

Blake,

Again, pardon. It does seem, from the binary dump that you gave us, that it is a valid encoded UTF-8, and as such, the embedded chr(1) does have utility in a Unicode sense.

>>> hexvalues
[83, 107, 111, 112, 101, 67, 97, 108, 88, 195, 175, 194, 191, 194, 189, 195, 175, 194, 191, 194, 189, 35, 1, 50, 100, 49]
>>> bytearray(_)
bytearray(b'SkopeCalX\xc3\xaf\xc2\xbf\xc2\xbd\xc3\xaf\xc2\xbf\xc2\xbd#\x012d1')
>>> _.decode('utf8')
'SkopeCalXï¿½ï¿½#\x012d1'

So, I'm guessing that this data was intended to represent Unicode. It will be significant to the iRODS client whether this is presented as a Python3 string (natively Unicode) or bytestring (natively octet-based). If the former in your case, then it should be via the decode() method, with argument 'utf8', rather than mapping binary codes directly to produce the str object. b'...\x01...' will have a fundamentally different meaning than '...\u0001...'

- Dan

dmoore.renci

unread,

Jul 11, 2024, 1:41:52 PM7/11/24

to iRODS-Chat

Or if not '\x01', then other codes will ...

$ python3 -c "print ('\u0001',end='')" |od -vtx1
0000000 01
0000001

$ python3 -c "print ('\u00ce',end='')" |od -vtx1
0000000 c3 8e
0000002

dmoore.renci

unread,

Jul 11, 2024, 1:45:12 PM7/11/24

to iRODS-Chat

And to be complete:

python -c "print ('\xce',end='')" |od -vtx1
0000000 c3 8e
0000002

Blake Fitch

unread,

Jul 14, 2024, 9:50:35 AM7/14/24

to iRODS-Chat

Thanks for giving the matter consideration!

I believe it's all just standard Python3 strings. But I also think that's what I see in the examples. Here's some code and output.

Probably tens of thousands of strings have passed through this interface without breaking it until this one.

First, some debug output:

2024-07-14 13:18:23,077 [irods_utils.py: 207 - pydict_to_irods_avus() ] DEBUG: Create AVU OP list. ipath: /MRDataZone/home/mri_etl_s94t/echtdata/studies/42/experiments/EPAB-5ULJ/TWIX/allData#S94Tuebingen#F11217#M340#D120624#T110143#SkopeCalib.dat.part.2024-07-14_13:18:22.626622Z key: SequenceName type(key): <class 'str'> value: SkopeCalXï¿½ï¿½# 2d1 type(value): <class 'str'>

2024-07-14 13:18:23,077 [irods_utils.py: 223 - pydict_to_irods_avus() ] WARNING: Modify metadata value to contain only ascii chars. new: SkopeCalX______#_2d1 old: SkopeCalXï¿½ï¿½# 2d1 ipath: /MRDataZone/home/mri_etl_s94t/echtdata/studies/42/experiments/EPAB-5ULJ/TWIX/allData#S94Tuebingen#F11217#M340#D120624#T110143#SkopeCalib.dat.part.2024-07-14_13:18:22.626622Z
2024-07-14 13:18:23,077 [irods_utils.py: 225 - pydict_to_irods_avus() ] WARNING: old hexvalue: 53:6b:6f:70:65:43:61:6c:58:c3:af:c2:bf:c2:bd:c3:af:c2:bf:c2:bd:23:01:32:64:31

This is the code that produces those lines (which is currently removing whatever isn't ascii - though I'd prefer going back to feeding in what ever I'm handed by the application.):

nonprintable_regex_compiled = re.compile( r'[^\x20-\x7E]' )
def convert_non_ascii_to_underscore(s):
return nonprintable_regex_compiled.sub( '_', s )

# This ignores the AVU units field.
def pydict_to_irods_avus( coll_or_obj, metadata_dict ):

AVU_list = []

# Load the dict into a list of AVU ops
for key, value in metadata_dict.items():
logging.debug( "Create AVU OP list. ipath: " + coll_or_obj.path
+ " key: " + key + " type(key): " + str(type(key))
+ " value: " + value + " type(value): " + str(type(value)) )

if value == "":
value = " "

k = convert_non_ascii_to_underscore(key)
if k != key:
logging.warning( f"Modify metadata attribute to contain only ascii chars. new: {k} old: {key} ipath: {coll_or_obj.path}" )
hexkey = ":".join("{:02x}".format(c) for c in key.encode())
logging.warning( f"old hexkey: {hexkey}" )
key = k

v = convert_non_ascii_to_underscore(value)
if v != value:
logging.warning( f"Modify metadata value to contain only ascii chars. new: {v} old: {value} ipath: {coll_or_obj.path}" )
hexvalue = ":".join("{:02x}".format(c) for c in value.encode())
logging.warning( f"old hexvalue: {hexvalue}" )
value = v

AVU_list.append( AVUOperation( operation='add', avu=iRODSMeta( key, value ) ) )

worked = False
try:
coll_or_obj.metadata.apply_atomic_operations( *AVU_list )
worked = True
logging.debug( f"Succeeded in adding AVU_list" )
except Exception as ex:
logging.error( "Applying list of AVUs as atomic op to collection or object failed. (Will try individual AVUs)"
+ " ex: " + str(ex)

... ... ...

dmoore.renci

unread,

Jul 17, 2024, 8:27:31 AM7/17/24

to iRODS-Chat

hi Blake,

I need to ask one more thing: Were there practical concerns, other than the dysfunctional STANDARD_XML parser, for eliminating unprintables from your metadata? I can imagine if you were regularly printing out such strings to the terminal, for example, it might be a problem! You certainly wouldn't want, say, ANSI control sequences to be buried in them.

Where the parser issue is concerned, we're on a more serious tack to address it in future. Right now we've opened #582 :

https://github.com/irods/python-irodsclient/issues/582

The path we are on at the moment would provide user, in the near term, more options for controlling what XML parser is in force for a particular section of your application code, within the current thread. If the standard protocol parser "blowing up" were the only concern , this might help your application avoid the explosive cases! But I'd understand at this point if you did not want to make QUASI_XML the default for your application, just for the purpose of allowing these unprintable characters.

We are projecting that by Python3.0 we'll have a more robust default parser (either we'll have put QUASI_XML through the paces and removed any holes in its performance we might find, or ... we'll have something even better). And, of course, there will be more thorough testing around that default parser, as a matter of course!

At the moment however, it does seem you have a good solution, one that at least preserves the great majority of the metadata in your application.

We're happy to help in any way we can in the meantime.

Daniel Moore

Applications Engineer

iRODS Consortium, RENCI, UNC Chapel Hill

Terrell Russell

unread,

Jul 17, 2024, 8:39:40 AM7/17/24

to irod...@googlegroups.com

We've opened https://github.com/irods/python-irodsclient/issues/586 (xml_mode) to grant a more local option to control which parser is in use.

Terrell

To view this discussion on the web visit https://groups.google.com/d/msgid/irod-chat/79dece95-682f-417d-af61-f3b483c4c3b5n%40googlegroups.com.

Blake Fitch

unread,

Jul 18, 2024, 9:44:33 AM7/18/24

to iRODS-Chat

Thanks for the update!

I met with the scientific content providers and they were not too enthused about the idea of scrubbing all metadata down to ascii. Something about "this is Europe" and umlauts and such. So wanton scrubbing to ascii is out.

As I mentioned, once the bad string gets into the metadata, even removing the metadata reapply scrubbed metadata on the exception doesn't seem to work.

The new plan is to copy the data object with bad metadata, remove the data object with the bad metadata, and apply ascii-scrubbed metadata to the copy whenever we hit the error. Since this almost never happens, I hope this solves the problem with little overhead or disruption for now.

Terrell Russell

unread,

Jul 18, 2024, 9:54:17 AM7/18/24

to irod...@googlegroups.com

Yes, ASCII is a bit harsh - we won't be suggesting that as the 'solution'. But valid UTF-8... should be more palatable.

Glad you have a workaround for now - and of course, we'll provide additional tooling with #586.

Terrell

To view this discussion on the web visit https://groups.google.com/d/msgid/irod-chat/0aa65cf9-a131-4c3f-bc23-f453d7e15663n%40googlegroups.com.

Blake Fitch

unread,

Jul 18, 2024, 3:38:33 PM7/18/24

to iRODS-Chat

Didn't someone famous say: God gave us ASCII, all the rest is the work of men ...