Problem with Non-breaking Spaces

31 views

Skip to first unread message

Alistair Windsor

unread,

Nov 4, 2023, 4:39:56 PM11/4/23

to Suite of automatic linguistic analysis tools

Dear All,

I am getting errors running the Python 2.7 version of TAASSC 1.3.8 on a corpus:

Traceback (most recent call last):
File "/Users/awindsor/miniforge3/envs/py27/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/Users/awindsor/miniforge3/envs/py27/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "TAASSC_1.3.8.py", line 1417, in main
data_file_2.write(phrase_const_string +"\n")
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 337: ordinal not in range(128)

Now u'\xa0' is a non-ASCII unicode non-breaking space. My first thought was that this was present in my corpus despite my care to normalize unicode and the ignore the non-ASCII portions however, a careful check revealed that there are no non-ASCII characters in the processed corpus.

Apparently, CoreNLP introduces these characters as a way of joining single tokens that contain a space (such as phone numbers) but the Python code of TAASSC 1.3.8 chokes on them. I am trying to track through the code to find the write place to try to intervene.

Can anyone help? Has anyone faced this issue before? Python 2s handling of unicode was problematic.

Yours,

Alistair

Alistair Windsor

unread,

Nov 5, 2023, 8:28:19 AM11/5/23

to Suite of automatic linguistic analysis tools

It appears that adding an encode('utf-8') to line 1417 solves the issue

                            data_file_2.write(phrase_const_string.encode('utf8') + "\n")

Yours,

Alistair

Reply all

Reply to author

Forward

0 new messages