Problem with Non-breaking Spaces

31 views
Skip to first unread message

Alistair Windsor

unread,
Nov 4, 2023, 4:39:56 PM11/4/23
to Suite of automatic linguistic analysis tools
Dear All, 

I am getting errors running the Python 2.7 version of TAASSC 1.3.8 on a corpus:

Traceback (most recent call last):
  File "/Users/awindsor/miniforge3/envs/py27/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/Users/awindsor/miniforge3/envs/py27/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "TAASSC_1.3.8.py", line 1417, in main
    data_file_2.write(phrase_const_string +"\n")
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 337: ordinal not in range(128)

Now u'\xa0'  is a non-ASCII unicode non-breaking space. My first thought was that this was present in my corpus despite my care to normalize unicode and the ignore the non-ASCII portions however, a careful check revealed that there are no non-ASCII characters in the processed corpus. 

Apparently, CoreNLP introduces these characters as a way of joining single tokens that contain a space (such as phone numbers) but the Python code of TAASSC 1.3.8 chokes on them. I am trying to track through the code to find the write place to try to intervene. 

Can anyone help? Has anyone faced this issue before? Python 2s handling of unicode was problematic. 

Yours, 

Alistair




Alistair Windsor

unread,
Nov 5, 2023, 8:28:19 AM11/5/23
to Suite of automatic linguistic analysis tools
It appears that adding an encode('utf-8') to line 1417 solves the issue

data_file_2.write(phrase_const_string.encode('utf8') + "\n")

Yours, 

Alistair

Reply all
Reply to author
Forward
0 new messages