Displaying and Searching for Unicode Private Use Area Characters in Antconc 3.5.9 and 4.3.1

Mark Faulkner

unread,

Aug 23, 2024, 5:43:49 AM8/23/24

to AntConc-Discussion

I am currently working on a project that makes extensive use of the Private Use Area of Unicode to record different letter forms found in medieval texts.

I have tried using opening *.txt files (encoded in UTF-8) derived from this project in both Antconc 3.5.9 and 4.3.1 and changing the font to Junicode, the font we are using.

In Antconc 4.3.1, it is possible to get the text displaying properly:

However, while it is possible to search for mainstream Unicode characters (as ‘m’ here), it is not possible to search for PUA ones (where pasting the character in the search bar fails to return any hits), nor some characters in other parts of Unicode (e. g. in the second line of the concordance, the 7-shaped symbol is the Tironian nota (204A), which occurs in the General Punctuation range of Unicode, but searching for it returns no hits).

In Antconc 3.5.9, many of the PUA codepoints decompose into multiple katakana characters:

For instance, in row 14 the word after the hit is ‘mid’, with the d being PUA 10A00C in Junicode. However, here it has become 00F4 + FF8A + FF80 + FF8C. On the other hand, in this version of Antconc it is possible to search for the Tironian nota and get results.

Eventually, the corpus project I am involved with will probably need its own front end, but in the short to medium term being able to use Antconc to do corpus analysis of the data would be a significant boon, so if there is a way to resolve the issue in either 3.5.9 or 4.3.1, I would be very happy to hear it.

Thanks,

Mark

Laurence Anthony

unread,

Aug 26, 2024, 7:34:09 AM8/26/24

to AntConc-Discussion

Hi Mark,

Let's focus on AntConc 4x, which has much better support for Unicode that the old 3x version.

The default setting for AntConc uses the 'Letter' class of Unicode characters. If you are using characters beyond this, you simply need to append them to the default set in the Token Definiiton settings, before you create your corpus. You do this in the Corpus Manager. If you paste all the special characters into the append box, everything should work as expected.

Can you try it and let me know if this solves your problem?

Laurence.

Message has been deleted

Mark Faulkner

unread,

Aug 27, 2024, 3:50:59 AM8/27/24

to AntConc-Discussion

Dear Laurence,

Thanks very much for your reply.

Preliminary testing of your solution with a subset of our special characters suggests this solution works very well (including making it possible to search for combining characters like accents on their own, which is particularly helpful!), so thank you. Apologies I did not know about this feature of Antconc 4. I'll let you know if a full test of our (evolving) character set identifies any further issues.

Thanks again for all your work with Antconc: it is a marvellous tool.

Mark

Laurence Anthony

unread,

Aug 27, 2024, 3:52:40 AM8/27/24

to ant...@googlegroups.com

Hi Mark,

That's great to hear. I might also suggest that you activate the 'marks' option in the token definition settings, as some of your characters may use them.

Regards,

Laurence.

###############################################################
Laurence ANTHONY, Ph.D.
Professor of Applied Linguistics
Faculty of Science and Engineering
Waseda University
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
E-mail: antho...@gmail.com
WWW: http://www.laurenceanthony.net/
###############################################################

--
You received this message because you are subscribed to the Google Groups "AntConc-Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to antconc+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/antconc/49dd2715-ff89-49b1-9ef8-7b2faf56678en%40googlegroups.com.

Reply all

Reply to author

Forward