Re: [chinese mac] Delete repeated Chinese characters

118 views
Skip to first unread message

Jens Østergaard Petersen

unread,
Oct 8, 2019, 8:01:15 AM10/8/19
to chine...@googlegroups.com
The funny thing about these repetitions is that the first of them is a Kangxi radical, the second the corresponding CJK character. If real duplicate characters were involved, I would simply grep for "(\p{Han})\1” or "(\p{InCJKUnifiedIdeographs})\1". I thought that hunting for the radicals would be possible with "\p{InCJKRadicalsSupplement}”, but not so. However, you can catch the same span with "[\x{2E80}-\x{2FFF}]”.

Last time I checked, you couldn’t do this in Word. I would save the document i RTF format and search using Nisus Writer’s unparalleled PowerFind.

Jens

On 7 October 2019 at 16.49.17, Christopher Cullen (christoph...@nri.cam.ac.uk) wrote:

I have a Word document with Chinese text produced by OCR of a scanned Chinese text.

The scan is largely accurate - except that for some reason the text 'stutters', meaning that some single characters in the scanned text appear repeated in the OCR text.

So instead of, for instance   用之,并写成后给与小的们阅看。

I  have : ⽤用之,并写成后给与⼩小的们阅看。

Is there a suitable macro or other means of eliminating these repeats by replacing each pair of repeated characters with a single character?   I have a long text to process, so I need something that does the job at least semi-automatically.

--
--
You received this message because you are subscribed to the Chinese Mac group.
For answers to frequently-asked questions, visit http://www.chinesemac.org
To start a new topic, send a new message to chine...@googlegroups.com
To unsubscribe, send a message to chinesemac-...@googlegroups.com
For more options, visit http://groups.google.com/group/chinesemac
---
You received this message because you are subscribed to the Google Groups "Chinese Mac" group.
To unsubscribe from this group and stop receiving emails from it, send an email to chinesemac+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/chinesemac/c100de14-cd8a-4651-9026-9b9435297eb2%40googlegroups.com.

Christopher Cullen

unread,
Oct 10, 2019, 7:31:48 AM10/10/19
to Chinese Mac
Thanks for responding.  I am afraid I don't really understand the technical details in your post, but I am currently asking Nisus to explain to me how to use Powersearch to do this job.  I already have Nisus Writer Pro.


On Tuesday, 8 October 2019 14:01:15 UTC+2, Jens Østergaard Petersen wrote:
The funny thing about these repetitions is that the first of them is a Kangxi radical, the second the corresponding CJK character. If real duplicate characters were involved, I would simply grep for "(\p{Han})\1” or "(\p{InCJKUnifiedIdeographs})\1". I thought that hunting for the radicals would be possible with "\p{InCJKRadicalsSupplement}”, but not so. However, you can catch the same span with "[\x{2E80}-\x{2FFF}]”.

Last time I checked, you couldn’t do this in Word. I would save the document i RTF format and search using Nisus Writer’s unparalleled PowerFind.

Jens

On 7 October 2019 at 16.49.17, Christopher Cullen (christop...@nri.cam.ac.uk) wrote:

I have a Word document with Chinese text produced by OCR of a scanned Chinese text.

The scan is largely accurate - except that for some reason the text 'stutters', meaning that some single characters in the scanned text appear repeated in the OCR text.

So instead of, for instance   用之,并写成后给与小的们阅看。

I  have : ⽤用之,并写成后给与⼩小的们阅看。

Is there a suitable macro or other means of eliminating these repeats by replacing each pair of repeated characters with a single character?   I have a long text to process, so I need something that does the job at least semi-automatically.

--
--
You received this message because you are subscribed to the Chinese Mac group.
For answers to frequently-asked questions, visit http://www.chinesemac.org
To start a new topic, send a new message to chine...@googlegroups.com
To unsubscribe, send a message to chinesemac-...@googlegroups.com
For more options, visit http://groups.google.com/group/chinesemac
---
You received this message because you are subscribed to the Google Groups "Chinese Mac" group.
To unsubscribe from this group and stop receiving emails from it, send an email to chine...@googlegroups.com.

TenThousandThings

unread,
Oct 10, 2019, 8:24:13 AM10/10/19
to Chinese Mac
Hi Christopher,

I'll try to explain, in case it will help. Unicode has two sets of alternative characters for the Kangxi radicals, one is the "Kangxi Radicals" block and the other is the "CJK Radicals Supplement" block. The Kangxi Radicals block is not for general use, and is largely just there for legacy encoding purposes. The CJK Radicals Supplement has specific uses, but those are not relevant here.

Jens is a little off, in that In the two cases you sent us, the first character is from the Kangxi Radicals block and the second character is the "CJK Unified Ideographs" block -- the latter is the correct one to use for your text. The CJK Radicals Supplement block is not involved.

So what you need to do to your text is search for and remove any character that is from the Kangxi Radicals block of Unicode. The character range is 2F00-2FD5.

The only way to do this in Word is one by one, just using Search (for the Kangxi Radicals block characters) and Replace All (with nothing) 214 times. Use the Character Viewer (access it via "Emoji & Symbols") to input the Unicode block characters directly in macOS.

I don't have time now, and the Nisus people should be able to help you, but you could also use a text editor to do this using regular expressions. If you are stymied, write back and I can help you do it that way.

Eric

Christopher Cullen

unread,
Oct 10, 2019, 9:42:06 AM10/10/19
to chine...@googlegroups.com
Dear Eric,

Thanks for taking the time to write to me.

Unfortunately, things are more complicated than my sample indicated.  Both the repeated characters there (小, 用) are Kangxi radicals, but that was just the accident of choice.   There are other examples that are not in the radical list, so I hope that it will be possible to find a way to search for any pair of repeated characters (which occurs very rarely in original texts of the kind I am dealing with), and restore such instances to a single character.

I can see, however, that the fact that Unicode treats the same glyph differently when it represents a radical and when it represents a CJK character in the general list will pose problems!

Yours,

Christopher

To unsubscribe from this group and stop receiving emails from it, send an email to chinesemac+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/chinesemac/18ac7f9c-770e-45a2-b5f2-15abff8fa4db%40googlegroups.com.

TenThousandThings

unread,
Oct 10, 2019, 5:18:23 PM10/10/19
to Chinese Mac
Searching for repeated characters is not hard, and anyone with a working knowledge of regular expressions could do that for you. Probably there is even a way to do that in Word using wildcards.

Unfortunately, your problem is that these are NOT repeated characters. From the perspective of the machine, that is. They share the same glyph, but they are at different code points. A search for repeated characters won't find these pairs.

I suspect the other cases you mention are similar (i.e., NOT repeated characters), even though they are not Kangxi Radicals or CJK Radicals Supplement. There are other duplicate glyphs in Unicode, including the CJK Compatibility Ideographs block (F900-FA6D, which must be handled carefully because it has twelve stray non-duplicates sprinkled in) and others.

Copy and paste a few of these other (non-Kangxi Radicals) examples, and we can confirm their origins. You can look them up yourself by copying the characters into Character Viewer, which will show you the code point. I couldn't figure out how to do it directly in Word. Any good text editor will do it, probably Nisus, too.

I'm sure it's possible to write a script using regular expressions to do what you need done, but it would take some work to generate the list/range of code points to search for and remove. Can you easily move the text into a text editor?

If you have access to the original OCR scan materials, it might be easier to fix the problem in settings. Or use a different OCR application.

If that's not possible, then you'll need that list/range of code points to search for and remove. With that, Nisus could probably do it for you. Or you could just slog through using Search and Replace All to remove the problem code points as you find them (don't assume it will always be the first of the two characters -- that's likely true for the radicals but probably not for the CJK Compatibility Ideographs).

Eric

On Thursday, October 10, 2019 at 9:42:06 AM UTC-4, Christopher Cullen wrote:
Dear Eric,

Thanks for taking the time to write to me.

Unfortunately, things are more complicated than my sample indicated.  Both the repeated characters there (小, 用) are Kangxi radicals, but that was just the accident of choice.   There are other examples that are not in the radical list, so I hope that it will be possible to find a way to search for any pair of repeated characters (which occurs very rarely in original texts of the kind I am dealing with), and restore such instances to a single character.

I can see, however, that the fact that Unicode treats the same glyph differently when it represents a radical and when it represents a CJK character in the general list will pose problems!

Yours,

Christopher

TenThousandThings

unread,
Oct 10, 2019, 5:38:59 PM10/10/19
to Chinese Mac
Also, it occurs to me -- Wenlin -- put the text into Wenlin. Then you can just wave the cursor over the text and see the character info.

Wenlin's search and replace can include regular expressions. I've never used it, but I think it's quite powerful, using the PCRE (Perl) library. See Wenlin Help.

ER

Kerim Friedman

unread,
Oct 10, 2019, 11:56:05 PM10/10/19
to chine...@googlegroups.com
Wondering about some other approaches. I tried converting the text from Chinese to Chinese with Google translate, but it didn't help. However, another option might be to save the text as a high quality PDF and then run OCR on it? Then the identical characters will be identical and you should be able to use simpler tools to de-duplicate it? (Haven't tested this theory yet...)

K

--
--
You received this message because you are subscribed to the Chinese Mac group.
For answers to frequently-asked questions, visit http://www.chinesemac.org
To start a new topic, send a new message to chine...@googlegroups.com
To unsubscribe, send a message to chinesemac-...@googlegroups.com
For more options, visit http://groups.google.com/group/chinesemac
---
You received this message because you are subscribed to the Google Groups "Chinese Mac" group.
To unsubscribe from this group and stop receiving emails from it, send an email to chinesemac+...@googlegroups.com.


--


Associate Professor
The Department of Ethnic Relations and Cultures
College of Indigenous Studies
National DongHwa University, TAIWAN
教授國立東華大學族群關係與文化學系


Kerim Friedman

unread,
Oct 10, 2019, 11:56:59 PM10/10/19
to chine...@googlegroups.com
By the way, this Unicode search tool makes the problem clear. It shows the unicode value of each character in the pasted text in an easy to see way.


K

Jens Østergaard Petersen

unread,
Oct 11, 2019, 5:19:34 AM10/11/19
to chine...@googlegroups.com
In Nisus Writer, you would open up Find & Replace and select PowerFind Pro (regex). If you paste in "(\p{Han})\1” or "(\p{InCJKUnifiedIdeographs})\1”, you should find repeated CJK characters. Obviously, some of these may be legit, so you have to go through them one by one. In Replace with, you write “\1” to get only the first character. If you wish to find all radicals (and remove them), you should enter "[\x{2E80}-\x{2FD5}]”. That would give you the CJK radicals (2E80-2EF3) as well as the Kangxi radicals (2F00-2FD5). Probably you can delete all of these, but you had better check a sample first. If the document holds no formatting worth preserving, you can save the Word document as a text document and do the same find & replace in a text editor such as BBEdit.

I am not clear whether your document contains CJK Compatibility Ideographs, as Eric writes, but if it does, I would recommend using UnicodeChecker <https://earthlingsoft.net/UnicodeChecker/>. This will install a number of services in your Apple menu. Select the whole text (in Nisus Writer or in a text editor) and select the service Convert to Unicode Normalization Form C. This will convert the compatibility forms to their regular equivalents.

Jens

Christopher Cullen

unread,
Oct 11, 2019, 5:57:06 AM10/11/19
to chine...@googlegroups.com
I am very grateful indeed to everybody who has offered help so far.

I have been in touch with Nisus support, who have set me up to use PowerFind to find and replace repeats - and after experimenting with that, it seems able to do a lot of what I want.  

For information, I do have high quality scans of this material in the form of a PDF document.   I have OCR’d this using PDFPen Pro.  The ‘repeat characters’ situation that results from that seems to be quite heavily dependent on the means by which I extract the Chinese text from OCR layer of the pdf.  While I have noticed this, I am not in a position to give a systematic account of exactly what path produces what - because (as one does) I have tried a number of ways to get through this problem and have not kept a ‘lab notebook’ style record of everything I have tried.

I’ll report back when I have finished the job, in the hope that my experience may be useful to others.

Christopher Cullen



Nobumi Iyanaga

unread,
Oct 11, 2019, 8:50:01 AM10/11/19
to chine...@googlegroups.com
Hello,

My suggestion would be to use Abbyy Fine Reader to do the OCR. I tried PDFPen (for Japanese), but the result was not good at all. If the original pdf is in a good quality, then Abbyy Fine Reader does certainly a much better job...

Best regards,

Nobumi Iyanaga

Kerim Friedman

unread,
Oct 11, 2019, 11:48:00 AM10/11/19
to chine...@googlegroups.com
I second the recommendation for FineReader. Nothing has worked as well for me except for Microsoft’s cloud OCR which I use via Prizmo on my iPhone. I don’t know if there are any desktop apps that use this API?

K

--
--
You received this message because you are subscribed to the Chinese Mac group.
For answers to frequently-asked questions, visit http://www.chinesemac.org
To start a new topic, send a new message to chine...@googlegroups.com
To unsubscribe, send a message to chinesemac-...@googlegroups.com
For more options, visit http://groups.google.com/group/chinesemac
---
You received this message because you are subscribed to the Google Groups "Chinese Mac" group.
To unsubscribe from this group and stop receiving emails from it, send an email to chinesemac+...@googlegroups.com.
--

---

P. Kerim Friedman 傅可恩
Associate Professor
The Dept. of Ethnic Relations & Cultures

College of Indigenous Studies
National DongHwa University, TAIWAN
副教授國立東華大學族群關係與文化學系
Reply all
Reply to author
Forward
0 new messages