On 7 October 2019 at 16.49.17, Christopher Cullen (christoph...@nri.cam.ac.uk) wrote:
--I have a Word document with Chinese text produced by OCR of a scanned Chinese text.
The scan is largely accurate - except that for some reason the text 'stutters', meaning that some single characters in the scanned text appear repeated in the OCR text.
So instead of, for instance 用之,并写成后给与小的们阅看。
I have : ⽤用之,并写成后给与⼩小的们阅看。
Is there a suitable macro or other means of eliminating these repeats by replacing each pair of repeated characters with a single character? I have a long text to process, so I need something that does the job at least semi-automatically.
--
You received this message because you are subscribed to the Chinese Mac group.
For answers to frequently-asked questions, visit http://www.chinesemac.org
To start a new topic, send a new message to chine...@googlegroups.com
To unsubscribe, send a message to chinesemac-...@googlegroups.com
For more options, visit http://groups.google.com/group/chinesemac
---
You received this message because you are subscribed to the Google Groups "Chinese Mac" group.
To unsubscribe from this group and stop receiving emails from it, send an email to chinesemac+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/chinesemac/c100de14-cd8a-4651-9026-9b9435297eb2%40googlegroups.com.
The funny thing about these repetitions is that the first of them is a Kangxi radical, the second the corresponding CJK character. If real duplicate characters were involved, I would simply grep for "(\p{Han})\1” or "(\p{InCJKUnifiedIdeographs})\1". I thought that hunting for the radicals would be possible with "\p{InCJKRadicalsSupplement}”, but not so. However, you can catch the same span with "[\x{2E80}-\x{2FFF}]”.Last time I checked, you couldn’t do this in Word. I would save the document i RTF format and search using Nisus Writer’s unparalleled PowerFind.Jens
On 7 October 2019 at 16.49.17, Christopher Cullen (christop...@nri.cam.ac.uk) wrote:
--I have a Word document with Chinese text produced by OCR of a scanned Chinese text.
The scan is largely accurate - except that for some reason the text 'stutters', meaning that some single characters in the scanned text appear repeated in the OCR text.
So instead of, for instance 用之,并写成后给与小的们阅看。
I have : ⽤用之,并写成后给与⼩小的们阅看。
Is there a suitable macro or other means of eliminating these repeats by replacing each pair of repeated characters with a single character? I have a long text to process, so I need something that does the job at least semi-automatically.
--
You received this message because you are subscribed to the Chinese Mac group.
For answers to frequently-asked questions, visit http://www.chinesemac.org
To start a new topic, send a new message to chine...@googlegroups.com
To unsubscribe, send a message to chinesemac-...@googlegroups.com
For more options, visit http://groups.google.com/group/chinesemac
---
You received this message because you are subscribed to the Google Groups "Chinese Mac" group.
To unsubscribe from this group and stop receiving emails from it, send an email to chine...@googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to chinesemac+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/chinesemac/18ac7f9c-770e-45a2-b5f2-15abff8fa4db%40googlegroups.com.
Dear Eric,Thanks for taking the time to write to me.Unfortunately, things are more complicated than my sample indicated. Both the repeated characters there (小, 用) are Kangxi radicals, but that was just the accident of choice. There are other examples that are not in the radical list, so I hope that it will be possible to find a way to search for any pair of repeated characters (which occurs very rarely in original texts of the kind I am dealing with), and restore such instances to a single character.I can see, however, that the fact that Unicode treats the same glyph differently when it represents a radical and when it represents a CJK character in the general list will pose problems!Yours,Christopher
--
--
You received this message because you are subscribed to the Chinese Mac group.
For answers to frequently-asked questions, visit http://www.chinesemac.org
To start a new topic, send a new message to chine...@googlegroups.com
To unsubscribe, send a message to chinesemac-...@googlegroups.com
For more options, visit http://groups.google.com/group/chinesemac
---
You received this message because you are subscribed to the Google Groups "Chinese Mac" group.
To unsubscribe from this group and stop receiving emails from it, send an email to chinesemac+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/chinesemac/2a9c2f74-2bdc-4d87-95c2-e33d0f391c30%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/chinesemac/CAMnxv2oKd8qnVspMt6hSUrZEeVWzhnvfzUnb9Xvh5xtO03rWQQ%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/chinesemac/CAAvS_CiO0DbJcXKWPj88NvE9kpvEk27bvG8r_A8pTvp6sJ3RYg%40mail.gmail.com.
--
--
You received this message because you are subscribed to the Chinese Mac group.
For answers to frequently-asked questions, visit http://www.chinesemac.org
To start a new topic, send a new message to chine...@googlegroups.com
To unsubscribe, send a message to chinesemac-...@googlegroups.com
For more options, visit http://groups.google.com/group/chinesemac
---
You received this message because you are subscribed to the Google Groups "Chinese Mac" group.
To unsubscribe from this group and stop receiving emails from it, send an email to chinesemac+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/chinesemac/8DD5128F-EE32-43CB-9258-141CB33426A8%40nifty.com.