Eliminate Chinese Text

Kim Mosley

unread,

Mar 21, 2023, 12:08:46 PM3/21/23

to BBEdit Talk

What would be the grep pattern to either eliminate Chinese Text or to select English Text. The document is 230 pages.

河大地（悟後十方空）。
Case: A monk asked Master Langya Jiao, "Purity is originally so--how does it suddenly produce
mountains, rivers, and the great earth?"(When deluded, the world exists.) Langya said, "Purity is
originally so--how does it suddenly produce mountains, rivers, and the great earth?"(After enlightenment,
everywhere is void.)
229
師云。汾陽無德昭禪師。北地苦寒。因罷夜參。梵僧乘雲而至勸。不可失時。此眾雖不多。六人
大器。道廕人天。陽明日上堂云。胡僧金錫光。為法到汾陽。六人成大器。勸請為敷揚。時大愚
芝。慈明圓。瑯琊覺。法華舉。天勝泰。石霜永等。皆在席下。滁州瑯琊山。開化廣照禪師。諱
慧覺。西洛人。父為衡陽太守。捐館。扶襯歸洛。過澧州。登藥山古剎瞻禮。觀其游處。宛若舊
居。緣此出家。得法于汾陽。應緣滁水。與雪竇明覺。同時唱道。天下指為二甘露門。逮今淮南
遺化如昔。湖南祇林和尚。纔見僧來便云。魔來魔來。以木劍揮之。潛入方丈。如是十二年。後
置劍無言。有僧問。十二年前為甚麼降魔。林云。賊不打貧兒家。僧云。十二年後為甚麼不降魔
。林云。賊不打貧兒家。此名一劍下分身之意。首楞嚴第四。富樓那問。若復世間一切根塵。陰
處界等。皆如來藏。清淨本然。云何忽生山河大地諸有為相。次第遷流。終而復始。說者云。若
解則已知。覺體本妙。無明本空。山河大地。如空花相。若惑則能所妄分。強覺俄起。三細為世
。四輪成界。瑯琊云。我則不然。清淨本然。云何忽生山河大地。此喚騎賊馬赶賊。奪賊槍殺賊
。薦福信云。先行不到。末後太過。萬松道。徐六檐板。各見一邊。要除見滲漏。須見天童始得
。

Kaveh Bazargan

unread,

Mar 21, 2023, 12:24:03 PM3/21/23

to bbe...@googlegroups.com

Maybe a start:

^[^A-ž0-9]+$

So lines that do not contain

Roman letters (inc diacritics)
digits

--
This is the BBEdit Talk public discussion group. If you have a feature request or need technical support, please email "sup...@barebones.com" rather than posting here. Follow @bbedit on Twitter: <https://twitter.com/bbedit>
---
You received this message because you are subscribed to the Google Groups "BBEdit Talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bbedit+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bbedit/5bc239e8-c8df-498b-bae0-94792def4447n%40googlegroups.com.

--

Kaveh Bazargan PhD

Director

River Valley Technologies ● Twitter ● LinkedIn ● ORCID ● @kave...@mastodon.social

Accelerating the Communication of Research

Fletcher Sandbeck

unread,

Mar 21, 2023, 12:24:52 PM3/21/23

to bbe...@googlegroups.com

You can find a Unicode characters using \x{####} and then use that pattern to build up a range. The following pattern finds all characters that are not within the range of Latin characters with most common extensions and punctuation. You can tighten or loosen based on the ranges listed here https://jrgraphix.net/r/Unicode/.

Find: [^\x{0000}-\x{0400}]+

Hope this helps,

[fletcher]

Dan Barrett

unread,

Mar 21, 2023, 12:32:24 PM3/21/23

to bbe...@googlegroups.com

[^[:ascii:]]+ will match everything that is not an ASCII character (character codes 0-127) so it will ignore letters, digits, punctuation, etc.

--

Kim Mosley

unread,

Mar 21, 2023, 12:50:19 PM3/21/23

to bbe...@googlegroups.com

I used Kaveh’s code just because it came first and it worked for all but a few… job is done. Thanks everyone.

Though Dan’s is my favorite solution.

Kim

To view this discussion on the web visit https://groups.google.com/d/msgid/bbedit/CANGkpcPpV22guUaMMKBzy3jMNo%3DSrxy3Jwupi_zLPcxgT9vuQA%40mail.gmail.com.

Reply all

Reply to author

Forward