Canonize emoji for an XML file

339 views
Skip to first unread message

Justin Ross

unread,
Apr 10, 2022, 3:48:51 PM4/10/22
to BBEdit Talk
Hi all, I'm looking for a way to catch any emoji that's used amongst regular text. This is so that I can create an XML file to import into InDesign. Then I simply find/replace any emoji found and convert the character to an emoji font so it can be printed.

I've made a canonize file with over 1,000 emoji, separated by a tab, then the decimal equivalent.

This works great.  

For example:
This emoji is found in the text somewhere:
👦

and is changed to:
👦

It also works for skintone emoji where the decimal code can be repeated.

This emoji is found in the text somewhere:
👦🏻

and is changed to:
👦🏻


However, as soon as I wrap the code (so it's easier to find/change in InDesign), the duplicate codes cause a problem.

For example:

This emoji:
👦

Is changed to this:
(ef)&\#128102;(\ef) 

BUT...

This:
👦🏻

Is changed to this:
(ef)👦(\ef)(ef)🏻(\ef)

Note the extra (\ef)(ef) in the middle.

Now I could use a find/replace to remove that bit. But what if there are two different emoji next to each other? I'm replacing one problem with another.

Is there a way round this?

Many thanks if anyone can help.

Justin Ross

unread,
Apr 12, 2022, 11:38:38 AM4/12/22
to BBEdit Talk
Sorted it.

The only way for it to work with extra text on either end is to move all the single entity codes (e.g. 👦) below the multiple entity ones (e.g. 👦🏻) in the canonize file.

So it would look something like this...

👦🏻
ӧ🏟
ҁ
👦
👧
👨

jj

unread,
Apr 16, 2022, 10:24:16 AM4/16/22
to BBEdit Talk
Hi Community,

Let's celebrate BBEdit's 30 years of existence. 👏  🎉  🎂  🍾 

🥂 👉 👨🏼‍💻 & 🍀️  & 🦜 & 👥

Here is a Swift text filter that could help you prepare your inDesign birthday cards.

Based on Unicode's Emoji regular expression and Swift's ICU regular expression engine.

Save in ~/Library/Application Support/BBEdit/Text Filters/encode_emojis.swift

    #!/usr/bin/env swift

    // Based on: https://unicode.org/reports/tr51/#EBNF_and_Regex
    //
    // Changed \p{Emoji} to \p{Basic_Emoji} to avoid matching '#', numbers, etc.
    // Tweaked to match uncovered cases revealed by test files.
    //
    // Tested against the contents of those test files:
    // ------------------------------------------------
    // https://unicode.org/emoji/charts/full-emoji-list.html
    // https://raw.githubusercontent.com/unicode-org/icu/main/icu4c/source/data/unidata/emoji-sequences.txt
    // https://raw.githubusercontent.com/unicode-org/icu/main/icu4c/source/data/unidata/emoji-zwj-sequences.txt
    // https://unicode.org/Public/emoji/14.0/emoji-test.txt

    // example:     France 🇫🇷, Snail 🐌, Family👨‍👩‍👧‍👦, man technologist with skin tone 👨🏼‍💻
    // decimal:     France (ef)🇫🇷(\ef), Snail (ef)🐌(\ef), Family(ef)👨‍👩‍👧‍👦(\ef), man technologist with skin tone (ef)👨🏼‍💻(\ef)
    // hex:         France (ef)🇫🇷(\ef), Snail (ef)🐌(\ef), Family(ef)👨‍👩‍👧‍👦(\ef), man technologist with skin tone (ef)👨🏼‍💻(\ef)
    import Foundation

    let useDecimalEntities = true           // Change to false to encode as hexadecimal entities.
    let openingWrapperTag = "(ef)"          // Set to "" if no wrapper tag needed.
    let closingWrapperTag = "(\\ef)"        // Set to "" if no wrapper tag needed.

    let pattern = #"""
    (?x-i)
    (?:
        \p{RI} \p{RI}
    |
        [
            \x{00A9}
            \x{00AE}
            \x{203C}
            \x{2049}
            \x{2122}
            \x{2139}
            \x{2194}
            \x{2195}
            \x{2196}
            \x{2197}
            \x{2198}
            \x{2199}
            \x{21A9}
            \x{21AA}
            \x{2328}
            \x{23CF}
            \x{23ED}
            \x{23EE}
            \x{23EF}
            \x{23F1}
            \x{23F2}
            \x{23F8}
            \x{23F9}
            \x{23FA}
            \x{24C2}
            \x{25AA}
            \x{25AB}
            \x{25B6}
            \x{25C0}
            \x{25FB}
            \x{25FC}
            \x{2702}
            \x{2708}
            \x{2709}
            \x{270F}
            \x{2712}
            \x{2714}
            \x{2716}
            \x{271D}
            \x{2721}
            \x{2733}
            \x{2734}
            \x{2744}
            \x{2747}
            \x{2763}
            \x{27A1}
            \x{2934}
            \x{2935}
            \x{2B05}
            \x{2B06}
            \x{2B07}
            \x{3030}
            \x{303D}
            \x{3297}
            \x{3299}
            \x{1F170}
            \x{1F171}
            \x{1F17E}
            \x{1F17F}
            \x{1F202}
            \x{1F237}
        ]
        \x{FE0F}
    |
        [
            \x{0023}
            \x{002A}
            \x{0030}
            \x{0031}
            \x{0032}
            \x{0033}
            \x{0034}
            \x{0035}
            \x{0036}
            \x{0037}
            \x{0038}
            \x{0039}
        ]
        \x{FE0F} \x{20E3}

    |
        [
            \p{Basic_Emoji}
            \x{1F300}-\x{1F5FF}
            \x{1F3CA}-\x{1F3CC}
            \x{1F3F3}
            \x{1F3F4}
            \x{1F441}
            \x{1F574}
            \x{1F575}
            \x{1F590}
            \x{1F680}-\x{1F6FF}
            \x{2600}-\x{26FF}
            \x{261D}
            \x{26F9}
            \x{270C}
            \x{270D}
            \x{2764}
        ]
        (?:
            \p{EMod}
        |
            \x{FE0F} \x{20E3}?
        |
            [\x{E0020}-\x{E007E}]+
            \x{E007F}
        )?
        (?:
            \x{200D}
            [
                \p{Basic_Emoji}
                \x{1F32B}
                \x{1F5E8}
                \x{2620}
                \x{2640}
                \x{2642}
                \x{2695}
                \x{2696}
                \x{26A7}
                \x{2708}
                \x{2744}
                \x{2764}
            ]
            (?:
                \p{EMod}
            |
                \x{FE0F} \x{20E3}?
            |
                [\x{E0020}-\x{E007E}]+
                \x{E007F}
            )?
        )*
    )
    """#

    let regex = try NSRegularExpression(pattern: pattern, options: [])
    var output: [String] = []

    while var line = readLine() {
        let range = NSRange(line.startIndex..<line.endIndex, in: line)
        let matches = regex.matches(in: line, options: [], range: range)
        for match in matches.reversed() {
            if let range = Range(match.range, in: line) {
                let emoji = line[range];
                let entities = emoji.unicodeScalars.map {
                    useDecimalEntities ? "&#\(String($0.value, radix: 10, uppercase: true));" : "&#x\(String($0.value, radix: 16, uppercase: true));"
                }
                let replacement = entities.joined(separator:"")
                line.replaceSubrange(range, with: "\(openingWrapperTag)\(replacement)\(closingWrapperTag)")
            }
        }
        output.append(line)
    }

    print(output.joined(separator: "\n"), terminator:"")

--

BBEdit rocks!

Kind regards,

Jean Jourdain
Reply all
Reply to author
Forward
0 new messages