Canonize emoji for an XML file

Justin Ross

unread,

Apr 10, 2022, 3:48:51 PM4/10/22

to BBEdit Talk

Hi all, I'm looking for a way to catch any emoji that's used amongst regular text. This is so that I can create an XML file to import into InDesign. Then I simply find/replace any emoji found and convert the character to an emoji font so it can be printed.

I've made a canonize file with over 1,000 emoji, separated by a tab, then the decimal equivalent.

This works great.

For example:

This emoji is found in the text somewhere:

👦

and is changed to:

👦

It also works for skintone emoji where the decimal code can be repeated.

This emoji is found in the text somewhere:
👦🏻

and is changed to:

👦🏻

However, as soon as I wrap the code (so it's easier to find/change in InDesign), the duplicate codes cause a problem.

For example:

This emoji:

👦

Is changed to this:

(ef)&\#128102;(\ef)

BUT...

This:
👦🏻

Is changed to this:

(ef)👦(\ef)(ef)🏻(\ef)

Note the extra (\ef)(ef) in the middle.

Now I could use a find/replace to remove that bit. But what if there are two different emoji next to each other? I'm replacing one problem with another.

Is there a way round this?

Many thanks if anyone can help.

Justin Ross

unread,

Apr 12, 2022, 11:38:38 AM4/12/22

to BBEdit Talk

Sorted it.

The only way for it to work with extra text on either end is to move all the single entity codes (e.g. 👦) below the multiple entity ones (e.g. 👦🏻) in the canonize file.

So it would look something like this...

👦🏻
ӧ🏟
Ò
👦
👧
👨

jj

unread,

Apr 16, 2022, 10:24:16 AM4/16/22

to BBEdit Talk

Hi Community,

Let's celebrate BBEdit's 30 years of existence. 👏 🎉 🎂 🍾

🥂 👉 👨🏼‍💻 ＆ 🍀️ & 🦜 & 👥

Here is a Swift text filter that could help you prepare your inDesign birthday cards.

Based on Unicode's Emoji regular expression and Swift's ICU regular expression engine.

Save in ~/Library/Application Support/BBEdit/Text Filters/encode_emojis.swift

#!/usr/bin/env swift

// Based on: https://unicode.org/reports/tr51/#EBNF_and_Regex
//
// Changed \p{Emoji} to \p{Basic_Emoji} to avoid matching '#', numbers, etc.
// Tweaked to match uncovered cases revealed by test files.
//
// Tested against the contents of those test files:
// ------------------------------------------------
// https://unicode.org/emoji/charts/full-emoji-list.html
// https://raw.githubusercontent.com/unicode-org/icu/main/icu4c/source/data/unidata/emoji-sequences.txt
// https://raw.githubusercontent.com/unicode-org/icu/main/icu4c/source/data/unidata/emoji-zwj-sequences.txt
// https://unicode.org/Public/emoji/14.0/emoji-test.txt

// example: France 🇫🇷, Snail 🐌, Family👨‍👩‍👧‍👦, man technologist with skin tone 👨🏼‍💻
// decimal: France (ef)🇫🇷(\ef), Snail (ef)🐌(\ef), Family(ef)👨‍👩‍👧‍👦(\ef), man technologist with skin tone (ef)👨🏼‍💻(\ef)
// hex: France (ef)🇫🇷(\ef), Snail (ef)🐌(\ef), Family(ef)👨‍👩‍👧‍👦(\ef), man technologist with skin tone (ef)👨🏼‍💻(\ef)
import Foundation

let useDecimalEntities = true // Change to false to encode as hexadecimal entities.
let openingWrapperTag = "(ef)" // Set to "" if no wrapper tag needed.
let closingWrapperTag = "(\\ef)" // Set to "" if no wrapper tag needed.

let pattern = #"""
(?x-i)
(?:
\p{RI} \p{RI}
|
[
\x{00A9}
\x{00AE}
\x{203C}
\x{2049}
\x{2122}
\x{2139}
\x{2194}
\x{2195}
\x{2196}
\x{2197}
\x{2198}
\x{2199}
\x{21A9}
\x{21AA}
\x{2328}
\x{23CF}
\x{23ED}
\x{23EE}
\x{23EF}
\x{23F1}
\x{23F2}
\x{23F8}
\x{23F9}
\x{23FA}
\x{24C2}
\x{25AA}
\x{25AB}
\x{25B6}
\x{25C0}
\x{25FB}
\x{25FC}
\x{2702}
\x{2708}
\x{2709}
\x{270F}
\x{2712}
\x{2714}
\x{2716}
\x{271D}
\x{2721}
\x{2733}
\x{2734}
\x{2744}
\x{2747}
\x{2763}
\x{27A1}
\x{2934}
\x{2935}
\x{2B05}
\x{2B06}
\x{2B07}
\x{3030}
\x{303D}
\x{3297}
\x{3299}
\x{1F170}
\x{1F171}
\x{1F17E}
\x{1F17F}
\x{1F202}
\x{1F237}
]
\x{FE0F}
|
[
\x{0023}
\x{002A}
\x{0030}
\x{0031}
\x{0032}
\x{0033}
\x{0034}
\x{0035}
\x{0036}
\x{0037}
\x{0038}
\x{0039}
]
\x{FE0F} \x{20E3}

|
[
\p{Basic_Emoji}
\x{1F300}-\x{1F5FF}
\x{1F3CA}-\x{1F3CC}
\x{1F3F3}
\x{1F3F4}
\x{1F441}
\x{1F574}
\x{1F575}
\x{1F590}
\x{1F680}-\x{1F6FF}
\x{2600}-\x{26FF}
\x{261D}
\x{26F9}
\x{270C}
\x{270D}
\x{2764}
]
(?:
\p{EMod}
|
\x{FE0F} \x{20E3}?
|
[\x{E0020}-\x{E007E}]+
\x{E007F}
)?
(?:
\x{200D}
[
\p{Basic_Emoji}
\x{1F32B}
\x{1F5E8}
\x{2620}
\x{2640}
\x{2642}
\x{2695}
\x{2696}
\x{26A7}
\x{2708}
\x{2744}
\x{2764}
]
(?:
\p{EMod}
|
\x{FE0F} \x{20E3}?
|
[\x{E0020}-\x{E007E}]+
\x{E007F}
)?
)*
)
"""#

let regex = try NSRegularExpression(pattern: pattern, options: [])
var output: [String] = []

while var line = readLine() {
let range = NSRange(line.startIndex..<line.endIndex, in: line)
let matches = regex.matches(in: line, options: [], range: range)
for match in matches.reversed() {
if let range = Range(match.range, in: line) {
let emoji = line[range];
let entities = emoji.unicodeScalars.map {
useDecimalEntities ? "&#\(String($0.value, radix: 10, uppercase: true));" : "&#x\(String($0.value, radix: 16, uppercase: true));"
}
let replacement = entities.joined(separator:"")
line.replaceSubrange(range, with: "\(openingWrapperTag)\(replacement)\(closingWrapperTag)")
}
}
output.append(line)
}

print(output.joined(separator: "\n"), terminator:"")

--

BBEdit rocks!

Kind regards,

Jean Jourdain

Reply all

Reply to author

Forward