ICU API Proposal

15 views
Skip to first unread message

Mark Davis Ⓤ

unread,
Jun 11, 2026, 12:54:13 AMJun 11
to icu-design

I would like to propose the following API for: ICU 79

Please provide feedback by: next Wednesday, 2026-jun-17

Designated API reviewer: Markus

Ticket: https://unicode-org.atlassian.net/browse/ICU-23431

Pull request: TBD


/**

* Class for detecting links (URLs or emails) in text, and formatting them for display, implementing

* the algorithms in https://www.unicode.org/reports/tr58/. It supplies not only higher-level APIs

* that work directly, but also lower level APIs that can be used to augment existing scanners or

* formatters.<br>

* [TBD explain how the scanner handles cases with&without Schemes (mailto:, https://, etc.)

* and ambiguous cases (john.smith@foo.com/somepath?somequery)]

*/

public class LinkScanner {

private LinkUtilities.LinkScanner linkScanner;


/**

* Constructor, from source, start, and limit.

*

* @param source the source to be scanned

* @param start where to start the scanning from.

* @param limit where to stop the scanning.

* @param validTLDs list of valid TLDs. If the list is empty or null, then a default list is

* used. That list is derived from https://data.iana.org/TLD/tlds-alpha-by-domain.txt just

* before the last release of ICU, so for best results a more current list can be used.

*/

public LinkScanner(CharSequence source, int start, int limit, Set<String> validTLDs) {

...

}


/**

* Searches for the next link.

*

* @return true if one is found.

*/

public boolean next() {

...

}


/**

* Gets the start offset of a found link

*

* @return the start offset of the link, or -1 if not called after next()

*/

public int getLinkStart() {

...

}


/**

* Gets the end offset (limit) of a found link

*

* @return the end offset of the link, or -1 if not called after next()

*/

public int getLinkEnd() {

...

}


/**

* Lower level utility for finding the end of a PathQueryFragment (PQF) in text. It assumes that

* the start position is immediately after an identified domain name.

* The purpose of this routine is for fitting into algorithms that are already in use, just

* taking over for the PQF scanning.

*

* @param source the text to be scanned

* @param start the position in the text to be scanned from. It should be immediately after a

* domain name.

* @return the end position of the PQF, or the start value if there is none.

*/

public static int scanPathQueryFragment(CharSequence source, int start) {

...

}

}


/**

* Provides for either minimal or maximal escaping of URLs, implementing the algorithms in

* https://www.unicode.org/reports/tr58/.

*/

public class UrlEscaper {


/** Enum for determining whether any percent-escaping is minimal or maximal */

public enum Extent {

/** Minimal percent-escaping only percent-escapes non-ASCII where necessary. */

MINIMAL,

/** Maximal percent-escaping percent-escapes all non-ASCII. */

MAXIMAL

}


/**

* Escapes a URL according to the Extent parameter

* @param source In the source, it is assumed that ASCII syntax characters requiring escaping have already been escaped.

* For example, a literal / in a path segment would already be percent-escaped. [TBD give example]

* @param extent either MINIMAL or MAXIMAL

* @return an escaped string according to the extent parameter.

*/

public String escapeUrl(String source, Extent extent) {

...

}

}


Alan Liu

unread,
Jun 11, 2026, 6:24:31 AMJun 11
to Mark Davis Ⓤ, icu-design
Is it worth it to make LinkScanner implement Iterable<T>?

--
You received this message because you are subscribed to the Google Groups "icu-design" group.
To unsubscribe from this group and stop receiving emails from it, send an email to icu-design+...@unicode.org.
To view this discussion visit https://groups.google.com/a/unicode.org/d/msgid/icu-design/CAGuL-cgDL-KLoBG5MwD3__EpNenmC4vEdTMPECCGF4rzB4XDpg%40mail.gmail.com.
For more options, visit https://groups.google.com/a/unicode.org/d/optout.

--
You received this message because you are subscribed to the Google Groups "ICU - Team" group.
To unsubscribe from this group and stop receiving emails from it, send an email to icu-team+u...@unicode.org.
To view this discussion visit https://groups.google.com/a/unicode.org/d/msgid/icu-team/CAGuL-cgDL-KLoBG5MwD3__EpNenmC4vEdTMPECCGF4rzB4XDpg%40mail.gmail.com.

Mark Davis Ⓤ

unread,
Jun 11, 2026, 9:58:15 AMJun 11
to Alan Liu, icu-design
I thought a little bit about making it an iterable and/or support streaming. 
I can whip up what those would look like.

One additional API I thought about adding, was to return a Unicode set of "hard characters". These are characters that can never occur in any link. That makes it easy to break up a larger volume of text and process links over those segments in parallel with different link scanners. That can also be used to support parallel streams with some futzing under the covers.

I was a little reluctant to go too far afield, because a lot of implementations have fairly sophisticated mechanisms already to identify domain names, so the key functionality they need is to do the PQF part and the user part of it of an email address.(I just realized I didn't add the API for that)

Mark Davis Ⓤ

unread,
Jun 16, 2026, 6:31:25 PMJun 16
to Alan Liu, icu-design
Here is a revised proposal. It uses a handler, which should be easier for people to handle.

/**

* Class for detecting links (URLs or emails) in text, and formatting them for display, implementing

* the algorithms in https://www.unicode.org/reports/tr58/ to handle Unicode characters properly. It

* supplies not only higher-level APIs that work directly, but also lower level APIs that can be

* used to augment existing scanners or formatters.

*

* <p>The high-level API handles text with and without schemes (mailto:, https://, etc.). It does a

* fast scan for valid TLDs, then backs up for other components (full host, email user-part,

* scheme). For ambiguous cases, the first valid case is chosen. For example,

* "john.smith@foo.com/somepath?somequery" would detect "john.smith@foo.com", and treat

* "/somepath?somequery" as plain text.

*/

public class LinkScanner {

   /**

    * An interface for handling text being scanned. For example, in the text<br>

    * "At abc.net/foo#def can you find j...@somemail.com?", the following handler methods will be

    * called in sequence:

    *

    * <ul>

    *   <li>handlePlainText 0,3 — eg "At "

    *   <li>handleUrl 3,18 — eg "abc.net/foo#def"

    *   <li>handlePlainText 18,32 — eg " can you find "

    *   <li>handleMail 32,48 — eg "j...@somemail.com"

    *   <li>handlePlainText 48,49 — eg "?"

    * </ul>

    */

   public interface TextHandler {

       /**

        * Process a segment of plain text.

        *

        * @param plainText A string containing the plain text to process

        * @param start The start index

        * @param end The end offset (exclusive)

        */

       default void handlePlainText(CharSequence plainText, int start, int end) {}

       /**

        * Process a detected URL.

        *

        * @param urlText A string containing the URL to process. This exactly matches the source

        *     string: no normalization is performed.

        * @param start The start index

        * @param end The end offset (exclusive)

        */

       default void handleUrl(CharSequence urlText, int start, int end) {}

       /**

        * Process a detected email address.

        *

        * @param emailAddressText A string containing the email address to process. This exactly

        *     matches the source string: no normalization is performed.

        * @param start The start index

        * @param end The end offset (exclusive)

        */

       default void handleMailAddress(CharSequence emailAddressText, int start, int end) {}

   }


   /**

    * Constructor

    *

    * @param validTLDs list of valid TLDs. If the list is empty or null, then a default list is

    *     used. That list is derived from https://data.iana.org/TLD/tlds-alpha-by-domain.txt just

    *     before the last release of ICU, so for best results a more current list can be used.

    */

   public LinkScanner(Set<String> validTLDs) {

 

   }

   /**

    * Processes the given input text using the given handler to process the chunks of text

    * detected. For example, This method is not thread-safe.

    *

    * @param input The plain text to process.

    * @param start where to start the scanning from.

    * @param end where to stop the scanning.

    * @param handler The object which formats the detected chunks of text.

    */

   public void process(CharSequence input, int start, int end, TextHandler handler) {

 

   }

   /**

    * Lower level utility for finding the end of a PathQueryFragment (PQF) in text. It assumes that

    * the start position is immediately after an identified domain name. The purpose of this

    * routine is for fitting into algorithms that are already in use, just taking over for the PQF

    * scanning.

    *

    * @param source the text to be scanned

    * @param start the position in the text to be scanned from. It should be immediately after a

    *     domain name.

    * @return the end position of the PQF, or the start value if there is none.

    */

   public static int scanPathQueryFragment(CharSequence source, int start, int limit) {

 

   }

   // NOTE for reviewers: if ICU supplies \p{Link_Term=hard}, the following is not necessary

   /**

    * Returns a frozen set of Unicode characters that are guaranteed to never be part of a URL or

    * email address. This allows implementations to make various optimizations because URLs and

    * email addresses can never span these characters. For example, a span of characters between

    * safe characters that doesn't have a sequence of domain-character + . + domain-character can

    * be skipped in processing.

    */

   public static UnicodeSet getSafeCharacters() {

 

   }

   // NOTE for reviewers: if ICU supplies \p{Idn_Status≠disallowed}, the following is not necessary

   /**

    * Returns a frozen set of Unicode characters that are possible characters in a domain name

    * (pre-mapping) This allows implementations to make various optimizations because URLs and

    * email addresses must contain a sequence of domain-character + . + domain-character. It is the

    * same as the set of IDNA Mapping Table character with values ≠ disallowed

    */

   public static UnicodeSet getDomainCharacters() {

 

   }

}


Mark Davis Ⓤ

unread,
Jun 16, 2026, 6:40:37 PMJun 16
to Alan Liu, icu-design

Rich Gillam

unread,
Jun 17, 2026, 8:02:14 PMJun 17
to Mark Davis Ⓤ, Alan Liu, icu-design
I won’t be at the meeting tomorrow, but I looked this over on the API-change list, and it all looked well thought out and well-justified to me.  No notes.
:-)  Somebody will have to port the whole thing to C++ at some point, of course…

—Rich



-- 
You received this message because you are subscribed to the Google Groups "ICU - Team" group.
To unsubscribe from this group and stop receiving emails from it, send an email to icu-team+u...@unicode.org.
To view this discussion visit https://groups.google.com/a/unicode.org/d/msgid/icu-team/CAGuL-cgDL-KLoBG5MwD3__EpNenmC4vEdTMPECCGF4rzB4XDpg%40mail.gmail.com.
-- 
You received this message because you are subscribed to the Google Groups "icu-design" group.
To unsubscribe from this group and stop receiving emails from it, send an email to icu-design+...@unicode.org.


-- 
You received this message because you are subscribed to the Google Groups "ICU - Team" group.
To unsubscribe from this group and stop receiving emails from it, send an email to icu-team+u...@unicode.org.

Mark Davis Ⓤ

unread,
Jun 17, 2026, 9:44:13 PMJun 17
to Rich Gillam, Markus Unicode Scherer, Alan Liu, icu-design
Thanks!

I've got an implementation over in the unicodetools repo, plus the test files, so it shouldn't take too long to get the API to function, and a test to run.

One question for Markus is whether his update to Unicode 18.0 properties includes the properties in 

Markus Scherer

unread,
Jun 18, 2026, 11:13:37 AMJun 18
to Mark Davis Ⓤ, Rich Gillam, Markus Unicode Scherer, Alan Liu, icu-design
Hi...

First, sorry for not getting to this earlier.

Second, sorry, but to me it doesn't feel appropriate for ICU to try to implement the whole end-to-end link detection.

Other people / library implementers have spent years figuring out how to detect links in text. UTS #58 covers the piece of the problem where those other libraries are bad or inconsistent.

Therefore, scanPathQueryFragment() feels right -- that's the value that we can reasonably provide -- but the rest should be out of scope for ICU.
And that could be a static method. Especially in Java where lazy init is easy.

I also don't want to be in the business of maintaining a set of TLDs. It's yet one more thing that would complicate a release, but also, link detectors should be free to ignore whether TLDs are valid, and also free to validate below-top-level domain labels, etc.

Even considering the TextHandler callback, that feels like ignoring where the language has gone -- a modern Java iteration API would return a Stream.

Plus a constructor that may well sprout more arguments later... would probably want to have a Builder.

On Wed, Jun 17, 2026 at 6:44 PM Mark Davis Ⓤ <ma...@unicode.org> wrote:
One question for Markus is whether his update to Unicode 18.0 properties includes the [UTS #58] properties

Not yet, but I can try to work on that for ICU 79. Worst case, ICU 80.

Not sure where getDomainCharacters() should live. Is it derived from the IdnaMappingTable.txt type values? We could consider a regular property API for that.

markus

Markus Scherer

unread,
Jun 18, 2026, 11:15:45 AMJun 18
to Mark Davis Ⓤ, Rich Gillam, Markus Unicode Scherer, Alan Liu, icu-design
For the UrlEscaper:

Is it likely that enum Extent will ever have more than two values?
If not, then we could just have two separate functions.

Why not make the input a CharSequence like elsewhere?

markus

Mark Davis Ⓤ

unread,
Jun 18, 2026, 1:33:19 PMJun 18
to Markus Scherer, Rich Gillam, Markus Unicode Scherer, Alan Liu, icu-design
Here is the revised proposal after the discussion in the ICU-TC meeting today. Please let me know if there are any further questions or concerns.

/**

* Utility class for assisting with detecting links (URLs or emails) in text, and formatting them

* for display, implementing the algorithms in https://www.unicode.org/reports/tr58/ to handle

* Unicode characters properly. It supplies lower level APIs for use in augmenting existing scanners

* and formatters.

*/

public class LinkUtilities {

   /**

     * Lower level utility for finding the end of a PathQueryFragment (PQF) in text. It assumes that

     * the start position is immediately after an identified domain name. The purpose of this

     * routine is for fitting into algorithms that are already in use, just taking over for the PQF

     * scanning. For more information, see https://www.unicode.org/reports/tr58/.

     *

     * @param source the text to be scanned

     * @param start the position in the text to be scanned from. It should be immediately after a

     *     domain name.

     * @param limit the offset after the last character in the text to be scanned.

     * @return the end position of the PQF, or the start value if there is none.

     */

   public static int scanPathQueryFragment(CharSequence source, int start, int limit) {

   }

   /**

    * Lower level utility for finding the start of an email address in text. It assumes that the

    * limit position is immediately before an '@' + identified domain name. The purpose of this

    * routine is for fitting into algorithms that are already in use, just handling for the email

    * `local-part`.

    *

    * @param source the text to be scanned

    * @param start the position that is the earliest that should be considered in a backwards scan

    * @return the position to start scanning backwards from — should be just before @ +

    *     domain_name.

    */

   public static int scanBackEmailLocalPart(CharSequence s, int start, int limit) {

   }

   /** Enum for determining whether any percent-escaping is minimal or maximal, for use 

    * in escapePathQueryFragment.

    */

   public enum Extent {

       /** Minimal percent-escaping only percent-escapes non-ASCII where necessary. */

       MINIMAL,

       /** Maximal percent-escaping percent-escapes all non-ASCII. */

       MAXIMAL

   }

   /**

    * Escapes a URL according to the Extent parameter.

    *

    * @param source In the source, it is assumed that ASCII syntax characters requiring escaping

    *     have already been escaped. For example, a literal / in a path segment would already be

    *     percent-escaped. For more information, see https://www.unicode.org/reports/tr58/.

    * @param extent either MINIMAL or MAXIMAL

    * @return an escaped string according to the extent parameter.

    */

   public String escapePathQueryFragment(String source, Extent extent) {

   }


   // NOTE for reviewers: These are only temporary, until ICU supplies the following:

   //  \p{Link_Term=hard}

   // \p{Idn_Status≠disallowed}

   /**

    * Returns a frozen set of Unicode characters that are guaranteed to never be part of a URL or

    * email address. This allows implementations to make various optimizations because URLs and

    * email addresses can never span these characters. For example, a span of characters between

    * safe characters that doesn't have a sequence of domain-character + . + domain-character can

    * be skipped in processing.

    */

   public static UnicodeSet getSafeCharacters() {

   }

   /**

    * Returns a frozen set of Unicode characters that are possible characters in a domain name

    * (pre-mapping) This allows implementations to make various optimizations because URLs and

    * email addresses must contain a sequence of domain-character + . + domain-character. It is the

    * same as the set of IDNA Mapping Table character with values ≠ disallowed

    */

   public static UnicodeSet getDomainCharacters() {

   }

}

Markus Scherer

unread,
Jun 22, 2026, 12:31:57 PM (11 days ago) Jun 22
to Mark Davis Ⓤ, Rich Gillam, Markus Unicode Scherer, Alan Liu, icu-design
lgtm up to here

   /**

    * Escapes a URL according to the Extent parameter.

    *

    * @param source In the source, it is assumed that ASCII syntax characters requiring escaping

    *     have already been escaped. For example, a literal / in a path segment would already be

    *     percent-escaped. For more information, see https://www.unicode.org/reports/tr58/.

    * @param extent either MINIMAL or MAXIMAL

    * @return an escaped string according to the extent parameter.

    */

   public String escapePathQueryFragment(String source, Extent extent) {

   }


We said that this should be "static" as well.

We also discussed: "consider taking & returning a CharSequence"

   // NOTE for reviewers: These are only temporary, until ICU supplies the following:

   //  \p{Link_Term=hard}

   // \p{Idn_Status≠disallowed}


These getters should not be part of the public API and proposal.

tnx
markus

Markus Scherer

unread,
Jul 2, 2026, 3:30:19 PM (17 hours ago) Jul 2
to Mark Davis Ⓤ, Rich Gillam, Markus Unicode Scherer, Alan Liu, icu-design
The ICU-TC has approved the new API with the following changes:

  • escapePathQueryFragment() made static

  • removed getSafeCharacters() & getDomainCharacters()


This leaves just three static functions, and the enum:

public class LinkUtilities {

   public static int scanPathQueryFragment(CharSequence source, int start, int limit) {

   public static int scanBackEmailLocalPart(CharSequence s, int start, int limit) {

   public enum Extent { MINIMAL, MAXIMAL }

   public static String escapePathQueryFragment(String source, Extent extent) {



markus

Mark Davis Ⓤ

unread,
2:43 AM (6 hours ago) 2:43 AM
to Markus Scherer, Rich Gillam, Markus Unicode Scherer, Alan Liu, icu-design
Thanks!
Reply all
Reply to author
Forward
0 new messages