I would like to propose the following API for: ICU 79
Please provide feedback by: next Wednesday, 2026-jun-17
Designated API reviewer: Markus
Ticket: https://unicode-org.atlassian.net/browse/ICU-23431
Pull request: TBD
/**
* Class for detecting links (URLs or emails) in text, and formatting them for display, implementing
* the algorithms in https://www.unicode.org/reports/tr58/. It supplies not only higher-level APIs
* that work directly, but also lower level APIs that can be used to augment existing scanners or
* formatters.<br>
* [TBD explain how the scanner handles cases with&without Schemes (mailto:, https://, etc.)
* and ambiguous cases (john.smith@foo.com/somepath?somequery)]
*/
public class LinkScanner {
private LinkUtilities.LinkScanner linkScanner;
/**
* Constructor, from source, start, and limit.
*
* @param source the source to be scanned
* @param start where to start the scanning from.
* @param limit where to stop the scanning.
* @param validTLDs list of valid TLDs. If the list is empty or null, then a default list is
* used. That list is derived from https://data.iana.org/TLD/tlds-alpha-by-domain.txt just
* before the last release of ICU, so for best results a more current list can be used.
*/
public LinkScanner(CharSequence source, int start, int limit, Set<String> validTLDs) {
...
}
/**
* Searches for the next link.
*
* @return true if one is found.
*/
public boolean next() {
...
}
/**
* Gets the start offset of a found link
*
* @return the start offset of the link, or -1 if not called after next()
*/
public int getLinkStart() {
...
}
/**
* Gets the end offset (limit) of a found link
*
* @return the end offset of the link, or -1 if not called after next()
*/
public int getLinkEnd() {
...
}
/**
* Lower level utility for finding the end of a PathQueryFragment (PQF) in text. It assumes that
* the start position is immediately after an identified domain name.
* The purpose of this routine is for fitting into algorithms that are already in use, just
* taking over for the PQF scanning.
*
* @param source the text to be scanned
* @param start the position in the text to be scanned from. It should be immediately after a
* domain name.
* @return the end position of the PQF, or the start value if there is none.
*/
public static int scanPathQueryFragment(CharSequence source, int start) {
...
}
}
/**
* Provides for either minimal or maximal escaping of URLs, implementing the algorithms in
* https://www.unicode.org/reports/tr58/.
*/
public class UrlEscaper {
/** Enum for determining whether any percent-escaping is minimal or maximal */
public enum Extent {
/** Minimal percent-escaping only percent-escapes non-ASCII where necessary. */
MINIMAL,
/** Maximal percent-escaping percent-escapes all non-ASCII. */
MAXIMAL
}
/**
* Escapes a URL according to the Extent parameter
* @param source In the source, it is assumed that ASCII syntax characters requiring escaping have already been escaped.
* For example, a literal / in a path segment would already be percent-escaped. [TBD give example]
* @param extent either MINIMAL or MAXIMAL
* @return an escaped string according to the extent parameter.
*/
public String escapeUrl(String source, Extent extent) {
...
}
}
--
You received this message because you are subscribed to the Google Groups "icu-design" group.
To unsubscribe from this group and stop receiving emails from it, send an email to icu-design+...@unicode.org.
To view this discussion visit https://groups.google.com/a/unicode.org/d/msgid/icu-design/CAGuL-cgDL-KLoBG5MwD3__EpNenmC4vEdTMPECCGF4rzB4XDpg%40mail.gmail.com.
For more options, visit https://groups.google.com/a/unicode.org/d/optout.
--
You received this message because you are subscribed to the Google Groups "ICU - Team" group.
To unsubscribe from this group and stop receiving emails from it, send an email to icu-team+u...@unicode.org.
To view this discussion visit https://groups.google.com/a/unicode.org/d/msgid/icu-team/CAGuL-cgDL-KLoBG5MwD3__EpNenmC4vEdTMPECCGF4rzB4XDpg%40mail.gmail.com.
/**
* Class for detecting links (URLs or emails) in text, and formatting them for display, implementing
* the algorithms in https://www.unicode.org/reports/tr58/ to handle Unicode characters properly. It
* supplies not only higher-level APIs that work directly, but also lower level APIs that can be
* used to augment existing scanners or formatters.
*
* <p>The high-level API handles text with and without schemes (mailto:, https://, etc.). It does a
* fast scan for valid TLDs, then backs up for other components (full host, email user-part,
* scheme). For ambiguous cases, the first valid case is chosen. For example,
* "john.smith@foo.com/somepath?somequery" would detect "john.smith@foo.com", and treat
* "/somepath?somequery" as plain text.
*/
public class LinkScanner {
/**
* An interface for handling text being scanned. For example, in the text<br>
* "At abc.net/foo#def can you find j...@somemail.com?", the following handler methods will be
* called in sequence:
*
* <ul>
* <li>handlePlainText 0,3 — eg "At "
* <li>handleUrl 3,18 — eg "abc.net/foo#def"
* <li>handlePlainText 18,32 — eg " can you find "
* <li>handleMail 32,48 — eg "j...@somemail.com"
* <li>handlePlainText 48,49 — eg "?"
* </ul>
*/
public interface TextHandler {
/**
* Process a segment of plain text.
*
* @param plainText A string containing the plain text to process
* @param start The start index
* @param end The end offset (exclusive)
*/
default void handlePlainText(CharSequence plainText, int start, int end) {}
/**
* Process a detected URL.
*
* @param urlText A string containing the URL to process. This exactly matches the source
* string: no normalization is performed.
* @param start The start index
* @param end The end offset (exclusive)
*/
default void handleUrl(CharSequence urlText, int start, int end) {}
/**
* Process a detected email address.
*
* @param emailAddressText A string containing the email address to process. This exactly
* matches the source string: no normalization is performed.
* @param start The start index
* @param end The end offset (exclusive)
*/
default void handleMailAddress(CharSequence emailAddressText, int start, int end) {}
}
/**
* Constructor
*
* @param validTLDs list of valid TLDs. If the list is empty or null, then a default list is
* used. That list is derived from https://data.iana.org/TLD/tlds-alpha-by-domain.txt just
* before the last release of ICU, so for best results a more current list can be used.
*/
public LinkScanner(Set<String> validTLDs) {
…
}
/**
* Processes the given input text using the given handler to process the chunks of text
* detected. For example, This method is not thread-safe.
*
* @param input The plain text to process.
* @param start where to start the scanning from.
* @param end where to stop the scanning.
* @param handler The object which formats the detected chunks of text.
*/
public void process(CharSequence input, int start, int end, TextHandler handler) {
…
}
/**
* Lower level utility for finding the end of a PathQueryFragment (PQF) in text. It assumes that
* the start position is immediately after an identified domain name. The purpose of this
* routine is for fitting into algorithms that are already in use, just taking over for the PQF
* scanning.
*
* @param source the text to be scanned
* @param start the position in the text to be scanned from. It should be immediately after a
* domain name.
* @return the end position of the PQF, or the start value if there is none.
*/
public static int scanPathQueryFragment(CharSequence source, int start, int limit) {
…
}
// NOTE for reviewers: if ICU supplies \p{Link_Term=hard}, the following is not necessary
/**
* Returns a frozen set of Unicode characters that are guaranteed to never be part of a URL or
* email address. This allows implementations to make various optimizations because URLs and
* email addresses can never span these characters. For example, a span of characters between
* safe characters that doesn't have a sequence of domain-character + . + domain-character can
* be skipped in processing.
*/
public static UnicodeSet getSafeCharacters() {
…
}
// NOTE for reviewers: if ICU supplies \p{Idn_Status≠disallowed}, the following is not necessary
/**
* Returns a frozen set of Unicode characters that are possible characters in a domain name
* (pre-mapping) This allows implementations to make various optimizations because URLs and
* email addresses must contain a sequence of domain-character + . + domain-character. It is the
* same as the set of IDNA Mapping Table character with values ≠ disallowed
*/
public static UnicodeSet getDomainCharacters() {
…
}
}
* used. That list is derived from https://flagged.apple.com:443/proxy?t2=Dh9u9H9xX5&o=aHR0cHM6Ly9kYXRhLmlhbmEub3JnL1RMRC90bGRz&emid=b1bf9341-1601-45d0-8297-90d5e7e4f410&c=11-alpha-by-domain.txt just
* used. That list is derived from https://flagged.apple.com:443/proxy?t2=Dh9u9H9xX5&o=aHR0cHM6Ly9kYXRhLmlhbmEub3JnL1RMRC90bGRz&emid=b1bf9341-1601-45d0-8297-90d5e7e4f410&c=11-alpha-by-domain.txt just
--
You received this message because you are subscribed to the Google Groups "ICU - Team" group.
To unsubscribe from this group and stop receiving emails from it, send an email to icu-team+u...@unicode.org.
To view this discussion visit https://groups.google.com/a/unicode.org/d/msgid/icu-team/CAGuL-cgDL-KLoBG5MwD3__EpNenmC4vEdTMPECCGF4rzB4XDpg%40mail.gmail.com.
--
You received this message because you are subscribed to the Google Groups "icu-design" group.
To unsubscribe from this group and stop receiving emails from it, send an email to icu-design+...@unicode.org.
To view this discussion visit https://groups.google.com/a/unicode.org/d/msgid/icu-design/CAGuL-cjodBD59Bi%2BfpoL19uBwTvX6Z3QKzvuJ8Bm34g2bb%3DfXg%40mail.gmail.com.
For more options, visit https://flagged.apple.com:443/proxy?t2=DS3Y8k4Ue7&o=aHR0cHM6Ly9ncm91cHMuZ29vZ2xlLmNvbS9hL3VuaWNvZGUub3JnL2Qvb3B0b3V0&emid=b1bf9341-1601-45d0-8297-90d5e7e4f410&c=11.
--
You received this message because you are subscribed to the Google Groups "ICU - Team" group.
To unsubscribe from this group and stop receiving emails from it, send an email to icu-team+u...@unicode.org.
One question for Markus is whether his update to Unicode 18.0 properties includes the [UTS #58] properties
/**
* Utility class for assisting with detecting links (URLs or emails) in text, and formatting them
* for display, implementing the algorithms in https://www.unicode.org/reports/tr58/ to handle
* Unicode characters properly. It supplies lower level APIs for use in augmenting existing scanners
* and formatters.
*/
public class LinkUtilities {
/**
* Lower level utility for finding the end of a PathQueryFragment (PQF) in text. It assumes that
* the start position is immediately after an identified domain name. The purpose of this
* routine is for fitting into algorithms that are already in use, just taking over for the PQF
* scanning. For more information, see https://www.unicode.org/reports/tr58/.
*
* @param source the text to be scanned
* @param start the position in the text to be scanned from. It should be immediately after a
* domain name.
* @param limit the offset after the last character in the text to be scanned.
* @return the end position of the PQF, or the start value if there is none.
*/
public static int scanPathQueryFragment(CharSequence source, int start, int limit) {
…
}
/**
* Lower level utility for finding the start of an email address in text. It assumes that the
* limit position is immediately before an '@' + identified domain name. The purpose of this
* routine is for fitting into algorithms that are already in use, just handling for the email
* `local-part`.
*
* @param source the text to be scanned
* @param start the position that is the earliest that should be considered in a backwards scan
* @return the position to start scanning backwards from — should be just before @ +
* domain_name.
*/
public static int scanBackEmailLocalPart(CharSequence s, int start, int limit) {
…
}
/** Enum for determining whether any percent-escaping is minimal or maximal, for use
* in escapePathQueryFragment.
*/
public enum Extent {
/** Minimal percent-escaping only percent-escapes non-ASCII where necessary. */
MINIMAL,
/** Maximal percent-escaping percent-escapes all non-ASCII. */
MAXIMAL
}
/**
* Escapes a URL according to the Extent parameter.
*
* @param source In the source, it is assumed that ASCII syntax characters requiring escaping
* have already been escaped. For example, a literal / in a path segment would already be
* percent-escaped. For more information, see https://www.unicode.org/reports/tr58/.
* @param extent either MINIMAL or MAXIMAL
* @return an escaped string according to the extent parameter.
*/
public String escapePathQueryFragment(String source, Extent extent) {
…
}
// NOTE for reviewers: These are only temporary, until ICU supplies the following:
// \p{Link_Term=hard}
// \p{Idn_Status≠disallowed}
/**
* Returns a frozen set of Unicode characters that are guaranteed to never be part of a URL or
* email address. This allows implementations to make various optimizations because URLs and
* email addresses can never span these characters. For example, a span of characters between
* safe characters that doesn't have a sequence of domain-character + . + domain-character can
* be skipped in processing.
*/
public static UnicodeSet getSafeCharacters() {
…
}
/**
* Returns a frozen set of Unicode characters that are possible characters in a domain name
* (pre-mapping) This allows implementations to make various optimizations because URLs and
* email addresses must contain a sequence of domain-character + . + domain-character. It is the
* same as the set of IDNA Mapping Table character with values ≠ disallowed
*/
public static UnicodeSet getDomainCharacters() {
…
}
}
/**
* Escapes a URL according to the Extent parameter.
*
* @param source In the source, it is assumed that ASCII syntax characters requiring escaping
* have already been escaped. For example, a literal / in a path segment would already be
* percent-escaped. For more information, see https://www.unicode.org/reports/tr58/.
* @param extent either MINIMAL or MAXIMAL
* @return an escaped string according to the extent parameter.
*/
public String escapePathQueryFragment(String source, Extent extent) {
…
}
// NOTE for reviewers: These are only temporary, until ICU supplies the following:
// \p{Link_Term=hard}
// \p{Idn_Status≠disallowed}
escapePathQueryFragment() made static
removed getSafeCharacters() & getDomainCharacters()
public class LinkUtilities {
public static int scanPathQueryFragment(CharSequence source, int start, int limit) {
public static int scanBackEmailLocalPart(CharSequence s, int start, int limit) {
public enum Extent { MINIMAL, MAXIMAL }
public static String escapePathQueryFragment(String source, Extent extent) {