approved API proposal for Segmenter API

16 views
Skip to first unread message

Elango Cheran

unread,
Apr 11, 2025, 7:39:40 PMApr 11
to icu-d...@unicode.org
Hi everyone,
This is a little retroactive of an announcement since I just realized that I omitted including the icu-design@ mailing list when sending out mailings to solicit reviews and discussions on the Segmenter API proposal.

The Segmenter API is designed to be a higher level API that provides a more modern API for segmentation by ensuring immutability of instances and isolation of iteration state across instances. This avoids a big source of complexity and bugs from BreakIterator. It also uses the Stream API from Java 8+ to represent an element sequence abstraction, that in turn contains APIs for functional programming constructs. The Segmenter API is not meant to be a 1-to-1 replacement of BreakIterator, so it does not attempt to replicate all BreakIterator APIs. Another non-goal of the Segmenter API is to maintain 100% performance parity with BreakIterator at the same time that it is wrapping BreakIterator

The proposal went through rounds of discussion in the TC over the past few months and the ICU4J portion received an Approved as Amended status yesterday.
There are a couple of implicit issues of discussion that didn't get discussed in the TC but are present in the implementation PR branch:
  1. Should the `Segment` class be a top-level class, or an inner class of `Segments`?
    • The implementation PR has moved the `Segment` class as well as all of the other inner classes used for implementation from being inner classes within `Segments` to being top level classes.
    • Having top level classes allows a clearer delineation of which classes are public or not. An inner class of an interface might be default public, but only the `Segment` class needs to be public (and of course, given that it is a part of the return type in the signature of the public APIs), whereas all of the iteration implementation-specific classes should be kept private.
  2. Is it okay to create a new package segment in the Java package hierarchy for Segmenter API classes, and where to put it?
    • The implementation PR has created a package `com.ibm.icu.segmenter`
    • Even though `BreakIterator` exists in `com.ibm.icu.text`, that package is a hodgepodge of all types of classes that were created a long time ago.
    • Newer APIs have started to create packages with a specific focus, like `com.ibm.icu.message2` for the new MessageFormatter that implements MF2. The PR follows this pattern.

The following are the public API signatures

public interface Segmenter {

  Segments segment(CharSequence s);

}


public interface Segments {


  /**

   * Returns a {@code Stream} of the {@code CharSequence}s for all of the segments in the source

   * sequence. Start from the beginning of the sequence and iterate forwards until the end.

   * @return a {@code Stream} of all {@code Segments} in the source sequence.

   */

  Stream<CharSequence> subSequences();


  /**

   * Returns the segment that contains index {@code i}. Containment is inclusive of the start index

   * and exclusive of the limit index.

   *

   * <p>Specifically, the containing segment is defined as the segment with start {@code s} and

   * limit {@code  l} such that {@code  s ≤ i < l}.</p>

   * @param i index in the input {@code CharSequence} to the {@code Segmenter}

   * @throws IllegalArgumentException if {@code i} is less than 0 or greater than the length of the

   *    input {@code CharSequence} to the {@code Segmenter}

   * @return A segment that either starts at or contains index {@code i}

   */

  Segment segmentAt(int i);


  /**

   * Returns a {@code Stream} of all {@code Segment}s in the source sequence. Start with the first

   * and iterate forwards until the end of the sequence.

   *

   * <p>This is equivalent to {@code segmentsFrom(0)}.</p>

   * @return a {@code Stream} of all {@code Segments} in the source sequence.

   */

  Stream<Segment> segments();


  /**

   * Returns a {@code Stream} of all {@code Segment}s in the source sequence where all segment limits

   * {@code  l} satisfy {@code i < l}.  Iteration moves forwards.

   *

   * <p>This means that the first segment in the stream is the same

   * as what is returned by {@code segmentAt(i)}.</p>

   *

   * <p>The word "from" is used here to mean "at or after", with the semantics of "at" for a

   * {@code Segment} defined by {@link #segmentAt(int)}}. We cannot describe the segments all as

   * being "after" since the first segment might contain {@code i} in the middle, meaning that

   * in the forward direction, its start position precedes {@code i}.</p>

   *

   * <p>{@code segmentsFrom} and {@link #segmentsBefore(int)} create a partitioning of the space of

   * all {@code Segment}s.</p>

   * @param i index in the input {@code CharSequence} to the {@code Segmenter}

   * @return a {@code Stream} of all {@code Segment}s at or after {@code i}

   */

  Stream<Segment> segmentsFrom(int i);


  /**

   * Returns whether offset {@code i} is a segmentation boundary. Throws an exception when

   * {@code i} is not a valid index position for the source sequence.

   * @param i index in the input {@code CharSequence} to the {@code Segmenter}

   * @throws IllegalArgumentException if {@code i} is less than 0 or greater than the length of the

   *     input {@code CharSequence} to the {@code Segmenter}

   * @return Returns whether offset {@code i} is a segmentation boundary.

   */

  boolean isBoundary(int i);


  /**

   * Returns all segmentation boundaries, starting from the beginning and moving forwards.

   *

   * <p><b>Note:</b> {@code boundaries() != boundariesAfter(0)}.

   * This difference naturally results from the strict inequality condition in boundariesAfter,

   * and the fact that 0 is the first boundary returned from the start of an input sequence.</p>

   * @return An {@code IntStream} of all segmentation boundaries, starting at the first

   * boundary with index 0, and moving forwards in the input sequence.

   */

  IntStream boundaries();


  /**

   * Returns all segmentation boundaries after the provided index.  Iteration moves forwards.

   * @param i index in the input {@code CharSequence} to the {@code Segmenter}

   * @return An {@code IntStream} of all boundaries {@code b} such that {@code b > i}

   */

  IntStream boundariesAfter(int i);


  /**

   * Returns all segmentation boundaries on or before the provided index. Iteration moves backwards.

   *

   * <p>The phrase "back from" is used to indicate both that: 1) boundaries are "on or before" the

   * input index; 2) the direction of iteration is backwards (towards the beginning).

   * "on or before" indicates that the result set is {@code b} where {@code b ≤ i}, which is a weak

   * inequality, while "before" might suggest the strict inequality {@code b < i}.</p>

   *

   * <p>{@code boundariesBackFrom} and {@link #boundariesAfter(int)} create a partitioning of the

   *     space of all boundaries.</p>

   * @param i index in the input {@code CharSequence} to the {@code Segmenter}

   * @return An {@code IntStream} of all boundaries {@code b} such that {@code b ≤ i}

   */

  IntStream boundariesBackFrom(int i);


  class Segment {

    public final int start;

    public final int limit;

    public final int ruleStatus = 0;


    /**

     * Return the subsequence represented by this {@code Segment}

     * @return a new {@code CharSequence} object that is the subsequence represented by this

     * {@code Segment}.

     */

    public CharSequence getSubSequence() { ... }

  }


}


public class LocalizedSegmenter implements Segmenter {


  @Override

  public Segments segment(CharSequence s) { ... }


  public static Builder builder() {

    return new Builder();

  }


  public enum SegmentationType {

    GRAPHEME_CLUSTER,

    WORD,

    LINE,

    SENTENCE,

  }


  public static class Builder {

    public Builder setLocale(ULocale locale) { ... }


    public Builder setLocale(Locale locale) { ... }


    public Builder setSegmentationType(SegmentationType segmentationType) { ... }


    public Segmenter build() { ... }

  }

}


public class RuleBasedSegmenter implements Segmenter {


  @Override

  public Segments segment(CharSequence s) { ... }


  public static Builder builder() {

    return new Builder();

  }


  public static class Builder {

    public Builder setRules(String rules) { ... }


    public Segmenter build() { ... }

  }

}



-- Elango

Markus Scherer

unread,
Apr 16, 2025, 7:35:49 PMApr 16
to Elango Cheran, icu-d...@unicode.org
On Fri, Apr 11, 2025 at 4:39 PM Elango Cheran <ela...@unicode.org> wrote:
This is a little retroactive of an announcement since I just realized that I omitted including the icu-design@ mailing list when sending out mailings to solicit reviews and discussions on the Segmenter API proposal.

Thanks for sending this, and for working on it!
Looking forward to using the new API!
Now I need to review the PR...

markus
Reply all
Reply to author
Forward
0 new messages