| Jakub Pawlinski This is a known with the Unicode spec and the Java platform implementation of it, not Pipeline. In UTF-8 the BOM is neither needed nor suggested - since the BOM is essentially meaningless in UTF-8, Java transparently passes the BOM through. First I'd make sure to add the "encloding: 'UTF-8'" argument to your readFile step to ensure it reads as UTF-8. Then we do postprocessing to correct for nonstandard input. Some suggested solutions are available on StackOverflow. Personally, I'd do something like this to sanitize your input:
/** These are UTF-8 BOM characters */
private static String removeUTF8BOM(String s) {
return s.replace("\uEFBBBF", "");
}
(might need to be \u FEFF, try it both ways). There's also code snippets out there that do a more efficient approach, which only considers the leading bytes of the String. |