On Wednesday, June 13, 2012 5:11:49 PM UTC-5, Chris Price wrote:
[...] Due to limitations in Puppet's representation of strings (character encoding is not explicitly specified), it's not possible for us to do anything too fancy when we encounter a byte sequence that is not directly representable in UTF-8.
Is Puppet's representation of strings distinct from Ruby's representation? In any case, it seems like a fundamental problem that Puppet is working with a bunch of strings whose encoding is uncertain. Why can't that be tackled farther upstream with a mechanism for ensuring that Puppet uses a consistent and known encoding for strings? Or even that it uses UTF-8 internally, so that no transcoding is needed when sending data to puppetdb?
Furthermore, what do you mean by "a byte sequence that is not directly representable in UTF-8"? UTF-8 encodes characters as bytes, not bytes as bytes. No byte sequence is inherently non-representable. For example, you can encode any byte sequence in UTF-8 by assuming that it represents a sequence of Latin1-encoded characters, so that the bytes are also the characters' Unicode scalar values. Do you perhaps mean "a byte sequence that isn't already valid UTF-8"?
I understand that Ruby 1.8 has pretty dismal character encoding support, but there are ways to deal with it. Surely you can do better than just an improved warning and a "don't do that".
At least there is a potential for some user guidance. For example, would the problem be adequately addressed if all manifests and data were encoded in UTF-8 and the agent were ensured to run in a UTF-8-based locale?
John