puppetdb: UTF-8 byte sequence

Antidot SAS

unread,

Jun 13, 2012, 6:51:22 AM6/13/12

to puppet-users

Hi everyone,

Me again regarding puppetdb, I have the following warning message:

"Jun 13 12:49:15 puppetmaster puppet-master[28444]: Ignoring invalid UTF-8 byte sequences in data to be sent to PuppetDB"

Do I have to worry?

Regards,

JM

jcbollinger

unread,

Jun 13, 2012, 9:06:38 AM6/13/12

to puppet...@googlegroups.com

I don't know any relevant specifics about PuppetDB, but on general principles I would say that to the extent you rely on the data curated by PuppetDB to be correct, yes, you should worry. The message suggests data stream corruption between PuppetDB and whatever other part of the master is talking to it at that point. Probably they disagree about what character encoding to use, but whatever the cause of the problem, the message suggests that PuppetDB interpreted the data in question differently than its source intended. There is a bug of some kind in there, so I would file a ticket.

John

Chris Price

unread,

Jun 13, 2012, 6:11:49 PM6/13/12

to puppet...@googlegroups.com

Because the serialization format (JSON) and the database both require UTF-8 character encoding for their data, puppetdb needs to encode strings before it sends them from the puppet master to the puppetdb server. Due to limitations in Puppet's representation of strings (character encoding is not explicitly specified), it's not possible for us to do anything too fancy when we encounter a byte sequence that is not directly representable in UTF-8. Thus, when this scenario occurs, you will see the warning that you mentioned. This does mean that we will be discarding the invalid bytes.

Whether or not this is cause for concern in your particular case depends on which resource triggered the warning, and what your use case for that resource is. If the offending resource is an exported resource that other nodes are relying on, then this could cause problems. If the offending resource is one that you query or report on, then your data could be skewed slightly. Otherwise, this is effectively harmless for you.

One thing that we should do on our end, though, is try to provide a bit more context to the warning message to help you try to identify which resource is causing the warning. To that end I've filed the following ticket:

http://projects.puppetlabs.com/issues/15016

(Also worth noting: in the existing/old storeconfigs, the behavior for handling this scenario is undefined... so for us, this warning is a first step towards providing comprehensive, robust support for handling string encoding.)

We are definitely interested in hearing more details about your setup if this does cause you any problems.

Thanks for the feedback!

Chris

Antidot SAS

unread,

Jun 14, 2012, 6:20:07 AM6/14/12

to puppet...@googlegroups.com

Hi,

I have no idea how I can help, tell me what to do and I would be glad to help.

Regards,

Jeremy MAURO

--
You received this message because you are subscribed to the Google Groups "Puppet Users" group.
To view this discussion on the web visit https://groups.google.com/d/msg/puppet-users/-/PZtYDMbV1XQJ.

To post to this group, send email to puppet...@googlegroups.com.
To unsubscribe from this group, send email to puppet-users...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/puppet-users?hl=en.

Chris Price

unread,

Jun 14, 2012, 8:41:39 AM6/14/12

to puppet...@googlegroups.com

No action necessary; we should be able to create repro scenarios that will help us provide more info in the warning message (and resolve the ticket that I mentioned). If you happen to know (or are able to identify) which resource in your system is triggering the warning (because of a String that contains a non-UTF-8 byte sequence), it would be interesting to see what your resource looked like. Otherwise, since the odds are high that the warning should be harmless, just let us know if you notice any other unusual behavior or problems that you suspect might be related to this.

Thanks again for the feedback!

To unsubscribe from this group, send email to puppet-users+unsubscribe@googlegroups.com.

Antidot SAS

unread,

Jun 14, 2012, 10:37:22 AM6/14/12

to puppet...@googlegroups.com

Hi again,

Can I run facter and dump the result? Would that be enough. On every client I have the warning so I would say that the scenario is pretty much reproducible. The only own made factts that I use is a shell scripts with the facts function from: https://github.com/ripienaar/facter-facts/tree/master/facts-dot-d

One example of the output would be:

--

#!/bin/bash

echo "network_site=dmz"

--

Could that be the problem?

Regards,

JM

jcbollinger

unread,

Jun 14, 2012, 12:22:32 PM6/14/12

to puppet...@googlegroups.com

On Wednesday, June 13, 2012 5:11:49 PM UTC-5, Chris Price wrote:

[...] Due to limitations in Puppet's representation of strings (character encoding is not explicitly specified), it's not possible for us to do anything too fancy when we encounter a byte sequence that is not directly representable in UTF-8.

Is Puppet's representation of strings distinct from Ruby's representation? In any case, it seems like a fundamental problem that Puppet is working with a bunch of strings whose encoding is uncertain. Why can't that be tackled farther upstream with a mechanism for ensuring that Puppet uses a consistent and known encoding for strings? Or even that it uses UTF-8 internally, so that no transcoding is needed when sending data to puppetdb?

Furthermore, what do you mean by "a byte sequence that is not directly representable in UTF-8"? UTF-8 encodes characters as bytes, not bytes as bytes. No byte sequence is inherently non-representable. For example, you can encode any byte sequence in UTF-8 by assuming that it represents a sequence of Latin1-encoded characters, so that the bytes are also the characters' Unicode scalar values. Do you perhaps mean "a byte sequence that isn't already valid UTF-8"?

I understand that Ruby 1.8 has pretty dismal character encoding support, but there are ways to deal with it. Surely you can do better than just an improved warning and a "don't do that".

At least there is a potential for some user guidance. For example, would the problem be adequately addressed if all manifests and data were encoded in UTF-8 and the agent were ensured to run in a UTF-8-based locale?

John

Deepak Giridharagopal

unread,

Jun 14, 2012, 1:07:36 PM6/14/12

to puppet...@googlegroups.com

On Thu, Jun 14, 2012 at 9:22 AM, jcbollinger <John.Bo...@stjude.org> wrote:

On Wednesday, June 13, 2012 5:11:49 PM UTC-5, Chris Price wrote:
[...] Due to limitations in Puppet's representation of strings (character encoding is not explicitly specified), it's not possible for us to do anything too fancy when we encounter a byte sequence that is not directly representable in UTF-8.

Is Puppet's representation of strings distinct from Ruby's representation? In any case, it seems like a fundamental problem that Puppet is working with a bunch of strings whose encoding is uncertain. Why can't that be tackled farther upstream with a mechanism for ensuring that Puppet uses a consistent and known encoding for strings? Or even that it uses UTF-8 internally, so that no transcoding is needed when sending data to puppetdb?

Agreed, it can and should be tackled upstream! I believe there's already a ticket for that, but I'll verify that assumption.

Your suspicions are correct: Puppet doesn't currently track the encoding of any strings inside the language. Once a string is "inside" of Puppet, we no longer know what its original encoding was. All we have are bytes. Nothing is converted to an internal, "neutral" encoding, nor do we maintain metadata about the original character set. A string in Puppet could contain ASCII, Latin-1, UTF-8, Shift-JIS, binary data, etc...and we unfortunately don't have any way to distinguish between them.

So until this issue is fixed in core Puppet, if we need to send those bytes over-the-wire to a system that actually cares about the precise encoding of what you're sending, our options are limited. What we do in the PuppetDB terminus is apply a heuristic: we attempt to convert the string to UTF-8. For things like ASCII (which in our research represents the lion's share of Puppet code out there) this works fine, and preserves all data. For things like Latin-1 etc., though, which can't be transcoded in a lossless way, we emit the warning and try to preserve as much of the original data as we can. Once the root cause is fixed, though, PuppetDB can take advantage of it with very minor changes to our terminus code.

Furthermore, what do you mean by "a byte sequence that is not directly representable in UTF-8"? UTF-8 encodes characters as bytes, not bytes as bytes. No byte sequence is inherently non-representable. For example, you can encode any byte sequence in UTF-8 by assuming that it represents a sequence of Latin1-encoded characters, so that the bytes are also the characters' Unicode scalar values. Do you perhaps mean "a byte sequence that isn't already valid UTF-8"?

Yes, that phrasing is more accurate. :)

I understand that Ruby 1.8 has pretty dismal character encoding support, but there are ways to deal with it. Surely you can do better than just an improved warning and a "don't do that".

At least there is a potential for some user guidance. For example, would the problem be adequately addressed if all manifests and data were encoded in UTF-8 and the agent were ensured to run in a UTF-8-based locale?

Correct on all accounts, I think. I'll add that suggestion to the ticket. Ultimately, this needs be fixed in core...then downstream services (PuppetDB, Foreman, Dashboard, etc) can all benefit. But certainly, I believe that improved guidelines would help. And the ticket that Chris filed earlier against PuppetDB specifically will at least help users in the interim figure out precisely which resources are giving us trouble.

Cheers,

Deepak

--

Deepak Giridharagopal / Puppet Labs / @grim_radical

David Schmitt

unread,

Jun 15, 2012, 3:30:24 AM6/15/12

to puppet...@googlegroups.com

On 14.06.2012 19:07, Deepak Giridharagopal wrote:
> At least there is a potential for some user guidance. For example,
> would the problem be adequately addressed if all manifests and data
> were encoded in UTF-8 and the agent were ensured to run in a
> UTF-8-based locale?
>
>
>
> Correct on all accounts, I think. I'll add that suggestion to the
> ticket. Ultimately, this needs be fixed in core.

JFTR: +1 for recommending/enforcing UTF-8 on manifests. File contents
and data of course is something completely different, but you know that
anyways.

Best Regards, David

Reply all

Reply to author

Forward