U+200B to WINDOWS-1252 in conversion from UTF-8 to WINDOWS-1252

162 views
Skip to first unread message

Scott Gardner

unread,
Mar 19, 2020, 4:16:07 AM3/19/20
to Prawn
Hi there,

I'm trying to make a simple PDF generated from a CSV in ruby. I do some basic parsing, decide on a couple cells that have things that are important to me, and do some math to average some values. Nothing fancy. I only use `text`, `table`, and `move_down` a few times, the PDFs are only two pages long. However, the PDF somehow repeatedly fails to parse a non width space character (which I think is the above error) a casual 4300 times. I'm pretty sure this is directly related to how many empty cells there are in the CSV, but I'm not calling `pdf.text` anywhere near that amount. My questions are:

- Where and why is prawn trying repeatedly to encode the zero width space? Or choking on it?

- Is there something I'm missing about default PDF generation that's forcing the encoding to be WINDOWS-1252? Which I assume can't handle the U+200B character? Can I avoid it?

- Are there unsurprising ways to strip the character from the CSV? I'm looking into doing this currently...

Thanks,

Scott

Alexander Mankuta

unread,
Mar 19, 2020, 4:24:40 AM3/19/20
to Prawn
Hi Scott,

I assume you use default PDF fonts. Those fonts only support Windows-1252 encoding. Prawn will try converting your text to that encoding when you use default fonts.

You have a few options:
  • Make sure your text is compatible with Win1252 encoding.
  • Use an external font that support Unicode and provides glyphs for all characters you use. Most do support ZWS as it is essentially invisible and doesn't require much effort from font designers.
You can also manually remove the character (e.g. my_string.sub("\u200b", '')) but you should be careful as it is a space. It can be taken into account to break a string into multiple lines, for example.

--
Alex

Scott Gardner

unread,
Mar 19, 2020, 1:20:13 PM3/19/20
to Prawn
HI Alex,

Thanks for your response! I did some digging around and although I'm attempting to force UTF-8 encoding, prawn was overwriting it in prawn-2.2.2/lib/prawn/font/afm.rb (`normalize_encoding`). I supplied my own font like you suggested using `pdf.font("./Roboto-Regular.ttf")`, which seems to have fixed the problem. 

Thank you!
Reply all
Reply to author
Forward
0 new messages