PDF Previews and StaticSync

159 views
Skip to first unread message

Maria T

unread,
Nov 12, 2012, 8:40:54 AM11/12/12
to resour...@googlegroups.com
Hi,

I have been trying out staticsync to import PDFs from our server. It seems that previews are not generated when there is an umlaut (ü/ä/ö) in the file name (only display correctly in RS when charset is set to iso-8859-1 instead of UTF-8, otherwise look like this �). I tried uploading the same files using plupload and previews are generated/umlauts are displayed correctly. 

Does anyone know where the problem may lie? 

Any pointers/tips appreciated, Maria

Maria T

unread,
Nov 14, 2012, 10:04:24 AM11/14/12
to resour...@googlegroups.com
So to summarise what I have found out so far:
- changed settings so now everything is set to UTF-8 (when running query: show variables like 'char%';).
- re-installed resourcespace via svn and created new database. Everything is now displaying correctly in phpmyadmin/running queries via cmd line.
- re-ran staticsync. Instead of displaying d�w177, it 'breaks' when it encounters an umlaut, so it displays the first letter only 'd'. No preview and, unlike before, no pdfs. All other files (i.e. without umlauts) have been imported correctly. 
- interestingly, if the collection has an umlaut in the name (e.g Jüdischer), it also breaks, and all files in that folder (without special characters) are not imported and previews are not created, even though original filename and title are intact. 
- tried uploading any files via plupload, not working so I have reverted to java. Files upload correctly, with or without umlaut, and with previews.

There must be a difference in the way that upload and staticsync import files but I don't have the php knowledge to figure it out for myself. Are filenames encoded differently and would this have an impact on staticsync? I noticed that when I am importing files, umlauts do not display correctly in terminal.

If anyone could replicate this, that would be awesome. I am running RS on Debian.

Many Thanks,
Maria

Maria T

unread,
Nov 20, 2012, 6:56:18 AM11/20/12
to resour...@googlegroups.com
Apologies for all the updates but I have made some interesting discoveries over the past week which may help resolve this issue (I may have reached the limit of my abilities with this one):

mysql_charset in config.php
- mysql_charset needs to be set to on in the config.php before database creation for multibyte characters to be stored correctly as utf-8.
- If I comment out mysql_charset before starting staticsync, although umlauts are replaced by � in web browser, as soon as I re-activate mysql_charset in config, it now display umlauts correctly! To summarise, mysql_charset set to utf-8 breaks staticsync and truncates when it encounters an umlaut. By commenting out it before and 'un-commenting' after, files are imported (ingest set to false) correctly albeit without previews.

ghostscript and staticsync
- pdf previews are generated when uploading files by java upload because the file path is scrambled. However when files are imported via static sync it breaks when it encounters an umlaut. I am assuming the problems lie with RS (or perhaps, my installation of RS) as Ghostscript is supposed to support UTF-8 (or at least that's what I read; I have not tested this).

Comparing the debug output, with mysql_charset="UTF8" set to on vs off, it breaks at the following points:
- SQL: select file_path value from resource where ref='1' etc
- SQL: update resource set preview_tweaks (i.e. just before GS is supposed to start)
- it seems to think file source is: filestore/1_[etc]/1_[etc].pdf (i.e. that the file has been ingested, when in actual fact there is nothing in that folder)

With mysql_charset commented out, GS breaks when trying to create previews, e.g.:
- 'PDF multi page preview [....] /var/www/Analysis/hpihh/dradio.pdf', even though it recognises that: 'file source is var/www/Analysis/hpäihh/dradio.pdf'

I was wondering if anyone could shed any light on what may be happening here? 

Many Thanks,
Maria

Jeff Harmon

unread,
Nov 20, 2012, 1:27:46 PM11/20/12
to resour...@googlegroups.com, resour...@googlegroups.com
Probably a stupid question, but are all your tables in MySQL set to UTF as well?

Just curious. 

J

--
Jeff Harmon
Chief Executive Officer
Colorhythm LLC

Main Office:  +1 415-399-9921
Mobile:  +1 510-710-9590

--
 
 

Maria T

unread,
Nov 21, 2012, 4:19:09 AM11/21/12
to resour...@googlegroups.com
No, not a stupid question at all. Everything is set to UTF-8. When I run the query - show variables like 'char%'; - I get the following result:

character_set_client  utf8      

character_set_connection  utf8

character_set_database utf8

character_set_filesystem binary

character_set_results utf8

character_set_server utf8

character_set_system utf8

character_sets_dir /usr/share/mysql/charsets/

I have also exported the table and umlauts display correctly/are encoded in UTF-8. When I run staticsync, theme categories are also imported correctly into the database but only if mysql_char is turned off in config, otherwise it truncates when it encounters a special character.

Maria 

Reply all
Reply to author
Forward
0 new messages