Can anyone help with the CP conversion script?

8 views
Skip to first unread message

Matthew Gerring

unread,
Mar 7, 2011, 3:23:07 PM3/7/11
to cop...@googlegroups.com
I'm getting this error:

Traceback (most recent call last):
File "cpconvert.py", line 848, in <module>
main()
File "cpconvert.py", line 821, in main
version,stories,images = importStories(verbose)
File "cpconvert.py", line 679, in importStories
for line in storiesCSV:
_csv.Error: line contains NULL byte

pretty straightforward, but I don't know a whole lot about Python. Can anybody help?

Daniel Bachhuber

unread,
Mar 7, 2011, 5:57:00 PM3/7/11
to cop...@googlegroups.com
Hey Matt,
I'm not positive, going off memory, but I think you need to find and replace in the file on the null byte. You can replace it with a space or nothing

Daniel

> --
> You received this message because you are a part of CoPress (http://www.copress.org/).
> - To post a message to this group, send email to cop...@googlegroups.com
> - To unsubscribe from this group, send an email to copress+u...@googlegroups.com
> - For more options, visit this group at http://groups.google.com/group/copress
> - Get connected on Twitter http://www.twitter.com/copress or Facebook http://www.facebook.com/copress
>
> http://www.copress.org/

Matthew Gerring

unread,
Mar 7, 2011, 6:11:14 PM3/7/11
to cop...@googlegroups.com
Fixed the null byte issue, but now I'm getting "list index out of range". Most of the data is normalized, but a few of the stories (the ones containing NULL bytes) have really weird characters in them, and there are newlines in the story text that might be throwing it off.

I checked the encoding of the CSV file and it's telling me: Non-ISO extended ASCII HTML document text, with very long lines, with CRLF line terminators.

The error I get is:

Traceback (most recent call last):
File "cpconvert.py", line 848, in <module>
main()
File "cpconvert.py", line 821, in main
version,stories,images = importStories(verbose)

File "cpconvert.py", line 722, in importStories
story = [line[0],line[2],line[3],line[4],line[6],line[7],line[8],line[5]]
IndexError: list index out of range

and here's what the last few lines of text in my CSV before the error (in verbose mode, answering "yes" to "would you like to run a test") looks like:

$%?m(?O` "<p>

Q
A
"
?
N<p>

qra?\
i
O[?HD??=?A~r
-/ ?="#?+$?+%<p>


Any ideas?

Miles Skorpen

unread,
Mar 7, 2011, 6:15:45 PM3/7/11
to copress
I haven't dealt with this in well over a year, but that a story is screwing up the CSV parsing -- it thinks that "$%?m(?O`        "<p>" is the title, "Q" is the author, etc. and then the line ends before the script expects it to, causing the error. I'd find that line in your CSV and remove or replace it. Ideally, look closely and try to figure out why the CSV parser would choke on it (are there lots of commas and quotations? That's a recipe for issues, though the script should automatically escape a lot of them, I believe) and then do a general find-and-replace throughout the whole file.

Matthew Gerring

unread,
Mar 7, 2011, 6:21:32 PM3/7/11
to cop...@googlegroups.com
I figured it out- it's mangled text resulting from pasting out of a word document into the College Publisher editor.

Behold:


OO 5 0 @ G à Times New Roman5 à éSymbol3 à Arial3 à Times"" 1 Ü - h jf jf E † # ! * ® ùùé 0K ` By Daniel Lopez JMC JMC <p>

ÿôÅU Oh 'â +' Y0X Ö à " - ÿ I _<p>

<p>

, 8 @ H P ' By Daniel Lopez rd y D JMC MC Normale JMC 1C Microsoft Word 8.0d@@& aÖj™ @& aÖj™ E † <p>

'O'£. ç¢ +, íD 'O'£. ç¢ +, í< _ h p | • Ü å £ " ù<p>

ß Y ' an # K : By Daniel Lopez Title ï 6 > <p>

_PID_GUID 'AN{3AB27300-D641-11D6-B322-0005020E1577} <p>

<p>

!""# %&'()*+ -./0123 6 Root Entry ®Fÿ+ Mj™ 8é1Table WordDocument 6 SummaryInformation( $ DocumentSummaryInformation8 , CompObj XObjectPool ÿ+ Mj™ ÿ+ Mj™ ®F Microsoft Word Document NB6W Word.Document.8","Daniel Lopez, Daily Staff Writer"
>.<

Manual search and replace would be fine if my computer could handle a million-line text file, but it's having a little bit of trouble with that. Any suggestions?

On Mar 7, 2011, at 3:15 PM, Miles Skorpen wrote:

I haven't dealt with this in well over a year, but that a story is screwing up the CSV parsing -- it thinks that "

Miles Skorpen

unread,
Mar 7, 2011, 6:25:10 PM3/7/11
to copress, Matthew Gerring
Open the file, and then copy paste into a collection of smaller text files. Do the find replace on each. Then re-merge.

Miles

Daniel Bachhuber

unread,
Mar 7, 2011, 7:34:17 PM3/7/11
to cop...@googlegroups.com
Which text editor are you using? I believe a free trial of bbEdit for the Mac will work for your needs. Otherwise, vim or nano on the command line?

Daniel Bachhuber

unread,
Mar 7, 2011, 7:36:10 PM3/7/11
to cop...@googlegroups.com
Also, if you can, you may want to ignore the stories with junk data anyway and do a manual reimport. I don't think it's worth corrupting your archives, and causing troubles for the next migration, etc. with junk data

Matthew Gerring

unread,
Mar 7, 2011, 7:43:05 PM3/7/11
to cop...@googlegroups.com
Did it with TextMate. Offending stories were removed from the CSV and pasted elsewhere so they can be manually re-inserted. Thank you guys so much for developing this, now that we've got the archives we'll be able to get back off of College Publisher next semester. Awesomesauce!

Now- anybody have ideas about hosting that will placate the journalism department's concerns? I offered to put up my own money for a VPS but they're not having that, and I don't want to run WordPress on the windows machines we have here.

-Matthew

Andrew Spittle

unread,
Mar 7, 2011, 7:44:36 PM3/7/11
to cop...@googlegroups.com
We had really good success with WebFaction when CoPress was running. It's also where I have my personal site hosted. Pretty solid.

-- 
Andrew Spittle | andrewspittle.net

Matthew Gerring

unread,
Mar 7, 2011, 7:49:22 PM3/7/11
to cop...@googlegroups.com
Ouch- almost got there, but then I got a message from WordPress 3.0 that the files produced by the script are not valid WXR. Says in the file that the generator is WordPress 2.7.1.

Daniel Bachhuber

unread,
Mar 7, 2011, 7:54:26 PM3/7/11
to cop...@googlegroups.com
Did it fail uploading, or just give you the error? As far as I can remember, we just arbitrarily wrote the generator number.

If it still doesn't work, do you mind doing a comparison between the file the script generated and the newer WXR files? I can help improve it this weekend or early next week

Andrew Nacin

unread,
Mar 7, 2011, 7:54:51 PM3/7/11
to cop...@googlegroups.com

On Mar 7, 2011 7:49 PM, "Matthew Gerring" <beat...@gmail.com> wrote:
>
> Ouch- almost got there, but then I got a message from WordPress 3.0 that the files produced by the script are not valid WXR. Says in the file that the generator is WordPress 2.7.1.

What version of the WordPress Importer plugin are you running? Be sure you're on 0.3 and preferably running PHP 5.2. If that still doesn't work, try adding this to wp-config.php, for expanded error reporting:

define('IMPORT_DEBUG', true);

Nacin

Miles Skorpen

unread,
Mar 7, 2011, 7:55:25 PM3/7/11
to copress
I'm not familiar with the differences. You may want to install 2.7, add the stories, then update the site.

Miles

Daniel Bachhuber

unread,
Mar 7, 2011, 7:57:10 PM3/7/11
to cop...@googlegroups.com
Also, forgot this, but two hosting options I'd consider if you have any budget: http://page.ly/ and http://wpengine.com/ It's WordPress-specific support, so I think better suited for student publications

--
You received this message because you are a part of CoPress (http://www.copress.org/).
- To post a message to this group, send email to cop...@googlegroups.com
- To unsubscribe from this group, send an email to copress+u...@googlegroups.com
- For more options, visit this group at http://groups.google.com/group/copress
- Get connected on Twitter http://www.twitter.com/copress or Facebook http://www.facebook.com/copress
 
http://www.copress.org/

Matthew Gerring

unread,
Mar 7, 2011, 8:08:09 PM3/7/11
to cop...@googlegroups.com
Great news- upgrading to WordPress 3.1 fixed it. Also, Daniel, you have a user account in this database. Such a small world!

Adam Hemphill

unread,
Mar 7, 2011, 8:55:33 PM3/7/11
to cop...@googlegroups.com
College Publisher now offers WordPress hosting, don'tyaknow!

Juuust kidding. WebFaction is great for flexible managed hosting and support, though the WP-specific options might placate the higher-ups a bit more.
Reply all
Reply to author
Forward
0 new messages