Issue 179 in couchdb-python: couchdb-dump cannot deal with unicode characters in doc ids

77 views
Skip to first unread message

couchdb...@googlecode.com

unread,
May 14, 2011, 3:59:08 AM5/14/11
to couchdb...@googlegroups.com
Status: New
Owner: ----
Labels: Type-Defect Priority-Medium

New issue 179 by heshim...@gmail.com: couchdb-dump cannot deal with unicode
characters in doc ids
http://code.google.com/p/couchdb-python/issues/detail?id=179

What steps will reproduce the problem?
1.Create a document in couchdb, with some Chinese character like "文档"
2.Run couchdb-dump on the database

What is the expected output? What do you see instead?
couchdb-dump crashes upon reaching this document. Here are the last lines
of the trace:
File "/pylonsenv/lib/python2.6/site-packages/couchdb/multipart.py", line
122, in __init__
self._write_headers(headers)
File "/pylonsenv/lib/python2.6/site-packages/couchdb/multipart.py", line
175, in _write_headers
self.fileobj.write(headers[name])
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2:
ordinal not in range(128)

What version of the product are you using? On what operating system?
couchdb-python 0.8 against couchdb 1.0.1 on Ubuntu.



couchdb...@googlecode.com

unread,
May 14, 2011, 4:21:19 AM5/14/11
to couchdb...@googlegroups.com

Comment #1 on issue 179 by heshim...@gmail.com: couchdb-dump cannot deal

I just needed a quick solution to dump the database and reload it in
another environment. So I made some changes to multipart.py to get pass
this utf-8 thing. It did work.

However, I understand that other parts are using multipart.py too. This
probably won't fit the MIME standard. If I have time, I'll investigate
further and provide a patch that does satisfy the MIME standard.

Attachments:
utf-8_dump_load.patch 1.1 KB

couchdb...@googlecode.com

unread,
May 14, 2011, 4:25:21 AM5/14/11
to couchdb...@googlegroups.com

Comment #2 on issue 179 by kxepal: couchdb-dump cannot deal with unicode

Confirm. There is also invalid test case about how multipart module works
with unicode data: StringIO could handle mixed "str" and "unicode" values,
but files requires only "str" one.

couchdb...@googlecode.com

unread,
May 14, 2011, 5:21:33 AM5/14/11
to couchdb...@googlegroups.com

Comment #3 on issue 179 by kxepal: couchdb-dump cannot deal with unicode

Sorry, I was wrong about tests - StringIO confused me(: Don't rush, sit
down and think about...yes(:
There is no needs to fix multipart module, only dump tool due to it pass
unicode document id to multipart writer. This is about dump-tool.patch.

dump-tool-2.patch solves same problem, but with respect of Content-Type
header and his charset. I suppose, that would a more correct solution.

Attachments:
dump-tool.patch 648 bytes
dump-tool-2.patch 1.5 KB

couchdb...@googlecode.com

unread,
May 14, 2011, 6:43:51 AM5/14/11
to couchdb...@googlegroups.com

Comment #4 on issue 179 by heshim...@gmail.com: couchdb-dump cannot deal

Ah, that's much smarter. Thanks!

couchdb...@googlecode.com

unread,
May 14, 2011, 6:48:51 AM5/14/11
to couchdb...@googlegroups.com

Comment #5 on issue 179 by heshim...@gmail.com: couchdb-dump cannot deal

Hmm... another thing. I was under the impression that utf-8 encoded strings
aren't valid ascii. Currently, isn't multipart.py expecting strict ascii
strings as header?

couchdb...@googlecode.com

unread,
May 14, 2011, 7:14:56 AM5/14/11
to couchdb...@googlegroups.com

Comment #6 on issue 179 by kxepal: couchdb-dump cannot deal with unicode

Actually, only first 128 chars of utf-8 encoding are valid ascii. Problem
was not in what characters in headers, but in type of string multipart
tries to write into output stream. Files and streams doesn't expects pure
unicode strings, but favors stings called as "bytes" in Python 3
terminology and multipart module expects this behavior.

But there was a "hack" which adds to headers document id which used by
couchdb-load tool to help create document with same id value. Since
document id could be unicode, this "hack" breaks expectations and makes
multipart crash.

You could try revert patch and replace in dump.py default value of output
argument in dump_db function from sys.stdout to StringIO.StringIO and error
wouldn't be occurred because StringIO could handle both str and unicode
values.

couchdb...@googlecode.com

unread,
May 14, 2011, 8:25:10 AM5/14/11
to couchdb...@googlegroups.com

Comment #7 on issue 179 by djc.ocht...@gmail.com: couchdb-dump cannot deal

IMO the correct way to have non-ASCII strings in MIME headers would be to
use RFC 2047 encoding for any non-ascii header values.

couchdb...@googlecode.com

unread,
May 14, 2011, 8:51:15 AM5/14/11
to couchdb...@googlegroups.com

Comment #8 on issue 179 by kxepal: couchdb-dump cannot deal with unicode

Correct, but looks like an overhead in such case, because it would applied
only to one header while others should follow RFC 822. Wouldn't be better
to use base64 encoding?

couchdb...@googlecode.com

unread,
Jun 2, 2011, 2:47:19 AM6/2/11
to couchdb...@googlegroups.com

Comment #9 on issue 179 by heshim...@gmail.com: couchdb-dump cannot deal

Hmm... I'd like to make a note here that kxepal's dump-tool-2.patch
actually generated some invalid multipart boundaries.

couchdb...@googlecode.com

unread,
Sep 21, 2012, 4:33:25 AM9/21/12
to couchdb...@googlegroups.com
Updates:
Owner: kxepal

Comment #10 on issue 179 by djc.ochtman: couchdb-dump cannot deal with
(No comment was entered for this change.)

couchdb...@googlecode.com

unread,
Sep 21, 2012, 8:45:28 PM9/21/12
to couchdb...@googlegroups.com
Updates:
Labels: Milestone-0.9

Comment #11 on issue 179 by wickedg...@gmail.com: couchdb-dump cannot deal

couchdb...@googlecode.com

unread,
Oct 22, 2012, 7:27:37 AM10/22/12
to couchdb...@googlegroups.com

Comment #12 on issue 179 by djc.ochtman: couchdb-dump cannot deal with
Any progress on this?

couchdb...@googlecode.com

unread,
Oct 22, 2012, 7:33:14 AM10/22/12
to couchdb...@googlegroups.com

Comment #13 on issue 179 by kxepal: couchdb-dump cannot deal with unicode
Yes, will submit patch with tests during this week. I'd agreed with you
about RFC 2047 specification, so diving into it.

couchdb...@googlecode.com

unread,
Apr 24, 2013, 1:20:24 PM4/24/13
to couchdb...@googlegroups.com
Updates:
Status: Accepted

Comment #14 on issue 179 by kxepal: couchdb-dump cannot deal with unicode
Patch attached. Non-ascii headers now encoded following RFC 2047. Actually,
I feel to rewrite multipart module to let him base on top of email package,
but probably that would be another issue - need to workaround some email
specific features to keep backward compatibility.

Attachments:
couchdb-python_485.patch 3.9 KB

--
You received this message because this project is configured to send all
issue notifications to this address.
You may adjust your notification preferences at:
https://code.google.com/hosting/settings

couchdb...@googlecode.com

unread,
Apr 25, 2013, 6:09:49 AM4/25/13
to couchdb...@googlegroups.com
Updates:
Status: Fixed

Comment #16 on issue 179 by djc.ochtman: couchdb-dump cannot deal with
Pushed a slightly changed patch as rce40fd77ae8d, thanks!

couchdb...@googlecode.com

unread,
Apr 25, 2013, 7:17:56 AM4/25/13
to couchdb...@googlegroups.com
Updates:
Labels: -Milestone-0.9

Comment #17 on issue 179 by djc.ochtman: couchdb-dump cannot deal with
(No comment was entered for this change.)

Reply all
Reply to author
Forward
0 new messages