페칭관련질문

146 views

Skip to first unread message

주성

unread,

Jul 20, 2011, 10:42:49 AM7/20/11

to Python 3 질문 게시판

urllib.request 모듈 사용해서 웹페이지 페치하는데
한글이 깨집니다.
encoding세팅만 3일 읽었는데두 해결이 안되네요

이를테면 아래와 같은 코드를 실행해보면..

=========================
opener = urllib.request.build_opener()
opener.addheaders=[('User-agent', 'chrome/5.0')]
fp=opener.open('http://ko.wikipedia.org/wiki/%EC%95%84%EC%A3%BC%EB
%8C%80', 'MS949')
page=fp.read()
print (page)

===========================

위키피디아에서 한글문서 페이지를 fetch 하고 출력 해보면 아래와 같습니다.

========================================
b'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n<html
lang="ko" dir="ltr" xmlns="http://www.w3.org/1999/xhtml">\n<head>
\n<title>\xec\x95\x84\xec\xa3\xbc\xeb\x8c\x80\xed\x95\x99\xea\xb5\x90
- \xec\x9c\x84\xed\x82\xa4\xeb\xb0\xb1\xea\xb3\xbc, \xec\x9a\xb0\xeb
\xa6\xac \xeb\xaa\xa8\xeb\x91\x90\xec\x9d\x98 \xeb\xb0\xb1\xea\xb3\xbc
\xec\x82\xac\xec\xa0\x84</title>\n
.........
중략
======================================
\xeb\xaa\xa8\xeb 문자의 정체가 뭔가요??

평면우주

unread,

Jul 21, 2011, 11:22:52 AM7/21/11

to Python 3 질문 게시판

안녕하세요.

일단 물어 보신 것 처럼 한글이 깨진 것은 encording이 깨져서 그런 겁니다. 웹사이트에서 보내는 문자열은 멀티바이트 입니
다.
즉 정상적으로 보시려면 다음 처럼 decode 함수를 사용해서 utf-8로 변경하면 됩니다.

>>> import urllib.request

>>> opener = urllib.request.build_opener()

>>> opener.addheaders = [('User-agent', 'Mozilla/5.0')]
>>> fp = opener.open('http://ko.wikipedia.org/wiki/%EC%95%84%EC%A3%BC%EB%8C%80')
>>> print (fp.read(500))

b'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n<html
lang="ko" dir="ltr" xmlns="http://www.w3.org/1999/xhtml">\n<head>
\n<title>\xec\x95\x84\xec\xa3\xbc\xeb\x8c\x80\xed\x95\x99\xea\xb5\x90
- \xec\x9c\x84\xed\x82\xa4\xeb\xb0\xb1\xea\xb3\xbc, \xec\x9a\xb0\xeb
\xa6\xac \xeb\xaa\xa8\xeb\x91\x90\xec\x9d\x98 \xeb\xb0\xb1\xea\xb3\xbc

\xec\x82\xac\xec\xa0\x84</title>\n<meta http-equiv="Content-Type"
content="text/html; charset=UTF-8" />\n<meta http-equiv="Content-Style-
Type" content="text/css" />\n<meta name="generator" content="MediaWiki
1.17wmf1" />\n<link rel="canonical" href="/wiki/%EC%95%84%E'
>>> print(fp.read(500).decode('utf-8'))
C%A3%BC%EB%8C%80%ED%95%99%EA%B5%90" />
<link rel="alternate" type="application/x-wiki" title="편집" href="/w/
index.php?title=%EC%95%84%EC%A3%BC%EB%8C%80%ED%95%99%EA
%B5%90&action=edit" />
<link rel="edit" title="편집" href="/w/index.php?title=%EC%95%84%EC%A3%BC
%EB%8C%80%ED%95%99%EA%B5%90&action=edit" />
<link rel="apple-touch-icon" href="http://ko.wikipedia.org/apple-touch-
icon.png" />
<link rel="shortcut icon" href="/favicon.ico" />
<link rel="search" type="application/opensearchdesc
>>>

------ 이후 주성님의 추가 질문입니다.

사실 한가지 문제가 더 발생했는데요

저자님이 알려주신 코드는 아래와 같이 500글자만 읽어오는건데요

opener = urllib.request.build_opener()
opener.addheaders=[('User-agent', 'chrome/5.0')]

fp=opener.open('http://ko.wikipedia.org/wiki/%ED%8A%B9%EC%88%98%EA
%B8%B0%EB%8A%A5:%EC%9E%84%EC%9D%98%EB%AC%B8%EC%84%9C')
print(fp.read(500).decode('utf-8'))

저는 페이지 전체를 읽고자 아래와 같이 수정하면 에러가 발생합니다.
print(fp.read().decode('utf-8'))

에러문은 아래와 같습니다.
Traceback (most recent call last):
File "C:\Users\justin\workspace\KorDocList\src\main.py", line 14, in
<module>
print(page.decode('utf-8'))
File "C:\Python30\lib\io.py", line 1494, in write
b = encoder.encode(s)
UnicodeEncodeError: 'cp949' codec can't encode character '\ufeff' in
position 4846: illegal multibyte sequence

이 에러는 'http://daum.net' 을 fetch했을때는 발생하지 않습니다......
어떻게 된거죠??

답변.

에러 메세지를 보시면 '\ufeff'를 encode 할 수 없어서 발생했습니다.
위키피디아에서 읽어온 html 소스를 중 잘못된 문자가 있어서 에러가 발생하는 것입니다.

daum에서 에러가 발생하지 않은 이유는 daum 사이트는 잘못된 문자가 없었기 때문에 에러가 발생하지 않았던 것입니다.