Information disappears while sving to mysql

49 views
Skip to first unread message

gulan

unread,
Jul 17, 2012, 5:10:12 AM7/17/12
to scrapy...@googlegroups.com
Hi, 

I'm using the following xpath expression to scrape a couple of sites:

x.select('//div[@id="text"]/p | //div[@id="text"]/blockquote/p').extract()

When I save the output to json I retrieve all the information I want, but when I save to mysql using a pipeline not all of the information that is retrieved by the expression is inserted in the db. This problem only occurs, it seems, when I use the '|' operator in my expression. Does anyone now why this is? 

Here's my pipeline:

class MySQLStorePipeline(object):
    def __init__(self):
        self.conn = MySQLdb.connect(user='user',
                           passwd='',
                           db='',
                           host='local',
                           charset="utf8",
                           use_unicode="True"
)
self.cursor = self.conn.cursor()
self.conn.set_character_set('utf8')
    
    def process_item(self, item, spider):                 
        
for ite in item['titel']:
titel = ite
for itea in item['desc']:
desc = itea
desc = re.escape(desc)
link = item['link']
sql = "INSERT INTO articles (link, title, content) \
VALUES ('%s','%s','%s')" % \
(link, titel, desc)
try:
self.cursor.execute(sql)
self.cursor.execute('SET NAMES utf8;')
self.cursor.execute('SET CHARACTER SET utf8;')
self.cursor.execute('SET character_set_connection=utf8;')
            
self.conn.commit()
            

except MySQLdb.Error, e:
print ("MySQL Error -------------------------------------------------------")
print (e)
print ("-------------------------------------------------------------------")

return item


All help much appreciated!






 

Gustav Julander

unread,
Jul 20, 2012, 2:37:34 PM7/20/12
to scrapy...@googlegroups.com
Does anyone recognize this problem? 

Any help much appreciated!

2012/7/17 gulan <gustav....@gmail.com>






 

--
You received this message because you are subscribed to the Google Groups "scrapy-users" group.
To view this discussion on the web visit https://groups.google.com/d/msg/scrapy-users/-/drFVT0pS818J.
To post to this group, send email to scrapy...@googlegroups.com.
To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/scrapy-users?hl=en.

ScrapMe

unread,
Jul 21, 2012, 2:32:35 PM7/21/12
to scrapy...@googlegroups.com
Hi gulan,

I'm a little confused about your code (which could be because of formatting).

In your SQl statement, link, titel, desc. are all single values correct?
They all contain only one value which will then be inserted into the database?
The SQL statement only executes once and that is it?


sql = "INSERT INTO articles (link, title, content) \
VALUES ('%s','%s','%s')" % \
(link, titel, desc)

If they are one value, I don't understand why you have these loops?

for ite in item['titel']:
titel = ite
for itea in item['desc']:
desc = itea
desc = re.escape(desc)

Also, I have used SQL statements without the '' in tuple following VALUES:
sql = "INSERT INTO table_name(col_1, col_2, col_3) VALUES (%s, %s, %s)"

Again, it hard to understand your code.

Any help much appreciated!

2012/7/17 gulan <gustav....@gmail.com>
To unsubscribe from this group, send email to scrapy-users+unsubscribe@googlegroups.com.

Gustav Julander

unread,
Jul 22, 2012, 9:56:15 AM7/22/12
to scrapy...@googlegroups.com

Hi,

Thank's for taking the time to answer my question!

I can understand your confusion regarding the for loop. Yes, its single values being inserted, but my spider returns each item as à tuple. Probably not ideal, but thats the reason I need the loop. Maybe this is where my problem comes from?

Also I will try the insert statement without the "" . It will be a while before I can test this, but I will let you know!
> To view this discussion on the web visit https://groups.google.com/d/msg/scrapy-users/-/20VA9jJ_mWAJ.

> To post to this group, send email to scrapy...@googlegroups.com.
> To unsubscribe from this group, send email to scrapy-users...@googlegroups.com.

Steven Almeroth

unread,
Jul 23, 2012, 2:19:21 PM7/23/12
to scrapy...@googlegroups.com
I am confused by these for loops as well, for example:

    for ite in item['titel']:
        titel = ite

This loop is functionaly equivelent to a direct assignment:

    titel = item['titel'][-1]

It does not matter how many elements the list has, at the end of looping thru them titel will contain the value of the last one.

Let's try some sample code:

    >>> item = dict(titel=['one', 'two', 'three'])
    >>> for ite in item['titel']:
    ...     titel = ite
    ...
    >>> titel
    'three'

Can you provide sample item data for us?

Gustav Julander

unread,
Jul 25, 2012, 1:07:02 PM7/25/12
to scrapy...@googlegroups.com
Hi Steven,

Thank you for your input! Unfortunately I will not be able to provide sample until next friday. I will post it then.
> --
> You received this message because you are subscribed to the Google Groups "scrapy-users" group.
> To view this discussion on the web visit https://groups.google.com/d/msg/scrapy-users/-/tfD88Uo76IgJ.

Gustav Julander

unread,
Aug 3, 2012, 11:33:31 AM8/3/12
to scrapy...@googlegroups.com
Hi, 

Sorry for late response, but here's some sample output when I use the spider quoted in the original question. As I wrote before, I don't seem to be able to access the 'title' and the 'desc' items without using the for loop.  

[{"link": "http://dinsvenska.se/ordformer/sager-man-jag-jobbar-i-hamnen-eller-jag-jobbar-pa-hamnen/", "titel": ["S\u00e4ger man \u201djag jobbar i hamnen\u201d eller \u201djag jobbar p\u00e5 hamnen\u201d?"], "desc": ["<div class=\"cat\"><p><a href=\"http://dinsvenska.se/category/ordfoljd/\" title=\"Se alla inl\u00e4gg i Ordf\u00f6ljd\" rel=\"category tag\">Ordf\u00f6ljd</a>, <a href=\"http://dinsvenska.se/category/ordformer/\" title=\"Se alla inl\u00e4gg i Ordformer\" rel=\"category tag\">Ordformer</a> av <a href=\"http://dinsvenska.se/author/anna-danielsson/\" title=\"Inl\u00e4gg av Anna Danielsson\" rel=\"author\">Anna Danielsson</a></p></div>"]},
.....
.....
.....
{"link": "http://dinsvenska.se/ordfoljd/ar-det-bast-att-inleda-ett-e-post-meddelande-med-hej-hej-eller-hej-2/", "titel": ["\u00c4r det b\u00e4st att inleda ett e-post-meddelande med \u201dHej!\u201d, \u201dHej,\u201d, eller \u201dHej.\u201d?"], "desc": ["<div class=\"cat\"><p><a href=\"http://dinsvenska.se/category/ordfoljd/\" title=\"Se alla inl\u00e4gg i Ordf\u00f6ljd\" rel=\"category tag\">Ordf\u00f6ljd</a>, <a href=\"http://dinsvenska.se/category/skiljetecken-och-andra-tecken/\" title=\"Se alla inl\u00e4gg i Skiljetecken och andra tecken\" rel=\"category tag\">Skiljetecken och andra tecken</a> av <a href=\"http://dinsvenska.se/author/anna-danielsson/\" title=\"Inl\u00e4gg av Anna Danielsson\" rel=\"author\">Anna Danielsson</a></p></div>"]}]




2012/7/25 Gustav Julander <gustav....@gmail.com>

Максим Горковский

unread,
Aug 3, 2012, 11:49:02 AM8/3/12
to scrapy...@googlegroups.com
Try this:

class Pipeline(object):
    def __init__(self):
        self.conn = MySQLdb.connect(host=host, user=user, passwd=passwd, db=db, use_unicode=True)
        self.conn.set_character_set('utf-8')
        self.cur = self.conn.cursor()
        self.cur.execute('set names utf8')
        self.cur.execute('set character set utf8')
        self.cur.execute('set character_set_connection=utf8')
        self.conn.commit()

    def process_item(self, item, spider):
        ...
        self.cur.execute('insert into table (a, b, c) values (%s, %s, %s)', (a,b,c))
        self.conn.commit()

this syntax will take care of types of variables. also make sure variables are not lists/tuples

2012/8/3 Gustav Julander <gustav....@gmail.com>



--
С уважением,
Максим Горковский

Gustav Julander

unread,
Aug 4, 2012, 7:42:48 AM8/4/12
to scrapy...@googlegroups.com
Hi, 

Thank's for helping out! I tried your code but it issues the following error:

_mysql_exceptions.ProgrammingError: (1064, 'You have an error in your SQL syntax; check the manual that correspo
nds to your MySQL server version for the right syntax to use near \'), (\'\\\'<div class=\\\\"cat\\\\"><p><a href=\\\\"h

The thing is that my original code, quoted earlier in the thread, works fine with an xpath like:

x.select('//div[@class="cat"]').extract()

But when I add someting more to the expression, for eample; 

x.select('//div[@class="cat"] | //div[@class="entry"] ').extract() 

It fails to insert all of the information retrieved, even though I know that the info is scraped. This is the output for a desc item when I print to json using the more complex expression cited above. Note that both the 'cat' class and the "entry" class is scraped. 

{"desc": ["<div class=\"cat\"><p><a href=\"http://dinsvenska.se/category/ordfoljd/\" title=\"Se alla inl\u00e4gg i Ordf\u00f6ljd\" rel=\"category tag\">Ordf\u00f6ljd</a> av <a href=\"http://dinsvenska.se/author/anna-danielsson/\" title=\"Inl\u00e4gg av Anna Danielsson\" rel=\"author\">Anna Danielsson</a></p></div>", "<div class=\"entry\"><p>Ja, det g\u00e5r bra.</p><div class=\"clear2\"></div></div>"]}

And this is the information for the same item when inserted to a mysql using my original pipeline:

<div class="entry"><p>Ja, det går bra.</p><div class="clear2"></div></div>

It only inserts the last part of the expression, namely the "entry" class. It could be that the loop is causing this. But if this is the case, how can I insert the items without using it. My item output follows the pattern:

[{"link": "link", "titel": ["titel"], "desc": ["descraiption"]},
{"link": "link", "titel": ["titel"], "desc": ["descraiption"]},
{"link": "link", "titel": ["titel"], "desc": ["descraiption"]}]    

/Gustav






2012/8/3 Максим Горковский <ragzo...@gmail.com>
Reply all
Reply to author
Forward
0 new messages