Removing duplicate items from CSV

80 views
Skip to first unread message

Jaspreet Singh

unread,
May 4, 2014, 5:33:23 AM5/4/14
to scrapy...@googlegroups.com
I am using scrapy to crawl through news articles in the form of blogs on a fixed url. Each news article (blog) is an item on the page (fixed url) with fields like date, heading, description etc.

I am using the following command to store the items into the CVS file:

scrapy crawl NewsBlog -o Items_File.csv -t csv

This web page gets refreshed every few hours with New Articles (Blogs). If I run the above command I get duplicate items in the csv as each time the entire set of items are appended to the csv file. I don't want the duplicates in the csv. I am not using pipelines to filter duplicates as the duplicates are not there in the blog. I am not deleting the csv each time as old articles disappear from the blog url page. Its a fair to assume that the combination of date and heading is unique in the item fields.

Please suggest a viable solution for storing only the unique items in the csv.

Bill Ebeling

unread,
May 5, 2014, 12:31:59 PM5/5/14
to scrapy...@googlegroups.com
Good option: Sounds like a case for a database.

Very Bad Option: The only other option I can think of is storing a hash of the url's in a flat file, and reading in that file and checking to see if a hash of the current url is in that list, if not, save it and add the url to that list..  this leads to many other problems.

Jaspreet Singh

unread,
May 5, 2014, 1:12:41 PM5/5/14
to scrapy...@googlegroups.com
Thanks for your reply Bill.

I will go with the good option. Where should I place the SQL insert? I am thinking of placing it inside the parse function of the spider.

Bill Ebeling

unread,
May 5, 2014, 1:20:32 PM5/5/14
to scrapy...@googlegroups.com
I have a separate file for the DB in the same folder as settings.py   The file handles the DB overhead and contains loose methods like 'saveItem' or 'itemExists' and whatnot.  I call it from the pipeline, and if not 'itemExists', 'saveItem'

This way the spider just happily crawls along and the pipeline deals with whatever items it's producing.


--
You received this message because you are subscribed to a topic in the Google Groups "scrapy-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/scrapy-users/x_R1HAqoySU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to scrapy-users...@googlegroups.com.
To post to this group, send email to scrapy...@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Jaspreet Singh

unread,
May 8, 2014, 1:12:07 PM5/8/14
to scrapy...@googlegroups.com
I am setting the weblink (url) field as the primary key in the mysql db. I have specified the length for this field as varchar(1000), however mysql allows max key length as 768 bytes. Is it safe to use the hash value of the url as primary key in the mysql?

Bill Ebeling

unread,
May 8, 2014, 2:21:10 PM5/8/14
to scrapy...@googlegroups.com
Depends on your hash algo.  If you use sha1 it will give you 20 characters, then use hexdigest and end up with a char(40) column.  That's what the internet seems to do, at least.

Py Docs:> https://docs.python.org/2/library/hashlib.html#module-hashlib

Bill Ebeling

unread,
May 8, 2014, 2:25:12 PM5/8/14
to scrapy...@googlegroups.com
Though, now that I think of it, Scrapy uses a fingerprint ID to determine whether it's submitted a request already.  Maybe you could grab the id scrapy uses and make it easy on yourself.

See if this helps:  http://doc.scrapy.org/en/latest/search.html?q=fingerprint&check_keywords=yes&area=default#

Reply all
Reply to author
Forward
0 new messages