Duplicate items when using url as ID FIELD

40 views
Skip to first unread message

Boy Sandy Gladies Arriezona

unread,
May 7, 2017, 3:39:11 AM5/7/17
to django-dynamic-scraper
Hi, I use url as ID FIELD, but the problem is the url has argument like this:
https://www.tokopedia.com/sumberjayamotor/jas-hujan-axio-europe-original?trkid=f%3DCa63L000P0W0S0Sh00Co0Po0Fr0Cb0_src%3Ddirectory_page%3D1_ob%3D24_q%3D_catid%3D1318_po%3D1

Sometime, the argument is different but actually it is the same page. Because of this, 
I have a lot of duplicate item in database. I have try to normalize the url in pipeline
but it still save the item instead of dropping it. I also have try to use option in scraper
page, I use 'remove_chars': '\?.+' but I still have a lot of duplicate item.

This is my pipeline:
class DjangoWriterPipeline(object):
def process_item(self, item, spider):
if spider.conf['DO_ACTION']:
try:
item['source_detail'] = spider.ref_object
checker_rt = SchedulerRuntime(runtime_type='C')
checker_rt.save()
item['checker_runtime'] = checker_rt

item.save()
spider.action_successful = True
spider.log("Item saved to Django DB.", logging.INFO)

except IntegrityError as e:
spider.log(str(e), logging.ERROR)
raise DropItem("Missing attribute.")

return item


class UrlNormalizerPipeline(object):
def process_item(self, item, spider):
if spider.conf['DO_ACTION']:
try:
if 'url' in item:
item['url'] = item['url'].split('?')[0]

spider.action_successful = True
except IntegrityError as e:
spider.log(str(e), logging.ERROR)
raise DropItem("Something went wrong when normalizing url.")

return item

Holger Drewes

unread,
May 7, 2017, 4:31:10 AM5/7/17
to django-dyna...@googlegroups.com
Hmmm, maybe just switch your ID concept and take a combination of e.g. category, sub category and title (you can select several fields forming the ID) instead?

Probably a bit messy in the transition period :-) but maybe worth it.

Holger


--


Sie erhalten diese Nachricht, weil Sie in Google Groups E-Mails von der Gruppe "django-dynamic-scraper" abonniert haben.


Wenn Sie sich von dieser Gruppe abmelden und keine E-Mails mehr von dieser Gruppe erhalten möchten, senden Sie eine E-Mail an django-dynamic-sc...@googlegroups.com.


Weitere Optionen finden Sie unter https://groups.google.com/d/optout.


Boy Sandy Gladies Arriezona

unread,
May 7, 2017, 5:17:18 AM5/7/17
to django-dynamic-scraper
I already solved it by searching the url in the database (I override the save method in DjangoItem).
But I still cannot find the way to update the data, do you have an example for that?
Wenn Sie sich von dieser Gruppe abmelden und keine E-Mails mehr von dieser Gruppe erhalten möchten, senden Sie eine E-Mail an django-dynamic-scraper+unsub...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages