Duplicate items when using url as ID FIELD

Boy Sandy Gladies Arriezona

unread,

May 7, 2017, 3:39:11 AM5/7/17

to django-dynamic-scraper

Hi, I use url as ID FIELD, but the problem is the url has argument like this:

https://www.tokopedia.com/sumberjayamotor/jas-hujan-axio-europe-original?trkid=f%3DCa63L000P0W0S0Sh00Co0Po0Fr0Cb0_src%3Ddirectory_page%3D1_ob%3D24_q%3D_catid%3D1318_po%3D1

Sometime, the argument is different but actually it is the same page. Because of this,

I have a lot of duplicate item in database. I have try to normalize the url in pipeline

but it still save the item instead of dropping it. I also have try to use option in scraper

page, I use 'remove_chars': '\?.+' but I still have a lot of duplicate item.

This is my pipeline:

class DjangoWriterPipeline(object):
   def process_item(self, item, spider):
      if spider.conf['DO_ACTION']:
         try:
            item['source_detail'] = spider.ref_object
            checker_rt = SchedulerRuntime(runtime_type='C')
            checker_rt.save()
            item['checker_runtime'] = checker_rt

            item.save()
            spider.action_successful = True
            spider.log("Item saved to Django DB.", logging.INFO)

         except IntegrityError as e:
            spider.log(str(e), logging.ERROR)
            raise DropItem("Missing attribute.")

      return item


class UrlNormalizerPipeline(object):
   def process_item(self, item, spider):
      if spider.conf['DO_ACTION']:
         try:
            if 'url' in item:
               item['url'] = item['url'].split('?')[0]

            spider.action_successful = True
         except IntegrityError as e:
            spider.log(str(e), logging.ERROR)
            raise DropItem("Something went wrong when normalizing url.")

      return item

Holger Drewes

unread,

May 7, 2017, 4:31:10 AM5/7/17

to django-dyna...@googlegroups.com

Hmmm, maybe just switch your ID concept and take a combination of e.g. category, sub category and title (you can select several fields forming the ID) instead?

Probably a bit messy in the transition period :-) but maybe worth it.

Holger

--

Sie erhalten diese Nachricht, weil Sie in Google Groups E-Mails von der Gruppe "django-dynamic-scraper" abonniert haben.

Wenn Sie sich von dieser Gruppe abmelden und keine E-Mails mehr von dieser Gruppe erhalten möchten, senden Sie eine E-Mail an django-dynamic-sc...@googlegroups.com.

Weitere Optionen finden Sie unter https://groups.google.com/d/optout.

Boy Sandy Gladies Arriezona

unread,

May 7, 2017, 5:17:18 AM5/7/17

to django-dynamic-scraper

I already solved it by searching the url in the database (I override the save method in DjangoItem).
But I still cannot find the way to update the data, do you have an example for that?

Wenn Sie sich von dieser Gruppe abmelden und keine E-Mails mehr von dieser Gruppe erhalten möchten, senden Sie eine E-Mail an django-dynamic-scraper+unsub...@googlegroups.com.

Reply all

Reply to author

Forward