Hi, I use url as ID FIELD, but the problem is the url has argument like this:
https://www.tokopedia.com/sumberjayamotor/jas-hujan-axio-europe-original?trkid=f%3DCa63L000P0W0S0Sh00Co0Po0Fr0Cb0_src%3Ddirectory_page%3D1_ob%3D24_q%3D_catid%3D1318_po%3D1
Sometime, the argument is different but actually it is the same page. Because of this,
I have a lot of duplicate item in database. I have try to normalize the url in pipeline
but it still save the item instead of dropping it. I also have try to use option in scraper
page, I use 'remove_chars': '\?.+' but I still have a lot of duplicate item.
This is my pipeline:
class DjangoWriterPipeline(object):
def process_item(self, item, spider):
if spider.conf['DO_ACTION']:
try:
item['source_detail'] = spider.ref_object
checker_rt = SchedulerRuntime(runtime_type='C')
checker_rt.save()
item['checker_runtime'] = checker_rt
item.save()
spider.action_successful = True
spider.log("Item saved to Django DB.", logging.INFO)
except IntegrityError as e:
spider.log(str(e), logging.ERROR)
raise DropItem("Missing attribute.")
return item
class UrlNormalizerPipeline(object):
def process_item(self, item, spider):
if spider.conf['DO_ACTION']:
try:
if 'url' in item:
item['url'] = item['url'].split('?')[0]
spider.action_successful = True
except IntegrityError as e:
spider.log(str(e), logging.ERROR)
raise DropItem("Something went wrong when normalizing url.")
return item