Crawling Sitemap with Custom Metadata

32 views
Skip to first unread message

ravis...@gmail.com

unread,
Mar 16, 2020, 9:58:44 AM3/16/20
to DigitalPebble
We have a use case to pass custom metadata of documents stored in a content management repository (not part of document metadata) into elastic search, 

One option we are considering is to add custom metadata into our sitemap.xml. This would need us to develop our custom SiteMapParserBolt, customizing parser to use custom xsd, handers  etc to parse and seed metadata from our sitemap. 

But would this be an issue to have our custom sitemap parse implementation with future upgrade and is there a better approach for this scenario . Please let me know. 

thanks
ravi

DigitalPebble

unread,
Mar 17, 2020, 5:55:30 AM3/17/20
to DigitalPebble
Hi Ravi

There are plans to support sitemap extensions (https://github.com/DigitalPebble/storm-crawler/issues/749) however you'd also need to find a way of representing your bespoke metadata in a way which is compatible with the extensions supported by crawler commons.

The code of the sitemap parser shouldn't change much, so extending it would only cause minor problems in the future. An alternative would be to generate the information at a separate format non sitemap format (e.g. tab separated key values) and write a custom parser specifically for it 

Hope  this helps

Julien

--
You received this message because you are subscribed to the Google Groups "DigitalPebble" group.
To unsubscribe from this group and stop receiving emails from it, send an email to digitalpebbl...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/digitalpebble/8b029af3-d44b-4ab5-9a6b-43cb358ccefc%40googlegroups.com.


--

ravis...@gmail.com

unread,
Mar 17, 2020, 1:34:13 PM3/17/20
to DigitalPebble
Thanks Julien. This is helpful. I can successfully extend sitemap parser and parse and transfer metadata values into outlink status index. However, as you raised, still trying to figure out how to pass metadata into content index storage using existing implementation.

Unfortunately I don't have control on the source to create a custom non sitemap format which would need development on the other side and may not be feasible. Or I should parse sitemap to generate the information at a separate format as you suggested and let custom parser take it from there. I'll spend some more time to explore the options I have. Thanks again.

thanks
ravi

DigitalPebble

unread,
Mar 18, 2020, 4:05:31 AM3/18/20
to DigitalPebble
Thanks Julien. This is helpful. I can successfully extend sitemap parser and parse and transfer metadata values into outlink status index. However, as you raised, still trying to figure out how to pass metadata into content index storage using existing implementation.

that's easy. Set the following in your config file

indexer.md.mapping:
- parse.title=title
- parse.keywords=keywords
- parse.description=description
 
where the left hand sides are the key names in the metadata and the rhs the names of the fields you want to generate in Elastic. You can configure the latter in the ES mapping of course.

 

Unfortunately I don't have control on the source to create a custom non sitemap format which would need development on the other side and may not be feasible. Or I should parse sitemap to generate the information at a separate format as you suggested and let custom parser take it from there. I'll spend some more time to explore the options I have. Thanks again.

thanks
ravi

On Monday, March 16, 2020 at 9:58:44 AM UTC-4, ravis...@gmail.com wrote:
We have a use case to pass custom metadata of documents stored in a content management repository (not part of document metadata) into elastic search, 

One option we are considering is to add custom metadata into our sitemap.xml. This would need us to develop our custom SiteMapParserBolt, customizing parser to use custom xsd, handers  etc to parse and seed metadata from our sitemap. 

But would this be an issue to have our custom sitemap parse implementation with future upgrade and is there a better approach for this scenario . Please let me know. 

thanks
ravi

--
You received this message because you are subscribed to the Google Groups "DigitalPebble" group.
To unsubscribe from this group and stop receiving emails from it, send an email to digitalpebbl...@googlegroups.com.

ravis...@gmail.com

unread,
Mar 18, 2020, 9:51:48 AM3/18/20
to DigitalPebble
Thanks again Julien. Great it is working now. We did try this mapping before and dint work at that time while trying along with other settings. I wonder what mistake we did at that time. I was thinking on alternate process to set additional flag in outlinks and update content index based on that flag and _id.  glad no need for those. thanks

thanks
ravi

On Monday, March 16, 2020 at 9:58:44 AM UTC-4, ravis...@gmail.com wrote:

DigitalPebble

unread,
Mar 18, 2020, 10:45:41 AM3/18/20
to DigitalPebble
You're welcome. Glad you got it to work

--
You received this message because you are subscribed to the Google Groups "DigitalPebble" group.
To unsubscribe from this group and stop receiving emails from it, send an email to digitalpebbl...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages