how to pass cookie from curl response to oulinks

21 views
Skip to first unread message

ravis...@gmail.com

unread,
Mar 2, 2020, 3:02:55 PM3/2/20
to DigitalPebble
Hi,

I am trying to crawl secured site content with basic authentication. If I understand correctly, I either can use Selenium with custom navigation filter to login first or perform curl post on login url to generate cookie and transfer to outlinks. 

I am trying the second option. I wrote a simple bat/sh file that generates cookie.txt. I have following config to set metadata.transfer. But my question is how do i read cookie from this file and pass it? Do I have to pass as value for set-cookie key? If so, where do I do that, should I create my custom Protocol. Can I just run the curl command and start the crawl?

I am new to StormCrawler and I really like the power of crawler with Elastic Search combo. Any help is appreciated.

  metadata.transfer:
   - set-cookie

  http.use.cookiestrue

  http.protocol.implementation"com.digitalpebble.stormcrawler.protocol.okhttp.HttpProtocol"
  https.protocol.implementation"com.digitalpebble.stormcrawler.protocol.okhttp.HttpProtocol"


thanks
ravi

DigitalPebble

unread,
Mar 3, 2020, 8:48:57 AM3/3/20
to DigitalPebble
Hi


You could also generate the cookie externally prior to the crawling and specify it in the seed metadata using the key set-cookie. You'd need to add that key to metadata.transfer in your conf so that it gets transmitted to the outlinks and persisted to the storage.

Simply inject the initial cookie as a key value for the seed e.g.

http://stormcrawler.net    \t   set-cookie=xxxxxxxxxxxxxxxxxxxx

The httpclient protocol implementation supports basic authentication (https://github.com/DigitalPebble/storm-crawler/wiki/Protocols). Would this work for you?

Hope this helps

Julien
 

--
You received this message because you are subscribed to the Google Groups "DigitalPebble" group.
To unsubscribe from this group and stop receiving emails from it, send an email to digitalpebbl...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/digitalpebble/33c99dcf-775c-40cc-b2bb-33b466ed275b%40googlegroups.com.


--

DigitalPebble

unread,
Mar 3, 2020, 9:53:11 AM3/3/20
to DigitalPebble
BTW I've just added basic authentication for OKHttp https://github.com/DigitalPebble/storm-crawler/issues/792

ravis...@gmail.com

unread,
Mar 3, 2020, 11:17:47 AM3/3/20
to DigitalPebble
Thanks a lot Julien for sending the details. http basic authentication does work. I am testing on set-cookie as well which will be helpful once we switch to SSO protected. thanks again. much appreciated.

thanks
ravi
To unsubscribe from this group and stop receiving emails from it, send an email to digita...@googlegroups.com.


--
Reply all
Reply to author
Forward
0 new messages