REDIRECTION question

15 views
Skip to first unread message

Steven Zhu

unread,
May 27, 2021, 10:31:10 AM5/27/21
to DigitalPebble
I just finished https security access elasticsearch from SC. Everything works perfectly. I am going to create an issue and post everything I did after I resolve couple of other issues.

Currently we are using nutch for our production. But we decide to move to SC. The main reason is that Nutch can't do continuous crawl. We think that SC is the right so location for us. 

Now the major issue I have to resolve is REDIRECTION issue. I got consistent results: SC is short of around 5000 indexes comparing to Nutch. I finally figure out that SC has around 5000 indexes which have status REDIRECTION. I read through questions and answers from Stackoverflow. It seems that they eventually will index. But I have wait almost 5 days. They are still not indexing yet. I thought I might configure wrong but couldn't figure out what they are wrong. I look into the specific url marked as REDIRECTION in status index. I saw the field "_redirTo" only has relative url. But _routing plus _redirTo would be complete url. It might not be problem. Anyway, I am trying to resolve this major issue so that we can migrate it to our production soon.

Other two minor issues are description and modified metadata from PDF non html documents. Some PDF files return String but some PDF files return Array. Both Nutch and SC use tika for non html parser. Our Nutch production sites seems not have issue but SC has. I am just wondering. Those are easy fix. I just wish that you have faced the same issues so that I can just get the solution from you.

So far SC's performance is so GOOD. SC only needs 20% time to finish almost same amount of indexes. Thanks Julien for the excellent job.

Thanks

Steven

Steven Zhu

unread,
May 27, 2021, 10:39:47 AM5/27/21
to DigitalPebble
BTW, the redirection is from exact same domain/subdomain, from one page to another page. (ex. https://example.com/page1/index.html to  https://example.com/page1)
Thanks

Steven

Julien Nioche

unread,
May 27, 2021, 11:02:40 AM5/27/21
to DigitalPebble
Hi Steven, 

Comments inlined below


I just finished https security access elasticsearch from SC. Everything works perfectly. I am going to create an issue and post everything I did after I resolve couple of other issues.

Fab, thanks
 

Currently we are using nutch for our production. But we decide to move to SC. The main reason is that Nutch can't do continuous crawl. We think that SC is the right so location for us. 

Great!
 

Now the major issue I have to resolve is REDIRECTION issue. I got consistent results: SC is short of around 5000 indexes comparing to Nutch. I finally figure out that SC has around 5000 indexes which have status REDIRECTION.
I read through questions and answers from Stackoverflow. It seems that they eventually will index.

The targets are DISCOVERED URLs like any other, there is no specific reason why they would be processed later. The REDIRECTED URLs might get retried later, but this is a separate issue.
 
But I have wait almost 5 days. They are still not indexing yet. I thought I might configure wrong but couldn't figure out what they are wrong. I look into the specific url marked as REDIRECTION in status index. I saw the field "_redirTo" only has relative url. But _routing plus _redirTo would be complete url. It might not be problem. Anyway, I am trying to resolve this major issue so that we can migrate it to our production soon.

Check the status index for the targets of the redirections. Redir is just the value found but the complete URL should be built and added.
Redirections are treated like outlinks, maybe check that they don't get filtered out e.g. based on distance from seed.
 

Other two minor issues are description and modified metadata from PDF non html documents. Some PDF files return String but some PDF files return Array. Both Nutch and SC use tika for non html parser. Our Nutch production sites seems not have issue but SC has. I am just wondering. Those are easy fix. I just wish that you have faced the same issues so that I can just get the solution from you.

Please open an issue and share are reproducible URL.
 

So far SC's performance is so GOOD. SC only needs 20% time to finish almost same amount of indexes. Thanks Julien for the excellent job.

Thanks, I am glad you like it. It is a community effort, loads of people have contributed to make it what it is now.
 

BTW, the redirection is from exact same domain/subdomain, from one page to another page. (ex. https://example.com/page1/index.html to  https://example.com/page1)

OK, so it should not be related to domain or host filtering.

Kind regards

Julien
 
Reply all
Reply to author
Forward
0 new messages