I just finished https security access elasticsearch from SC. Everything works perfectly. I am going to create an issue and post everything I did after I resolve couple of other issues.
Currently we are using nutch for our production. But we decide to move to SC. The main reason is that Nutch can't do continuous crawl. We think that SC is the right so location for us.
Now the major issue I have to resolve is REDIRECTION issue. I got consistent results: SC is short of around 5000 indexes comparing to Nutch. I finally figure out that SC has around 5000 indexes which have status REDIRECTION. I read through questions and answers from Stackoverflow. It seems that they eventually will index. But I have wait almost 5 days. They are still not indexing yet. I thought I might configure wrong but couldn't figure out what they are wrong. I look into the specific url marked as REDIRECTION in status index. I saw the field "_redirTo" only has relative url. But _routing plus _redirTo would be complete url. It might not be problem. Anyway, I am trying to resolve this major issue so that we can migrate it to our production soon.
Other two minor issues are description and modified metadata from PDF non html documents. Some PDF files return String but some PDF files return Array. Both Nutch and SC use tika for non html parser. Our Nutch production sites seems not have issue but SC has. I am just wondering. Those are easy fix. I just wish that you have faced the same issues so that I can just get the solution from you.
So far SC's performance is so GOOD. SC only needs 20% time to finish almost same amount of indexes. Thanks Julien for the excellent job.
Thanks
Steven