I hope that it helps.
Best regards,
Olivier
Error: Unhandled SolrServerException: No live SolrServers available to handle this request:[http://localhost:8983/solr/FileShare]
Hi Nico,
I think your problem is not the amount of RAM you dedicated to Solr here but the fact that some files contains too much text to index.
This is why I suggest you to change the configuration of your job to add a content limiter. You can follow this documentation : https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/100663333/Use+the+content+limiter+transformation+connector
Notice that you will need to add a Tika connector, as for what I understand, you are currently using the embedded Tika of Solr, which I don’t recommend. By doing this, you will also need to allocate more RAM to MCF, at least 3Gb (location : DATAFARI_HOME/mcf/mcf_home/option.env.unix).
Let us know if it fixes your issue or if you have difficulties to configure your job.
Regards,
Julien Massiera
De : Nico D
Envoyé le :mercredi 27 juin 2018 09:05
À : Datafari
Objet :Re: "No Results Found" - Job Done Successfully and AutocompleteSuggestions working
Hi Olivier,
Sorry to have to come back to this thread, but it apparently is not actually working as planned.
The crawling process went much further, but still seemed to crash and my job now has the error message that no Solr instance is online (hint: it crashed lol).
Error message:
Error: Unhandled SolrServerException: No live SolrServers available to handle this request:[http://localhost:8983/solr/FileShare]
I am able to restart datafari through the provided restart-datafari.sh script and try again, but receive the same error shortly thereafter.
Like I had mentiond previously, this new machine has about 50gb of RAM and a Xeon X5560 and 500gb of HDD space, so I hope performance is not an issue this time.
Below is a screenshot of my Solr config page this time around:
On Friday, June 22, 2018 at 11:00:21 AM UTC+2, Olivier Tavard wrote:
Hi,
Thanks for the screenshots.
It is certainly the weak amount of RAM for Solr that causes your issue.
ie the JVM pauses all threads when running a full GC. if the time of the GC is greater than the ZK timeout session, ZK will think that the instance Solr is offline.
If you cannot increase the amount of RAM, the workaround that you could do is to decrease the number of fetches/minute into your filer by MCF.
To do that, edit your MCF repository connector and go to the Throttling tab then enter for example 10 for the number of fetches/min :
I hope that it helps.
Best regards,
Olivier
Le lundi 4 juin 2018 23:32:37 UTC+2, Nico D a écrit :
Hi Olivier,
Thanks for the reply.
So to answer some of your questions - that screenshot of Solr Admin Homepage was shortly after stopping and starting the container again.
For your next requested screenshots. Here you go:
As you can see, there are some errors and warnings for Solr. Unfortunately I cannot decipher what any of them mean. What exactly is overseer in terms of Solr?
Next, as you can see I have two jobs atm - one for each smb share. The top one which is "Done" succesfully is for a relatively small share as you can see (879 Docs) and the bottom one is a much bigger share. It hung up and crashed after a while and even now after refreshing and waiting a day or two, I cannot restart this bottom job. It seems to have died entirely on me. Unfortunately the docker container's latest commit includes the job at this status, so I cant really jump back to before it was in this crashed/aborted state.
Maybe I could just readd the share in a new job?
I also unfortunately only have access to this machine atm with 8gb of ram total. I have access to another server with a static public IP actually. But it only has that one NIC and so is not on my itnernal network and cannot see the smb shares.
Thanks!
On Friday, June 1, 2018 at 9:44:34 PM UTC+2, Olivier Tavard wrote:
Hello,
First thanks for the info about your issue, it will be easier to help you ;)
However I will need a little bit of information :
in the Solr admin UI could you click on Logging and do a screenshot to check if there are some errors displayed.
About the Solr screenshot you sent, was it the Solr status just after you launched the job or did you meantime stop the container and started it again ?
Because the first bar full at 8 GB RAM is "normal" : it is the amount of RAM used for all the server when you saw the page. Regarding Solr only, it is the last bar at the bottom : JVM memory.
Concerning ManifoldCF, could you do a screenshot of the job status screen ie in the Job menu click on "Status and Job management".
About the requirements, the minimum of RAM is 16 GB without ELK and 32 GB is recommended when you activate ELK for the Community edition of Datafari (https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/110788626/Hardware+requirements+Community+Edition).
Please send me the screenshots to investigate your problem.
Best regards,
Olivier Tavard
Le vendredi 1 juin 2018 01:19:24 UTC+2, N Dom a écrit :
--
You received this message because you are subscribed to the Google Groups "Datafari" group.
To unsubscribe from this group and stop receiving emails from it, send an email to datafari+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Be careful !
The DatafariSolrNoTika output connector is meant to be used if a Tika connector is configured in your job config ! If you use it without a Tika connector in your job, you will break the crawl !
To use the content limiter you will be forced to add a Tika connector in your job.
The Content Length will bypass documents with the indicated content length but will not ensure you that the pure text content of your document will not exeed the length (ex : a zip generates more content length after a tika extraction)
Julien
De : Nico D
Envoyé le :mercredi 27 juin 2018 13:50
Garanti sans virus. www.avast.com
--
You received this message because you are subscribed to the Google Groups "Datafari" group.
To unsubscribe from this group and stop receiving emails from it, send an email to datafari+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Hi Nico,
can you please send a screen of your job configuration view ? And
also the list of transformation connections
Regards,
Julien
-- Julien MASSIERA Responsable produit Datafari France Labs – Les experts du Search Vainqueur du challenge Internal Search de EY à Viva Technologies 2016 www.francelabs.com
Hi Nico,
thanks for the screenshot. To be honest, I'm not certain why you
didn't apply my recommendations to get rid of your issues, but for
sure I won't be of any help if you don't give it a try...
Is there any steps in my explanations that may have been unclear?
If yes let me know and I'll try to be more precise.
Hi Nico,
Is the job still in a frozen state ? Have you more informations about the status in the Simple History of MCF ?
Also, did you allocate more RAM to MCF than the default
configuration in the option.env.unix file located in
/opt/datafari/mcf/mcf_home ?
Regards,
Julien
I am still using the same Job config as in my previous screenshots - with the TikaOCR transformation and the contentLimiter transformation after that.
LG
Nico
Yes I think the job is frozen because your MCF agent may have crashed due to low memory issue. I suggest you to stop your Datafari and to allocate 5Gb of memory to MCF by changing the Xms and Xmx values in the option.env.unix file :
-Xms5120m
-Xmx5120m
During a crawl Tika is the component that uses a lot of memory to
extract the content of files. The job config that I suggested and
that you implemented is using the Tika of MCF instead of the
Solr's one. This is the reason why you really need to allocate
more memory to MCF than the default configuration which should be
256Mb.
In the same way, you also need to explicitely allocate more RAM to Solr in the solr.in.sh file located in /opt/datafari/solr/bin :
SOLR_JAVA_MEM="-Xms10240m -Xmx10240m"
Finally, you can switch from the DatafariSolr output to the
DatafariSolrNoTika now in your job config.
Julien
Hi Nico,
Indeed no documents have been indexed because of an error on the sequence_number field which is an EXIF field. This is a very interesting case I never seen before, could you send us at least one the files that are not indexed please ?
In the meantime you can ignore the field by editing the /opt/datafari/solr/solrcloud/FileShare/conf/solrconfig.xml file, find the /update/no-tika and /update/extract request handlers and add the following line to the defaults list : <str name="fmap.sequence_number">ignored_</str>
You should obtain :
<requestHandler
class="com.francelabs.datafari.handler.parsed.ParsedRequestHandler"
name="/update/no-tika" startup="lazy">
<lst name="defaults">
<str name="lowernames">true</str>
<str name="fmap.language">ignored_</str>
<str name="fmap.source">ignored_</str>
<str
name="fmap.sequence_number">ignored_</str>
<str name="uprefix">ignored_</str>
<str name="update.chain">datafari</str>
</lst>
</requestHandler>
<requestHandler
class="solr.extraction.ExtractingRequestHandler"
name="/update/extract" startup="lazy">
<lst name="defaults">
<str name="scan">false</str>
<str name="captureAttr">true</str>
<str name="lowernames">true</str>
<str name="fmap.language">ignored_</str>
<str name="fmap.source">ignored_</str>
<str
name="fmap.sequence_number">ignored_</str>
<str name="fmap.url">ignored_</str>
<str name="uprefix">ignored_</str>
<str name="update.chain">datafari</str>
<bool
name="ignoreTikaException">true</bool>
<str
name="tika.config">/opt/datafari/solr/solrcloud/FileShare/conf/tika.config</str>
</lst>
</requestHandler>
After this modification you will need to upload to new conf through the admin UI : Seach Engine Configuration -> Zookeeper then click on Upload, wait few seconds then click on Reload.
Unfortunately you will need to recrawl your share after this modification (make a copy of your job and run the copy).
Regards,
Julien
<requestHandler class="solr.extraction.ExtractingRequestHandler" name="/update/extract" startup="lazy">
<lst name="defaults">
<str name="captureAttr">true</str>
<str name="lowernames">true</str>
<str name="fmap.language">ignored_</str>
<str name="fmap.source">ignored_</str>
<str name="fmap.sequence_number">ignored_</str>
<str name="uprefix">ignored_</str>
<str name="update.chain">datafari</str>
<bool name="ignoreTikaException">true</bool>
</lst>
</requestHandler>
<p
Hi Nico,
I don’t think the paths length is an issue. If MCF didn’t complain and Solr received the files it’s ok.
The missing tika.config is not a problem, I pasted my own configuration of Solr and I did some custom modifications including this one, you can remove the tika.config line.
Julien
De : Nico D
Envoyé le :jeudi 5 juillet 2018 08:39
À : Datafari
Objet :Re: "No Results Found" - Job Done SuccessfullyandAutocompleteSuggestions working
Hi Julien,
Could that have anything to do with not receiving any results in the search?
LG
Nico
On Monday, July 2, 2018 at 4:56:36 PM UTC+2, julien.massiera wrote:
Yes I think the job is frozen because your MCF agent may have crashed due to low memory issue. I suggest you to stop your Datafari and to allocate 5Gb of memory to MCF by changing the Xms and Xmx values in the option.env.unix file :
-Xms5120m
-Xmx5120m
During a crawl Tika is the component that uses a lot of memory to extract the content of files. The job config that I suggested and that you implemented is using the Tika of MCF instead of the Solr's one. This is the reason why you really need to allocate more memory to MCF than the default configuration which should be 256Mb.In the same way, you also need to explicitely allocate more RAM to Solr in the solr.in.sh file located in /opt/datafari/solr/bin :
SOLR_JAVA_MEM="-Xms10240m -Xmx10240m"
Finally, you can switch from the DatafariSolr output to the DatafariSolrNoTika now in your job config.Julien
On 02/07/2018 16:29, Nico D wrote:
Hi Julien,
Unfortunately it is still in frozen state. I did not allocate more RAM, because on this new machine it is running natively (not in a docker container) and the Solr config says it has all 50gb available.
Do I still need to somehow allocate more in the option.env.unix file?
This is my Solr log - only some warnings about Tika
I am still using the same Job config as in my previous screenshots - with the TikaOCR transformation and the contentLimiter transformation after that.
LG
Nico
On Monday, July 2, 2018 at 4:21:27 PM UTC+2, julien.massiera wrote:
Hi Nico,
Is the job still in a frozen state ? Have you more informations about the status in the Simple History of MCF ?
Also, did you allocate more RAM to MCF than the default configuration in the option.env.unix file located in /opt/datafari/mcf/mcf_home ?
Regards,
Julien
On 29/06/2018 16:57, Nico D wrote:
Hi Julien,
I am trying it now with your content limiter with the following job config:
It seems to be running a bit better, but slows to a crawl after about 1000 documents. Solr doesnt seem to be crashing, but it doesnt seem to be going forward either..
On Friday, June 29, 2018 at 2:30:00 PM UTC+2, julien.massiera wrote:
Hi Nico,
thanks for the screenshot. To be honest, I'm not certain why you didn't apply my recommendations to get rid of your issues, but for sure I won't be of any help if you don't give it a try...
Is there any steps in my explanations that may have been unclear? If yes let me know and I'll try to be more precise.Julien
On 29/06/2018 14:19, Nico D wrote:
Hi Julien,
Below is a screenshot of my job config. I dont have any transformationsn except for the default "TikaOCR" that is included, which I am not using.
--
You received this message because you are subscribed to the Google Groups "Datafari" group.
To unsubscribe from this group and stop receiving emails from it, send an email to datafari+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Did you run a job that was in a « done » status ?
If it is the case MCF will not send the documents to Solr because it does not detect any changes in the already crawled files. You need to make a copy of the job and run the fresh copy to avoid this behavior.
If it is not the case and you ran a new job, what does the MCF simple history says about the crawl ?
The warning message of Solr is not an issue.
Julien
De : Nico D
Envoyé le :jeudi 5 juillet 2018 10:13
Garanti sans virus. www.avast.com
--
You received this message because you are subscribed to the Google Groups "Datafari" group.
To unsubscribe from this group and stop receiving emails from it, send an email to datafari+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Start Time | Activity | Identifier | Result Code | Bytes | Time | Result Description |
---|---|---|---|---|---|---|
07-05-2018 08:03:13.973 | job end | 1530772428261(newtelco-daten_NoTikaOut2) | 0 | 1 | ||
07-05-2018 06:50:00.162 | document ingest (DatafariSolrNoTika) | file://///192.168.1.239/newtelco-daten/Technik/maintenance%20... | OK | 1159 | 16 | |
07-05-2018 06:50:00.137 | limit [contentLimiter] | file://///192.168.1.239/newtelco-daten/Technik/maintenance%20... | OK | 1159 | 1 | |
07-05-2018 06:50:00.068 | extract [TikaOCR] | file://///192.168.1.239/newtelco-daten/Technik/maintenance%20... | OK | 1159 | 58 | |
07-05-2018 06:49:59.951 | access | smb://192.168.1.239/newtelco-daten/Technik/maintenance works/... | OK | 54784 | 249 | |
07-05-2018 06:49:50.359 | document ingest (DatafariSolrNoTika) | file://///192.168.1.239/newtelco-daten/Technik/maintenance%20... | OK | 1155 | 23 | |
07-05-2018 06:49:50.315 | limit [contentLimiter] | file://///192.168.1.239/newtelco-daten/Technik/maintenance%20... | OK | 1155 | 1 | |
07-05-2018 06:49:50.280 | extract [TikaOCR] | file://///192.168.1.239/newtelco-daten/Technik/maintenance%20... | OK | 1155 | 26 | |
07-05-2018 06:49:50.168 | access | smb://192.168.1.239/newtelco-daten/Technik/maintenance works/... | OK | 66048 | 742 |
When I look back on your job configuration, I can see that you enabled file security and share security ! So if you did not configure an Active Directory in Datafari you will not be able to see the indexed documents as you need to be authenticated with an Active Directory user.
You can check the number of documents indexed in Solr thanks to the admin UI :
You basically have 2 options :
<str name="fq">{!manifoldCFSecurity}</str>
</lst>
Of course you will then need to upload the new conf again with an upload then reload through the ZooKeeper UI in the admin UI. With this solution, no need to recrawl !
Regards,
Julien
De : Nico D
Envoyé le :jeudi 5 juillet 2018 12:32
No problem Nico,
As I mentioned in a previous mail, we are interested in one or few files that had a problem with the sequence_number field. As a thank for our help on your issues, we would really appreciate that you share with us some of your problematic files, as they are apparently triggering a bug that we'd like to handle properly.
Regards,
Julien
De : Nico D
Envoyé le :jeudi 5 juillet 2018 14:55
À : Datafari
Objet :Re: "No Results Found" -JobDoneSuccessfullyandAutocompleteSuggestions working
Hi Julien,
I do not need any additional security, this will be on a server with a private IP only and all users should have the same level of access.
After disabling / commenting out the manifoldCFSecurity appends and reloading the search page, I have indeed received results!
According to my Solr statistics page, I have available 431598 docs!
Thanks for everything!
LG
Nico
On Thursday, July 5, 2018 at 2:11:00 PM UTC+2, julien.massiera wrote:
When I look back on your job configuration, I can see that you enabled file security and share security ! So if you did not configure an Active Directory in Datafari you will not be able to see the indexed documents as you need to be authenticated with an Active Directory user.
You can check the number of documents indexed in Solr thanks to the admin UI :
You basically have 2 options :
1. You don’t really care about the respect of ACLs for your documents during search as you just want to use Datafari as a demo. So either you change the configuration of your job to disable security and you will need to recrawl, or you remove the security check module of Solr by removing/comment the following lines in the /opt/Datafari/solr/solrcloud/FileShare/conf/solrconfig.xml for the « /select » request handler :
<lst name="appends">
<str name="fq">{!manifoldCFSecurity}</str>
</lst>
Of course you will then need to upload the new conf again with an upload then reload through the ZooKeeper UI in the admin UI. With this solution, no need to recrawl !
2. You want to have security enabled for you search and so you need to configure an Active Directory for Datafari
--
You received this message because you are subscribed to the Google Groups "Datafari" group.
To unsubscribe from this group and stop receiving emails from it, send an email to datafari+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
<p class=MsoNormal style='ms
Hi Nico,
Yes the log files of Solr are located in /opt/datafari/logs
Julien
De : Nico D
Envoyé le :jeudi 5 juillet 2018 15:39
--
You received this message because you are subscribed to the Google Groups "Datafari" group.
To unsubscribe from this group and stop receiving emails from it, send an email to datafari+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Thanks Nico,
I realize I forgot to answer to your questions :
Yes you can remove the function that shorten the links in the results. The involved file is /opt/datafari/tomcat/webapps/Datafari/js/AjaxFranceLabs/widgets/SubClassResult.widget.js and you need to replace this line :
elm.find('.doc:last .address').append('<span>' + AjaxFranceLabs.tinyUrl(decodeURIComponent(url)) + '</span>');
By this one :
elm.find('.doc:last .address').append('<span>' + decodeURIComponent(url) + '</span>');
Concerning the fact that you are not able to directly open the files from the search results, it is a browser limitation and you need to configure it. We have this documentation that may help you, but it is possible that with browser updates, it is outdated : https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/14745633/Browsers+Configuration+to+Open+Files
Julien
De : Nico D
Envoyé le :vendredi 6 juillet 2018 12:32
--
You received this message because you are subscribed to the Google Groups "Datafari" group.
To unsubscribe from this group and stop receiving emails from it, send an email to datafari+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
<td valign=top style='
I think you just need to empty your browser cache.
De : Nico D
Envoyé le :vendredi 6 juillet 2018 13:46
À : Datafari
Objet :Re: "No ResultsFound"-JobDoneSuccessfullyandAutocompleteSuggestionsworking
Hi Julien,
Thanks for the tip.
However, I have now removed the AjaxFranceLabs.tinyUrl function wrapping around the decodeURIComponent(url) variable and it still prints the truncated URL on the results page.
I have also restarted Datafari via the provided restart-datafari.sh script. Do I need to upload and reload the zookeeper config(s) again for this change to apply?
See screenshot:
--
You received this message because you are subscribed to the Google Groups "Datafari" group.
To unsubscribe from this group and stop receiving emails from it, send an email to datafari+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
The local link addons does not work with Datafari cause of our redirection, you really need to process the manual configuration.
De : Nico D
Envoyé le :vendredi 6 juillet 2018 14:05
À : Datafari
Objet :Re: "NoResultsFound"-JobDoneSuccessfullyandAutocompleteSuggestionsworking
Hi Julien,
--
You received this message because you are subscribed to the Google Groups "Datafari" group.
To unsubscribe from this group and stop receiving emails from it, send an email to datafari+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.