"No Results Found" - Job Done Successfully and Autocomplete Suggestions working

132 views
Skip to first unread message

N Dom

unread,
May 30, 2018, 5:09:51 PM5/30/18
to Datafari
Hi All,

So I've got Datafari 4.0.0 up and running in a docker image (from the datafari/datafariv4 image from the docker marketplace) on a 64-bit Ubuntu 16.04 system.

I added the repo and the job as per tutorial. The job has run and now has the status "Done".

On the search page, the autocomplete suggestions are even appearing successfully.

However, every search term results in "No Results Found".

Is there anything else one must do in order to successfully index documents and get them to show up?!

Thanks
Nico

cedric...@francelabs.com

unread,
May 31, 2018, 9:48:31 AM5/31/18
to Datafari
Hi Nico,

Some questions for you, as we need some context:
1. Which tutorial did you follow ? (please provide a link)
2. Can you give us a screenshot of your repo and job ?
3. Do you have any ACL involved?
4. How many RAM/Core did you give to your docker image?

Regards,

Cedric

N Dom

unread,
May 31, 2018, 1:06:36 PM5/31/18
to Datafari
Hi Cedric,

Thanks for answering,

I am following the instructions from the francelabs blog post about the docker images. I cant seem to find the link anymore unfortunately.

It is on a server at work and its a bank holiday here in germany today so I also cant take a screenshot for you at the moment, but I will post one tomorrow.

There are no ACLs involved.

The server is running on a VM in our datacenter with 8gb of ram dedicated to it and an 8 core cpu. I didnt make any special configs for the docker image.

I definitely do remember seeing on the Solr config page that the memory is basically full. I think like 7.9 / 8gb or something like that. It was the very top info bar on the Solr config home page, I forget the label exactly as I cant access the machine at the moment. 

N Dom

unread,
May 31, 2018, 1:21:52 PM5/31/18
to Datafari
The memory being full seems like a legitimate reason for Solr to crash when scanning large samba shares.

Unfortunately I'm not sure if its the RAM that its talking about or the HDD space available to the docker image. 

Like I said, its the top "bar" on the main Solr config page. Maybe someone else with access could check briefly?

Thanks
Nico

N Dom

unread,
May 31, 2018, 7:19:24 PM5/31/18
to Datafari

Screenshots incoming!



Olivier Tavard

unread,
Jun 1, 2018, 3:44:34 PM6/1/18
to Datafari
Hello,

First thanks for the info about your issue, it will be easier to help you ;)
However I will need a little bit of information :
in the Solr admin UI could you click on Logging and do a screenshot to check if there are some errors displayed.
About the Solr screenshot you sent, was it the Solr status just after you launched the job or did you meantime stop the container and started it again ?
Because the first bar full at 8 GB RAM is "normal" : it is the amount of RAM used for all the server when you saw the page. Regarding Solr only, it is the last bar at the bottom : JVM memory. 

Concerning ManifoldCF, could you do a screenshot of the job status screen ie in the Job menu click on "Status and Job management".

About the requirements, the minimum of RAM is 16 GB without ELK and 32 GB is recommended when you activate ELK for the Community edition of Datafari  (https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/110788626/Hardware+requirements+Community+Edition).

Please send me the screenshots to investigate your problem.

Best regards,

Olivier Tavard

N Dom

unread,
Jun 4, 2018, 5:32:37 PM6/4/18
to Datafari
Hi Olivier,

Thanks for the reply. 

So to answer some of your questions - that screenshot of Solr Admin Homepage was shortly after stopping and starting the container again. 

For your next requested screenshots. Here you go: 



As you can see, there are some errors and warnings for Solr. Unfortunately I cannot decipher what any of them mean. What exactly is overseer in terms of Solr? 


Next, as you can see I have two jobs atm - one for each smb share. The top one which is "Done" succesfully is for a relatively small share as you can see (879 Docs) and the bottom one is a much bigger share. It hung up and crashed after a while and even now after refreshing and waiting a day or two, I cannot restart this bottom job. It seems to have died entirely on me. Unfortunately the docker container's latest commit includes the job at this status, so I cant really jump back to before it was in this crashed/aborted state. 

Maybe I could just readd the share in a new job?

I also unfortunately only have access to this machine atm with 8gb of ram total. I have access to another server with a static public IP actually. But it only has that one NIC and so is not on my itnernal network and cannot see the smb shares. 

Thanks!

Olivier Tavard

unread,
Jun 22, 2018, 5:00:21 AM6/22/18
to Datafari
Hi,

Thanks for the screenshots.
It is certainly the weak amount of RAM for Solr that causes your issue.
ie the JVM pauses all threads when running a full GC. if the time of the GC is greater than the ZK timeout session, ZK will think that the instance Solr is offline.
If you cannot increase the amount of RAM, the workaround that you could do is to decrease the number of fetches/minute into your filer by MCF.
To do that, edit your MCF repository connector and go to the Throttling tab then enter for example 10 for the number of fetches/min :

I hope that it helps.


Best regards,


Olivier

Nico D

unread,
Jun 26, 2018, 4:41:40 AM6/26/18
to Datafari
Hi All,

Thanks for the suggestions. I was able to dig out a 16 core server with 50gb of ram out of our companies storage and run it "natively" on there. Everything is working well. Looks like it really was just a lack of ram.

Nico

Nico D

unread,
Jun 27, 2018, 3:05:25 AM6/27/18
to Datafari
Hi Olivier,

Sorry to have to come back to this thread, but it apparently is not actually working as planned. 

The crawling process went much further, but still seemed to crash and my job now has the error message that no Solr instance is online (hint: it crashed lol).
Error message:

Error: Unhandled SolrServerException: No live SolrServers available to handle this request:[http://localhost:8983/solr/FileShare]

I am able to restart datafari through the provided restart-datafari.sh script and try again, but receive the same error shortly thereafter.

Like I had mentiond previously, this new machine has about 50gb of RAM and a Xeon X5560 and 500gb of HDD space, so I hope performance is not an issue this time.

Below is a screenshot of my Solr config page this time around:



On Friday, June 22, 2018 at 11:00:21 AM UTC+2, Olivier Tavard wrote:

Nico D

unread,
Jun 27, 2018, 6:54:45 AM6/27/18
to Datafari
Hi All,

I cant seem to find an edit button on this google groups system, so I will just post the additional info in this reply.

So I also checked the Solr log - this was more or less empty, the only contents were:

Time (Local)
Level
Core
Logger
Message
6/27/2018, 8:55:54 AM
WARN false
UpdateLog
Starting log replay tlog{file=/opt/datafari/solr/solr_home/Statistics_shard1_replica1/data/tlog/tlog.0000000000000000005 refcount=2} active=false starting pos=0
6/27/2018, 8:55:54 AM
WARN false
UpdateLog
Log replay finished. recoveryInfo=RecoveryInfo{adds=2 deletes=0 deleteByQuery=0 errors=0 positionOfStart=0}
Then I also took a screenshot of htop during my current crawling session - I restarted it minimally now and it seems to ahve been running much longer at least. It has not crashed yet. Does minimally automatically apply a throttle or something of that nature?

See screenshot: 

Julien

unread,
Jun 27, 2018, 7:45:34 AM6/27/18
to Datafari

Hi Nico,

 

I think your problem is not the amount of RAM you dedicated to Solr here but the fact that some files contains too much text to index.

 

This is why I suggest you to change the configuration of your job to add a content limiter. You can follow this documentation : https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/100663333/Use+the+content+limiter+transformation+connector

 

Notice that you will need to add a Tika connector, as for what I understand, you are currently using the embedded Tika of Solr, which I don’t recommend. By doing this, you will also need to allocate more RAM to MCF, at least 3Gb (location : DATAFARI_HOME/mcf/mcf_home/option.env.unix).

 

Let us know if it fixes your issue or if you have difficulties to configure your job.

 

Regards,

Julien Massiera

 

De : Nico D
Envoyé le :mercredi 27 juin 2018 09:05
À : Datafari
Objet :Re: "No Results Found" - Job Done Successfully and AutocompleteSuggestions working

 

Hi Olivier,


Sorry to have to come back to this thread, but it apparently is not actually working as planned. 

 

The crawling process went much further, but still seemed to crash and my job now has the error message that no Solr instance is online (hint: it crashed lol).

Error message:

 

Error: Unhandled SolrServerException: No live SolrServers available to handle this request:[http://localhost:8983/solr/FileShare]

 

I am able to restart datafari through the provided restart-datafari.sh script and try again, but receive the same error shortly thereafter.

 

Like I had mentiond previously, this new machine has about 50gb of RAM and a Xeon X5560 and 500gb of HDD space, so I hope performance is not an issue this time.

 

Below is a screenshot of my Solr config page this time around:

 

http://imgur.com/2vnisZ7l.png



On Friday, June 22, 2018 at 11:00:21 AM UTC+2, Olivier Tavard wrote:

Hi,

 

Thanks for the screenshots.

It is certainly the weak amount of RAM for Solr that causes your issue.

ie the JVM pauses all threads when running a full GC. if the time of the GC is greater than the ZK timeout session, ZK will think that the instance Solr is offline.

If you cannot increase the amount of RAM, the workaround that you could do is to decrease the number of fetches/minute into your filer by MCF.

To do that, edit your MCF repository connector and go to the Throttling tab then enter for example 10 for the number of fetches/min :

https://lh3.googleusercontent.com/-LCEYI_yQtAw/Wyy6iDiz18I/AAAAAAAAJTw/1qbHyhsrpc0RfpzAs80RDummZ7pRpljmwCLcBGAs/s320/mcf_throttling.jpeg

I hope that it helps.

 

Best regards,

 

Olivier

 

 

 

 



Le lundi 4 juin 2018 23:32:37 UTC+2, Nico D a écrit :

Hi Olivier,

 

Thanks for the reply. 

 

So to answer some of your questions - that screenshot of Solr Admin Homepage was shortly after stopping and starting the container again. 

 

For your next requested screenshots. Here you go: 

 

https://lh3.googleusercontent.com/-GsU2v2PIuKE/WxWu2wSvBxI/AAAAAAAAiMs/cHVByKrAtkISp4IbXgJ47Mo76QJG5IoYwCLcBGAs/s320/solr_log.png

 

 

As you can see, there are some errors and warnings for Solr. Unfortunately I cannot decipher what any of them mean. What exactly is overseer in terms of Solr? 

 

https://lh3.googleusercontent.com/-LwdQ_T5kGa8/WxWvMM7MuMI/AAAAAAAAiM0/IAgaccZ3IKkKOiLr9Fh7ke_K1muGnw4FACLcBGAs/s320/mcf_job_status.PNG

 

Next, as you can see I have two jobs atm - one for each smb share. The top one which is "Done" succesfully is for a relatively small share as you can see (879 Docs) and the bottom one is a much bigger share. It hung up and crashed after a while and even now after refreshing and waiting a day or two, I cannot restart this bottom job. It seems to have died entirely on me. Unfortunately the docker container's latest commit includes the job at this status, so I cant really jump back to before it was in this crashed/aborted state. 

 

Maybe I could just readd the share in a new job?

 

I also unfortunately only have access to this machine atm with 8gb of ram total. I have access to another server with a static public IP actually. But it only has that one NIC and so is not on my itnernal network and cannot see the smb shares. 

 

Thanks!



On Friday, June 1, 2018 at 9:44:34 PM UTC+2, Olivier Tavard wrote:

Hello,

 

First thanks for the info about your issue, it will be easier to help you ;)

However I will need a little bit of information :

in the Solr admin UI could you click on Logging and do a screenshot to check if there are some errors displayed.

About the Solr screenshot you sent, was it the Solr status just after you launched the job or did you meantime stop the container and started it again ?

Because the first bar full at 8 GB RAM is "normal" : it is the amount of RAM used for all the server when you saw the page. Regarding Solr only, it is the last bar at the bottom : JVM memory. 

 

Concerning ManifoldCF, could you do a screenshot of the job status screen ie in the Job menu click on "Status and Job management".

 

About the requirements, the minimum of RAM is 16 GB without ELK and 32 GB is recommended when you activate ELK for the Community edition of Datafari  (https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/110788626/Hardware+requirements+Community+Edition).

 

Please send me the screenshots to investigate your problem.

 

Best regards,

 

Olivier Tavard

 

 

 



Le vendredi 1 juin 2018 01:19:24 UTC+2, N Dom a écrit :

Screenshots incoming!

 

https://lh3.googleusercontent.com/-cR25apJmO6o/WxCCLkPJmII/AAAAAAAAiFo/85ahah0z-CQzQsAdp31BLKJQZiElJtoOQCLcBGAs/s320/solr_full_memory.png

 

https://lh3.googleusercontent.com/-epL2AELA5Qs/WxCCFVQBT4I/AAAAAAAAiFg/xWj56wJQY-8hsjekvJdutkl2ZKQmqAPMQCLcBGAs/s320/mcf_job.png

https://lh3.googleusercontent.com/-StbsvKRXR0s/WxCCJzRs18I/AAAAAAAAiFk/dmVpuBpjoPsUXVvfi7UZoC80C4awOMtPwCLcBGAs/s320/mcfrepo.png

--
You received this message because you are subscribed to the Google Groups "Datafari" group.
To unsubscribe from this group and stop receiving emails from it, send an email to datafari+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

 


Garanti sans virus. www.avast.com

Nico D

unread,
Jun 27, 2018, 7:50:23 AM6/27/18
to Datafari
Hi Julien,

Thanks for your reply. 

I am currently using the DatafariSolr Output - I meanttt to use the 'NoTika' - let me try that again with the correct output at least..

Can one not specify a max size through "Content Length" in the Job Config?

Julien

unread,
Jun 27, 2018, 8:07:29 AM6/27/18
to Datafari

Be careful !

 

The DatafariSolrNoTika output connector is meant to be used if a Tika connector is configured in your job config ! If you use it without a Tika connector in your job, you will break the crawl !

 

To use the content limiter you will be forced to add a Tika connector in your job.

 

The Content Length will bypass documents with the indicated content length but will not ensure you that the pure text content of your document will not exeed the length (ex : a zip generates more content length after a tika extraction)

 

Julien

 

De : Nico D
Envoyé le :mercredi 27 juin 2018 13:50

 

https://lh6.googleusercontent.com/proxy/JjzEPM_1UQiAcgcIii9zH3waHJfrDmD6mwOpzjvKSWAzeFyvEmJKxjsfyB4-JKbN4o_rmMG0O6UuV95TfuAG3NaBCtKzucfHCUODobStNrGqQJzTwuEOKfxUSFySxNn_igewCYnfXbgJP-gAq2cupOvC=w5000-h5000

Garanti sans virus. www.avast.com

--
You received this message because you are subscribed to the Google Groups "Datafari" group.
To unsubscribe from this group and stop receiving emails from it, send an email to datafari+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

 

Nico D

unread,
Jun 29, 2018, 5:19:57 AM6/29/18
to Datafari
Hi Julien,

So I ended up purging and reinstalling datafari from scratch.

I added the jcifs connector according to the instructions from the documents, added a new repo and job from the same smb share.

This time I only included a subfolder in there with about 1500 Documents in it and limited it to 5 connections at a time. 

This, however, still lead to Solr eventually. 

Am I missing some step or something here?

LG
Nico

Julien Massiera

unread,
Jun 29, 2018, 5:34:25 AM6/29/18
to data...@googlegroups.com

Hi Nico,

can you please send a screen of your job configuration view ? And also the list of transformation connections

Regards,
Julien

-- 
Julien MASSIERA
Responsable produit Datafari
France Labs – Les experts du Search
Vainqueur du challenge Internal Search de EY à Viva Technologies 2016
www.francelabs.com

Nico D

unread,
Jun 29, 2018, 8:19:30 AM6/29/18
to Datafari
Hi Julien,

Below is a screenshot of my job config. I dont have any transformationsn except for the default "TikaOCR" that is included, which I am not using.

Julien Massiera

unread,
Jun 29, 2018, 8:30:00 AM6/29/18
to data...@googlegroups.com

Hi Nico,

thanks for the screenshot. To be honest, I'm not certain why you didn't apply my recommendations to get rid of your issues, but for sure I won't be of any help if you don't give it a try...
Is there any steps in my explanations that may have been unclear? If yes let me know and I'll try to be more precise.

Julien

Nico D

unread,
Jun 29, 2018, 8:59:39 AM6/29/18
to Datafari
Hi Julien,

I wanted to try it from scratch, after a purge and fresh reinstall. I guess the only option left is to try your suggestion with the transformation with content limiter then. 

As far as I understand, I have to add the preexisting "TikaOCR" transformation and then immediately afterward the newly created "contentLimiter" transformation. Correct?

I will try this now and get back to you guys.

LG
Nico

Nico D

unread,
Jun 29, 2018, 10:57:52 AM6/29/18
to Datafari
Hi Julien,

I am trying it now with your content limiter with the following job config:




It seems to be running a bit better, but slows to a crawl after about 1000 documents. Solr doesnt seem to be crashing, but it doesnt seem to be going forward either..

Julien Massiera

unread,
Jul 2, 2018, 10:21:27 AM7/2/18
to data...@googlegroups.com

Hi Nico,

Is the job still in a frozen state ? Have you more informations about the status in the Simple History of MCF ?

Also, did you allocate more RAM to MCF than the default configuration in the option.env.unix file located in /opt/datafari/mcf/mcf_home ?

Regards,
Julien

Nico D

unread,
Jul 2, 2018, 10:29:38 AM7/2/18
to Datafari
Hi Julien,

Unfortunately it is still in frozen state. I did not allocate more RAM, because on this new machine it is running natively (not in a docker container) and the Solr config says it has all 50gb available.

Do I still need to somehow allocate more in the option.env.unix file?

This is my Solr log - only some warnings about Tika 




I am still using the same Job config as in my previous screenshots - with the TikaOCR transformation and the contentLimiter transformation after that.


LG
Nico

Julien Massiera

unread,
Jul 2, 2018, 10:56:36 AM7/2/18
to data...@googlegroups.com

Yes I think the job is frozen because your MCF agent may have crashed due to low memory issue. I suggest you to stop your Datafari and to allocate 5Gb of memory to MCF by changing the Xms and Xmx values in the option.env.unix file :

-Xms5120m
-Xmx5120m


During a crawl Tika is the component that uses a lot of memory to extract the content of files. The job config that I suggested and that you implemented is using the Tika of MCF instead of the Solr's one. This is the reason why you really need to allocate more memory to MCF than the default configuration which should be 256Mb.

In the same way, you also need to explicitely allocate more RAM to Solr in the solr.in.sh file located in /opt/datafari/solr/bin :

SOLR_JAVA_MEM="-Xms10240m -Xmx10240m"


Finally, you can switch from the DatafariSolr output to the DatafariSolrNoTika now in your job config.

Julien

Nico D

unread,
Jul 3, 2018, 10:48:17 AM7/3/18
to Datafari
Hi Julien,

I have edited the aforementioned config files to allocate more RAM to mcf and solr and added a new job with the output of DatafariSolrNoTika.

This is running, so far, pretty well! (Knock on Wood!) 

The smb share is fairly large, but it has been reading and analyzing documents for at least 1 hour now, when previously it would stop within 5-10 minutes.

Thanks again and I will keep you all up to date in case anything else comes up..

LG
Nico

Nico D

unread,
Jul 4, 2018, 11:01:08 AM7/4/18
to Datafari
Julien,

So the crawl of about 500,000 documents went through seemingly successfully. It took about 1.5 days - but did not hang and ended with a status of 'Done'.

Now unfortunately, I still do not get any results, no matter what I search for. The suggested searches appear to work / appear - however no results are ever produced. 

I checked the Solr log and have a bunch of the following types of errors here at the end:



Could that have anything to do with not receiving any results in the search?

LG
Nico



On Monday, July 2, 2018 at 4:56:36 PM UTC+2, julien.massiera wrote:

Julien Massiera

unread,
Jul 4, 2018, 11:30:43 AM7/4/18
to data...@googlegroups.com

Hi Nico,

Indeed no documents have been indexed because of an error on the sequence_number field which is an EXIF field. This is a very interesting case I never seen before, could you send us at least one the files that are not indexed please ?

In the meantime you can ignore the field by editing the /opt/datafari/solr/solrcloud/FileShare/conf/solrconfig.xml file, find the /update/no-tika and /update/extract request handlers and add the following line to the defaults list : <str name="fmap.sequence_number">ignored_</str>

You should obtain :

<requestHandler class="com.francelabs.datafari.handler.parsed.ParsedRequestHandler" name="/update/no-tika" startup="lazy">
    <lst name="defaults">
            <str name="lowernames">true</str>
            <str name="fmap.language">ignored_</str>
            <str name="fmap.source">ignored_</str>
            <str name="fmap.sequence_number">ignored_</str>
            <str name="uprefix">ignored_</str>
            <str name="update.chain">datafari</str>
    </lst>
  </requestHandler>

  <requestHandler class="solr.extraction.ExtractingRequestHandler" name="/update/extract" startup="lazy">
    <lst name="defaults">
              <str name="scan">false</str>
              <str name="captureAttr">true</str>
            <str name="lowernames">true</str>
            <str name="fmap.language">ignored_</str>
            <str name="fmap.source">ignored_</str>
            <str name="fmap.sequence_number">ignored_</str>
            <str name="fmap.url">ignored_</str>
            <str name="uprefix">ignored_</str>
            <str name="update.chain">datafari</str>
            <bool name="ignoreTikaException">true</bool>
            <str name="tika.config">/opt/datafari/solr/solrcloud/FileShare/conf/tika.config</str>
    </lst>
  </requestHandler>

After this modification you will need to upload to new conf through the admin UI : Seach Engine Configuration -> Zookeeper then click on Upload, wait few seconds then click on Reload.

Unfortunately you will need to recrawl your share after this modification (make a copy of your job and run the copy).

Regards,
Julien

Nico D

unread,
Jul 5, 2018, 2:39:20 AM7/5/18
to Datafari
Hi Julien,

Thanks again. So I have updated the defaults lists for the two requestHandlers mentioned below, uploaded my new config, reloaded, copied my job and have started it again.

This time I selected a few subfolders in so as to not crawl the entire smb share yet, for testing now and so it doesnt take ~1.5 days haha.

The paths where I received the errors in my previous screenshot all seemed to be paths that were realllyyyy long, including some file names that were also entirely too long. Could this be an issue?

Also my /update/extract handler did not have the tika.config default value like the one you listed, mine looked like this after adding the sequence_number=ignored:

 <requestHandler class="solr.extraction.ExtractingRequestHandler" name="/update/extract" startup="lazy">
    <lst name="defaults">
                <str name="captureAttr">true</str>
                        <str name="lowernames">true</str>
                        <str name="fmap.language">ignored_</str>
                        <str name="fmap.source">ignored_</str>
                        <str name="fmap.sequence_number">ignored_</str>
                        <str name="uprefix">ignored_</str>
                        <str name="update.chain">datafari</str>
                        <bool name="ignoreTikaException">true</bool>
    </lst>
  </requestHandler>

Is that a problem?

Thanks!
<p

Julien

unread,
Jul 5, 2018, 3:26:11 AM7/5/18
to Datafari

Hi Nico,

 

I don’t think the paths length is an issue. If MCF didn’t complain and Solr received the files it’s ok.

The missing tika.config is not a problem, I pasted my own configuration of Solr and I did some custom modifications including this one, you can remove the tika.config line.

 

Julien

 

De : Nico D
Envoyé le :jeudi 5 juillet 2018 08:39


À : Datafari
Objet :Re: "No Results Found" - Job Done SuccessfullyandAutocompleteSuggestions working

 

Hi Julien,

https://lh3.googleusercontent.com/-salS-FAimYM/WzzhIKRNRVI/AAAAAAAAjGA/hgiY2jYfh6Mb_zIWsz6uzMpFczLP_W4CACLcBGAs/s320/Solr_ErrorSequenceNum.png

 

 

Could that have anything to do with not receiving any results in the search?

 

LG
Nico

 


On Monday, July 2, 2018 at 4:56:36 PM UTC+2, julien.massiera wrote:

Yes I think the job is frozen because your MCF agent may have crashed due to low memory issue. I suggest you to stop your Datafari and to allocate 5Gb of memory to MCF by changing the Xms and Xmx values in the option.env.unix file :

-Xms5120m
-Xmx5120m


During a crawl Tika is the component that uses a lot of memory to extract the content of files. The job config that I suggested and that you implemented is using the Tika of MCF instead of the Solr's one. This is the reason why you really need to allocate more memory to MCF than the default configuration which should be 256Mb.

In the same way, you also need to explicitely allocate more RAM to Solr in the solr.in.sh file located in /opt/datafari/solr/bin :

SOLR_JAVA_MEM="-Xms10240m -Xmx10240m"


Finally, you can switch from the DatafariSolr output to the DatafariSolrNoTika now in your job config.

Julien

 

On 02/07/2018 16:29, Nico D wrote:

Hi Julien,

 

Unfortunately it is still in frozen state. I did not allocate more RAM, because on this new machine it is running natively (not in a docker container) and the Solr config says it has all 50gb available.


Do I still need to somehow allocate more in the option.env.unix file?

 

This is my Solr log - only some warnings about Tika 

https://lh3.googleusercontent.com/-Rjd6E8jSYmU/Wzo2n_gy_QI/AAAAAAAAjDA/ezIgt4noSJk-IcLRJZUI96ZhMrDSb3cegCLcBGAs/s320/SolrLogs.png

 

 

 

I am still using the same Job config as in my previous screenshots - with the TikaOCR transformation and the contentLimiter transformation after that.

 

LG
Nico

 

 



On Monday, July 2, 2018 at 4:21:27 PM UTC+2, julien.massiera wrote:

Hi Nico,

Is the job still in a frozen state ? Have you more informations about the status in the Simple History of MCF ?

Also, did you allocate more RAM to MCF than the default configuration in the option.env.unix file located in /opt/datafari/mcf/mcf_home ?

Regards,
Julien

 

On 29/06/2018 16:57, Nico D wrote:

Hi Julien,


I am trying it now with your content limiter with the following job config:

 

 

https://lh3.googleusercontent.com/-jXb8M8aXbUw/WzZI58CocOI/AAAAAAAAjAA/zMeGXpXuwK4SBFH_AE21YzBwOri-8QUhwCLcBGAs/s320/datafari_contentLimiter_JobDetails.png

 

 

It seems to be running a bit better, but slows to a crawl after about 1000 documents. Solr doesnt seem to be crashing, but it doesnt seem to be going forward either..

On Friday, June 29, 2018 at 2:30:00 PM UTC+2, julien.massiera wrote:

Hi Nico,

thanks for the screenshot. To be honest, I'm not certain why you didn't apply my recommendations to get rid of your issues, but for sure I won't be of any help if you don't give it a try...
Is there any steps in my explanations that may have been unclear? If yes let me know and I'll try to be more precise.

Julien

On 29/06/2018 14:19, Nico D wrote:

Hi Julien,

 

Below is a screenshot of my job config. I dont have any transformationsn except for the default "TikaOCR" that is included, which I am not using.

 

https://lh3.googleusercontent.com/-H-6jjN17h7k/WzYjzV_iZ7I/AAAAAAAAi_0/UY5123uKTjMjwu19qX8_krkwvh7ksFQ-ACLcBGAs/s320/datafari_mcf_jobConfig.png

--

You received this message because you are subscribed to the Google Groups "Datafari" group.
To unsubscribe from this group and stop receiving emails from it, send an email to datafari+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Nico D

unread,
Jul 5, 2018, 3:38:03 AM7/5/18
to Datafari
Ok thanks - I will let this current (shorter) crawl run and report back whenever its finished. 

Nico D

unread,
Jul 5, 2018, 4:13:26 AM7/5/18
to Datafari
Hi Julien,

The crawl with ~50.000 documents has now ended succesfully - unfortunately there are still no results being produced no matter what one searches.

Solr log has produced only one entry and it is the following warning:

Julien

unread,
Jul 5, 2018, 4:48:03 AM7/5/18
to Datafari

Did you run a job that was in a « done » status ?

If it is the case MCF will not send the documents to Solr because it does not detect any changes in the already crawled files. You need to make a copy of the job and run the fresh copy to avoid this behavior.

If it is not the case and you ran a new job, what does the MCF simple history says about the crawl ?

 

The warning message of Solr is not an issue.

 

Julien

 

 

De : Nico D
Envoyé le :jeudi 5 juillet 2018 10:13

 

https://lh6.googleusercontent.com/proxy/JjzEPM_1UQiAcgcIii9zH3waHJfrDmD6mwOpzjvKSWAzeFyvEmJKxjsfyB4-JKbN4o_rmMG0O6UuV95TfuAG3NaBCtKzucfHCUODobStNrGqQJzTwuEOKfxUSFySxNn_igewCYnfXbgJP-gAq2cupOvC=w5000-h5000

Garanti sans virus. www.avast.com

--

You received this message because you are subscribed to the Google Groups "Datafari" group.
To unsubscribe from this group and stop receiving emails from it, send an email to datafari+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Nico D

unread,
Jul 5, 2018, 6:32:25 AM7/5/18
to Datafari
Hi Julien,

I copied my old job in order to create a new one, also changed the path that it crawled to keep the document count / time down, and ran that this morning. 

The mcf simple history looks good as well. For example, the last 5-6 rows:

Start Time
Activity
Identifier
Result Code
Bytes
Time
Result Description
07-05-2018 08:03:13.973
job end
1530772428261(newtelco-daten_NoTikaOut2)
0
1
07-05-2018 06:50:00.162
document ingest (DatafariSolrNoTika)
file://///192.168.1.239/newtelco-daten/Technik/maintenance%20...
works/2018/Maintenance/Virtela/incoming%20Vimpelcom%2004.07.2...
018.msg
OK
1159
16
07-05-2018 06:50:00.137
limit [contentLimiter]
file://///192.168.1.239/newtelco-daten/Technik/maintenance%20...
works/2018/Maintenance/Virtela/incoming%20Vimpelcom%2004.07.2...
018.msg
OK
1159
1
07-05-2018 06:50:00.068
extract [TikaOCR]
file://///192.168.1.239/newtelco-daten/Technik/maintenance%20...
works/2018/Maintenance/Virtela/incoming%20Vimpelcom%2004.07.2...
018.msg
OK
1159
58
07-05-2018 06:49:59.951
access
smb://192.168.1.239/newtelco-daten/Technik/maintenance works/...
2018/Maintenance/Virtela/incoming Vimpelcom 04.07.2018.msg
OK
54784
249
07-05-2018 06:49:50.359
document ingest (DatafariSolrNoTika)
file://///192.168.1.239/newtelco-daten/Technik/maintenance%20...
works/2018/Maintenance/Virtela/Outgoing%20Virtela%2004.07.201...
8.msg
OK
1155
23
07-05-2018 06:49:50.315
limit [contentLimiter]
file://///192.168.1.239/newtelco-daten/Technik/maintenance%20...
works/2018/Maintenance/Virtela/Outgoing%20Virtela%2004.07.201...
8.msg
OK
1155
1
07-05-2018 06:49:50.280
extract [TikaOCR]
file://///192.168.1.239/newtelco-daten/Technik/maintenance%20...
works/2018/Maintenance/Virtela/Outgoing%20Virtela%2004.07.201...
8.msg
OK
1155
26
07-05-2018 06:49:50.168
access
smb://192.168.1.239/newtelco-daten/Technik/maintenance works/...
2018/Maintenance/Virtela/Outgoing Virtela 04.07.2018.msg
OK
66048
742

Julien

unread,
Jul 5, 2018, 8:11:00 AM7/5/18
to Datafari

When I look back on your job configuration, I can see that you enabled file security and share security ! So if you did not configure an Active Directory in Datafari you will not be able to see the indexed documents as you need to be authenticated with an Active Directory user.

 

You can check the number of documents indexed in Solr thanks to the admin UI :


You basically have 2 options :

 

  1. You don’t really care about the respect of ACLs for your documents during search as you just want to use Datafari as a demo. So either you change the configuration of your job to disable security and you will need to recrawl, or you remove the security check module of Solr by removing/comment the following lines in the /opt/Datafari/solr/solrcloud/FileShare/conf/solrconfig.xml for the « /select » request handler :

    <lst name="appends">

               <str name="fq">{!manifoldCFSecurity}</str>

</lst>

Of course you will then need to upload the new conf again with an upload then reload through the ZooKeeper UI in the admin UI. With this solution, no need to recrawl !


  1. You want to have security enabled for you search and so you need to configure an Active Directory for Datafari

 

Regards,
Julien

 

De : Nico D
Envoyé le :jeudi 5 juillet 2018 12:32

Nico D

unread,
Jul 5, 2018, 8:55:06 AM7/5/18
to Datafari
Hi Julien,

I do not need any additional security, this will be on a server with a private IP only and all users should have the same level of access.

After disabling / commenting out the manifoldCFSecurity appends and reloading the search page, I have indeed received results!

According to my Solr statistics page, I have available 431598 docs!

Thanks for everything!

LG
Nico

Nico D

unread,
Jul 5, 2018, 9:13:39 AM7/5/18
to Datafari
Hi Julien,

The only functionality that is now still missing for me and would be great, is the ability to open the file directly from the search result page.

I see that when you click on the link, datafari is supposed to redirect to file:///PATH/TO/FILE.xyz however this does not seem to work. I've tested it on my Ubuntu 18.04 Desktop as well as a colleagues Win 10 Machine. 

This file:/// redirect never goes anywhere / does anything - it just opens a new empty browser tab. Is there anything else that must be enabled to allow Windows, for example, to deal with file:// type urls? How about on Datafari's side? 

Thanks
Nico

Julien

unread,
Jul 5, 2018, 9:24:05 AM7/5/18
to Nico D, Datafari

No problem Nico,

As I mentioned in a previous mail, we are interested in one or few files that had a problem with the sequence_number field. As a thank for our help on your issues, we would really appreciate that you share with us some of your problematic files, as they are apparently triggering a bug that we'd like to handle properly. 

Regards,
Julien

De : Nico D
Envoyé le :jeudi 5 juillet 2018 14:55


À : Datafari
Objet :Re: "No Results Found" -JobDoneSuccessfullyandAutocompleteSuggestions working

 

Hi Julien,

 

I do not need any additional security, this will be on a server with a private IP only and all users should have the same level of access.

 

After disabling / commenting out the manifoldCFSecurity appends and reloading the search page, I have indeed received results!

 

According to my Solr statistics page, I have available 431598 docs!

 

Thanks for everything!

 

LG
Nico

 



On Thursday, July 5, 2018 at 2:11:00 PM UTC+2, julien.massiera wrote:

When I look back on your job configuration, I can see that you enabled file security and share security ! So if you did not configure an Active Directory in Datafari you will not be able to see the indexed documents as you need to be authenticated with an Active Directory user.

 

You can check the number of documents indexed in Solr thanks to the admin UI :

https://groups.google.com/group/datafari/attach/8c3ccc3d88bc9/40D62881D3954783AB46AB3ABC41A720.png?part=0.1&authuser=0



You basically have 2 options :

 

1.       You don’t really care about the respect of ACLs for your documents during search as you just want to use Datafari as a demo. So either you change the configuration of your job to disable security and you will need to recrawl, or you remove the security check module of Solr by removing/comment the following lines in the /opt/Datafari/solr/solrcloud/FileShare/conf/solrconfig.xml for the « /select » request handler :

<lst name="appends">

               <str name="fq">{!manifoldCFSecurity}</str>

</lst>

Of course you will then need to upload the new conf again with an upload then reload through the ZooKeeper UI in the admin UI. With this solution, no need to recrawl !

2.       You want to have security enabled for you search and so you need to configure an Active Directory for Datafari

--

You received this message because you are subscribed to the Google Groups "Datafari" group.
To unsubscribe from this group and stop receiving emails from it, send an email to datafari+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Nico D

unread,
Jul 5, 2018, 9:25:24 AM7/5/18
to Datafari
Hi Julien,

Alternatively - if we could change a config value to print the entire path, instead of shortening it - this could be a decent work-around too. I've looked in solrconfig.xml and didnt find anything to that effect unfortunately.

LG
Nico

Nico D

unread,
Jul 5, 2018, 9:39:33 AM7/5/18
to Datafari
Hi Julien,

Oh yeah, no problem - I tried to get you some of those files initially when we were talking about that, but the paths were so long they werent fully displayed in the screenshot so I couldnt ascertain which files exactly it was complaining about.

Now those errors are also no longer in the Solr log. Does Solr also save these logs somewhere where one can look at the files themselves? That way I can try to find the paths / files that it was complaining about for you guys exactly. 

Thanks!

Oh and btw - just to reiterate my previous question in case it got lost - is there anyway to print the full file path in the results page?

LG
Nico


<p class=MsoNormal style='ms

Julien

unread,
Jul 6, 2018, 5:10:30 AM7/6/18
to Datafari

Hi Nico,

 

Yes the log files of Solr are located in /opt/datafari/logs

 

Julien

 

De : Nico D
Envoyé le :jeudi 5 juillet 2018 15:39

--

You received this message because you are subscribed to the Google Groups "Datafari" group.
To unsubscribe from this group and stop receiving emails from it, send an email to datafari+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Nico D

unread,
Jul 6, 2018, 6:32:30 AM7/6/18
to Datafari
Hi Julien,

Okay that was easy - should have probably looked there myself ;) 

Anyway, one of the files it could not index because of the missing sequence_number field was an image - I will attach it below.


LG
Nico

Julien

unread,
Jul 6, 2018, 7:25:06 AM7/6/18
to Datafari

Thanks Nico,

 

I realize I forgot to answer to your questions :

 

Yes you can remove the function that shorten the links in the results. The involved file is /opt/datafari/tomcat/webapps/Datafari/js/AjaxFranceLabs/widgets/SubClassResult.widget.js and you need to replace this line :
elm.find('.doc:last .address').append('<span>' + AjaxFranceLabs.tinyUrl(decodeURIComponent(url)) + '</span>');

 

By this one :
elm.find('.doc:last .address').append('<span>' + decodeURIComponent(url) + '</span>');

 

Concerning the fact that you are not able to directly open the files from the search results, it is a browser limitation and you need to configure it. We have this documentation that may help you, but it is possible that with browser updates, it is outdated : https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/14745633/Browsers+Configuration+to+Open+Files

 

Julien

 

De : Nico D
Envoyé le :vendredi 6 juillet 2018 12:32

--

You received this message because you are subscribed to the Google Groups "Datafari" group.
To unsubscribe from this group and stop receiving emails from it, send an email to datafari+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Nico D

unread,
Jul 6, 2018, 7:46:44 AM7/6/18
to Datafari
Hi Julien,

Thanks for the tip.

However, I have now removed the AjaxFranceLabs.tinyUrl function wrapping around the decodeURIComponent(url) variable and it still prints the truncated URL on the results page. 

I have also restarted Datafari via the provided restart-datafari.sh script. Do I need to upload and reload the zookeeper config(s) again for this change to apply?

See screenshot:


<td valign=top style='

Julien

unread,
Jul 6, 2018, 7:48:20 AM7/6/18
to Datafari

I think you just need to empty your browser cache.

 

De : Nico D
Envoyé le :vendredi 6 juillet 2018 13:46


À : Datafari
Objet :Re: "No ResultsFound"-JobDoneSuccessfullyandAutocompleteSuggestionsworking

 

Hi Julien,

 

Thanks for the tip.


However, I have now removed the AjaxFranceLabs.tinyUrl function wrapping around the decodeURIComponent(url) variable and it still prints the truncated URL on the results page. 

 

I have also restarted Datafari via the provided restart-datafari.sh script. Do I need to upload and reload the zookeeper config(s) again for this change to apply?

 

See screenshot:

 

https://lh3.googleusercontent.com/-aymsUBt5--Q/Wz9Wnqgoz-I/AAAAAAAAjGo/3xgrSJKNxQIFhdO_H-Y-lOCDChrMhhuTACLcBGAs/s320/datafari_truncatedURL.png

--

You received this message because you are subscribed to the Google Groups "Datafari" group.
To unsubscribe from this group and stop receiving emails from it, send an email to datafari+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Nico D

unread,
Jul 6, 2018, 7:50:20 AM7/6/18
to Datafari
Hi Julien,

Ahhh yes - just tried it in a new incognito tab and it looks great. Doesnt overflow past the rest of the result text. 

Great job! 

LG
Nico


P.S. Were you able to download the picture file I had passed along earlier? I hope it helps you guys out!

Nico D

unread,
Jul 6, 2018, 8:05:15 AM7/6/18
to Datafari
Hi Julien,

The full path prints nicely - however I still want to try and get the browser to open the file:// link correctly.

I have installed a firefox addon called 'Local Link Addon' and it seems to work great on other pages incl. their jsfiddle demo page.

The problem with the Datafari file:// redirects, when you click on the search result's file name, is that the redirect never really happens, so to speak. 

And just stays there.. it never attempts to redirect to the URL specified by "?url=xyz"

Is this normal?

LG
Nico



On Friday, July 6, 2018 at 1:48:20 PM UTC+2, julien.massiera wrote:

Julien

unread,
Jul 6, 2018, 8:17:20 AM7/6/18
to Datafari

The local link addons does not work with Datafari cause of our redirection, you really need to process the manual configuration.

 

De : Nico D
Envoyé le :vendredi 6 juillet 2018 14:05


À : Datafari
Objet :Re: "NoResultsFound"-JobDoneSuccessfullyandAutocompleteSuggestionsworking

 

Hi Julien,

--

You received this message because you are subscribed to the Google Groups "Datafari" group.
To unsubscribe from this group and stop receiving emails from it, send an email to datafari+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages