Windows Share Indexation

75 views
Skip to first unread message

Sam Zas

unread,
Aug 9, 2019, 5:05:06 AM8/9/19
to Datafari
Hello,

We have successfully installed Datafari 4.2.1 on a Debian 9.8 with 8GB RAM and  Intel Core i5 Processor.

The goal is to index and search through different local NAS holding a lot of big video files (from 500Mo to 50Go per file) and data files (txt, excel, etc.)

When we mount a smb share on the debian hosting datafari, we can create a File system type Repository and a job to crawl it. It seems to be working quite fast, but as I read in the doc, this kind of repository is for testing purposes and it's not very convenient for what we need to do.
But when we try to create a "Windows shares" repository (smb share) and a job to crawl it, we run into several problems :  

The Repository Connection status says "Connection working", but : 

1/ Datafari is copying a lot of data in the /tmp folder. Is it copying every video file there? When the server is full, it flushes the tmp folder, but indexation in then very very slow.
2/ After a while, the whole system is failing : 
We end with Giga octets of logs in the /opt/datafari/mcf/mcf_home/nohup.out file with thousands of messages like :

[main-SendThread(localhost:2181)] WARN org.apache.zookeeper.ClientCnxn - Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
        at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141)
[main-SendThread(localhost:2181)] INFO org.apache.zookeeper.ClientCnxn - Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error)

or 
[Crawler idle cleanup thread-SendThread(localhost:2181)] INFO org.apache.zookeeper.ClientCnxn - Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authentic
ate using SASL (unknown error)
[Worker thread '27'-SendThread(localhost:2181)] INFO org.apache.zookeeper.ClientCnxn - Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authenticate using
SASL (unknown error)
[Worker thread '27'-SendThread(localhost:2181)] WARN org.apache.zookeeper.ClientCnxn - Session 0x16c72292007004a for server null, unexpected error, closing socket connection and attemptin
g reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141)

or...


The question is simple : What have we done wrong? :)
Subsidiary question : What do you need to examine this problem (config files, etc?) 

Thanks for your help!

Sam Zas

unread,
Sep 2, 2019, 11:34:41 AM9/2/19
to Datafari
Hello,

could anyone help me on this point?

Thanks a lot!

Sam

Olivier Tavard

unread,
Sep 3, 2019, 2:07:29 AM9/3/19
to Datafari
Hello,

Sorry for the late answer. And thank you for your detailed use case.

- Could you tell me the total number of files that you want to index into Datafari ?
- Did you change the default settings of RAM for the different components of Datafari especially MCF and Solr ?
- For the /tmp folder, what size do you have on your system for that partition ? You can change the default location (/tmp) by another one into /opt/datafari/tomcat-mcf/bin/setenv.sh, change the line :
JAVA_OPTS="-Duser.timezone=UTC"
by :
JAVA_OPTS="-Duser.timezone=UTC -Djava.io.tmpdir=/PATH_TO_OTHER_TMP_FOLDER"
- Just to be sure : do you need to index the video files or only the data files like xls, doc, pdf etc... ? If you just need to index data files you can skip the video files into the job configuration.
If you need to index both, my recommandation would be to increase the amount of RAM for the system. 8 GB is not enough if you need to index very big files. 16 GB would be better.
- What is the size of your  SWAP size into the system ?

Best regards,

Olivier

Samuel Zaslavsky

unread,
Sep 5, 2019, 5:46:55 AM9/5/19
to Olivier Tavard, Datafari
Hello Olivier,

Thank you for you answer!
Please find below my answers to your questions :


- Could you tell me the total number of files that you want to index into Datafari ?

Hard to say exactly but let's say dozens of thousands of files. Size are from 100MB to 50GB.
 
- Did you change the default settings of RAM for the different components of Datafari especially MCF and Solr ?

No. But I checked those settings and found ( wrongly?) that they were correct according to our configuration and your recommendations.
 
- For the /tmp folder, what size do you have on your system for that partition ? You can change the default location (/tmp) by another one into /opt/datafari/tomcat-mcf/bin/setenv.sh, change the line :
JAVA_OPTS="-Duser.timezone=UTC"
by :
JAVA_OPTS="-Duser.timezone=UTC -Djava.io.tmpdir=/PATH_TO_OTHER_TMP_FOLDER"

The /tmp folder is part of the / partition, which size is 103GB. (55GB free remaining) 

 
- Just to be sure : do you need to index the video files or only the data files like xls, doc, pdf etc... ? If you just need to index data files you can skip the video files into the job configuration.
If you need to index both, my recommandation would be to increase the amount of RAM for the system. 8 GB is not enough if you need to index very big files. 16 GB would be better.

We need to index all the files. But we only need metadata for the video files : Name, path, size and other metadata inside the video file. We'll try to add memory, but the weird thing is that it seems that datafari downloads the whole files in /tmp folder, so it lead quickly to full HDD, errors, etc. And the process is also very very slow (as the downloads takes time) compared to the same operation with mounted filesystem... Is it normal?
 
- What is the size of your  SWAP size into the system ?

Swap size is 8GB.

I hope all these informations will help you to help me.
My goal is to create a small internal search engine based on datafari, and with little hacking be able to edit metadata (like author, some custom datas, cover picture name, etc.). 
We have also a few xls, docx, pdf etc. whose content is pertinent for us and should also be searchable...

Thanks again for your help!

Best regards, 

Samuel

--
You received this message because you are subscribed to the Google Groups "Datafari" group.
To unsubscribe from this group and stop receiving emails from it, send an email to datafari+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/datafari/c899fb55-4e27-419a-87df-511d81e8e1dc%40googlegroups.com.

Samuel Zaslavsky

unread,
Oct 7, 2019, 5:47:48 AM10/7/19
to Olivier Tavard, Datafari
Hello Olivier,

Sorry to insist, but I didn't get any answer to my previous message.

Thank you for your help.

Best regards,

Samuel

Olivier Tavard

unread,
Oct 8, 2019, 5:45:53 AM10/8/19
to Datafari
Hello Samuel,

Sorry for the delay. Thanks for your detailed response.
It is not usual to put Datafari on the same server than the one that contains the filer if I understood correctly your use case. It is also not normal to have HDD errors during the data transfer in the /tmp partition during indexation.We did not encounter the same problems with our customers.
The best would be to use a SSD disk for Datafari I think in your case. At least investigate why there are HDD problems during indexation. You can try to lower the max connections into the repository connection configuration into MCF and the number of fetches per minute to mitigate that.
Keep me updated about that.
Best regards,

Olivier
To unsubscribe from this group and stop receiving emails from it, send an email to datafari+unsubscribe@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages