Hi,
As ziqi has already replied, You can try with JATE2.0, which is based on Solr framework so that it can be used to process large number of documents.
JATE2.0 is language independent tool, but you need language dependent components to work with your language, typically like tokeniser and part-of-speech (PoS) tagger.
We implemented OpenNLP tokeniser & PoS tagger to work with Solr as plugin. So, you can either choose to train a tokenisation model for Persian language by yourself ( see example via
https://github.com/rfarahmand/PersianPoSTagger) or use an pre-trained one.
If you have more advanced knowledge of Solr, you can also choose to develop your tokeniser/PoS solr plugin (e.g., using
standford parser,
universal tagger) to work within JATE2.0.
For language independent candidate extraction method, you can try out n-gram based approach.
You can have a look at our paper to get an overview. Also, JATE2.0 wiki page contains sufficient information to make a quirk start of JATE2.0. We are still working on a complete version of wiki now.
Thanks for your interests. Please feel free to ask if you need any help with set-up.
Jerry