Hi,
As a part of the analysis of HDT, I am still generating HDT files (only 10% complete after several days). It seems that the processing is too slow for a single processor approach, especially when it comes to the PubChem neighboring files. Each of the PubChem neighboring files take several minutes each to process (rapper TDT.gz->NT conversion [60-120 seconds], rdf2hdt NT->HDT conversion [60-120 seconds], hdrSearch index generation [~1-10 seconds]). Most of the files are these neighboring files.
I will parallelize this (across many processors) for expediency.
Again, if the TDT.gz->HDT conversion step could be skipped, 1-2 minutes per file could be removed. Considering that there are 21,360 files, it would be a considerable time savings.
Best,
Evan
--
Evan Bolton, Ph.D.
National Center for Biotechnology Information
Bldg. 38A, Room 8S810
National Library of Medicine
National Institutes of Health
8600 Rockville Pike, Bethesda, MD 20894
Phone: 301-451-1811
Fax: 301-480-4559
Email: bol...@ncbi.nlm.nih.gov
Skype: evan_bolton