SNOFYRE Indexing

19 views
Skip to first unread message

JoeA

unread,
Feb 18, 2012, 10:34:52 AM2/18/12
to snofyre-dev
. I have recently been checking out Snofyre including all its
capabilities and speed.

• I currently have the complete SNOMED CT RF1 dataset.
• I have also checked out your code (SNOMED & YASB – 0.8, Sappheiros
1.8).
• My development environment is Netbeans.
• I use PostGRES 9.1 as my database. I have resolved all the issues
associated with datatypes between MySQL and PostGRES.
• I have also upgraded the code base to use JPA 2 annotations.
• I deploy on JBoss AS 7.
• All the entities have been mapped correctly. All the data sets from
the RF1 have also been loaded.
• The prepared statements in the DAO’s have been modified to point to
the correct schemas and tables (columns inclusive).
• I am able to successfully make basic queries.
• Development machine specs: 64-bit HP, 4GB RAM, Intel Core 2 Duo @
2.8 GHz per core.

In these past few days, I have been trying to make
search faster by implementing indexing. I found that YASB has an index
generator which I attempted to use.
The Concepts table has almost 400,000 rows of records. It has been 2
and a half days since I run the indexer, and I have been able to index
only 100,000 of the concepts.
I dread thinking about indexing the descriptions. Those are around 1
million records. Is there a way that I can make this indexing faster?

The SNOMED docs listed 4 tables which they mention are
used for indexing. CONCWORDKEY, DECSWORDKEY, CONCDUALKEY and
DESCDUALKEY. I looked through SNOFYRE documentation and code, but
found no references whatsoever to these tables. Are these tables being
used at all? If so, how exactly?

Jay Kola

unread,
Feb 18, 2012, 7:25:17 PM2/18/12
to snofy...@googlegroups.com
Hi Jonathan,

Many thanks for giving Snofyre a go. Congratulations on successfully migrating Snofyre to PostgreSQL. In fact, I had been wanting to do this for a while, since PostgreSQL has a more 'liberal' license than MySQL. See this issue (http://code.google.com/p/snofyre/issues/detail?id=18) on the Issue tracker. There is also the issue of 'hard coded' table and column names that I would like to fix. So it'll help me, if you could send me a list of the changes you've had to make.

I am not sure I understand your 'indexing' issue entirely. Lexical search [matching based on terms (FSN, PT, etc)] in Snofyre does not use the RDBMS engine, but uses Lucene instead. This separation of search from the RDBMS engine offers some significant advantages in my opinion. Snofyre does not use the 'index' tables described in the SNOMED CT documentation. 

The 'Indexer' generally takes about an hour or two (depending on machine specs), to generate the index. You seem to have a decent spec machine, so what you report is definitely 'not expected' behaviour. Can you please provide me the following information, so I can understand what might be going wrong?

1. How are you calling the Index Generator? Are you running it from your IDE directly?
2. What happens when you introduce a 'counter' to restrict the index generation to just 1000 rows? Try modifying line 111 in SnomedLuceneDescriptionIndexer to while(rs.next() && counter < 1000)
3. Check that your code is not persisting SNOMED CT data during import, which can lead to long delays. Verify that is not happening behind the scenes -- via the JPA bindings.

Please let me know how this goes… :)

Cheers,
Jay

JoeA

unread,
Feb 18, 2012, 8:27:51 PM2/18/12
to snofyre-dev
//My code: A call is made to getConnectionString().
@Stateful
@Named
@ConversationScoped
public class MyClass implements Serializable {

@Resource(name = "java:jboss/datasources/myDatasource")
private DataSource dataSource;
private Logger logger;
SnomedConceptDatabaseDAO snomedConceptDatabaseDAO;
String connectionString;

public String getConnectionString() {

try {
logger.info("Connection Successful: " +
dataSource.getConnection().getCatalog());
snomedConceptDatabaseDAO = new
SnomedConceptDatabaseDAO(dataSource);
} catch (SQLException ex) {
logger.info("Connection Unsuccessful: " + ex.toString());
}
testLucene();
return "DONE";
}

public void testLucene() {
SnomedLuceneIndexer sli = new SnomedLuceneIndexer(dataSource,
snomedConceptDatabaseDAO);
sli.indexConcepts();
}
}

//The SnomedLuceneIndexer constructor
public SnomedLuceneIndexer(DataSource dataSource,
TerminologyConceptDAO terminologyConceptDAO) {
try {
this.connection = dataSource.getConnection();
this.terminologyConceptDAO = terminologyConceptDAO;
try {
directory = FSDirectory.open(new
File(System.getProperty("user.dir") + "/.myFolder/snomed-desc/
lucene"));
logger.debug("Value of directory : " + directory);
indexWriterConfig = new
IndexWriterConfig(Version.LUCENE_34, new
StandardAnalyzer(Version.LUCENE_34));
// Using the default line below, indexing is still
slow. Your use of indexwriter is also deprecated using lucene 3.4
// indexWriter = new IndexWriter(directory, new
StandardAnalyzer(), true, IndexWriter.MaxFieldLength.UNLIMITED);
indexWriter = new IndexWriter(directory,
indexWriterConfig);

} catch (IOException e) {
logger.warn("Error locating location of index
directory. Nested exception is : " + e.fillInStackTrace());
}
} catch (SQLException e) {
logger.warn("Error obtaining connection from data source.
Nested exception is : " + e.fillInStackTrace());
}
}

//Modified method from SnomedLuceneIndexer
public void indexConcepts() {

try {
Statement statement = connection.createStatement();
ResultSet rs = statement.executeQuery(""
+ "SELECT DISTINCT CONCEPT_ID FROM
PUBLIC.CONCEPTS_TABLE");
int counter = 1;
while (rs.next()) {
String conceptID = rs.getString("CONCEPT_ID");
SnomedConcept concept = (SnomedConcept)
terminologyConceptDAO.getTerminologyConcept(conceptID);

Document document = new Document();
String preferredTerm = concept.getPreferredLabel();
String fullySpecName =
concept.getFullySpecifiedName();
String isPrimitive =
String.valueOf(concept.isPrimitive());
String source = concept.getSource();
String status = concept.getStatus().name();
String conceptType = concept.getType().name();
conceptType = conceptType.replaceAll(" ", "_");

document.add(new Field("ID", conceptID, Store.YES,
Index.ANALYZED));
document.add(new Field("FSN", fullySpecName,
Store.YES, Index.ANALYZED));
document.add(new Field("TERM", fullySpecName,
Store.NO, Index.ANALYZED));
document.add(new Field("PT", preferredTerm, Store.NO,
Index.ANALYZED));
document.add(new Field("TERM", preferredTerm,
Store.NO, Index.ANALYZED));
document.add(new Field("SOURCE", source, Store.NO,
Index.ANALYZED));
document.add(new Field("STATUS", status, Store.YES,
Index.ANALYZED));
document.add(new Field("IS_PRIM", isPrimitive,
Store.NO, Index.NOT_ANALYZED));
document.add(new Field("TYPE", conceptType, Store.NO,
Index.ANALYZED));

Collection<String> synonyms = concept.getSynonyms();
for (String synonym : synonyms) {
document.add(new Field("SYN", synonym, Store.NO,
Index.ANALYZED));
document.add(new Field("TERM", synonym, Store.NO,
Index.ANALYZED));
}
indexWriter.addDocument(document);
if (logger.isDebugEnabled() && counter % 1000 == 0) {
logger.debug("Concepts indexed : " + counter);
}
counter++;
}
rs.close();
statement.close();
indexWriter.optimize();
indexWriter.commit();

} catch (SQLException e) {
logger.warn(e.fillInStackTrace());
} catch (ConceptNotFoundException e) {
logger.warn(e.fillInStackTrace());
} catch (CorruptIndexException e) {
logger.warn(e.fillInStackTrace());
} catch (IOException e) {
logger.warn(e.fillInStackTrace());
}
}

The above code snippets contain the relevant classes and calls being
made.

Ans 1. I am not calling the index generator directly from my IDE. I am
calling it from my running JBoss application.
Ans 2. I introduced a counter (1000) the first time i attempted to
index and it took approximately 40 minutes to complete.
Ans 3. There is no SNOMED CT data being persisted.

Am i doing anything wrong Jay?

Jonathan

On Feb 19, 12:25 am, Jay Kola <jay.k...@biomedicalontologies.com>
wrote:

Jay Kola

unread,
Feb 18, 2012, 8:42:10 PM2/18/12
to snofy...@googlegroups.com
Hmm… off the top of my head, I'd try the following to check what the bottleneck is:

1. Check that you've created indices on CONCEPT_TABLE. You are using a DISTINCT select statement. So use logging before and after your call to this statement, to see how long its taking to execute in PostgreSQL.
2. I'd create similar 'log' records to see how long its taken to optimise and commit your indexWriter, separately.
3. Repeat with counter to just '10' and see what the profiles show.

Cheers,
Jay

JoeA

unread,
Feb 19, 2012, 7:55:05 AM2/19/12
to snofyre-dev
1. I have created indices on the concepts table. I have added the
snippet of a report below:

ALTER TABLE concepts_table
OWNER TO postgres;

-- Index: idx_concept_id

-- DROP INDEX idx_concept_id;

CREATE INDEX idx_concept_id
ON concepts_table
USING btree
(concept_id COLLATE pg_catalog."default" );

-- Index: idx_concept_type

-- DROP INDEX idx_concept_type;

CREATE INDEX idx_concept_type
ON concepts_table
USING btree
(concept_type COLLATE pg_catalog."default" );

-- Index: idx_ctv3id

-- DROP INDEX idx_ctv3id;

CREATE INDEX idx_ctv3id
ON concepts_table
USING btree
(ctv3_id COLLATE pg_catalog."default" );

-- Index: idx_fsn

-- DROP INDEX idx_fsn;

CREATE INDEX idx_fsn
ON concepts_table
USING btree
(fully_specified_name COLLATE pg_catalog."default" );

-- Index: idx_isprim

-- DROP INDEX idx_isprim;

CREATE INDEX idx_isprim
ON concepts_table
USING btree
(is_primitive );

-- Index: idx_sctid

-- DROP INDEX idx_sctid;

CREATE INDEX idx_sctid
ON concepts_table
USING btree
(snomed_id COLLATE pg_catalog."default" );

-- Index: idx_source

-- DROP INDEX idx_source;

CREATE INDEX idx_source
ON concepts_table
USING btree
(source COLLATE pg_catalog."default" );

2. The bulk of the time (approx. 20 seconds per record) is spent on
executing this line

SnomedConcept concept = (SnomedConcept)
terminologyConceptDAO.getTerminologyConcept(conceptID);

12:38:48,275 INFO [stdout] (http--127.0.0.1-8181-1) Sun Feb 19
12:38:48 UTC 201
2 : Query 2 Begins:
12:39:08,514 INFO [stdout] (http--127.0.0.1-8181-1) Sun Feb 19
12:39:08 UTC 201
2 : Query 2 Ends:

The is no issue with the optimise and commit. I believe these are
called after all the concepts have been written to the lucene
document. It was executed within a few microseconds.

On Feb 19, 1:42 am, Jay Kola <jay.k...@biomedicalontologies.com>

Jay Kola

unread,
Feb 19, 2012, 8:44:13 AM2/19/12
to snofy...@googlegroups.com

You need to create corresponding indices on all tables in the database. Have a look at the indices in the MySql database schema and recreate all of them in your PostgreSQL backend.

Performance will likely be poor without the indices since the tables contain hundreds of thousands of rows... :-)

JoeA

unread,
Feb 19, 2012, 9:28:17 AM2/19/12
to snofyre-dev
The indices being created by snofyre have not been changed. They are
all being created. The snippet earlier was only from the concepts
table report. All the other indices being created in the descriptions
table, relationships table, clinical entity...etc have not been
touched/removed.

At least i have been able to pinpoint that the DAO is causing the
delays. I only hope its not PostGRES itself causing this.

On Feb 19, 1:44 pm, Jay Kola <jay.k...@biomedicalontologies.com>
wrote:
> You need to create corresponding indices on all tables in the database.
> Have a look at the indices in the MySql database schema and recreate all of
> them in your PostgreSQL backend.
>
> Performance will likely be poor without the indices since the tables
> contain hundreds of thousands of rows... :-)
> ...
>
> read more »

Jay Kola

unread,
Feb 19, 2012, 9:34:56 AM2/19/12
to snofy...@googlegroups.com

Snofyre doesn't create indices on the concept, description and relationships table...  They are just imported as part of the MySql data import. JPA generated tables for SNOMED CT content are not used. So I think it'll be useful to look at the indices in your backend with those in MySql... :-)

JoeA

unread,
Feb 21, 2012, 6:03:23 AM2/21/12
to snofyre-dev
Eureka! It just so happened that i had not indexed one particular
field in the descriptions table, and one field in the relationships
table.
Also, i made some optimizations in postgres itself (which might not be
relevant), but after doing these, and running the indexer again on the
concepts, it took me less than 30 minutes to complete. Now am just
left with querying the indexes.
Also, it seems snofyre does not implement subsets. That is something i
will be looking at soon enough.
Thanks for the help.

On Feb 19, 2:34 pm, Jay Kola <jay.k...@biomedicalontologies.com>
wrote:
> Snofyre doesn't create indices on the concept, description and
> relationships table...  They are just imported as part of the MySql data
> import. JPA generated tables for SNOMED CT content are not used. So I think
> it'll be useful to look at the indices in your backend with those in
> MySql... :-)
> ...
>
> read more »

Jay Kola

unread,
Feb 21, 2012, 6:43:19 PM2/21/12
to snofy...@googlegroups.com

Great to hear that! Have fun with Snofyre :-)

Reply all
Reply to author
Forward
0 new messages