LOAD CSV takes over an hour

550 views
Skip to first unread message

Aram Chung

unread,
Mar 4, 2014, 10:54:03 AM3/4/14
to ne...@googlegroups.com
Hi,

I was asked to post this here by Mark Needham (@markhneedham) who thought my query took longer than it should.

I'm trying to see how graph databases could be used in investigative journalism: I was loading in New York State's Active Corporations: Beginning 1800 data from https://data.ny.gov/Economic-Development/Active-Corporations-Beginning-1800/n9v6-gdp6 as a 1964486-row csv (and deleted all U+F8FF characters, because I was getting "[null] is not a supported property value"). The Cypher query I used was 

USING PERIODIC COMMIT 500
LOAD CSV
  FROM "file://path/to/csv/Active_Corporations___Beginning_1800__without_header__wonky_characters_fixed.csv"
  AS company
CREATE (:DataActiveCorporations
{
DOS_ID:company[0],
Current_Entity_Name:company[1],
Initial_DOS_Filing_Date:company[2],
County:company[3],
Jurisdiction:company[4],
Entity_Type:company[5],

DOS_Process_Name:company[6],
DOS_Process_Address_1:company[7],
DOS_Process_Address_2:company[8],
DOS_Process_City:company[9],
DOS_Process_State:company[10],
DOS_Process_Zip:company[11],

CEO_Name:company[12],
CEO_Address_1:company[13],
CEO_Address_2:company[14],
CEO_City:company[15],
CEO_State:company[16],
CEO_Zip:company[17],

Registered_Agent_Name:company[18],
Registered_Agent_Address_1:company[19],
Registered_Agent_Address_2:company[20],
Registered_Agent_City:company[21],
Registered_Agent_State:company[22],
Registered_Agent_Zip:company[23],

Location_Name:company[24],
Location_Address_1:company[25],
Location_Address_2:company[26],
Location_City:company[27],
Location_State:company[28],
Location_Zip:company[29]
}
);

Each row is one node so it's as close to the raw data as possible. The idea is loosely that these nodes will be linked with new nodes representing people and addresses verified by reporters.

This is what I got:

+-------------------+
| No data returned. |
+-------------------+
Nodes created: 1964486
Properties set: 58934580
Labels added: 1964486
4550855 ms

Some context information: 
Neo4j Milestone Release 2.1.0-M01
Windows 7
java version "1.7.0_03"

Best,
Aram

Mark Needham

unread,
Mar 4, 2014, 11:22:12 AM3/4/14
to ne...@googlegroups.com
Hi Aram,

* Do you have any other information of the spec of the machine you're running this on? e.g. how much RAM etc
* Have you tried upping the value to PERIODIC COMMIT? Perhaps try it out with a smaller subset of the data to measure the impact - try it with values of 1,000 / 10,000 perhaps. 
* I think it would be interesting to pull out some other things as nodes as well - might lead to more interesting queries e.g. CEO, Location, Registered Agent, DOS Process, Jurisdiction could all be nodes that link back to a DOS. 

Let me know if any of that doesn't make sense.
Mark


--
You received this message because you are subscribed to the Google Groups "Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email to neo4j+un...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Michael Hunger

unread,
Mar 5, 2014, 2:29:56 AM3/5/14
to ne...@googlegroups.com
Yep,

it would be also interesting how you ran this? With neo4j-shell? Against a running server?
Did you configure any RAM or memory mapping setting in neo4j.properties?

Note that on windows the heap settings include the mmio settings unlike other OS'es.

Michael

Michael Hunger

unread,
Mar 5, 2014, 2:32:33 AM3/5/14
to ne...@googlegroups.com
Also if you're not yet creating rels (i.e. read your writes you should also be able to up the periodic commit to 50k)

Michael

Michael Hunger

unread,
Mar 5, 2014, 2:34:04 AM3/5/14
to ne...@googlegroups.com
Oh and if you use neo4j-shell without server you have to set the heap in bin\Neo4jShell.bat in EXTRA_JVM_ARGUMENTS="-Xmx4G -Xms4G -Xmn1G"

and call 

bin\Neo4jShell -conf conf\neo4j.properties -path data\graph.db

Am 05.03.2014 um 08:29 schrieb Michael Hunger <michael...@neopersistence.com>:

Michael Hunger

unread,
Mar 5, 2014, 6:00:03 AM3/5/14
to ne...@googlegroups.com
I just tested your file on MacOS with these settings
and got 6:30 for the 2m rows

EXTRA_JVM_ARGUMENTS="-Xmx6G -Xms6G -Xmn1G"

on windows you have to add the memory from the mmio settings in neo4j.properties to the heap

cat conf/neo4j.properties
# Default values for the low-level graph engine
neostore.nodestore.db.mapped_memory=200M
neostore.relationshipstore.db.mapped_memory=1G
neostore.propertystore.db.mapped_memory=500M
neostore.propertystore.db.strings.mapped_memory=250M
neostore.propertystore.db.arrays.mapped_memory=0M

USING PERIODIC COMMIT 10000
> LOAD CSV
> FROM "file:///Users/mh/Downloads/Active_Corporations___Beginning_1800_no_head.csv"
> AS company
> CREATE (:DataActiveCorporations
> {
> DOS_ID:company[0],
> Current_Entity_Name:company[1],
> Initial_DOS_Filing_Date:company[2],
......
> Registered_Agent_Zip:company[23],
>
> Location_Name:company[24],
> Location_Address_1:company[25],
> Location_Address_2:company[26],
> Location_City:company[27],
> Location_State:company[28],
> Location_Zip:company[29]
> }
> );

+-------------------+
| No data returned. |
+-------------------+
Nodes created: 1964486
Properties set: 58934580
Labels added: 1964486
391059 ms

Michael Hunger

unread,
Mar 5, 2014, 6:48:34 AM3/5/14
to ne...@googlegroups.com
Oh and btw. I would LOVE to see a blog post from you about what you're working on!

Thanks so much

Michael

Aram Chung

unread,
Mar 5, 2014, 10:39:52 AM3/5/14
to ne...@googlegroups.com
Wow this is great! I'll definitely try what you did. Please expect questions along the way.

And a write-up is coming; I was thinking I'd do that as soon as I get some relationships in, but now I should probably make a post about LOAD CSV. I'll post a link when I do.

Thanks!
Aram

Aram Chung

unread,
Apr 5, 2014, 12:18:14 PM4/5/14
to ne...@googlegroups.com
Hello good people,

I need help!

Since my last post I've been trying to get a slightly altered LOAD CSV command to run, without much success. (I haven't been successful writing up a blog post either, though a summary is up at Aramology.com, first tile on the menu. Any corrections on the content are welcome.)

This (below) is how I want to structure the database, so that all the Business nodes that contain the same name points to the same Name node. (I also want to prevent it from creating nodes when the names are blanks. Is this possible?) The database freezes up halfway through the command (it becomes unresponsive and the launcher window goes black).

USING PERIODIC COMMIT 10000
LOAD CSV
  FROM "path/to/Active_Corporations___Beginning_1800.csv"
  AS company
CREATE (n:Business
{
MERGE (n0:Name
{
name:company[1]
}
)
CREATE (n)-[:CURRENT_ENTITY_NAME]->(n0)
MERGE (n1:Name
{
name:company[6]
}
)
CREATE (n)-[:DOS_PROCESS_NAME]->(n1)
MERGE (n2:Name
{
name:company[12]
}
)
CREATE (n)-[:CEO_NAME]->(n2)
MERGE (n3:Name
{
name:company[18]
}
)
CREATE (n)-[:REGISTERED_AGENT_NAME]->(n3)
MERGE (n4:Name
{
name:company[24]
}
)
CREATE (n)-[:LOCATION_NAME]->(n4)
;


1. I tried to get around this by first using CREATE instead of MERGE, then copying over the relationships from duplicate Name nodes to just one and deleting the duplicates. This works fine for a 10,000-row version. For a while I was even convinced it's linear time, but of course it's not, because I have to sort the Name nodes along the way to find the duplicates (Is ORDER BY O(n log n)?). The full 1,964,486-row version gets derailed at this point. 

I did some counting outside of Neo4j, and the full 1,964,486-row version should have 
 5,067,050 relationships and 
 2,816,857 non-blank names once the duplicates are deleted.
Is there an efficient way to get 5,067,050 relationships to connect 1,964,486 Business nodes to the correct Name node out of 2,816,857?


2. I also recently tried using another computer, this time a Mac, and I need some help editing the memory settings. 

First I edited conf/neo4j.properties to include
use_memory_mapped_buffers=true
neostore.nodestore.db.mapped_memory=200M
neostore.relationshipstore.db.mapped_memory=1G
neostore.propertystore.db.mapped_memory=500M
neostore.propertystore.db.strings.mapped_memory=250M
neostore.propertystore.db.arrays.mapped_memory=0M

Then I edited bin/neo4j-shell to say 
EXTRA_JVM_ARGUMENTS00="-Xmx6G -Xms6G -Xmn1G -XX:+UseConcMarkSweepGC -server"

Did I get everything? Even with the original LOAD CSV command just creating Business nodes I wasn't able to bring it down to the 6m30 that you had, which makes me think I missed something. In my Windows machine I can't get 6m30 either (it's usually a little less than 30m), but I think that's because I can't do -Xmx6G -Xms6G -Xmn1G so I settled for -Xmx4G -Xms4G -Xmn1G instead.


I'm very keen to get this working, as I'm getting some amazing query results even from the 10,000-row version, that are a great improvement from traditional relational databases. Once I can get the full dataset in there, Newsday is interested in using it for an ongoing investigative news story. I would very much like to know if it can be done. The next dataset I need to load in and connect with the current 1,964,486-row one is a whopping 8,765,456-row csv.

Thanks,
Aram


P.S. This might be on your to-do list already, but will a future version of Neo4j support date types? I know there are ways around it, but so much of journalism work relies on correct date information that I think this would most limit Neo4j's journalistic application. It's not a very pressing matter now.

Rodger

unread,
Apr 7, 2014, 9:19:47 AM4/7/14
to ne...@googlegroups.com
Hello Aram, 

This does look like a great project!
It's similar to Opencorporates.com
You might want to communicate with them too. 

I'll add that the neo4j-shell should much be faster 
than the web GUI. 

As I've wrote before, I agree that having 
the datatype, Date, is Critical.  

I'm curious, what version of NEO4J 
is this command, LOAD CSV, being done in?

Thanks a lot!

Best,

Rodger


Michael Hunger

unread,
Apr 7, 2014, 9:48:00 AM4/7/14
to ne...@googlegroups.com
Load CSV is in Neo4j 2.1-M01 available as  a preview version to experiment with.

You can also check out this blog post by Rik Van Bruggen: http://blog.bruggen.com/2014/03/food-networks-countries-diets-health.html?view=sidebar

(michael)-[:SUPPORTS]->(YOU)-[:USE]->(Neo4j)
Learn OnlineOffline or Read a Book (in Deutsch)
We're trading T-shirts for cool Graph Models


--
You received this message because you are subscribed to the Google Groups "Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email to neo4j+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Aram Chung

unread,
Apr 7, 2014, 5:03:20 PM4/7/14
to ne...@googlegroups.com
Rodger: Thanks!
I agree, Opencorporates.com is lovely. My project isn't really about corporations though; it just happened that the first dataset I loaded in was corporations data. My aim is to demonstrate how investigative journalistic processes can be improved by graph databases. I'm getting worried now, as a lot of journalism work deals with large datasets (rows in the millions) and date information.

I'm using Neo4j Milestone Release 2.1.0-M.01 and yes, I'm working in neo4j-shell.

Michael: Thanks for the link, it was a good read.
I'd like to point out LOAD CSV isn't a problem. (Even when it is, I can always chop the csv into 500,000 rows and load them one by one. Tedious but totally doable.) Rather, the issue is how much time it takes to establish the correct 5,067,050 relationships between 1,964,486 Business nodes and 2,816,857 Name nodes.

I just got a new error message: 
SystemException: TM has encountered some problem, please perform necessary action (tx recovery/restart)
What is the necessary action in this case?

Did I do everything for the Mac memory settings then?

Rodger

unread,
Apr 8, 2014, 12:19:34 AM4/8/14
to ne...@googlegroups.com
HI Michael,

That is an interesting blog post that Rik wrote!

Glad to hear that there is a new data loader.

This past winter, I outlined writing my own loader in java.
The idea was one self contained program. 
That would do a lot of data checking/cleansing. 
And convert dates to Long. 
Although, I got sidelined by some other projects. 


I'll take a look at the loader. 
Hopefully sooner than later.


Best,

Rodger


david fauth

unread,
Apr 10, 2014, 11:36:03 AM4/10/14
to ne...@googlegroups.com
Aram,
 
I have some cycles and can possibly help. you can reach me at dsfauth at gmail dot com
 
df

Pavan Kumar

unread,
Jun 18, 2014, 2:46:16 AM6/18/14
to ne...@googlegroups.com
Hi,
I have deployed neo4j 2.1.0-M01 on windows which has 8GB RAM. I am trying to import CSV file which has 30000 records. I am using USING PERIODIC COMMIT 1000 LOAD CSV command for importing, but it gives unknown error. I have modified neo4j.properties file as adviced in the blogs. My neo4j.properties now looks like 
# Default values for the low-level graph engine

neostore.nodestore.db.mapped_memory=200M
neostore.relationshipstore.db.mapped_memory=4G
neostore.propertystore.db.mapped_memory=500M
neostore.propertystore.db.strings.mapped_memory=500M
neostore.propertystore.db.arrays.mapped_memory=500M

# Enable this to be able to upgrade a store from an older version
allow_store_upgrade=true

# Enable this to specify a parser other than the default one.
#cypher_parser_version=2.0

# Keep logical logs, helps debugging but uses more disk space, enabled for
# legacy reasons To limit space needed to store historical logs use values such
# as: "7 days" or "100M size" instead of "true"
keep_logical_logs=true

# Autoindexing

# Enable auto-indexing for nodes, default is false
node_auto_indexing=true

# The node property keys to be auto-indexed, if enabled
#node_keys_indexable=name,age

# Enable auto-indexing for relationships, default is false
relationship_auto_indexing=true

# The relationship property keys to be auto-indexed, if enabled
#relationship_keys_indexable=name,age

# Setting for Community Edition:
cache_type=weak

Still i am facing the same problem. Is there any other file to change properties. Kindly help me in this issue.
Thanks in advance

Michael Hunger

unread,
Jun 18, 2014, 3:11:57 AM6/18/14
to ne...@googlegroups.com
What does your query look like?
Please switch to Neo4j 2.1.2

And create indexes / constraints for the nodes you're inserting with MERGE or looking up via MATCH.

Michael

Pavan Kumar

unread,
Jun 18, 2014, 3:19:53 AM6/18/14
to ne...@googlegroups.com
My query looks like following
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM
"file:D:/Graph_Database/DrugBank_database/DrugbankFull_Database.csv"
AS csvimport
merge (uniprotid:Uniprotid{uniprotid: csvimport.ID, Name:csvimport.Name, Uniprot_title: csvimport.Uniprot_Title})
merge (genename:Gene_Name{genename: csvimport.Gene_Name})
merge (Genbank_prtn:GenBank_Protein{GenBank_protein_id: csvimport.GenBank_Protein_ID})
merge (Genbank_gene:GenBank_Gene{GenBank_gene_id: csvimport.GenBank_Gene_ID})
merge (pdbid:PDBID{PDBid: csvimport.PDB_ID})
merge (geneatlas:Geneatlasid{Geneatlas: csvimport.GenAtlas_ID})
merge (HGNC:HGNCid{hgnc: csvimport.HGNC_ID})
merge (species:Species{Species: csvimport.Species})
merge (genecard:Genecardid{Genecard: csvimport.GeneCard_ID})
merge (drugid:DrugID{DrugID: csvimport.Drug_IDs})
merge (uniprotid)-[:Genename]->(genename)
merge (uniprotid)-[:GenBank_ProteinID]->(Genbank_prtn)
merge (uniprotid)-[:GenBank_GeneID]->(Genbank_gene)
merge (uniprotid)-[:PDBID]->(pdbid)
merge (uniprotid)-[:GeneatlasID]->(geneatlas)
merge (uniprotid)-[:HGNCID]->(HGNC)
merge (uniprotid)-[:Species]->(species)
merge (uniprotid)-[:GenecardID]->(genecard)
merge (uniprotid)-[:DrugID]->(drugid)

I am attaching sample csv file also. Please find it.
As suggested, I will try with new version of neo4j


--
You received this message because you are subscribed to a topic in the Google Groups "Neo4j" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/neo4j/a2DdoKkbyYo/unsubscribe.
To unsubscribe from this group and all its topics, send an email to neo4j+un...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Thanks & Regards,
Pavan Kumar
Project Engineer
CDAC -KP
Ph +91-7676367646
SAmple_Drugbank.xls

Michael Hunger

unread,
Jun 18, 2014, 3:39:17 AM6/18/14
to ne...@googlegroups.com
And create the indexes for all those node + property

And for operations like this: 

MERGE (uniprotid:Uniprotid{uniprotid: csvimport.ID, Name:csvimport.Name, Uniprot_title: csvimport.Uniprot_Title}

please use a constraint:

create constraint on (uniprotid:Uniprotid) assert uniprotid.uniprotid is unique;

and the merge operation like this, so it can actually leverage the index/constraint.

MERGE (uniprotid:Uniprotid{uniprotid: csvimport.ID}) ON CREATE SET uniprotid.Name=csvimport.Name,uniprotid.Uniprot_title=csvimport.Uniprot_Title
...

<SAmple_Drugbank.xls>

Pavan Kumar

unread,
Jun 18, 2014, 4:11:44 AM6/18/14
to neo4j
When i use create statements, it is not considering  the empty fileds from the CSV file. So used Merge command

Michael Hunger

unread,
Jun 18, 2014, 4:14:52 AM6/18/14
to ne...@googlegroups.com
I don't understand.

Michael

Pavan Kumar

unread,
Jun 18, 2014, 5:13:42 AM6/18/14
to ne...@googlegroups.com
Hi, 
So My cypher will be like 
----------------------------------------------------------
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM
"file:D:/Graph_Database/DrugBank_database/DrugbankFull_Database.csv"
AS csvimport
create constraint on (uniprotid:Uniprotid) assert uniprotid.uniprotid is unique;
MERGE (uniprotid:Uniprotid{uniprotid: csvimport.ID}) ON CREATE SET uniprotid.Name=csvimport.Name,uniprotid.Uniprot_title=csvimport.Uniprot_Title
create constraint on (genename:Gene_Name) assert genename:Gene_Name is unique;
merge (genename:Gene_Name{genename: csvimport.Gene_Name})
 and so on...
merge (uniprotid)-[:Genename]->(genename)
merge (uniprotid)-[:GenBank_ProteinID]->(Genbank_prtn)
and so on...
---------------------------------------------------------
Is that right...? i tried the same statements in 2.1.2 and i am getting the following errors.

1. Invalid input 'n': expected 'p/P' (line 5, column 20)

"create constraint on (uniprotid:Uniprotid) assert uniprotid.uniprotid is unique;"

2. Cannot merge node using null property value for uniprotid

Kindly help

david fauth

unread,
Jun 18, 2014, 10:50:26 AM6/18/14
to ne...@googlegroups.com
Run the Create Constraint commands then attempt your LOAD CSV command.

Pavan Kumar

unread,
Jun 19, 2014, 1:54:11 AM6/19/14
to neo4j
Hi,
I have tried creating constraints and index before attempting LOAD CSV command
Commands are executing for long time and it it showing me "Unknown error"
Any idea why it is giving me such error, I am running it on windows machine which has 8GB RAM.
Do i have to change properties in neo4j.properties file
Kindly help me

Pavan Kumar

unread,
Jun 25, 2014, 7:06:12 AM6/25/14
to neo4j
Hello, 
I am using load csv command for importing database.
i am getting the following error "GC overhead limit exceeded"
Do i have to change jvm properties..?? I have value set to -Xmx512, do i have to increase this to avoid the error.
Kindly help

Michael Hunger

unread,
Jun 25, 2014, 7:18:04 AM6/25/14
to ne...@googlegroups.com
What does your query look like?
Please switch to Neo4j 2.1.2

And create indexes / constraints for the nodes you're inserting with merge or looking up via MATCH.


Am 18.06.2014 um 08:46 schrieb Pavan Kumar <kumar.p...@gmail.com>:

Michael Hunger

unread,
Jun 25, 2014, 7:19:14 AM6/25/14
to ne...@googlegroups.com
Please read this blog post: http://jexp.de/blog/2014/06/load-csv-into-neo4j-quickly-and-successfully/

And yes you should use more memory than 512 byte.

-Xmns4G -Xmx4G -Xmn1G 

Pavan Kumar

unread,
Jun 25, 2014, 7:23:55 AM6/25/14
to neo4j
Hi , 
My qiuery is as follows
create constraint on (ChemicalName:chemicalname) assert ChemicalName.chemicalname is unique;
create constraint on (ChemicalID:chemicalid) assert ChemicalID.chemicalid is unique;
create constraint on (Genesymbol:genesymbol) assert Genesymbol.genesymbol is unique;
create constraint on (Geneid:geneid) assert Geneid.geneid is unique;
create constraint on (Geneform:geneform) assert Geneform.geneform is unique;
create constraint on (Interaction:interaction) assert Interaction.interaction is unique;
create constraint on (Interactionactions:interactionactions) assert Interactionactions.interactionactions is unique;
create constraint on (PubmedID:pubmed) assert PubmedID.pubmed is unique;
create index on :ChemicalName(chemicalname);
create index on :ChemicalID(chemicalid);
create index on :Genesymbol(genesymbol);
create index on :Geneid(geneid);
create index on :Geneform(geneform);
create index on :Interaction(interaction);
create index on :Interactionactions(interactionactions);
create index on :PubmedID(pubmed);

USING PERIODIC COMMIT 10000
LOAD CSV WITH HEADERS FROM
"file:D:/Graph_Database/CTD/CTD_chem_gene_ixns.csv"
AS chemgeneinteractions
match (geneid:Geneid{geneid: chemgeneinteractions.GeneID})
match (genesymbol:Genesymbol{genesymbol: chemgeneinteractions.GeneSymbol})
merge (chemicalname:ChemicalID{chemicalid: chemgeneinteractions.ChemicalID, chemicalname: chemgeneinteractions.ChemicalName})
ON CREATE SET chemicalname.chemicalid=chemgeneinteractions.ChemicalID,chemicalname.chemicalid=chemgeneinteractions.ChemicalName
merge (geneform:Geneform {geneform: chemgeneinteractions.GeneForms})
merge (interations:Interaction{interact: chemgeneinteractions.Interaction})
merge (oraganism:Organism{organism: chemgeneinteractions.Organism})
merge (interaction:Interactionactions{interr: chemgeneinteractions.InteractionActions})
merge (pubmed:PubmedID{pub: chemgeneinteractions.PubMedIDs})
merge (geneid)-[:Gene_Symbol]->(genesymbol)
merge (geneid)-[:chemicalname]->(chemicalname)
merge (geneid)-[:geneform]->(genefrom)
merge (geneid)-[:Its_interaction_action]->(interactions)
merge (geneid)-[:Its_interaction]->(interaction)
merge (geneid)-[:PubmedID]->(pubmed)
merge (geneid)-[:Related_To]->(organism)
merge (genesymbol)-[:chemicalname]->(chemicalname)
merge (genesymbol)-[:geneform]->(genefrom)
merge (genesymbol)-[:geneform]->(genefrom)
merge (genesymbol)-[:Its_interaction_action]->(interactions)
merge (genesymbol)-[:Its_interaction]->(interaction)
merge (genesymbol)-[:PubmedID]->(pubmed)
merge (genesymbol)-[:Related_To]->(organism)



My jvm properties are 
-Xmx512m
-XX:+UseConcMarkSweepGC




You received this message because you are subscribed to a topic in the Google Groups "Neo4j" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/neo4j/a2DdoKkbyYo/unsubscribe.
To unsubscribe from this group and all its topics, send an email to neo4j+un...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Michael Hunger

unread,
Jun 25, 2014, 7:31:15 AM6/25/14
to ne...@googlegroups.com
use 1000 here
USING PERIODIC COMMIT 1000

Increase the memory settings to 6G

As you run on windows:

neostore.nodestore.db.mapped_memory=100M
neostore.relationshipstore.db.mapped_memory=2G
neostore.propertystore.db.mapped_memory=200M
neostore.propertystore.db.strings.mapped_memory=200M
neostore.propertystore.db.arrays.mapped_memory=0M
also can you share your complete CSV with me privately?

Do you have any nodes in your dataset that have many (100k-1M) of relationships?

Pavan Kumar

unread,
Jun 25, 2014, 9:04:40 AM6/25/14
to neo4j
My property file is 

# Default values for the low-level graph engine

neostore.nodestore.db.mapped_memory=100M
neostore.relationshipstore.db.mapped_memory=2G
neostore.propertystore.db.mapped_memory=200M
neostore.propertystore.db.strings.mapped_memory=200M
neostore.propertystore.db.arrays.mapped_memory=0M

# Enable this to be able to upgrade a store from an older version
allow_store_upgrade=true

# Enable this to specify a parser other than the default one.
#cypher_parser_version=2.0

# Keep logical logs, helps debugging but uses more disk space, enabled for
# legacy reasons To limit space needed to store historical logs use values such
# as: "7 days" or "100M size" instead of "true"
keep_logical_logs=true

# Autoindexing

# Enable auto-indexing for nodes, default is false
#node_auto_indexing=true

# The node property keys to be auto-indexed, if enabled
#node_keys_indexable=name,age

# Enable auto-indexing for relationships, default is false
#relationship_auto_indexing=true

# The relationship property keys to be auto-indexed, if enabled
#relationship_keys_indexable=name,age
cache_type=strong



My jvm options are 
# Enter one VM parameter per line, note that some parameters can only be set once.
# For example, to adjust the maximum memory usage to 512 MB, uncomment the following line
-Xmx512m

But still i am getting GC overhead limit exceeded error
Kindly somebody suggest me

Pavan Kumar

unread,
Jun 26, 2014, 5:58:24 AM6/26/14
to neo4j
My JVM properties include

# Enter one VM parameter per line, note that some parameters can only be set once.
# For example, to adjust the maximum memory usage to 512 MB, uncomment the following line
-Xmx6144m
Xmx4G -Xms4G -Xmn1G

 but still i am getting GC overhead limit exceeded error. (I have tried from 512m to 6GB)
My neo4j.propertie file contains
neostore.nodestore.db.mapped_memory=100M neostore.relationshipstore.db.mapped_memory=2G neostore.propertystore.db.mapped_memory=200M neostore.propertystore.db.strings.mapped_memory=200M neostore.propertystore.db.arrays.mapped_memory=0M

Any more suggestions to get rid of the error


Michael Hunger

unread,
Jun 26, 2014, 9:27:57 AM6/26/14
to ne...@googlegroups.com
Your constraints are wrong you mixed up labels and identifiers

Please also check the index properties

And I had better success doing a multi-pass for each set of elements to connect

Sent from mobile device

Pavan Kumar

unread,
Jun 28, 2014, 4:23:26 AM6/28/14
to neo4j
I have changed my constraints and i am sure my labels and identifiers are not same now..
But still query is executing for long time in log file i can see "Applications threads blocked for" statement..And i am getting the same error
"GC overhead limit exceeded". Currently i have set to 2 GB in jvm file..Kindly tell me if i am doing any mistake in my cypher statements.
create constraint on (ChemicalName:chemicalname) assert ChemicalName.chemicalname is unique;
create constraint on (Chemicalid:chemicalid) assert Chemicalid.chemicalid is unique;
create constraint on (Genesymb:genesymbol) assert Genesymb.genesymbol is unique;
create constraint on (GeneID:geneid) assert GeneID.geneid is unique;
create constraint on (form:geneform) assert form.geneform is unique;
create constraint on (Interac:interaction) assert Interac.interaction is unique;
create constraint on (Interactactions:interactionactions) assert Interactactions.interactionactions is unique;
create constraint on (Pubmedid:pubmed) assert Pubmedid.pubmed is unique;
create index on :ChemicalName(chemicalname);
create index on :Chemicalid(chemicalid);
create index on :Genesymb(genesymbol);
create index on :GeneID(geneid);
create index on :form(geneform);
create index on :Interac(interaction);
create index on :Interactactions(interactionactions);
create index on :Pubmedid(pubmed);

USING PERIODIC COMMIT 1000

Thanks

Michael Hunger

unread,
Jun 28, 2014, 5:10:52 AM6/28/14
to ne...@googlegroups.com
Let me send you my version of your import when I come back home

Sent from mobile device

Pavan Kumar

unread,
Jun 28, 2014, 5:19:50 AM6/28/14
to neo4j
Thanks a lot..
Are these settings fine 

neo4j properties

neostore.nodestore.db.mapped_memory=100M
neostore.relationshipstore.db.mapped_memory=2G
neostore.propertystore.db.mapped_memory=200M
neostore.propertystore.db.strings.mapped_memory=200M
neostore.propertystore.db.arrays.mapped_memory=0M

Java proeprties

-Xmx2048m
-Xmx4G 
-Xms4G
-Xmn1G

Michael Hunger

unread,
Jun 30, 2014, 7:22:00 AM6/30/14
to ne...@googlegroups.com
No they are not.

this is the same parameter, stick with 4 or 6G if you're on windows.
-Xmx2048m
-Xmx4G 

I'm pretty frustrated with your provided import script.

Almost ally our index/constraint definitions were actually wrong.

And some of the properties or identifiers in your load script were wrong too, that cost me several hours of finding those typos 
and fixing them while wondering why the created indexes were not used.

Please take _more care_ when providing scripts like this, I can't afford to spend so much time on a single issue. 
Usually this should have been a paid consulting effort for 1-2k for the time it took.

For the indexes, the syntax is:

create index on (alias:Label) assert alias.keyProperty is unique;

you had

create index on (alias:Label) assert alias.alias is unique;

so they were not effective, and Neo4j had to scan the whole database for every row.

here is the right set:

create constraint on (chemicalname:ChemicalName) assert chemicalname.chemicalname is unique;
create constraint on (chemicalid:ChemicalID) assert chemicalid.chemicalid is unique;
create constraint on (genesymbol:Genesymbol) assert genesymbol.genesymbol is unique;
create constraint on (geneid:Geneid) assert geneid.geneid is unique;
create constraint on (geneform:Geneform) assert geneform.geneform is unique;
create constraint on (interaction:Interaction) assert interaction.interact is unique;
create constraint on (interactionactions:Interactionactions) assert interactionactions.interr is unique;
create constraint on (pubmed:PubmedID) assert pubmed.pub is unique;
create constraint on (oraganism:Organism) assert oraganism.organism is unique;
create index on :ChemicalID(chemicalname);

Unfortunately your data is not so well suited for the merge operation (large string properties), so I took a different approach of inserting it instead.

I did a multi-pass insert that imported one aspect at a time, using the DISTINCT feature to limit the input to the distinct set or set of tuples and
then used CREATE for the actual data creation.

See this script for the import.


I think you should also re-think your model, as I'm not sure what the difference between Geneid and Genesymbol is, and also all the relationship-types are repeated for both.

Cheers,

Michael
Reply all
Reply to author
Forward
0 new messages