intact import problem

8 views
Skip to first unread message

David...@schering.de

unread,
Feb 4, 2005, 10:59:38 AM2/4/05
to cp...@googlegroups.com
Hello,

I'm experimenting with a fresh local installation of cpath.
The system setup and loading the provided demo dataset works perfect.
As the next step I would like to import the data from EBIs Intact database.
With the human_small.xml I'm getting the error messages shown below.
This database is listed on psidev.sf.net as providing PSI MI data.
So any ideas what's going wrong here ?

Thanks,
David.

######################################################################
> ./admin.pl -d -u tomcat -p kitty -f ../dbData/Intact/human_small.xml
import
Java HotSpot(TM) Client VM warning: Can't detect initial thread stack
location
Based on the file extension, I am concluding that this is a PSI-MI File.
Loading data file: human_small.xml
Description: human_small.xml
XML Document Loaded. Ready for Import.
Transferring Import Records
Checking record: 1, Status: NEW
--> Transferring record
Step 1 of 4: Normalizing PSI Document


-----------------------------------------
Fatal Error: location is a required field.

Full Details are available in the stack trace below.
org.mskcc.pathdb.sql.transfer.ImportException: location is a required
field.
at
org.mskcc.pathdb.sql.transfer.ImportPsiToCPath.addRecord(ImportPsiToCPath.java:182)

at
org.mskcc.pathdb.tool.ImportRecords.transferRecord(ImportRecords.java:106)
at
org.mskcc.pathdb.tool.ImportRecords.transferAllImportRecords(ImportRecords.java:83)

at
org.mskcc.pathdb.tool.ImportRecords.transferData(ImportRecords.java:62)
at org.mskcc.pathdb.tool.Admin.importData(Admin.java:145)
at org.mskcc.pathdb.tool.Admin.main(Admin.java:89)
Caused by: ValidationException: location is a required field.;
- location of error: XPATH:
entrySet/entry/interactionList/interactionElementType/participantList/proteinParticipantType/featureList/featureType{file:

[not available]; line: 135356; column: 12}
at
org.exolab.castor.xml.Unmarshaller.unmarshal(Unmarshaller.java:563)
at
org.exolab.castor.xml.Unmarshaller.unmarshal(Unmarshaller.java:487)
at
org.exolab.castor.xml.Unmarshaller.unmarshal(Unmarshaller.java:627)
at
org.mskcc.dataservices.schemas.psi.EntrySet.unmarshalEntrySet(EntrySet.java:302)

at org.mskcc.pathdb.util.PsiUtil.normalizeDoc(PsiUtil.java:303)
at
org.mskcc.pathdb.util.PsiUtil.getNormalizedDocument(PsiUtil.java:82)
at
org.mskcc.pathdb.sql.transfer.ImportPsiToCPath.addRecord(ImportPsiToCPath.java:162)

... 5 more
-----------------------------------------





Ethan Cerami

unread,
Feb 4, 2005, 11:35:05 AM2/4/05
to cp...@googlegroups.com
David,

I have been able to successfully import all of Intact, based on the June
2004 and November 2004 releases. However, using the very latest
release, I get the same error you do. I'll write back soon, once I
figure out some more details.

Ethan
--
Ethan Cerami
Computational Biology Center
Memorial Sloan-Kettering Cancer Center
http://cbio.mskcc.org
Email: cer...@cbio.mskcc.org
Direct phone: (646) 735-8082
cerami.vcf

Ethan Cerami

unread,
Feb 4, 2005, 12:41:04 PM2/4/05
to cp...@googlegroups.com
David,

As far as I can tell, the last three month of Intact releases contain
invalid XML (the specific error is noted below, if you are interested).
And, since cPath only works with valid XML, the command line script
refuses to allow the import.

If you are very eager to import the data today, you can get the November
2004 Intact release:
ftp://ftp.ebi.ac.uk/pub/databases/IntAct/2004-11-01/xml/. This release
is valid, and will work with cPath. I just tried it again to
doublecheck that it is OK.

I'll also see about submitting a bug report to IntAct.

Here are the details re the December 2004, January 2004, and February
2005 Intact PSI-MI Files:

Each of these files is invalid. For example, the February 2005 release
contains 1,427 validity errors. Here are the first few validity errors,
as determined by the Oxygen XML Editor:

E cvc-complex-type.2.4.a: Invalid content was found starting with
element 'featureDetection'. One of '{"net:sf:psidev:mi":location}' is
expected.
E cvc-complex-type.2.4.b: The content of element 'featureDetection' is
not complete. One of '{"net:sf:psidev:mi":names}' is expected.
E cvc-complex-type.2.4.a: Invalid content was found starting with
element 'featureDetection'. One of '{"net:sf:psidev:mi":location}' is
expected.
E cvc-complex-type.2.4.a: Invalid content was found starting with
element 'featureDetection'. One of '{"net:sf:psidev:mi":location}' is
expected.
E cvc-complex-type.2.4.b: The content of element 'featureDetection' is
not complete. One of '{"net:sf:psidev:mi":names}' is expected.

If you encounter a problem similar to this in the future, and don't have
a good XML editor/validtor, you can try out:
http://tools.decisionsoft.com/schemaValidate.html. In the XML Instance
box, specify your PSI-MI file, and then click validate. If you get no
errors, the file will import to cPath.

Thanks for reporting this problem to us.

Ethan
cerami.vcf

Ethan Cerami

unread,
Feb 4, 2005, 1:43:55 PM2/4/05
to cp...@googlegroups.com
Following up on this, I think that the error message that you
encountered here was less than helpful. I have therefore made a few
changes.

For example, when you load an invalid XML file, you will now see
something like this:

-START-
Based on the file extension, I am concluding that this is a PSI-MI File.
Validating XML File:
/Users/cerami/dev/cpath_data/intact/feb_2005/human_small.xml
XML File is Invalid.

-------------------------------------
Import aborted due to invalid XML.
Use the validate command to view a list of XML validation errors.
For example: admin.pl -f human_small.xml validate
-------------------------------------
-END-

You can also now validate the offending XML file directly from the admin
script. For example:

./admin.pl -f ~/dev/cpath_data/intact/feb_2005/human_small.xml validate

produces the following output:

XML File is Invalid.
Error #0: cvc-complex-type.2.4.a: Invalid content was found starting
with element 'featureDetection'. One of '{"net:sf:psidev:mi":location}'
is expected. [Line: 53911, Column: 20
Error #1: cvc-complex-type.2.4.b: The content of element
'featureDetection' is not complete. One of '{"net:sf:psidev:mi":names}'
is expected. [Line: 53911, Column: 20
Error #2: cvc-complex-type.2.4.a: Invalid content was found starting
with element 'featureDetection'. One of '{"net:sf:psidev:mi":location}'
is expected. [Line: 53923, Column: 19
Error #3: cvc-complex-type.2.4.a: Invalid content was found starting
with element 'featureDetection'. One of '{"net:sf:psidev:mi":location}'
is expected. [Line: 54161, Column: 20
...

BTW: the validation service works on any XML document, not just PSI-MI.

These changes are checked in, and will be available here:
http://www.cbio.mskcc.org/dev_site/cpath/source.html (under Nightly
Snapshots). The changes will get pushed out automatically at around 4
am EST tomorrow.

Ethan
cerami.vcf

David...@schering.de

unread,
Feb 7, 2005, 2:31:36 AM2/7/05
to cp...@googlegroups.com

Ethan,

thank you for your fast response !
Did you test this with a local copy of the older intact data or have you
done a fresh download from the EBI server ?
I'm getting the same error also with the June and November data.
All of the directories on the EBI ftp server are from 2.Dez or newer. So
may be they changed something
in creating the xml dumps on that day....
Just to make sure, I don't have any local problem with my java or
whatsoever, could you please be so kind
and send me a copy of human_small.xml (or any other xml file from intact)
which you can correctly import ?

Thanks so lot,
David.
(See attached file: cerami.vcf)


cerami.vcf

Ethan Cerami

unread,
Feb 7, 2005, 11:17:25 AM2/7/05
to cp...@googlegroups.com
David,

I just downloaded human_small.xml from here:
ftp://ftp.ebi.ac.uk/pub/databases/IntAct/2004-11-01/xml/, and it works ok.

But, I think your problem is due to something else: that is, cPath uses
a two-step process to import data. First, data is imported to the
"import" table; second, the data is "chopped up", and placed in the
core cpath tables. Each time you import a new file, it first copies the
new file to the "import" table, and then chops up all the pending
records in the "import" table.

Prior to the changes I made on Friday, cPath allowed invalid XML in the
first step. And, once there, it can never really get out of the pending
queue. Hence, each time you add a new file, the admin script gets stuck
in the second step for your old, invalid XML file from last week.

Clearing up this confusion and creating a clear pipeline for validating
and importing new data is actually one of our todo items. If you are
interested, more details are here:
http://www.cbio.mskcc.org/dev_site/cpath/todo.html, (see #33, refactor
pipeline for importing PSI-MI data). Since you have actually hit up
against the issue already, we will triage this item as a higher priority.

In the meantime, the simplest option is to reset the database:

bin/resetDb.pl

This clears everything, including the import tables. However, if you
already have data in cPath that you don't want to lose, you can also
connect to mysql directly, and issue this command:

truncate table import;

Then, try importing the November IntAct file.

Ethan
cerami.vcf

Ethan Cerami

unread,
Mar 4, 2005, 10:06:28 AM3/4/05
to cp...@googlegroups.com, David...@schering.de, Gary Bader, sker...@ebi.ac.uk
Hello David,

After you contacted us about your problems with importing PSI-MI files
from IntAct, we relayed the problem to the IntAct team, cc'd here.

The IntAct group has fixed the problem, and their latest data is now
available here: ftp://ftp.ebi.ac.uk/pub/databases/IntAct/2005-03-01/xml/.

I also just downloaded the human_small.xml file, and successfully loaded
it into cPath without any problems.

Thanks again for reporting this problem to us.
cerami.vcf
Reply all
Reply to author
Forward
0 new messages