PDB on BiO.CC?

Dan Bolser

unread,

Oct 26, 2010, 1:39:26 AM10/26/10

to BiO.CC server interface

Hi,

Do we mirror the PDB on BiO.CC?

For PDBWiki we made some nice scripts for syncing and unzipping the
PDB:

http://github.com/dbolser/PDB-rsync

If it is OK, I'll put them on bio.cc and stick them in crontab.

Cheers,
Dan.

Sung Gong

unread,

Oct 26, 2010, 4:10:39 AM10/26/10

to Dan Bolser, BiO.CC server interface

Hi, Dan

Not that I am aware of, but found bio.cc:/BiO/Store/DB/PDB which was
mirrored last year (maybe one-off)?
So, please go on as long the space is enough - 846GB remaining.
You be you can overwrite old one.

Cheers,
Sung

> --
> You received this message because you are subscribed to the Google
> Groups "BiO.CC server interface" group.
>
> BiOcentre proposes progressive concepts in using biological data, new types of databases, and new ways of looking at old problems. We encourage members to propose and realize radical and revolutionary methods in science and engineering.

Dan Bolser

unread,

Oct 26, 2010, 4:20:12 AM10/26/10

to Sung Gong, BiO.CC server interface

Cheers Sung,

We keep the parts we don't regularly use zipped, so .8Tb should be enough ;-)

Seriously, I think the whole archive comes in well under 100Gb.

jong bhak

unread,

Oct 26, 2010, 4:28:42 AM10/26/10

to Dan Bolser, Sung Gong, BiO.CC server interface

Hi,

We are buying hundreds of terabyte disks.
If you need just shout.

Cheers

Jong

Jong Bhak,

\(^o^)/

Tel: 031-888-9311, Fax: 031-888-9314, Mobile: 010 9944 6754

Dan Bolser

unread,

Oct 26, 2010, 4:37:41 AM10/26/10

to jong bhak, Sung Gong, BiO.CC server interface

I need to get round to building my own raid server. I'm just too disorganized!

Dan Bolser

unread,

Oct 27, 2010, 2:22:19 AM10/27/10

to BiO.CC server interface

On Oct 26, 9:10 am, Sung Gong <s...@bio.cc> wrote:
> Hi, Dan
>
> Not that I am aware of, but found bio.cc:/BiO/Store/DB/PDB which was
> mirrored last year (maybe one-off)?
> So, please go on as long the space is enough - 846GB remaining.
> You be you can overwrite old one.

OK, I changed the group owner of that dir to Biomatics, and made it
group writeable.

I checked out the scripts from git and made a few changes. I commented
out the rsync '--delete' line for this run and I have started the
process

Lets see if it works! :-D

> Cheers,
> Sung

Dan Bolser

unread,

Oct 28, 2010, 1:56:16 AM10/28/10

to BiO.CC server interface

On 27 October 2010 07:22, Dan Bolser <dan.b...@gmail.com> wrote:

> Lets see if it works! :-D

Seems the process crashed 6000 structures in:

./update.sh

Wed Oct 27 15:33:43 KST 2010

RSYNCING!
rsync: failed to set times on "/BiO/Store/DB/PDB/./.": Operation not
permitted (1)
rsync: writefd_unbuffered failed to write 4092 bytes to socket
[generator]: Connection reset by peer (104)
rsync error: error in rsync protocol data stream (code 12) at
io.c(1525) [generator=3.0.6]
rsync error: received SIGUSR1 (code 19) at main.c(1285) [receiver=3.0.6]

NEW AND/OR UPDATED stats:
PDB; all 0, divided 0.
mmCIF; all 0, divided 0.
XML; all 0, divided 0.
XML-noatom; all 0, divided 0.
XML-extatom; all 0, divided 0.
Structure factors; all 0, divided 0.
BioUnit; all 0, divided 6085.

DELETED stats:
PDB; all 0, divided 0.
mmCIF; all 0, divided 0.
XML; all 0, divided 0.
XML-noatom; all 0, divided 0.
XML-extatom; all 0, divided 0.
Structure factors; all 0, divided 0.
BioUnit; all 0, divided 0.

Wed Oct 27 16:44:30 KST 2010

UNZIPING!

doing /BiO/Store/DB/PDB/data/structures/all/pdb
cant open directory /BiO/Store/DB/PDB/data/structures/all/pdb : No
such file or directory

Wed Oct 27 16:44:30 KST 2010

DONE!

Dan Bolser

unread,

Dec 8, 2010, 4:15:54 AM12/8/10

to biocc-serve...@googlegroups.com

The rsync PDB update process seems to be working now, so I've added it to cron. For simplicity I have created a pdb group / user who owns all the files under /BiO/Store/DB/PDB (and this user runs the cron job).

Note, any files created under /BiO/Store/DB/PDB will be deleted by the rsync process unless you explicitly list them in /BiO/Store/DB/PDB/exclude.list this is important to remember!

Other PDB related files can be found under /BiO/Store/DB/DBd/PDB, which has nothing to do with the PDB mirror at /BiO/Store/DB/PDB. i.e. create what you like under /BiO/Store/DB/DBd/PDB, and it wont get deleted no matter what.

I'll check in on the cron job for at least the first few weeks... anyone know how to get the job to report to a mailing list?

Cheers,

Dan.

Sung Gong

unread,

Dec 8, 2010, 4:28:08 PM12/8/10

to biocc-serve...@googlegroups.com

Dan, thanks for managing PDB stuff.

Did you try editing /etc/aliases?
e.g. pdb biocc-serve...@googlegroups.com

Need to issue 'newaliases' after editing the file.
I cannot access bio.cc at the moment - maybe down?

Cheers,
Sung

Sung Gong

unread,

Dec 8, 2010, 4:31:18 PM12/8/10

to biocc-serve...@googlegroups.com

Just edited - let's see how it goes.

Dan Bolser

unread,

Dec 9, 2010, 7:32:39 AM12/9/10

to Sung Gong, biocc-serve...@googlegroups.com

No problem Sung, we are hoping to use BiO.CC as the PDB MySQL server
for PDBWiki.

I'm loading the structures into the relational database now, and it's
taking a long time... We are up to 50k structures loaded after about
2.5 days...

It could be 'painful' to run this every week as we do currently. I'll
have to look at alternative PDB to RDB loaders to see how easy it is
to build a 'PDBLite' relational database.

I guess this is why bio.cc is running slow currently, because mysql
writes to NFS (and home dirs are also on NFS) . Not sure how to get
round that easily... Do any other mysql servers use the same NFS
mounted data directory? (That could be a problem).

Thanks for changing /etc/aliases (sorry, I always forget that is the
place to edit). Looks like the first cron job got in successfully
(yay!) before the change was made:

From: ro...@jaesu.bio.cc (Cron Daemon)
To: p...@jaesu.bio.cc
Subject: Cron <pdb@jaesu> /BiO/Store/DB/PDB/update.sh
Content-Type: text/plain; charset=UTF-8
Auto-Submitted: auto-generated
X-Cron-Env: <SHELL=/bin/sh>
X-Cron-Env: <HOME=/home/pdb>
X-Cron-Env: <PATH=/usr/bin:/bin>
X-Cron-Env: <LOGNAME=pdb>
X-Cron-Env: <USER=pdb>
Status: R

Wed Dec 8 11:45:01 KST 2010

RSYNCING!

NEW AND/OR UPDATED stats:
PDB; all 181, divided 325.
mmCIF; all 181, divided 329.
XML; all 181, divided 329.
XML-noatom; all 181, divided 329.
XML-extatom; all 181, divided 329.
Structure factors; all 172, divided 172.
BioUnit; all 318, divided 529.

DELETED stats:
PDB; all 2, divided 2.
mmCIF; all 2, divided 2.
XML; all 2, divided 2.
XML-noatom; all 2, divided 2.
XML-extatom; all 2, divided 2.
Structure factors; all 2, divided 2.
BioUnit; all 5, divided 5.

Wed Dec 8 12:42:58 KST 2010

UNZIPING!

doing /BiO/Store/DB/PDB/data/structures/all/pdb
While unzipping I found...
181 new PDB files
144 updated PDB files
2 obsolete PDB files
doing /BiO/Store/DB/PDB/data/structures/all/mmCIF
While unzipping I found...
181 new PDB files
148 updated PDB files
2 obsolete PDB files

Wed Dec 8 12:47:51 KST 2010

DONE!

BTW, feel free to play with the new PDB relational database (pdbase).
We have some usage notes that I'll put somewhere visible soon.

Cheers,
Dan.

Henning Stehr

unread,

Dec 9, 2010, 10:50:00 AM12/9/10

to Dan Bolser, Sung Gong, biocc-serve...@googlegroups.com

Hi Dan,

our admins pointed out another potential bottleneck for the PDBase
loading. Apparently NFS transmits a directory index every time a file
is accessed. For the 'unzipped' directory with 60.000 files this may
put sigfinicant load on the network and slow down the process. Maybe
we could work around this by using 'hash directories' just like for
the gzipped files. But I'd have to look into how OpenMMS handles file
loading. I think it currently takes a list file as input and I'm not
sure whether paths are allowed in this list (then it would be easy to
change). Otherwise, maybe running multiple parallel instances of
OpenMMS would help. This assumes that we have enough memory and MySQL
is not the bottleneck.

Thanks for all the work!

Henning

Dan Bolser

unread,

Dec 9, 2010, 11:21:43 AM12/9/10

to Henning Stehr, Sung Gong, biocc-serve...@googlegroups.com

On 9 December 2010 15:50, Henning Stehr <st...@molgen.mpg.de> wrote:
> Hi Dan,
>
> our admins pointed out another potential bottleneck for the PDBase
> loading. Apparently NFS transmits a directory index every time a file
> is accessed. For the 'unzipped' directory with 60.000 files this may
> put sigfinicant load on the network and slow down the process. Maybe

Interesting. I remember hearing about problems with big directories
under NFS / GPFS / etc. However, I thought if you had the full path,
it didn't require the directory index lookup.

BTW, it's almost 70 thousand structures now! We should get ready for
the 100,000th structure ;-)

> we could work around this by using 'hash directories' just like for
> the gzipped files. But I'd have to look into how OpenMMS handles file
> loading. I think it currently takes a list file as input and I'm not
> sure whether paths are allowed in this list (then it would be easy to

It should be easy enough the hack the loader, but I figure if we are
going to do anything, it should be to implement incremental loading.
Improving load efficiency may let us load 70k structures per week, but
since we only really need to load about 500 (140 times fewer) we
should focus on that (i.e. what are the chances of us improving
loading efficiency by 140 fold?).

I remember that there are options for incremental loading, but that
the loader wasn't behaving as adversed?

> change). Otherwise, maybe running multiple parallel instances of
> OpenMMS would help. This assumes that we have enough memory and MySQL
> is not the bottleneck.

Yeah, I thought about running parallel instances too. I think this is
the best short term solution, but I don't mind looking at the loader
for a) incremental loading, or b) PDBLite schema.

> Thanks for all the work!

As you know, all I did was install your modified loader!

Cheers,
Dan.

Henning Stehr

unread,

Dec 9, 2010, 11:43:24 AM12/9/10

to Dan Bolser, biocc-serve...@googlegroups.com

I agree that incremental loading should be the way to go. From what I remember,
we had two reasons for loading the whole pdb every time:

1. to be sure that the DB is up-to-date no matter how often the update
was skipped
2. to avoid inconsistencies which would occur if the process is killed
while loading an entry

The first point can easily be addressed by doing a diff of the files
in 'unzipped' and in the DB.
Just feeding the diff list to OpenMMS would do it without even having
to change anything in the pipeline.
The second point is more difficult because OpenMMS is not really
'transaction-safe'. But then,
we could just hope for the best and if we really notice any
inconsistencies we can always
do a full update again.

For both points it may help that table 'mms_entry' has a column
'load_status' which, I think,
should only be '1' if loading was successful.

I believe we mainly didn't implement incremental loading before
because we had enough resources available
to not even bother.

Henning

Sung Gong

unread,

Dec 9, 2010, 5:44:38 PM12/9/10

to Dan Bolser, biocc-serve...@googlegroups.com

> I'm loading the structures into the relational database now, and it's
> taking a long time... We are up to 50k structures loaded after about
> 2.5 days...

It looks taking too long?

>
> It could be 'painful' to run this every week as we do currently. I'll
> have to look at alternative PDB to RDB loaders to see how easy it is
> to build a 'PDBLite' relational database.
>
> I guess this is why bio.cc is running slow currently, because mysql
> writes to NFS (and home dirs are also on NFS) . Not sure how to get
> round that easily... Do any other mysql servers use the same NFS
> mounted data directory? (That could be a problem).
>

Are you sure? Is mysql writing a NFS directory?
I see this from /etc/my.cnf

datadir=/BiO/Serve/LocalmySQL

And this from /etc/fstab:
/dev/md0 /BiO ext3 defaults 0 0

I don't think /BiO is NFS-mounted.

>
> Thanks for changing /etc/aliases (sorry, I always forget that is the
> place to edit). Looks like the first cron job got in successfully
> (yay!) before the change was made:

Let's see whether we get a cron mail for this job next time when it runs.

>
>
> BTW, feel free to play with the new PDB relational database (pdbase).
> We have some usage notes that I'll put somewhere visible soon.
>

Thanks and cheers.

Dan Bolser

unread,

Dec 17, 2010, 4:10:38 PM12/17/10

to biocc-serve...@googlegroups.com

On Wednesday, December 8, 2010 9:31:18 PM UTC, Sungsam wrote:

Just edited - let's see how it goes.

Seems it ran OK, and there is no new email in /var/spool/mail/pdb, however, I guess it didn't turn up here... I just tried sending a test mail from bio.cc here...

Dan Bolser

unread,

Dec 17, 2010, 5:11:15 PM12/17/10

to BiO.CC server interface

On Dec 9, 10:44 pm, Sung Gong <s...@bio.cc> wrote:
> > I'm loading the structures into the relational database now, and it's
> > taking a long time... We are up to 50k structures loaded after about
> > 2.5 days...
>
> It looks taking too long?

I just checked the log now the job is finished:
tail /BiO/Research/PDBWiki/Software/Pipeline/PDBASE.LOG

...
Loading PDB_ID: 9xim
Read time: 0.475 s. Load time: 1.702 s. Entry_key: 69650

69650 entries loaded. Fri Dec 10 10:36:11 KST 2010
Total time = 52:01:56 Time per entry = 2.689
Load list_file Done.

i.e. two days... which I guess wouldn't be too bad if it didn't create
a high load.

> > It could be 'painful' to run this every week as we do currently. I'll
> > have to look at alternative PDB to RDB loaders to see how easy it is
> > to build a 'PDBLite' relational database.
>
> > I guess this is why bio.cc is running slow currently, because mysql
> > writes to NFS (and home dirs are also on NFS) . Not sure how to get
> > round that easily... Do any other mysql servers use the same NFS
> > mounted data directory? (That could be a problem).
>
> Are you sure? Is mysql writing a NFS directory?
> I see this from /etc/my.cnf
>
> datadir=/BiO/Serve/LocalmySQL
>
> And this from /etc/fstab:
> /dev/md0 /BiO ext3 defaults 0 0
>
> I don't think /BiO is NFS-mounted.

OK that's good.

Cheers,
Dan.

Sung Gong

unread,

Dec 18, 2010, 4:56:06 AM12/18/10

to biocc-serve...@googlegroups.com

On 17 December 2010 21:10, Dan Bolser <dan.b...@gmail.com> wrote:
>
>
> On Wednesday, December 8, 2010 9:31:18 PM UTC, Sungsam wrote:
>>
>> Just edited - let's see how it goes.
>
> Seems it ran OK, and there is no new email in /var/spool/mail/pdb, however,
> I guess it didn't turn up here... I just tried sending a test mail from
> bio.cc here...
>
>

I could see a new email in /var/spool/mail/pdb.
But it does't seem to relay properly to biocc-serve...@googlegroups.com

Maybe I need to see the biocc google-group setting - only members can
send emails?

Tested on my local account (su...@jaesu.bio.cc), which relayed
successfully to my gmail account.

Dan Bolser

unread,

Dec 18, 2010, 6:07:30 AM12/18/10

to Sung Gong, biocc-serve...@googlegroups.com

On 18 December 2010 09:56, Sung Gong <su...@bio.cc> wrote:
> On 17 December 2010 21:10, Dan Bolser <dan.b...@gmail.com> wrote:
>>
>>
>> On Wednesday, December 8, 2010 9:31:18 PM UTC, Sungsam wrote:
>>>
>>> Just edited - let's see how it goes.
>>
>> Seems it ran OK, and there is no new email in /var/spool/mail/pdb, however,
>> I guess it didn't turn up here... I just tried sending a test mail from
>> bio.cc here...
>>
>>
>
> I could see a new email in /var/spool/mail/pdb.
> But it does't seem to relay properly to biocc-serve...@googlegroups.com
>
> Maybe I need to see the biocc google-group setting - only members can
> send emails?
>
> Tested on my local account (su...@jaesu.bio.cc), which relayed
> successfully to my gmail account.

Interesting! I'll add it to the list of members.

Sung Gong

unread,

Dec 18, 2010, 6:23:03 AM12/18/10

to Dan Bolser, biocc-serve...@googlegroups.com

On 18 December 2010 11:07, Dan Bolser <dan.b...@gmail.com> wrote:
> On 18 December 2010 09:56, Sung Gong <su...@bio.cc> wrote:
>> On 17 December 2010 21:10, Dan Bolser <dan.b...@gmail.com> wrote:
>>>
>>>
>>> On Wednesday, December 8, 2010 9:31:18 PM UTC, Sungsam wrote:
>>>>
>>>> Just edited - let's see how it goes.
>>>
>>> Seems it ran OK, and there is no new email in /var/spool/mail/pdb, however,
>>> I guess it didn't turn up here... I just tried sending a test mail from
>>> bio.cc here...
>>>
>>>
>>
>> I could see a new email in /var/spool/mail/pdb.
>> But it does't seem to relay properly to biocc-serve...@googlegroups.com
>>
>> Maybe I need to see the biocc google-group setting - only members can
>> send emails?
>>
>> Tested on my local account (su...@jaesu.bio.cc), which relayed
>> successfully to my gmail account.
>
> Interesting! I'll add it to the list of members.
>

Just added to the member list.
Could you set up a dummy cron job to see whether p...@jaesu.bio.cc can
relay to biocc-server-interface?

Just to make things clear, my test above is done locally using mutt -
that's why the mail sender can recognize jaesu.bio.cc
But just added 'jaesu' as a CNAME of bio.cc

About godaddy, it's all about email forwarding - now set up for
su...@bio.cc and j...@bio.cc.
Do you like to add p...@bio.cc (same with p...@jaesu.bio.cc)?
I thought p...@bio.cc would be just sending only (to biocc-server-interface)

Dan Bolser

unread,

Dec 18, 2010, 6:39:26 AM12/18/10

to Sung Gong, biocc-serve...@googlegroups.com

On 18 December 2010 11:23, Sung Gong <su...@bio.cc> wrote:
> On 18 December 2010 11:07, Dan Bolser <dan.b...@gmail.com> wrote:
>> On 18 December 2010 09:56, Sung Gong <su...@bio.cc> wrote:
>>> On 17 December 2010 21:10, Dan Bolser <dan.b...@gmail.com> wrote:
>>>>
>>>>
>>>> On Wednesday, December 8, 2010 9:31:18 PM UTC, Sungsam wrote:
>>>>>
>>>>> Just edited - let's see how it goes.
>>>>
>>>> Seems it ran OK, and there is no new email in /var/spool/mail/pdb, however,
>>>> I guess it didn't turn up here... I just tried sending a test mail from
>>>> bio.cc here...
>>>>
>>>>
>>>
>>> I could see a new email in /var/spool/mail/pdb.
>>> But it does't seem to relay properly to biocc-serve...@googlegroups.com
>>>
>>> Maybe I need to see the biocc google-group setting - only members can
>>> send emails?
>>>
>>> Tested on my local account (su...@jaesu.bio.cc), which relayed
>>> successfully to my gmail account.
>>
>> Interesting! I'll add it to the list of members.
>>
>
> Just added to the member list.
> Could you set up a dummy cron job to see whether p...@jaesu.bio.cc can
> relay to biocc-server-interface?

I tried via Mutt to begin with (as pdb)

Just tried a cron at 20:33 (server time)...
Just tried a cron at 20:38 (server time)...

> Just to make things clear, my test above is done locally using mutt -
> that's why the mail sender can recognize jaesu.bio.cc
> But just added 'jaesu' as a CNAME of bio.cc
>
> About godaddy, it's all about email forwarding - now set up for
> su...@bio.cc and j...@bio.cc.
> Do you like to add p...@bio.cc (same with p...@jaesu.bio.cc)?
> I thought p...@bio.cc would be just sending only (to biocc-server-interface)