Help with setting up VASP example

1,083 views
Skip to first unread message

Brad

unread,
May 29, 2011, 10:25:55 PM5/29/11
to USPEX
Hi USPEXers, my name is Brad Malone and I'm a graduate student at UC
Berkeley. I've seen Prof. Oganov speak a couple of times about the
USPEX methodology and it always seemed like a powerful technique for
predicting structures. I also read Qiang Zhu's paper last week on
superdense carbon allotropes and figured I should finally try and see
how the code works.

I've downloaded the code successfully and have MATLAB installed and
working on a cluster I have access to. I'm currently trying to run the
8-atom silicon VASP example but am running into difficulties getting
it to run properly. The manual clearly states "we cannot guarantee
support for solving problems with massively parallel mode" but I was
hoping to get some suggestions if possible. Maybe others will find the
discussion useful as well.

I've taken it that for me to run the VASP example on the 8-atom
silicon cell there are essentially 4 things I need to do:

1). Modify the INPUT_EA.txt file
2). Drop either an executable or a batch script into /Specific.
3). Modify submitJob.m
4). Modify checkStatusC.m

If the list of things needed to be done are larger than this then
that's probably my problem. Assuming it's not, let me explain what I
changed exactly and how it crashes.

1). I made minimal changes to the INPUT_EA.txt file. A 'diff' reveals
that all I did was change whichCluster from nonParallel to hreidar,
changed the number of processors to 32, and changed 'mpirun -np 2 ./
vasp > log' in commandExecutable to 'batch.basic', which is the name
of the batch script I dropped into the /Specific directory. While I
don't think there's anything wrong with batch.basic, I'll post the
contents below for completeness.
------------------------------------
#PBS -A our_repo
#PBS -q normal
#PBS -N USPEX_test
#PBS -l nodes=4:ppn=8
#PBS -l walltime=2:00:00
#PBS -j eo
#PBS -M my_email
#PBS -m e

cd $PBS_O_WORKDIR

VASP=/global/cluster/users/me/vasp

NPROCS=`wc -l $PBS_NODEFILE | awk '{print $1}'`

mpirun $VASP
-------------------------------------------------------------

2). As I said, the above script is dropped into /Specific.

3). With regards to modification of the submitJob.m file I attempted
to modify the hreidar block, which is originally this:

---------------------------------------------------
elseif ORG_STRUC.hreidar
if numProcessors == 1
if ORG_STRUC.gonzales
[nothing, tline] = unix(['bsub -W ' ORG_STRUC.wallTime ' -o schaa
"prun ./' ORG
_STRUC.commandExecutable{POP_STRUC.POPULATION(Ind_No).Step} '"']);
else
[nothing, tline] = unix(['bsub -o schaa -W ' ORG_STRUC.wallTime '
< ./' ORG_STR
UC.commandExecutable{POP_STRUC.POPULATION(Ind_No).Step} ]);
end
else
if ORG_STRUC.gonzales
[nothing, tline] = unix(['bsub -n ' num2str(numProcessors) ' -W '
ORG_STRUC.wall
Time ' -o schaa "prun ./'
ORG_STRUC.commandExecutable{POP_STRUC.POPULATION(Ind_N
o).Step} '"']);
else
[nothing, tline] = unix(['bsub -x -n ' num2str(numProcessors) ' -o
schaa -W ' OR
G_STRUC.wallTime ' < ./'
ORG_STRUC.commandExecutable{POP_STRUC.POPULATION(Ind_No
).Step} ]);
end
end
----------------------------------------------

The only thing I modified here was the final unix command, and I
changed it to:
----------------------------
[nothing,tline] = unix(['qsub '
ORG_STRUC.commandExecutable{POP_STRUC.POPULATION
(Ind_No).Step} ]);
------------------------------------------
since all my environment variables are defined in the batch script.

I also changed the jobNumber to
jobNumber = str2num(tline(1:6));

since this is what's appropriate for my cluster.


4). Finally, I made a small change to checkStatusC.m. Instead of
-----------------------
elseif ORG_STRUC.hreidar
[nothing, statusStr] = unix(['bjobs ' num2str(jobID) ]);
doneOr = strfind(statusStr, 'found');
if ~isempty(doneOr)
doneOr = 1;
else
doneOr = strfind(statusStr,'DONE')
if ~isempty(doneOr)
doneOr = 1;
else
doneOr =0;
end
end
---------------------------------
I have the section
---------------
elseif ORG_STRUC.hreidar
disp('About to launch the python');
[nothing, statusStr] = unix(['python ~/help_uspex.py status '
num2str(jobID) ]);
doneOr = strfind(statusStr, 'Running');
if ~isempty(doneOr)
doneOr = 0;
else
donOr = 1;
end
------------------------
help_uspex.py is a simple python script which returns the string
'Running' if the job is still in the queue (whether running, queued,
or just completed) and 'Done' if it's not found in the queue.


And that's all I've changed. When I run the code with MATLAB my job is
constructed in CalcFold1 properly (I believe), is sent to the queue,
and completes with VASP successfully. The problem is that MATLAB
returns almost immediately, with the last thing written to 'log'
being

---------------------------------
rm: cannot remove `still_reading': No such file or directory
----------------------------------
and this comes from this section of ev_alg.m it appears:
-----------
if sum([POP_STRUC.POPULATION(:).Done])~= length(POP_STRUC.POPULATION)
unix('rm still_reading');
-----------

Now I would have imagined (although I don't know) that MATLAB would
continue to run and use the script checkStatusC.m to check on the job
and wait until it finishes, and then proceed with the next instance of
the generation (or move to the next generation) and not kick out
immediately. However, the code that I modified for checkStatus.C isn't
being used (I never see my comment before launching the python script
in the log file).

Did I not change something that I need to change, or perhaps change
something improperly? It seems like it's very close to working, just
not quite. Thanks in advance for any suggestions, I appreciate it!

Best,
Brad


Andriy Lyakhov

unread,
May 30, 2011, 12:26:46 PM5/30/11
to USPEX
Hello,

you actually did a good job changing the USPEX code it seems. And it
should work properly. What you didn't take into account is how USPEX
works in massive parallel mode. Instead of constantly checking whether
job was done like in 'nonParallel' mode, USPEX simply stops after
submitting all jobs. You have to use a crontab to launch the USPEX
regularly. On every launch it will check the output - if it's done it
will work on it and submit the next batch of jobs. If it's not done
USPEX will simply quit again. So, just try to launch USPEX again,
without changing/deleting anything and see what happens.

Such behavior may have no sense on a small cluster for a small job,
but it is needed for jobs on a big supercomputers that sometimes run
for days. There is no need to waste the scarce resources of login
nodes by keeping matlab launched for every USPEX calculation. Also, at
some supercomputers that we used it was apparently a problem if USPEX
checks the status of the job too often.

If you don't want such behavior, I would recommend you to make a
following change. Check how fast is VASP being executed for your
average task. Let's say 10min. Then change this part in ev_alg.m:
----------------------------------------------------------------------------
if sum([POP_STRUC.POPULATION(:).Done])~= length(POP_STRUC.POPULATION)
unix('rm still_reading');
fclose ('all');
if ORG_STRUC.remote
return
elseif ORG_STRUC.remoteTALC
return
end
quit
else
break;
end
--------------------------------------------------------------------------

into:

----------------------------------------------------------------------------
if sum([POP_STRUC.POPULATION(:).Done])~= length(POP_STRUC.POPULATION)
if ORG_STRUC.hreidar
pause(600);
else
unix('rm still_reading');
fclose ('all');
if ORG_STRUC.remote
return
elseif ORG_STRUC.remoteTALC
return
end
quit
end
else
break;
end
--------------------------------------------------------------------------

This way instead of closing, USPEX will wait for 10 min and then
checks whether the job is finished. Hope it helps.

Sincerely,
Andriy Lyakhov

Brad

unread,
May 30, 2011, 10:01:53 PM5/30/11
to USPEX
Hi Andriy,

Thanks so much for the quick and helpful response! I did in fact not
realize that cron (or another automation procedure) was needed to
rerun USPEX periodically. I saw that section in the manual but
erroneously thought that it only dealt with remote submissions. I had
attempted running the code again yesterday but it didn't fix my
problem, which turned out to be due to a typo above (donOr rather than
doneOr) in one of my lines. After correcting for that the code seems
work correctly.

However, I have a question about the results. My calculation (still
the default VASP example) is currently on the 6th generation (with a
population of 20). If I look at my /results1/origin file I see the
following:


------- generation1 -------
1 random
2 random
3 random
4 random
5 random
6 random
7 random
8 random
9 random
10 random
11 random
12 random
13 random
14 random
15 random
16 random
17 random
18 random
19 random
20 random
------- generation2 -------
21 done by heredity, parents numbers 7 3
22 done by heredity, parents numbers 6 4
23 done by heredity, parents numbers 14 18
24 done by heredity, parents numbers 9 12
25 done by heredity, parents numbers 7 10
26 done by heredity, parents numbers 3 7
27 done by heredity, parents numbers 18 3
28 done by heredity, parents numbers 7 3
29 done by heredity, parents numbers 12 2
30 coormutated, parent number 7
31 done by heredity, parents numbers 10 18
32 coormutated, parent number 3
33 coormutated, parent number 12
34 latmutated, parent number 10
35 latmutated, parent number 10
36 latmutated, parent number 9
37 coormutated, parent number 18
38 kept as best, parent number 7
39 kept as best, parent number 3
40 kept as best, parent number 18
41 kept as best, parent number 6
42 latmutated, parent number 3
43 latmutated, parent number 3
44 latmutated, parent number 6
------- generation3 -------
45 done by heredity, parents numbers 22 21
46 done by heredity, parents numbers 21 1
47 done by heredity, parents numbers 13 1
48 done by heredity, parents numbers 15 11
49 done by heredity, parents numbers 15 13
50 done by heredity, parents numbers 2 1
51 done by heredity, parents numbers 21 1
52 done by heredity, parents numbers 15 21
53 done by heredity, parents numbers 4 22
54 done by heredity, parents numbers 15 22
55 coormutated, parent number 1
56 coormutated, parent number 1
57 coormutated, parent number 1
58 coormutated, parent number 1
59 latmutated, parent number 15
60 latmutated, parent number 11
61 latmutated, parent number 15
62 latmutated, parent number 22
63 kept as best, parent number 21
64 kept as best, parent number 22
65 kept as best, parent number 23
66 kept as best, parent number 1
67 latmutated, parent number 22
68 latmutated, parent number 24
------- generation4 -------
69 kept as best, parent number 21
70 kept as best, parent number 13
71 kept as best, parent number 20
72 kept as best, parent number 6
------- generation5 -------
73 latmutated, parent number 21
74 kept as best, parent number 21
75 kept as best, parent number 22
76 kept as best, parent number 23
77 kept as best, parent number 24
------- generation6 -------

This result seems odd to me, because I thought that I should have a
population of 20 within each generation (no more and no less). Am I
missing something? If we look at the /reference/origin file we see

------- generation1 -------
1 random
2 random
3 random
4 random
5 random
6 random
7 random
8 random
9 random
10 random
11 random
12 random
13 random
14 random
15 random
16 random
17 random
18 random
19 random
20 random
------- generation2 -------
21 done by heredity, parents numbers 6 17
22 done by heredity, parents numbers 19 12
23 done by heredity, parents numbers 19 6
24 done by heredity, parents numbers 17 6
25 done by heredity, parents numbers 12 17
26 done by heredity, parents numbers 17 19
27 done by heredity, parents numbers 6 19
28 done by heredity, parents numbers 12 6
29 done by heredity, parents numbers 13 17
30 done by heredity, parents numbers 17 6
31 coormutated, parent number 6
32 coormutated, parent number 13
33 coormutated, parent number 17
34 coormutated, parent number 12
35 latmutated, parent number 19
36 latmutated, parent number 4
37 latmutated, parent number 2
38 latmutated, parent number 6
39 latmutated, parent number 2
40 latmutated, parent number 17
41 kept as best, parent number 17
42 kept as best, parent number 6
43 kept as best, parent number 19
44 kept as best, parent number 12
------- generation3 -------
45 done by heredity, parents numbers 14 21
46 done by heredity, parents numbers 7 6
47 done by heredity, parents numbers 1 7
48 done by heredity, parents numbers 1 22
49 done by heredity, parents numbers 7 23
50 done by heredity, parents numbers 21 1
51 done by heredity, parents numbers 20 7
52 done by heredity, parents numbers 5 20
53 done by heredity, parents numbers 21 1
54 done by heredity, parents numbers 1 6
55 coormutated, parent number 7
56 coormutated, parent number 24
57 coormutated, parent number 1
58 coormutated, parent number 7
59 latmutated, parent number 7
60 latmutated, parent number 1
61 latmutated, parent number 1
62 latmutated, parent number 20
63 latmutated, parent number 1
64 latmutated, parent number 1
65 kept as best, parent number 7
66 kept as best, parent number 1
67 kept as best, parent number 22
68 kept as best, parent number 23
------- generation4 -------
69 done by heredity, parents numbers 19 4
70 done by heredity, parents numbers 7 5
71 done by heredity, parents numbers 7 19
72 done by heredity, parents numbers 9 4
73 done by heredity, parents numbers 4 8
74 done by heredity, parents numbers 5 23
75 done by heredity, parents numbers 8 19
76 done by heredity, parents numbers 7 9
77 done by heredity, parents numbers 7 4
78 done by heredity, parents numbers 4 5
79 coormutated, parent number 5
80 coormutated, parent number 4
81 coormutated, parent number 19
82 coormutated, parent number 5
83 latmutated, parent number 9
84 latmutated, parent number 7
85 latmutated, parent number 4
86 latmutated, parent number 4
87 latmutated, parent number 8
88 latmutated, parent number 6
89 kept as best, parent number 8
90 kept as best, parent number 4
91 kept as best, parent number 6
92 kept as best, parent number 5
------- generation5 -------
93 done by heredity, parents numbers 17 23
94 done by heredity, parents numbers 11 14
95 done by heredity, parents numbers 1 22
96 done by heredity, parents numbers 23 17
97 done by heredity, parents numbers 22 23
98 done by heredity, parents numbers 3 8
99 done by heredity, parents numbers 21 1
100 done by heredity, parents numbers 11 22
101 done by heredity, parents numbers 24 17
102 done by heredity, parents numbers 21 23
103 coormutated, parent number 22
104 coormutated, parent number 22
105 coormutated, parent number 8
106 coormutated, parent number 21
107 latmutated, parent number 21
108 latmutated, parent number 1
109 latmutated, parent number 3
110 latmutated, parent number 16
111 latmutated, parent number 1
112 latmutated, parent number 17
113 kept as best, parent number 21
114 kept as best, parent number 22
115 kept as best, parent number 23
116 kept as best, parent number 24
------- generation6 -------
117 done by heredity, parents numbers 9 11
118 done by heredity, parents numbers 17 3
119 done by heredity, parents numbers 22 3
120 done by heredity, parents numbers 3 22
121 done by heredity, parents numbers 16 18
122 done by heredity, parents numbers 22 24
123 done by heredity, parents numbers 6 3
124 done by heredity, parents numbers 17 3
125 done by heredity, parents numbers 16 10
126 done by heredity, parents numbers 6 17
127 coormutated, parent number 3
128 coormutated, parent number 5
129 coormutated, parent number 18
130 coormutated, parent number 24
131 latmutated, parent number 11
132 latmutated, parent number 17
133 latmutated, parent number 24
134 latmutated, parent number 3
135 latmutated, parent number 16
136 latmutated, parent number 3
137 kept as best, parent number 3
138 kept as best, parent number 22
139 kept as best, parent number 17
140 kept as best, parent number 24
------- generation7 -------
141 done by heredity, parents numbers 4 23
142 done by heredity, parents numbers 4 22
143 done by heredity, parents numbers 1 4
144 done by heredity, parents numbers 21 23
145 done by heredity, parents numbers 21 4
146 done by heredity, parents numbers 8 21
147 done by heredity, parents numbers 21 22
148 done by heredity, parents numbers 19 21
149 done by heredity, parents numbers 24 7
150 done by heredity, parents numbers 1 23
151 coormutated, parent number 19
152 coormutated, parent number 19
153 coormutated, parent number 23
154 coormutated, parent number 4
155 latmutated, parent number 4
156 latmutated, parent number 1
157 latmutated, parent number 21
158 latmutated, parent number 8
159 latmutated, parent number 21
160 latmutated, parent number 21
161 kept as best, parent number 21
162 kept as best, parent number 22
163 kept as best, parent number 23
164 kept as best, parent number 1
------- generation8 -------
165 done by heredity, parents numbers 3 23
166 done by heredity, parents numbers 24 23
167 done by heredity, parents numbers 21 7
168 done by heredity, parents numbers 23 21
169 done by heredity, parents numbers 3 23
170 done by heredity, parents numbers 22 3
171 done by heredity, parents numbers 3 5
172 done by heredity, parents numbers 3 21
173 done by heredity, parents numbers 3 21
174 done by heredity, parents numbers 24 21
175 coormutated, parent number 11
176 coormutated, parent number 23
177 coormutated, parent number 7
178 coormutated, parent number 24
179 latmutated, parent number 3
180 latmutated, parent number 22
181 latmutated, parent number 19
182 latmutated, parent number 19
183 latmutated, parent number 22
184 latmutated, parent number 3
185 kept as best, parent number 21
186 kept as best, parent number 22
187 kept as best, parent number 23
188 kept as best, parent number 24
------- generation9 -------
189 done by heredity, parents numbers 21 22
190 done by heredity, parents numbers 22 6
191 done by heredity, parents numbers 6 21
192 done by heredity, parents numbers 5 9
193 done by heredity, parents numbers 22 21
194 done by heredity, parents numbers 22 6
195 done by heredity, parents numbers 9 18
196 done by heredity, parents numbers 12 6
197 done by heredity, parents numbers 24 7
198 done by heredity, parents numbers 23 7
199 coormutated, parent number 21
200 coormutated, parent number 12
201 coormutated, parent number 6
202 coormutated, parent number 22
203 latmutated, parent number 18
204 latmutated, parent number 18
205 latmutated, parent number 24
206 latmutated, parent number 21
207 latmutated, parent number 22
208 latmutated, parent number 23
209 kept as best, parent number 21
210 kept as best, parent number 22
211 kept as best, parent number 23
212 kept as best, parent number 24
------- generation10 -------
213 done by heredity, parents numbers 23 24
214 done by heredity, parents numbers 17 22
215 done by heredity, parents numbers 24 5
216 done by heredity, parents numbers 23 19
217 done by heredity, parents numbers 1 24
218 done by heredity, parents numbers 5 12
219 done by heredity, parents numbers 23 22
220 done by heredity, parents numbers 23 24
221 done by heredity, parents numbers 23 24
222 done by heredity, parents numbers 23 22
223 coormutated, parent number 1
224 coormutated, parent number 1
225 coormutated, parent number 2
226 coormutated, parent number 3
227 latmutated, parent number 24
228 latmutated, parent number 24
229 latmutated, parent number 2
230 latmutated, parent number 24
231 latmutated, parent number 1
232 latmutated, parent number 23
233 kept as best, parent number 1
234 kept as best, parent number 22
235 kept as best, parent number 23
236 kept as best, parent number 24

which also doesn't keep the population fixed at 20, but doesn't suffer
from the same population massacre that my run seems to have suffered
from. I realize the results are stochastic and shouldn't be exactly
the same, but just curious if this is normal random behavior. With
only 4-5 structures in a generation, I don't know if I'll have enough
diversity to break out of the enthalpy minimum I'm currently in
(judging from the much lower groundstate as determined from the
reference calculation).

Best,
Brad

Brad Malone

unread,
May 30, 2011, 10:55:55 PM5/30/11
to USPEX
Also, I should note that after awhile some of the calculations were failing with the familiar VASP error:

"Error reading item 'IMAGES' from file INCAR."

which occasionally happens when something goes awry on the supercomputer (this particular supercomputer's Lustre filesystem can be buggy, especially when you need it to work). Ahh, so maybe enough failures (>"maxErrors") with regards to this eventually led to the structure being removed from the population? If so I guess that makes sense.

Perhaps I'll see if I can try this same calculation on a different cluster. 

Brad
Reply all
Reply to author
Forward
0 new messages