FAQ - Frequently Asked Questions

1,716 views
Skip to first unread message

Steve Ludtke

unread,
Nov 29, 2011, 8:37:41 AM11/29/11
to em...@googlegroups.com
This topic will contain answers to some frequently asked questions

Steve Ludtke

unread,
Nov 29, 2011, 8:39:15 AM11/29/11
to em...@googlegroups.com
What sort of computer should I buy ?

As of 11/29/2011:

A complete answer to your question depends a bit upon your budgetary constraints,
or lack thereof. As you are probably aware, at the 'high end', computers become rapidly more
expensive for marginal gains in performance. Generally speaking, we tend to build our own Linux boxes
in-house rather than purchasing prebuilt ones, both as a cost-saving measure, and to insure future
upgradability. Then again, there is nothing wrong with most available commercial pre-build PCs as
long as you get the correct components. For a minimal cost-effective workstation, I would suggest:

Sandy-bridge series processor, the quad core Core i7-2600K is a good choice
    - If you can get one of the new 6-core versions, that would be 50% more performance
    - note that Sandy-bridge significantly outperforms the previous generation, so going with a 6-core
      from the pre-sandy bridge series is not a great choice)
    - If you can afford a dual processor configuration, with dual 6-core Xeon's you will presently have
      to go with the previous generation, as the Sandy Bridge Xeons won't be out for a while. This configuration
      (12 cores last gen) is worthwhile, but expensive.

RAM - 3-6 GB/core is what I'd recommend for image processing
    - This depends a bit on the target application. For large viruses, you may wish to get more RAM/core
    - The performance benefit of high-speed RAM is rarely worth the cost. Get the fastest you can without
      breaking the bank

Disk - we would generally get something like 4, 2 TB drives for data/user files configured as software RAID 5,
    with a small (~100gb) SSD as a boot drive, current Intel SSDs are good for this purpose.
    - Note that other than the very fastest SSD drives, none of the drives can actually keep up with the
      latest SATA busses anyway, so going out of your way to get the superfast SATA drive is kind of pointless

Video - Get an NVIDIA card, NOT ATI, particularly if you plan on doing stereo. This will also get you some
    CUDA capabilities. A reasonably high-end GeForce with a good amount of RAM is generally fine
    with some caveats below.

Stereo - This is a tricky and complicated issue. There are 2 main choices:
    - Active stereo
        - Requires a 120 hz stereo capable 1080P display, AND, importantly, a Quadro series Nvidia graphics
          card (to do stereo in a window under Linux !). Note that you will have difficulties making most
          consumer '3D TVs' work with this setup, though some will. The most reliable option is to get
          a monitor designed for stereo use with Nvidia cards (Acer makes a decent 24"). Note that this also
          requires a dual-link DVI port.
    - Passive
        - By FAR the easiest and cheapest option, which also allows multiple users with cheap passive glasses.
          It also does NOT require an expensive Quadro video card. Chimera and many other programs have built-in
          support for 'interleaved' stereo, which they implement without support from the underlying Nvidia driver,
          so you can do it even with cheap graphics cards. Only disadvantage is that you lose 50% of your vertical
          resolution. Personally this doesn't bother me overly.  The other minor issue is that over the last couple
          of years these have been hard to find. Finally, LG came out with one which can be easily purchased again,
          though I confess we haven't purchased one of these new ones yet. Does not require dual-link DVI.

Monitor - Dual monitor setups can be very useful for image processing. If you can afford it, I would suggest a high-resolution
    30" primary display with a passive stereo secondary display. If you get an active stereo secondary display, you will
    need 2 dual-link DVI outputs on your graphics card.


hope that helps. Note that this represents my personal opinion only. Your mileage may vary.

Steve Ludtke

unread,
Mar 20, 2012, 9:22:25 AM3/20/12
to em...@googlegroups.com
3/20/2012 - Sandy - bridge Xeons are now available, and I've been getting questions about which computer to get again. Note that Macs are still using the earlier Westmere technology. Anyway, here's a quick analysis:

----------------------
Sandy-bridge Xeons (E5-2600 series) have finally become available, but aren't available in Macs yet. Certainly the Mac Pro loaded with 12 cores will give you the best available performance on a Mac right now. However, it is very far from the most cost-effective solution. So, it really depends on your budget and goals. Westmere still offers a decent price-performance ratio if you want dual CPUs. If you are happy with a single CPU, I'd say Core-i5's are actually the way to go (this is what I just set up in my home PC).

Here is a rough comparison of 3 machines I use:
Linux - 12 core Xeon X5675 (3.07 Ghz, westmere):  Speedtest = 4100/core -> ~50,000 total    (2 CPU ~$2880 total)   
Mac - 12 core Xeon (2.66 Ghz): Speedtest = 3000 -> ~36,000 total
Linux - 4 core i5-2500 (3.3 Ghz+turbo): Speedtest = 6400 (turbo), 5600 (sustained) -> ~22,000 total   (1 CPU ~$210)

Now, they have just released the Sandy-bridge Xeons, but, for example, a dual 8 core system:
16 core E5-2690 (2.9 Ghz): Speedtest (Estimated) = 5650 (turbo), 4950 (sustained) -> ~80,000 total (2 CPU ~$4050)

Now, the costs I gave above are just for the CPUs. If you wanted to build, for example, several of the core i5 systems and use them in parallel, you'd need motherboard, case, memory, etc for them as well. A barebones Core I5 pc with 8 GB of ram and a 2TB drive would run you ~$650.

If you built a 16 core system around the E5-2690,
$4050 - CPU
$600 - motherboard
$200 - case
$150 - power supply
$300 - 32 gb ram
$500 - 4x 2TB drives (equivalent)

So ~$5800 for the (almost) equivalent 16 core machine vs $2600 for 4 of the 4-core i5 systems.

ie - you pay  ~2x for the privilege of having it all integrated into a single box.  Of course, that buys you a bit of flexibility as well, and saves you a lot of effort in configuration and running in parallel, etc.  It also gives you 32 GB of ram on one machine, which can be useful for dealing with large volume data, visualization, etc.

On the Mac side, a 12-core 2.93 Ghz westmere system with 2 GB/core of ram -> $8000 and would give a speedtest score of ~45,000. ie ~40% more expensive and 1/2 the speed of a single linux box with the 16 core config, and 3x as expensive and 1/2 the speed of the core-i5 solution.

Please keep in mind that this is just a quick estimate, and that actual prices can vary considerably, but as you can see, the decision you make will depend a lot on your goals and your budget.

Message has been deleted

Steve Ludtke

unread,
Apr 4, 2012, 12:43:57 PM4/4/12
to em...@googlegroups.com
Q: What paper should I cite if I use EMAN2 for a reconstruction ?


A: While we plan to publish something a bit more up to date documenting newer procedures in EMAN2, for now this the main reference to cite:

Tang, G., Peng, L., Baldwin, P.R., Mann, D.S., Jiang, W., Rees, I. & Ludtke, S.J., 2007, EMAN2: an extensible image processing suite for electron microscopy, Journal of structural biology, 157(1), pp. 38-46.

Steve Ludtke

unread,
Apr 4, 2012, 3:43:49 PM4/4/12
to em...@googlegroups.com
Q: I have a heterogeneous particle population, how can I separate the particles into groups, or produce multiple structures for this data in EMAN2 ?

A:
This is a question that's in need of a paper, not just a quick answer here, and it's one that I've probably answered 20 times in the past, so at the very least, it's worthy of a FAQ post.

If you have a discretely heterogeneous target, eg- a ligand binding or heteromultimeric complex which may or may not be 100 % assembled, but when the components assemble, they do so fairly rigidly, then you can probably produce a reliable separation between populations. In other words, if you split your particles into N groups, the particles within those groups should be highly structurally homogeneous, and the difference between the particles in the N different groups should be measurable, even if only weakly in an individual single particle image.

If, on the other hand, you have particles exhibiting continuous heterogeneity of some sort, such as the ~20 Å motions that molecules like homodimeric fatty acid synthase undergoes in solution, things become much murkier. It can become difficult or impossible to distinguish between subtle conformational changes and orientation changes. In such situations, often there is no alternative to invoking an experimental methodology making use of tilt data (single particle tomography or RCT: http://blake.bcm.edu/emanwiki/EMAN2/Tutorials).

So, let's assume for the sake of argument that you're looking at a ligand binding experiment, where the ligand is either bound or unbound,  the ligand is smaller than the assembly it binds to, and upon binding it causes at most minimal changes to the structure of the assembly. Let's consider the different methods at our disposal to handle this:

I) 2-D classification

One approach to this problem is to perform reference-free 2-D classification of your particle data. The 2-D class-averages will generally exhibit the ligand binding much more clearly than individual particles would. You can then pick out all of the class-averages with apparent ligand binding, then extract the particles from these classes and put them into a subset of your original data. While this is an attractive approach at first, there are several problems with it:
  1. You need to identify presence/absence of ligand binding manually in images in all different orientations. This may be challenging
  2. While class-averages may look good, per-particle classification may not be good. This caveat will apply to several other solutions as well. Consider a histogram that looks like this:

It is clearly possible to divide this data into two groups, and the mean of each group could be computed quite accurately. Classification would consist of drawing a line half way between the peaks and putting everything on the left in one group and everything on the right in the other group. However, if we overlay this with the underlying distribution of two populations that produced this histogram:

you see clearly that drawing a line half way between the two peaks will result in considerable misclassification of points. This is just an analogy. The situation in 2-D reference free averaging is even worse, and you will tend to find that you can get absolutely beautiful looking class-averages which seem to represent liganded vs unliganded particles, but if you look at the particles that went into the average, you will find a considerable number of unliganded particles in the liganded average, and vice-versa.  The extent to which this problem will exist depends on how strong the discriminatory signal provided by the ligand is in an individual particle.

With that warning in mind, it can still be quite useful to subdivide the particles into subpopulations. This is the basic method, and can also be applied to data heterogeneity/motion analysis in 2-D:

Note, there is also a discussion about related methods here: http://blake.bcm.edu/emanwiki/EMAN2/Programs/e2evalparticles

  1. Run e2refine2d.py on your particle data.
  2. Run e2evalparticles.py and select the class-averages image stack in the main window
  3. double-click on any class-averages which appear to be liganded. This will put a blue mark on each of these averages
  4. Select "make new set". This will take all of the particles associated with the blue-marked class-averages and build a new particle set from them.
  5. Select "Invert", then "make new set" again. This will create a set with all of the ostensibly unliganded particles in them.
II) 3-D classification using e2refinemulti.py

e2refinemulti.py takes almost exactly the same arguments as e2refine.py, but two of the options have been expanded. --model, and --sym can take a comma-separated list of files (and corresponding symmetries).

This program (see: http://blake.bcm.edu/emanwiki/EMAN2/Programs/e2refinemulti) will do almost exactly the same thing as e2refine.py, but instead of making projections of a single model, it makes projections of N models, so when it classifies particles it is not only determining orientation, but also determining which model the particle is best associated with. After running iteratively for several rounds, the particle classification and the N models should gradually converge/stabilize. You can then take the particles associated with each model and turn them into a set, which can then be individually refined in 3-D. Note that the statistical argument made in the previous method holds for this method as well. One should not expect perfectly accurate per-particle classification, even if the two models look very good.  To separate the particles, use e2evalparticles.py just like in the first method. The documentation for e2evalparticles.py explains how this can be done easily for this specific use-case.

You can find other caveats to this approach, particularly with continuously heterogeneous data, discussed in other questions on the Google Group.

III) There is a program designed specifically with the goal of separating liganded/unliganded particle populations. It is called e2classifyligand.py.

In theory, this program should be documented here: http://blake.bcm.edu/emanwiki/EMAN2/Programs/e2classifyligand
however, as of when I'm writing this message, the program is still considered highly experimental, and I haven't had time to properly document it. Since it does help answer the question of whether there is sufficient signal present in an individual particle to accurately classify it, I will discuss it a bit here.

This program contains two different algorithms. One is based on having two 3-D models, one liganded and one unliganded, for your particle, and using this as a basis for classification. The other method uses a mask-based approach. You define a 3-D binary mask covering the region of your map where the ligand is expected to be, and classification is based on the density present in regions defined by 2-D projections of this mask. In my experience, the two-reference approach works best.

Here is an overview of how this is done:
  1. run e2refine.py on your full data set. Generally you will use a large angular step (in the orientation generator) for this, since we aren't after high resolution. Ideally you'd like most class-averages to have at least 15-20 particles in them. Since you will have a mix of particles with and without ligand, the ligand should be present in the final structure, but with weak density. Exactly how weak the ligand is in the map depends on the fractional binding.
  2. use this result to generate two models one with some density in the ligand position (even if not 100%) and the other with no density in the ligand location. You could simply manually mask out the ligand and then use the original map and the masked map as your two references, or you could use these two maps as starting models for e2refinemulti.py, and (hopefully) generate 2 better reference maps.
  3. Use e2classifyligand.py:
    • e2classifyligand.py <raw particle file> <class mx> <projections> --ref1=<liganded map> --ref2=<unliganded map> --cmp=<comparator specification> [--badgroup]
    • Here is an example of a full command: e2classifyligand.py bdb:sets#set-all_phase_flipped-hp bdb:refine_05#cls_result_04 bdb:refine_05#projections_04 --ref1=bdb:multi_01#threed_filt_04_00 --ref2=bdb:multi_01#threed_filt_04_01 --cmp=frc:snrweight=1:zeromask=1:maxres=15:minres=120 --badgroup -v 2
    • This command will produce a diagnostic output file:  plot.ligand.txt containing an x/y plot of sim1-sim2 vs sim1+sim2. That is, It plots the difference between how well the particle matched map 1 and map 2 (representing discrimination between the two maps) vs sim1+sim2 which should be representative of overall particle quality. Ideally you would see clear separation on the y axis of this plot, but I wouldn't count on getting this unless you have a very large ligand.
    • The other output is 2 new sets, eg - bdb:set#set-all_phase_flipped-hp_ref1 and bdb:set#set-all_phase_flipped-hp_ref2
    • With the --badgroup option, you will get additional sets representing the ambiguous classification region.
  4. If you don't see clear separation between the groups on the sim1-sim2 axis, that tells you that there is probably a fair bit of classification ambiguity on a per-particle basis.
  5. Note that you have full control over the comparator used in the discrimination process, and this can have a MAJOR impact on results.

IV) A fourth method is not yet fully implmented, but you should be aware that we are actively developing modules which will allow you to make calls to Relion and Xmipp to experiment with their classification routines, directly from the EMAN2 GUI. We will announce these new capabilities here when they are considered stable.

Hope that helps. Feel free to post follow-up questions to the main Google group thread.

Steve Ludtke

unread,
Apr 10, 2012, 3:11:39 PM4/10/12
to em...@googlegroups.com
Q: How can I get the transformation matrices for symmetry X ?

A: I get this question a few times a year, so let me go ahead and post in the FAQ.

You can get a list of the symmetries EMAN2 supports using
e2help.py symmetry -v 2

Once you have the name of the symmetry you want (D3 for example), this will print out the Transform objects for each element:

e2.py
s=Symmetry3D()
print s.get_symmetries("D3")

or, if you want to see the matrices, this will print them out on the display:

e2.py
s=Symmetry3D()
for x in s.get_symmetries("D3"):
  x.printme()
  print " "

Sorry, there is no command-line tool for this at the moment. You have to do it via the Python interface.

Steve Ludtke

unread,
Aug 2, 2012, 9:18:52 AM8/2/12
to em...@googlegroups.com
Q: I'm having a difficulty understanding the syntax for some of the comparators available in EMAN2. What is the proper python code for using the frc and phase comparators if I wish to alter variables such as snrweight, max resolution, min resolution...? Thanks.

A:
It's fairly straightforward:

From the command-line, all modular functions (processors, comaparators, averagers, etc.) are specified like this:

<name>:<parm>=<value>:<parm>=<value>

eg:

frc:zeromask=1:snrweight=1

In Python, these constructs are parsed using the parsemodopt() function from the EMAN2 module:

print parsemodopt("frc:zeromask=1:snrweight=1")

('frc', {'snrweight': 1, 'zeromask': 1})

As you can see, this returns a tuple with a string, and a dictionary containing the options. For comparators, this would be called as:

image1.cmp("frc",image2,{"snrweight":1,"zeromask":1})

ie - the normal code you'd see is:

cmpopt=parsemodopt(optionstring)
image1.cmp(cmpopt[0],image2,cmpopt[1])

A list of all comparators and their options can be found several different ways, including:

e2help.py cmp -v 2

Steve Ludtke

unread,
Apr 23, 2013, 10:47:54 PM4/23/13
to em...@googlegroups.com
I just realized that a fairly serious problem on the Windows platform was not well-described on the group. Windows 7 (and I presume later) likes to kill programs which are "unresponsive". Unlike earlier versions of Windows, it doesn't even ask, it just zaps them if they don't respond for more than a couple of seconds. If an EMAN2 GUI program gets killed like this, it's usually because it's in the middle of doing something that shouldn't be stopped. This has the unfortunate tendency to cause serious database corruption. It's important to reiterate that this is a Windows 7+ specific problem. It does not occur on Linux/Mac.  While we have done a lot already to reduce the cases where this problem will occur, really fixing it will involve massive code rewrites in parts of the GUI code, so it is not likely to be completely fixed any time soon. There are 2 solutions:

1) When an EMAN2 GUI program is busy doing something (spinny wheel cursor, etc., not long jobs running peacefully in the background), DON'T click on the window. Just wait.
or
2) This is experimental. No idea if it will really work. The following website purports to have a method to prevent this nasty windows killer for specific programs. Try setting it up for EMAN2's Python.exe.
http://superuser.com/questions/311696/is-it-possible-to-prevent-windows-7-forcing-the-closure-of-an-unresponsive-progr

If you happen to run afoul of this bug, and start getting database errors that persist after removing the cache in /temp, often searching for any browsercache.bdb files in your EMAN2 project, and deleting them at the same time as removing the cache will fix the problem without corrupting anything.

Steve Ludtke

unread,
Nov 1, 2013, 9:45:29 AM11/1/13
to em...@googlegroups.com
Q: I refined my structure in (EMAN1, BSOFT, SPIDER, ...) and got subnanometer resolution, but when I refine using the new "gold standard" in EMAN2.1, I'm only getting 15 Å. My structure looks right, and I think I see alpha helices, what's wrong with EMAN2.1 !?!?

A: I have gotten variations of this question from several different people recently. Not all were subnanometer, and not all cases were as significant as this, and most people were more tactful than I've been to myself above, but it's a very important question right now, so I have a pretty extensive answer:

{preach mode on} Note that there have been several controversial structures published recently in the cyro-EM community (one specific one will probably come to mind), and these are beginning to give the entire field a pretty negative public perception. Despite the fact that probably 90% of published structures are correct, people outside the field are beginning to question whether anything done by cryo-EM is trustworthy. While many other fields probably also need to have a similar 'housecleaning", in general I think the biological sciences need to do everything they can to verify the accuracy of their scientific conclusions before publishing, rather than just trying for the splashy press release you think will help you get funded. {preach mode off}

When you have a structure, which appears correct, and can be reasonably interpreted at the resolution you believe you have achieved, it can be very disturbing when some new-fangled algorithm tells you there is something wrong. This could be the gold standard, tilt validation or even Molprobity scores. In this rather lengthy post, I'm not going to explain the gold standard again, but rather discuss some specific reasons why your resolution may not behave as you expect.

In the original IP3R structure we solved ~12 years ago, we convinced ourselves we were correct, and by all of the measures in common use at that time, we were. However, it was later demonstrated (by ourselves thankfully) to be complete rubbish. When we published the second (correct overall) structure of IP3R at ~10 Å resolution, we also had docked crystal structures, and came up with biological stories that made excellent logical sense, however, when reproducing and validating, we showed conclusively that it was really only valid to a lower (17 Å) resolution, due to internal flexibility. While we self-corrected here, it was still very painful to effectively have to retract some portion of our earlier results. Don't get me wrong, I believe the vast majority of our published structures and interpretations have been fully correct and reasonable over the years, but there have been a few where we found that we had pushed too hard. In each case where I've gone back and checked, the new validation methods have correctly identified the previous cases which were correct and those which later proved to be overinterpretations. In many cases, the 'gold standard' in fact produced almost exactly the same resolution as we had originally published. In other cases, the old method produced a slight resolution exaggeration which would have had no impact on the interpretation of the map. In a few cases, the resolution exaggeration was significant, and the old method produced a significant exaggeration. This means that the single particle structures published (by everyone, not just us) in the 2000's are highly variable. A majority of structures really do clearly have approximately the resolution they claim, whereas others are clearly overstated, in some cases significantly.  I believe the tide has finally turned now, and it will become increasingly difficult to publish anything that does not have fairly robust validation and use a well justified resolution criterion (as it should be). The fact that many people now have microscopes capable of producing easy subnanometer structures and occasional sub 5 Å resolution structures, even with these robust criteria certainly helps  :^)

Let's consider 4 scenarios:

1) Homogeneous data of similar quality, doubling particles:  Say you collect 10,000 particles on a particular scope, then go back and collect another 10,000 on a grid of identical quality, with similar alignment and similar range of defocuses. Regardless of the size of the particle, refining the first 10,000 particles, then refining the full set of 20,000 particles will produce only a modest resolution improvement. Resolution improvement with fixed data quality requires a worse than exponential increase in the number of particles. That is, doubling the number of particles in this situation will give some improvement, but it's going to be fairly marginal. If you have a data set where this behavior is not observed. ie - doubling the particle count seems to provide a dramatic improvement in the structure, you should be VERY suspicious, and you need to figure out why (see case 3 for a possible explanation). The argument that there is a nonlinear process going on, and the structure simply can't 'click in' correctly with fewer particles is a difficult one to make. While this sort of thing may be possible, it would only happen with very small numbers of particles (10's or 100's, not thousands).

2) Heterogeneity: Say you have 20,000 particles, but the structure is moving internally with ~15 Å motions. Further say that the power spectrum of this data extends well-past 10 Å. If you refine such a data set with "old" methods, you will almost always achieve a structure which is assessed to be better than 10 Å resolution. However, if you repeat the refinement several times with different starting models, you will find that you may produce several different structures, each of which assesses at better than 10 Å resolution, BUT that only agree among themselves to ~15 Å.  What is the correct interpretation in this situation ?  The "gold standard" says that you are only entitled to interpret such data at 15 Å. However, can you make an argument that there is some validity to each of these separate structures ?  Unfortunately this is on very shaky ground, since each of the structures ostensibly incorporates the same raw data.  What you CAN do is subclassify the data, in which case you CAN achieve a "gold standard" resolution of better than 15 Å, since the data in each subset is more homogeneous. Doing this may require collecting larger data sets, of course.

In this situation halving the number of particles by subclassification actually IMPROVES the resolution of each structure. On the other hand, if you simply randomly split the data in half, each half will produce a structure with marginally worse resolution than the original set.

3) Data quality variations: As an extreme example, consider collecting 10,000 particles on an old 200kev LaB6 scope, then collecting 10,000 more on a good 300kev FEG scope. If you reconstruct the first 10,000 particles alone, then add the second 10,000 you will naturally find the second structure (with 20,000) has much better resolution than the first. If you then refine ONLY the second set of 10,000. ie - completely exclude the lower quality data, you will generally find that the structure does not degrade at all. Due to the ~Gaussian falloff of the envelope function, and the ~exponential falloff of the SSNR, using data of mixed quality is almost always pointless. That is, even if you have only 10,000 good particles, and 50,000 'less good' particles, using only the 10,000 good particles will almost always produce a better structure than using the full set.

People REALLY hate to hear this one, and fight against it all the time. I cannot tell you how many times I've been asked by people how they can combine data collected on two different microscopes. Try to think like a crystallographer. You spend a lot of time screening your crystals on a tabletop machine in your lab. When you finally get that really good crystal, of course you're going to send it to the beamline, and it would be silly to try and merge your table-top data with the beamline data. Once you have good grids, image them on the best scope you can get access to. Data collected on a 'lesser' scope will be useless if you later use a better scope.

4) Preferred orientation/anisotropic resolution: Some particles with preferred orientation are harmless, meaning the preferred orientation distribution still manages to fill Fourier space reasonably uniformly. However, others can produce strange effects. For example, if you look at Ribosome structures in the EMDB, you will find that some (but not all !) of them appear to have a funny 'directional smearing' in their structures. This is due to a preferred orientation problem which sometimes occurs with ribosomes, where specific regions in Fourier space are less well sampled than others. That is, in one direction in Fourier space, you may have 50,000 particles contributing to the structure, and in another direction there may only be 1000 particles contributing. In this situation, doubling the number of particles will have an unusual effect, visually on the map. The directions that already had 50,000 particles contributing will often see almost no benefit from suddenly having 100,000 particles, as you may have already reached the alignment-limited resolution of the data. However, the direction that had only 1000 particle contributions will now have 2000, and the improvement may be quite significant. That is, doubling the number of particles will have only a slight effect on the FSC measured resolution, but the visual quality of the structure (due to filling in the 'bad' directions) may improve significantly.

Even with isotropic data, it's worth remembering that there is generally some limiting resolution for any given data set. That is, there is a resolution past-which you cannot move regardless of how many particles you have acquired, due to limitations in alignment accuracy.  However, even in these cases, adding more particles can still be worthwhile, as it will improve the quality of the reconstruction even if the resolution ceases to improve. That is, additional data will reduce the noise level in the map near the limiting resolution and improve interpretations of the structure.

---
If you believe that you have a subnanometer resolution structure, but the gold standard says you don't, the first test to try is:

Use exactly the same data, and exactly the same refinement command you used to produce the subnanometer resolution structure. Take your starting model, and phase randomize it beyond, say, 20 Å. This will preserve the quaternary structure, but scramble the high resolution details. In a well-behaved data set, the correct subnanometer structure will very quickly re-emerge, since the quaternary structure dominates the particle alignment (note that you are using the full data here, not 1/2 as in the gold standard). This is a test of model-bias, not resolution, and a variant of this was often performed before the concepts were merged into the "gold standard". The phase randomization can be achieved with:

e2proc3d.py old_model.hdf new_model.hdf --process filter.lowpass.randomphase:cutoff_freq=0.05

repeat this process with at least 2 or 3 different starting models, then compute FSCs among the various refined results. If all of the refinements produce basically the same high resolution structure, then you can move forward with more confidence. (if you find this in a case where using 1/2 the data produces a substantially worse result, I would very much like to see it as a test-case). If, on the other hand, you find that you get different 'high resolution' structures out, then you have either a case of heterogeneity (which can be identified via improvement by subclassification) or model bias (meaning the apparent high resolution is just incorrect).

Anyway, hope this long discussion helps someone, and will help convince you that the solution to EMAN2.1 giving you a worse resolution value isn't to switch to some other program (even an older version of EMAN) which gives you a better looking number. Just keep in mind that even if the software doesn't do it automatically, you can do gold standard resolution tests in EMAN1, BSOFT, SPIDER, etc. manually without much effort.

cheers

Steve Ludtke

unread,
Dec 11, 2013, 5:50:01 PM12/11/13
to em...@googlegroups.com
Most of this original FAQ post remains accurate, but for those of you using 2.1, let me update a couple of points, since this post is widely applicable:

- e2evalparticles is still broken in EMAN2.1. It will be fixed, but it's still broken right now.
- There is a command-line program called e2classextract.py which can also be used to extract subsets of particles associated with specific class averages. That one should work.
- In the new version of e2refinemulti, which is now much more similar to e2refine_easy, when the job completes, it will generate new 'sets' (LST files in EMAN2.1) representing the particles associated with each of the final refined maps. This is done automatically without any extra commands.

Steve Ludtke

unread,
Sep 17, 2015, 9:13:02 AM9/17/15
to EMAN2
Q: Specifying options at the command line and relating these to options in python for things like processors is confusing. Could you give an overview?
 
A:
From the command line, the core syntax is standard Unix, with the processors being the only EMAN2 defined form. Basic Unix syntax has 3 acceptable variants for command-line options:

Options with values:
--option=value
--option value
-o value

where -o is the 'short form' of the option, which only exists for some options. For any EMAN2 command, a list of options is available with 'command --help' or 'command -h'

Options without values:
--option
-o

Processors are specified with

name:option=value:option=value:...

So, here are equivalent ways of specifying the same operation:

e2proc2d.py single_image.hdf outfile.hdf --process=filter.lowpass.gauss:apix=2:cutoff_freq=.05

e2proc2d.py single_image.hdf outfile.hdf --process filter.lowpass.gauss:cutoff_freq=0.05:apix=2

(there is no short form for --process)

python
img=EMData("single_image.hdf",0)
img.process_inplace("filter.lowpass.gauss",{"cutoff_freq":0.05,"apix":2})
img.write_image("outfile.hdf",0)


In python, quotation marks serve to define strings, so "0.05" and 0.05 are not equivalent. From the command-line, however, quotation marks are used to protect special characters from the shell, eg-

command test 1 2 3
has 4 parameters: "test", "1", "2" and "3", whereas:
command "test 1 2 3"
has a single parameter: "test 1 2 3"

at the unix command line, all parameters are passed to the program as strings, the program then has to decide which ones to interpret as such, and which to convert to integers or other data types.

Python is driven by python syntax for dictionaries, etc. There isn't really anything in the syntax determined by EMAN. Processors and other modular functions are specified by name with a Dictionary { }  to hold any associated parameters. Dictionaries in Python use {"key":value,"key":value,...}. The keys are always strings, and the values are of various types as described by:  e2help.py processors -v 2

Steve Ludtke

unread,
Sep 14, 2016, 7:25:59 PM9/14/16
to EMAN2
I just prepared this for someone else, and thought it might be worth posting here. We have made a LOT of changes to EMAN2 over the last year, and can routinely match or beat Relion in terms of resolution, and visual appearance is also now comparable thanks to a discovery Pawel made about how Relion was actually filtering maps during post-processing. This is a quick summary of the current best-practices for refinement in EMAN:

Current best-practice for a high resolution EMAN2 refinement:

1) movie processing, could be EMAN2's e2ddd_movie (which has been rewritten and is at least as good as the other programs in common use) or other
2) import micrographs with CTF estimation, and if you think you will need it, astigmatism. Defocus and astigmatism values if computed on micrographs will be retained. You are welcome to use another tool and import if you like.
3) box & extract particles, box size should be close to 1.5x the longest particle axis. NOT smaller. Make sure you use a "good size" from the Wiki
4) run e2ctf_auto.py with the high resolution checkbox
5) If you don't have a good initial model, and need to make one:
 - use the most down-sampled particles produced by e2ctf_auto, and optionally make a 'set' using  ~10,000 particles from the best images (by SSNR)
 - class-averages
 - initial model
 - quick refinement using downsampled data to "clean up" the initial model
6) Run a refinement with the full particle set, but using one of the downsampled data sets (for speed), and speed 5, with a corresponding resolution target. You may get better bad particle discrimination if you target ~8 A, but you can do a pretty good job even at ~14 A.
7) Run e2evalrefine to separate the good from bad particles (this is described in the current version of the tutorial)
8) Run e2refine_easy using the _fullres particles, only using the good subset from 7. Target resolution can be close to the final target resolution, but use speed 5. Use the "tophat" option as well.
9) Once you have the best speed 5 structure you can get, run one more refinement with speed=1 for 1-3 iterations. Again, use tophat and only the good particles. This final step will take a pretty long time compared to step 8, and in most cases will only very slightly improve the structure.

Reply all
Reply to author
Forward
This conversation is locked
You cannot reply and perform actions on locked conversations.
0 new messages