Splitting a JSON file by score or coverage or other attributes

48 views
Skip to first unread message

digvij...@gmail.com

unread,
May 25, 2021, 7:12:56 PM5/25/21
to EMAN2
Hello,

I would like to split JSON file into two based on score i.e. top 70% particles (based on score) into one json file and the remaining 30% in another one. 

The reason is simple, while I want to use the 70% of the particles for subsequent analysis. I would like to check and revisit the worst 30% or 20% or 10% particles to try to diagnose issues with them. 

The --keep option in e2spt_extract.py is a good way to extract top n% of particles. Bur I was hoping to have a separate list or record for the remaining so I can troubleshoot or check them separately. 

One way could be to parse the JSON file.E.g., I could collect all the keys under which exists key named score with value greater than a threshold. I could then write back these keys into a new json file. There are many tools/editors both in python and MATLAB. But I thought I will run this by you to see if there is even a simpler way. Or if there is something to be kept in mind while attempting it via the JSON file editor way. 

Another thing could be, if it is easier for you to add a flag/option in e2spt_extract.py that would allow to extract particles of a given score range. E.g., --keeprange 0.2,0.4 to re-extract particles that fall within the 20% (0.2) percentile to 40% percentile (0.4). 

Just something to play around with the worst-score particles separately!

Thanks and cheers,
Digvijay

Ludtke, Steven J.

unread,
May 25, 2021, 7:24:45 PM5/25/21
to em...@googlegroups.com
Hi Digvijay,
I don't think e2procjson.py has an option to do this. Also, we are moving towards standardizing all per-particle metadata like this into .lst files in future. This change has already been made for the _new programs, but we haven't worked our way back to the older programs yet. There is a script in examples/ which can interconvert if/when that becomes an issue.

Anyway, doing this in Python, using EMAN2 tools is pretty trivial. JSON files appear as dictionaries in EMAN2, so copying a portion of one JSON file to another is as easy as copying a portion of a dictionary to another:

from EMAN2 import *

jsin=js_open_dict("input_file.json")
jsout=js_open_dict("output_file.json")

for k in jsin.keys():
if jsin[k]["score"]>0.03: jsout[k]=jsin[k]



--------------------------------------------------------------------------------------
Steven Ludtke, Ph.D. <slu...@bcm.edu>                      Baylor College of Medicine 
Charles C. Bell Jr., Professor of Structural Biology
Dept. of Biochemistry and Molecular Biology                      (www.bcm.edu/biochem)
Academic Director, CryoEM Core                                        (cryoem.bcm.edu)
Co-Director CIBR Center                                    (www.bcm.edu/research/cibr)




--
--
----------------------------------------------------------------------------------------------
You received this message because you are subscribed to the Google
Groups "EMAN2" group.
To post to this group, send email to em...@googlegroups.com
To unsubscribe from this group, send email to eman2+un...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/eman2

---
You received this message because you are subscribed to the Google Groups "EMAN2" group.
To unsubscribe from this group and stop receiving emails from it, send an email to eman2+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/eman2/66c0e810-b8e9-42d3-9c7b-957b078ffe7cn%40googlegroups.com.

digvij...@gmail.com

unread,
May 25, 2021, 9:19:14 PM5/25/21
to EMAN2
Thank you, Steve. 

This worked like a charm.

Added a simple routine to filter particles by percentile of score/coverage etc. 

import numpy as np

from EMAN2 import *

jsin=js_open_dict("input_file.json")
jsout=js_open_dict("output_file.json")
score_list=[]

for i in jsin.keys():
 score_list.append(jsin[i]["score"])  

Threshold=np.percentile(score_list,80) % Below this score will be 80% of particles.

for k in jsin.keys():
 if jsin[k]["score"]>ThresholdForAGivenPercentile: jsout[k]=jsin[k]
Reply all
Reply to author
Forward
0 new messages