Protein Categorization

Cory J. Geesaman

unread,

Dec 7, 2017, 10:10:01 PM12/7/17

to DIYbio

Does anyone know of a set of categorizations for proteins which provide good coverage of all the observed structural conformations?
Examples of what I'm referring to would be the pH, salinity, temperature, and possibly reducing agent concentrations.

The purpose is to try to classify different stable folded proteins for each set of possible changes, so if for instance there's a hard cutoff at a pH of 3 where lower doesn't do anything, or greater than 100C or more than a 90% saturation of salt, or more than a 2% solution with reducing agents (a specific reducing agent or in general,) etc that would be helpful to know. Additionally, if there is somewhat of a gradient that would be helpful (e.g. will a protein fold different at a pH of 6, 7, and 8 - and if so are there known transition points?)

Bryan Jones

unread,

Dec 7, 2017, 10:27:55 PM12/7/17

to diy...@googlegroups.com

Are you asking about some particular protein? I don't think you could specify any general parameters as the limit for all proteins. Some proteins are much more tolerant of extreme conditions than others. Proteins are evolved for particular conditions, but those conditions are different for different proteins, and some proteins, by chance, happen to be even more tolerant than they were selected to be. Some proteins, espicially those from extremophile organisms, show surprising tolerance for pH extremes or temperature extremes or high salinity. That being said, if you want to find conditions that are almost guaranteed to completely unfold almost all proteins, look at the conditions usually used to prep samples for running protein gels: a bunch of soap (SDS) and reducing agent (beta mercaptoethanol or dithiothreitol) and heated to almost boiling.

Skyler Gordon

unread,

Dec 7, 2017, 11:16:01 PM12/7/17

to diy...@googlegroups.com

I’d suggest looking for things like isoelectric points (where the peptides are charged or uncharged based on pH), and other things related the charges, hydrophobicity, etc.

Chemistry is your friend here, not necessarily biochemistry or protein processing.

-SG

--
-- You received this message because you are subscribed to the Google Groups DIYbio group. To post to this group, send email to diy...@googlegroups.com. To unsubscribe from this group, send email to diybio+un...@googlegroups.com. For more options, visit this group at https://groups.google.com/d/forum/diybio?hl=en
Learn more at www.diybio.org
---
You received this message because you are subscribed to the Google Groups "DIYbio" group.
To unsubscribe from this group and stop receiving emails from it, send an email to diybio+un...@googlegroups.com.
To post to this group, send email to diy...@googlegroups.com.
Visit this group at https://groups.google.com/group/diybio.
To view this discussion on the web visit https://groups.google.com/d/msgid/diybio/CAKw3Q73yUfn6DymT4SW4-kaTpjpVP9s9RhZ-j871CxtH%3Dj5pfg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Cory J. Geesaman

unread,

Dec 8, 2017, 12:04:27 AM12/8/17

to DIYbio

I'm aiming for more of a blanket id system, of the sort:
{
sequence,
chunk,
temp,
pH,
salinity,
reducingAgentConcentration
}
where sequence is the actual sequence of amino acids, chunk would be a pointer to something like the clustering functionality of MSM Builder, and the others are some integer or floating/double point precision number mapping to all the critical regions of each attribute in the smallest possible manner (i.e. if a linear or logarithmic pH scale with 4 bits is plenty to encompass all the possible folds given the other parameters, then I don't want to use a 64-bit floating point number as I would like to be able to actually simulate ever combination for a given protein, or more accurately a polypeptide given reasonable computational limits with current technology.)

Skyler Gordon

unread,

Dec 8, 2017, 2:32:45 AM12/8/17

to diy...@googlegroups.com

You would have to figure out some sort of Principle Component Analysis (PCA) algorithm to determine what you are most concerned about when it comes to how things ‘differ’. It has a lot to do with compressing 1000s of variables into a single dimensional field so you can graph it and people will actually understand it in a visual sense.

pH is very important to proteins, but remember that’s sort of because hydrogen ions are positively charged. Proteins need to have specific amino acids protonated or deprotonated in order to properly associate with the nearest amino acids in the sequence, as well as amino acids further away in the sequence. Other positively charged ions like Calcium and Magnesium can also interfere.

Don’t forget : adding different salts, sugars, and peptides to an aqueous solution will change the way things dissolve, for lack of a better way of saying it. The hydration shells start to change, the viscosity, and the way HYDROPHOBIC MOLECULES FOLD is very complex.

If you figure this out, every company in the world that designs recombinant proteins will just throw money at you.

Again. Best of luck.

-SG

Skyler Gordon

unread,

Dec 8, 2017, 2:35:43 AM12/8/17

to diy...@googlegroups.com

P.S. Buy a physics and math text. One that focuses on micro-fluidics. You’re gonna need it.

-SG

Cory J. Geesaman

unread,

Dec 8, 2017, 9:41:50 AM12/8/17

to DIYbio

It will be open source if I crack it (though not without a revenue stream, I don't get why companies designing recombinant proteins specifically would want it,) any idea of the pH scale required to get good coverage? E.g. 0-15, 0-31, 0-63, 0-127, 0-255 (14/16, 14/32, 14/64, 14/128, 14/256 respectively.) I may be looking at this the wrong way too, since pH, salinity, dissolved sugars, peptides, reducing agents, etc are just how we measure modifications to different dimension - do you know the comprehensive but concise list of underlying dimensions I should be looking at? If the problem can be narrowed down to dimensions + scales for the dimensions then it's pretty straightforward to get it into a single number identifying every combination thereof, the scale of each dimension is really the critical component.

John Griessen

unread,

Dec 8, 2017, 10:12:16 AM12/8/17

to diy...@googlegroups.com

On 12/08/2017 08:41 AM, Cory J. Geesaman wrote:
> I don't get why companies designing recombinant proteins specifically would want it,) any idea of the pH scale required to get
> good coverage?

Because once you knew such data points for a range of proteins you would shorten the trial and error of designing
proteins with similar conditions. You'd start to be able to treat proteins like a key to a lock, and know which detent position
on the key you were working with. That would be input to synthesis.

Getting the data on folding needs some kind of observation of folding states though. Visual observing of it probably is not very
easy, so not very useful. Some kind of test, which is why Skyler Gordon was saying get knowledge on microfluidic physics for
corralling molecules, would be a help for categorizing. I can only see it being done by automaton that moves protein molecules
through microscopic chambers of conditions that fold and unfold and also with some detector of folded state. The you could relate
the basic conditions to how the folding happens in presence of other molecules -- the normal way catalyze reactions, but since you
will get observations 2 ways you'll have a rosetta stone to decode more...

Cory J. Geesaman

unread,

Dec 8, 2017, 10:36:06 AM12/8/17

to DIYbio

I may have worded this poorly, I'm currently just looking for the core dimensions and their scale+span. The NP-hard folding problem still remains, even the NP-hard chunking problem (or clustering) for actual simulation in silica still remains - the issue I'm trying to resolve at the moment is purely based around addressing. I.e. a unique identifier consisting of a structure similar to: { sequence, chunk, pH, salinity, etc } which is able to comprise all or at least nearly all we've observed in as few bits as possible. I know the dimensions involved are based on the structure (sequence and chunking, though chunking is largely covered via the use of MSM Builder,) but the possible conformations of importance beyond that are finite but definitely spread across multiple dimensions (e.g. any temperature over 1,000 C might as well be considered the same and 0.0000001C of difference between the conditions while folding a protein is almost certainly not going to yield anything different from 0.000001C of difference) but how finite those dimensional spans and scales can be is the crux of this problem. I get what you and Skyler are saying about microfluidics for testing, but I'm trying to derive this from known data (way too many ongoing projects to try to engineer a microfluidic chip then spending decades and millions if not billions of dollars running enough proteins through it to get an exact value,) which is why I'm aiming for knowledge pulled from people who have more experience with this than myself. If it hits 99.9% of protein conformations for a given protein within a bioengineering application it will pass the good enough test in my opinion.

John Griessen

unread,

Dec 8, 2017, 11:20:53 AM12/8/17

to diy...@googlegroups.com

On 12/08/2017 09:36 AM, Cory J. Geesaman wrote:
> I may have worded this poorly, I'm currently just looking for the core dimensions and their scale+span.

So you mean molecule dimensions in aggregate, like mol. weight vs. volume? and how far that ranges over the conditions?

If it hits 99.9%
> of protein conformations for a given protein within a bioengineering application it will pass the good enough test in my opinion.

--
John Griessen
industromatic.com Austin TX building lab gear for biologists

Patrik D'haeseleer

unread,

Dec 8, 2017, 12:02:49 PM12/8/17

to DIYbio

The closest I've been able to find (at least as far as I understand what you are asking for) is PSCDB - the Protein Structural Change DataBase, but that one seems heavily focused on structural changes induced by ligand binding, not pH or temperature changes etc.

There are likely also some databases out there with information on what the physiological range is for enzymes or other proteins. But it may not have any info on exactly why they lose their function beyond those parameters - whether it's due to a distinct structural change, or they start to denature, or undergo some other non-structural changes that inhibit their function.

What I am fairly certain you will NOT find, is a database with a large number of proteins documenting in detail how their structure changes with changes with environmental parameters. There just aren't that many proteins that have been documented to switch between distinct structural conformations. And for those proteins we do know about, the characterization of the parameter space will typically be very coarse, e.g. something like "this protein has conformation X at pH 7, but conformation Y at pH 5 in the presence of Ca+".

Patrik

Skyler Gordon

unread,

Dec 8, 2017, 12:19:49 PM12/8/17

to diy...@googlegroups.com

Listen,

The cell doesn’t just change the pH (which isn’t in an integer scale, it has infinite range to it - and on a logarithmic axis), and the protein magically folds correctly. There are hundreds of different cell pathways for processing proteins to get them into a specific shape before being “shuttled” to the proper place.

This is a subject that people spend their entire lives studying, and that is just for a single protein much less every peptide combination possible.

Saying one sequence will fold like another because they are similar polypeptides in similar conditions is like saying two people with similar traits will act the same under the same circumstances or that two apple trees will taste the same since you planted the seeds in the same soil. It’s possible, but really unlikely. Just a single difference in CELL PROTEIN PROCESSING will change everything and your program goes out the window.

Now you’re talking about having users input their specific cell pathways to try to determine exactly what happens as a sum of those pathways and then GUESSING how all those variables are weighted.

Remember, even if you think you got it right you’ll have to use X-ray crystallography to find out if you’re right. Old school, and classic, but also very very difficult and time consuming.

-SG

--

-- You received this message because you are subscribed to the Google Groups DIYbio group. To post to this group, send email to diy...@googlegroups.com. To unsubscribe from this group, send email to diybio+un...@googlegroups.com. For more options, visit this group at https://groups.google.com/d/forum/diybio?hl=en
Learn more at www.diybio.org
---
You received this message because you are subscribed to the Google Groups "DIYbio" group.
To unsubscribe from this group and stop receiving emails from it, send an email to diybio+un...@googlegroups.com.
To post to this group, send email to diy...@googlegroups.com.
Visit this group at https://groups.google.com/group/diybio.

To view this discussion on the web visit https://groups.google.com/d/msgid/diybio/f39f8baa-c0da-4d42-8478-f3711bdd9667%40googlegroups.com.

Cory J. Geesaman

unread,

Dec 8, 2017, 1:30:38 PM12/8/17

to DIYbio

I'm not talking about folding proteins based on a magic id, I'm trying to classify known folds with unique id's. A logarithmic scale can be defined with integers easily (e.g. 14/64 gets you to a precision of 0.21875 on a linear scale alone with only 6 bits, so for every 0.21875 pH difference you would be able to have a distinct protein defined with the same sequence and chunk,) it being a logarithmic scale is essentially half of the question I was asking relative to pH. The other half being what is the step needed for the scaling (e.g. do I need 6 bits or do I need 8 bits to define it at the required precision.)

The chaperone proteins and such I'm not so concerned about because of the chunking parameters - I already know how to get that into a single integer for any protein with a high degree of stability in the conformation (e.g. if you have 2 amino acids in a polypeptide you might start with two rotomers, which would be chunked out as 0 or 1, you don't need to know what the most thermodynamically stable conformation is as derived by the folding process, or even which one is hit by the various chaperones, you only need to know the chunking step size and ambient dimensional parameters such as pH, salinity, dissolved sugars, etc to say "from this starting conformation the most thermodynamically stable state is x and you shouldn't expect to see the next most thermodynamically stable state until the next chunk signifying the next starting conformation given the same ambient dimensional parameters.")

The knowns are:
-folding
-chunking
-sequence

The unknowns are:
-dimensions (pH, sugar, salinity, etc)
-step sizes and scales of those dimensions

The objective here is to get a concise list of dimensions (even if they are completely different from "pH, salinity, dissolved sugar..." and are in the realm of "ambient electronegativity, ambient inclusions with their electronegativity, size, dispersion pattern or fractal geometry...") Knowing the distinct id is the only aspect involved in this question, the folded conformation is data tied to that id, and while initially derived from it, not expected to be a set of generation routines where you enter the id and have it spit out the completely folded protein with any degree of accuracy - just a form of classification.

Cory J. Geesaman

unread,

Dec 8, 2017, 1:35:45 PM12/8/17

to DIYbio

Kind of both, look at it as trying to come up with a globally unique identifier for proteins in various conditions without the risk of collision beyond something really low like 0.0001%. I'm not aiming for perfection, if on average the same exact folded protein is derived from 500 GUIDs after months of intensive number crunching, meaning 500 duplicates on average, then that's fine, the objective is to limit the number of GUIDs a protein may cover to the minimum practical without collisions (e.g. 1 GUID being able to identify two different thermodynamically stable configurations should not happen more than 0.0001% of the time, to reuse the same arbitrary percentage.)

Cory J. Geesaman

unread,

Dec 8, 2017, 1:40:21 PM12/8/17

to DIYbio

The datasets available are a definite pain from a bioinformatics standpoint, that's why I was aiming for a fuzzy logic "what's your take" post on here. This strikes me as the kind of knowledge people don't bother to write down in the format needed but would develop somewhat of a gnack/feel for after many years of study on mostly unrelated topics which loosely depend upon it because they happen to remember something that changed at a pH of 7 and another something which happened to change at a pH of 7.1, giving a hint of a necessary step size in the process. I'll check out the link, it looks like it might be useful. Thank you.

Skyler Gordon

unread,

Dec 9, 2017, 12:59:18 PM12/9/17

to diy...@googlegroups.com

I like the idea of boiling down protein assembly / thermodynamic stability to Boolean mathematics, but I think you’ve gone a little beyond my reach.

-SG

--

-- You received this message because you are subscribed to the Google Groups DIYbio group. To post to this group, send email to diy...@googlegroups.com. To unsubscribe from this group, send email to diybio+un...@googlegroups.com. For more options, visit this group at https://groups.google.com/d/forum/diybio?hl=en
Learn more at www.diybio.org
---
You received this message because you are subscribed to the Google Groups "DIYbio" group.
To unsubscribe from this group and stop receiving emails from it, send an email to diybio+un...@googlegroups.com.
To post to this group, send email to diy...@googlegroups.com.
Visit this group at https://groups.google.com/group/diybio.

To view this discussion on the web visit https://groups.google.com/d/msgid/diybio/2e8bc814-9509-4be3-ae3a-ae27af0802d4%40googlegroups.com.

Reply all

Reply to author

Forward