Ants, snakes, and the hydrophobic-polar model

[This article was first published on biochemistries, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The other night I was reading Lesk’s Introduction to Protein Science, when I came across this diagram:

A lattice model represents the structure of a protein as a connected set of points distributed at discrete and regular positions in space, with simplified interaction rules for calculating the energies of different conformations. For small lattice polymers, it is possible to enumerate every possible configuration. There is then no chance that a computer program failed to find the optimum because the simulation could not be carried out for long enough times.

The simplest lattice model is the two-dimensional hydrophobic-polar (HP) model, with the assumption that:

  1. The positions of residues are restricted to points on a square lattice in a plane.
  2. Each residue can be of one or two types for hydrophobic or polar — H or P
  3. The interaction energy is measured only between residues in contact: a neighbouring H-H pair contributes a constant attractive energy — E, and all other neighbours make no contribution to the conformational energy.

Figure 2.31 shows a two-dimensional lattice ‘protein’ in a low-energy conformation.

It’s clearly a well-defined problem, the lattice itself looking a lot like a matrix making it amenable to linear algebra (in technical terms it “can be translated into a binary linear program”). I did a little literature searching and found some interesting ‘state of the art’ approaches I’d have no idea how to implement, and eventually a nice and simple ready-made Python script.

A simple 2 day solution

The Python code I used to do so was written just last year for ‘virtual organisationHPCU (‘High Performance Computing University’), a neat little site who run weekly code challenges.

Note that in the second file of this gist I was adding data to a data.frame in R as I went along running benchmarks of different length sequences (this isn’t the normal way to write data). I’d still not gotten a strong grip on ggplot2 so I impatiently resorted to plotting in Excel (since worked it out, using a linear model for the line of best fit). It was clear that the runtime was increasing exponentially with sequence length (which I later found out mathematically in the literature, see later).

Impromptu data analysis: the trendline fitted perfectly when using just residues 9 to 15 (at smaller sequence lengths things like a decimal point here or there make a difference in the gradient, not when you quickly reach the hundreds). I used this gradient (0.4979) but kept the intercept which arose from using the actual lower sequence lengths (-4.3), as it seemed to make sense at the time, to extrapolate to expected running time for 20 residues: >144,000 seconds, 40+ hours, or ~2 days. For anything more professional than back-of-the-envelope calculations R and ggplot are the way to go though.

The spreadsheet pictured is attached to the GitHub gist below (I love how versatile these little things are, and just found I can make them from command line), click View Raw to download it. The only hassle is that unlike the parent site,, binary files can only be added from command line (clone to your computer, add them, push back to the online gist, delete local copy again). Hopefully they add this into the existing drag-and-drop option soon.

The good news was that the program ran for just under 1 and a half days. The bad news was that it did so because I forgot to plug my laptop in and it died so I had to start over, taking the predicted 40ish hours… The results came out looking like this:

The results show diagrams of the HP lattices, but without the protein chain shown, i.e. the numbering of H and P residues as in Lesk’s example. This can probably be solved with some further code, but in the end I just checked each H residue manually (drawn as green circles) for whether they had the required neighbours to act as the first in the 20 residue sequence, then checking if the subsequent residue had the required neighbours and so on, moving down the lattice.

In this way it’s possible to find the 2 residues able to start and end the chain (note that the sequence is palindromic so there’s no distinction), and thread the route of the protein chain, then count H-H contacts. I mistakenly took the output of 11 to mean Lesk’s conformation was not the lowest energy: this Python script counts adjacent chain H-H contacts, thus positions 6-7 and 14-15 make up the extra 2. Apologies for the dodgy handwriting.

A simple search online would have also found that 9 is the optimal minimum number of contacts (Dittel 2011, using the notation that gives 11; Krasnogor 2002, using Lesk’s notation for 9).

Dittel’s paper is fairly mathematical, and interested readers should try instead Chandru (2003), The algorithmics of folding proteins on lattices.

My two favourite solutions from the literature I came across before settling on the simple HPCU script were on a flexible ‘ant colony algorithm’ (FAC), which though published in Protein and Peptide Letters seemed to be more computer science-oriented than the other, described last year by Alessio Bechini at the University of Pisa, who presented a historical overview of this fundamental bioinformatics problem, detailed the ‘moves’ of algorithms on lattices, and gave guidance on their generic design.

The travelling ant problem

Wolfram Alpha MathWorld describes a peculiar type of algorithm based on the observed behaviours of ants searching for food:

At first, the ants wander randomly. When an ant finds a source of food, it walks back to the colony leaving “markers” (pheromones) that show the path has food. When other ants come across the markers, they are likely to follow the path with a certain probability. If they do, they then populate the path with their own markers as they bring the food back. As more ants find the path, it gets stronger until there are a couple streams of ants traveling to various food sources near the colony.

Because the ants drop pheromones every time they bring food, shorter paths are more likely to be stronger, hence optimizing the “solution.” In the meantime, some ants are still randomly scouting for closer food sources. A similar approach can be used find near-optimal solution to the traveling salesman problem.

Once the food source is depleted, the route is no longer populated with pheromones and slowly decays.

Because the ant-colony works on a very dynamic system, the ant colony algorithm works very well in graphs with changing topologies. Examples of such systems include computer networks, and artificial intelligence simulations of workers.

The ant algo came from Dorigo and Gambardella, who described such a system in 1997:

Consider Fig. 1(a): ants arrive at a decision point in which they have to decide whether to turn left or right. Since they have no clue about which is the best choice, they choose randomly. It can be expected that, on average, half of the ants decide to turn left and the other half to turn right. This happens both to ants moving from left to right (those whose name begins with an L) and to those moving from right to left (name begins with an R). Fig. 1(b) and (c) shows what happens in the immediately following instants, supposing that all ants walk at approximately the same speed. The number of dashed lines is roughly proportional to the amount of pheromone that the ants have deposited on the ground. Since the lower path is shorter than the upper one, more ants will visit it on average, and therefore pheromone accumulates faster. After a short transitory period the difference in the amount of pheromone on the two paths is sufficiently large so as to influence the decision of new ants coming into the system [this is shown by Fig. 1(d)]. From now on, new ants will prefer in probability to choose the lower path, since at the decision point they perceive a greater amount of pheromone on the lower path. This in turn increases, with a positive feedback effect, the number of ants choosing the lower, and shorter, path. Very soon all ants will be using the shorter path.

They published it along with ‘pseudo-code’ – there’s nothing to stop anyone without programming skills studying ‘computational thinking’ like this, indeed it’s an important part of ‘bioinformatics’ as a recent article in PLOS Comp. Biol. was keen to stress.

in addition to the cost measure δ(r,s), each edge (r,s) has also a desirability measure τ(r,s), called pheromone, which is updated at run time by artificial ants (ants for short). When ant system is applied to symmetric instances of the Travelling Salesman Problem (TSP), τ(r,s) = τ(s,r), but when it is applied to asymmetric instances it is possible that τ(r,s) ≠ τ(s,r).

Writing from Glasgow and Guangzhou, the authors describe further variants on the ant colony algorithm — the elitist ant system (EAS), max-min ant system (MMAS), and ant colony system (ACS) to name but 3 — as forming “at last” an ant colony optimisation (ACO) ‘paradigm’.

There’s something really nice about the heuristic and ‘pheromone’ steps mimicking activities of these tiny creatures, which are notoriously heavily researched (I liked this finding on the anternet, ants mimicking the TCP web protocol…)

Given a two-dimensional square lattice board, the protein folding problem is to place the protein sequence in the lattice to form a self-avoiding path. The mission of an ant colony is to discover a path, which maximizes the number of H-H bonds.

For the square lattice HP model, the best conformation is judged by the number of hydrophobic-hydrophobic (H-H) bonds that hydrophobic amino acids are adjacent on the lattice, but not consecutive in the sequence. The number of H-H bonds in each conformation in Figure (1) is 23 (i.e., the number of the dashed lines in the first conformation), which forms the minimum free energy (MFE) state with E* = -23. It can be seen that the hydrophobic amino acids do form a core inside the protein conformation, while the polar amino acids are surrounding the core and their placements are quite flexible.

Although the square lattice model is highly abstract from the real protein folding model, some special conformations can reflect a possible secondary structure of a protein… The 3-dimensional structure of a specific protein sequence is unique, but there may be several equivalent conformations by the same protein sequence. Therefore, the HP model is too simple to reflect the real protein structure completely. However, it is already a very challenging computational model.

Reminiscent of the folding funnel / energy landscape situation (shown quite nicely in this video) in real protein folding, the algorithm sometimes has to ‘retrieve’ the protein from an unfavourable conformation, lest it ‘stagnate’.

As with the ants, the author’s algorithm includes pheromone attraction, and further to this they throw in ‘local’ and ‘global’ pheromone updating, which they liken to the pheromone ‘evaporating’. The goal of this seemingly florid manoeuvre is to avoid early convergence of the ‘ants’ (basically tweaking the strength of the attraction to make the ants work a bit harder rather than quickly making a collective decision, at both a local, ant-level after each move they make, and then ‘globally’ once all ants have charted their protein folding paths).

In contrast to these ‘memories’, heuristic steps represent the intelligence given to each ant actor on the ground: conformations yielding more H-H bonds are favoured, conformations for polar amino acids try and avoid placing them next to more polar amino acids (if I read the equation right), and as the name suggests the overall selection process is flexible. This final heuristic involves a ‘reinforcement’ parameter which seems to update as it runs, and represents the model’s ‘confidence’ in its own parameters for the number of ants and the pheromone evaporation rate (essentially it can make it run faster if it has to optimise these less).

Hu et al. (2008) Protein folding in hydrophobic-polar lattice model: a flexible ant colony optimization approach

Multidimensional protein-folding

This paper caught my eye as it discussed how so far there’d “only” been attempts at making lattice models in two and three dimensions. Makes me feel a bit square…! As I’ll go into, I’m not totally convinced this holds much water after a couple of readings.

The inspection of the neighborhood of alpha carbons in the core of real proteins reveals that also lattices with higher coordination numbers, possibly in higher dimensional spaces, can be adopted.

The author Alessio Bechini presents a general parametric lattice model for simplified models of protein conformation — the phrase ‘lattice-agnostic’ is also used instead of ‘general’, meaning it’s made to be extensible from 2D to 3D, and beyond.

He takes an object-oriented approach to handling these protein lattices, so that he can “catch” the structural and behavioural details of proteins in “dedicated classes” (programming-speak for having nicely organised results, more or less).

Far from being an exercise in pure computer science, once again this style of work feeds back into biological thinking, aiding in biologists’ understanding of chaperonins.

One of these papers came from Ken Dill, who first put forward the lattice model in 1995 in 1989, later reviewed in 1995:

General principles of protein structure, stability, and folding kinetics have recently been explored in computer simulations of simple exact lattice models. These models represent protein chains at a rudimentary level, but they involve few parameters, approximations, or implicit biases, and they allow complete explorations of conformational and sequence spaces. Such simulations have resulted in testable predictions that are sometimes unanticipated: The folding code is mainly binary and delocalized throughout the amino acid sequence. The secondary and tertiary structures of a protein are specified mainly by the sequence of polar and nonpolar monomers. More specific interactions may refine the structure, rather than dominate the folding code. Simple exact models can account for the properties that characterize protein folding: two-state cooperativity, secondary and tertiary structures, and multistage folding kinetics-fast hydrophobic collapse followed by slower annealing. These studies suggest the possibility of creating “foldable” chain molecules other than proteins. The encoding of a unique compact chain conformation may not require amino acids; it may require only the ability to synthesize specific monomer sequences in which at least one monomer type is solvent-averse.

The paper has a wonderful outsider’s perspective to it, this section of the introduction being strikingly prescient of the forthcoming finding that not all proteins do fold, which biologists for a long time managed to sweep under the rug:

Although native proteins are specific, compact, and often remarkably symmetrical structures, ordinary synthetic polymers in solution, glasses, or melts adopt large ensembles of more expanded conformations, with little intrachain organization. With simple exact models, we ask what are the fundamental causes of the differences between proteins and other polymers- What makes proteins special?

On this note, while looking for answers to the question above in Lesk’s book I happened upon a not entirely glowing review of the text highlighting the lack of said answers, written by now-retired immunologist and structural bioinformatics pioneer Eric Martz (whose web 1.0 homepage full of Jmol-generated GIFs I quite enjoyed):

In addition to the many strengths listed above, there are also some weaknesses. Lesk conveys the impression that all proteins must fold into a compact, low-energy, native state in order to function. I found no mention of the concept of intrinsically unstructured or disordered regions of proteins, present in 40% of proteins by one estimate, with roughly 10% of proteins predicted to be fully disordered (Tompa 2002, Gunasekaran 2003. Neither “disordered” nor “unstructured” appear in the index nor glossary.

The Protein Data Bank is introduced in Chapter 2, but no estimate is given of the percentage of proteins genome-wide that have empirically known structure (which is less than a few percent). Although ab initio structure prediction of novel folds from sequence is introduced nicely in Chapter 5, and critical assessment of structure prediction is mentioned, the present success levels are not quantitated, nor is it made clear that for most purposes useful predictions of tertiary structure cannot be achieved from sequence alone. Similarly, the overall success rate of crystallography (less than 10%) is not discussed. Dramatic recent successes in computational redesign of ligand binding by Hellinga’s group most likely came out too late for inclusion.

I like Lesk’s style of writing and these critcisms notwithstanding would recommend his works to other students, but returning to Dill’s lattice model, which challenged the 1° ⇢ 2° ⇢ 3° structure paradigm that “seeks computer algorithms to predict secondary structures from sequence, and then to assemble them into the tertiary native structure”:

The nonlocal interactions that drive collapse processes in heteropolymers can give rise to protein structure, stability, and folding kinetics. This perspective is based on evidence that the folding code is not predominantly localised in short windows of the amino acid sequence. It implies that collpase drives secondary structure formation, rather than the reverse. It implies that proteins are special among polymers not primarily because of the 20 types of their monomers, the amino acids, but because the amino acids in proteins are linked in specific sequences. It implies that the folding code resides mainly in global patterns of contact interactions, which are nonlocal and arise from the arrangements of polar and nonpolar monomers in the sequence.

Dill and co. argued exact models were preferable to experiments, Monte Carlo partial sampling or approximating theories as they didn’t overice the cake with parameters, while exact models had partition functions working from thermodynamics to compute physical properties of simulated molecules (more on these later).

While they sacrifice geometric accuracy, simple exact models often adequately characterize the collection of all possible sequences of amino acids (sequence space) and the collection of all possible chain conformations (conformational space) of a given sequence. For many questions of folding, we believe that complete and unbiased characterizations of conformational and sequence spaces are more important than atomic detail and geometric accuracy.

This paper was a mammoth work, reaching almost 40 pages — and it’s wonderful, but in reading it I spotted it’s not the original description of the lattice model, which was in fact in 1989. Approximations lattice models avoid include those made from spin glasses, such as the random energy assumption proposed by Bryngelson and Wolynes in 1987, who drew links from problems in protein folding to descriptions from polymer physicists on glassy materials:

we often think of crystalline, liquid, and glassy states of simple material. Clearly in the glassy state many different free-energy minima can coexist and interconvert. The properties of glasses can be dependent on the history of their preparation, and in the glassy state very slow aging processes occur.

For anyone with an interest in intrinsically disordered proteins, pages 19 onwards of Dill (1995) offer very compelling early insights on the subject.

Back to the idea of multidimensional folding — the meaning is not entirely made clear in the paper of what dimensions beyond 3 would mean in the real world, though mathematically they are desirable to increase the number of adjacent vertices, quantified as the ‘coordination number’ (in a square 2D latice each vertex has at most 4 adjacent vertices, in a cubic 3D lattice the maximum rises to 6).

A caveat is mandatory: high dimensional models, although possibly attractive for several aspects, may lead to conformations with an unnatural surface/volume ratio, hampering their employment in the study of solvent interactions, or of multichain systems.

In the core of real proteins, residues are densely packed and alpha carbons are close to one another. We can quantitatively characterize such an accommodation, and subsequently we can to check whether one specific lattice type, with its own coordination number, is able to correctly describe the neighborhood of an alpha-carbon (Cα).

Despite a seemingly simple geometric setup, even a square lattice comes with a serious glitch, the parity constraint (think of how a bishop on a chessboard will never take a bishop on another colour):

no two residues can be placed on adjacent vertices if they are separated along the backbone by an odd number of residues. This limitation has pushed towards the exploration of different lattice types. Other significant choices are the planar triangular lattice, and the FCC (Face Centered Cubic) lattice (which can be considered as a three-dimensional generalization of the former), with coordination numbers 6 and 12 respectively. Finally, also the two-dimensional honeycomb lattice, called also hexagonal, with coordination number 3 (thus at the very lower boundary of the admissible range), has been explored; this one is not a Bravais lattice [Ed., explained here], and the directions to reach neighbor vertices depend on the specific vertex we are placed on.

These odd behaviours can be overcome by altered models. The parity problem was surmounted by Heun et al. in 2003 using ‘extended cubic lattices’ (the extensions being diagonals in the plane), though the results are still not optimal. Reviewing the work in 2008, Albrecht et al. noted that, “protein folding is one of the problems where indeed the knowledge of optimum solutions is essential”, though it ought be possible for such a sub-par algorithm to narrow down the results for a “local” optimisation (known as a heuristic step).

Bechini wrote,

Other exploration possibilities may easily come from the adoption of lattices in generic spatial dimensions, even beyond the classical planar and three-dimensional cases… the proposed generalization stems from square and triangular lattices. This does not represent a loss of generality, because the honeycomb model can be plainly obtained within the planar triangular lattice by imposing an angle of 120 degrees between any couple of subsequent bonds, and this constraint can also be added in the case of triangular lattices in higher dimensions.

For a visual explanation of this see here.

The rest of the paper digresses into linear algebra of higher dimensions (such as on seemingly mundane operations like ‘rotation’), and outlines the blueprints for software to this end (which makes it queerer still for the author’s software to be omitted from publication).

The advantage of higher dimensional underpinnings to protein folding become clear when the resultant conformation is shown in 3D, such as for these 3D projections of a five-dimensional configuration on a triangular lattice (in that they appear less regular than a cubic lattice, presumably more representative of true protein conformation):

From my elementary understanding of linear optimisation, the section on ‘moves’ refers to something akin to pivot operations in the simplex algorithm (without meaning to confuse readers though, one is called the ‘pivot move’).

The set of moves essentially constitute different ways of rejigging the parameters of various equations which together influence an objective function, which may or may not subsequently head in the right direction — minimisation algorithms seek to reduce the value of this objective function, maximisation algos to increase it.

Find out more about linear programming optimisation methods here, but in this case the goal is simply to go from one protein conformation to another by rejigging parameters (each residue’s position on the lattice). For protein folding, the objective is instead called the energy function, since the objective is to minimise the free energy of folding (i.e. find the most favourable conformation, maximising H-H inter-residue hydrophobic contacts etc.).

Bechini highlights research in augmented reality which showed that for a common class of simulation algorithm, Monte Carlo, “the effectiveness of a specific MC approach strictly depends on the used move set”. Having underscored the importance, he proceeds to detail the types of moves used on HP lattices:

One of the earliest move sets proposed by Verdier and Stockmayer, and it contains only two moves that operate upon a single residue, namely the single residue end move and the corner move. Such approach was further developed by Hilhorst and Deutch, who also introduced the two-residue crankshaft move. These three move types have been used tgether e.g. by Gurler et al. Following Tachuk et al, here they are collectively called VSHD set. Dealing with multidimensional models, VSHD moves have to be accordingly generalized.

The single residue end move can be described in a general way as the placement of an end residue onto a free vertex that is one step away from its adjacent residue, and its implementation is trivial.

The corner move (also known as kink-jump move) can be reformulated as the placement of a target residue onto a free vertex that is one step away from both its two adjacent residues. This definition can be used (and coherently implemented) for any type of lattice in any dimension.

Similarly, although it was originally conceived for square/cubic lattices, the definition of the two-residue crankshaft move can be generalized as follows: Given four successive residues rp, r0, r1 and rs on vertices vp, v0, v1 and vs, place t0 and t1 in two free vertices v′0 and v′1 such that all the pairs (vp,v0), (v0,v1) and (v1,vs) contain adjacent vertices.

The generalized VSHD moves just described can be viewed as local ones, as their application affects only residues in the very neighborhood of the target residue(s). On the contrary, the use of global moves (sometimes also called long-range moves) generates more distant conformations.

…The slithering snake move (a.k.a. reptation) looks first for a free position adjacent to a target end, and such terminus is moved onto it. The other residues are all shifted by one position along the backbone path. This move, although it keeps most of the overall position arrangement, may completely disrupt the adjacency pattern of residue of different types, thus deeply impacting on the global potential.

The pivot move (a.k.a. wiggle) represents an effective way to drive a potentially radical change n the conformation, and it has been used in plenty of simulation works on bi- and three-dimensional lattices. In practice, one residue rpivot on vertex vpivot is chosen as “pivot”, and one branch departing from it, either the forward or the backward one, is wholly rotated around vpivot to a new position.

Because of the aforementioned complication of rotation for higher dimensional spaces, “the corresponding pivot move could be unfeasible” [in the mathematical sense], and “in any case requires a significant computational effort”, which poses a problem when trying to avoid falling into the traps of local minima in the protein folding landscape en route to the true, global minimum (again, check out this video for a nice visualisation of the process, which I’ll go into a bit more detail below).

…[Unlike local moves] the pivot move may provide very significant conformation changes, but the more compact the conformation is, the more often the move comes to be unfeasible. A good tradeoff has been found with pull moves, originally introduced for square lattices… which exhibit the semi-local property, i.e. although in the worst case the pull move may relocate a large number of elements, on average only a few residues are involved.

Analysing the performance metrics of each move found the pull to be most favorable, with up to 100% hit rate at higher dimensions (none of the moves tried at random were ‘disallowed’ moves, and ~80% were effective). The algorithm described is ‘greedy’, preferring short-term best decisions in choosing how to position each step (‘chunk’) along the peptide chain; a chain-growth algorithm as was the ant-colony optimisation algos described above.

Moreover PERM, one of the most effective folding algorithms in the HP model, can be regarded as a particular form of Monte Carlo chain-growth.

The experiment over 10 optimisation runs confirmed minimum potential decreased for increasing dimensions, the trend coming to ‘a sort of saturation’ and levelling off at higher dimensions, illustrated in Figure 9, below. In lacking parity constraint, triangular lattices reached this ‘saturation’ sooner.

Far from being an abstract exercise in software development, there are notes scattered throughout directly relating to the protein folding problem itself (which biology ultimately tackles as computers must).

Usually longer sequences hit deeper minima and, the lower the minimum, the higher the dimension the trend saturation tends to occur. Actually, important facotrs are the number of H residues and the H/P ratio. The last three sequences in Figure 9 have different lengths but the same number of Hs, and the lower curve corresponds to the one with the highest H/P ratio.

…the curves correspond to the temporal computational complexity for the algorithm. Specifically, it is O( l/p · ( z – 1 )m), with l as the sequence length, m and p as the chain growth parameters, and z the lattice coordination number which in turn depends linearly on the dimension in the square case, and quadratically in the triangular one.

The notation in bold is called Big O, and mathematically describes the worst case running time of an algorithm, the largest amount of time an algorithm can take given the worst possible input of a given size [Jones and Pevzner, 2004].

Figure 11 compares the general case across both types of lattice across the dimensions to try and draw an objective comparison “as the neighborhood varies”. Though the triangular lattices’ superior performance may be particular to this software, due to ‘looser topological constraints’ they are able to accomodate more hydrophobic residues around an H vertex compared to a square lattice.

So far, studies on simplified protein structures based on lattice models have directly targeted only a few specific lattice models… Observing the neighborhood of alpha carbons in the core of real proteins, it can be easily noted that it is sensible making use of lattices whose dimension is larger than three.

Reading the whole paper through a few times now, I’m unable to find any reference to such a claim backed up with a citation, which I’d be wary of founding a paper on if I were writing a manuscript. Looking at the resultant ‘projection’ down into 3D space as above though does show realistic representations of proteins on cubic lattices, and perhaps this is discussed elsewhere unbeknownst to me.

McLeish (2005) describes protein-folding in higher dimensional spaces, describing ‘hypergutters’ in addition to the well-known folding ‘funnels’, where the additional dimensions incorporate some sort of “cooperativity of structure in several simultaneous dimensions”:

[which in more biochemical language] might be exemplified by cooperative secondary structure formation alongside native or near-native distant contacts in α-helix bundle proteins.

Wales (2010) describes how low-dimensional representations can “misrepresent or even remove barriers” (Krivov and Karplus 2004, 2006, and 2008; Muff and Caflisch 2008; Noé and Fischer 2008). These references all discuss dimensionality in representations of the free energy surface rather than to the physical space of the protein, so I’m still a little perplexed by Bechini’s meaning, but they’re interesting papers regardless from distinguished biochemists, so some notes on them are to follow.

It’s fair to say that the folding process has a statistical nature, with dimensions in a sense from the multiple interconverting conformations (precisely what the protein folding ‘landscape’ model captures) but it’s not really the same as the arrangement of the alpha carbons in the core of a protein as written.

  • In 2004, Krivov and Karplus (the latter of whom won a recent Nobel prize in Chemistry for this line of work) described ‘hidden complexity’ of free energy surfaces for protein folding stemming from the many degrees of freedom in folding, indicating the ‘folding funnel’ is an oversimplification.

    In this work they represented the landscape in a graphical form not ‘projected’ to ‘progress variables’. It’s pretty clear this is not dimensionality in the sense Bechini alludes to, rather in the sense of data dimensionality (in the statistical sense).

    An unprojected representation of the FES of a protein can be obtained and presented in a meaningful way by using an equilibrium trajectory to construct a transition disconnectivity graph (TRDG). The essential idea is to group the states into free energy minima, not according to their geometrical characteristics (such as the number of native contacts) but rather according to the equilibrium dynamics; i.e., from an equilibrium trajectory we determine the populations of the states, which provide the relative free energies, and the rates of the transition between the states, which provide the free energy barriers.

    ( Image: Poster from Miller et al. on TRDGs )

    Use of a disconnectivity graph to represent the FES reveals its inherent complexity. Most importantly, the denatured basin has a number of deep subbasins with low enthalpy and low entropy, which are not evident in projected surfaces. In fact, several of these basins have an energy below that of the native basin, which is stabilized by its higher entropy. This result contrasts with the standard funnel picture, which assumes a single large basin for the effective energy that directs folding to the native state.

For anyone not following the idea of a free energy landscape at this point, there was a brilliant introductory exposition on the model from Ruth Nussinov and Peter Wolynes at the start of a folding landscape-themed issue of Physical Chemistry Chemical Physics in April this year, available here).

The free energy landscape concept has commanded the imagination of many scientists because those who have adopted and exploited it can more deeply grasp mechanisms in biology than what can be achieved by the simple structure, function paradigm. Why has the free energy landscape idea been so useful for portraying function? While X-ray structures capture snapshots from an ensemble, the changes of this ensemble under varying conditions ultimately give rise to the non-linear information flows needed for life. By also depicting the barriers between the active and inactive ensembles of states, the kinetics of information flow in the living cell can begin to be understood.

Free energy landscape ideas clarify how some mutations can be oncogenic: they can either stabilize the active state ensemble or destabilize it. Essentially, mutations sculpt the free energy landscape of the kinase and the resulting ensembles of states partner differentially with other elements in the cell to turn on or off regulatory mechanisms.

While conceived originally as a physical concept to describe transitions between protein substates revealed only at cryogenic temperatures and in following years adopted to describe protein folding, its most significant impact will eventually be its application to understanding function. This may be the next revolution in physicochemical biology.

  • In 2006 Krivov and Karplus went on to describe how progress variables could preserve the barriers, as had been noted problematic in the 2004 work, reinterpreting their ‘minimum-cut’ method to give the exact values of the free energy as a function of the progress coordinate “i.e. at each value of the progress coordinate, the profile is obtained from the surface with the minimal partition function among the surfaces that divide the full free-energy surface between two chosen end points”.

    The partition function sums the Boltzmann factors for a system; the Oxford Dictionary of Biochemistry and Molecular Bio. defines it as “a dimensionless mathematical function, derived from statistical mechanical theory, that can be used to derive the average distribution of the internal energy of a system in terms of its partition between all its molecular inhabitants”.

    A minimum cut is a term proposed in Ford and Fulkerson (1955):

    “Consider a rail network connecting two cities by way of a number of intermediate cities, where each link of the network has a number assigned to it representing its capacity. Assuming a steady state condition, find a maximal flow from one given city to the other.” …the minimal cut theorem establishes that an obvious upper bound for flows over an arbitrary network can always be achieved… by specializing the network we obtain as a consequence of the minimal cut theorem an effective computational scheme… we observe the duality between the capacity problem and that of finding the shortest path, via a network, between two given points.

    Mishra and Caflisch explained this ‘balanced minimum-cut procedure’ as applied to the FES in Biochemistry, 2011:

    Exploiting the analogy between system kinetics and equilibrium flow-through network, a barrier preserving projection of the free energy surface… where the minimum cut in a network is used for finding the free energy barriers and the relative partition function arising from the minimum cut is used as the progress coordinate to project the free energy onto.

    This method, the so-called cut-based free energy profile (cFEP), preserves the barriers and minima in the order they appear during the sequence of events. The cFEP approach has been applied to extract protein-folding pathways and mechanisms from molecular dynamics simulations, to analyse kinetics of peptide aggregation, to derive thermodynamics and kinetics of small molecule unbinding from proteins, and has been extended recently to MD sampling by distributed computing. The success of cFEP method relies on the validity of the underlying coarse graining where care should be taken to avoid clustering of kinetically distant snapshots into the same node.

    Krivov and Karplus summarised the FEP as “the essence of the reaction kinetics by showing the barriers and basins on the way from the initial to the final state.

    if one wants to identify the structures associated with most important pathways, one can do this by postprocessing the obtained cuts, e.g., by considering the structural properties of the clusters between two successive cuts. The fundamental idea of the present approach is that the reaction coordinate is chosen to include any and all pathways from the initial to the final state without any prejudgment as to the geometric coordinates involved. This requires that the reaction coordinate be defined independent of structural features, as it is.

    Because the chosen progress coordinate is very flexible, the obtained FEP is likely to be the best way of projecting the FES onto a 1D coordinate. A posteriori, the structural features associated with the reaction can be determined.

  • In 2008 Krivov and Karplus showed how to combine these minimum-cut based free energy profiles (FC), with conventional “histogram-based” free-energy profiles (FH) to calculate the coordinate-dependent diffusion coefficient on the FC. It’s quite advanced theoretically, but enlightening. The ‘diffusive preexponential factor’ derived from this relates experimentally determined folding rates to computed free energy barriers.

    Even when a one-dimensional single-barrier free energy projection seems adequate to describe the kinetics, there is a fundamental difficulty in determining the barrier height, because the measurements provide only one parameter (e.g., in protein folding, the rate constant of the corresponding unimolecular reaction is obtained).

    [In our 2006 article] we considered the ballistic regime (i.e., the quenching interval was large enough that the number of recrossings of the transition state was negligible). The free energy of the barrier was actually determined only up to an arbitrary additive constant… that is, the minimum-cut value used for the free energy is equal to the total number of transition (i.e. proportional to the rate). Here we focus on the diffusive regime, which is implicit in Monte Carlo simulations and is valid in many cases for molecular dynamics simulations, as in protein-folding studies.

    we first demonstrate the results for a one-dimensional system to avoid complexity. We then outline how the results are generalized to the multidimensional case; for the practically important case of clustered equilbrium trajectories, the method corresponds to the minimum-cut procedure for the corresponding network.

  • Muff and Caflisch demonstrated how state-based kinetic models may reveal features hidden in projections on just a few coordinates through a kinetic analysis, again using coarse-grained MD. They applied this to a particular protein, mutating it in their simulations to show how a minor folding pathway could become the principle route, “point mutation modulating the flux through each pathway”.

    Remarkably just a single amino acid side chain’s difference (from a W10V, i.e. Trp to Val mutation) was enough to redistribute the [structured*] denatured state ensemble, i.e. at the beginning of the folding process
    * for a structured as opposed to intrinsically disordered protein, i.e. that is expected to fold into a native state

    Recent network and graph analyses have shown that the surprisingly simple two-state picture of protein folding, often obtained by projecting the free energy onto an arbitrarily chosen progress variable, is not consistent with the actual free energy surface.

    Above: Conformation space networks (CSNs) for the wild-type β3s protein in Figure 1, with each node representing a secondary structure string. When perturbed by mutation (W10V) the distribution of these changes, as shown by the altered networks in Figure 2, legend provided in the accompanying table.

  • Noé and Fischer 2008 lastly wrote of a transition network in contrast to previously described reaction-coordinate-based methods, in a special Theory and simulation themed issue of Current Opinion in Structural Biology on the pages of which Scheraga and colleagues also published an extensive review of the general field of protein conformational sampling simulations.

    [Conformational change] processes are often simulated by driving the transition with a few (often one) pre-defined reaction coordinates or order parameters while allowing the remaining degrees of freedom to relax. This assumes that the chosen reaction coordinates suffice to define all relevant states of the process, including the transition states, such that the slow transition events are entirely separated in the low-dimensional projection onto these coordinates.

    Such an approach may work for simple chemical systems, or for processes in which the driving coordinate corresponds to an experimentally applied external force, such as in force-clamp protein unfolding (Lu 1998; Grater 2005). However, reaction-coordinate-based simulation methods tend to perturb the transition mechanism and yield wrong rates for complex equilibrium processes and folding. A striking example of this is the method of targeted (or steered) molecular dynamics (TMD), in which a constraint is put on the RMS-difference between the current and target coordinate-set in order to accelerate the transition towards the target coordinate during molecular dynamics simulation. When the equilibrium transition mechanism involves a concerted interplay between localized and large-scale motions, as it often does for macromolecules, TMD biases the order of these motions (Koppole 2008).

Above: transition network illustrations in Noé and Fischer 2008

Moreover, attempting to understand the thermodynamics and kinetics of complex systems in terms of the free-energy landscape in low-dimensional projections is, although appealing, often highly deceptive. This is due to the fact that low-dimensional projections shrink distances between points in space and produce overlaps between conformations that are separated in the full-dimensional state space. This makes free energy barriers disappear in the low-dimensional projection, often producing apparently smooth free energy surfaces with only one or two basins even for systems that contain many kinetically separated substates.

Perhaps there’s some connection in these to what Bechini attempted to describe. I don’t see it in honesty, and find the suggestion that “multidimensional models may provide hints on the dynamic properties of protein systems” tenuous without further explanation – which is disappointing as the paper’s main finding concerns use of said multidimensional mathematics.

The paper was an intriguing look into HP lattices, and hopefully the subsequent foray into more complex (and prominent) protein folding simulation methods was as interesting to read as it was to write about.

I’m happy just seeing Bechini’s above-3D ‘generalised, multidimensional’ method as an abstract way for a computer to thread together more complex shapes [accessing favourable conformations] before the projection back to true 3 dimensions, which would surely explain the runtime results.

All this highlights to me the importance of understanding geometry at a more advanced level to even be able to interpret statements like this. The reality of the situation is likely that such interpretation will only come from waiting a few years, to see whether others build on the idea or if (I suspect) such ideas remain on the fringes of protein lattice models. In a more recent paper, Bechini collaborated with colleagues and notably stuck to 2 and 3D.

It almost sounds like a suggestion of non-Euclidean space, and papers found on searching for “non-Euclidean protein folding” are all too reminiscent of something from the likes of Deepak Chopra, e.g. this one absurdly claiming a cure for cancer from a physicist on self-publication server arXiv (I’ll refrain from details of more here). Sadly the internet gives bizarre things like this a home they’d struggle to find offline. Checking the citations to an odd-looking paper a couple of weeks ago, Google Scholar showed me a source titled ‘IMAGINATION’, which said all I needed to know. Needless to say it’s super irritating.

The closest thing I can find is a paper on Geometric Issues in Dimensionality Reduction and Protein Conformation Space, on treating protein conformation space as the field of robotic motion planning treats configuration space. Once again, the multidimensional side of things is statistical, and dimension reduction refers to protein conformational sampling. This subject was covered in a recent talk from Nancy Amato given at Robotics: Science and Systems conference which I watched just the other day, and would highly recommend (certified Chopra free!).

In the final period of questions (~44mins in) Nancy responds to a question of what this approach is missing, mentioning a familiar name:

Well, basically the energy function. One of my collaborators (Ken Dill) actually thinks we should still be able to do protein structure prediction, and he thinks that we can use these techniques to help us learn and improve the energy function. We have done that a little bit with him. Basically if you know the energy function, then it tells you which way to move.

Otherwise, what I showed you today is not doing structure prediction – it’s mapping the landscape. If we can come up with a good map of the landscape, we should be able to do protein structure predictions. But can we do that without having information at all about the native state?

Q) It’s kind of surprising that these methods are able to use sampling in a high-dimensional space effectively, despite the fact that the energy landscape might be very complex… do you think that’s really saying that the energy function is actually more structured than you might believe otherwise? Like maybe in a probabilistic sense, say because the energy function is composed of a number of independent components, that there’s actually some probabilistic converge to a more structured object – is there any theory where you try and analyse that aspect?

Well, one of the great strengths of these kinds of sampling methods is you can apply them to a very high dimensional space. If they’re going to work, you basically hit the nail on the head: you need to have some structure in the underlying landscape, you need to have something to map. If you’re going to map it in a very sparse set, in a very high-dimensional space, if the map will be meaningful you do need there to be some structure underneath it.

Ken Dill, the colleague who I mentioned before, he…— One thing that always bothers me is the way that we are weighting our edges. So we basically take the two endpoints of the edge, and we take them at some fixed resolution which we decided a priori – all edges have the same resolution – and we compute the intermediate, analyse each one of those and we say that that’s the edge. But the edge could be very long, and the landscape between those two edge points might really be something like this [*gestures rough surface*] and I’ve approximated it with this straight line. It bothers me, I still today am not convinced that this is a good thing. He is convinced that there is this underlying structure, and that this is a good approximation. You know, if it’s too long it’s probably not, but if it’s out there in the space of the landscape, if it’s a longer space, and we don’t really know very much, then maybe it’s ok… That’s his intuition, he basically believes what you’re saying. I… defer to him [*laughs*].

Q) Do you think that any other motion planning techniques (potential-function based ones) would have any luck tackling this problem?

In methods besides the sampling-based methods? …I think the sampling-based methods are best suited. If you think about it, ours is maybe a hybrid between a tree-based method, and a PRM [probabilistic road map], or graph-based method, because we are starting from some known states. We’re also doing things where we’re studying transitions between known conformations, maybe seeding our roadmaps from several different points.

I do think that this higher dimensional space makes it ideally suited for the sampling-based methods. I just believe that they are good methods, I haven’t honestly though about whether they would be good methods.

Q) Could you give us a sense on how much of an impact these techniques had on computational biology? Are they becoming mainstream?

Well, not as much as we would have liked. I think one of the challenges is that much of the work in biology is still very sequence-based, they’re not really using the three-dimensional structure as much. I think that we have a lot to offer in understanding the geometry, and the spatial information. I think that some of the forward-looking people in these fields are paying attention, and they like to collaborate with us. We’re not there yet, for sure, but there are a lot of people who’ve heard about it. It’s rare that I meet someone who’s not aware of these types of approaches.

⌘ Bechini (2013) On the Characterization and Software Implementation of General Protein Lattice Models. PLOS ONE 8(3): e59504

To leave a comment for the author, please follow the link and comment on their blog: biochemistries. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)