As if water didn’t have enough unusual aspects (of interest to biologists, since, well, it’s only the primary solvent in all living organisms) a paper out today highlights its disobedience to a fundamental law describing molecular diffusion.
Aqueous guanidium hydrochloride ‒ a “water neutral” denaturant ‒ is used to produce so-called chemical (cf. physical/mechanical) unfolding of proteins. The particular paper I’m thinking of is Exploring Early Stages of the Chemical Unfolding of Proteins at the Proteome Scale from late last year, which used urea not GuHCl (GuCl is also used) and had some really interesting conclusions.

The dominant paradigm for unfolding (the ‘‘direct’’ mechanism) claims that the denaturant properties of urea are related to its capacity to interact with exposed protein residues more strongly than water. However, the nature of such a preferential interaction is not so clear. Thus, while some authors suggest that it is mostly electrostatic and related to the formation of direct hydrogen bonds, others claim that preferential dispersion is the leading term. It is also unclear whether the major destabilizing effect of urea is related to interaction with the backbone or with side chains. In the latter case, there is also discussion regarding the preferential side chains: polar and charged or apolar.
We recently combined multi-replica molecular dynamics (MD) simulations and direct NMR measures of ubiquitin to characterize the “urea unfolded ensemble” of this model protein. Our results suggest that urea stabilizes flexible over-extended conformations of the protein, which are unlikely to be sampled in the “unfolded” state of aqueous proteins. Extended conformations of the protein with exposed hydrophobic surfaces are more urea-philic than the native globular state, due mostly to extensive London dispersion interactions (the attractive contribution in Van der Waals interactions between instantaneous dipoles) between apolar side chains and urea molecules in the first solvation shell of unfolded conformations. We believe that our results clarify the molecular basis of the effect of urea on the thermodynamics of the folded←→unfolded equilibrium, but unfortunately, they do not provide information on the kinetic role of urea in the unfolding process. In other words: does urea actively induce protein unfolding? Or, on the contrary, does it passively stabilize the unfolded state by selectively binding to unfolded conformations?
Urea and protein dynamics.
Urea diffuses quite slowly and limits protein fluctuations, which leads to an apparent paradox: a denaturant that slows down the dynamics of proteins compared to the equivalent simulations in water. However, analysis of trajectories show that such a paradox does not exist. Urea migration to the protein surface was slower than that of water, but once it reached the surface, urea remained for longer periods, especially when located in cavities near the hydrophobic core of the protein. Interestingly, the positions of long-lasting urea interactions are consistent among all four force-fields and seems associated with a sizeable improvement in van der Waals interactions and electrostatic energies and with the formation of strong long-living H-bonds. These findings demonstrate that even if H-bonding is not the driving force behind the urea-philicity of proteins, it is important to stabilize urea molecules at specific positions at the protein interior.

The idea that solvent diffusion around the protein has an effect on its movements means diffusion constants can be useful readouts in protein folding experiments.

urea/water/protein effective parameters are able to reproduce a variety of experimental observables, such as mass densities and radial distribution functions of urea/water solutions derived from neutron scattering experiments, the experimental water/urea transfer free energies of tripeptides, and the urea density around unfolded proteins found by vapor pressure osmometry measures

“Long-residence” denaturant molecules (urea in the paper above) typically bind to regions adjacent to mobile residues, at ‘hinge-points’.
This was of interest when studying intrinsically disordered proteins, my dissertation topic this semester, as intrinsically disordered regions produce just such flexible loops or tails, meaning ‘sticky’ denaturants are expected to play a major part in guiding their unfolding.

similar to the unfolded state, it is the van der Waals interactions that drive the accumulation of urea on the surface of the folded protein. However, the role of H-bonding cannot be dismissed, as these bonds are crucial for the stabilization of long-living urea interactions near hinge points, which in turn are required to bias intrinsic protein dynamics towards unfolding. Clearly, “direct” effects not only are the main factors responsible for the urea-mediated stabilization of the unfolded state, but are also relevant in guiding the first steps of urea unfolding.
Microscopic unfolding events are related to stochastic thermal motions, which are in principle similar to those that occur spontaneously in water at room temperature. However, urea is not a mere passive spectator that simply stabilizes the small percentage of unfolded protein coexisting within the native ensemble and leading to a displacement in the folded←→unfolded equilibrium towards the denatured state. On the contrary, urea has a dual function: i) it takes advantage of microscopic unfolding events, decreasing their chances of refolding, and favoring further unfolding; and ii) among these microscopic unfolding events it selects and stabilizes microstates with exposed hydrophobic regions. These effects lead to a slow divergence in the temperature-unfolding pathways in water and urea, and, as shown for ubiquitin, to distinct unfolded states. Consequently, concepts such as folded and unfolded states or folding and unfolding pathways need to be revisited and reformulated considering the nature of the denaturant used.

The paper I wanted to talk about is from the University of Glasgow, on the validity of the “Stokes−Einstein−Debye” (SED) relationship, which describes rotational diffusion.
In the simplest model of particulate diffusion ‒ a sphere ‒ a frictional force arises in a fluid due to attraction between molecules. In order to move a solid object through it, some solvent molecules must also be displaced, and those nearest to the moving particle are the most perturbed, decreasing to zero with distance (Serdyuk, 2007). 
To compute the frictional force, we must calculate the force to maintain the perturbed velocity distribution of the solvent molecules, related to the fluid’s viscosity, η₀.
Although the actual derivation is extremely difficult (the sphere’s motion induces velocity gradients in the fluid which must be calculated explicitly), it’s still worth examining conclusions of the early biophysicists to see the effect changes to them may bring to bear on our interpretation of protein structure.

According to Stokes, the translational friction of a sphere is proportional to its radius, R₀, and the viscosity η₀ of the solvent through which the particle moves:
f₀= 6πη₀R₀
There are three important remarks concerning Stokes’ equation. First, the coefficient (i.e. 6π) is determined by the boundary conditions of fluid flow at the surface of the particle (‘stick’ or ‘slip’ conditions). The number 6 indicates the use of stick conditions. Second, from a mathematical point of view, Stokes’ approach is correct when the solvent is considered as an unstructured medium. It is evident that this condition holds if the molecules under consideration are much larger than the solvent molecules. It is generally accepted that for proteins with a molecular mass in excess of 5000 Da (so-called ‘large’ molecules), the motion of molecules in solution follows a continuous flow pattern. However, the low limit for the correct application of Stokes’ law (for so-called ‘small’ particles) is still under discussion. Third, the equation was derived with the assumption that interparticle interactions are absent. This condition is realised upon extrapolation to infinite dilution of the solution.
Since the frictional force is directly connected with the surface area of the particle studied and a sphere has the smallest surface area of all geometrical objects, it can be concluded that in general frictional coefficients for a spherical molecule are smaller than for any non-spherical molecule of the same volume.

The equation shown in the main image above, is becoming visible now as a slightly modified form of this relation, one relevant to protein motions after the asymmetric (non-spherical) shape is taken into account.

From general considerations it is clear that the frictional force of asymmetrical particles depends on their orientation relative to the flow direction. Thus, an elongated particle has a smaller friction when it is oriented along the flow as compared to that when it is oriented perpendicular to the flow. In the averaged case the solution is
f₀= 6πη₀R₀F(p)

The term introduced here, F(p) is the Perrin function (or Perrin translational friction factor, also written FP). From Wikipedia:

The frictional coefficient is related to the diffusion constant D by the Einstein relation, D = kBT / ftot
Hence, ftot can be measured directly using analytical ultracentrifugation, or indirectly using various methods to determine the diffusion constant (e.g., NMR and dynamic light scattering).

The meaning of the equation shown above (being challenged in this paper) is now clear: the frictional coefficient has been stated as related to the diffusion coefficient by the Einstein relation, where the subscript 0 indicates surface friction rather than total (I think!).
We can say that water is an extreme case of a molecular liquid
Now that we’re all experts in molecular biophysics, the paper’s claim is that the SED expression, used routinely to relate orientational molecular diffusivity quantitatively to viscosity for asymmetric particles, as found in nature, is somewhat inopportune.

It is well-known that Einstein’s equations are derived from hydrodynamic theory for the diffusion of a Brownian particle in a homogeneous fluid and examples of SED breakdown and failure for molecular diffusion are not unusual. Here, using optical Kerr-effect spectroscopy to measure orientational diffusion for solutions of guanidine hydrochloride in water and mixtures of carbon disulfide with hexadecane, we show that these two contrasting systems each show pronounced exception to the SED relation and ask if it is reasonable to expect molecular diffusion to be a simple function of viscosity.

Drot shows linear dependence on η “over a limited range of temperature, for small solutes and neat molecular liquids — there are observations of SED breakdown and failure”. Importantly, the paper finds this occurs for molecules used in probing protein structure:

for some molecular solutes, it has been suggested that specific local interactions are more important than the solvent bulk properties and a detailed description of the relation of diffusivity to viscosity may require the solvent structure to be taken into account. Here we investigate two very different systems of mixtures: an aqueous solution of guanidine hydrochloride representing a strongly interacting (hydrogen bonding) liquid and a mixture of carbon disulfide and hexadecane having only weak (van der Waals) interactions. By varying the composition, we isothermally change the viscosity in each system and observe, in general, no systematic relationship of viscosity to the molecular orientational diffusion.

The guanidinium ion (GuH+ = [C(NH2)3]+ ) is a fitting protein denaturant since it’s highly soluble and relatively “water neutral” (kosmotropic or perhaps isotropic as I understand it); i.e., GuH+:H2O interactions are similar to H2O:H2O interactions — they don’t bind strongly, and have “a most unusual ability to dissolve in water without altering its dynamics”.
With increasing concentration, the solution’s viscosity increases monotonically (“entirely non-decreasing”) up to about twice that of neat water for the saturated solution.
As well as water ‒ ‘an extreme case of a molecular liquid, characterized by strong directional bonding’ ‒ the authors examined hexadecane, as a solvent for the nonpolar molecule CS2 (S=C=S), a system which lacks them.
To move into the heart of the paper, the authors highlight previous findings that water has an expanded structure in which the hydrogen-bonded molecules reorient through a complex large-angle “jump” process, “which, although an activated process, is not diffusional”

The rate of relaxation is primarily determined by the rate of hydrogen bond fluctuations. Furthermore, in simple salt solutions, the orientational relaxation of the water molecule, as a function of concentration, is generally uncorrelated with viscosity.
Here, the presence of a high concentration of the large GuH+ ions does not strongly influence the relaxational rate; hence, it appears that H2O forms hydrogen bonds to both GuH+ (and chloride ions) that are of similar strength to H2O−H2O hydrogen bonds. We can say that water is an extreme case of a molecular liquid, characterized by strong directional bonding.

In contrast, the relaxation time scale of the GuH+ ion has an apparently linear dependence on concentration, but this is markedly different to the nonlinear viscosity increase. This linear trend, which breaks at ca. 2.1 M, suggests a simple dependence on concentration rather than viscosity. At the highest concentration of 7.35 M, each GuH+ ion has (6.1 Å)3 of space in which there are only 3.5 H2O molecules and 1 chloride ion. At the lowest concentration of 0.53 M, each GuH+ ion has (15 Å)3 of space in which there are ca. 100 H2O molecules. Therefore, at low concentration, the GuH+ ion is effectively surrounded by bulk water (and its dynamics are determined by collisions with water molecules that are relatively mobile but not by the bulk viscosity). The break in the line at 2.1 M suggests a transition to the regime where GuH+:GuH+ collisions become the dominant factor in the orientational diffusion rate.

The upshot of this is that while the SED relation may be useful for large probe particles (colloids, fluorescent tracers), molecular diffusion doesn’t perform so well, and corrections have to be made to account for the more complex reality of a dielectric medium, with ‘stick’ or ‘slip’ surface interactions (molecular friction).
Lest this all sound a bit esoteric, the point is that deviations from SED are read as evidence of structural changes, and often fractional forms of the Stokes−Einstein (for translational diffusion) and Stokes−Einstein−Debye (for rotational) relations are interpreted as evidence of a change in effective volume with temperature.

However, since SED is based on hydrodynamics, and applies strictly to a particle immersed in a homogeneous fluid, there is little reason (as Einstein made clear in his 1906 paper) to expect it to apply on a molecular scale. Here, for these two contrasting systems, it is clear that SED does not generally apply. This suggests that diffusion of molecular-size particles is dominated by local interactions that decouple the diffusivity from the bulk viscosity. This would be consistent with the observation of the anomalous speeding up of CS2 relaxation in the hexadecane mixture, reflecting that the CS2−hexadecane interactions are weaker than the CS2−CS2 interactions.
A molecule is apparently aware of only short-range interactions, primarily then to its first solvation shell, and application of the Stokes−Einstein and Stokes−Einstein−Debye relations in studies of molecular self-diffusion must be made cautiously. It has indeed been suggested before that a critical particle volume exists below which the SE relation (for translational diffusion) fails, and molecular dynamics (MD) simulations for a Lennard-Jones liquid suggest a critical volume, in the nanometer range, below which local intermolecular forces dominate the translational mobility.

They conclude that taken together, this evidence suggests that molecular orientational diffusion is controlled by local (first solvation shell) interactions rather than by the bulk properties of the liquid.
If Einstein didn’t quite grasp it, then what is the true relationship of diffusion to viscosity?

However, because of this complexity and the distinct relaxation mechanisms that contribute, only detailed MD calculations are likely to be able to predict such a relationship, and there is no simple theory that is able to predict the value of viscosity from molecular properties.
The observations suggest that in the case that a molecule can support numerous weak interactions (hexadecane) the single molecule motion could correctly be termed diffusive, and SED is then obeyed, whereas if the interactions are dominated by a small number of strong interactions, then orientational relaxation is not diffusive, and SED is not obeyed.
We suggest that the majority of liquids composed of small molecules fall into the second category. Since SED is a widely used method of identifying anomalous behavior in molecular liquids, it is essential that such distinction can be made and this calls for a systematic approach to predict, perhaps through MD simulation, the nature of single molecule relaxation.
Often, temperature dependent measurements do show similar trends in viscosity and molecular diffusion (and in these cases the SED relation will remain an important metric), but as both processes are activated, this is unsurprising and is not evidence of a causal relationship. Hence, while there are systems for which the application of SED is appropriate (e.g., nanometer-scale probe molecules used for studies in homogeneous solvents), the apparent observance of SED in other liquids should be treated with caution.

❃ David A. Turnton (2014) Stokes−Einstein−Debye Failure in Molecular Orientational Diffusion: Exception or Rule? J Phys Chem B, in press, 10.1021/jp5012457
See also: 
• Bian and Ji (2014) Distribution, Transition and Thermodynamic Stability of Protein Conformations in the Denaturant-Induced Unfolding of Proteins. PLOS ONE, 9(3): e91129
• Kumar, Szamel, Douglas (2005) Nature of the breakdown in the Stokes-Einstein Relationship in a Hard Sphere Fluid. Arxiv, cond-mat/0508172umar

As if water didn’t have enough unusual aspects (of interest to biologists, since, well, it’s only the primary solvent in all living organisms) a paper out today highlights its disobedience to a fundamental law describing molecular diffusion.

Aqueous guanidium hydrochloride ‒ a “water neutral” denaturant ‒ is used to produce so-called chemical (cf. physical/mechanical) unfolding of proteins. The particular paper I’m thinking of is Exploring Early Stages of the Chemical Unfolding of Proteins at the Proteome Scale from late last year, which used urea not GuHCl (GuCl is also used) and had some really interesting conclusions.

The dominant paradigm for unfolding (the ‘‘direct’’ mechanism) claims that the denaturant properties of urea are related to its capacity to interact with exposed protein residues more strongly than water. However, the nature of such a preferential interaction is not so clear. Thus, while some authors suggest that it is mostly electrostatic and related to the formation of direct hydrogen bonds, others claim that preferential dispersion is the leading term. It is also unclear whether the major destabilizing effect of urea is related to interaction with the backbone or with side chains. In the latter case, there is also discussion regarding the preferential side chains: polar and charged or apolar.

We recently combined multi-replica molecular dynamics (MD) simulations and direct NMR measures of ubiquitin to characterize the “urea unfolded ensemble” of this model protein. Our results suggest that urea stabilizes flexible over-extended conformations of the protein, which are unlikely to be sampled in the “unfolded” state of aqueous proteins. Extended conformations of the protein with exposed hydrophobic surfaces are more urea-philic than the native globular state, due mostly to extensive London dispersion interactions (the attractive contribution in Van der Waals interactions between instantaneous dipoles) between apolar side chains and urea molecules in the first solvation shell of unfolded conformations. We believe that our results clarify the molecular basis of the effect of urea on the thermodynamics of the folded←→unfolded equilibrium, but unfortunately, they do not provide information on the kinetic role of urea in the unfolding process. In other words: does urea actively induce protein unfolding? Or, on the contrary, does it passively stabilize the unfolded state by selectively binding to unfolded conformations?

Urea and protein dynamics.

Urea diffuses quite slowly and limits protein fluctuations, which leads to an apparent paradox: a denaturant that slows down the dynamics of proteins compared to the equivalent simulations in water. However, analysis of trajectories show that such a paradox does not exist. Urea migration to the protein surface was slower than that of water, but once it reached the surface, urea remained for longer periods, especially when located in cavities near the hydrophobic core of the protein. Interestingly, the positions of long-lasting urea interactions are consistent among all four force-fields and seems associated with a sizeable improvement in van der Waals interactions and electrostatic energies and with the formation of strong long-living H-bonds. These findings demonstrate that even if H-bonding is not the driving force behind the urea-philicity of proteins, it is important to stabilize urea molecules at specific positions at the protein interior.

The idea that solvent diffusion around the protein has an effect on its movements means diffusion constants can be useful readouts in protein folding experiments.

urea/water/protein effective parameters are able to reproduce a variety of experimental observables, such as mass densities and radial distribution functions of urea/water solutions derived from neutron scattering experiments, the experimental water/urea transfer free energies of tripeptides, and the urea density around unfolded proteins found by vapor pressure osmometry measures

“Long-residence” denaturant molecules (urea in the paper above) typically bind to regions adjacent to mobile residues, at ‘hinge-points’.

This was of interest when studying intrinsically disordered proteins, my dissertation topic this semester, as intrinsically disordered regions produce just such flexible loops or tails, meaning ‘sticky’ denaturants are expected to play a major part in guiding their unfolding.

similar to the unfolded state, it is the van der Waals interactions that drive the accumulation of urea on the surface of the folded protein. However, the role of H-bonding cannot be dismissed, as these bonds are crucial for the stabilization of long-living urea interactions near hinge points, which in turn are required to bias intrinsic protein dynamics towards unfolding. Clearly, “direct” effects not only are the main factors responsible for the urea-mediated stabilization of the unfolded state, but are also relevant in guiding the first steps of urea unfolding.

Microscopic unfolding events are related to stochastic thermal motions, which are in principle similar to those that occur spontaneously in water at room temperature. However, urea is not a mere passive spectator that simply stabilizes the small percentage of unfolded protein coexisting within the native ensemble and leading to a displacement in the folded←→unfolded equilibrium towards the denatured state. On the contrary, urea has a dual function: i) it takes advantage of microscopic unfolding events, decreasing their chances of refolding, and favoring further unfolding; and ii) among these microscopic unfolding events it selects and stabilizes microstates with exposed hydrophobic regions. These effects lead to a slow divergence in the temperature-unfolding pathways in water and urea, and, as shown for ubiquitin, to distinct unfolded states. Consequently, concepts such as folded and unfolded states or folding and unfolding pathways need to be revisited and reformulated considering the nature of the denaturant used.

imageThe paper I wanted to talk about is from the University of Glasgow, on the validity of the “Stokes−Einstein−Debye” (SED) relationship, which describes rotational diffusion.

In the simplest model of particulate diffusion ‒ a sphere ‒ a frictional force arises in a fluid due to attraction between molecules. In order to move a solid object through it, some solvent molecules must also be displaced, and those nearest to the moving particle are the most perturbed, decreasing to zero with distance (Serdyuk, 2007). 

To compute the frictional force, we must calculate the force to maintain the perturbed velocity distribution of the solvent molecules, related to the fluid’s viscosity, η₀.

Although the actual derivation is extremely difficult (the sphere’s motion induces velocity gradients in the fluid which must be calculated explicitly), it’s still worth examining conclusions of the early biophysicists to see the effect changes to them may bring to bear on our interpretation of protein structure.

According to Stokes, the translational friction of a sphere is proportional to its radius, R₀, and the viscosity η₀ of the solvent through which the particle moves:

f₀= 6πη₀R₀

There are three important remarks concerning Stokes’ equation. First, the coefficient (i.e. 6π) is determined by the boundary conditions of fluid flow at the surface of the particle (‘stick’ or ‘slip’ conditions). The number 6 indicates the use of stick conditions. Second, from a mathematical point of view, Stokes’ approach is correct when the solvent is considered as an unstructured medium. It is evident that this condition holds if the molecules under consideration are much larger than the solvent molecules. It is generally accepted that for proteins with a molecular mass in excess of 5000 Da (so-called ‘large’ molecules), the motion of molecules in solution follows a continuous flow pattern. However, the low limit for the correct application of Stokes’ law (for so-called ‘small’ particles) is still under discussion. Third, the equation was derived with the assumption that interparticle interactions are absent. This condition is realised upon extrapolation to infinite dilution of the solution.

Since the frictional force is directly connected with the surface area of the particle studied and a sphere has the smallest surface area of all geometrical objects, it can be concluded that in general frictional coefficients for a spherical molecule are smaller than for any non-spherical molecule of the same volume.

The equation shown in the main image above, is becoming visible now as a slightly modified form of this relation, one relevant to protein motions after the asymmetric (non-spherical) shape is taken into account.

From general considerations it is clear that the frictional force of asymmetrical particles depends on their orientation relative to the flow direction. Thus, an elongated particle has a smaller friction when it is oriented along the flow as compared to that when it is oriented perpendicular to the flow. In the averaged case the solution is

f₀= 6πη₀R₀F(p)

The term introduced here, F(p) is the Perrin function (or Perrin translational friction factor, also written FP). From Wikipedia:

The frictional coefficient is related to the diffusion constant D by the Einstein relation, D = kBT / ftot

Hence, ftot can be measured directly using analytical ultracentrifugation, or indirectly using various methods to determine the diffusion constant (e.g., NMR and dynamic light scattering).

The meaning of the equation shown above (being challenged in this paper) is now clear: the frictional coefficient has been stated as related to the diffusion coefficient by the Einstein relation, where the subscript 0 indicates surface friction rather than total (I think!).

We can say that water is an extreme case of a molecular liquid

Now that we’re all experts in molecular biophysics, the paper’s claim is that the SED expression, used routinely to relate orientational molecular diffusivity quantitatively to viscosity for asymmetric particles, as found in nature, is somewhat inopportune.

It is well-known that Einstein’s equations are derived from hydrodynamic theory for the diffusion of a Brownian particle in a homogeneous fluid and examples of SED breakdown and failure for molecular diffusion are not unusual. Here, using optical Kerr-effect spectroscopy to measure orientational diffusion for solutions of guanidine hydrochloride in water and mixtures of carbon disulfide with hexadecane, we show that these two contrasting systems each show pronounced exception to the SED relation and ask if it is reasonable to expect molecular diffusion to be a simple function of viscosity.

Drot shows linear dependence on η “over a limited range of temperature, for small solutes and neat molecular liquids — there are observations of SED breakdown and failure”. Importantly, the paper finds this occurs for molecules used in probing protein structure:

for some molecular solutes, it has been suggested that specific local interactions are more important than the solvent bulk properties and a detailed description of the relation of diffusivity to viscosity may require the solvent structure to be taken into account. Here we investigate two very different systems of mixtures: an aqueous solution of guanidine hydrochloride representing a strongly interacting (hydrogen bonding) liquid and a mixture of carbon disulfide and hexadecane having only weak (van der Waals) interactions. By varying the composition, we isothermally change the viscosity in each system and observe, in general, no systematic relationship of viscosity to the molecular orientational diffusion.

The guanidinium ion (GuH+ = [C(NH2)3]+ ) is a fitting protein denaturant since it’s highly soluble and relatively “water neutral” (kosmotropic or perhaps isotropic as I understand it); i.e., GuH+:H2O interactions are similar to H2O:H2O interactions — they don’t bind strongly, and have “a most unusual ability to dissolve in water without altering its dynamics”.

With increasing concentration, the solution’s viscosity increases monotonically (“entirely non-decreasing”) up to about twice that of neat water for the saturated solution.

As well as water ‒ ‘an extreme case of a molecular liquid, characterized by strong directional bonding’ ‒ the authors examined hexadecane, as a solvent for the nonpolar molecule CS2 (S=C=S), a system which lacks them.

To move into the heart of the paper, the authors highlight previous findings that water has an expanded structure in which the hydrogen-bonded molecules reorient through a complex large-angle “jump” process, “which, although an activated process, is not diffusional

The rate of relaxation is primarily determined by the rate of hydrogen bond fluctuations. Furthermore, in simple salt solutions, the orientational relaxation of the water molecule, as a function of concentration, is generally uncorrelated with viscosity.

Here, the presence of a high concentration of the large GuH+ ions does not strongly influence the relaxational rate; hence, it appears that H2O forms hydrogen bonds to both GuH+ (and chloride ions) that are of similar strength to H2O−H2O hydrogen bonds. We can say that water is an extreme case of a molecular liquid, characterized by strong directional bonding.

image

In contrast, the relaxation time scale of the GuH+ ion has an apparently linear dependence on concentration, but this is markedly different to the nonlinear viscosity increase. This linear trend, which breaks at ca. 2.1 M, suggests a simple dependence on concentration rather than viscosity. At the highest concentration of 7.35 M, each GuH+ ion has (6.1 Å)3 of space in which there are only 3.5 H2O molecules and 1 chloride ion. At the lowest concentration of 0.53 M, each GuH+ ion has (15 Å)3 of space in which there are ca. 100 H2O molecules. Therefore, at low concentration, the GuH+ ion is effectively surrounded by bulk water (and its dynamics are determined by collisions with water molecules that are relatively mobile but not by the bulk viscosity). The break in the line at 2.1 M suggests a transition to the regime where GuH+:GuH+ collisions become the dominant factor in the orientational diffusion rate.

The upshot of this is that while the SED relation may be useful for large probe particles (colloids, fluorescent tracers), molecular diffusion doesn’t perform so well, and corrections have to be made to account for the more complex reality of a dielectric medium, with ‘stick’ or ‘slip’ surface interactions (molecular friction).

Lest this all sound a bit esoteric, the point is that deviations from SED are read as evidence of structural changes, and often fractional forms of the Stokes−Einstein (for translational diffusion) and Stokes−Einstein−Debye (for rotational) relations are interpreted as evidence of a change in effective volume with temperature.

However, since SED is based on hydrodynamics, and applies strictly to a particle immersed in a homogeneous fluid, there is little reason (as Einstein made clear in his 1906 paper) to expect it to apply on a molecular scale. Here, for these two contrasting systems, it is clear that SED does not generally apply. This suggests that diffusion of molecular-size particles is dominated by local interactions that decouple the diffusivity from the bulk viscosity. This would be consistent with the observation of the anomalous speeding up of CS2 relaxation in the hexadecane mixture, reflecting that the CS2−hexadecane interactions are weaker than the CS2−CS2 interactions.

A molecule is apparently aware of only short-range interactions, primarily then to its first solvation shell, and application of the Stokes−Einstein and Stokes−Einstein−Debye relations in studies of molecular self-diffusion must be made cautiously. It has indeed been suggested before that a critical particle volume exists below which the SE relation (for translational diffusion) fails, and molecular dynamics (MD) simulations for a Lennard-Jones liquid suggest a critical volume, in the nanometer range, below which local intermolecular forces dominate the translational mobility.

They conclude that taken together, this evidence suggests that molecular orientational diffusion is controlled by local (first solvation shell) interactions rather than by the bulk properties of the liquid.

If Einstein didn’t quite grasp it, then what is the true relationship of diffusion to viscosity?

However, because of this complexity and the distinct relaxation mechanisms that contribute, only detailed MD calculations are likely to be able to predict such a relationship, and there is no simple theory that is able to predict the value of viscosity from molecular properties.

The observations suggest that in the case that a molecule can support numerous weak interactions (hexadecane) the single molecule motion could correctly be termed diffusive, and SED is then obeyed, whereas if the interactions are dominated by a small number of strong interactions, then orientational relaxation is not diffusive, and SED is not obeyed.

We suggest that the majority of liquids composed of small molecules fall into the second category. Since SED is a widely used method of identifying anomalous behavior in molecular liquids, it is essential that such distinction can be made and this calls for a systematic approach to predict, perhaps through MD simulation, the nature of single molecule relaxation.

Often, temperature dependent measurements do show similar trends in viscosity and molecular diffusion (and in these cases the SED relation will remain an important metric), but as both processes are activated, this is unsurprising and is not evidence of a causal relationship. Hence, while there are systems for which the application of SED is appropriate (e.g., nanometer-scale probe molecules used for studies in homogeneous solvents), the apparent observance of SED in other liquids should be treated with caution.

❃ David A. Turnton (2014) Stokes−Einstein−Debye Failure in Molecular Orientational Diffusion: Exception or Rule? J Phys Chem B, in press, 10.1021/jp5012457

See also: 
• Bian and Ji (2014) Distribution, Transition and Thermodynamic Stability of Protein Conformations in the Denaturant-Induced Unfolding of ProteinsPLOS ONE, 9(3): e91129
• Kumar, Szamel, Douglas (2005) Nature of the breakdown in the Stokes-Einstein Relationship in a Hard Sphere Fluid. Arxiv, cond-mat/0508172umar
4

Taurine (and also cysteate) are described as “amino acids” in the literature, even though their acid group is a sulphonic, not carboxylic one.

Both are derived from the essential but interconverted Cys & Met. It’s odd to notice, as I guess it’s something I would just take for granted. Similarly, there are phosphonic amino acids. A large part of this oversight is that the amino acids making up proteins are all carboxylic, but in the living cell these building blocks interconvert to modified forms.

Sitting down to write this made me realise I still don’t really have a good source for molecular animations either, which I’ve been meaning to look into. Thanks to Google Shopping’s 360° viewer (you couldn’t make it up) there are now neat looking molecular ball-on-stick 3D representations appearing in search results for many molecules, since around the end of last year.

Using these is cutting corners, and it's preferable to use QuteMol, UCSF Chimera (which I used to make the 2nd animation, above) or PyMol. This thread gives a little intro to what's out there. I still prefer Google's somehow!

To grab their readymade, you have to Inspect Element in the browser using Ctrl + Shift + I (right click is disabled), which uncovers the video in the background carrying out the illusion of dragging the molecule (really it's just changing the time of a video). For HQ select the .webm video, not the .mp4, though the latter is better supported by PhotoShop etc.

image

Back to the point anyway, a little rooting through Wikipedia led me to the loose class of “sulphur amino acids” which was similar to what I had in mind (i.e. the sulphonic rather than carboxylic amino acids). A look through this list gave me no new additions though.

image

I ended up using PubChem search to look for molecules with sulphonic acid substructures, which produced a list of over 16,000 compounds. Since it’s just out of curiosity, I picked the first 1,000 sulphonic acids by molecular weight, and filtered them down offline to only those that were (mono)amino acids, then cross-referenced the items from PubChem with the KEGG biological products database.

It felt like a tiny version of a pipeline (or ‘workflow’ in plainer language), and could probably be set up in code if I had a reason to do so.

I came back a little while later to repeat this for the full list, but the server was giving a Bad gateway error, so to do that would probably require downloading it all and doing a search using my own code.

You can download the PubChem structure database in its entirety here, with the option to get 3D structures too.

The results (in ascending order of molecular weight) were:

PC ID		KEGG ID		Compound name
1123		C00245		Taurine
1646		C03349		Homotaurine
23674183	C13229		Dipyron hydrate
68759		C05844		Glutaurine
6926		C06333		Orthanilic acid
72886		C00506		L-Cysteic acid
81831		C05353		TES
8474		C06334		Metanilic acid
8479		C06335		Sulfanilic acid

As I suspected when glancing through the list, taurine was not the only molecule with biological annotation ‒ by which I mean, endogenous ‘biomolecules’.

This is an interesting ambiguity in the KEGG annotation, which I wasn’t aware of and perhaps others aren’t.

Most of the Wikipedia articles for these compounds are stubs, to which I’ll have to add some detail. The organic chemists have declared sulphanilic acid an ‘off-white crystalline solid’ with its existence in biology ‒ very much not as a crystalline solid ‒ literally left out of the encylopaedia (though not for long!).

image

As well as its primary role in bile acid biosynthesis, taurine interacts with ABC transporters and ‘neuroactive ligand-receptor’.

The diagram you get when you click through is a pretty dense image map (something KEGG uses a lot of) in which you can see taurine’s interaction with a GLR (glutamate receptor), GLRA1.

Homotaurine is, according to Wikipedia, a “synthetic organic compound” ‒ its presence in KEGG doesn’t indicate there’s a metabolic pathway making it, as I previously thought. In fact, drug molecules are also given KEGG descriptors. Homotaurine therefore has a place in KEGG since it was trialled as an amyloid β treatment for Alzheimer’s under the name of Tramiprosate but dropped by 2007 after little success. The distinguishing factor on its page is the lack of any metabolic pathways it’s involved in, i.e. it’s an exogenous compound.

Glutaurine

Glutaurine as its name suggests is a hybrid of glutamate and taurine. It’s only possible to see a hint of a biological role from its pathway diagram (an offshoot of taurine metabolism), but it’s been noted in studies as an antiepileptic, with antiamnesia properties.

A really nice paper from 2005 in the Amino Acids journal gives an overview of its relevance in various organs and organisms.

the discovery of the dipeptide γ-glutamyltaurine (γ-GT; glutaurine, Litoralon) in the parathyroid in 1980 and later in the brain of mammals gave rise to studies on intrinsic and synthetic taurine peptides of this type. It was suggested that γ-glutamyltransferase (GGT; γ-glutamyl-transpeptidase) in the brain is responsible for the in vivo formation of this unusual dipeptide.

The versatile molecule mimicks the anxiolytic drug diazepam, and is implicated in everything from feline aggression to amphibian metamorphosis, radiation protection and the glutamatergic system in schizophrenic disorders. The paper also covers taurine:

Since taurine plays a number of important roles in mammalian tissues, it has been thoroughly investigated over the past 50 years, but even so its precise biochemical function is not fully understood. It was generally accepted that taurine is not utilized in protein synthesis in the same way as other common α-amino acids. Consequently, studies on its peptidic derivatives were limited, and up to the end of the 1970s there was no information available on the existence of naturally occurring taurine peptides. The discovery of γ-GT in the parathyroid (Furka et al, 1980) and later in the brain of mammals (Marnela 1985, Nakamura 1990) prompted subsequent studies on both intrinsic and synthetic taurine peptides.

In recent years, studies on oligopeptides containing taurine residues have received considerable attention in view of their tendency to adopt preferential secondary structures as well as their stability towards enzyme degradation. It seemed possible that sulfonic acid analogs of amino acids built into peptides might provide a means of inhibiting the parent peptide.

It has been suggested that γ-GT can function as an intracellular storage form of taurine. Indeed, it has been shown that taurine can form small peptides with a number of other amino acids, e.g. N-acetylglutamyltaurine… 

It is disheartening to see that the scientific work on this molecule has, by and large, gone into hibernation over the past 10 years. Such a situation is uncalled for regarding a molecule with this much potential.

Orthanilic acid

image
Orthanilic acid, a.k.a. 2-aminobenzene sulphonate, has roles in benzoate degradation and ‘microbial metabolism in diverse environments’.

Last year it was noted to promote reverse turn formation in peptides:

Orthanilic acid (2-aminobenzenesulfonic acid, SAnt), an aromatic β-amino acid, has been shown to be highly useful in inducing a folded conformation in peptides. When incorporated into peptide sequences (Xaa-SAnt-Yaa), this rigid aromatic β-amino acid strongly imparts a reverse-turn conformation to the peptide backbone, featuring robust 11-membered-ring hydrogen-bonding.

image

In a paper that gives a name to what I was looking for (‘sulfonated amines’), Tan et al describe orthanilic acid in its use in azo dyes as a problem, since the sulphonic acid group makes it highly water soluble and thus a pollutant, a situation worsened by poor bacterial degradation.

There also seems to be research on the effects of taurine,  L-cysteic and orthanilic acids on cardiac tension but it’s being withheld from my university by chemical giant Sigma Aldrich (why they own a journal on Progress in Clinical Biological Research is beyond me) so who knows…

L-cysteate

Taurine is what’s known as a ‘conditionally essential’ amino acid, as it can be manufactured in the body by conversion from the essential amino acids Met and Cys.

That is, cysteine can be made from [non-protein α-amino acid] homocysteine, which in turn is made by cleaving methionine’s terminal Cε methyl group.

First found in wool in 1946,

When dry wool is very vigorously agitated in benzene, a small proportion of the cuticle cells (scales) becomes detached. These were found to contain a relatively large amount of cysteic acid and it is reasonable to suppose that these scales come almost exclusively from the tips of the wool.

It is of interest that a small amount of cysteic acid, but no cystine, was found to be present in a piece of bath sponge, which is usually prepared by allowing the living sponge to rot in the sun.

…the compound has also been found in Staphylococcus aureus and Bacillus subtilis, shown to be an educt for synthesising bacterial sulpholipid capnine, algal sulpholipids, as well as extracellularly as a component in spider’s webs.

Sulphanilic acid

image

Resembling its orthanilic sibling, sulphanilic acid (or 4-aminobenzene sulphonate) was the hardest to find much information on, and clearly heavily used in organic chemistry (drowning out the studies on the endogenous molecule).

KEGG says it’s involved in aminobenzoate degradation and ‘microbial metabolism in diverse environments’.

Reading these network diagrams is like Where’s Wally sometimes.

So, it doesn’t seem that there’s all that much particularly special going on with sulphonated amine chemistry — not all of which are the α-isomer of the proteinaceous amino acids, and for the likes of glutaurine the α-form is not present at all (as verified in the bovine brain).

There was an intriguing paper out yesterday regarding how taurine attenuates amyloid β 1–42-induced mitochondrial dysfunction by activating SIRT1 in neuroblastomas. It’s interesting in light of the abandoned homotaurine trials, and suggests maybe they were onto something but just weren’t sufficiently aware of and/or properly exploiting the underlying mechanism.

I’m quite interested in protein folding, aggregation, and the chaperones that mediate both processes. Likewise, mitochondrial bioenergetics was one of the first things that really gripped my attention in biochem, and SIRT1 was another early interest from a school project in which I looked into the biochemistry of the ageing process (available to read here).

The SIRT1 gene encodes NAD-dependent deacetylase sirtuin-1, an enzyme that regulates cellular reaction to stress and mediates longevity in what has been proposed as the mechanism behind the (tenuous) benefits of caloric restriction.

The report covers recent evidence from other labs of taurine’s inhibition of oxidative stress through restoring SIRT1 expression, particularly by regulating mitochondrial protein synthesis and enhancing electron transport chain activity (Jong 2012; Kumari 2013).

It has also been suggested that taurine attenuates Aβ1–42-induced neurons impairment in vitro as well as cognitive deficits in the transgenic mice model of AD via its antioxidant and neuroprotective properties. However, the potential protective effects of taurine against Aβ1–42-induced mitochondrial dysfunctions and neuronal death still need to be well clarified.

In the present study, we investigated neuroprotective effects of taurine against Aβ1–42-induced mitochondrial dysfunction and neuronal death in SK-N-SH cells. Furthermore, we explored the underlying mechanisms of taurine on mitochondrial function and neuronal loss.

Aβ plays a central role in the pathogenesis of AD. Aβ1–42 induces neurodegeneration in cortical and hippocampus through disturbing calcium homeostasis oxidative stress, ROS accumulation, and mitochondrial dysfunction. Taurine was reported to display potent antioxidant and neuroprotective properties. It is reported that taurine significantly attenuated neuronal death in ischemia-injured brain. Our results showed that taurine could exert a protective effect against the neuronal loss induced by Aβ1–42.

A growing number of studies show that intracellular ROS accumulation and elevation of [Ca2+]i in neurons of cerebral cortex and hippocampus in AD. Here, the results showed that Aβ1–42 significantly increased intracellular ROS generation as well as the level of [Ca2+]i in SK-N-SH cells which were reversed by administration of taurine in the presence of Aβ1–42. It has been reported that taurine plays a role in modulating [Ca2+]i in neurons and cardiomyocytes.

The paper above I don’t have access to regarding effects of sulfonated amines on cardiac tension is likely related to this last point re: cardiomyocytes (heart muscle).

image

Other studies support that taurine exerts an anti-oxidative effect. Therefore, we may have a conclusion that taurine protects neuronal cells against Aβ through inhibiting ROS generation and buffering [Ca2+]i as well. Studies show that excessive amounts of [Ca2+]i, as well as elevated intracellular ROS, are main factors to trigger the opening of the mitochondrial permeability transition pore (mPTP). Therefore, we hypothesized that the neuroprotection of taurine is related to the regulation of the mPTP. As expected, we observed that taurine blocked mPTP opening in SK-N-SH cells in Aβ1–42 rich environment. The result is similar with the experimental results from hopaxia model reported by Chen et al. The mPTP consists of the voltage-dependent anion channel (VDAC), adenine nucleotide translocator (ANT) and cyclophilin D (CypD).

The opening of the mPTP allows molecules <1.5 kDa across the mitochondrial membrane which causes uncoupling of the electron respiratory chain, mitochondrial depolarization and rupture of mitochondrial outer membrane, which finally leads to cell death. In our experiments, taurine recovered mitochondrial mem- brane potential and ATP level in SK-N-SH cells in the presence of Aβ1–42, which strengthens the notion that the neuroprotection of taurine is due to its prevention in the mPTP opening. These results are consistent with the reports that taurine protects cerebella granual cells from glutamate toxicity by enhancing the mitochondrial activity. Overall, these data indicate that taurine inhibit the mPTP opening by modulating [Ca2+]i and ROS generation.

✶ Sun et al. (2014) Taurine attenuates amyloid β 1–42-induced mitochondrial dysfunction by activating of SIRT1 in SK-N-SH cellsBiochem Biophys Res Commun, in press.

10

Big Data, big business and bioscience

There’s a funny sense of déjà vu when a passing thought reappears in a respected publication only a few hours after having popped into your head.

This happened today after seeing this morning’s news from the Science and Tech. Facilities Council (STFC) ‒ Big Data Is Big Business. The update came across quite heavy on buzzwords: big data, open data, and asking “Is there an app for that?” despite the piece having nothing to do with apps…

Big data is big business, with the British government estimating that it will have created 58,000 new jobs and added £216 billion to the UK economy by 2017. The UK has vast data sets that are open for public use, generated through world-class research activity and data-intensive public sector organisations. Research has shown that allowing unfettered access is likely to stimulate novel uses of the data, resulting in the emergence of many new companies selling new services.

Allowing unfettered access to public data has implications beyond what “research” can show, and there’s rightly contention regarding whether governmental bodies should really be stewards to avid entrepeneurs with what may be confidential data.

image

The news piece focussed on the Sentinel 1a satellite, launched a fortnight ago, and its uses for flood monitoring.

This isn’t really what I understand by the term Big Data, which is used by smaller companies such as Zillabyte to mean watching trends, predicting markets, and generally being very aware of the populace, and by monoliths like IBM and Palantir to indicate doing so on a grandiose scale, but with more emphasis on really strong software engineering.

This sort of tacit meaning left me with a vague unease. It takes a very minor change of perspective to see “predictive analytics” as Orwellian, and I had a moment of wondering whether groups pushing such initiatives on this public sector data might do so by intermingling them with (or disguising them behind) tech trends for “openness”.

I should probably clarify at this point that this unease is separate from any related to science working with industry, and more to do with the proven track record for some of these tech companies to act reprehensibly behind closed doors when left to decide what they should do with analytics, and just a gut feeling that this echoes previous scandals involving unregulated industries.

That the current UK government is in favour of these plans isn’t a great surprise after the recently exposed sale of NHS patient data, which we can only hope isn’t part of a greater push for privatisation. The event STFC are hosting (the reason behind said press release) will hold talks from heads of all 5 UK Research Councils, in the Daresbury lab at Harwell, Oxford.

The piece that made me think back to their news item came in Nature later today, bearing caution to Beware of backroom deals in the name of ‘science’ from Colin Macilwain, editor of Research EuropeIt’s hard to think of another article I’ve read in Nature in recent times which has been so outright political.

He describes the work of a neo-conservative lobby in US congress, using the word ‘sound’ to give an unearnt respectability to a policy (nothing new in Congress) known as the ‘sound science’ farm bill. AAAS helped block its passage to the President, stating that:

The Section would also require that agencies favor data that are “experimental, empirical, quantifiable, and reproducible,” although not all scientific research could meet each of these criteria. For example, some experiments are theoretical or statistical rather than experimental, and others are so large-scale that they may not be reproducible. The new regulation could also prevent policymakers from using science based on new technologies

this provision could “further hamstring agencies already under significant budgetary pressure.”

In short, the Section, if passed, may slow or even paralyze agencies’ rule-making abilities by complicating an already thorough review process, making it exceedingly difficult to implement new regulations pertaining to agricultural, environmental, or public health practices, among other things, the statement said.

Leshner echoed concerns of Sen. Edward Markey (D-MA), who released a statement of his own in December. The Consortium for Ocean Leadership and UCAR also released a joint letter opposing Section 12307 and calling for removal of the provision from the Farm Bill. 

Experimental, empirical, quantifiable and reproducible are lofty goals, but as Macilwain points out, “the approach would discount, for example, the use of weather modelling, or of data collected from one-off events, such as natural disasters.” It’s a good example of scientific language being used to shut down critical thinking in the scientifically-minded, but the appropriate groups took note and had it excised from the bill.

Dealing with such provisions is a bit like whack-a-mole. There is another mole already in sight on Capitol Hill: the Secret Science Reform Act, now under consideration by the House science committee, to stop the Environmental Protection Agency from using data that are not publicly available in its assessments.

And who could argue with that? Well, one issue with making all such data public is that it gives industry grounds for refusing to hand confidential data over, as it would then become public.

In the end, regulatory arguments are more philosophical than scientific in their nature. Environmentalists advocate caution in the face of uncertainty; industry wants cost-benefit analysis.

The natural sciences have little to say on which approach is wiser. Industry, however, has become adroit at using the concept of sound science to advocate the latter path. Too many researchers, as well as the wider public, are taken in by the claim that when someone says they are seeking the scientific answer to a regulatory question, they mean what they say. They very rarely do.

While this piece made no mention of ‘Big Data’, ‘Big Business’ cropped up several times, as did concerns around data sharing in public and private sectors. I’m not here to espouse any particular political view, and personally I’m more interested to know about what it is that each of the Research Councils are terming their ‘Big Data’, most of all the biotech & biosciences council (bbsrc).

The BBSRC Strategic Plan (2010-2015) covers their early intentions to exploit ‘big data’, but none of this has ever been called Big Data when I’ve heard it discussed.

The slide (just one, but with a lot crammed onto one page) highlighted a need for:

  • computationally proficient biologists
  • software engineers that understand the heterogeneity of biological data
  • biological engineers that can deploy computational models to design and manipulate biological systems

image

In 2012, work began on a new BBSRC bioinformatics ‘Technical Hub’ in Cambridgeshire, which holds a training centre, EMBL-EBI office space and ‘an industry-led clinical translation suite for bioinformatics’. This all seems quite vague, and again it’d be interesting to hear what ‘Big Data’ BBSRC are going to discuss as ‘now available’ in the presentation to UK businesses.

image

Again, a presentation related to the 2010-15 plan makes no clear reference to industry other than citing them as partners. What’s more, the statement that “bioscience is big data science” makes me a little bemused as to what this term is being used to mean.

Bioscience ‘big data’ is quite different from the behavioural / socioeconomic ‘Big Data’ that raises concerns of social engineering

Recently here in the UK, there has been a row over patient privacy in a system known as Care.data, which was to allow mining of GP records — amounting to entire medical histories. The NHS was described a few years ago as a huge source of health informatics being missed out on, and the potential benefits to research are huge if handled well. Trust in this system is vital as it’s possible to opt-out, so poor handling could doom it.

Fears over privacy were downplayed as the records would be “anonymised” — though Ben Goldacre quickly highlighted how it could in fact be used to identify individuals.

Seemingly contrary to this statement, health secretary Jeremy Hunt outlined plans to link care.data to genomic sequencing, in what seems to be preparation for personalised medicine in the NHS (‘Genomics England’). This is as far as I can see the current most obvious instantiation of ‘Big Data’ in UK science, and largely it’s only through the biosciences’ links to the medical profession (i.e. patients and those in clinical trials) that they would earn any of the accompanying privacy or social concerns vs. lab research data.

Taking a look at the jobs listed on NatureJobs, to get an idea of exactly what sort of work might come under the title in the workplace as it stands, the reality seems to lack that much reason for alarm. No sign of bioscience-NSA partnerships, just use of a buzzword: bioinformatics by another name, with an element of bragging rights (similarly: why are we still calling it Next-Generation Sequencing other than to make it sound cool?)

It’s also interesting to see the diversity in desired backgrounds: more mathematical/statistical/computer science than bioscientists.

PhD student Computational Discovery of Genetic Variation in Genomic ‘Big Data’ : Amsterdam, Netherlands

development of computational and statistical algorithmic frameworks for “finding needles in genomic big data haystacks”. This poses intriguing and challenging computational and/or statistical questions. The project will be linked to the “Genome of the Netherlands” (GoNL) project, which is concerned with the genomes of 769 Dutch individuals, grouped into families. The purpose of this project and the arrangement of its data aims at spotting and characterizing genetic variation in the light of evolution most favorably. It is based on 60 terabytes of genome data and provides a most relevant link to current genomics research, as being largest family-oriented such sequencing project worldwide. Beyond GoNL, applications to diseases such as cancer and virus genomes are also of great interest.

Biomedical Informatics Data Scientist : Austin, TX

develop software and methods to explore, analyze, and visualize clinical and biological data sets including genomic, neuroimaging, and electronic health record data

Data Analytics Engineer : Warrington , UK

STFC’s Hartree Centre, based within the Scientific Computing Department, specializes in exploring advanced computing techniques, including Big Data and HPC. The Hartree Centre was funded through the UK Government’s e-Leadership council to work closely with industrial applications, and some of its work is commercial in confidence and subject to IPR rules regarding public disclosure.

We are looking for someone with an aptitude for solving ‘Big Data’ problems. A background with exposure to data analytics systems and infrastructure such as Hadoop, HDFS, IBM Infosphere, IBM Streams, and MarkLogic is required, with an underlying scientific/engineering discipline. 

Postdoctoral Research Scientist in Single Cell Genomics : Oxford, UK

The research will focus on the development of novel statistical methodology for the analysis of large-scale single cell genomics data and offers an opportunity to be at the analytical forefront of institutional expansion in this area.

Ideally, you will have experience of statistical methods development gained through a recently obtained (or soon to be) PhD (or equivalent) in a quantitative subject (e.g. mathematics, statistics, physics, engineering or computer science). Experience of Bayesian Statistics and machine learning techniques is highly desirable as is evidence of prior experience of developing bioinformatics software and/or analysing genomic data sets.

Director, Micro & Nanotechnology Laboratory & Professor : Urbana, IL

Over the next few years, more than 35 new endowed professorships and chairs will be established, which will provide incredible opportunities for world-renowned researchers. The two main research areas are Big Data and Bioengineering.

I won’t be able to make it to Hartree, but Genomics England is holding ‘town hall engagement meetings’ across the UK starting later this month, for the public/patients and afterparties (i.e. technical talks) for clinicians and those interested in healthcare data and the realities of sequencing 100,000 Brits’ genomes over the next 5 years which I’m looking forward to.

There was a nice Leading Edge Analysis report in Cell this week touching on this exact issue, interviewing Gene Myers, mathematician-computer scientist turned cell biologist (indeed director of Molecular Cell Biology and Genetics at the Max Planck Institute)

There was a time when [biological] data was really hard-fought and you wanted to preserve it,’’ Myers says, ‘‘but now that’s not true anymore.’’

The piece underscores that data sharing is not just trendy, but rather it’s necessary for any hope of statistically significant findings for rare diseases and the intricacies of biology.

Santa Cruz bioinformatician David Haussler is involved in the Cancer Genome Atlas and many other big data projects such as the Genome 10K Project,

which aims to “assemble a genomic zoo” by collecting sequence data representing the genomes of 10,000 vertebrate species, which corresponds to approximately one genome per vertebrate genus. As of December 2013, the group has data for 94 species complete or in progress.

“I’m extremely passionate about the fact that we have an opportunity to understand life on this planet, how it evolved, how living systems are built by molecular evolution,” says Haussler. “This is a watershed for science.”

For research like this to flourish, though, certain cultural changes will be needed, including broader data sharing, improved computational training for biologists, and providing more support for data scientists in the traditional academic structure.

and the Global Alliance for Genomics and Health initiative,

which currently counts among its members more than 100 healthcare, research, and disease advocacy organizations from across the world. The initiative aims to support an infrastructure for sharing patient genetic data, keeping in mind concerns about patient privacy and security, to help push medical research forward.

“Individual researchers need to be committed to the idea of participating in collaborations to collect enough data together to do a truly deep analysis, to have the numbers to be able to investigate their individual cases in the context of other cases,” he says.

But convincing researchers to share their data has not been easy. “Scientists always want to have their cake and eat it too,” Haussler says. “They would like to have total control over all the genomes from their patients, yet they realize virtually everything we’re talking about becomes rare when you get down to the precise molecular characterization.

The reported fear of researchers being “scooped” from sharing data (after publication, i.e. reinterpretation of data from the literature) has been criticised as careerism in one way or another, and not something the scientific community should be making allowances for given the benefits to science on the whole.

Scooping here takes on a different meaning to that where a researcher beats you to a publication off their own back ‒ the issue is more to do with keeping what you generated, since you paid for it. 

Systems biologist Uri Alon even wrote a song about it… an inside-joke if ever there was one.

The alternative view is that really if your research was paid for with public money, it’s not really yours to hoard in this way.

Others have suggested that the confidence to avoid being scooped is a luxury not available to early career researchers, and richer nations may have less to lose from funding (as is seen with trends in governmental data’s ‘openness’).

More discussion can be found here, and it seems that genomics is one of the fields less prone to this fear, and perhaps ecology more so ‒ for social and cultural reasons that need be treated more carefully than with simple mandates to deposit data.

A third interview was with Philip Bourne, leader of the National Institutes of Health’s Big Data to Knowledge (BD2K) initiative. In March he became the first Associate Director of Data Science, described as the “so-called data czar for the NIH.”

“The notion of being a data scientist is crucially important, yet these people are typically not well looked after” in a university setting, Bourne says. “They don’t last in the system.” He cites as an example his own University of California, San Diego research group, which earned the nickname of “the Google bus” because so many of its alumni ended up working at the nearby Google office. “Every morning half the people on that bus were people from my lab. They weren’t even looking for jobs outside academia, but they were just attracted away. We need to raise awareness of the importance of these people in the system.”

It goes the other way as well, with current researchers requiring training in data science if they hope to be successful in the new research environment. “We’ve got to figure out how to train the next generation and our current generation,” says Eric Green [director of the National Human Genome Research Institute]. “Mid-career scientists are going to be practicing their trade for another 20 to 30 years, yet they’re woefully untrained when it comes to data science. The train is just going to pass them by if we’re not careful. Then we need to think about the next generation: for the run-of-the-mill biologist, what is the minimum competency they need to function in the new world of data science? We want to raise everyone’s floor.”

“On the one hand, it’s everybody’s problem, but at the same time it becomes nobody’s problem because it slips through the cracks.”

On the technical side, Bourne and others are also considering the best way to build systems that can support the huge data sets currently being generated. Bourne says it is likely that public-private partnerships between research institutions and corporations like Amazon, Microsoft, and Google are likely to solve these problems. Scientists have been interested in such partnerships for a long time, but only recently have the data reached the scale where computer scientists are really interested in getting involved. “During the Human Genome Project, we would invite in computer scientists and entice them to get involved,” Green recalls, “but they really weren’t very interested.” Now that researchers are generating data three or four orders of magnitude larger than the human genome though, “all of a sudden we have the technology to generate the scale of data to get them going.”

Infrastructure development is certainly important, but it’s also useful to step back and remember the real reason big data is scientifically important in the first place. “Applications drive everything,” says Schatz. “For the rest of our days storage and transfer will be a problem, but I’m really excited for the days where we’ll have those systems in place and can ask some really exciting questions.”

This final point was echoed in another post today, from Razib Khan ‒ in his closing notes on a cattle genetics paper, he got tangibly excited at the shifting ground Bernstein noted above.

Over the next decade it seems inevitable that the clusters at the heart of “genomics cores” across the world will be gorging on whole sequences of thousands of individuals for many organisms. It will be a “flood the zone” era for attempting to understand the tree of life. An army of bioinformaticists will be thrown at the data in human waves, absorbing shock after shock, slowly transforming the ad hoc kludge pipelines of the pre-Model T era of genomics into simpler turnkey solutions. And then the biology will come back to the fore, and the deep wellspring of knowledge by those who focus on specific organisms and is going to be the essence of the enterprise once more.

❏ Bernstein R (2014) Shifting Ground for Big Data Researchers. Cell, 157, 283-284

9

Journal abbreviation generator

Here’s a little generator I made yesterday from the PubMed list of MedLine-indexed journals. Of minor note, but might be a handy bookmark…

image

1

After installing a Linux Mint operating system (OS) as a “dual boot” setup alongside Windows 8 (not as bad as I’d heard!), I was hoping I’d be able to use some of the software I have student licenses for without having to restart each time.

The open source LibreOffice just isn’t as smooth as Office 2013 packages, Office365 was made for the lightest of users and more than anything I’m comfortable as I was in the Office setup, with university supported plugins for bibliographies etc.

image

Before coming to Linux I’d heard of something called Wine, which was supposed to emulate a Windows environment. Supported versions of Office however only go to 2010, so that was useless to me.

I stumbled upon what seems to be a setup made for developers to promote use of Internet Explorer, on modern.ie under the heading Virtualisation Tools, where essentially you can get a free copy of Windows in a version of your choice to run inside another OS (Windows, Mac or Linux)…

And the result is great! Better yet, the 90 day trial period is just a technicality, and Microsoft themselves advise you to take a ‘snapshot’ of your machine at day 0 to reset it to after this period runs out. I’ve managed to get the full Office suite and Creative Cloud up and running without a fuss, compared to some ugly errors from Wine.

My only tip to any other Linux users would be to dock the taskbar on your left and set it to auto-hide, then you can run the two OS’s seamlessly as shown up top.

NB: license terms state that this should be used “for testing" and explicitly not "in a live operating environment”. Others have interpreted this to mean not a production/business environment, nor a company/school-bought computer (cf. a personal/home PC). For ‘evaluating’ features, with fully paid licenses for the same Microsoft software on the same computer (and certainly not making any money in doing so) before switching back to an actual Windows 8 copy seems perfectly within the terms to me, and with the state of Wine it’s worth knowing you’ve got the option even if it’s never used.

For anything slightly more professional, you need Software Assurance (full info) via the relevant reseller/IT support team.

P.S. My gripe about rubbish search indexing in Windows in my last post was solved in seconds on Linux with a little package called pdfgrep, which not only finds search strings within PDFs but brings back context around them in the terminal! :~)

8

Strategic reading and scientific discourse

I’m aware I’ve lapsed in posting here, and I think a good part of this is some level of subconscious awareness that my ability to read the scientific literature critically is not fully developed; more than anything though I’m just a busy bee. The ubiquitous style of science writing is to regurgitate as a news piece what is really an object up for interpretation, and originally this was something I was set on avoiding here.

Following 1,000+ life scientists on Twitter since I joined back in December has been informative on this deficiency; or perhaps it’s simply easier for me to read other people’s criticisms than engage naïvely on my own.

It’s been almost a year (well, ⅔) since I began what I think can be described as “strategic reading” with RSS feeds and I’m taking stock of how this has worked out.

Initially an outsourced short-term memory, I use it less in this way now, and am left wondering why I choose to sometimes (in an interrogative sense as opposed to “if I should ” ) given the overwhelming flow of news and views.

I’d say it helps me really engage with current events, and gives an overview of the state of the art in the life sciences, though that’s a very different thing than the understanding that core studies and revision supply. Rather, as described in Strategic Reading and Scientific Discourse from two sociologists at UoI at Urbana (Center for Informatics Research in Science and Scholarship), the goal of lit. surfing, is not to find an article to read, but rather to find, assess, and exploit a range of information by scanning portions of many articles, akin to channel hopping.

Picking up small nuggets (even if only gleaned from titles or abstracts) and being aware of the importance of the cornucopia of genes and proteins (and other…), as well as the changing face of this status quo is something I would never get from textbooks.

The pair wrote this under an ‘e-Science’ initiative which took them to the Semantic Web Applications in Scientific Discourse meeting near Washington DC 5 years back.

The Center’s work seeks:

…to improve information transfer and integration, technology development and sustainability, and collaboration in the practice of science through basic and applied research and training of information specialists to work cooperatively with research scientists. Scientific data problems do not stand in isolation. They are part of a larger set of challenges associated with escalated production of scientific information and changes in scholarly communication in the digital environment. Across all scientific disciplines, researchers are producing and consuming increasing amounts and varieties of information and data, while striving to work with these resources in new ways. This has lead to daunting problems and opportunities for information management and integration. There are numerous challenges associated with the amount and rate of data being generated; however, the complexity of the underlying science is of greater consequence for scientific discovery than the sheer volume of the data.

Their papers discuss case studies and best practices on topics such as:

Another development this semester has been becoming increasingly involved in development, i.e. mucking around with Javascript programming and as of today R (I think I prefer Python) after only having tried out Ruby (nice in that it aims to be very ‘readable’, but not so popular) and manipulating webpages with CSS markup.

image

The list of papers above a case in point; I threw together a quick citation grabber in Javascript to ask me which papers to include and assemble the links to save me the mule work. Learning RegEx honestly is worth its weight in gold (for all academics).

One of my favourite little self-assigned projects recently has been completing various tasks from the browser console related to parsing the genetic code/protein sequence for a tutorial exercise on site-directed mutagenesis; very much ground floor computational biology but just as much a timesaver as Excel etc. (without the clunkiness). There are websites to do this and it’s still assumed you can do things like find the reverse complement of a sequence by eye, but… well, it’s 2014. The result was an eye opener on manipulating ‘strings’ (chunks of text) and seeing both the possibilities and limits therein (unlike Ruby, JavaScript has no multiline CDATA input other than a popup input box which is annoying…).

image

My interest in coding actually stemmed from increasingly elaborate spreadsheets, and I don’t feel the barrier to entry is as high as perceived (I’d like to think sharing this will encourage others to take a look). In Google Chrome you can press Ctrl + Shift + J to bring up the JavaScript console and get started…

To return to my original point, the reasoning in this sociology/informatics paper resonated with my own experience of computational tools being fundamental in parsing the sheer variety of information formats related to scientific work, most importantly in reading the literature.

When they do engage directly with the literature scientists do not typically read individual articles, but rather work with many articles simultaneously to search, filter, compare, arrange, link, annotate, and analyze fragments of content, in order to gather information as efficiently as possible. This behavior, which we call strategic reading, has been well-documented for both digital as well as print media.

Today scientists often search and browse as if they were playing a video game. They rapidly move through resources, changing search strings, chaining references backward and citations forward. By note-taking or cutting and pasting, they extract and accumulate bits of specific information, such as findings, equations, protocols, and data. They make rapid judgments—such as assessments of relevance, impact, and quality—as they formulate and iteratively refine queries.

Again, this is something I find a lot of affinity with. Steve Caplan, a biochemistry professor in Nebraska, posted a sharp-tongued article in the Guardian this weekend (a British newspaper, but whatever) bemoaning how lazy and unprofessional students ‘these days’ are. It got on my nerves to read someone longing for the old days pre-PubMed when a scientific article was a thing of reverence and students had to frantically catch everything a lecturer said. But podcasts are nothing all that new either, only levelling the playing field over those who can afford dictaphones. Each to their own ideal method of engagement, but I think it’s a little rich of him to call students lazy for using technology (only PowerPoint at that…) as he tweets pictures of his MacBook Pro but whatever.

Statistics presented on minutes per article etc. miss a key facet of this ‘strategic’ reading, which is its non-linearity. New articles will oft be revisited, reinterpreted, or simply saved and indexed to be discovered through a relevant search unintentionally.

As an aside, perhaps it's just my bl**dy Windows PC, but I'm unable to search the content of articles. My Chromebook (a pared back Linux distro by Google) features flawless and instant text searching of every PDF in my online library, its winning feature in my eyes. I'm on the verge of installing Linux Mint to dual boot (live DVD at the ready) on a new laptop, and hope the situation is better there...

There’s still a sizeable deficit in programs related to this, with Mendeley, PubChase, Zotero, EndNote and Papers the most common software used in curating one’s publication catalogue.

I’m really excited by the possibilites for text mining and ‘enriching’ content in the literature beyond plain-text through the use of annotations:

One example is Textpresso, an ontology-based mining and retrieval system that works with prepared collections of articles, split into sentences and annotated with terms from 33 ontology categories, three of which correspond to the Gene Ontology ontologies. Results screens present a ranked list of sentences within a ranked list of articles, with term highlighting, and links to articles and external databases. Reading the sentences of an article in relevance order rather than narrative order is an example of strategic reading within an article. An example of strategic reading across a collection is provided by Information Hyperlinked over Proteins (iHOP), which uses genes, proteins, NCBI taxonomy identifiers, and MeSH headings to create a network of sentences and abstracts for searching and navigating MEDLINE abstracts, presenting configurable pages of ranked lists of sentences retrieved from many abstracts.

The article gives a shout out to Ted Nelson, a visionary in web technology whose enthusiasm for primordial ‘hypermedia’ is infectious.

image

Although much of this functionality had already been implemented in experimental hypertext systems, the predicted revolution never happened. Even now, in 2009, few of the features anticipated in the 1980s are available for general use, and those that have been realized are but pale versions of imagined and prototyped originals. It is true that there were substantial changes in scientific publishing during the 1990s: nearly all journals came to include a digital version that was distributed over the internet and linked from indexing systems. But these changes provided very little of the functionality explored in the hypertext systems of the 1980s.

In retrospect we can see that in the 1990s every aspect of computing, from hardware to supporting social infrastructure, was inadequate for the emergence of a high-function scientific communication system. However since then there have been extraordinary improvements in the functionality, interoperability, and efficiency of basic networking, hardware, and software. Key basic standards and protocols have been developed and widely implemented; new software engineering strategies such as object oriented programming and conceptual modeling have been widely adopted; powerful new software applications have been developed and diffused; and user interfaces are considerably more effective and compelling. Strategies for distributed development and interoperable tools and data have been developed and tested.

This gap in what’s possible is so tangible it’s frustrating, as those in more commercial fields make full use of modern technology and for a variety of reasons (many related to scientific publishing) we languish in these 10 to 20 year old setups.

Particularly important for the changes underway is the widespread adoption of a standard serialization language (XML) with associated standards and tools (XSLT, XQuery, XPath), providing a high level of interoperability at the data structure serialization level. Interoperability at the level of logical syntax is provided by the rapidly developing standards and technologies of the semantic web (RDF, OWL, SPARQL, SWRL). But most important is the development of scientific domain ontologies, which promise the semantic interoperability needed to realize the anticipated functionality.

Originally designed to support the sharing and integration of scientific data, these ontologies will increasingly be integrated into the scientific publishing workflow. Once they are deployed as part of digital scientific literature these ontologies will of course enhance text mining, information extraction, and literature-based discovery—but just as importantly they will transform how scientists “read” the narrative prose of scientific literature.

image

In the first reference of the above excerpt, disadvantages of hypertext are discussed, which seem relevant to the discussion of ‘e-Science’: “problems with the current implementations and problems that seem to be endemic to hypertext”.

The problems in the first class include delays in the display of referenced material, restrictions on names and other proper:ies of links, lack of or deficiencies in browsers etc. Two problems that are more challenging than these implementation shortcomings, and that may in fact ultimately limit the usefulness of hypertext are disorientation (getting “lost in space”) and cognitive overhead (it is difficult to become accustomed to the additional mental overhead required to create, name, and keep track of links).

One of the solutions presented is exactly that presented by the group at Illinois:

one can filter (or elide) information so that the user is presented with a manageable level of complexity and detail, and can shift the view or the detail suppression while navigating through the network. However, much research remains to be done on effective and standardized methods for ellision.

The authors of the 2009 paper focus on this ellision, which is perfect as it’s precisely what unsettles me. At the level of the decisions a program such as this would make,

  • At the moment you encounter a link, how do you decide if following the side path is worth the distraction?
  • Does the label appearing in the link tell you enough to decide?

No heuristic can match an individual’s preferences, and in any case this process of ‘parsing’ is a core aspect of firing up something fundamental to the scientific/creative thought process.

The brain can create ideas faster than the hand can write them or the mouth can speak them. There is always a balance between relining the current idea, returning to a previous idea to refine it, and attending to any of the vague “proto-ideas” which are hovering at the edge of consciousness. Hypertext simply offers a sufficiently sophisticated “pencil” to begin to engage the richness, variety and interrelatedness of creative thought. This aspect of hypertext has advantages when this richness is needed and drawbacks when it is not.

The only parallel of prescriptive content provision I see on the web today is from the likes of Facebook, Google and Amazon, whose machine learning algos decide what you ought be reading at the expense of your personal agency.

This is fine, but my intuition is just as Conklin wrote — effective standardised methods are sorely lacking, and commercial interests reign supreme in the previous examples. Simply put, the available material I receive to read is one thing I’d like to retain control of.

I’m gladdened that those in the relevant areas of technology are tackling this, and are aware of the anxieties surrounding tearing this away from users.

A challenge of a different sort is the limited empirical research on how scientists using digital resources read and engage with texts in the course of research. Many of the traditional approaches to evaluating information systems, such as retrieval precision and recall or satisfaction measures, do not provide the kind of analysis needed to guide the development of strategic reading technologies.

If we want to understand the fast-paced and subtle tactics, interactions, and intentions involved in using and applying the literature in online environments, then methods need to be applied that capture what scientists actually do and value as they gather, review, and manipulate texts and work with them over time. We know, for instance, that scientists often have trouble locating very problem-specific information (on methods and protocols, for instance) and that the occasional exploration of results from another discipline can have considerable impact on progress or the direction of research.

These are the kinds of information behaviors that we need to understand more fully to design tools that go beyond search and retrieval to support creative strategic reading.

As this paper points out, parsing huge information sources is nothing special to the digital age. Rather than the insidiously prescriptive approach of “Recommended for You”, which robs the user of this vital process of creative exploration, a more desirable tool would aid them by leveraging existing annotations, as Textpresso (above) does for example.

This morning I trialled a prototype of the recently announced speed reading app Spritz, — OpenSpritz — simply a bookmarklet filled with Javascript to whizz through the current page. A little manic and unnatural, bringing some strategy to the literature is a far more enviable prospect.

⏛ Renear and Palmer (2009) Strategic Reading and Scientific DiscourseCentral Europe Workshop Proceedings, from the International Semantic Web Conference.

» The proceedings paper was subsequently made into an article at Science.

See also:
• David Shotton (2009) Semantic Publishing: the coming revolution in scientific journal publishing. Learned Publishing, 22(2): 85‒94

Despite easy online access to a greater variety of journals, they have developed tunnel vision, citing articles that are less diverse and more recently published than was the case in the days of purely paper-based journals, since they are undertaking less journal browsing that was formerly necessary for knowledge discovery.

10

CropScience

✂  ‒  ‒  ‒  ‒  ‒  ‒  ‒  ‒  ‒  ‒  ‒  ‒  ‒  ‒  ‒

The scientific literature is a dense thing, and it’s made denser still by being constantly shrunk down to make way for distracting little tidbits at the side of the page and just general poor use of space. This can be particularly annoying when these are web-only features, like the CrossMark button (explained in this previous post) shown in the top right corner of the PNAS paper above.

Printing papers was a constant cause of annoyance and frustration for me last year, and as a student ink is expensive. This little set of scripts is nothing huge, but will (for a good portion of bioscience journals) chop out all of this junk, and ensure the focus stays on the article itself when squeezed out onto A4.

The Javascript code runs when you boot up Acrobat and makes life a whole lot less stressful living with obsessive compulsions like this…

You can find the repository and explanations of both how to install them or to make your own (for whatever purpose) over on GitHub.

5

The thousands of mutations that are present in any one patient’s tumours have a robust medical literature backing them up — what pathways are they involved in, how do they drive the tumour, are they irrelevant mutations? Watson can read tens of thousands, maybe millions of papers, in a split second. Poring through that data will allow Watson and us as a team to think about which mutations we can focus on — to drive the identifiable mutations, together using Watson, with the pharmacopia.

✲   ✲   

The New York Genome Center held a press conference yesterday, streamed above, on how the IBM Watson machine learning technology is being applied to genomic medicine.

Dr John Kelly, Senior VP and director of IBM Research was introduced by Dr Robert Darnell @darnelr.

We can now start to look for true needles in many haystacks… getting to the root cause of an individual’s cancer in seconds, rather than weeks to months.

This is a big bet. Normally, big bets take decades, but I think this one is going to occur very rapidly. The data is just immense — not only the data being generated from the patient, but the amount of data from the clinical journals, the background, the pharmaceutical companies, that needs to be understood by Watson and brought to the table.

I’ve posted briefly on Watson’s cognitive computing earlier this year after Bernard Meyerson’s IET and BCS Turing Lecture, Beyond Silicon: Cognition and much, much moreFurther details on the mechanism behind what’s known as Watson can be found in that talk, but this focussed on the applications in genomic medicine, which Meyerson only referenced in the abstract.

When we created Watson, it was a demonstration of man vs. machine. … We have spent the last two years teaching Watson the language of medicine. We’re moving Watson to a cloud-based capability, so that [it] will rapidly scale to physicians around the world … to have a tremendous impact on cancer and some of these horrific diseases… very very soon.

Next up to the podium came Ajay Royyuru, director of computational biology at IBM Research, who gave a glimpse as to how the software works. Essentially (with a lot of visual fanfare) it maps genomic information to pathways. So far, so unrevolutionary. It’s great and all, but you can do this much at home with KEGG (minus the fancy dark field cell diagram).

image

Watson finds new links and significant events in existing biomedical literature

One part of Meyerson’s talk that really intrigued me was how Watson’s cognitive analysis can forecast based on statistical inference and highlight not just links and correlation but new, definitive causes of problems before they would have been detectable with standard diagnostics. The example Meyerson gave was related to computer hardware failures in a vast and complex network. On a biological level, Royyuru describes this as “drilling down” in order to ask “is this relevant to the context of what I’m studying”, finding new biochemical events such as phosphorylation that may have been missed by researchers. That is, expect plenty of unexpected hypotheses to be followed up in the lab.

Incorporating aspects of systems pharmacology, Watson uses information on drugs that deal with the perturbations present in the mutations carried by patient’s tumours, providing clinical specialists to take advantage of available drugs, including those in clinical trials.

In a lovely closing statement, Darnell spoke of “harnessing empathy”, in a way that is “not academic, not chasing our next publication, but crystallising treatments”.

Questions highlighted journal bias — the software develops a “trust factor” based on outcomes of information gleaned from that journal (how useful it is in making the right decisions), rather than more primitive metrics such as readership.

The Memorial-Sloan Kettering Cancer Center, one of the 12 founding institutions of the NYGC is working close by on breast cancer. Recent collaborative work has resulted in a paper in Science which describes a new chimeric RNA transcript (encoding a fusion protein) important in a specific type of carcinogenesis, and potentially also more widely. From the NYGC press release:

“We discovered chimeric RNAs in the tumor samples — made when DNA deletions create unnatural products that can drive cancer,” says Nicolas Robine, co-first author and NYGC Computational Biologist.  “This chimera had never been seen before, so we believe it will help drive the work of our Rockefeller colleagues and Elana’s future.  It is the NYGC’s mission to undertake such collaborative genomic studies that will accelerate medical advances.”

17