Diverse biological sciences, from engineering to epidemiology, benefit from an increased understanding of how genetic code (genotype) determines downstream function (phenotype). While decreasing costs and increasing throughput has led to progressively larger-scale measurements, these measurements can still only sample a small fraction of the full genotype-phenotype landscape. So, a complete picture of the landscape requires models to predict unmeasured genotype-phenotype combinations. Recently, researchers have increasingly relied on black-box models, like deep neural networks, to make these predictions due to their unsurpassed ability to accurately predict the effects of genotype on phenotype. But, these approaches have a substantial limitation: they are uninterpretable. Specifically, it is extremely difficult to understand how or why a black-box model makes a particular prediction. This drastically limits the insights possible from these approaches, and decreases their trustworthiness to practitioners.
To address these problems, researchers in the Statistical Engineering Division and Cellular Engineering Group developed a novel approach that is fully interpretable: LANTERN. Across a broad benchmark of large-scale genotype-phenotype landscapes, LANTERN equals or outperforms alternative models (including deep neural networks) - achieving state-of-the-art prediction. Beyond accurate predictions, LANTERN is fully interpretable, and we show how the model provides novel insights into diverse protein landscapes relevant to public health and the bioeconomy. LANTERN demonstrates that state-of-the-art prediction is possible without sacrificing interpretability.
The bioeconomy is increasingly data-driven, relying on large-scale measurements to rapidly engineer novel functions. To facilitate rational engineering, where complex designs are constructed from well understood components, models must distill the large design space associated with genetic sequences down to a scale comprehensible to designers. For example, a typical protein sequence has thousands of potential mutations that can be included in a novel construct. Since most bioengineering goals require multiple mutations before reaching optimal performance, this massive design space places a bottleneck on engineering tasks.
In this project, we address these problems with LANTERN, a fully interpretable model of genotype-phenotype landscapes. LANTERN compresses the high-dimensional genotype design space down to a continuous, low-dimensional space where the effect of mutations combine additively. This converts the challenge of designing over thousands of potential mutations to navigating a much smaller space when designing new proteins. For example, LANTERN compressed thousands of mutations in a previously measured landscape of genetic sensors (Large-Scale Genotype-Phenotype Landscape Measurements for Precision Engineering of Living Measurement Systems | NIST) down to only three dimensions. Importantly, LANTERN also predicts the effects of mutations with state-of-the-art accuracy: across a broad benchmark of large-scale datasets LANTERN equaled or out-performed existing approaches in predicting novel function. This comparison included black-box neural networks, which are popular for their ability to generate accurate prediction but with a substantial sacrifice to model interpretability. When modeling genotype-phenotype landscapes, we show that LANTERN makes this trade-off unnecessary: state-of-the-art prediction is possible while still remaining fully interpretable.
Source code: https://github.com/usnistgov/lantern
Documentation: https://lantern-gpl.readthedocs.io/en/latest/