Open source development of data fusion and analysis tools to enable machine learning for systems biology 


Subsurface Insights (in collaboration with Dr. John Bargar at PNNL) recently (Feb 2024) received a DOE SBIR award (DE-SC0024850) for a proposal in response to topic 17a (Complex Data: Advanced  Data Analytic Technologies For Systems Biology And Bioenergy) of the  FY2024 Phase I release I SBIR call.   

Our proposal addresses needs defined by BER’s Biological Systems Science Division (BSSD) related to how to create sustainable production systems for biofuels and bioproducts. Biofuels are a critical part of the US energy strategy – see e.g.  the recently released 2023 Billion Ton Report. However, there are numerous challenges associated with going from feedstocks to biofuels and bioproducts at scale and in a cost efficient manner. These challenges (which within DOE are for a large part addressed within four Bioenergy Research Centers) can be roughly grouped in the four research thrusts shown below. 

Biiofuels Research Thrusts

Four different research thrusts associated with the research done by DOE scientists. Figure source:   Bioenergy Research Centers 2022 Program Update (page 3). 

Specific need addressed by our proposal

The specific need addressed by our proposal is related to the first research thrust: Sustainability, and specifically the  need for optimal feedstock development. Selecting and developing site specific optimal feedstocks requires knowledge and insights about the complex multivariate interactions between crops and their environment, impacts of crop choice and management systems, and key plant-microbe-environment interactions.  

The increasing availability of large bioenergy crop, soil and environmental data sets promises major new opportunities to obtain this knowledge by fusing and analyzing this data to discover constitutive relationships that link key parameters such as genotype and soil microbiome to crop yield and water stress resiliency. 

As is shown in the figure below, this data  is multimodal and multiscale. It includes high throughput plant phenotype data, multi-omic data from plants and soils, and imagery over scales ranging from molecular (e.g., protein structure models) to plot scale (e.g., Phenocam and drone data) to tens of meters (i.e., green normalized difference vegetation index (GNDVI) from satellite data). Critical ancillary data include soil types, topography, and environmental conditions (precipitation, solar radiation and air and soil temperature). Our SBIR project  is developing tools for fusing and analyzing this data.

Some of the different data sets associated with  bioenergy crops. These datasets are multimodal and multiscale and can be categorized and grouped in different ways. One possible grouping and some of the sources are shown. Figure inspired and modified from (Venturas, Sperry et al. 2018)

SBIR approach

The approach in this SBIR is summarized in the figure below. This approach leverages capabilities at Subsurface Insights (e.g. the Subsurface Insights developed open source ODMX data management software) and PNNL (specifically the MONet program led by John Bargar).  We will use the SmartTensors AI toolbox, and specifically NMFK – a novel unsupervised ML method for data analysis based on Matrix Decomposition.

Approach summary: we will fuse multimodal biofuel crop data into an ODMX Database, implement ML-guided analysis capabilities and demonstrate the application of the analysis tool on different field data sets.

Implementation and opportunities for collaboration

The project team will implement this approach through an open source effort.  We expect the first software release to happen in mid May 2024 (links will be posted here). In parallel with this release  we will provide  interfaces (both through the web and through APIs) which will allow users to interface and experiment with initial versions of our software. Note that the computational backend will use  DOE compute resources – initially  Tahoma at PNNL and similar resources at NERSC.
In our development we hope to benefit from data and insights from DOE scientists, academia and the biofuel industry. We thus invite anyone who is interested in this effort to fill out the form below so that they can follow progress on this effort and/or participate.