metaboWorkflows

Introduction

Metabolomics encompasses a wide range of techniques that includes mass spectrometry based fingerprinting and profiling. The analysis of this data requires a number of steps that includes spectral processing, data pre-treatment, quality control, data mining and visualisation in order to extract relevant biological information. The goal of metaboWorkflows is to provide project directory templates for a range of mass spectrometry based metabolomic techniques. The generated project template directories can then be easily extended by the user to meet the needs of particular analysis goals.

These project templates utilise a number of tools to promote efficient and reproducible analysis, agnostic of the actual analysis R code. These tools include:

targets - an R focused pipeline toolkit for efficiently maintaining reproducible analysis workflows.
renv - an R package for project-local R package dependency management for maintaining reproducible R package environments.
git - a widely used, open-source distributed version control system.
docker - enables the containerization of operating system (OS) level environments. This can be used to define reproducible OS environments in which a workflow analysis can be performed.

While the use of most of these tools is optional but highly encouraged, it is recommended for the user to at least be familiar with the basic use of targets based pipelines outlined here.

There are three steps in using metaboWorkflows to generate a workflow project template directory. These are as follows:

input - define the input data source of the workflow
define - define the workflow steps based on the metabolomic technique and project directory structure
generate - generate the workflow project directory

This introduction will outline each of these steps, provide an overview of an example workflow project directory and how to execute one of these workflows.

To begin, firstly load the package:

library(metaboWorkflows)
#> 
#> Attaching package: 'metaboWorkflows'
#> The following object is masked from 'package:base':
#> 
#>     args

Workflow input

Prior to defining a workflow, the user needs to know what sort of input type the workflow will use. metaboWorkflows currently supports two types: remote data obtained through the use of a grover RESTful web API or through providing a vector of .mzML file paths and a tibble of sample information.

`grover` API input

The grover R package provides a framework for hosting RESTful web APIs for remote access and conversion of raw metabolomics data to the .mzML format.

This type of input can be declared by providing the host information of the grover API to the inputGrover function. Below shows an example for a fictitious grover API host.

workflow_input <- inputGrover(instrument = 'An_instrument',
                              directory = 'Experiment_directory',
                              host = 'a.grover.host',
                              port = 80,
                              auth = '1234')

print(workflow_input)
#> Grover API workflow input 
#> Instrument: An_instrument 
#> Directory: Experiment_directory 
#> Host: a.grover.host 
#> Port: 80 
#> Authentication: 1234

File path input

If the raw mass spectrometry .mzML format data files are locally available for a sample set, their file paths and a tibble of the sample information can be provided for workflow input.

Below shows an example file path input for the FIE-HRMS fingerprinting data set of Brachypodium distachyon ecotype comparisons available in the metaboData package.

file_paths <- metaboData::filePaths('FIE-HRMS','BdistachyonEcotypes')
sample_information <- metaboData::runinfo('FIE-HRMS','BdistachyonEcotypes')

workflow_input <- inputFilePath(file_paths,sample_information)

print(workflow_input)
#> File path workflow input
#> # files: 68

This example input will be used throughout the rest of this introduction.

Define a workflow

Workflow definition is simple and requires the input definition outlined in the previous section, the name of the metabolomic technique and the project name.

To return the currently available metabolomic workflow templates, use:

availableWorkflows()
#> [1] "FIE-HRMS fingerprinting" "NSI-HRMS fingerprinting"
#> [3] "RP-LC-HRMS profiling"    "NP-LC-HRMS profiling"   
#> [5] "GC-MS profiling"

This example will use the FIE-HRMS fingerprinting template and the project will be named Example project.

workflow_definition <- defineWorkflow(
  input = workflow_input,
  workflow = 'FIE-HRMS fingerprinting',
  project_name = 'Example project'
)

Further project template options can also be specified at this point such as the output directory path or the use of renv for R package management. See ?defineWorkflow for more details. Printing the resulting workflow definition will provide information about the defined workflow.

print(workflow_definition)
#> Workflow:  FIE-HRMS fingerprinting 
#> 
#> Project name: Example project 
#> Directory path: . 
#> Use renv: TRUE 
#> Docker: TRUE 
#> GitHub repository: FALSE 
#> Private repository: FALSE 
#> GitHub Actions: FALSE 
#> Parallel plan: jfmisc::suitableParallelPlan() 
#> Force creation: FALSE 
#> 
#> File path workflow input
#> # files: 68
#> 
#> # targets: 37

As shown above, this workflow definition contains 37 targets. These targets are the individual steps in the analysis pipeline. The full workflow graph, showing the relationships between these targets, can be plotted as shown below:

glimpse(workflow_definition)

The package also contains functionality for modifying and extending the targets available in the workflow templates. This is outlined in the Workflow customisation and extension vignette.

Generate a workflow project directory

The final step is the generation of the project directory for the defined workflow.

generateWorkflow(workflow_definition)

This will then generate the project directory at the specified directory path. The additional start argument can be used to automatically open the project directory in the RStudio IDE after project generation.

The project directory

Below shows an overview of the generated Example project project directory:

Example_project/
├── Example_project.Rproj
├── R
│   ├── functions
│   ├── targets
│   │   ├── correlations_targets.R
│   │   ├── input_targets.R
│   │   ├── modelling_targets.R
│   │   ├── molecular_formula_assignment_targets.R
│   │   ├── pre_treatment_targets.R
│   │   ├── report_targets.R
│   │   └── spectral_processing_targets.R
│   └── utils.R
├── README.md
├── _targets.R
├── _targets.yaml
├── data
│   ├── file_paths.txt
│   └── runinfo.csv
├── exports
├── misc
│   ├── build_project.sh
│   ├── docker
│   │   ├── Dockerfile
│   │   ├── build_image.sh
│   │   └── run_container.sh
│   └── run.R
├── renv
│   ├── activate.R
│   ├── library
│   ├── sandbox
│   ├── settings.dcf
│   └── staging
├── renv.lock
└── report
    └── report.Rmd

The presence of some of these components will be dependent on the defined project options and the input type that was selected in the workflow definition. Here, a brief overview of some of the important components will be given.

R/functions - R scripts containing additional functions can be placed here.

R/targets - This directory contains scripts for the target definitions of each module of the workflow.

R/utils.R - This contains any code related to loading packages and setting package options.

_targets.R - This R script sources all the necessary scripts in the project directory and contains the formal definition of the workflow. See here for more information about this file.

_targets.yaml - A YAML file for targets pipeline configuration settings.

data - Where workflow input data is stored. The contents of this directory will differ depending on whether a grover API or file path input type has been selected.

exports - Any outputs from the workflow will be directed here including the HTML report output and any .csv data table outputs.

misc - Any miscellaneous scripts and files can be placed here.

misc/docker - This directory contains infrastructure for building a suitable docker image from which the workflow pipeline can be executed within a reproducible containerised OS environment.

misc/build_project.sh - A convenience shell script to build and run the workflow pipeline with a docker container.

misc/run.R - An R script that can be used to execute the workflow. See the section below for more information.

renv.lock - A lock file used by the renv package to capture the state of the R package library used in the project.

report/report.Rmd - An R Markdown report for summarising the workflow results. It’s output will be saved in exports.

Executing the workflow analysis pipeline of the generated project

There are a number of ways in which the workflow targets pipeline can be executed after the project directory has been generated. The simplest way is to execute tar_make() in an R session run from within the project directory. The targets package provides further information on this topic here.

A recommended method is to open the project in the RStudio IDE and to run the workflow as a background job using the rstudioapi package and misc/run.R script as follows:

rstudioapi::jobRunScript('misc/run.R')

Jasen Finch