Introduction
Metabolomics encompasses a wide range of techniques that includes
mass spectrometry based fingerprinting and profiling. The analysis of
this data requires a number of steps that includes spectral processing,
data pre-treatment, quality control, data mining and visualisation in
order to extract relevant biological information. The goal of
metaboWorkflows
is to provide project directory templates
for a range of mass spectrometry based metabolomic techniques. The
generated project template directories can then be easily extended by
the user to meet the needs of particular analysis goals.
These project templates utilise a number of tools to promote efficient and reproducible analysis, agnostic of the actual analysis R code. These tools include:
- targets - an R focused pipeline toolkit for efficiently maintaining reproducible analysis workflows.
- renv - an R package for project-local R package dependency management for maintaining reproducible R package environments.
- git - a widely used, open-source distributed version control system.
- docker - enables the containerization of operating system (OS) level environments. This can be used to define reproducible OS environments in which a workflow analysis can be performed.
While the use of most of these tools is optional but highly
encouraged, it is recommended for the user to at least be familiar with
the basic use of targets
based pipelines outlined here.
There are three steps in using metaboWorkflows
to
generate a workflow project template directory. These are as
follows:
- input - define the input data source of the workflow
- define - define the workflow steps based on the metabolomic technique and project directory structure
- generate - generate the workflow project directory
This introduction will outline each of these steps, provide an overview of an example workflow project directory and how to execute one of these workflows.
To begin, firstly load the package:
library(metaboWorkflows)
#>
#> Attaching package: 'metaboWorkflows'
#> The following object is masked from 'package:base':
#>
#> args
Workflow input
Prior to defining a workflow, the user needs to know what sort of
input type the workflow will use. metaboWorkflows
currently
supports two types: remote data obtained through the use of a grover
RESTful web API or through providing a vector of .mzML file paths and a
tibble
of
sample information.
grover
API input
The grover
R
package provides a framework for hosting RESTful web APIs for remote
access and conversion of raw metabolomics data to the .mzML format.
This type of input can be declared by providing the host information
of the grover
API
to the inputGrover
function. Below shows an example for a
fictitious grover
API
host.
workflow_input <- inputGrover(instrument = 'An_instrument',
directory = 'Experiment_directory',
host = 'a.grover.host',
port = 80,
auth = '1234')
print(workflow_input)
#> Grover API workflow input
#> Instrument: An_instrument
#> Directory: Experiment_directory
#> Host: a.grover.host
#> Port: 80
#> Authentication: 1234
File path input
If the raw mass spectrometry .mzML format data files are locally
available for a sample set, their file paths and a tibble
of the
sample information can be provided for workflow input.
Below shows an example file path input for the FIE-HRMS
fingerprinting data set of Brachypodium distachyon ecotype
comparisons available in the metaboData
package.
file_paths <- metaboData::filePaths('FIE-HRMS','BdistachyonEcotypes')
sample_information <- metaboData::runinfo('FIE-HRMS','BdistachyonEcotypes')
workflow_input <- inputFilePath(file_paths,sample_information)
print(workflow_input)
#> File path workflow input
#> # files: 68
This example input will be used throughout the rest of this introduction.
Define a workflow
Workflow definition is simple and requires the input definition outlined in the previous section, the name of the metabolomic technique and the project name.
To return the currently available metabolomic workflow templates, use:
availableWorkflows()
#> [1] "FIE-HRMS fingerprinting" "NSI-HRMS fingerprinting"
#> [3] "RP-LC-HRMS profiling" "NP-LC-HRMS profiling"
#> [5] "GC-MS profiling"
This example will use the FIE-HRMS fingerprinting
template and the project will be named Example project
.
workflow_definition <- defineWorkflow(
input = workflow_input,
workflow = 'FIE-HRMS fingerprinting',
project_name = 'Example project'
)
Further project template options can also be specified at this point
such as the output directory path or the use of renv
for R
package management. See ?defineWorkflow
for more details.
Printing the resulting workflow definition will provide information
about the defined workflow.
print(workflow_definition)
#> Workflow: FIE-HRMS fingerprinting
#>
#> Project name: Example project
#> Directory path: .
#> Use renv: TRUE
#> Docker: TRUE
#> GitHub repository: FALSE
#> Private repository: FALSE
#> GitHub Actions: FALSE
#> Parallel plan: jfmisc::suitableParallelPlan()
#> Force creation: FALSE
#>
#> File path workflow input
#> # files: 68
#>
#> # targets: 37
As shown above, this workflow definition contains 37 targets. These targets are the individual steps in the analysis pipeline. The full workflow graph, showing the relationships between these targets, can be plotted as shown below:
glimpse(workflow_definition)
The package also contains functionality for modifying and extending the targets available in the workflow templates. This is outlined in the Workflow customisation and extension vignette.
Generate a workflow project directory
The final step is the generation of the project directory for the defined workflow.
generateWorkflow(workflow_definition)
This will then generate the project directory at the specified
directory path. The additional start
argument can be used
to automatically open the project directory in the RStudio IDE after
project generation.
The project directory
Below shows an overview of the generated Example project
project directory:
Example_project/
├── Example_project.Rproj
├── R
│ ├── functions
│ ├── targets
│ │ ├── correlations_targets.R
│ │ ├── input_targets.R
│ │ ├── modelling_targets.R
│ │ ├── molecular_formula_assignment_targets.R
│ │ ├── pre_treatment_targets.R
│ │ ├── report_targets.R
│ │ └── spectral_processing_targets.R
│ └── utils.R
├── README.md
├── _targets.R
├── _targets.yaml
├── data
│ ├── file_paths.txt
│ └── runinfo.csv
├── exports
├── misc
│ ├── build_project.sh
│ ├── docker
│ │ ├── Dockerfile
│ │ ├── build_image.sh
│ │ └── run_container.sh
│ └── run.R
├── renv
│ ├── activate.R
│ ├── library
│ ├── sandbox
│ ├── settings.dcf
│ └── staging
├── renv.lock
└── report
└── report.Rmd
The presence of some of these components will be dependent on the defined project options and the input type that was selected in the workflow definition. Here, a brief overview of some of the important components will be given.
R/functions
- R scripts containing additional functions
can be placed here.
R/targets
- This directory contains scripts for the
target definitions of each module of the workflow.
R/utils.R
- This contains any code related to loading
packages and setting package options.
_targets.R
- This R script sources all the necessary
scripts in the project directory and contains the formal definition of
the workflow. See here
for more information about this file.
_targets.yaml
- A YAML file for targets
pipeline configuration settings.
data
- Where workflow input data is stored. The contents
of this directory will differ depending on whether a grover API or file
path input type has been selected.
exports
- Any outputs from the workflow will be directed
here including the HTML report output and any .csv data table
outputs.
misc
- Any miscellaneous scripts and files can be placed
here.
misc/docker
- This directory contains infrastructure for
building a suitable docker image from which the workflow pipeline can be
executed within a reproducible containerised OS environment.
misc/build_project.sh
- A convenience shell script to
build and run the workflow pipeline with a docker container.
misc/run.R
- An R script that can be used to execute the
workflow. See the section below for more information.
renv.lock
- A lock file used by the renv
package to capture the state of the R package library used in the
project.
report/report.Rmd
- An R Markdown report for summarising
the workflow results. It’s output will be saved in
exports
.
Executing the workflow analysis pipeline of the generated project
There are a number of ways in which the workflow targets pipeline can
be executed after the project directory has been generated. The simplest
way is to execute tar_make()
in an R session run from
within the project directory. The targets package provides further
information on this topic here.
A recommended method is to open the project in the RStudio IDE and to run the workflow
as a background job using the rstudioapi
package and misc/run.R
script as follows:
rstudioapi::jobRunScript('misc/run.R')