Modelling and feature selection

Introduction

Modelling provides the essential data mining step for extracting biological information and explanatory metabolome features from a data set relating to the experimental conditions. metabolyseR provides a number of both univariate and multivariate methods for data mining.

For an introduction to the usage of metabolyseR for both exploratory and routine analyses, see the introduction vignette using:

vignette('introduction','metabolyseR')

To further supplement this document, a quick start example analysis is also available as a vignette:

vignette('quick_start','metabolyseR')

To begin, the package can be loaded using:

library(metabolyseR)
#> 
#> Attaching package: 'metabolyseR'
#> The following objects are masked from 'package:stats':
#> 
#>     anova, predict
#> The following objects are masked from 'package:base':
#> 
#>     raw, split

Example data

The examples used here will use the abr1 data set from the metaboData package. This is nominal mass flow-injection mass spectrometry (FI-MS) fingerprinting data from a plant-pathogen infection time course experiment. The pipe %>% from the magrittr package will also be used. The example data can be loaded using:

library(metaboData)

Only the negative acquisition mode data (abr1$neg) will be used along with the sample meta-information (abr1$fact). Create an AnalysisData class object, assigned to the variable d, using the following:

d <- analysisData(abr1$neg[,1:500],abr1$fact)

print(d)
#> 
#> AnalysisData object containing:
#> 
#> Samples: 120 
#> Features: 500 
#> Info: 9

As can be seen above the data set contains a total of 120 samples and 500 features.

Parallel processing

The package supports parallel processing using the future package.

By default, processing by metabolyseR will be done seqentially. However, parallel processing can be activated, prior to analysis, by specifying a parallel implementation using plan(). The following example specifies using the multisession implementation (muliple background R sessions) with two worker processes.

plan(future::multisession,workers = 2)

See the future package documentation for more information on the types of parallel implementations that are available.

Random Forest

Random forest is a versatile ensemble machine learning approach based on forests of decision trees for multivariate data mining. This can include unsupervised analysis, classification of discrete response variables and regression of continuous responses.

Random forest can be performed in metabolyseR using the randomForest() method. For further details on the arguments for using this function, see ?randomForest. This implementation of random forest in metabolyseR utilises the randomForest package. See ?randomForest::randomForest for more information about that implementation.

Unsupervised

The unsupervised random forest approach can be useful starting point for analysis in any experimental context. It can be used to give a general overview of the structure of the data and to identify any possible problems. These could include situations such as the presence of outliers samples or splits in the data caused by the impact of analytical or sample preparation factors. Unsupervised random forest can have advantages in these assessments over other approaches such as Principle Component Analysis (PCA). It is less sensitive to the effect of a single feature that in fact could have little overall impact relative to the other hundreds that could be present in a data set.

The examples below will show the use of unsupervised random forest for assessing the general structure of the example data set and the presence of outlier samples.

Unsupervised random forest can be performed by setting the cls argument of randomForest() to NULL:

unsupervised_rf <- d %>%
  randomForest(cls = NULL)

The type of random forest that has been performed can be checked using the type method.

type(unsupervised_rf)
#> [1] "unsupervised"

Or by printing the results object.

unsupervised_rf
#> 
#> Unsupervised random forest
#> 
#> Samples:  120 
#> Features:     500

Firstly, the presence of outlier samples will be assessed. A multidimensional scaling (MDS) plot can be used to visualise the relative proximity of the observations, as shown in the following. The individual points are also labelled by their injection order to enable the identification of individual samples if necessary.

plotMDS(unsupervised_rf,
        cls = NULL,
        label = 'injorder',
        labelSize = 3,
        title = 'Outlier detection')

From the plot above, it can be seen a single sample lies outside the 95% confidence ellipse. It is unlikely that this sample can be considered an outlier as it’s position is as a result of the underlying class structure as opposed to differences specific to that individual sample.

The structure of these observations can be investigated further by colouring the points by a different experimental factor. This will be by the day class column which is the main experimental factor of interest in this experiment.

plotMDS(unsupervised_rf,
        cls = 'day')

This shows that it is indeed the experimental factor of interest that is having the greatest impact on the structure of the data. The progression of the experimental time points are obvious across Dimension 1.

The available feature importance metrics for a random forest analysis can be retrieved by:

importanceMetrics(unsupervised_rf)
#> [1] "1"                    "2"                    "false_positive_rate" 
#> [4] "MeanDecreaseAccuracy" "MeanDecreaseGini"     "selection_frequency"

And the importance values of these metrics for each feature can returned using:

importance(unsupervised_rf)
#> # A tibble: 3,000 × 3
#>    feature metric                value
#>    <chr>   <chr>                 <dbl>
#>  1 N1      1                    0     
#>  2 N1      2                    0     
#>  3 N1      MeanDecreaseAccuracy 0     
#>  4 N1      MeanDecreaseGini     0     
#>  5 N1      false_positive_rate  0.0238
#>  6 N1      selection_frequency  0     
#>  7 N10     1                    0     
#>  8 N10     2                    0     
#>  9 N10     MeanDecreaseAccuracy 0     
#> 10 N10     MeanDecreaseGini     0     
#> # ℹ 2,990 more rows

The explanatory features for a given threshold can be extracted for any of the importance metrics. The following will extract the explanatory features below a threshold of 0.05 based on the false positive rate metric.

unsupervised_rf %>%
  explanatoryFeatures(metric = "false_positive_rate", 
                      threshold = 0.05)
#> # A tibble: 359 × 3
#>    feature metric                 value
#>    <chr>   <chr>                  <dbl>
#>  1 N342    false_positive_rate 1.31e-19
#>  2 N161    false_positive_rate 2.34e-16
#>  3 N341    false_positive_rate 6.50e-16
#>  4 N315    false_positive_rate 1.79e-15
#>  5 N367    false_positive_rate 3.47e-14
#>  6 N173    false_positive_rate 9.09e-14
#>  7 N385    false_positive_rate 9.09e-14
#>  8 N133    false_positive_rate 1.52e-12
#>  9 N439    false_positive_rate 1.52e-12
#> 10 N379    false_positive_rate 3.78e-12
#> # ℹ 349 more rows

In this example there are 359 explanatory features.

The trend of the most highly ranked explanatory feature against the day factor can be plotted using the plotFeature() method.

unsupervised_rf %>%
  plotFeature(feature = 'N425',
              cls = 'day')

Classification

Random forest classification can be used to assess the extent of discrimination (difference) between classes of a discrete response variable. This includes both multinomial (number of classes > 2) and binary (number of classes = 2) comparisons.

In multinomial situations, the suitability of a multinomial comparison versus multiple binary comparisons can depend on the experimental context. For instance, in a treatment/control experiment that includes multiple time points, a multinomial comparison using all available classes could be useful to visualise the general structure of the data. However, it could make any extracted explanatory features difficult to reason about as to how they relate to the individual experimental time point or treatment conditions. An investigator could instead identify the binary comparisons relevant to the biological question and focus the further classification comparisons to better select for explanatory features.

Multinomial comparisons

In experiments with more than two classes, multinomial random forest classification can be used to assess the discrimination between the classes and give an overview of the relative structure between classes.

The example data set consists of a total of 6 classes for the day response variable.

d %>% 
  clsExtract(cls = 'day') %>% 
  unique()
#> [1] 2 3 4 1 H 5
#> Levels: 1 2 3 4 5 H

Multinomial classification can be performed by:

multinomial_rf <- d %>%
  randomForest(cls = 'day')

print(multinomial_rf)
#> 
#> Random forest classification 
#> 
#> Samples:  120 
#> Features:     500 
#> Response:     day 
#> # comparisons:    1

The performance of this model can be assessed using metrics based on the success of the out of bag (OOB) predictions. The performance metrics can be extracted using:

multinomial_rf %>%
  metrics()
#> # A tibble: 4 × 5
#>   response comparison  .metric  .estimator .estimate
#>   <chr>    <chr>       <chr>    <chr>          <dbl>
#> 1 day      1~2~3~4~5~H accuracy multiclass     0.8  
#> 2 day      1~2~3~4~5~H kap      multiclass     0.76 
#> 3 day      1~2~3~4~5~H margin   NA             0.146
#> 4 day      1~2~3~4~5~H roc_auc  hand_till      0.964

These metrics include accuracy, Cohen’s kappa (kap), area under the receiver operator characteristic curve (roc_auc, ROC-AUC) and margin. Each metric has both strengths and weaknesses that depend on the context of the classification such as the balance of observations between the classes. As shown below, the class frequencies for this example are balanced with 20 observations per class.

d %>% 
  clsExtract(cls = 'day') %>% 
  table()
#> .
#>  1  2  3  4  5  H 
#> 20 20 20 20 20 20

In this context, each of these metrics could be used to assess the predictive performance of the model. The margin metric is the difference between the proportion of votes for the correct class and the maximum proportion of votes for the other classes for a given observation which is then averaged across all the observations. A positive margin value indicates correct classification and values greater than 0.2 can be considered as the models having strong predictive power. The margin also allows the extent of discrimination to be discerned even in very distinct cases above where both the accuracy and ROC-AUC would be registering values of 1.

In this example, the values of all the metrics suggest that the model is showing good predictive performance. This can be investigated further by plotting the MDS of observation proximity values.

multinomial_rf %>% 
  plotMDS(cls = 'day')

This shows that the model is able to discriminate highly between classes such as 5 and H. It is less able to discriminate more similar classes such as H and 1 or 4 and 5 whose confidence ellipses show a high degree of overlap. This makes sense in the context of this experiment as these are adjacent time points that are more likely to be similar than time points at each end of the experiment.

The ROC curves can also be plotted as shown below.

multinomial_rf %>% 
  plotROC()

Classes with their line further from the central dashed line are those that were predicted with the greatest reliability by the model. This plot shows that both the H and 1 classes were least reliably predicted which is a result of their close proximity shown in the MDS plot previously.

Importance metrics can be used to identify the metabolome features that contribute most to the class discrimination in the model. The available importance metrics for this model are shown below.

importanceMetrics(multinomial_rf)
#>  [1] "1"                    "2"                    "3"                   
#>  [4] "4"                    "5"                    "false_positive_rate" 
#>  [7] "H"                    "MeanDecreaseAccuracy" "MeanDecreaseGini"    
#> [10] "selection_frequency"

Here, we will use the false positive rate metric with a threshold of below 0.05 to identify explanatory features for the day response variable.

multinomial_rf %>%
  explanatoryFeatures(metric = 'false_positive_rate',
                      threshold = 0.05)
#> # A tibble: 121 × 5
#>    response comparison  feature metric                 value
#>    <chr>    <chr>       <chr>   <chr>                  <dbl>
#>  1 day      1~2~3~4~5~H N341    false_positive_rate 1.02e-93
#>  2 day      1~2~3~4~5~H N133    false_positive_rate 7.38e-68
#>  3 day      1~2~3~4~5~H N163    false_positive_rate 3.59e-61
#>  4 day      1~2~3~4~5~H N439    false_positive_rate 1.07e-54
#>  5 day      1~2~3~4~5~H N342    false_positive_rate 3.19e-49
#>  6 day      1~2~3~4~5~H N377    false_positive_rate 3.19e-49
#>  7 day      1~2~3~4~5~H N171    false_positive_rate 6.26e-44
#>  8 day      1~2~3~4~5~H N497    false_positive_rate 6.11e-30
#>  9 day      1~2~3~4~5~H N146    false_positive_rate 2.74e-29
#> 10 day      1~2~3~4~5~H N195    false_positive_rate 7.16e-25
#> # ℹ 111 more rows

As shown above there were a total of 121 explanatory features identified.

Within a multinomial experiment, it is also possible to specify the exact class comparisons to include, where it might not be suitable to compare all the classes at once using the comparisons argument. This should be specified as a named list, the corresponding to the cls argument. Each named element should then consist of a vector of comparisons, the classes to compare separated using the ~.

The following specifies two comparisons (H~1~2,H~1~5) for the day response variable and displays the performance metrics.

d %>%
  randomForest(cls = 'day',
               comparisons = list(day = c('H~1~2',
                                          'H~1~5'))) %>%
  metrics()
#> # A tibble: 8 × 5
#>   response comparison .metric  .estimator .estimate
#>   <chr>    <chr>      <chr>    <chr>          <dbl>
#> 1 day      H~1~2      accuracy multiclass     0.833
#> 2 day      H~1~2      kap      multiclass     0.75 
#> 3 day      H~1~2      margin   NA             0.172
#> 4 day      H~1~2      roc_auc  hand_till      0.906
#> 5 day      H~1~5      accuracy multiclass     0.75 
#> 6 day      H~1~5      kap      multiclass     0.625
#> 7 day      H~1~5      margin   NA             0.320
#> 8 day      H~1~5      roc_auc  hand_till      0.909

The MDS and ROC curve plots can also be plotted simultaneously for the two comparisons.

d %>%
  randomForest(cls = 'day',
               comparisons = list(day = c('H~1~2',
                                          'H~1~5'))) %>%
  {plotMDS(.,cls = 'day') +
      plotROC(.) +
      patchwork::plot_layout(ncol = 1)}

Similarly, it is also possible to model multiple response factors with a single random forest call by specifying a vector of response class information column names to the cls argument. In the following, both the name and day response factors will be analysed and the performance metrics returned in a single table.

d %>%
  randomForest(cls = c('name','day')) %>%
  metrics()
#> Unbalanced classes detected. Stratifying sample size to the smallest class size.
#> # A tibble: 8 × 5
#>   response comparison                    .metric  .estimator .estimate
#>   <chr>    <chr>                         <chr>    <chr>          <dbl>
#> 1 name     11_2~12_2~12_4~13_4~14_4~15_3 accuracy multiclass    0.35  
#> 2 name     11_2~12_2~12_4~13_4~14_4~15_3 kap      multiclass    0.212 
#> 3 name     11_2~12_2~12_4~13_4~14_4~15_3 margin   NA           -0.0485
#> 4 name     11_2~12_2~12_4~13_4~14_4~15_3 roc_auc  hand_till     0.753 
#> 5 day      1~2~3~4~5~H                   accuracy multiclass    0.8   
#> 6 day      1~2~3~4~5~H                   kap      multiclass    0.76  
#> 7 day      1~2~3~4~5~H                   margin   NA            0.146 
#> 8 day      1~2~3~4~5~H                   roc_auc  hand_till     0.964

The MDS plots can also be returned for both models simultaneously.

d %>%
  randomForest(cls = c('name','day')) %>%
  plotMDS()
#> Unbalanced classes detected. Stratifying sample size to the smallest class size.

Binary comparisons

It may in some cases be preferable to analyse class comparisons as multiple binary comparisons.

The possible binary comparisons for a given response variable can be displayed using the binaryComparisons() method. Below shows the 15 comparisons for the day response variable.

binaryComparisons(d,cls = 'day')
#>  [1] "1~2" "1~3" "1~4" "1~5" "1~H" "2~3" "2~4" "2~5" "2~H" "3~4" "3~5" "3~H"
#> [13] "4~5" "4~H" "5~H"

For this example we will only use the binary comparisons containing the H class.

binary_comparisons <- binaryComparisons(d,cls = 'day') %>% 
  .[stringr::str_detect(.,'H')]

The binary comparisons can then be performed using the following.

binary_rf <- d %>%
  randomForest(cls = 'day',
               comparisons = list(day = binary_comparisons))

print(binary_rf)
#> 
#> Random forest classification 
#> 
#> Samples:  120 
#> Features:     500 
#> Response:     day 
#> # comparisons:    5

To run all possible binary comparisons, the binary = TRUE argument could instead be used.

The MDS plots for each comparison can be visualised to inspect the comparisons.

binary_rf %>% 
  plotMDS(cls = 'day')

These plots show good separation in all the comparisons except H~1 which is also shown by the plot of the performance metrics below. Each of the comparisons are showing perfect performance for the accuracy, Cohen’s kappa and ROC-AUC metrics as well as very high margin values except for the H~1 comparison.

binary_rf %>% 
  plotMetrics()

The explanatory features for these comparisons can be extracted as below using the false positive rate metric and a cut-off threshold of 0.05. This gives a total of 251 explanatory features.

binary_rf %>% 
  explanatoryFeatures(metric = 'false_positive_rate',
                      threshold = 0.05)
#> # A tibble: 251 × 5
#>    response comparison feature metric                 value
#>    <chr>    <chr>      <chr>   <chr>                  <dbl>
#>  1 day      2~H        N341    false_positive_rate 7.34e-52
#>  2 day      2~H        N439    false_positive_rate 1.80e-45
#>  3 day      3~H        N342    false_positive_rate 2.71e-39
#>  4 day      2~H        N327    false_positive_rate 1.06e-35
#>  5 day      3~H        N439    false_positive_rate 1.06e-35
#>  6 day      2~H        N477    false_positive_rate 1.60e-34
#>  7 day      3~H        N377    false_positive_rate 1.60e-34
#>  8 day      4~H        N477    false_positive_rate 7.40e-34
#>  9 day      2~H        N447    false_positive_rate 6.48e-30
#> 10 day      3~H        N163    false_positive_rate 6.48e-30
#> # ℹ 241 more rows

A heatmap of these explanatory features can be plotted to show their mean relative intensities across the experiment time points. Here, the classes are also refactored to customise the order of the classes on the x-axis.

refactor_cls <- clsExtract(binary_rf,
                           cls = 'day') %>% 
  factor(.,levels = c('H','1','2','3','4','5'))

binary_rf <- clsReplace(binary_rf,
                        value = refactor_cls,
                        cls = 'day')
binary_rf %>% 
  plotExplanatoryHeatmap(metric = 'false_positive_rate',
                      threshold = 0.05,
                      featureNames = TRUE)

Regression

Random forest regression can be used to assess the extent of association of the metabolomic data with continuous response variables.

In this example, the extent of association of injection order with the example data will be assessed.

regression_rf <- d %>% 
  randomForest(cls = 'injorder')

print(regression_rf)
#> 
#> Random forest regression 
#> 
#> Samples:  120 
#> Features:     500 
#> Response:     injorder

The regression model performance metrics, based on the OOB prediction error, can be extracted using the following:

regression_rf %>% 
  metrics()
#> # A tibble: 5 × 4
#>   response .metric .estimator .estimate
#>   <chr>    <chr>   <chr>          <dbl>
#> 1 injorder ccc     standard       0.513
#> 2 injorder mae     standard      23.5  
#> 3 injorder mape    standard     154.   
#> 4 injorder rmse    standard      26.5  
#> 5 injorder rsq     standard       0.476

These regression metrics include R² (rsq), mean absolute error (mae), mean absolute percentage error (mape), root mean squared error (rmse) and the concordance correlation coefficient (ccc).

The R² and concordance correlation coefficient metrics suggest that there is some association of features with the injection order, although this is weak. This is in agreement with mean absolute error metric that shows that on average, the injection order could only be predicted to an accuracy of 23 injection order positions.

The MDS plot belows the relative proximities of the samples based on this injection order regression model. This shows that for the most part, there is little correspondence of the sample positions with their injection order. However, there is a small grouping of samples towards the end of the run around sample ~99 to 120. It suggests that there could have been some analytical issues, for certain features, towards the end of the mass spectral analytical run.

regression_rf %>% 
  plotMDS(cls = NULL,
          ellipses = FALSE,
          label = 'injorder',
          labelSize = 3)

The available feature importance metrics for this regression model can be listed.

regression_rf %>% 
  importanceMetrics()
#> [1] "%IncMSE"       "IncNodePurity"

The feature importance metrics can be plotted to give an overview of their distribution. The following will plot the percentage increase in the mean squared error (%IncMSE) importance metric.

regression_rf %>% 
  plotImportance(metric = "%IncMSE", 
                 rank = FALSE)

This shows that there are only a few features that are contributing to the association with injection order. These explanatory features can be extracted with the following, using a threshold of above 5.

regression_rf %>% 
  explanatoryFeatures(metric = '%IncMSE',
                      threshold = 5)
#> # A tibble: 7 × 4
#>   response feature metric  value
#>   <chr>    <chr>   <chr>   <dbl>
#> 1 injorder N283    %IncMSE 19.9 
#> 2 injorder N135    %IncMSE  8.71
#> 3 injorder N451    %IncMSE  5.58
#> 4 injorder N161    %IncMSE  5.51
#> 5 injorder N306    %IncMSE  5.49
#> 6 injorder N118    %IncMSE  5.22
#> 7 injorder N297    %IncMSE  5.07

This returned a total of 7 explanatory features above this threshold. The top ranked feature N283 can be plotted to investigate it’s trend in relation to injection order.

regression_rf %>% 
  plotFeature(feature = 'N283',
              cls = 'injorder')

This shows an increase in the intensity of that feature for samples above 100 in the injection order which corresponds with the cluster that was seen in the MDS plot above.

Univariate analyses

Univariate methods select features, explanatory for response variables, with features tested on an individual basis. These methods offer simplicity and easy interpretation in their use, however they provide no information as to how features may interact.

The univariate methods currently available in metabolyseR include Welch’s t-test, analysis of variance (ANOVA) and linear regression. The following sections will provide brief examples of the use of each of these methods.

Welch’s t-test

Welch’s t-test can be used to select explanatory metabolome features for binary comparisons of discrete variables. By default, all the possible binary comparisons for the categories of a response variable will be tested.

Below shows the possible binary comparisons for the day response variable for the example data set.

binaryComparisons(d,
                  cls = 'day')
#>  [1] "1~2" "1~3" "1~4" "1~5" "1~H" "2~3" "2~4" "2~5" "2~H" "3~4" "3~5" "3~H"
#> [13] "4~5" "4~H" "5~H"

For the following example, only a subset of comparisons will be tested. These will be selected by supplying a list to the comparisons argument.

ttest_analysis <- ttest(d,
                        cls = 'day',
                        comparisons = list(day = c('H~1',
                                                   'H~2',
                                                   'H~5')))

print(ttest_analysis)
#> 
#> Univariate t-test analysis
#> 
#> Samples:  120 
#> Features:     500 
#> Responses:    day 
#> # comparisons:    3

The explanatory features that show a significant difference between the response categories can be extracted as shown below.

explanatoryFeatures(ttest_analysis,
                    threshold = 0.05)
#> # A tibble: 73 × 14
#>    response comparison feature estimate estimate1 estimate2 statistic  p.value
#>    <chr>    <chr>      <chr>      <dbl>     <dbl>     <dbl>     <dbl>    <dbl>
#>  1 day      H~5        N163      -735.       19.5   755.       -13.8  1.43e-11
#>  2 day      H~5        N341      2445.     2537.     92.6       13.6  2.88e-11
#>  3 day      H~5        N133      1055.     1077.     21.9       13.0  5.44e-11
#>  4 day      H~2        N341       200.      293.     92.6       10.6  1.38e-10
#>  5 day      H~5        N171        62.6      64.7     2.15      11.9  2.62e-10
#>  6 day      H~5        N119        17.2      17.9     0.763     11.0  8.54e-10
#>  7 day      H~5        N342       243.      247.      4.13      10.8  1.42e- 9
#>  8 day      H~5        N343        27.4      28.3     0.961      9.83 5.99e- 9
#>  9 day      H~5        N377       152.      157.      5.05       9.81 6.75e- 9
#> 10 day      H~5        N477       103.      129.     26.1        9.30 1.05e- 8
#> # ℹ 63 more rows
#> # ℹ 6 more variables: parameter <dbl>, conf.low <dbl>, conf.high <dbl>,
#> #   method <chr>, alternative <chr>, adjusted.p.value <dbl>

This will threshold the features based on their adjusted p-value, found in the adjusted.p.value column of the table. The results of all of the features can be returned using the importance() method.

A heat map of the explanatory features can be plotted to inspect the relative trends of the explanatory features in relation to the response variable.

plotExplanatoryHeatmap(ttest_analysis)

ANOVA

ANOVA can be used to select explanatory features for discrete response variables with 3 or more categories. The following example will compare all the categories in the day response variable. However, the comparisons argument can be used to select particular comparisons of interest.

anova_analysis <- anova(d,
                        cls = 'day')

print(anova_analysis)
#> 
#> Univariate ANOVA analysis
#> 
#> Samples:  120 
#> Features:     500 
#> Responses:    day 
#> # comparisons:    1

The explanatory features that are significantly different between the categories can then be extracted.

explanatoryFeatures(anova_analysis,
                    threshold = 0.05)
#> # A tibble: 110 × 10
#>    response comparison  feature term        df   sumsq meansq statistic  p.value
#>    <chr>    <chr>       <chr>   <chr>    <dbl>   <dbl>  <dbl>     <dbl>    <dbl>
#>  1 day      1~2~3~4~5~H N341    response     5  1.09e8 2.17e7     124.  1.90e-44
#>  2 day      1~2~3~4~5~H N163    response     5  1.25e7 2.51e6     113.  1.71e-42
#>  3 day      1~2~3~4~5~H N133    response     5  1.96e7 3.92e6     108.  1.71e-41
#>  4 day      1~2~3~4~5~H N171    response     5  6.29e4 1.26e4      88.8 1.16e-37
#>  5 day      1~2~3~4~5~H N342    response     5  1.04e6 2.07e5      85.1 7.61e-37
#>  6 day      1~2~3~4~5~H N343    response     5  1.19e4 2.38e3      66.1 4.43e-32
#>  7 day      1~2~3~4~5~H N119    response     5  4.92e3 9.83e2      53.8 2.07e-28
#>  8 day      1~2~3~4~5~H N497    response     5  1.10e5 2.20e4      49.6 4.83e-27
#>  9 day      1~2~3~4~5~H N137    response     5  6.32e3 1.26e3      39.9 1.59e-23
#> 10 day      1~2~3~4~5~H N277    response     5  6.31e4 1.26e4      39.1 3.14e-23
#> # ℹ 100 more rows
#> # ℹ 1 more variable: adjusted.p.value <dbl>

The top ranked explanatory feature N341 can be plotted to inspect it’s trend relative to the day response variable.

plotFeature(anova_analysis,
            feature = 'N341',
            cls = 'day')

Linear regression

Univariate linear regression can be used to associate a continuous response variable with metabolome features. In the example below, the example data will be regressed against injection order to identify any linearly associated metabolome features.

lr_analysis <- linearRegression(d,
                                cls = 'injorder')

print(lr_analysis)
#> 
#> Univariate linear regression analysis
#> 
#> Samples:  120 
#> Features:     500 
#> Responses:    injorder

The explanatory features can then be extracted.

explanatoryFeatures(lr_analysis)
#> # A tibble: 8 × 15
#>   response feature r.squared adj.r.squared sigma statistic  p.value    df logLik
#>   <chr>    <chr>       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl>
#> 1 injorder N283        0.310         0.304  4.27      53.0 4.10e-11     1  -343.
#> 2 injorder N135        0.165         0.157 78.7       23.2 4.31e- 6     1  -693.
#> 3 injorder N221        0.140         0.133  5.87      19.3 2.50e- 5     1  -382.
#> 4 injorder N473        0.135         0.127  7.24      18.3 3.78e- 5     1  -407.
#> 5 injorder N335        0.132         0.124 20.1       17.9 4.59e- 5     1  -529.
#> 6 injorder N452        0.120         0.112  4.00      16.0 1.10e- 4     1  -335.
#> 7 injorder N255        0.119         0.111 11.1       15.9 1.17e- 4     1  -458.
#> 8 injorder N267        0.118         0.111 26.4       15.8 1.22e- 4     1  -562.
#> # ℹ 6 more variables: AIC <dbl>, BIC <dbl>, deviance <dbl>, df.residual <int>,
#> #   nobs <int>, adjusted.p.value <dbl>

The top ranked explanatory feature N283 can be plotted to inspect inspects it’s association with injection order.

plotFeature(lr_analysis,
            feature = 'N283',
            cls = 'injorder')

Routine analyses

For routine analyses, the initial analysis parameters for pre-treatment of the data and then the modelling can be selected.

p <- analysisParameters(c('pre-treatment','modelling'))

More specific parameters for pre-treatment of the example data can be declared using the following.

parameters(p,'pre-treatment') <- preTreatmentParameters(
  list(
    keep = 'classes',
    occupancyFilter = 'maximum',
    transform = 'TICnorm' 
  )
)

The modellingMethods() function can be used to list the modelling methods that are currently available in metabolyseR.

modellingMethods()
#> [1] "anova"            "ttest"            "linearRegression" "randomForest"

The modellingParameters() function can be used to retrieve the default parameters for specific modelling methods. Below, the default modelling parameters for the randomForest and ttest methods are specified.

parameters(p,'modelling') <- modellingParameters(c('randomForest','ttest'))

The class parameters can the be universily specified for both the pre-treatment and modelling elements. For this example, the day response variable will be used with just the H and 2 classes.

changeParameter(p,'cls') <- 'day'
changeParameter(p,'classes') <- c('H','2')

This gives the following parameters for the analysis.

p
#> Parameters:
#> pre-treatment
#>  keep
#>      classes
#>          cls = day
#>          classes = c("H", "2")
#>  occupancyFilter
#>      maximum
#>          cls = day
#>          occupancy = 2/3
#>  transform
#>      TICnorm
#>          refactor = TRUE
#> 
#> modelling
#>  randomForest
#>      cls = day
#>      rf = list()
#>      reps = 1
#>      binary = FALSE
#>      comparisons = list()
#>      perm = 0
#>      returnModels = FALSE
#>      seed = 1234
#>  ttest
#>      cls = day
#>      pAdjust = bonferroni
#>      comparisons = list()
#>      returnModels = FALSE

The analysis can then be executed.

analysis <- metabolyse(abr1$neg,abr1$fact,p)
#>  [34m
#> metabolyseR  [39m  [31mv0.15.4 [39m Tue Sep 12 15:16:38 2023
#> ________________________________________________________________________________
#>  [33m [33mParameters: [33m [39m
#> pre-treatment
#>  keep
#>      classes
#>          cls = day
#>          classes = c("H", "2")
#>  occupancyFilter
#>      maximum
#>          cls = day
#>          occupancy = 2/3
#>  transform
#>      TICnorm
#>          refactor = TRUE
#> 
#> modelling
#>  randomForest
#>      cls = day
#>      rf = list()
#>      reps = 1
#>      binary = FALSE
#>      comparisons = list()
#>      perm = 0
#>      returnModels = FALSE
#>      seed = 1234
#>  ttest
#>      cls = day
#>      pAdjust = bonferroni
#>      comparisons = list()
#>      returnModels = FALSE
#> ________________________________________________________________________________
#>  [34mPre-treatment  [39m…

 [34mPre-treatment  [39m     [32m✔ [39m [0.5S]
#>  [34mModelling  [39m…
 [34m
Modelling  [39m  [32m✔ [39m [10.4S]
#> ________________________________________________________________________________
#> 
#>  [32mComplete!  [39m[10.9S]

The results for the modelling can be specifically extracted using the following.

analysisResults(analysis,'modelling')
#> $randomForest
#> 
#> Random forest classification 
#> 
#> Samples:  40 
#> Features:     1713 
#> Response:     day 
#> # comparisons:    1 
#> 
#> 
#> $ttest
#> 
#> Univariate t-test analysis
#> 
#> Samples:  40 
#> Features:     1713 
#> Responses:    day 
#> # comparisons:    1

This returns the results as a list containing the modelling results objects for each specified method.

Alternatively, the modelling results can be assess directly from the Analysis object. Below shows the extraction of the explanatory features, using default parameters for each method, with the results returned in a single table.

explanatory_features <- analysis %>% 
  explanatoryFeatures()

print(explanatory_features)
#> # A tibble: 100 × 17
#>    Method       response comparison feature metric      value estimate estimate1
#>    <chr>        <chr>    <chr>      <chr>   <chr>       <dbl>    <dbl>     <dbl>
#>  1 randomForest day      2~H        N341    false_p… 8.06e-28       NA        NA
#>  2 randomForest day      2~H        N377    false_p… 5.70e-18       NA        NA
#>  3 randomForest day      2~H        N447    false_p… 5.70e-18       NA        NA
#>  4 randomForest day      2~H        N579    false_p… 5.70e-18       NA        NA
#>  5 randomForest day      2~H        N1084   false_p… 1.19e-16       NA        NA
#>  6 randomForest day      2~H        N327    false_p… 2.33e-15       NA        NA
#>  7 randomForest day      2~H        N580    false_p… 4.32e-14       NA        NA
#>  8 randomForest day      2~H        N1083   false_p… 7.49e-13       NA        NA
#>  9 randomForest day      2~H        N1085   false_p… 7.49e-13       NA        NA
#> 10 randomForest day      2~H        N503    false_p… 7.49e-13       NA        NA
#> # ℹ 90 more rows
#> # ℹ 9 more variables: estimate2 <dbl>, statistic <dbl>, p.value <dbl>,
#> #   parameter <dbl>, conf.low <dbl>, conf.high <dbl>, method <chr>,
#> #   alternative <chr>, adjusted.p.value <dbl>

Heat maps of the explanatory features can also be plotted for both the modelling methods.

plotExplanatoryHeatmap(analysis) %>% 
  patchwork::wrap_plots()

metabolyseR v0.15.4

Jasen Finch

12 September, 2023