Step 1: Processing the input data

In order to run bayesReact the following information is needed:

  1. Sequences of interest, e.g., 3’ UTRs.
  2. Motifs of interest, e.g., miRNA target sites.
  3. Experimental data used for sequence ranking, e.g., gene expression matrix.

data_flow

Above is shown a flow diagram of the initial data processing and the subsequent bayesReact framework. Conventional expression data processing is conducted before it is provided to bayesReact (example shown is miRNA activity inference from single-cell RNA-Seq data).

Sequence data

Species specific sequences can be obtained using BioMart. An example of how to obtain 3’ UTR sequences for protein-coding genes is provided in the following tutorial.

Motif data

Motif(s) are provided as a vector of regular expression(s) on the alphabet A, C, G, T. For example, the motif “ACACTCC” is a target site for miR-122-5p, and “ACA[CT]TCC” allows for either C or T at the fourth position. It is also possible to provide all k-mers of interest as motif input by specifying the k-mer length (integer), e.g., 7 entails all 7-mers are evaluated (M = 4^7 = 16,384 motifs).

Expression data

Expression data is provided as a matrix with genes as rows and conditions (cells or samples) as columns. It is important that the rownames of the expression matrix match the sequence names used from the fasta file (or sequence dataframe).

Generate fold-change ranks and calculate sequence specific motif probabilities (SSP)

bayesReact::process_raw_input() handles the initial data input. Note, this step is quite computationally demanding due to the use of a Markov model to calculate probabilities of motif occurrence for each sequence given its length and nucleotide composition (SSP values).

# Define a set of miRNA target-sites:
mots <- c("ACACTCC", "GTGCCTT") # Target-sites for miR-122-5p and miR-124-3p (assuming seed-site at position 2-8 of mature miRNAs)
#mots <- 7 # Specifies all 7-mer motifs as input

# Load in the expression data (genes x cells/samples matrix)
# Very important that the rownames of the expression matrix match the sequence names in the fasta file.
exp_data <- readRDS("path/to/exp.rds")

# Then process all the data
processed_output_paths <- bayesReact::process_raw_input(exp_data, seq_data = "/path/to/seqs.fasta", 
                          motifs = mots, 
                          out_path = "./",    # Specify output path
                          exp_type = "count", # Or use "CPM" for normalized values
                          seq_gene_id = "gid" # If you use Ensembl gene IDs, else use "gsym" for gene symbols
                                      )
# Running 'process_raw_input()' may be impossible locally, if you are working with a large set of motifs.
# SSP values only needs to be computed once per sequence and motif set, and can be reused for multiple (expression) datasets.

The ranking of sequences is done based on their fold-change (FC) scores, where the scores are defined as log2(exp of gene i in condition j) - log2(median exp of gene i across all conditions). Using the median expression of each gene as a pseudo-normal condition makes the normal setting dependent on the provided expression data, and impacts the interpretation of the activity output.

It is also possible to provide a user-defined control-setting, when paired case-control data is available:

# Define a set of miRNA target-sites:
mots <- c("ACACTCC", "GTGCCTT")

# Load in the expression data (genes x cells/samples matrix)
exp_data <- readRDS("path/to/exp.rds")

# Load the control data (genes x cells/samples matrix)
# Genes should match 'exp_data' and should either be provided as a vector (single control condition) or
# as a matrix with one control-setting matching each column in 'exp_data' (multiple control conditions).
control_exp_data <- readRDS("path/to/control_exp.rds")

# Then process all the data
processed_output_paths <- bayesReact::process_raw_input(exp_data, seq_data = "/path/to/seqs.fasta", 
                          motifs = mots, 
                          out_path = "./",    
                          exp_type = "count", 
                          seq_gene_id = "gid" 
                          control_exp = control_exp_data # Control condition(s) used to compute FC scores in instances of available case-control data
                          )

Step 2: Inferring motif activities

Running bayesReact locally

Depending on the size of your data, running bayesReact may take a while. MCMC sampling is used to approximate the posterior distribution of the underlying activity parameters ‘a_{c,m}’. Since all conditions are considered independent, it’s possible to split the FC-rank data into smaller partitions to be run in parallel if necessary.

# Run bayesReact
in_data <- list(FC_rank = "/path/to/FC_rank.rds", # All input data is obtained from 'processed_output_paths'
                motif_probs = "/path/to/motif_probs.rds", 
                motif_counts = "/path/to/motif_counts.rds")

act_out <- bayesReact::bayesReact_core(in_data, output_type = "activity_summary") # This output type returns a list of matrices with activity information, CI and more.

#act_out <- bayesReact::bayesReact_core(in_data, model = "BF", MCMC_cores = 4) # Will also return log BF values, however, this is computationally more demanding.

# Save the output
saveRDS(act_out, "path/to/output.rds")

If you have few computational resources available, you may also consider using the fast Laplace approximation of the posterior instead:

# Run bayesReact
act_out <- bayesReact::bayesReact_core(in_data, output_type = "activity", posterior_approx = "Laplace") # This outputs only the activity scores and is less information-rich.

Running bayesReact on a computer cluster

An outline of a Bash script to run bayesReact in parallel can be found here /inst/run_bayesReact_parallel_example.sh. If relevant, it is possible to adjust the resources used for each job, and the size of the parallel partitions, as well as specifications for the MCMC sampling such as the number of chains, iterations, and whether to use Laplace approximation.

Running bayesReact on test data and an initial look at the output

Below is an example of how to run bayesReact using the test data provided as part of the R package:

# Take a look at the test data
library(bayesReact)
#?test_data
# Dimensions: M x C x S = 2 x 20 x 18,559, where M is the number of motifs, C is the number of conditions, and S is the number of sequences.

# Look at the motifs evaluated
#colnames(test_data$lst_data$motif_probs) # miR-122-5p target site and seed site

# Structure of the input data
lst_data <- list(FC_rank = test_data$lst_data$FC_rank, # Fold-change ranks for each sequence
                  motif_probs = test_data$lst_data$motif_probs, # SSP values
                  motif_counts = test_data$lst_data$motif_counts) # Number of motif occurrences for each sequence

# Run bayesReact
act_out <- bayesReact::bayesReact_core(test_data$lst_data, MCMC_cores = 1) # Running on a single core for the test data
## 
## Initiating ...
##                                 core 
## 
## ________________________________________________________
## ________________________________________________________
## 
## Fitting model for 'ACACTCC'. Succesfull model fit. 
## Fitting model for 'GGAGTGT'. Succesfull model fit. 
## Done.
# bayesReact initially spends a bit of time compiling the STAN model before starting the MCMC sampling.
# The compilation is only done once.

# The output
names(act_out) # List of matrices
## [1] "motif_activity"   "motif_post_prob"  "motif_a_mean"     "motif_a_sd"      
## [5] "motif_a_CI_lower" "motif_a_CI_upper" "motif_model_nEff" "motif_model_Rhat"

Step 3: Inspecting the results

Below is a brief look at the output from the test data.

# Activity measure and correlation to observed miRNA expression
head(act_out$motif_activity)
mot <- "ACACTCC" # miR-122-5p target motif ("ACACTCC"), non-functional motif ("GGAGTGT")
cor(act_out$motif_activity[, mot], test_data$exp_miR_122_5p, method = "spearman") 

# Posterior distribution of the activity parameter 'a_{c,m}'
head(act_out$motif_a_mean) # Posterior mean of 'a_{c,m}' (similar to an effect size)
head(act_out$motif_a_sd) # Posterior standard deviation (sd) of 'a_{c,m}'
head(act_out$motif_a_CI_lower) # Lower bound of 80% credible interval (or user specified CI)
head(act_out$motif_a_CI_upper) # Upper bound of 80% credible interval
head(act_out$motif_post_prob) # Posterior probability of the of 'a_{c,m}' being smaller or equal to 0 under a normal approximation of the posterior MCMC samples, with a mean equal to the absolute value of the posterior mean and sd equal to the posterior sd (please see the method paper for more information).
# a_{c,m} = 0 corresponds to the null model of no motif clustering across the ranked sequences.

# Model diagnostics
head(act_out$motif_model_Rhat) # Rhat statistic for convergence diagnostics; should preferably be close to 1 (check for values larger than 1.05)
head(act_out$motif_model_nEff) # Effective sample size; should be large (check for values much smaller than the 'number of iterations' * 'number of chains'; check ?bayesReact_core for these values)

## Run bayesReact and compute the Bayes Factor ##
#act_out <- bayesReact::bayesReact_core(test_data$lst_data, model = "BF", MCMC_cores = 4) 
# head(act_out$motif_model_lBF) # log Bayes Factor values
# The BF evaluates log P(data | model) / P( data | null model)
# All positive values thus indicate support for the bayesReact model, while negative values indicate support for the null model (of no motif activity).

A large activity score indicates motif clustering in sequences with extreme relative abundance. Positive values indicate clustering in lowly abundant sequences, while negative values indicate clustering in highly abundant sequences. Subsequently, positive values may indicate motif association with target depletion while negative values may indicate association with target enrichment.