Title: | LS-TreeBoost and LAD-TreeBoost for Gene Regulatory Network Reconstruction |
---|---|
Description: | Provides an implementation of Regularized LS-TreeBoost & LAD-TreeBoost algorithm for Regulatory Network inference from any type of expression data (Microarray/RNA-seq etc). |
Authors: | Raghvendra Mall [aut, cre], Khalid Kunji [aut], Melissa O'Neill [ctb] |
Maintainer: | Raghvendra Mall <[email protected]> |
License: | GPL (>= 3) |
Version: | 1.0-11 |
Built: | 2024-11-04 04:03:04 UTC |
Source: | https://github.com/cran/RGBM |
Here we add the names of the transcription factors (Tfs) as rownames and names of the target genes as column names to the adjacency matrix A.
add_names(A, tfs, targets)
add_names(A, tfs, targets)
A |
Adjacency matrix A obtained as a result of GBM procedure. |
tfs |
List of names of transcription factors. |
targets |
List of names of target genes. |
In case of DREAM Challenge datasets list of transcription factors is same as list of target genes and are referred as G1, ..., G100.
Raghvendra Mall <[email protected]>
This function performs a row-wise standard deviation of network A to generate an S1 matrix which is then used to modify the weights in network A
apply_row_deviation(A,Ntfs,Ntargets)
apply_row_deviation(A,Ntfs,Ntargets)
A |
Inferred GRN in the form of Ntfs-by-Ntargets matrix |
Ntfs |
Total number of transcription factors used in the experiment. |
Ntargets |
Total number of target genes used in the experiment |
Refined adjacency matrix A in the form of Ntfs-by-Ntargets matrix
Raghvendra Mall <[email protected]>
This function combines the adjacency matrix A_prev obtained as a result of first_GBM_step
with the adjacency matrix A obtained as a result of second_GBM_step
. All the edges in the matrix A which have non-zero weights are given machine precision weights initially. We then perform a harmonic mean for each element of A_prev and A to obtain a regularized adjacency matrix (A_final). As a result of this procedure transcriptional regulations which were strong and present in both A_prev and A end up getting highest weights in A_final. We finally remove all edges whose weights are less than machine precision from A_final.
consider_previous_information(A, A_prev,real)
consider_previous_information(A, A_prev,real)
A |
Inferred GRN from the |
A_prev |
Inferred GRN from the |
real |
Numeric value 0 or 1 corresponding to simulated or real experiment respectively. |
Returns an adjacency matrix A_final of the form Ntfs-by-Ntargets
Raghvendra Mall <[email protected]>
first_GBM_step
, second_GBM_step
## The function is currently defined as function (A, A_prev) { #Utilize Past Information also to not remove true positives A_prev[A_prev==0] <- .Machine$double.eps; A_prev <- transform_importance_to_weights(A_prev); A[A==0] <- .Machine$double.eps; epsilon <- 1/log(1/.Machine$double.eps); A <- transform_importance_to_weights(A); A_final <- 2*A*A_prev/(A+A_prev); A_final <- A_final - epislon; A_final[A_final<0] <- 0.0; return(A_final); }
## The function is currently defined as function (A, A_prev) { #Utilize Past Information also to not remove true positives A_prev[A_prev==0] <- .Machine$double.eps; A_prev <- transform_importance_to_weights(A_prev); A[A==0] <- .Machine$double.eps; epsilon <- 1/log(1/.Machine$double.eps); A <- transform_importance_to_weights(A); A_final <- 2*A*A_prev/(A+A_prev); A_final <- A_final - epislon; A_final[A_final<0] <- 0.0; return(A_final); }
GBM
) on expression matrix E followed by the null_model_refinement_step
This function utilizes the core gradient boosting machine model (GBM
) followed by the refinement step to generate the first adjacency matrix A of size p x p using the list of Tfs and the set of target genes. Several such adjacency matrices (A) are obtained based on the number of iterations to be performed. All these adjacency matrices are averaged to reduce the noise in the inferred intermediate GRN.
first_GBM_step(E, K, tfs, targets, Ntfs, Ntargets, lf, M, nu,s_f, no_iterations)
first_GBM_step(E, K, tfs, targets, Ntfs, Ntargets, lf, M, nu,s_f, no_iterations)
E |
N-by-p expression matrix. Columns correspond to genes, rows correspond to experiments. E is expected to be already normalized using standard methods, for example RMA. Colnames of E is the set of all genes. |
K |
N-by-p initial perturbation matrix. It directly corresponds to E matrix, e.g. if K[i,j] is equal to 1, it means that gene j was knocked-out in experiment i. Single gene knock-out experiments are rows of K with only one value 1. Colnames of K is set to be the set of all genes. By default it's a matrix of zeros of the same size as E, e.g. unknown initial perturbation state of genes. |
tfs |
List of names of transcription factors. In case of presence of prior mechanistic network it is a subset of all the p genes whereas in absence of such a mechanistic network it is a list of names of all the p genes. |
targets |
List of names of target genes. In case of presence of prior mechanistic network it is a subset of all the p genes whereas in absence of such a mechanistic network it is a list of names of all the p genes. |
Ntfs |
Total number of transcription factors used in the experiment. |
Ntargets |
Total number of target genes used in the experiment. |
lf |
Loss Function: 1 -> Least Squares and 2 -> Least Absolute Deviation |
M |
Number of extensions in boosting model, e.g. number of iterations of the main loop of RGBM algorithm. By default it's 5000. |
nu |
Shrinkage factor, learning rate, 0<nu<=1. Each extension to boosting model will be multiplied by the learning rate. By default it's 0.001. |
s_f |
Sampling rate of transcription factors, 0<s_f<=1. Fraction of transcription factors from E, as indicated by |
no_iterations |
Number of iterations to perform equivalent to building that many core LS-Boost/LAD-Boost models and then averaging them to have smooth edge-weights in the inferred intermediate GRN. |
Intermediate Gene Regulatory Network in form of a Ntfs-by-Ntargets adjacency matrix.
Raghvendra Mall <[email protected]>
This function calculates a Ntfs-by-Ntargets adjacency matrix A from N-by-p expression matrix E. E is expected to be given as input. E is assumed to have p columns corresponding to all the genes, Ntfs represents the number of transcription factors and Ntargets represents the number of target genes and N rows corresponding to different experiments. Additionally, GBM function takes matrix of initial perturbations of genes K of the same size as E, and other parameters including which loss function to use (LS = 1, LAD = 2). As a result, GBM returns a squared matrix A of edge confidences of size Ntfs-by-Ntargets. A subset of known transcription factors can be defined as a subset of all p genes.
GBM(E = matrix(rnorm(100), 10, 10), K = matrix(0, nrow(E), ncol(E)), tfs = paste0("G",c(1:10)), targets = paste0("G",c(1:10)), s_s = 1, s_f = 0.3, lf = 1, M = 5000,nu = 0.001, scale = TRUE,center = TRUE, optimization.stage = 2)
GBM(E = matrix(rnorm(100), 10, 10), K = matrix(0, nrow(E), ncol(E)), tfs = paste0("G",c(1:10)), targets = paste0("G",c(1:10)), s_s = 1, s_f = 0.3, lf = 1, M = 5000,nu = 0.001, scale = TRUE,center = TRUE, optimization.stage = 2)
E |
N-by-p expression matrix. Columns correspond to genes, rows correspond to experiments. E is expected to be already normalized using standard methods, for example RMA. Colnames of E is the set of all genes. |
K |
N-by-p initial perturbation matrix. It directly corresponds to E matrix, e.g. if K[i,j] is equal to 1, it means that gene j was knocked-out in experiment i. Single gene knock-out experiments are rows of K with only one value 1. Colnames of K is set to be the set of all genes. By default it's a matrix of zeros of the same size as E, e.g. unknown initial perturbation state of genes. |
tfs |
List of names of transcription factors |
targets |
List of names of target genes |
s_s |
Sampling rate of experiments, 0<s_s<=1. Fraction of rows of E, which will be sampled with replacement to calculate each extension in boosting model. By default it's 1. |
s_f |
Sampling rate of transcription factors, 0<s_f<=1. Fraction of transcription factors from E, as indicated by |
lf |
Loss function: 1 -> Least Squares, 2 -> Least Absolute deviation |
M |
Number of extensions in boosting model, e.g. number of iterations of the main loop of RGBM algorithm. By default it's 5000. |
nu |
Shrinkage factor, learning rate, 0<nu<=1. Each extension to boosting model will be multiplied by the learning rate. By default it's 0.001. |
scale |
Logical flag indicating if each column of E should be scaled to be unit standard deviation. By default it's TRUE. |
center |
Logical flag indicating if each column of E should be scaled to be zero mean. By default it's TRUE. |
optimization.stage |
Numerical flag indicating if re-evaluation of edge confidences should be applied after calculating initial V, optimization.stage={0,1,2}. If optimization.stage=0, no re-evaluation will be applied. If optimization.stage=1, variance-based optimization will be applied. If optimization.stage=2, variance-based and z-score based optimizations will be applied. |
A |
Gene Regulatory Network in form of a Ntfs-by-Ntargets adjacency matrix. |
Raghvendra Mall <[email protected]>
# load RGBM library library("RGBM") # this step is optional, it helps speed up calculations, run in parallel on 2 processors library(doParallel) cl <- makeCluster(2) # run network inference on a 100-by-100 dummy expression data. V = GBM() stopCluster(cl)
# load RGBM library library("RGBM") # this step is optional, it helps speed up calculations, run in parallel on 2 processors library(doParallel) cl <- makeCluster(2) # run network inference on a 100-by-100 dummy expression data. V = GBM() stopCluster(cl)
This function tests a regression model for a given X.test
feature matrix, Y.test
response vector, and working parameters.
GBM.test(model, X.test, Y.test, M.test)
GBM.test(model, X.test, Y.test, M.test)
model |
Model returned by |
X.test |
Input N-by-p feature matrix of unseen samples. Columns correspond to features, rows correspond to samples. |
Y.test |
Input N-element response vector of unseen samples. |
M.test |
Number of extensions of boosting model to take when predicting response. Must be not greater than |
Result of regression
Raghvendra Mall <[email protected]>
This function trains a regression model for a given X.train
feature matrix, Y.train
response vector, and working parameters. A model returned by this function can be used to predict response for unseen data with GBM.test
function.
GBM.train(X.train, Y.train, s_f = 0.3, s_s = 1, lf =1, M.train = 5000, nu = 0.001)
GBM.train(X.train, Y.train, s_f = 0.3, s_s = 1, lf =1, M.train = 5000, nu = 0.001)
X.train |
Input N-by-p feature matrix of training samples. Columns correspond to features, rows correspond to samples. |
Y.train |
Input N-element response vector of training samples. |
s_f |
Sampling rate of features, 0<s_f<=1. Fraction of columns from X.train, which will be sampled without replacement to calculate each extesion in boosting model. By default it's 0.3. |
s_s |
Sampling rate of samples, 0<s_s<=1. Fraction of rows from X.train, which will be sampled with replacement to calculate each extension in boosting model. By default it's 1. |
lf |
Loss function: 1-> Least Squares and 2 -> Least Absolute Deviation |
M.train |
Number of extensions in boosting model, e.g. number of iterations of the main loop of RGBM algorithm. By default it's 5000. |
nu |
Shrinkage factor, learning rate, 0<nu<=1. Each extension to boosting model will be multiplied by the learning rate. By default it's 0.001. |
Regression model is a structure containing all the information needed to predict response for unseen data
Raghvendra Mall <[email protected]>
This function is used to identify the recitified list of transcription factors for individual target genes after analysing the variable importance scores (where non-essential Tfs are pruned). These list of Tfs are usually different for individual target genes. Hence we maintain this in the form an adjacency matrix where the rownames correspond to all the Tfs and colnames correspond to all the target genes. Each column is a binary vector where all the values corresponding to the rectified Tfs active for that target are 1 while rest of the values are zeros.
get_colids(A, ideal_k, tfs, targets, Ntfs, Ntargets)
get_colids(A, ideal_k, tfs, targets, Ntfs, Ntargets)
A |
Adjacency Matrix A obtained after the GBM and refinement step. |
ideal_k |
A vector containing the optimal value of k (no of active TFs) for each target gene obtained from |
tfs |
List of names of transcription factors. |
targets |
List of names of target genes. |
Ntfs |
Total number of transcription factors used in the experiment. |
Ntargets |
Total number of target genes used in the experiment. |
The function returns an adjacency matrix where the rownames correspond to all the Tfs and colnames correspond to all the target genes. Each column is a binary vector where all the values corresponding to the rectified Tfs active for that target are 1 while rest of the values are zeros.
Raghvendra Mall <[email protected]>
This function generates a set of filepaths which are used to keep the adjacency matrix A obtained after the first_GBM_step
+ null_model_refinement_step
. It also generates a path where an image of the variable importance curves for several target genes can be kept.
get_filepaths(A_prev, experimentid, outputpath, sample_type)
get_filepaths(A_prev, experimentid, outputpath, sample_type)
A_prev |
Adjacency matrix A obtained after |
experimentid |
The id of the experiment being conducted. It takes natural numbers like 1,2,3 etc. By default it's 1. |
outputpath |
Location where the Adjacency_Matrix and Images folder will be created. |
sample_type |
String arguement representing a label for the experiment i.e. in case of DREAM3 challenge sample_type="DREAM3". |
Returns a data frame where the first element in the data frame is the location where the Adjacency_Matrix folder is located in the filesystem, second element represents the location where the Images folder is located in the filesystem, third element represents the path to the file where the Adjacency_Matrix will be written.
Raghvendra Mall <[email protected]>
This function provides the indices of all those samples (out of N) where it is known apriori that a gene was either knocked-out or was knocked-down. This information is useful for the null_model_refinement_step
which utilizes the z_score_effect
technique (with the help of this information).
get_ko_experiments(K)
get_ko_experiments(K)
K |
N-by-p initial perturbation matrix. It directly corresponds to E matrix, e.g. if K[i,j] is equal to 1, it means that gene j was knocked-out in experiment i. Single gene knock-out experiments are rows of K with only one value 1. Colnames of K is set to be the set of all genes. By default it's a matrix of zeros of the same size as E, e.g. unknown initial perturbation state of genes. |
Return a vector containing the indices of all the samples where a gene was knocked-out/down.
Raghvendra Mall <[email protected]>
null_model_refinement_step
, z_score_effect
This function provides the indices of all the transcription factors which are present in the expression matrix. In case of DREAM Challenges it will return the indices as 1,...,p for all the p genes in the data as the transcription factors are not known beforehand.
get_tf_indices(E, tfs, Ntfs)
get_tf_indices(E, tfs, Ntfs)
E |
E is the expression matrix of size N x p where N is number of examples and p is the number of genes. Here the column names of expression matrix is the list of all the genes present in the E matrix. Colnames of E is the set of all genes. |
tfs |
List of names of transcription factors. |
Ntfs |
Total number of transcription factors used in the experiment. |
Returns the indices of all the transcription factors present in E matrix.
Raghvendra Mall <[email protected]>
We perform a column normalization on an adjacency matrix A equivalent to inferred GRN
normalize_matrix_colwise(A,Ntargets)
normalize_matrix_colwise(A,Ntargets)
A |
Inferred GRN in the form of Ntfs-by-Ntargets matrix |
Ntargets |
Total number of target genes used in the experiment |
Column Normalized GRN of size Ntfs-by-Ntargets
Raghvendra Mall <[email protected]>
We used this function for refining the edge-weights in an inferred GRN (A) by utilizing matrix (S2) obtained from null-mutant zscore effect (z_score_effect
) as shown in Slawek J, Arodz T i.e. A = A x S2.
null_model_refinement_step(E, A, K,tfs, targets, Ntfs, Ntargets)
null_model_refinement_step(E, A, K,tfs, targets, Ntfs, Ntargets)
E |
N-by-p expression matrix. Columns correspond to genes, rows correspond to experiments. E is expected to be already normalized using standard methods, for example RMA. Colnames of E is the set of all genes. |
A |
Intermediate GRN network in the form of a p-by-p adjacency matrix. |
K |
N-by-p initial perturbation matrix. It directly corresponds to E matrix, e.g. if K[i,j] is equal to 1, it means that gene j was knocked-out in experiment i. Single gene knock-out experiments are rows of K with only one value 1. Colnames of K is set to be the set of all genes. By default it's a matrix of zeros of the same size as E, e.g. unknown initial perturbation state of genes. |
tfs |
List of names of transcription factors |
targets |
List of names of target genes |
Ntfs |
Number of transcription factors used while building the GBM ( |
Ntargets |
Number of targets used while building the GBM ( |
Returns a refined adjacency matrix A in the form of a Ntfs-by-Ntargets matrix.
Raghvendra Mall <[email protected]>
Slawek J, Arodz T. ENNET: inferring large gene regulatory networks from expression data using gradient boosting. BMC systems biology. 2013 Oct 22;7(1):1.
This function undertakes all the proposed steps for regularizing the list of transcription factors for individual target gene followed by re-iterating through the core GBM model and the refinement step to produce the final reverse engineered GRN.
regularized_GBM_step(E, A_prev, K, tfs, targets, Ntfs, Ntargets, lf, M, nu, s_f, experimentid, outputpath, sample_type, mink=0,real=0)
regularized_GBM_step(E, A_prev, K, tfs, targets, Ntfs, Ntargets, lf, M, nu, s_f, experimentid, outputpath, sample_type, mink=0,real=0)
E |
N-by-p expression matrix. Columns correspond to genes, rows correspond to experiments. E is expected to be already normalized using standard methods, for example RMA. Colnames of E is the set of all genes. |
A_prev |
An intermediate inferred GRN obtained from |
K |
N-by-p initial perturbation matrix. It directly corresponds to E matrix, e.g. if K[i,j] is equal to 1, it means that gene j was knocked-out in experiment i. Single gene knock-out experiments are rows of K with only one value 1. Colnames of K is set to be the set of all genes. By default it's a matrix of zeros of the same size as E, e.g. unknown initial perturbation state of genes. |
tfs |
List of names of transcription factors. |
targets |
List of names of target genes. |
Ntfs |
Total number of transcription factors used in the experiment. |
Ntargets |
Total number of target genes used in the experiment |
lf |
Loss Function: 1 -> Least Squares and 2 -> Least Absolute Deviation |
M |
Number of extensions in boosting model, e.g. number of iterations of the main loop of RGBM algorithm. By default it's 5000. |
nu |
Shrinkage factor, learning rate, 0<nu<=1. Each extension to boosting model will be multiplied by the learning rate. By default it's 0.001. |
s_f |
Sampling rate of transcription factors, 0<s_f<=1. Fraction of transcription factors from E, as indicated by |
experimentid |
The id of the experiment being conducted. It takes natural numbers like 1,2,3 etc. By default it's 1. |
outputpath |
Location where the Adjacency_Matrix and Images folder will be created. |
sample_type |
String arguement representing a label for the experiment i.e. in case of DREAM3 challenge sample_type="DREAM3". |
mink |
User specified threshold i.e. the minimum number of Tfs to be considered while optimizing the L-curve criterion. By default it's 0. |
real |
Numeric value 0 or 1 corresponding to simulated or real experiment respectively. |
Returns the final inferred GRN in form of Ntfs-by-Ntargets matrix
Raghvendra Mall <[email protected]>
We control the size of the regulon for each TF by using a heuristic to remove the edges whose weights are small
regulate_regulon_size(A)
regulate_regulon_size(A)
A |
Inferred GRN in the form of Ntfs-by-Ntargets matrix |
Refined adjacency matrix A in the form of Ntfs-by-Ntargets matrix
Raghvendra Mall <[email protected]>
This function performs the proposed regularized gradient boosting machines for reverse engineering GRN. It allows the user to provide prior information in the form of a mechanistic network g_M and after generation of an initially inferred GRN using the core GBM model undergoes a pruning step. Here we detect and remove isolated nodes using the select_ideal_k
function along with identification of the optimal set of transcription factors for each target gene. We then re-iterate through the GBM followed by the refinement step to generate the final re-constructed GRN.
RGBM(E = matrix(rnorm(100), 10, 10), K = matrix(0, nrow(E), ncol(E)), g_M = matrix(1, 10, 10), tfs = paste0("G", c(1:10)), targets = paste0("G", c(1:10)), lf = 1, M = 5000, nu = 0.001, s_f = 0.3, no_iterations = 2, mink = 0, experimentid = 1, outputpath= "DEFAULT", sample_type = "Exp1_", real = 0)
RGBM(E = matrix(rnorm(100), 10, 10), K = matrix(0, nrow(E), ncol(E)), g_M = matrix(1, 10, 10), tfs = paste0("G", c(1:10)), targets = paste0("G", c(1:10)), lf = 1, M = 5000, nu = 0.001, s_f = 0.3, no_iterations = 2, mink = 0, experimentid = 1, outputpath= "DEFAULT", sample_type = "Exp1_", real = 0)
E |
N-by-p expression matrix. Columns correspond to genes, rows correspond to experiments. E is expected to be already normalized using standard methods, for example RMA. Colnames of E is the set of all p genes and Ntfs represents the number of transcription factors and Ntargets represents the number of target genes. |
K |
N-by-p initial perturbation matrix. It directly corresponds to E matrix, e.g. if K[i,j] is equal to 1, it means that gene j was knocked-out in experiment i. Single gene knock-out experiments are rows of K with only one value 1. Colnames of K is set to be the set of all genes. By default it's a matrix of zeros of the same size as E, e.g. unknown initial perturbation state of genes. |
g_M |
Initial mechanistic network in the form of an adajcency matrix (Ntf-by-Ntargets). Here each column is a binary vector where only those elements are 1 when the corresponding transcription factor has a connection with that target gene. Colnames of g_M should be same as names of targets and Rownames of g_M should be same as names of Tfs. By default it's a matrix of ones of size Ntfs x Ntargets. |
tfs |
List of names of transcription factors |
targets |
List of names of target genes |
lf |
Loss Function: 1 -> Least Squares and 2 -> Least Absolute Deviation |
M |
Number of extensions in boosting model, e.g. number of iterations of the main loop of RGBM algorithm. By default it's 5000. |
nu |
Shrinkage factor, learning rate, 0<nu<=1. Each extension to boosting model will be multiplied by the learning rate. By default it's 0.001. |
s_f |
Sampling rate of transcription factors, 0<s_f<=1. Fraction of transcription factors from E, as indicated by |
no_iterations |
Number of times initial GRN to be constructed and then averaged to generate smooth edge weights for the initial GRN as shown in |
mink |
specified threshold i.e. the minimum number of Tfs to be considered while optimizing the L-curve criterion. By default it's 0. |
experimentid |
The id of the experiment being conducted. It takes natural numbers like 1,2,3 etc. By default it's 1. |
outputpath |
Location where intermediate Adjacency_Matrix and Images folder will be created. By default it's a temp directory (e.g. /tmp/Rtmp...) |
sample_type |
String arguement representing a label for the experiment i.e. in case of DREAM3 challenge sample_type="DREAM3". |
real |
Numeric value 0 or 1 corresponding to simulated or real experiment respectively. |
Returns the final inferred GRN of form Ntfs-by-Ntargets adjacency matrix.
Raghvendra Mall <[email protected]>
select_ideal_k
, first_GBM_step
# load RGBM library library("RGBM") # this step is optional, it helps speed up calculations, run in parallel on 2 processors library(doParallel) cl <- makeCluster(2) # run network inference on a 100-by-100 dummy expression data. A = RGBM() stopCluster(cl)
# load RGBM library library("RGBM") # this step is optional, it helps speed up calculations, run in parallel on 2 processors library(doParallel) cl <- makeCluster(2) # run network inference on a 100-by-100 dummy expression data. A = RGBM() stopCluster(cl)
This function tests a regression model for a given X.test
feature matrix, Y.test
response vector, and working parameters.
RGBM.test(model, X.test, Y.test, M.test)
RGBM.test(model, X.test, Y.test, M.test)
model |
Model returned by |
X.test |
Input S-by-P feature matrix of unseen samples. Columns correspond to features, rows correspond to samples. |
Y.test |
Input S-element response vector of unseen samples. |
M.test |
Number of extensions of boosting model to take when predicting response. Must be not greater than |
Result of regression
Raghvendra Mall <[email protected]>
This function trains a regression model for a given X.train
feature matrix, Y.train
response vector, and working parameters. A model returned by this function can be used to predict response for unseen data with RGBM.test
function.
RGBM.train(X.train, Y.train, s_f = 0.3, s_s = 1, lf = 1, M.train = 5000, nu = 0.001)
RGBM.train(X.train, Y.train, s_f = 0.3, s_s = 1, lf = 1, M.train = 5000, nu = 0.001)
X.train |
Input S-by-P feature matrix of training samples. Columns correspond to features, rows correspond to samples. |
Y.train |
Input S-element response vector of training samples. |
s_f |
Sampling rate of features, 0<s_f<=1. Fraction of columns from X.train, which will be sampled without replacement to calculate each extesion in boosting model. By default it's 0.3. |
s_s |
Sampling rate of samples, 0<s_s<=1. Fraction of rows from X.train, which will be sampled with replacement to calculate each extension in boosting model. By default it's 1. |
lf |
Loss function: 1-> Least Squares and 2 -> Least Absolute Deviation |
M.train |
Number of extensions in boosting model, e.g. number of iterations of the main loop of RGBM algorithm. By default it's 5000. |
nu |
Shrinkage factor, learning rate, 0<nu<=1. Each extension to boosting model will be multiplied by the learning rate. By default it's 0.001. |
Regression model is a structure containing all the information needed to predict response for unseen data
Raghvendra Mall <[email protected]>
This function re-performs the core GBM model building (only one time) using the optimal set of transcription factors obtained from select_ideal_k
followed by get_colids
for individual target gene to return a regularized GRN.
second_GBM_step(E, K, df_colids, tfs, targets, Ntfs, Ntargets, lf, M, nu, s_f)
second_GBM_step(E, K, df_colids, tfs, targets, Ntfs, Ntargets, lf, M, nu, s_f)
E |
N-by-p expression matrix. Columns correspond to genes, rows correspond to experiments. E is expected to be already normalized using standard methods, for example RMA. Colnames of E is the set of all genes. |
K |
N-by-p initial perturbation matrix. It directly corresponds to E matrix, e.g. if K[i,j] is equal to 1, it means that gene j was knocked-out in experiment i. Single gene knock-out experiments are rows of K with only one value 1. Colnames of K is set to be the set of all genes. By default it's a matrix of zeros of the same size as E, e.g. unknown initial perturbation state of genes. |
df_colids |
A matrix made up of column vectors where each column vector represents the optimal set of active Tfs which regulate each target gene and obtained from |
tfs |
List of names of transcription factors. |
targets |
List of names of target genes. |
Ntfs |
Total number of transcription factors used in the experiment. |
Ntargets |
Total number of target genes used in the experiment |
lf |
Loss Function: 1 -> Least Squares and 2 -> Least Absolute Deviation |
M |
Number of extensions in boosting model, e.g. number of iterations of the main loop of RGBM algorithm. By default it's 5000. |
nu |
Shrinkage factor, learning rate, 0<nu<=1. Each extension to boosting model will be multiplied by the learning rate. By default it's 0.001. |
s_f |
Sampling rate of transcription factors, 0<s_f<=1. Fraction of transcription factors from E, as indicated by |
Returns a regularized GRN of the form Ntfs-by-Ntargets
Raghvendra Mall <[email protected]>
This function detects the optimal number of transcription factors which are regulating each target gene. This number is different for different target genes. It utilizes a heuristic to also detect the isolated targets which are not regulated by any transcription factor. To the detect the optimal number of Tfs for each target gene, it uses a notion similar to that used for optimization of the L-curve criterion for Tikonov regularization by evaluating the variable importance curve for each target gene.
select_ideal_k(experimentid, mink, filepath, imagepath, adjacency_matrix_path)
select_ideal_k(experimentid, mink, filepath, imagepath, adjacency_matrix_path)
experimentid |
The id of the experiment being conducted. It takes natural numbers like 1,2,3 etc. By default it's 1. |
mink |
User specified threshold i.e. the minimum number of Tfs to be considered while optimizing the L-curve criterion. By default it's 0. |
filepath |
Path where some intermediate files will be written and provided by the function |
imagepath |
Path where an image of the variable importance curves for first 16 target genes will be written and provided by the function |
adjacency_matrix_path |
Path where an intermediate adjacency matrix will be written and provided by the function |
Returns a vector where each element represents the optimal number of transcription factors for each target gene.
Raghvendra Mall <[email protected]>
Test the regression model for each target gene
The format is: List of 4 $ name : chr "test_regression_stump_R" $ address :Class 'RegisteredNativeSymbol' <externalptr> $ dll :List of 5 ..$ name : chr "RGBM" ..$ path : chr "/home/raghvendra/R/x86_64-pc-linux-gnu-library/3.3/RGBM/libs/RGBM.so" ..$ dynamicLookup: logi TRUE ..$ handle :Class 'DLLHandle' <externalptr> ..$ info :Class 'DLLInfoReference' <externalptr> ..- attr(*, "class")= chr "DLLInfo" $ numParameters: int 15 - attr(*, "class")= chr [1:2] "CRoutine" "NativeSymbolInfo"
Raghvendra Mall <[email protected]>
Train the regression stump for each target gene
The format is: List of 4 $ name : chr "train_regression_stump_R" $ address :Class 'RegisteredNativeSymbol' <externalptr> $ dll :List of 5 ..$ name : chr "RGBM" ..$ path : chr "/home/raghvendra/R/x86_64-pc-linux-gnu-library/3.3/RGBM/libs/RGBM.so" ..$ dynamicLookup: logi TRUE ..$ handle :Class 'DLLHandle' <externalptr> ..$ info :Class 'DLLInfoReference' <externalptr> ..- attr(*, "class")= chr "DLLInfo" $ numParameters: int 15 - attr(*, "class")= chr [1:2] "CRoutine" "NativeSymbolInfo"
Raghvendra Mall <[email protected]>
This function performs an inverse absolute log-transformation of the non-zero edge weights in the final inferred GRN (A) to make the edge-weights more comprehensible and understandable.
transform_importance_to_weights(A)
transform_importance_to_weights(A)
A |
Inferred GRN in the form of Ntfs-by-Ntargets matrix |
Refined adjacency matrix A in the form of Ntfs-by-Ntargets matrix
Raghvendra Mall <[email protected]>
This function converts adjacency matrix A to a sorted list of edges, e.g. a list in which edges are sorted by decreasing confidence.
v2l(A, max = 1e+05, check.names = TRUE)
v2l(A, max = 1e+05, check.names = TRUE)
A |
Input adjacency matrix. |
max |
Maximal length of the resulting list. This number may be lower than the number of all the edges from adjacency matrix. Then only top |
check.names |
Checks name of the gene ids |
A data frame of sorted edges: (1) list of sources (2) list of destinations (3) list of confidences. Elements in all the lists correspond to each other.
Raghvendra Mall <[email protected]>
This function generates a matrix of the form Ntfs-by-Ntargets using the steps proposed in null-mutant zscore method and acts as a refinement step for the inferred GRN where this matrix is multiplied element by element with the inferred adjacency matrix A. However, this step is only effective in presence of additional source of information like knockout, knockdown or which genes are intially perturbed in time-series expression data.
z_score_effect(E, K, tfs, targets, Ntfs, Ntargets)
z_score_effect(E, K, tfs, targets, Ntfs, Ntargets)
E |
N-by-p expression matrix. Columns correspond to genes, rows correspond to experiments. E is expected to be already normalized using standard methods, for example RMA. Colnames of E is the set of all genes. |
K |
N-by-p initial perturbation matrix. It directly corresponds to E matrix, e.g. if K[i,j] is equal to 1, it means that gene j was knocked-out in experiment i. Single gene knock-out experiments are rows of K with only one value 1. Colnames of K is set to be the set of all genes. By default it's a matrix of zeros of the same size as E, e.g. unknown initial perturbation state of genes. |
tfs |
List of names of transcription factors |
targets |
List of names of target genes |
Ntfs |
Total number of transcription factors used in the experiment. |
Ntargets |
Total number of target genes used in the experiment. |
Returns an S2 matrix of form Ntfs-by-Ntargets. In absence of any additional knockout/knockdown/perturbation information the S2 matrix is a matrix of ones.
Raghvendra Mall <[email protected]>
Prill, Robert J., et al. "Towards a rigorous assessment of systems biology models: the DREAM3 challenges." PloS one 5.2 (2010): e9202.