User Tools

Site Tools


worked_example_prediction_-_train

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
worked_example_prediction_-_train [2021/01/30 15:40] – [Training report] krianworked_example_prediction_-_train [2021/01/30 16:16] (current) – removed krian
Line 1: Line 1:
-====== Worked example Prediction ===== 
-===== Training ===== 
- 
-**1-** Log into HiPathia. For further information on this step visit [[logging_in|logging in]]. 
- 
-**2-** Collection of data. We will work with a Breast Cancer dataset from the repository The Cancer Genome Atlas (TCGA) [[https://portal.gdc.cancer.gov/projects/TCGA-BRCA | Link to dataset]].\\  
-More information on the proposed dataset is available here: 
-  * https://www.nature.com/articles/nature11412 
-  * https://pubmed.ncbi.nlm.nih.gov/23644459/ 
- 
-Before use in HiPathia, the dataset must be normalized. We recommend using the [[https://genomebiology.biomedcentral.com/articles/10.1186/gb-2010-11-3-r25 | logarithm of the trimmed mean of M values]] (log2TMM). 
- 
-We have selected samples of breast cancer tumors from the dataset, annotated as luminal A and luminal B (the molecular annotations come from [[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3465532/ | this paper]]). You can learn more about breast cancer molecular subtypes [[https://link.springer.com/article/10.1007/s10549-009-0499-6 | here]]. The purpose of this study is to train a predictor so that it can learn to distinguish molecular subtypes from gene expression data using the Hipathia mechanistic models, and evaluate the model with a controlled set of samples. 
- 
-The expression matrix and the experimental design can be downloaded from these links: 
- 
-  * Expression matrix: [[http://hipathia.babelomics.org/data/brca_sub_class_exp_train.txt|brca_sub_class_exp_train.txt]] 
-  * Experimental design:[[http://hipathia.babelomics.org/data/brca_sub_class_des_train.txt|brca_sub_class_des_train.txt]] 
- 
-**3-** Click //Prediction// button.  
-{{ :hipathia_bar_pred.png?600 |}} 
- 
-**4-** Upload the normalized data to HiPathia by clicking //My data// in the data panel, or click the //Run training example// button. For further information on this step visit [[upload_your_data|Upload your data]]. 
-{{ ::runtrainingexmpl.png?nolink |}} 
-**5.** In the //Type// panel select //Train new predictor//. 
- 
-**6.** In the //Input data// panel, select //Expression matrix//. Click the //File browser// of the //Expression matrix// section and select the desired file. 
- 
-{{ :hipathia_Pred_work1.png?600 |}} 
- 
-**7.** In the //Design data// panel, select //Class predictor//. Click the //File browser// of the //Experimental design// section and select the desired file. Automatically //Condition 1// and //Condition 2// files are selected. Select "Tumor" for //Condition 1// and "Normal" for //Condition 2//. 
- 
-{{ :hipathia_Pred_work2.png?600 |}} 
- 
-**8.** Select Human (Homo sapiens) as species (default). 
- 
- 
-**9.** In the //Pathways// panel, select all the pathways (default). 
- 
-**10.** In the //Study information// panel, click the //File browser// button and select the desired output folder. In this case, we will use //analysis_BRCA//. Give a name to the study, for example "BRCA train model". 
- 
-{{ :hipathia_work4.png?600 |}} 
- 
-**11.** Click the //Run analysis// button. A Study will be created and listed in the studies panel. You can access this panel by clicking on the //My studies// button. 
- 
-===== Training report ===== 
-Once the launched study is finished, the report/results will be available in “My studies”.  
-The report page of the Prediction-training tool includes different output results. You can download any table or image shown on the results page by clicking on the name right before it. You can also download the pathway and function matrices by clicking on //Circuit values//, For more information about each result please read [[prediction|Prediction]]. 
- 
-==== Study Information ==== 
-Here you can find the information about the selected study.  
-{{ ::studyinfopredreport.png?nolink |}} 
-  * **Name**: the study name. 
-  * **Description**: the description of the current study. 
-  * **Tool**: the name of the used tool (in this case, is Hipathia). 
-  * **Date**: study launching date (MM/DD/AAAA, HH:MM:SS AM/PM format) 
-==== Input Parameters ==== 
-Here you can visualize the parameters with which the current study was launched.  
-{{ ::inputpredictreport.png?nolink |}} 
-  * **Expression file**: The name of the expression file that has been used in the current study. 
-  * **Design file**: The name of the design file that has been used in the current study. 
-  * **Comparison**: The groups that have been compared, for example; Normal vs Tumor. 
-  * **Paired analysis**: Have the input data been paired? **No** or **Yes**. 
-  * **Species**: The species of this experiment; Human (Homo sapiens), Mouse (Mus musculus) or Rat (Rattus norvegicus). 
-==== Circuit values ==== 
-You can download the matrix of circuit activity values by clicking on circuit values. 
-{{ ::circuitvaluespredreport.png?nolink |}} 
-This matrix file indicates for each "effector circuit" the level of activation calculated using Hipathia method for each sample. 
-==== Model training ==== 
-Here you can visualize the results from the prediction analysis.  
-  * **K-fold cross-validation**: The number of equal sized subsamples in which the original sample is randomly partitioned. 
-{{ ::k-foldcrossvalidation.png?nolink |}} 
-  * **SVM-RBF hypermeter performance** 
-{{ ::hyperparameterpredreport.png?nolink |}} 
-  * **Test model statistics**: 
-{{ ::testmodelstatistics.png?nolink |}} 
-  * **Split Train pr**: 
-{{ ::splittrainpr.png?nolink |}} 
-  * **Split Train roc**: 
-{{ ::splittrainroc.png?nolink |}} 
-  * **Split Test pr**: 
-{{ ::splittestpr.png?nolink |}} 
-  * **Split Test roc**: 
-{{ ::splittestroc.png?nolink |}} 
-==== Prediction model ==== 
-{{ ::predictionmodelreport.png?nolink |}} 
- 
-==== Model statistics ==== 
-You can download the model statistics. 
-  * **Selected features**: You can download the filtered paths that best differentiate your phenotype. This section is only available when selecting //filter paths// option. 
-===== Workflow ===== 
-The prediction tool is based on a machine learning module, this module of the Hipathia web tool can be summarized as follows: 
-  * Expected input and output: 
-    * Input features: Hipathia circuit values. 
-    * Input response: 1-D binary array with the same number of samples as the input features. 
-    * Output 1: CV performance metrics 
-    * Output 2: The selected features with their respective interaction sign, sorted by their relevance. 
-      * A positive sign indicates that a given feature pushes the prediction towards the positive class. 
-    * Output 3: Statistics and ROC, PR curves for typical train test split scenario. 
-    * Output 4: Probability boxplots for the test set. 
-  * Feature selection: 
-    * We select the features that best discriminate between the response values by means of the LASSO [4] (using the ''glmnet'' r package which implements a fast coordinate descent version of the LASSO [5]). 
-    * We filter the feature space using those circuits selected in the previous step. 
-  * Hyperparameter search (''C'' //cost// or //margin// and γ) of a non-linear SVM [6] with a radial-basis kernel: 
-    * **γ**: determines the complexity of the svm frontier.  
-    * **cost**: is basically the //margin// around the frontier established by the svm. 
-    * **method**: both γ and //margin// are obtained using a k-fold cross-validation procedure: 
-      * for each selection of γ and ''C'' we train a svm 
-      * we compute the mean of the misclassification error over all the folds in the test split 
-      * we select the best pair of hyperparameters (γ, ''C''), i.e. the ones with the lower CV mean error. 
-    * From now onwards we fix the features selected by the LASSO and the hyperparameters previously found. 
-    * The ''SVM'' training has been carried out using the ''LIBSVM'' library [2] by means of the R interface provided by the package ''e1071'' [1]. 
-  * Performance evaluation:  
-    * We perform a k-fold cross-validation with the features and hyperparameters selected above in order to report the generalization capabilities of the method.  
-      * The report contains a set of commonly used metrics for classification. 
-    * We perform a train-test split analysis 
-      * We randomly select 30% of the samples as the test 
-      * We train a SVM on the train set using the hyperparameters and features previously found. 
-      * We provide summary statistics as in the case of the k-fold cross-validation. 
-      * We plot the ROC and Precision-Recall (PR) curves along with the area under the curve.  
-      * Note that all curve visualizations have been done using the specialized R package ''PRROC'' [3] 
- 
-=== Breast Cancer Molecular Subtype Classification === 
- 
-The experiment consists in classifying a given sample as Luminal A or Luminal B (molecular subtype). We use TCGA data, no pathway filtering was done by hand. 
- 
-**Model Analysis** 
- 
-Hyperparameter search: 
- 
-{{ :svm.performance.heatmap.png?direct&400 | Hyperparameter search heatmap}} 
- 
-CV Performance: 
-{{ :model_stats.tsv| CV stats }} 
- 
-The most relevant features along with their interaction sign: 
- 
-|**features**                                                          |**coefs** | 
-|p53 signaling pathway: CDK1 CCNB3                                 |+     | 
-|p53 signaling pathway: CDK2 CCNE1                                 |+     | 
-|p53 signaling pathway: SERPINB5                                   |-     | 
-|Oocyte meiosis: REC8*                                             |+     | 
-|Neurotrophin signaling pathway: NFKB1                             |+     | 
-|Neurotrophin signaling pathway: RHOA                              |+     | 
-|Amphetamine addiction: ARC                                        |-     | 
-|RIG-I-like receptor signaling pathway: CHUK IKBKB IKBKG           |+     | 
-|Ras signaling pathway: RAP1A                                      |-     | 
-|Progesterone-mediated oocyte maturation: CDK1                     |+     | 
-|Vascular smooth muscle contraction: ACTA2                         |-     | 
-|Complement and coagulation cascades: BDKRB1                       |+     | 
-|Fanconi anemia pathway: RAD51C                                    |+     | 
-|TGF-beta signaling pathway: ROCK1                                 |-     | 
-|ErbB signaling pathway: ELK1*                                     |-     | 
-|HTLV-I infection: TP53 TBPL2                                      |+     | 
-|Platelet activation: ITPR1                                        |-     | 
-|Pathways in cancer: E2F1                                          |+     | 
-|PI3K-Akt signaling pathway: BCL2                                  |-     | 
-|Maturity onset diabetes of the young: NKX6-1                      |+     | 
-|Signaling pathways regulating pluripotency of stem cells: MAPK1*  |-     | 
-|Rap1 signaling pathway: THBS1                                     |+     | 
-|HTLV-I infection: E2F1                                            |+     | 
-|Jak-STAT signaling pathway: CDKN1A                                |+     | 
-|Epstein-Barr virus infection: RB1                                 |-     | 
-|Colorectal cancer: BIRC5                                          |+     | 
-|Signaling pathways regulating pluripotency of stem cells: MYC     |-     | 
-|Taste transduction: C00076*                                       |-     | 
-|Hepatitis B: JUN                                                  |-     | 
-|Axon guidance: GSK3B                                              |-     | 
-|MAPK signaling pathway: MAPT                                      |-     | 
-|cAMP signaling pathway: HHIP                                      |-     | 
-|Fanconi anemia pathway: BRCA1                                     |+     | 
-|Pathways in cancer: CSF3R                                         |+     | 
-|Cell cycle: CDC45 MCM7 MCM6 MCM5 MCM4 MCM3 MCM2                   |+     | 
-|ErbB signaling pathway: CDKN1A                                    |+     | 
-|HTLV-I infection: PTTG2                                           |+     | 
-|AMPK signaling pathway: CCNA2                                     |+     | 
-|Oocyte meiosis: CDC25C*                                           |+     | 
-|Non-alcoholic fatty liver disease (NAFLD): BAX                    |+     | 
-|Hepatitis B: PCNA                                                 |+     | 
-|AMPK signaling pathway: G6PC                                      |-     | 
-|Adrenergic signaling in cardiomyocytes: BCL2                      |-     | 
-|HTLV-I infection: ANAPC10 CDC20                                   |-     | 
-|Progesterone-mediated oocyte maturation: CDK1*                    |+     | 
-|Complement and coagulation cascades: C4A                          |-     | 
-|Choline metabolism in cancer: WAS                                 |+     | 
-|ErbB signaling pathway: STAT5A                                    |-     | 
-|Herpes simplex infection: FOS                                     |-     | 
-|Amyotrophic lateral sclerosis (ALS): DERL1                        |+     | 
-|AGE-RAGE signaling pathway in diabetic complications: F3          |-     | 
-|Non-alcoholic fatty liver disease (NAFLD): PKLR                   |+     | 
-|Maturity onset diabetes of the young: FOXA3                       |-     | 
-|AMPK signaling pathway: CPT1C                                     |+     | 
-|PPAR signaling pathway: FADS2                                     |+     | 
-|Rap1 signaling pathway: C00076*                                   |+     | 
-|cAMP signaling pathway: GRIN3A                                    |+     | 
-|Glutamatergic synapse: CACNA1A                                    |-     | 
-|Progesterone-mediated oocyte maturation: MAPK14                   |-     | 
-|Salivary secretion: BEST2                                         |+     | 
-|Vibrio cholerae infection: PDIA4                                  |+     | 
-|cAMP signaling pathway: PLN                                       |+     | 
-|Neurotrophin signaling pathway: JUN                               |-     | 
-|Pathways in cancer: CCNA1                                         |-     | 
-|Epithelial cell signaling in Helicobacter pylori infection: GIT1  |-     | 
-|Renal cell carcinoma: TGFA                                        |-     | 
-|Influenza A: RNASEL                                               |+     | 
-|Thyroid hormone signaling pathway: TP53*                          |+     | 
-|Epithelial cell signaling in Helicobacter pylori infection: CXCL1 |-     | 
-|Signaling pathways regulating pluripotency of stem cells: MAPK14  |+     | 
-|Hepatitis C: EIF2S1                                               |+     | 
-|Proteoglycans in cancer: CTNNB1                                   |+     | 
-|Influenza A: STAT1 IRF9                                           |+     | 
-|Thyroid hormone signaling pathway: CTNNB1                         |-     | 
-|Taste transduction: PKD1L3 PKD2L1                                 |+     | 
-|Prostate cancer: C16038                                           |-     | 
-|Basal cell carcinoma: PTCH1*                                      |-     | 
-|Toxoplasmosis: C06314                                             |-     | 
-|Prostate cancer: BCL2                                             |-     | 
-|Measles: EIF2S1                                                   |+     | 
-|Acute myeloid leukemia: CCNA1                                     |-     | 
-|Glucagon signaling pathway: CPT1C*                                |+     | 
- 
- 
-**Split Analysis** 
- 
-Split Performance: 
-{{ :test_model_stats.tsv| Test stats }} 
- 
-PR curve over the test: 
- 
-{{ :split_test_pr.png?400 | Precision-recall (PR) curve for the test split. }} 
- 
-ROC curve over the test set: 
- 
-{{ :split_test_roc.png?400 | ROC curve for the test split. }} 
- 
-Probability for the test set: 
- 
-{{ :test_probability_boxplot.png?400 | ROC curve for the test split. }} 
- 
- 
-===== Bibliography ===== 
- 
-[1]D. Meyer, E. Dimitriadou, K. Hornik, A. Weingessel, and F. Leisch, e1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien. 2019, https://CRAN.R-project.org/package=e1071 
- 
-[2]C.-C. Chang and C.-J. Lin, “LIBSVM: A Library for Support Vector Machines,” ACM Trans. Intell. Syst. Technol., vol. 2, no. 3, pp. 27:1–27:27, May 2011, doi: 10.1145/1961189.1961199. 
- 
-[3]J. Grau, I. Grosse, and J. Keilwagen, “PRROC: computing and visualizing precision-recall and receiver operating characteristic curves in R,” Bioinformatics, vol. 31, no. 15, pp. 2595–2597, 2015, doi: 10.1093/bioinformatics/btv153 
- 
-[4]R. Tibshirani, “Regression Shrinkage and Selection via the Lasso,” Journal of the Royal Statistical Society. Series B (Methodological), vol. 58, no. 1, pp. 267–288, 1996, doi: 10.1111/j.2517-6161.1996.tb02080.x 
- 
-[5]J. Friedman, T. Hastie, and R. Tibshirani, “Regularization Paths for Generalized Linear Models via Coordinate Descent,” J Stat Softw, vol. 33, no. 1, pp. 1–22, 2010, doi: 10.18637/jss.v033.i01 
- 
-[6]C. Cortes and V. Vapnik, “Support-vector networks,” Mach Learn, vol. 20, no. 3, pp. 273–297, Sep. 1995, doi: 10.1007/BF00994018. 
  
worked_example_prediction_-_train.1612021223.txt.gz · Last modified: 2021/01/30 15:40 by krian