This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
prediction [2020/02/05 16:34] krian [Workflow] |
prediction [2020/04/03 20:18] (current) |
||
---|---|---|---|
Line 19: | Line 19: | ||
The **expression data** has to be: | The **expression data** has to be: | ||
* Expression matrix provided by ourselves (see how to upload files in [[upload_your_data|Upload your data]]). | * Expression matrix provided by ourselves (see how to upload files in [[upload_your_data|Upload your data]]). | ||
+ | When we select a gene expression file, the number of samples of this matrix will appear under the "file browser" button as shown below. | ||
+ | {{ ::diffnumbersamples.png?nolink |}} | ||
==== Design data panel ==== | ==== Design data panel ==== | ||
The design data panel allows you to choose the kind of experiment you want to perform. You can choose between two kinds of experimental design: | The design data panel allows you to choose the kind of experiment you want to perform. You can choose between two kinds of experimental design: | ||
Line 35: | Line 37: | ||
{{ ::species.png?nolink |}} | {{ ::species.png?nolink |}} | ||
==== Parameters ==== | ==== Parameters ==== | ||
- | This panel includes further parameters necessary to run an analysis. | + | This panel includes further parameters necessary to run an analysis.\\ |
**Filter circuits**: Check to obtain the circuits that best differentiate your phenotype. This option is only available from //Prediction// tool. | **Filter circuits**: Check to obtain the circuits that best differentiate your phenotype. This option is only available from //Prediction// tool. | ||
{{ ::filtercircuits.png?nolink |}} | {{ ::filtercircuits.png?nolink |}} | ||
Line 106: | Line 108: | ||
You can download the model statistics. | You can download the model statistics. | ||
* **Selected features**: You can download the filtered paths that best differentiate your phenotype. This section is only available when selecting //filter paths// option. | * **Selected features**: You can download the filtered paths that best differentiate your phenotype. This section is only available when selecting //filter paths// option. | ||
+ | ===== Workflow ===== | ||
+ | The prediction tool is based on a machine learning module, this module of the Hipathia web tool can be summarized as follows: | ||
+ | * Expected input and output: | ||
+ | * Input features: hipathia circuit values. | ||
+ | * Input response: 1-D binary array with the same number of samples as the input features. | ||
+ | * Output 1: CV performance metrics | ||
+ | * Output 2: The selected features with their respective interaction sign, sorted by their elevance. | ||
+ | * A positive sign indicates that a given feature pushes the prediction towards the positive class. | ||
+ | * Output 3: Statistics and ROC, PR curves for typical train test split scenario. | ||
+ | * Output 4: Probability boxplots for the test set. | ||
+ | * Feature selection: | ||
+ | * we select the features that best discriminate between the response values by means of the LASSO [4] (using the ''glmnet'' r package which implements a fast coordinate descent version of the LASSO [5]). | ||
+ | * We filter the feature space using those circuits selected in the previous step. | ||
+ | * Hyperparameter search (''C'' //cost// or //margin// and γ) of a non-linear SVM [6] with a radial-basis kernel: | ||
+ | * **γ**: determines the complexity of the svm frontier. | ||
+ | * **cost**: is basically the //margin// around the frontier established by the svm. | ||
+ | * **method**: both γ and //margin// are obtained using a k-fold cross-validation procedure: | ||
+ | * for each selection of γ and ''C'' we train a svm | ||
+ | * we compute the mean of the misclassification error over all the folds in the test split | ||
+ | * we select the best pair of hyperparameters (γ, ''C''), i.e. the ones with the lower CV mean error. | ||
+ | * From now onwards we fix the features selected by the LASSO and the hyperparameters previously found. | ||
+ | * The ''SVM'' training has been carried out using one of the most powerful libraries to train svm-based models ''LIBSVM'' [2] by means of the R interface provided by the package ''e1071'' [1]. | ||
+ | * Performance evaluation: | ||
+ | * We perform a k-fold cross-validation with the features and hyperparameters selected above in order to report the generalization capabilities of the method. | ||
+ | * The report contains a set of commonly used metrics for classification. | ||
+ | * We perform a train-test split analysis | ||
+ | * We randomly select 30% of the samples as the test | ||
+ | * We train a SVM on the train set using the hyperparameters and features previously found. | ||
+ | * We provide summary statistics as in the case of the k-fold cross-validation. | ||
+ | * We plot the ROC and Precision-Recall (PR) curves along with the area under the curve. | ||
+ | * Note that all curve visualizations have been done using the specialized R package ''PRROC'' [3] | ||
+ | |||
+ | |||
+ | === Breast Cancer Molecular Subtype Classification === | ||
+ | |||
+ | The experiment consists in classifying a giving sample as Luminal A or Luminal B (molecular subtype). We use TCGA data (processed by Inma), no pathway filtering was done by hand. | ||
+ | |||
+ | **Model Analysis** | ||
+ | |||
+ | Hyperparameter search: | ||
+ | |||
+ | {{ :svm.performance.heatmap.png?direct&400 | Hyperparameter search heatmap}} | ||
+ | |||
+ | CV Performance: | ||
+ | {{ :model_stats.tsv| CV stats }} | ||
+ | |||
+ | The most relevant features along with their interaction sign: | ||
+ | |||
+ | |features |coefs | | ||
+ | |:-----------------------------------------------------------------|:-----| | ||
+ | |p53 signaling pathway: CDK1 CCNB3 |+ | | ||
+ | |p53 signaling pathway: CDK2 CCNE1 |+ | | ||
+ | |p53 signaling pathway: SERPINB5 |- | | ||
+ | |Oocyte meiosis: REC8* |+ | | ||
+ | |Neurotrophin signaling pathway: NFKB1 |+ | | ||
+ | |Neurotrophin signaling pathway: RHOA |+ | | ||
+ | |Amphetamine addiction: ARC |- | | ||
+ | |RIG-I-like receptor signaling pathway: CHUK IKBKB IKBKG |+ | | ||
+ | |Ras signaling pathway: RAP1A |- | | ||
+ | |Progesterone-mediated oocyte maturation: CDK1 |+ | | ||
+ | |Vascular smooth muscle contraction: ACTA2 |- | | ||
+ | |Complement and coagulation cascades: BDKRB1 |+ | | ||
+ | |Fanconi anemia pathway: RAD51C |+ | | ||
+ | |TGF-beta signaling pathway: ROCK1 |- | | ||
+ | |ErbB signaling pathway: ELK1* |- | | ||
+ | |HTLV-I infection: TP53 TBPL2 |+ | | ||
+ | |Platelet activation: ITPR1 |- | | ||
+ | |Pathways in cancer: E2F1 |+ | | ||
+ | |PI3K-Akt signaling pathway: BCL2 |- | | ||
+ | |Maturity onset diabetes of the young: NKX6-1 |+ | | ||
+ | |Signaling pathways regulating pluripotency of stem cells: MAPK1* |- | | ||
+ | |Rap1 signaling pathway: THBS1 |+ | | ||
+ | |HTLV-I infection: E2F1 |+ | | ||
+ | |Jak-STAT signaling pathway: CDKN1A |+ | | ||
+ | |Epstein-Barr virus infection: RB1 |- | | ||
+ | |Colorectal cancer: BIRC5 |+ | | ||
+ | |Signaling pathways regulating pluripotency of stem cells: MYC |- | | ||
+ | |Taste transduction: C00076* |- | | ||
+ | |Hepatitis B: JUN |- | | ||
+ | |Axon guidance: GSK3B |- | | ||
+ | |MAPK signaling pathway: MAPT |- | | ||
+ | |cAMP signaling pathway: HHIP |- | | ||
+ | |Fanconi anemia pathway: BRCA1 |+ | | ||
+ | |Pathways in cancer: CSF3R |+ | | ||
+ | |Cell cycle: CDC45 MCM7 MCM6 MCM5 MCM4 MCM3 MCM2 |+ | | ||
+ | |ErbB signaling pathway: CDKN1A |+ | | ||
+ | |HTLV-I infection: PTTG2 |+ | | ||
+ | |AMPK signaling pathway: CCNA2 |+ | | ||
+ | |Oocyte meiosis: CDC25C* |+ | | ||
+ | |Non-alcoholic fatty liver disease (NAFLD): BAX |+ | | ||
+ | |Hepatitis B: PCNA |+ | | ||
+ | |AMPK signaling pathway: G6PC |- | | ||
+ | |Adrenergic signaling in cardiomyocytes: BCL2 |- | | ||
+ | |HTLV-I infection: ANAPC10 CDC20 |- | | ||
+ | |Progesterone-mediated oocyte maturation: CDK1* |+ | | ||
+ | |Complement and coagulation cascades: C4A |- | | ||
+ | |Choline metabolism in cancer: WAS |+ | | ||
+ | |ErbB signaling pathway: STAT5A |- | | ||
+ | |Herpes simplex infection: FOS |- | | ||
+ | |Amyotrophic lateral sclerosis (ALS): DERL1 |+ | | ||
+ | |AGE-RAGE signaling pathway in diabetic complications: F3 |- | | ||
+ | |Non-alcoholic fatty liver disease (NAFLD): PKLR |+ | | ||
+ | |Maturity onset diabetes of the young: FOXA3 |- | | ||
+ | |AMPK signaling pathway: CPT1C |+ | | ||
+ | |PPAR signaling pathway: FADS2 |+ | | ||
+ | |Rap1 signaling pathway: C00076* |+ | | ||
+ | |cAMP signaling pathway: GRIN3A |+ | | ||
+ | |Glutamatergic synapse: CACNA1A |- | | ||
+ | |Progesterone-mediated oocyte maturation: MAPK14 |- | | ||
+ | |Salivary secretion: BEST2 |+ | | ||
+ | |Vibrio cholerae infection: PDIA4 |+ | | ||
+ | |cAMP signaling pathway: PLN |+ | | ||
+ | |Neurotrophin signaling pathway: JUN |- | | ||
+ | |Pathways in cancer: CCNA1 |- | | ||
+ | |Epithelial cell signaling in Helicobacter pylori infection: GIT1 |- | | ||
+ | |Renal cell carcinoma: TGFA |- | | ||
+ | |Influenza A: RNASEL |+ | | ||
+ | |Thyroid hormone signaling pathway: TP53* |+ | | ||
+ | |Epithelial cell signaling in Helicobacter pylori infection: CXCL1 |- | | ||
+ | |Signaling pathways regulating pluripotency of stem cells: MAPK14 |+ | | ||
+ | |Hepatitis C: EIF2S1 |+ | | ||
+ | |Proteoglycans in cancer: CTNNB1 |+ | | ||
+ | |Influenza A: STAT1 IRF9 |+ | | ||
+ | |Thyroid hormone signaling pathway: CTNNB1 |- | | ||
+ | |Taste transduction: PKD1L3 PKD2L1 |+ | | ||
+ | |Prostate cancer: C16038 |- | | ||
+ | |Basal cell carcinoma: PTCH1* |- | | ||
+ | |Toxoplasmosis: C06314 |- | | ||
+ | |Prostate cancer: BCL2 |- | | ||
+ | |Measles: EIF2S1 |+ | | ||
+ | |Acute myeloid leukemia: CCNA1 |- | | ||
+ | |Glucagon signaling pathway: CPT1C* |+ | | ||
+ | |||
+ | |||
+ | **Split Analysis** | ||
+ | |||
+ | Split Performance: | ||
+ | {{ :test_model_stats.tsv| Test stats }} | ||
+ | |||
+ | PR curve over the test: | ||
+ | |||
+ | {{ :split_test_pr.png?400 | Precision-recall (PR) curve for the test split. }} | ||
+ | |||
+ | ROC curve over the test set: | ||
+ | |||
+ | {{ :split_test_roc.png?400 | ROC curve for the test split. }} | ||
+ | |||
+ | Probability for the test set: | ||
+ | |||
+ | {{ :test_probability_boxplot.png?400 | ROC curve for the test split. }} | ||
+ | |||
+ | |||