Differences

This shows you the differences between two versions of the page.

--- worked_example_prediction_train_and_test [2021/01/30 19:09] – [Prediction evaluation] cloucera
+++ worked_example_prediction_train_and_test [2021/01/31 17:33] (current) – [Test report] cloucera
@@ Line 7: / Line 7: @@
 The training phase is used to build the predictor. To do so, we need firstly a training set (a dataset of individuals belonging to two classes and properly labeled, or individuals with the associated measurement). This training set must be large and diverse enough to represent the real population of individuals on which we will be using the trained predictor. Then the predictor “learns” from this example training dataset how the classes are related to the attributes, which we will call features from now on. The performance is assessed by cross-validation, and several scores would give us an idea of how specific and sensible is the predictor. Once the predictor is trained and the performance is acceptable it is ready to be used.  New samples can be tested with the predictor that will return a class membership probability.
+It is important to note that, when the features used for training the predictor have a functional meaning by themselves, such as the signaling circuits have, the interpretation of the reasons for which the predictor takes a decision is straightforward and related to the biological processes that define the differences between the cases compared.
 Below is a worked example of how to train and use a predictor/estimator/classifier. In this example, we will work with a subsample, those patients with a known molecular subtype of luminal A or luminal B, of the Breast Cancer dataset from the repository The Cancer Genome Atlas (TCGA) [[https://portal.gdc.cancer.gov/projects/TCGA-BRCA | Link to dataset]].
@@ Line 57: / Line 59: @@
 ===== Training report =====
 Once the launched study is finished, the report/results will be available in “My studies”.
 The report page of the Prediction-training tool includes different output results. You can download any table or image shown on the results page by clicking on the name right before it. You can also download the pathway and function matrices by clicking on //Circuit values//, For more information about each result please read [[prediction#training_report|Prediction - Training report]] and [[prediction#workflow|Prediction - Workflow]]  sections.
@@ Line 65: / Line 68: @@
 **Model Analysis**
-Hyperparameter search:
-{{ :svm.performance.heatmap.png?direct&400 | Hyperparameter search heatmap}}
 CV Performance:
+{{ ::model_stats.png?nolink |}}
 {{ :model_stats.txt| CV stats }}
-The most relevant features along with their interaction sign:
+The most relevant features along with their interaction sign. It is important to note that, when the features used for training the predictor have a functional meaning by themselves, such as the signaling circuits have, the interpretation of the reasons for which the predictor takes a decision is straightforward and related to the biological processes that define the differences between the cases compared. The most relevant circuits with the interaction sign are:
 |**Selected circuits name**                                                          |**Coef sign** |
@@ Line 124: / Line 125: @@
 PR curve over the test:
-{{ :split_test_pr.png?400 | Precision-recall (PR) curve for the test split. }}
+{{ :split_test_pr.png?400 | Precision-recall (PR) curve for the simulated test split. }}
 ROC curve over the test set:
@@ Line 130: / Line 131: @@
 {{ :split_test_roc.png?400 | ROC curve for the test split. }}
-Probability for the test set:
+Probability distributions of the positive class with respect to the true labels for the simulated test split:
 {{ :test_probability_boxplot.png?400 | ROC curve for the test split. }}
@@ Line 176: / Line 177: @@
 This is the most important result of our predictor, which is a matrix with three columns:
-  * Sample name: all the 125 samples in the used expression matrix file.
+  * Sample name: all the 125 samples (104 luminal A, 21 luminal B) in the used expression matrix file.
   * Prediction: the predicted group LumB (Luminal B) or LumA (Luminal A)
   * Probability LumB: this is the probability of being lumB, if it is 1 that means the predictor is 100% sure that the given result will be LumB.
 You can download the matrix of predicted experimental design by clicking on //Prediction results//.
 ===== Prediction evaluation =====
-Note that for this example we know beforehand the ground truth labels so we can compute the classification metrics as in the simulated split during the training phase. The ROC and PR curves are quite similar to those of the simulated split which inform us of the good generalization capabilities of the tool for this problem. The trend can also be observed from the companion metrics table.
+Note that for this example we know beforehand the ground truth labels so we can compute the classification metrics as in the simulated split during the training phase. The ROC and PR curves are quite similar to those of the simulated split which inform us of the good generalization capabilities of the tool for this problem. The trend can also be observed from the companion metrics table and the confusion matrix.
-{{ :split_prediction_roc.png?nolink |}}
+{{ :split_prediction_roc.png?400 | ROC curve for the holdout split }}
-{{ :split_prediction_pr.png?nolink |}}
+{{ :split_prediction_pr.png?400 | PR curve for the holdout split }}
-{{ :split_prediction_probability_boxplot.png?nolink |}}
+{{ :split_prediction_probability_boxplot.png?400 | Probability distribution of th SVM for the holdout split }}
-statistic	value
-Sensitivity	0.761904761904762
-Specificity	0.913461538461538
-Positive Predictive Value	0.64
-Negative Predictive Value	0.95
-False Positive Rate	0.0865384615384616
-False Negative Rate	0.238095238095238
-Likelihood Ratio Positive	8.8042328042328
-Likelihood Ratio Negative	0.260651629072682
-Percentage of data points in the main diagonal	0.888
-Percentage of data points in the main diagonal corrected for agreement by chance	0.627659574468085
-Rand index	0.799483870967742
-Rand index corrected for agreement by chance	0.525491509396793
-Total Accuracy	0.888
-==== Confusion Matrix and Statistics ====
+^ statistic	^ value ^
+| Sensitivity	| 0.761904761904762 |
+| Specificity	| 0.913461538461538 |
+| Positive Predictive Value	| 0.64 |
+| Negative Predictive Value	| 0.95 |
+| False Positive Rate	| 0.0865384615384616 |
+| False Negative Rate	| 0.238095238095238 |
+| Likelihood Ratio Positive	| 8.8042328042328 |
+| Likelihood Ratio Negative	| 0.260651629072682 |
+| Percentage of data points in the main diagonal	| 0.888 |
+| Percentage of data points in the main diagonal corrected for agreement by chance	| 0.627659574468085 |
+| Rand index	| 0.799483870967742|
+| Rand index corrected for agreement by chance	| 0.525491509396793 |
+| Total Accuracy	| 0.888 |
 ^              ^^              Reference              ^^
@@ Line 227: / Line 213: @@
 ^    :::    ^    LumB    |    9    |    16    |
-^              Accuracy              |||    0.888    |
-^              95% CI              |||    (0.8192, 0.9374)    |
-^              No Information Rate              |||    0.832    |
-^              P-Value [Acc > NIR]              |||    0.0547    |
-^              Kappa              |||    0.6277    |
-^              Mcnemar's Test              ^^^^
-^              P-Value              |||    0.4227    |
-^              Sensitivity              |||    0.9135    |
-^              Specificity              |||    0.7619    |
-^              Pos Pred Value              |||    0.9500    |
-^              Neg Pred Value              |||    0.6400    |
-^              Prevalence              |||    0.8320    |
-^              Detection Rat              |||    0.7600    |
-^              Detection Prevalence              |||    0.8000    |
-^              Balanced Accuracy              |||    0.8377    |
-It is important to note that, when the features used for training the predictor have a functional meaning by themselves, such as the signaling circuits have, the interpretation of the reasons for which the predictor takes a decision is straightforward and related to the biological processes that define the differences between the cases compared.
 ===== Discussion =====