User Tools

Site Tools


data_format

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
data_format [2019/07/19 14:11]
mmarin
data_format [2020/04/03 20:18] (current)
Line 1: Line 1:
 ====== Data format ====== ====== Data format ======
  
-Different types of data are used in Hipathia. Some of this data require a certain structure ​indicated ​on the following links:+Different types of data are used in Hipathia. Some of this data require a certain structure ​explained ​on the following links:
  
 +===== Expression matrix file format =====
  
-[[Expression matrix file format |Expression matrix ​file format]]+Expression matrix file is a Tab-separated values ​file.
  
-[[Experimental design file format ​Experimental design file format ​]]+A tab-separated values (TSV) file is a simple text format for storing data in a tabular structure (e.g. database or spreadsheet data). Each record in the table is one line of the text file. Each field value of a record is separated from the next by a tab stop character. ​[[https://​en.wikipedia.org/​wiki/​Tab-separated_values|More about TSV...]]
  
-[[Gene list file format | Gene list file format ​]]+ 
 +This file has two columns if there is only one sample, and more than two columns if there are many samples. 
 +The first line is a header and must contain the sample names.  
 +The first column corresponds to genes, probes or proteins, and the following IDs are accepted: 
 + 
 +  * Ensembl gene 
 +  * HGNC symbol 
 +  * Entrez id 
 +  * Affy HG U133A probeset 
 +  * Affy HG U133B probeset 
 +  * Affy HG U133-PLUS-2 probeset 
 +  * Affy HTA 2.0 
 + 
 + 
 +The next columns correspond to gene expression values in numeric ​format ​from each sample. 
 + 
 +Here is an example of a file with only one example: 
 + 
 +<​code>​ 
 +id sampleName 
 +1 0.3 
 +2 1 
 +3 0.73 
 +</​code>​ 
 + 
 +And here is another example with more than one sample: 
 + 
 +<​code>​ 
 +id sample1 sample2 sample3 
 +1 0.31 0.6 0.24 
 +2 1 0.81 0.91 
 +3 0.7 0.9 0.3 
 +4 0.23 0.45 0.33 
 +</​code>​ 
 + 
 +For a file example see {{:​brca_genes_vals_bn.txt|}} 
 + 
 +**Note**: If probe expression values are provided, these are recodified to gene expression values, obtained as the average value of all the probes mapping in the gene.  
 + 
 +===== Experimental design file format ===== 
 + 
 +Experimental design is Tab-separated values file. This file has two columns, the first one corresponds to the sample name and the second one corresponds to the phenotype. 
 + 
 + 
 +**Note**: In case of **paired data** the Experimental design file must be **ordered**. 
 + 
 +Here is an example of a file with 4 piared samples (sample1_Normal and sample1_Treated are the same sample before and after treatment):  
 + 
 +<​code>​ 
 +sample1_Normal Group_1 
 +sample2_Normal Group_1 
 +sample1_Treated Group_2 
 +sample2_Treated Group_2 
 +</​code>​ 
 + 
 +Here is an other file example see {{:​brca_normal-basal_ed.txt|}}. 
 + 
 +===== Gene list file format ​===== 
 + 
 +Gene List is Tab-separated values file. This file has just one column, that is the Entrez ID of genes (1 Entrez ID per line). 
 + 
 + 
 +Here is an example of a file with 4 genes to be evaluated:  
 + 
 +<​code>​ 
 +Gene_1 
 +Gene_2 
 +Gene_3 
 +Gene_4 
 +</​code>​
  
data_format.1563545501.txt.gz · Last modified: 2020/04/03 20:17 (external edit)