User Tools

Site Tools


data_format

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
data_format [2024/02/27 13:14] – [Experimental design file format] kriandata_format [2024/02/27 13:17] (current) – [Experimental design file format] krian
Line 1: Line 1:
 +====== Data format ======
  
 +Different types of data are used in Hipathia. Some of this data require a certain structure explained on the following links:
 +
 +**Note:** The recommended file extensions are '.txt' or '.tsv'.
 +
 +===== Expression matrix file format =====
 +
 +Expression matrix file is a Tab-separated values file.
 +
 +A tab-separated values (TSV) file is a simple text format for storing data in a tabular structure (e.g. database or spreadsheet data). Each record in the table is one line of the text file. Each field value of a record is separated from the next by a tab stop character. [[https://en.wikipedia.org/wiki/Tab-separated_values|More about TSV...]]
 +
 +
 +This file has two columns if there is only one sample, and more than two columns if there are many samples.
 +The first line is a header and must contain the sample names. 
 +The first column corresponds to genes, probes or proteins, and the following IDs are accepted:
 +
 +  * Ensembl gene
 +  * HGNC symbol
 +  * Entrez id
 +  * Affy HG U133A probeset
 +  * Affy HG U133B probeset
 +  * Affy HG U133-PLUS-2 probeset
 +  * Affy HTA 2.0
 +
 +
 +The next columns correspond to gene expression values in numeric format from each sample.
 +
 +Here is an example of a file with only one example:
 +
 +<code>
 +id sampleName
 +1 0.3
 +2 1
 +3 0.73
 +</code>
 +
 +And here is another example with more than one sample:
 +
 +<code>
 +id sample1 sample2 sample3
 +1 0.31 0.6 0.24
 +2 1 0.81 0.91
 +3 0.7 0.9 0.3
 +4 0.23 0.45 0.33
 +</code>
 +
 +For a file example see {{:brca_genes_vals_bn.txt|}}
 +
 +**Note**: If probe expression values are provided, these are recodified to gene expression values, obtained as the average value of all the probes mapping in the gene. 
 +
 +===== Experimental design file format =====
 +
 +Experimental design is Tab-separated values file. This file has two columns, the first one corresponds to the sample name and the second one corresponds to the phenotype.
 +<code>
 +sample1 Group_1
 +sample2 Group_1
 +sample3 Group_2
 +</code>
 +
 +**Note**: In case of **paired data** the Experimental design file must be **ordered**.
 +
 +Here is an example of a file with 4 piared samples (sample1_Normal and sample1_Treated are the same sample before and after treatment): 
 +
 +<code>
 +sample1_Normal Group_1
 +sample2_Normal Group_1
 +sample1_Treated Group_2
 +sample2_Treated Group_2
 +</code>
 +
 +Here is an other file example see {{:brca_normal-basal_ed.txt|}}.
 +
 +===== Gene list file format =====
 +
 +Gene List is Tab-separated values file. This file has just one column, that is the Entrez ID of genes (1 Entrez ID per line).
 +
 +
 +Here is an example of a file with 4 genes to be evaluated: 
 +
 +<code>
 +Gene_1
 +Gene_2
 +Gene_3
 +Gene_4
 +</code>
 +====== Character encoding ======
 +We recommend using the **[[https://en.wikipedia.org/wiki/UTF-8 | UTF-8]]** character encoding for your content or data.