data_format
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
data_format [2019/07/19 14:11] – mmarin | data_format [2024/02/27 13:17] (current) – [Experimental design file format] krian | ||
---|---|---|---|
Line 3: | Line 3: | ||
Different types of data are used in Hipathia. Some of this data require a certain structure explained on the following links: | Different types of data are used in Hipathia. Some of this data require a certain structure explained on the following links: | ||
+ | **Note:** The recommended file extensions are ' | ||
- | [[Expression matrix file format | + | ===== Expression matrix file format |
- | [[Experimental design | + | Expression matrix |
- | [[Gene list file format | + | A tab-separated values (TSV) file is a simple text format for storing data in a tabular structure (e.g. database or spreadsheet data). Each record in the table is one line of the text file. Each field value of a record is separated from the next by a tab stop character. |
+ | |||
+ | This file has two columns if there is only one sample, and more than two columns if there are many samples. | ||
+ | The first line is a header and must contain the sample names. | ||
+ | The first column corresponds to genes, probes or proteins, and the following IDs are accepted: | ||
+ | |||
+ | * Ensembl gene | ||
+ | * HGNC symbol | ||
+ | * Entrez id | ||
+ | * Affy HG U133A probeset | ||
+ | * Affy HG U133B probeset | ||
+ | * Affy HG U133-PLUS-2 probeset | ||
+ | * Affy HTA 2.0 | ||
+ | |||
+ | |||
+ | The next columns correspond to gene expression values in numeric format from each sample. | ||
+ | |||
+ | Here is an example of a file with only one example: | ||
+ | |||
+ | < | ||
+ | id sampleName | ||
+ | 1 0.3 | ||
+ | 2 1 | ||
+ | 3 0.73 | ||
+ | </ | ||
+ | |||
+ | And here is another example with more than one sample: | ||
+ | |||
+ | < | ||
+ | id sample1 sample2 sample3 | ||
+ | 1 0.31 0.6 0.24 | ||
+ | 2 1 0.81 0.91 | ||
+ | 3 0.7 0.9 0.3 | ||
+ | 4 0.23 0.45 0.33 | ||
+ | </ | ||
+ | |||
+ | For a file example see {{: | ||
+ | |||
+ | **Note**: If probe expression values are provided, these are recodified to gene expression values, obtained as the average value of all the probes mapping in the gene. | ||
+ | |||
+ | ===== Experimental design file format ===== | ||
+ | |||
+ | Experimental design is Tab-separated values file. This file has two columns, the first one corresponds to the sample name and the second one corresponds to the phenotype. | ||
+ | < | ||
+ | sample1 Group_1 | ||
+ | sample2 Group_1 | ||
+ | sample3 Group_2 | ||
+ | </ | ||
+ | |||
+ | **Note**: In case of **paired data** the Experimental design file must be **ordered**. | ||
+ | |||
+ | Here is an example of a file with 4 piared samples (sample1_Normal and sample1_Treated are the same sample before and after treatment): | ||
+ | |||
+ | < | ||
+ | sample1_Normal Group_1 | ||
+ | sample2_Normal Group_1 | ||
+ | sample1_Treated Group_2 | ||
+ | sample2_Treated Group_2 | ||
+ | </ | ||
+ | |||
+ | Here is an other file example see {{: | ||
+ | |||
+ | ===== Gene list file format ===== | ||
+ | |||
+ | Gene List is Tab-separated values file. This file has just one column, that is the Entrez ID of genes (1 Entrez ID per line). | ||
+ | |||
+ | |||
+ | Here is an example of a file with 4 genes to be evaluated: | ||
+ | |||
+ | < | ||
+ | Gene_1 | ||
+ | Gene_2 | ||
+ | Gene_3 | ||
+ | Gene_4 | ||
+ | </ | ||
+ | ====== Character encoding ====== | ||
+ | We recommend using the **[[https:// |
data_format.1563545518.txt.gz · Last modified: 2019/07/19 14:11 by mmarin