data_format
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
data_format [2024/02/27 13:12] – [Experimental design file format] krian | data_format [2024/02/27 13:17] (current) – [Experimental design file format] krian | ||
---|---|---|---|
Line 1: | Line 1: | ||
+ | ====== Data format ====== | ||
+ | Different types of data are used in Hipathia. Some of this data require a certain structure explained on the following links: | ||
+ | |||
+ | **Note:** The recommended file extensions are ' | ||
+ | |||
+ | ===== Expression matrix file format ===== | ||
+ | |||
+ | Expression matrix file is a Tab-separated values file. | ||
+ | |||
+ | A tab-separated values (TSV) file is a simple text format for storing data in a tabular structure (e.g. database or spreadsheet data). Each record in the table is one line of the text file. Each field value of a record is separated from the next by a tab stop character. [[https:// | ||
+ | |||
+ | |||
+ | This file has two columns if there is only one sample, and more than two columns if there are many samples. | ||
+ | The first line is a header and must contain the sample names. | ||
+ | The first column corresponds to genes, probes or proteins, and the following IDs are accepted: | ||
+ | |||
+ | * Ensembl gene | ||
+ | * HGNC symbol | ||
+ | * Entrez id | ||
+ | * Affy HG U133A probeset | ||
+ | * Affy HG U133B probeset | ||
+ | * Affy HG U133-PLUS-2 probeset | ||
+ | * Affy HTA 2.0 | ||
+ | |||
+ | |||
+ | The next columns correspond to gene expression values in numeric format from each sample. | ||
+ | |||
+ | Here is an example of a file with only one example: | ||
+ | |||
+ | < | ||
+ | id sampleName | ||
+ | 1 0.3 | ||
+ | 2 1 | ||
+ | 3 0.73 | ||
+ | </ | ||
+ | |||
+ | And here is another example with more than one sample: | ||
+ | |||
+ | < | ||
+ | id sample1 sample2 sample3 | ||
+ | 1 0.31 0.6 0.24 | ||
+ | 2 1 0.81 0.91 | ||
+ | 3 0.7 0.9 0.3 | ||
+ | 4 0.23 0.45 0.33 | ||
+ | </ | ||
+ | |||
+ | For a file example see {{: | ||
+ | |||
+ | **Note**: If probe expression values are provided, these are recodified to gene expression values, obtained as the average value of all the probes mapping in the gene. | ||
+ | |||
+ | ===== Experimental design file format ===== | ||
+ | |||
+ | Experimental design is Tab-separated values file. This file has two columns, the first one corresponds to the sample name and the second one corresponds to the phenotype. | ||
+ | < | ||
+ | sample1 Group_1 | ||
+ | sample2 Group_1 | ||
+ | sample3 Group_2 | ||
+ | </ | ||
+ | |||
+ | **Note**: In case of **paired data** the Experimental design file must be **ordered**. | ||
+ | |||
+ | Here is an example of a file with 4 piared samples (sample1_Normal and sample1_Treated are the same sample before and after treatment): | ||
+ | |||
+ | < | ||
+ | sample1_Normal Group_1 | ||
+ | sample2_Normal Group_1 | ||
+ | sample1_Treated Group_2 | ||
+ | sample2_Treated Group_2 | ||
+ | </ | ||
+ | |||
+ | Here is an other file example see {{: | ||
+ | |||
+ | ===== Gene list file format ===== | ||
+ | |||
+ | Gene List is Tab-separated values file. This file has just one column, that is the Entrez ID of genes (1 Entrez ID per line). | ||
+ | |||
+ | |||
+ | Here is an example of a file with 4 genes to be evaluated: | ||
+ | |||
+ | < | ||
+ | Gene_1 | ||
+ | Gene_2 | ||
+ | Gene_3 | ||
+ | Gene_4 | ||
+ | </ | ||
+ | ====== Character encoding ====== | ||
+ | We recommend using the **[[https:// |