Differences

This shows you the differences between two versions of the page.

--- data_format [2019/07/19 14:11] – mmarin
+++ data_format [2024/02/27 13:17] (current) – [Experimental design file format] krian
@@ Line 3: / Line 3: @@
 Different types of data are used in Hipathia. Some of this data require a certain structure explained on the following links:
+**Note:** The recommended file extensions are '.txt' or '.tsv'.
-[[Expression matrix file format |Expression matrix file format]]
+===== Expression matrix file format =====
-[[Experimental design file format | Experimental design file format ]]
+Expression matrix file is a Tab-separated values file.
-[[Gene list file format | Gene list file format ]]
+A tab-separated values (TSV) file is a simple text format for storing data in a tabular structure (e.g. database or spreadsheet data). Each record in the table is one line of the text file. Each field value of a record is separated from the next by a tab stop character. [[https://en.wikipedia.org/wiki/Tab-separated_values|More about TSV...]]
+This file has two columns if there is only one sample, and more than two columns if there are many samples.
+The first line is a header and must contain the sample names.
+The first column corresponds to genes, probes or proteins, and the following IDs are accepted:
+  * Ensembl gene
+  * HGNC symbol
+  * Entrez id
+  * Affy HG U133A probeset
+  * Affy HG U133B probeset
+  * Affy HG U133-PLUS-2 probeset
+  * Affy HTA 2.0
+The next columns correspond to gene expression values in numeric format from each sample.
+Here is an example of a file with only one example:
+<code>
+id	sampleName
+	0.3
+	1
+	0.73
+</code>
+And here is another example with more than one sample:
+<code>
+id	sample1	sample2	sample3
+	0.31	0.6	0.24
+	1	0.81	0.91
+	0.7	0.9	0.3
+	0.23	0.45	0.33
+</code>
+For a file example see {{:brca_genes_vals_bn.txt|}}
+**Note**: If probe expression values are provided, these are recodified to gene expression values, obtained as the average value of all the probes mapping in the gene.
+===== Experimental design file format =====
+Experimental design is Tab-separated values file. This file has two columns, the first one corresponds to the sample name and the second one corresponds to the phenotype.
+<code>
+sample1	Group_1
+sample2	Group_1
+sample3	Group_2
+</code>
+**Note**: In case of **paired data** the Experimental design file must be **ordered**.
+Here is an example of a file with 4 piared samples (sample1_Normal and sample1_Treated are the same sample before and after treatment):
+<code>
+sample1_Normal	Group_1
+sample2_Normal	Group_1
+sample1_Treated	Group_2
+sample2_Treated	Group_2
+</code>
+Here is an other file example see {{:brca_normal-basal_ed.txt|}}.
+===== Gene list file format =====
+Gene List is Tab-separated values file. This file has just one column, that is the Entrez ID of genes (1 Entrez ID per line).
+Here is an example of a file with 4 genes to be evaluated:
+<code>
+Gene_1
+Gene_2
+Gene_3
+Gene_4
+</code>
+====== Character encoding ======
+We recommend using the **[[https://en.wikipedia.org/wiki/UTF-8 | UTF-8]]** character encoding for your content or data.