How to Format Your Data
Experimental data is often shared in a variety of formats. Carefully choosing a data format is a great way to extend the impact of your research by ensuring others can use it in the future.
On this page, we begin by defining the difference between raw data, processed data, and results data. We then provide a reference table for various types of data that you may share.
Raw data, processed data, and results
From a reusability perspective, data is the most useful to future users. Both results and data can be shared, but data is more important for reproducibility and reuse.
We consider data to be raw or partially processed information from a single sample, depending on the type of experiment being conducted.
Results are generally post-analysis information from an aggregate of samples or manuscript figures.
For example, if you are sharing RNA-seq information, the raw data would be the raw, fastq.gz files, the processed data would be the aligned reads (.bam) or gene counts data, and the differential expression analysis and volcano plots would be considered results. This distinction is well-defined for many types of data, but for assays, it is less often encountered, and it may be less clear. "Results" might also be acceptable for assays that do not lend themselves to re-analysis, such as western blotting. We can work with you to help figure this out.
A rubric for determining what datasets are key data
For the purposes of this portal, we define key data as data that, when shared in a raw or semi-processed format, is of sufficient size or complexity OR can be combined with similar data such that it can be mined for additional knowledge beyond the primary research question.
For example, a single Western blot image is typically not key data, because it can be used to answer just a handful of questions, typically all related to the protein that was assayed, and it is difficult to combine this information with lots of other Western blots to create a resource that can be mined. On the other hand, a collection of 5 whole-slide images of patient tumor sections would likely be key data, because there are many questions that could potentially be asked of the data that were not examined in the study that generated the data.
As a rough rule of thumb, you might ask yourself - if I was not doing this experiment myself, would I still want access to the raw data to combine it with other data or to ask my own questions about the data? Or would a figure in a publication suffice? If the former, it’s probably key data. If the latter, it’s probably optional.
Key datasets generally fulfill at least one of the following criteria.
The dataset contains data generated using high-throughput methods that output raw data presented in a widely used systematic format, and includes more than just one or two samples.
The dataset is considered validation data for a new method being developed under the funded grant.
The dataset is deemed explicitly of interest by the investigator for some other reason, e.g., it is particularly unique or non-recreatable data.
The dataset is deemed explicitly of interest by the funder for some other reason.
In addition to key datasets, you might consider sharing “optional” data for reasons like archiving or to meet publication requirements.
What data are accepted, and how should you format it?
Please note: many common experimental data types are included in this table, but you may be generating different or novel types of data that are not included here. Please don’t hesitate to reach out and ask us for a recommendation for your type of data if you do not see it mentioned here.
Levelsa | Format | Notes | |
---|---|---|---|
DNA | |||
whole genome sequencing | raw OR semi-processed AND processed | raw: FASTQ, unaligned BAM, CRAM | semi-processed: aligned BAM processed: any standard genotype format e.g. VCF, PLINK (.bed/.bim/.fam), .ped/.map, etc (genotypes per locus) | |
whole exome sequencing | raw OR semi-processed AND processed | raw: FASTQ, unaligned BAM, CRAM | semi-processed: aligned BAM processed: any standard genotype format e.g. VCF, PLINK (.bed/.bim/.fam), .ped/.map, etc (genotypes per locus) | |
SNP microarray | raw AND processed | raw: CEL, IDAT, tsv (raw values per SNP) processed: tsv (genotypes per SNP), any standard genotype format e.g. VCF, PLINK (.bed/.bim/.fam), .ped/.map, etc (genotypes per locus) | |
RNA expression | |||
RNA sequencing (bulk) | raw OR semi-processed AND processed | raw: FASTQ, unaligned BAM, CRAM | semi-processed: aligned BAM processed: counts matrices or quantification files | quantification files: like the quant.sf files generated by Salmon-based RNA-seq workflows |
RNA sequencing (single-cell) | raw AND processed | raw: FASTQ processed: hda5/hdf5 format following cellxgene required format | fastq should be created from bcl files with a program like More documentation on formatting hda5 files can be found here. hda5 format is a type of hdf5 file. |
gene expression microarray | raw AND processed | raw: CEL, IDAT, tsv (raw values per SNP, copy number, and loss of heterozygosity) processed: tsv (normalized values and purity/ploidy) | |
methylation | |||
ATAC sequencing | raw OR semi-processed | raw: FASTQ, unaligned BAM, CRAM | semi-processed: aligned BAM | |
methylation array | raw OR semi-processed | raw: FASTQ, unaligned BAM, CRAM | semi-processed: aligned BAM | |
bisulfite sequencing | raw OR semi-processed AND processed | raw: FASTQ, unaligned BAM, CRAM | semi-processed: aligned BAM processed: BED, BEDGRAPH, etc. | |
proteomics/protein | |||
LC-MS | raw AND processed | raw: mzML processed: protein intensities (csv/tsv), peptide abundance (csv/tsv) | |
western blot | processed | densitometry output (csv/tsv) | |
plate-based ELISA | raw | plate reader output (csv/tsv) | |
protein/peptide microarrays | processed | label-free quantification matrix (csv/tsv) | |
metabolomics | |||
LC-MS | raw AND processed | raw: mzML or vendor-dependent format & processed: metabolite intensities (csv/tsv) | |
metagenomics | |||
16S rRNA | raw AND processed | raw: raw sequence reads (FASTQ) processed: cleaned sequences or contigs (FASTA), aligned reads (BAM/SAM), feature tables (BIOM), counts per taxon (tsv/csv), pathway/gene-level abundance (tsv/csv/json), artifact visualization files (QZA/QZV), data matrices (HDF5), statistical results (RDS) | |
shotgun metagenomics | raw AND processed | raw: raw sequence reads (FASTQ) processed: cleaned sequences or contigs (FASTA), aligned reads (BAM/SAM), feature tables (BIOM), counts per taxon (tsv/csv), pathway/gene-level abundance (tsv/csv/json), artifact visualization files (QZA/QZV), data matrices (HDF5), statistical results (RDS) | |
phenotype/clinical | |||
structured clinical data | processed | csv/tsv or XML with metadata for each variable | |
clinical/imaging | |||
MRI or other radiological image | raw | DICOM | |
imaging | |||
immunohistochemistry | raw | OME-TIFF (preferred), at least bio-formats compatible file format | |
immunofluorescence | raw | OME-TIFF (preferred), at least bio-formats compatible file format | |
gross morphology photos (mice) | raw | tiff, png, jpg | |
in vitro drug screening | |||
plate-based cell viability assay | processed | csv/tsv (according to template) | |
other | |||
flow cytometry | raw | fsc with gating parameters | |
in vivo tumor growth experiments | raw OR processed | csv/tsv (according to template) where raw: tumor dimensions or other raw measurements & processed: calculated tumor volume/size | |
plate-based cell viability | processed | csv/tsv | |
electrophysiology | raw OR processed OR derived/analyzed | raw: binary (DAT/BIN) other formats: .abf / .atf, .nwb, .mat, .h5 / .hdf5, .smr / .smrx (Spike2), .csv / .tsv, .mcd, .brw, .new / .nsx | Use NWB if possible; it supports metadata, raw, and derived data in one structured file and is becoming a FAIR standard. |
a Level nomenclature can be cross-referenced with https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables/data-levels , where ‘raw' corresponds to Level 1 and ‘semi-processed' and ‘processed' most closely corresponds to Level 2. |
What results are accepted, and how should you format it?
To share analysis and results data on the Portal, read our instructions and formatting guidelines.
Metadata requirements
To share your data on the ELITE Portal, we require annotations as defined in the assay-specific manifests.