Sequence
Supported EGA data submission formats are described below. If you have data in any other format or have any questions please contact the EGA Helpdesk.
Generic Formats
Format | File Sufix |
---|---|
CRAM format | .cram |
BAM format | .bam |
Fastq format | .fastq.gz / .fastq.bz2 / .fq.gz / .fq.bz2 / .txt.gz / .txt.bz2 |
VCF format | .vcf |
Platform specific formats
Format | File Suffix | Notes |
---|---|---|
SFF Format (454 and Ion Torrent) | .sff | Spot Descriptor is required |
SOLiD csfasta/qual format | .csfasta / .csfasta.gz / .csfasta.bz2 / .qual / .qual.gz / .qual.bz2 | Support depreciated in 2015. |
Complete Genomics format | ||
PacBio HDF format | .metadata.xml / .bas.h5 / .bax.h5 | |
Illumina Qseq format | Support depreciated in 2015. | |
Illumina Scarf format | Support depreciated in 2015. | |
SRF Format (Illumina) | .srf | Support depreciated in 2015. |
Generic Formats
CRAM format (all platforms)
CRAM format
The CRAM format is our recommended primary sequence data submission format.
Submitted CRAM files must be readable with SAMtools and CRAMToolkit and the reference sequences must exist in the CRAM Reference Registry. The ArchiveCRAM specification outlines the requirements for BAM and CRAM submissions
Data files should be de-multiplexed prior to submission so that each run is submitted with files containing data for a single sample only.
BAM format (all platforms)
BAM format
All submitted BAM files must be readable with SAMtools and Picard. BAM files must be de-multiplexed prior to submission. However, multiple sample BAMs may be submitted as analysis.
Please note that color space BAM submissions are not supported.
The EGA would reccomend that whenever possible, the BAM files be converted to CRAM format. Please review the CRAM Usage page, to learn more about the advantages of converting to cram formats, and the available documentation.
Fastq format (all platforms)
Fastq format
Primary sequence data submissions of single and paired reads are accepted as Fastq files that meet the following the requirements:
Data files should be de-multiplexed prior to submission so that each run is submitted with files containing data for a single sample only.
- Quality scores must be in Phred scale. For example, quality scores from early Solexa pipelines must be converted to use this scale. Both ASCII and space delimitered decimal encoding of quality scores are supported. We will automatically detect the Phred quality offset of either 33 or 64.
- No technical reads (adapters, linkers, barcodes) are allowed.
- Single reads must be submitted using a single Fastq file and can be submitted with or without read names.
- Paired reads must split and submitted using either one or two Fastq files. The read names must have a suffix identifying the first and second read from the pair, for example '/1' and '/2' (regular expression for the reads "^(.*)([\\.|:|/|_])([12])$").
- The first line for each read must start with '@'.
- The base calls and quality scores must be separated by a line starting with '+'.
- The Fastq files must be compressed using gzip or bzip2.
Example of Fastq file containing single reads:
@read_name GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65 ...
Example of Fastq file containing paired reads:
@read_name/1 GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65 @read_name/2 GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65 ...
where <cycle> indicates the cycle number that starts the second read.
VCF format
VCF format
Sequence variations are accepted in VCF format.
Please validate your vcf file(s) by using the EVA VCF validator or minimally the VCF Tools validator.
The EGA offers support to multiple sample VCFs & aggregate level VCF. These should be preferable split by chromossome.
Platform Specific Formats
SFF format (454 and Ion Torrent)
SFF format (454 and Ion Torrent)
The SFF format is the recommended primary data submission format for the 454 and Ion Torrent platforms.
We recommend the conversion of the data to Fastq format.
SOLiD csfasta/qual format
SOLiD csfasta/qual format
Support deprecated in 2015. Please convert the data to Fastq format.
Complete Genomics format
Complete Genomics format
Complete Genomics data should be submitted as the full Complete Genomics data package. For each sample a directory containing the ASM, LIB and MAP subfolders should be prepared. Please note that the directory should be compressed using gzip or tar prior to submission.
PacBio HDF5 format
PacBio HDF5 format
PacBio data submissions are supported in the PacBio HDF5 format.
One run consists of *.bax.h5, *.bas.h5 and xml files. Please note that these files must not be tarred.
Illumina qseq format
Illumina qseq format
We accept but do not recommend primary data submissions in Illumina qseq format. Currently, qseq data submissions are not processed or made available in any other formats. We recommend that submitters convert their data from qseq format to Fastq format prior submission. Please convert the data to Fastq format.
If submitted, qseq files should be compressed using gzip or bzip2.
Illumina scarf format
Illumina scarf format
We accept but do not recommend primary data submissions in Illumina scarf format. Currently, scarf data submissions are not processed or made available in any other formats. We recommend that submitters convert their data from scarf format to Fastq format prior submission. Please convert the data to Fastq format.
Please note, that scarf format typically uses log-odds qualities that should be converted into Phred qualities when preparing the Fastq files. If submitted, scarf files should be compressed using gzip or bzip2.
SRF format (Illumina)
SRF format (Illumina)
The SRF format continues to be supported as historical primary data submission format for existing submitters only.
Preparing SRF files
The *_seq.txt files can be converted into SRF files using the illumina2srf utility available from the DNA Sequence Read Toolkit.
Each Illumina lane should be submitted as a separate SRF file and runs should be demultiplexed prior SRF file generation.
To produce a SRF submission file for a non-paired lane, change the working directory to the run folder and run:
illumina2srf -R -P -N <run>:%l:%t: -n %x:%y -o <center_name>_<run>_<lane>.srf s_<lane>_*_seq.txt
The -R, -P options are used to exclude intensity, noise and signal data from the generated SRF files. These data series are no longer supported for new data submissions.
The recommended format for the SRF file names is <center_name>_<run>_<lane>.srf, where <center_name> is the center name abbreviation assigned to all submitters, and the <run> and <lane> are the run and the lane identifiers.
To produce a SRF submission file for paired lane, change the working directory to the run folder and run:
illumina2srf -R -P -N <run>:%l:%t: -n %x:%y -2 <cycle> -o <center_name>_<run>_<lane>.srf s_<lane>_*_seq.txt