GoNL Data Access Committee of BBMRI-NL

Dac ID Contact Person Email Access Information
EGAC00001000146 Elisa Hoekstra gonl [at] bbmri [dot] nl No additional information is available

This DAC controls 5 datasets:

Dataset ID Description Technology Samples
EGAD00001000743 These files contain a total of 20.4M SNVs and the complete information output by the GATK UnifiedGenotyper v1.4 on all 767 GoNL samples. These calls are not trio-aware and all genotypes were reported regardless of their quality. Both filtered and passing calls are reported in these files. Filtered calls include (1) calls failing our VQSR threshold and (2) calls in the GoNL inaccessible genome. 767
EGAD00001000744 The samples in this panel come from 250 families: 248 parents-child trios and 2 parent-child duos. As the children do not provide additional haplotypes or population information, they were excluded from the panel. The samples present in the release are composed of 248 couples, 2 single individuals and 1 sample composed from the 2 haplotypes from the duo's children transmitted by their missing parent. The composed sample is named gonl-220c_223c. The files contain a total of 18.9M SNVs and 1.1M INDELs in autosomal chromosomes. They were generated by phasing/imputing the SNVs (a) and INDELs (b) using MVNCall. Only sites passing filters are reported. Sites filtered as part of the GoNL inaccessible genome were kept (but flagged as filtered) and still may contain true positive calls but should be used with care as they are located in parts of the genome that are less well captured (systematic under or over-covered or low-mapping quality) 499
EGAD00001000821 Raw sequencing data for all samples in fastq format. Illumina HiSeq 2000; 767
EGAD00001001038 We mapped the data to the UCSC human reference genome build 37 using BWA 0.5.9-r16. We first mapped each read pair separately using bwa aln. Then we used bwa sampe to map the paired reads together to a BAM9 file. The BAM file was then sorted by genomic position and indexed using PicardTools-1.32 SortSam. To prevent PCR artifacts from influencing the downstream analysis of our data, we used Picard to mark the duplicate reads, which were ignored in downstream analysis. We used GATK IndelRealigner on our data around known indels (from 1KG Pilot). The IndelRealigner creates all possible read alignments using the source and computes the likelihood of the data containing the indel based on the read pileup. Whenever the maximum likelihood contains an indel, the reads are realigned accordingly. Each base is associated with a phred-scaled base quality score. Calibration of Phred scores is crucial as they are used in some of the downstream analysis models. We used GATK to recalibrate the base qualities with respect to (i) the base cycle, (ii) original quality score, and (iii) dinucleotide context. To minimize issues stemming from mapping problems around indels, we decided to undergo a second round of indel realignment using the GATK IndelRealigner by family rather than by individual. For this second round, we considered two sources of possible indels: 1KG Phase 1 indels and indels aligned by BWA in the GoNL data. 769
EGAD00001002261 These files contain indels and structural variants on 769 GoNL samples (SV release 6, 2016-05-25). Illumina HiSeq 2000; 769