Need Help?

GAW16 Framingham and Simulated Data

Important links to apply for individual-level data

  1. Genetic Analysis Workshop
  2. Instructions to Request Authorized Access
  3. Data Use Certification Requirements (DUC)
  4. Apply here for controlled access to individual level data
  5. Research Use Statement

Questions regarding GAW16 should be directed to Vanessa Olmo at vanessa@business-endeavors.com.

Problem 2: Description of the Framingham Heart Study

In GAW16, we use data drawn from the Framingham Heart Study. The Framingham Heart Study — under the direction of National Heart, Lung, and Blood Institute; NHLBI — began in 1948 with the recruitment of adults from the town of Framingham, Massachusetts. At the time, little was known about the general causes of heart disease and stroke, but the death rates for cardiovascular disease (CVD) had been increasing steadily since the beginning of the 20th century and had become an American epidemic. The Framingham Heart Study is now conducted in collaboration with Boston University.

The objective of the Framingham Heart Study was to identify the common factors or characteristics that contribute to CVD by following its development over a long period of time in a large group of participants who had not yet developed overt symptoms of CVD or suffered a heart attack or stroke.

Between 1948 and 1953 the researchers recruited 5,209 subjects (2,336 men and 2,873 women) between the ages of 29 and 62 from the town of Framingham, Massachusetts and began the first round of extensive physical examinations and lifestyle interviews that they would later analyze for common patterns related to CVD development. Subjects were recruited from lists of addresses recorded for the town. Two out of every three households were approached for participation in the study. While there was no intention to recruit families for family studies, the plan was to recruit all household members in the ages 30-60 within each house that was selected for study. Hence, many biologically related individuals were recruited, including 1644 spouse pairs. Since 1948, these participants have returned to the study every two years for a detailed medical history, physical examination, and laboratory tests. Now in 2008 at 60 years of follow up, there remain about 500 participants from this cohort.

Between 1971 and 1975 the study enrolled a second-generation group — 5,124 of the original participants' children and the spouses of these children — to participate in similar examinations. 2,616 subjects are offspring of the original spouse pairs and 34 are stepchildren. A total of 898 offspring are children of cohort members where only one parent was a study participant and 1,576 are spouses of the offspring. The Offspring Cohort has been followed every four years through 2001 (except between Exams 1 and 2 with an intervening 8 years) using protocols similar to those used for study of the Original Cohort.

Between 2002 and 2005 the study enrolled the third generation (Gen3) of the Framingham Heart Study - 4095 offspring of the second generation. None of their spouses were recruited. An additional 103 parents of this third generation, who were not recruited between 1971 and 1975, were also recruited at this time. The latter group is not included in the GAW16 data. With the recruitment of this third generation, the study has increasingly focused on genetic factors associated with the development of cardiovascular disease and its associated risk factors. To date, there is only one examination of this generation of participants. A description of the recruitment of this third generation and comparison with the earlier generations at their initial recruitment is presented in Splansky GL et al., 2007.

Further information on the Study can be found at http://www.nhlbi.nih.gov/about/framingham/index.html.

Genome-wide Dense SNP Scan in Framingham Heart Study

Genetic studies did not begin in the FHS until the 1990s. In the late 1980s and through the 1990s DNA was extracted from blood samples of surviving FHS participants. In 2007, the FHS entered a new phase with the conduct of genotyping for the FHS SHARe (SNP Health Association Resource) project, for which dense SNP genotyping was performed using approximately 550,000 SNPs (GeneChip® Human Mapping 500K Array Set and the 50K Human Gene Focused Panel) in 10,775 samples (some duplicates) from the three generations of subjects (including over 900 pedigrees). Affymetrix conducted all genotyping for the FHS SHARe project, using the 250K Sty, 250K Nsp, and the supplemental 50K platforms. Eighty-nine percent of the DNA samples were collected during the 1990s. To maximize the power of the study, we also extracted DNA from 1133 blood samples, drawn from subjects who had no DNA, to include in the SHARe project. These samples had been sitting in our refrigerators for some time, a few as far back as the 1970s. We refer to these DNA samples as the legacy samples. These samples had a higher failure rate in the genotyping process (40%) than the other eighty-nine percent (3%). Affymetrix invoked its own criteria for a sample to succeed in genotyping. All non-legacy samples must succeed on all three platforms, while legacy samples needed to pass on at least one platform. When a sample failed, additional attempts were made. Samples that repeatedly failed 2-4 times were called failures. Other samples failed due to issues of genotyped sex identification not matching our records or low SNP concordance among SNPs common across arrays or contamination. Eighty-nine percent of the legacy samples for which genotyping results are available passed all three platforms. The genotyping data from the 10,043 samples from 9354 subjects that passed the Affymetrix criteria were additionally checked for gender consistency and consistency with family structure, resulting in genotyping data for 9,274 participants in FHS SHARe. Genotype calls were made with the BRLMM algorithm.

The SHARe database is housed at the National Center for Biotechnology Information database of genotypes and phenotypes (NCBI dbGaP) and contains all ~550,000 SNPs. This genome-wide dense SNP scan and a subset of phenotypes from the Framingham Heart Study are the focus of the Genetic Analysis Workshop 16.

Further information on the specific variables in the Problem 2 dataset can be found by clicking on the Documents tab at the top of the page.

Problem 3: Description of the Simulated Data Set

The focus of this simulation is gene discovery in genome-wide association scans (GWAS). The Framingham Heart Study data set (distributed as "Problem 2") is the basis for the FHS* simulated data. The pedigree structures are derived from the data distributed for Problem 2, and we distribute an accompanying triplet file (triplet_sim) containing person ID, father ID, mother ID, to ensure the identical subjects, pedigrees, and singletons are used in the simulated data analysis. Consistent with standard practice, founders and singletons are designated as subjects with both fshare and mshare equal to zero (missing). The simulated data includes a total of 6,479 subjects with both phenotype and genotype data, in 942 pedigrees distributed among 3 generations and 188 singletons. Data inclusion is consistent with the subjects' consent for use by both for-profit and not-for-profit researchers. The genotypes for all Problem 3 replicates are fixed as measured and distributed for Problem 2 for both the genomewide scan and the additional candidate gene SNPs, for a total of approximately 550,000 SNPs (GeneChip® Human Mapping 500K Array Set and the 50K Human Gene Focused Panel). Thus, to analyze the Problem 3 simulated data you also will need to download the Problem 2 genotypes. Note that there are slight discrepancies in counts between Problems 2 and 3 due to a change in consents between the two datasets.

Several phenotypes that contribute to coronary heart disease (CHD) were simulated for all individuals with genotypes across three different time points, 10 years apart. All genotyped individuals have complete data; the effects of missing values can be investigated by user-specified missing value patterns. There are 200 longitudinal datasets created, based on the generating model, and each replication is found in a separate dataset. We suggest that if only one replication is to be analyzed, that it be replication 1 to enable more precise comparisons among analytical approaches. The 'shareid' will allow you to merge the simulated phenotype data with the Problem 2 genotype data, and reconstruction of the pedigrees using the distributed 'triplet_sim' file or for larger families' relationships with the triplet distributed with the Problem 2.

The simulated data problem is further described in the associated readme file, and a data dictionary is provided defining all the variables. For disclosure of the generating model for these data, please contact Jean MacCluer at jean@sfbrgenetics.org.