How to encrypt files with EGACryptor
File preparation
Due to the processes used at the EGA for file archival the use of non-alphanumeric characters in a filename will cause issues in archival. By convention whitespaces in filenames are to be avoided and should be replaced with the underscore character (_). Before encrypting your files please make sure that any files that will be uploaded to EGA do not use special characters such as # ? ( ) [ ] / \ = + < > : ; " ' , * ^ | &
Crypt4gh
EGACryptor
Files encrypted with EGACryptor must be uploaded via FTP
EGACryptor
The EGACryptor v.2.0.0 is a JAVA-based application which enables submitters to produce EGA compliant encrypted files along with files for the encrypted and unencrypted md5sum for each file to be submitted. The application will generate an output folder that will by default mirror the directory structure containing the original files. This output folder can subsequently be uploaded to the EGA FTP staging area via an FTP or Aspera client.
Download EgaCryptor
Download EgaCryptor
Using the EgaCryptor
Using the EgaCryptor
Encrypting single file
Encrypting multiple files
Encrypting all files in folder
Points to note
Troubleshooting
Troubleshooting
Download EGACryptor
The required jar files can be obtained by downloading EgaCrytptor jar file
After the file has been downloaded, extraction of the zipped archive is required. The EGACryptor has been built to work with Java Runtime Environments from version 6 and above and with the OpenJDK Environment. Please refer to the relevant resources for installation guidance. Installing the latest version of the OpenJDK will include the JCE files. If your installation of Java JRE is less than 1.8.0_151 will require the manual installation of the JCE Policy Files. You can verify the version of the Java SE Runtime Environment (JRE) installed by using the command:
$ java -version
If you need to install the JCE please follow the instructions below:
Installing the JCE policy files (due to licensing terms and conditions the
required policy files must be downloaded direct from the ORACLE website)
:
Download the unlimited strength JCE policy files (JRE 6
/
JRE 7/
JRE 8)
Uncompress and extract the downloaded file.
This will create a subdirectory called JCE. This directory contains the
following files: README.txt, COPYRIGHT.html,
local_policy.jar and US_export_policy.jar
Install the two policy JAR files by replacing the existing ones in your java
home directory.
Install the two policy JAR files by replacing the existing ones in your
directory.
The standard place for JCE jurisdiction policy JAR files is:
/lib/security [Unix] or
\lib\security [Win32]
Notes: refers to the directory where the Java SE Runtime Environment (JRE) was
installed.
Additional performance enhancements that have been included in the EGACryptor
V2.0.0:
The ability to parallelise the processing of datasets through the use of the
resources on a system. Multicore systems will allow the user to specify n-1
cores for an n-core system. The use of this feature on clusters may speed up
the processing of datasets that have large file numbers but consult your
local cluster guide to ensure that there are not monopolising resources that
are needed by other system users. The default for this process remains
single threaded.
3 levels of system usage can be specified. Full usage within the limits
detailed above. A limited mode that will ensure that 50% of the system
resources are available for other tasks. Maximum mode is limited to 75% of
system resources, this allows encryption to be prioritised but allows for
the system to be usable for light alternate tasks. Finally there is a
throttling mode that allows you to specify the exact number of computational
threads to be used.
the EGACryptor is able to ingest a structured directory and will output a
directory with the same structure containing the encrypted files along with
the md5checksums for the plain and encrypted files. The entire output
directory can then be uploaded to the EGA for archival.
as with the input path, it is now possible to specify the output path.
the options have been updated inline with the upgraded functionality.
The tool can only be used via the command line. The EGACryptor is designed to
perform a single task, encrypting your data, for upload of these files please
refer to our uploading
guide
Using the EgaCryptor
Below are the three ways on how the EGACryptor tool can be used:
Encrypting a single file :
java -jar ../EGA-Cryptor_2_0_0.jar -i example1.bam
Encrypting multiple files :
java -jar ../EGA-Cryptor_2_0_0.jar -i "example1.bam,example2.bam"
Encrypting all the files within a
folder
java -jar ../EGA-Cryptor_2_0_0.jar -i path/to/target/directory
By default the EGACryptor v2 will create a new output directory containing all
encrypted files and the relevant checksums within the target directory. If a
specific directory is desired this can be specified by using the -o flag. This
can be achieved in a similar manner to the following example:
java -jar ../EGA-Cryptor_2_0_0.jar -i /path/to/target/directory -o
/path/to/output/directory
The tool will output three files per input file:
file.gpg ( encrypted file )
file.md5( file md5 sum value file )
file.gpg.md5 ( encrypted file md5 sum value file )
All output files must then be uploaded to your submission account using Aspera
or FTP. Further documentation on how to upload files:
FTP and Aspera.
Points to note
Remember to provide the path to EgaCryptor.jar and run the command from
within the directory the file/s are located.
ECryptor writes files to the source directory of your local file system, as
a result you must have write-access permissions for the source directory.
Troubleshooting
If in doubt about the function of the EGACryptor it is recommended to first
consult the built-in documentation. This can be accessessed by using the -h
flag as stated in the following table.
Built-in Commands
Table: list of the command line options built into EGA-Cryptor v2.0.0.
Command line Option Action
--------------------- -----------
-f Set this option to allow application to create maximum
threads to utilise full capacity of cores/processors
available on machine
-h Use this option to view the bult-in help menu
* -i File(s) to encrypt. Provide file/folder path or comma
separated file path if multiple files in double quotes
-l Set this option to allow application to create maximum
threads equals to 50% capacity of cores/processors
available on machine
-m Set this option to allow application to create maximum
threads equals to 75% capacity of cores/processors
available on machine
-o Path of the output file. This is optional. If not
provided then output files will be generated in the
same path as that of source file (default: output-
files)
-t Set this option if user wants to control application to
create maximum threads as specified. Application will
calculate no. of cores/processors available on machine
& will create threads accordingly
Encryption Errors
UnixFileSystem.createFileExclusively (Native Method)
The error is thrown by UNIX ("UnixFileSystem.createFileExclusively(Native Method)"). It appears that the user does not have write-access to the file system where the file to be uploaded is located. EGACryptor always writes MD5 checksums into files before uploading them to the server, and these files are created in the same location where the uploaded file itself resides.Solution: address your directory permission issue and re-run the command.
Install JCE Unlimited Strength Jurisdiction Policy files
The JCE policy unlimited strength jurisdiction files should be installed
according to your current java version
If you are still facing difficulty with the EGACryptor v.2 after having consulted the documentation please contact the EGA Helpdesk.
Documentation
submission/data/file-preparation/egacryptor
Whole Transcriptome Sequencing of NXF1 or CRM1 depleted Cell
mRNA exporters, NXF1 and CRM1 are known to function in mRNA export. However, since each target mRNA has not been identified genome-widely, it is not clear whether they export exclusive mRNA subsets. To reveal this point, we prepared the nuclear and the cytoplasmic RNAs under either exporter inhibited condition and carried out RNA-seq. In these experiments, we also focused in the export machinery of aberrant transcripts that usually retained and decayed in the nucleus.
Study
JGAS000294
Distribution of data using Live Distribution
Live Distribution
Welcome to the documentation for using the Live Distribution feature for distributing data files securely through the EGA platform. This guide will walk you through the process of downloading encrypted files and decrypting them using Crypt4GH. Please follow the steps below to ensure a smooth experience.
Before Downloading
Create an EGA user.
Make sure that you have the permissions to download the dataset of interest. In case you don’t have access, request access to the dataset.
Add your Crypt4GH-compatible public key to your EGA account. Please allow a few hours for your public key to be synced with your profile. Afterwards, you will be able to connect to your EGA outbox using the SFTP protocol.
Download
Graphical User Interface (GUI)
You can use any GUI that supports SFTP connections, such as FileZilla, an open-source FTP client. For Filezilla as your GUI, follow these steps to download files:
Open FileZilla and access Site Manager (File > Site Manager).
Create a new connection with the following settings (Figure 1):
Protocol: SFTP - SSH File Transfer Protocol
Host: __EGA_OUTBOX_DOMAIN__
Logon Type: Key file
User: your EGA username
Key file: path/to/your/private_key
Figure 1: Process of establishing a new connection to __EGA_OUTBOX_DOMAIN__ using a key file as the logon method in FileZilla. The figure showcases the FileZilla version 3.52.2 operating on IOS v11.2.3. By following the depicted steps, users can create a secure and efficient connection to the EGA outbox, ensuring seamless data transfers.
Click Connect to access your Outbox. This folder serves as your storage space within the EGA cloud, containing files accessible for download in a secure way.
Browse the remote directory on the right side of the FileZilla screen. Select the files you wish to download, right-click, and choose Download (Figure 2).
Figure 2: Step-by-step process of manually downñoad files from __EGA_OUTBOX_DOMAIN__ using FileZilla, with FileZilla version 3.52.2 operating on IOS v11.2.3. The figure demonstrates how users can downñoad data from the EGA outbox to their local storage by following the depicted steps
SFTP command line
You can also use the SFTP command line to securely download files from the EGA Outbox.
Using SFTP command line client in Linux/Unix
Open a terminal window
Enter the following command to connect: sftp username@hostname
Enter your EGA password
To see a list of available sftp commands type help
sftp> put – Upload file
sftp> get – Download file
sftp> cd path – Change remote directory to ‘path’
sftp> pwd – Display remote working directory
sftp> lcd path – Change the local directory to ‘path’
sftp> lpwd – Display local working directory
sftp> ls – Display the contents of the remote working directory
sftp> lls – Display the contents of the local working directory
Type get command to download files. For example: get encrypted_file.c4gh
Use the bye command to close the connection (SFTP session).
Convenient SSH settings
Include the following settings in your SSH config file, located in ~/.ssh/config
Host __EGA_OUTBOX_DOMAIN__ EGA-outbox
hostname __EGA_OUTBOX_DOMAIN__
User username
IdentityFile path/to/the/private/key
Replace username and path/to/the/private/key with the appropriate settings, and you will be able to connect to the __EGA_OUTBOX_DOMAIN__ simply using sftp EGA-outbox.
How to decrypt
Files archived at the EGA are encrypted based on Crypt4GH. Hence, to decrypt the files you need to install Crypt4GH. You can install a python implementation of it, with
pip install crypt4gh
or directly from the
Github repository
pip install git+https://github.com/EGA-archive/crypt4gh.git
After installing Crypt4GH, decrypt files using the following command:
crypt4gh decrypt --sk /path/private/key < encrypted_file.c4gh > decrypted_filename
The command reads the encrypted file from stdin (with <) and output the decrypted version to stdout (with >).
Replace encrypted_file.c4gh and decrypted_filename with the appropriate filenames but make sure to not use the same filename for both reading and writing because your SHELL would then truncate both files before you even read or write.
Frequently Asked Questions
What username should I use to log in to my outbox?
The authentication process for logging in to the EGA website, as well as accessing your inbox and outbox, requires the use of your username, not your email address. Therefore, if you registered a username different from your email address when creating your EGA account, you must use that username to log in.
If you have forgotten your registered username, please, contact our Helpdesk team for assistance.
I see that some files in my dataset have 'unavailable' as extension. What should I do?
Within your Outbox, you'll find a list of all the datasets available for download. Occasionally, certain files may be marked as "unavailable".
These unavailable files can be identified by the "unavailable" extension added to their filenames (e.g. filename.fastq.gz.unavailable.c4gh).
If you encounter an unavailable file that you need, please reach out to our Helpdesk. We'll promptly work on making the file accessible for download as soon as possible.
Specific to using keys
Can I access one EGA account from different devices?
Yes, you can access your account from different devices by linking several public keys to your EGA account. Each device can generate a unique public-private key pair, and the corresponding public keys can be linked to the same account. This way, you can use different public keys on different devices and still have access to the same account and data.
I have several keys and I don't remember which one is which
When generating SSH keys, it's a good practice to add a comment using the -C flag. This will allow you to add a descriptive tag to your key, making it easier to identify later on. Here's an example command that generates an SSH key with a comment:
ssh-keygen -t ed25519 -C work-pass
In this example, we're generating an ed25519 SSH key with the comment work-pass. Once you have multiple keys with different comments, you can use
the comments to easily identify each key.
To view the comments for your existing SSH keys, you can use the following command:
ssh-keygen -l -f /path/to/key
This will display the key fingerprint and the associated comment. By checking the comments, you should be able to identify which key is which.
What if I can't find my SSH keys for uploading files with a key file, and how can I use new keys?
If you can't find your SSH keys, don't worry - you can make new ones. To do this, open your terminal or command prompt and type a command to make a new SSH key. You can pick a name for the key, and choose a password to keep it safe. After making the key, you can add the new key to your account or server where you want to upload files using the key file. This usually involves copying and pasting the key's "public" (e.g. file.pub) part to the right place. If you lose track of the key again, just make a new one and add it again. Keep in mind that SSH keys belong to you and your computer, so if you switch computers or accounts, you'll need to make new keys.
I don't want to type the passphrase every time I use the key. What can I do?
You can use an ssh-agent to avoid typing the passphrase every time you use the key. An ssh-agent is a program that stores your private keys in memory and provides them to ssh when needed. You can add your key to the ssh-agent using the command ssh-add followed by the path to your key file.Here's an example of the steps to follow:
Open a terminal window.Start the ssh-agent by typing the command eval $(ssh-agent).Add your key to the ssh-agent by typing the command ssh-add [key filepath].
For instance, if your key file is located in the home directory with the name mykey, the command will look like this:
ssh-add ~/mykey
After adding your, key to the ssh-agent, you should be able to use ssh without having to enter your passphrase every time.
Can I use my password for authentication (without my private key)?
If you prefer to use your username and password for authentication instead of your private key, you can still do so. When using a Graphical User Interface (GUI) such as FileZilla, you can select Ask for password as your Logon Type (Figure 3). This option will prompt you to enter your password when you click
Connect, instead of using your private key.
Figure 3: This option will prompt you to enter your password when you click "Connect", instead of using your private key. Figure 3: Process of establishing a new connection to __EGA_OUTBOX_DOMAIN__ using your password as the logon method in FileZilla. The figure showcases the FileZilla version 3.52.2 operating on IOS v11.2.3. By following the depicted steps, users can create a secure and efficient connection to the inbox, ensuring seamless data transfers.
It's worth noting that using a password for authentication can be less secure than using an SSH key, as passwords can be more easily compromised through various means. However, if you choose to use your password for authentication, selecting "Ask for password" as your Logon Type is a good way to do so securely via a GUI.
Why is it better to use my key and not my password?
SSH keys for authentication is generally considered to be more secure and convenient than using passwords. SSH keys are more difficult to crack than passwords, and they can be restricted to specific users and machines, giving
you more control over access. Once you set up your SSH keys, you can use them to authenticate quickly and easily, without having to enter a password every time. This makes automation of tasks, such as uploading encrypted files, much
simpler. Additionally, SSH keys provide better logging, allowing you to keep track of who is accessing your systems and when. All in all, using SSH keys is a good practice for improving security and convenience in your authentication process.
Documentation
access/download/files/live-outbox
Whole exome sequencing of HCCs in children with bile salt export pump deficiency
Whole exome sequencing of 6 HCCs and matched background liver in children with bile salt export pump deficiency.
Dataset
EGAD00001000811
QSEA – modelling of genome-wide DNA methylation from sequencing enrichment experiments
We describe a new method, QSEA, for analyzing methylation enrichment data. We generated a benchmark experiment consisting of MeDIP-seq enrichment data, targeted BS-seq validation data, and RNA-seq data on five pairs of tumor and adjacent normal samples of non-small cell lung cancer patients.
Details of the method and its performance on the data have been described in Lienhard et al. (2016).
Study
EGAS00001001822
Targeting TRIP13 in Wilms Tumor with Nuclear Export Inhibitors
Wilms tumor (WT) is the most common renal malignancy of childhood. Despite improvements in the overall survival, relapse occurs in ~15% of patients with favorable histology WT (FHWT). Half of these patients will succumb to their disease. Identifying novel targeted therapies in a systematic manner remains challenging in part due to the lack of faithful preclinical in vitro models. We established ten short-term patient-derived WT cell lines and characterized these models using low-coverage whole genome sequencing, whole exome sequencing and RNA-sequencing, which demonstrated that these ex-vivo models faithfully recapitulate WT biology. We then performed targeted RNAi and CRISPR-Cas9 loss-of-function screens and identified the nuclear export genes (XPO1 and KPNB1) as strong vulnerabilities. We observed that these models are sensitive to nuclear export inhibition using the FDA approved therapeutic agent, selinexor (KPT-330). Selinexor treatment of FHWT suppressed TRIP13 expression, which was required for survival. We further identified in vitro and in vivo synergy between selinexor and doxorubicin, a chemotherapy used in high risk FHWT. Taken together, we identified XPO1 inhibition with selinexor as a potential therapeutic option to treat FHWTs and in combination with doxorubicin, leads to durable remissions in vivo.
Study
EGAS00001007389
Processed Chromium Single Cell GEX, CSP and VDJ data from intestinal plasma cells of untreated celiac disease patients
The dataset contains processed sequencing data from Chromium Single Cell 5’ gene expression, human B cell VDJ and feature barcode (CSP) sequencing from transglutaminase 2-specific and other small intestinal plasma cells isolated from four untreated celiac disease patients. The raw sequencing data has been processed with Cell Ranger v.6.0.2 with the multi and aggr functions using the pre-built Cell Ranger references GRCh38 version 2020-A for gene expression and GRCh38-alts-ensembl-5.0.0 for V(D)J analysis. The dataset consists of a gene expression and antibody capture expression matrix (cell barcodes and feature names in tsv.gz file, expression matrix in mtx.gz file) and VDJ sequences in AIRR format (csv file). A metadata file (csv file) details cells passing our custom quality control based on number of detected genes, UMIs, mitochondrial genes, immunoglobulin genes and a productively rearranged immunoglobulin heavy chain of the IgA isotype.
Dataset
EGAD50000000339
How to encrypt files with Crypt4gh
File preparation
Due to the processes used at the EGA for file archival the use of non-alphanumeric characters in a filename will cause issues in archival. By convention whitespaces in filenames are to be avoided and should be replaced with the underscore character (_). Before encrypting your files please make sure that any files that will be uploaded to EGA do not use special characters such as # ? ( ) [ ] / \ = + < > : ; " ' , * ^ | &
Crypt4gh
EGACryptor
Files encrypted with Crypt4gh must be uploaded via INBOX
Before uploading
If you are not a registered EGA user, you will first need an EGA user account.
Please note that it may take a few days for your account to be activated, as it needs to be vouched for by the EGA Helpdesk. Once your account is validated you will be able to request a
submitter role.
Meanwhile, you can create and add your public key to your EGA account profile. This option is not available for old submission accounts (e.g. ega-box-NNN).
As soon as you have been granted a submitter role, you will be able to connect with your username and password to the EGA inbox using the SFTP protocol. If you have also registered a public key in your profile, you can also connect using this key.
Encrypting your files
Please note that you can also encrypt your files by uploading them directly to the "to-encrypt" folder in your upload area
If your connection is unstable, please encrypt them using Crypt4gh
The EGA encryption of this inbox is based on Crypt4GH. You can install a python implementation of it, with
pip install crypt4gh
or directly from the
Github repository
pip install git+https://github.com/EGA-archive/crypt4gh.git
Save now the following Crypt4GH public key, into a file, say ingestion.pubkey.
__EGA_INGESTION_PUBKEY__
Encrypt a given file with the following command:
crypt4gh encrypt --recipient_pk ingestion.pubkey < file_to_encrypt > encrypted_file.c4gh
The command reads the file from stdin (with < ) and output the encrypted version to stdout (with > ).
Replace file_to_encrypt and encrypted_file.c4gh with the appropriate filenames but make sure to not use the same filename for both reading and writing because your SHELL would then truncate both files before you even read or write.
Documentation
submission/data/file-preparation/crypt4gh
HV31 - Read identifier list for local CCS, CLR, ONT and MGI reads
This file contains read identifiers for local CCS, CLR, ONT and MGI reads for each of the eight selected genomic regions (HLA, KIR, IGH, IGK, IGL, TRA, TRD, andTRG). We extracted these reads by aligning whole-genome sequencing data to a draft whole-genome de novo assembly, and selecting reads that map to contigs representing each region. These reads were involved in the polishing and validation of the HV31 assembly. Please refer to the relevant manuscript (https://doi.org/10.1101/2021.02.03.429586) for additional details. Read identifiers are stored in JSON format. Along with the full FASTQ files, this file enables convenient re-analysis of the HV31 sequencing data in the eight selected regions.
Dataset
EGAD00001007761
How to upload GPG files
Uploading files
Users that holds an ega-box-XXX account can upload files using either INBOX or FTP. Users who have a Submitter role associated with their email will only be able to upload files using INBOX.
Before uploading your files please make sure that any files that will be uploaded to EGA do not use special characters in their naming convention such as # ? ( ) [ ] / \ = + < > : ; " ' , * ^ | &. This can cause issues with the archiving process, leading to problems for end users.
The EGA is a shared, public service with limited storage. In order to manage the available resources, we enforce a limit of 10Tb per submission account at any one time. Please do not exceed this limit.
INBOX
FTP
The FTP is only compatible with files encrypted using the EGACryptor tool
Before uploading
Once your submission files have been prepared using the
EGAryptor, the resulting encrypted files and associated md5sum files can be uploaded to your submission account using Aspera or FTP.
The EGA is a shared, public service with limited resources. In order to manage the available resources, EGA submission boxes should not exceed 8Tb in size, and cannot exceed 12Tb. If you are approaching this limit please contact contact
EGA Helpdesk so that we can advise on how to register the associated metadata and trigger the archiving of files, so that you can continue with your submission. If we note that your submission account increases above 10Tb on a consistent base your password will be changed until metadata is associated.
Aspera
Download Aspera
Using Aspera
FTP
FTP windows
FTP Linux / Unix
FTP client (Filezilla)
FTP and TLS
Troubleshooting
Troubleshooting
Aspera Download
Aspera is a commercial
file transfer protocol that may provide faster transfer speeds than ftp
especially over longer distances.
The Aspera ascp command line client. Please select Aspera Connect.
The ascp command line client is distributed as part of the aspera connect highperformance transfer browser plugin and is free to use, without registration.
The minimum required version of the IBM Aspera Cli is V4. Further instructions.
Using the Aspera ascp command line program
The location of the ascp program in the filesystem:
Mac: on the desktop go cd /Applications/Aspera\
Connect.app/Contents/Resources/ there you'll see the command line utilities
where you're going to use ascp.
Windows: the downloaded files are a bit hidden. For
instance, in Windows 7 the ascp.exe is located in the users home directory
in: AppData\Local\Programs\Aspera\Aspera Connect\bin\ascp.exe
Linux: should be in your user's home directory, cd
/home/username/.aspera/connect/bin/ there you'll see the command line utilities where you're going to use ascp.
Your command should look similar to this:
ascp -P33001 -O33001 -QT -l300M -L- /path/file ega-box-N@fasp.ega.ebi.ac.uk:/path
If you wish to upload several files without being requested the password, please use the below command :
ASPERA_SCP_PASS=ega-box-password ascp -P33001 -O33001 -QT -l300M /path/file ega-box-N@fasp.ega.ebi.ac.uk:/path/
Explanation of parameters
l300M option sets the upload speed limit to 30MB/s. You
may wish to lower this value to increase the reliability of the transfer.
L option is for printing logs out while transferring
files to upload can be a file mask (e.g.
'/homes/submitter/*.srf) or a list of files.
ega-box-N is your submission account login.
Add k2 switch for transfer restarts
Check the command line transfer usage for more configuration details.
Using FTP to upload your prepared files
Use your preferred ftp client. For example, lftp is a popular choice for Linux and Mac users.
Use binary mode for file transfers.
Use ftp.ega.ebi.ac.uk as the target host.
Login with your ega-box username and password.
Upload files to your private ega-box upload area.
Depending on your network setting you might wish to start FTP in passive or active mode.
Using default FTP command line client in Windows
Start the command line interpreter: press WinR, type cmd, hit enter
Enter ftp ftp.ega.ebi.ac.uk
Enter your submission username
Enter your submission password
Type binary to enter binary mode for transfer
To see a list of available ftp commands type help.
Type ls command to check the content of your submission account.
Type prompt to switch off confirmation for each file uploaded.
Use mput command to upload files: mput *.bam*
Use bye command to exit the ftp client.
Use exit command to exit the command line interpreter.
Using default FTP command line client in Linux / Unix
Open a terminal and type ftp ftp.ega.ebi.ac.uk
Enter your submission username
Enter your submission password
Type binary to enter binary mode for transfer
To see a list of available ftp commands type help.
Type ls command to check the content of your submission account.
Type prompt to switch off confirmation for each file uploaded.
Use mput command to upload files: mput *.bam*
Use bye command to exit the ftp client.
Using FTP client FileZilla
We recommend the use of FileZilla, a free FTP client . FileZilla is open source software distributed free of charge under the terms of the GNU General Public License.
Use the following connection details (File - Site Manager) and add yoursubmission account username and password:
Using FTP client FileZilla
Select the files you wish to upload and then select upload:
Using FTP client Filezilla
Using LFTP with TLS
We recommend the following to force the use of a secure connection.
lftp > set ftp:ssl-force yes
We also recommend setting the following for not encrypting the bulk data itself for performance reasons (theauthentication will still be encrypted):
lftp > ftp:ssl-protect-data no
In order to verify the certificate,the recommended way would be to use the CA certificates from your machine. To do that use this command in lftp adjusting the path to your ca-certificates location.
lftp > ssl:ca-file "/etc/ssl/certs/ca-certificates.crt"
If that is not possible or certificates are old and you can't update them, you can download the certificates needed from
Quo Vadis Digital Repository
The two certificates to download in PEM format are:
QuoVadis Root CA2 G3
QuoVadis EV SSL ICA G3
Then you can concatenate them in a file one after the other and save it as lftp-certificates.pem
Once this is done you have to point the ssl:ca-file variable to the path
lftp > set ssl:ca-file "/path/to/lftp-certificates.pem"
Also note that you can save this configuration at ~/.lftp/rc
Another option is to download the certificates and add them to the ca-certificates of your machine. For example: In RHEL7 and cenots and others box the process to add the certificates globally is:
Download the two certificates (in PEM or DER format, doesn't matter) and save them to "/etc/pki/ca-trust/source/anchors/"
Run "update-ca-trust extract"
Another less secure option is to turn off certificate verification with the following command:
lftp > set ssl:verify-certificate false
Troubleshooting
If you are having problems with Aspera connection timeouts, it can be down to either one of the following.
Transfers cannot start the connection fails instantly.
Ensure that TCP traffic on port 33001 is allowed (open) for outbound connections through your computer's firewall and network's firewall.
The connection is made, transfers are started, but 0 bytes (0%) are uploaded for each file.
Ensure that UDP traffic on port 33001 is allowed (open) for outbound connections through your computer's firewall and network's firewal
Documentation
submission/data/uploading-files/ftp
Submission FAQ
Submission FAQ
Before Submission
Is the EGA the right archive for my data?
TThe most suitable archive for your data is dependent on the type of data you are wishing to submit and whether the data require public or controlled access. Public access is defined as complete and open access to all submitted data. On the contrary, controlled access, exerted by the EGA, requires formal applications to be made to access the submitted data files and metadata.
EGA only accepts human-derived data subject to controlled access. If your submission contains other types of data, please choose the appropriate repository for it (see image below): ENA, EVA, ArrayExpress, BioSD and GWAS catalog.
Should your submission be subjected to controlled access?
Data access conditions are normally defined in the original informed consent agreements signed by the participants involved in your study. All data submitted to the EGA is subject to controlled access. These consents prevent the derived data files, potentially identifiable, from being dispersed by open and public access. Controlled access data often refers to human data derived from medical research and consortium projects. If in doubt, consult the informed consent agreements that apply to your study
The EGA enables you to hold a submission before publication.
What data types can be submitted to the EGA?
Data types accepted by the EGA can be split into three categories:
Sequences: both in generic and platform-specific formats.
Array-based: from raw signal files to processed matrices.
Phenotypes: all possible phenotype formats are accepted.
All manufacturer-specific raw data formats derived from major next generation sequencing platforms are accepted. Also generic sequence formats: flat reads in a FASTQ file, aligned sequences (BAM or CRAM files) as well as sequence variation files in VCF format.
All array-based technologies are accepted, including raw data, intensity and analysis files, without any restriction on data formats accepted.
We also accept and distribute phenotype data (associated to the samples) in almost any format: from an image to a README file.
How long does a submission take?
Submissions to EGA come in a variety of formats and sizes, thus it is difficult for us to exactly predict how long a submission will take. We, therefore, advise all of our submitters to allow as much time as possible to make a submission. Based on previous records, we anticipate that the submission process may take at least one month. The submitter’s familiarity with the procedures, possible technical issues that may arise during submission and the amount of data that is being submitted are the main factors that will affect the length of the submission process. However, each step of a regular submission should be considered when estimating the time it would take:
Encryption of the files
Upload of the files
Metadata submission
Archival of the files
Release of the study and datasets to EGA webpage
For example, the upload of the files depends on the submission size, while metadata submission mainly relies on each submitter’s expertise. Further, some steps (e.g. answering to inquiries) depend on the EGA Helpdesk team, which may take some days to be processed during busy times
Is data deposited in the EGA secure?
The EGA set-up consists of a secure computing facility for data processing, a shared EBI set-up for data submissions and distribution of data via data requests made through the EGA website. Data is also copied in the Barcelona Supercomputing Center (BSC) infrastructure, where all stored and distributed data is encrypted
Data is encrypted along the submission process and stored securely, granting its access to authorised users exclusively. During the download process, through our Python Client or Aspera, all requested data is downloaded over secure https connections.
All data at the EGA is encrypted, and only accessible (for log-in and download) through secure protocols. For further information please, visit our security overview.
What documentation do I need to provide?
All submissions require policy’s documentation: 'Data Access Agreement (DAA)', 'Data Processing Agreement (DPA)' and 'Authorized Submitters Formulary'. The data processors (EGA) and the data owners will also sign the DPA.
Will all metadata be public?
Among the submitted metadata we need to make the distinction between identifiable and unidentifiable metadata: (1) the former may allow the identification of the human the sample derived from (e.g. detailed geographical providence, personal name, family ancestry…); (2) while the latter can be used to interpret the data without compromising the anonymity of the patients.
The majority of the metadata submitted to the EGA corresponds to the unidentifiable category (e.g. sequencer's model). This type of metadata is publicly available on the EGA website and other EBI resources/partners’ websites.
On the other hand, some parts samples’ metadata are subject to being identifiable, and thus only accessible by authorized data requesters, with the exception of:
5 submitter-defined attributes of the sample: alias, title, subject_id, gender and phenotype. It is the submitter’s responsibility not to submit sensitive metadata in these public fields.
3 anonymised fields that pinpoint the sample record in archivals: sample’s EGA stable ID (EGAN…), BioSample ID (SAMEA…) and submitter’s center name.
During Submission
Are there any sample specific requirements for EGA?
All samples submitted to the EGA must include the attributes of biological sex, subject ID (anonymised individual identifier) and phenotype information. These are critical for data findability and its analysis, and we highly recommend using controlled ontology terms where applicable. For example: defining tumour and non-tumour samples and/or defining disease state.
The EGA recommends using the Experimental Factor Ontology Database to find ontologized terms that describe your sample phenotypes.
How do I get an accession number to use in my publication?
You will receive your study accession number (EGAS…) upon complete your submission, either:
Programmatically. As soon as the metadata is submitted and validated your study will be assigned an accession number that will be given in the submission’s response.
Manually registering your study and relevant metadata using the online metadata submission tool: the EGA submitter Portal.
How are files uploaded to the EGA?
Data files are uploaded into private submission drop boxes (i.e. environments to which you are granted access and where you can transfer your files) using INBOX or FTP. These spaces are provided as part of the submission procedure. Before uploading any file, you must encrypt your files, . Only encrypted files shall be uploaded to the drop boxes.
Why does data need to to encrypted for my submitted files?
It is one of the security steps the EGA has implemented. In case of a security breach, people without the proper encryption key will not be able to read or use the information that could have been leaked. This measure is essential when working with sensitive data, such as controlled access human data.
All submitters must use crypt4gh to create EGA compliant files prior to uploading them. This encryption is GPG-based, using EGA’s public key.
Why are my files not available if I see them in the INBOX?
There exists a time window between the data upload and the availability of such files via the Submitter Portal. For this reason, some metadata (run and analysis objects) cannot be registered until at least 24 hours after the files have been uploaded to your box.
Why are MD5 sum values generated for my submitted files?
We require pre- and post- encryption MD5 (message-digest) checksum values to be provided for all submitted files. These 128-bit values are computed using the content of each file, creating unique sequences that allow us to ensure that file integrity has been maintained during the transfer process. In other words, if the MD5 checksums we generate and those you generated match, we infer that the content of the transferred files is correct (i.e. files are not corrupted or truncated). MD5 checksums are computed automatically using the crypt4gh tool provided.
Your submission will not be accepted and may be significantly delayed if you do not provide MD5 checksum values for all data files in the required format.
How can I check if my files are correctly uploaded to the inbox?
It is important to check the status pf your file so you know whether your files are in the inbox, being processed, or if there is any issue with one of them. In order to check this, you should:
Look for the file locally.
Drag and drop it to the file table. Then bars will appear on the table, which means that we are processing it.
Green: The files checksum are correct and your file will move to “ingested files”. No further actions are needed from your end..
Red: The files checksum does not match and your file needs to be re-uploaded. Please re-upload relevant files to your inbox using the same path.
After Submission
How do I use my accession number in my publication?
We suggest the use of the below template, using your study accession ID (EGAS…) :
Data has been deposited at the European Genome-phenome Archive (EGA), which is hosted by the EBI and the CRG, under accession number EGASXXXXXXXXXXX.
Further information about EGA can be found at https://ega-archive.org and "The European Genome-phenome Archive of human data consented for biomedical research"
Your study ID will be the one that groups your whole submission, and thus its usage is recommended as such. Nevertheless, all metadata submitted to EGA hold a unique and persistent identifier (starting with EGA…) that can be used to identify specific records. For example, you could reference a specific dataset (EGAD…) or sample (EGAN…) in your publications (see full list of identifiers).
How do I make my data searchable?
Once you have finalized your submission, you can schedule the data release. Please take into account that the release process needs time for the files to be archived in our system, and for the Helpdesk team to validate your submission.
Can I withdraw (meta)data from the EGA?
We have methods in place for the secure removal of deposited (meta)data. Contact EGA-helpdesk for further details.
EGA complies with FAIRness of (meta)data, and thus, even when the data is removed we keep an entry for their identifiers in our system. In other words, we execute a soft delete on canceled objects (e.g. a study): metadata is still stored in our systems, but it loses all links, cannot be queried and data files cannot be retrieved anymore. The reason behind this behaviour is so that queries using withdrawn data properly respond back (see example of a canceled study).
What happens to the data once it has been submitted to the EGA?
When the data is submitted, the submitter can choose either keep their data private or schedule the release of their data.
Documentation
submission/metadata/submission/FAQ
EGAD00010000560
SNP array of 7 HCCs and matched background liver in children with bile salt export pump deficiency
Dataset
EGAD00010000560
Targeted capture, whole genome sequencing, and RNAseq to identify rearrangements in B-cell lymphomas
This study contains whole genome and custom targeted capture sequencing of mature B-cell lymphomas (DLBCL, FL, BL, HGBCL-DH-BCL2, HGBCL-DH-BCL6) to identify translocation breakpoints of common oncogene rearrangements (MYC, BCL2, BCL6). It is complemented by RNAseq data where available. Complete details are available in the publication Hilton et al, 2024, Blood.
Capture sequencing: 357 samples; 364 unique libraries; cram file format aligned to hg38
Whole genome sequencing: 12 samples; 12 unique libraries; cram file format aligned to grch37
RNAseq: 257 samples; 257 unique libraries; fastq file format
Study
EGAS50000000328
Quick Guide for data submission
Quick Guide
This is a quick guide to submit data to the EGA. Please select data type to display the right detailed instructions.
Sequence data
Array-based data
Get your submission account
You first need to create your EGA user. Then, once your account has been verified by the Help Desk team, request your submission account and provide details of the data type and platform(s) in your submission.
Register Study/DAC
Use Submitter Portal to register your study, samples, Data Access Committee (DAC) and Policy.
Upload data
Encrypt your data files using the and upload it to your inbox using SFTP.
Register experiments and runs
Associate each data file to a registered sample and study by Linking files to samples. Details of the experimental procedure you followed must be provided.
Finalise your submission
Group your runs/analysis into datasets and link them to your new or existing DAC and policy using DAC Portal. Data request are done at a dataset level, thus files within a datasets must share release conditions.
Release and admin
When finalising the submission, set a released date to instruct our Helpdesk to release your study. All registered studies are automatically placed on hold until the named submission or DAC contact instructs our Helpdesk for the study to be released.
Get your submission account
Fill the submission form and provide details of the data type and platform(s) in your submission.
Register Study/DAC
Use EGA Programmatic submission to register your study, samples, Data Access Committee (DAC) and Policy.
Upload data
Encrypt your data files using the EgaCryptor and upload it using default FTP clients or Aspera.
Complete the Array-Format (AF) spreadsheet
Download the AF spreadsheet and complete all four sections. Return the spreadsheet to the EGA helpdesk.
Release and admin
Instruct our Helpdesk to release your study. All registered studies are automatically placed on hold until the named submission or DAC contact instructs our Helpdesk for the study to be released.
Documentation
submission/quickguide
Division of Cancer Epidemiology and Genetics (DCEG) Imputation Reference Dataset
We have built a new resource for imputation of SNPs for existing and future genome-wide association studies (GWAS), known as the Division of Cancer Epidemiology and Genetics (DCEG) Reference Set. The first build of the data set includes 728 cancer-free individuals of European descent from three large prospectively sampled studies, 98 African-American individuals from the Prostate, Lung, Colon, and Ovary Cancer Screening Trial (PLCO), 74 Chinese individuals from a Chinese clinical trial in Shanxi, China (SHNX), and 349 unrelated individuals from the HapMap Project (see Molecular Data Section for details on arrays used). The final harmonized dataset includes 2.8 million autosomal polymorphic SNPs on 1,249 subjects after rigorous quality control metrics were applied.
Study
phs000396
A WTCCC2 genome-wide association study for psychosis endophenotype (PE) in individuals from UK, Germany, Holland, Spain and Australia.
A WTCCC2 project genome-wide association study for psychosis endophenotype (PE) in 5831 individuals from UK, Germany, Holland, Spain and Australia, genotyped on the Affymetrix 6.0 array. Information about the family structure of the samples can be found in the .info file which accompanies the data. It should be noted that a number of samples in this study have no phenotype information including sex, case status and family membership. Details of the WTCCC2 analysis can be found in Bramon et al. [Biol Psychiatry. 2014 Mar 1;75(5):386-97].
Study
EGAS00001000817
Beacon v2
Beacon v2: a tool for data discovery
Motivation
In the era of data-driven health research and personalised medicine, human genomic data has become extremely valuable. These are also identifiable data, as they carry information pointing to a specific individual as well as their own family; and as such, they must be protected. This makes data discovery particularly challenging: this is where “Beacon” comes in.
A “Beacon” is an API aiming to enable the search of genomic variants and associated information without jeopardising the privacy of the dataset. Here, we refer to its current version, namely version 2 (v2).
Definition
Beacon v2 is a term that can refer to different aspects. The EGA is playing a central role in the following aspects:
The Beacon v2 protocol is a Global Alliance for Health and Genomics standard.
The Beacon v2 Reference Implementation (B2RI) is an “out-of-the-box” Beacon instance developed with ELIXIR, which facilitates Beacon deployment.
The EGA Beacon(s) are Beacons following the v2 standard and using the B2RI, deployed on top of data hosted at the EGA and allowing for their discovery.
Resources
Depending on whether you are visiting us a stakeholder (you need more general information about Beacon), a deployer /implementer (you want to have your own Beacon instance), or an EGA user (you want to query Beacon and start browsing data), you will be interested in the following resources:
Your role
Beacon aspect
Documentation type
Stakeholder
Beacon v2 protocol
Beacon website
Beacon page on the GA4GH website
Deployer/Implementer
Beacon v2 protocol
Read the docs: Beacon v2 standard technical description
GitHub repository Beacon v2 standard
Beacon v2 Reference Implementation
Read the docs: B2RI technical description
GitHub repository B2RI
Guide to deploy Beacon using B2RI
EGA user
EGA Beacon(s)
API in construction
UI in construction
Documentation
about/projects-and-funders/beacon
Policy Documentation
Policy Documentation
The following policy documentation is required to be prepared and submitted to the EGA, together with your data files and associated metadata.
Data Access Agreement (DAA)
The Data Access Agreement is a contract made between Data User and Data Access Committee. The agreement should be drafted by the DAC and includes, but is not limited to, details of data use, publication embargoes and storage. Completion of a DAA by the applicant/s should form part of the application process to the DAC.
NOTICE The data access agreement template below is provided for guidance only and should be adapted as you see fit to suit your own purpose. In the interest of promoting data sharing, we suggest that if an agreement cannot be met around clause 19 in this example that both parties should agree to remain silent, and that the clause should be removed from the agreement.
Example DAA template
Alternate (harmonised) DAA template
Documentation
access/data-access-committee/policy-documentation
TCELL_PILOT_ATAC_SEQ
This data is part of a pre-publication release. For information on the proper use of pre-publication data shared by the Wellcome Trust Sanger Institute (including details of any publication moratoria), please see http://www.sanger.ac.uk/datasharing/
Study
EGAS00001000758
Lymphocyte_RNA_profiling
This data is part of a pre-publication release. For information on the proper use of pre-publication data shared by the Wellcome Trust Sanger Institute (including details of any publication moratoria), please see http://www.sanger.ac.uk/datasharing/
Study
EGAS00001000564
Preclinical_evolution_of_haematological_malignancies_
This data is part of a pre-publication release. For information on the proper use of pre-publication data shared by the Wellcome Trust Sanger Institute (including details of any publication moratoria), please see http://www.sanger.ac.uk/datasharing/
Study
EGAS00001002128
CELM
This data is part of a pre-publication release. For information on the proper use of pre-publication data shared by the Wellcome Trust Sanger Institute (including details of any publication moratoria), please see http://www.sanger.ac.uk/datasharing/
Study
EGAS00001002261
Characterization_of_iPSC_derived_macrophages___cardiovascular_pilot
This data is part of a pre-publication release. For information on the proper use of pre-publication data shared by the Wellcome Trust Sanger Institute (including details of any publication moratoria), please see http://www.sanger.ac.uk/datasharing/
Study
EGAS00001000876
Programmatic submission based on XML
Programmatic submissions (XML based)
For further information please check our Submission FAQs, submission quickguide as well as submission terms!
Introduction
Besides the Submitter Portal tool, EGA supports programmatic sequence and
clinical data metadata submissions. If you are not sure what this means, you
may want to explore our brief
metadata introduction.
Programmatic submissions are recommended for array-based submission. Moreove, it may be of help if your submission is recurrent or it
is difficult to manage manually due to its sheer size. Otherwise,
we highly recommend using the Submitter Portal to perform submissions.
In this page we will guide you through the required steps to programmatically
submit data to the EGA. Programmatic submissions require your metadata to be
structured for an easy and straightforward validation and archival. It
basically consists in formatting your metadata as Extensible markup language
(XML) files and submitting them to the EGA using the
WEBIN
Before submitting metadata to the EGA, it is important to ensure that the
information in your XML files is compliant with our standards. You can see
further details on how these standards are maintained at EGA at our EGA
Schemas documentation page. Using
WEBIN, you can validate your XML files against EGA's schemas to ensure that your
metadata is compliant before submission.
WEBIN services
WEBIN production service
WEBIN test service
We advise you to submit your metadata to the test service when submitting to the production service for the first time.
The test service is identical to the production service except that all submissions will be discarded in the following 24 hours.
This allows you to learn about the submission process without having to worry about data being submitted.
Authentication
Authentication is required each time a submission is made. The submission
service uses HTTPS protocol for metadata encryption and identification to
provide a secure submission environment.
Data file upload
Both Runs and Analyses reference files (e.g. FASTQ need to be uploaded to the
EGA before these metadata objects are submitted. In other words, if you submit
a Run that references a file that we cannot find associated with your account,
the metadata submission will fail. See further details on how to upload your
files in our
File Upload documentation.
Metadata model of the EGA
Our metadata model is formed by multiple metadata objects. Check further
details in our documentation at our
EGA Schema documentation page.
Working with EGA XMLs files
Now that the basic concepts of the EGA metadata have been described, you can
start preparing your programmatic submission through XML. Here you will find
the guidance on how to prepare the XML files.
Programmatic Submission Tutorial Video
Take a look at the Programmatic Submission Tutorial Video, which explains the
workflow of a programmatic submission and goes over an example metadata
submission.
Programmatic Submission Tutorial Video.
When building your XML files, we recommend using text editors (e.g.Sublime Text
or VisualStudio) that allow you to
visualise the structure of the XML with ease. Furthermore, these editors
constantly check the consistency of the XML structure.
Alternatively, and if the submission consists of a big number of objects
(specially analyses), you may find the tool
star2xml handy. This tool allows for a direct
conversion between metadata in a tabular format (e.g. a spreadsheet) into
XMLs.
Identifying objects: Aliases and center names
Every EGA object must be uniquely identified within the submission account
using their alias attribute. The aliases can be used in submissions to make
references between EGA objects. Let us dig into EGA's use of aliases and
center names:
alias: every object should have a name that is unique within
your submission account. Once submitted successfully, every alias will be
assigned a unique and permanent accession (EGA ID).
refname: when an object
references another by its alias, the alias of the referenced object goes into
the "refname" attribute of the referencing object. For example, if a sample
has the alias "sample1", and an experiment uses this sample, then the
experiment's "EXPERIMENT/SAMPLE/refname" attribute should be "sample1".
center_name: The "center_name" attribute is required within the submission XML
and, if not provided when the object is submitted, it will be automatically
filled using your default EGA account center_name. This element is the
"controlled vocabulary acronym or abbreviation that is provided to the account
holder when the account is first generated". If the submitter is brokering a
submission for another institute, the submitter should use their special
broker account name in broker_name while the data centre acronym remains in
center_name. Log-in details should have been provided when you requested a
submission account. Please contact our Helpdesk team if you have any
questions.
run_center: Many submitting centers contract out the actual sample
sequencing to another center. In these cases, the sequencing center should be
acknowledged in the run_center attribute. Again, this is controlled vocabulary
and the acronym should be sought from EGA helpdesk before submitting. Please contact our Helpdesk team if you have any
questions.
Prepare your XMLs
The goal of this section is to provide sufficient information to be able to create the metadata XML documents required for programmatic submissions.
Please note, the EGA utilises the XML schemas maintained at the European Nucleotide Archive (ENA). It is important due to the fact that by using a similar system, some pieces of documentation from the ENA's programmatic submission can also help you with your programmatic submission to the EGA. For example, you can submit programmatically without using a Submission XML by following the steps at Submission actions without submission XML.
A submission does not have to contain all different types of XMLs. For example, it is possible to submit only a few samples; or a study that is later to be referenced. You can submit each object one by one, or submit all in a batch: you choose what method of submission works best for you. We do recommend, nevertheless, that you submit the objects to be referenced (e.g. samples or studies) first, and the objects that reference these (e.g. experiments or datasets) afterwards. You can see a graphical view of these objects and their relationships at our EGA Schemas page.
Independently of the submission scenario, you will always require a Dataset XML. The entity of a dataset is what is used to control access to the given data, in the form of runs or analyses. In other words, when a requester is granted access, it is through the dataset and the objects (e.g. runs or analyses) that the dataset contains, granting access to them in one go. Given the nature of the EGA, a dataset XML will always be required for the data access.
First, we will differentiate between submissions of "raw" and "processed" data: Runs and Analyses, respectively.
Run data submissions
Raw data derives from instruments "as is". For example, a plain sequence file (e.g. FASTQ or unaligned BAM files) would be considered raw data.
A typical raw (unaligned) sequence read submission consists of 8 XMLs:
Submission
Study
Sample
Experiment
Run
DAC
Policy
Dataset
When technical reads (e.g. barcodes, adaptors or linkers) are included in the submitted raw sequences, a spot descriptor must be submitted to describe the position of the technical reads so that they can be removed. The following data files can be submitted without providing spot descriptor information in the experiment/run XML:
BAM files (single reads)
SFF files (single reads without barcodes)
FastQ files (single reads without any technical reads)
Complete Genomics files
Analysis data submissions
Processed data is, in some way, refined raw data. This includes raw data that has been processed by some form of analysis method (e.g. alignment, noise reduction, etc.). For example, an aligned sequence (e.g. BAM file), that was created using raw FASTQ files, would be a processed file.
This category includes most types of data: sequence alignment files (e.g. BAM or CRAM), clinical data (e.g. phenopackets), sequence variation files (e.g. VCF), sequence annotation, etc.
A typical EGA analysis data submission consists of 7 EGA XML:
Submission
Study
Sample
Analysis
DAC
Policy
Dataset
We accept three different types of analysis data submissions:
BAM files (for multiple read alignments)
VCF files (for sequence variations)
Phenotype files (in any format)
In anycase, keep in mind that samples must be created in order to be referenced in the analyses. In other words, the provenance of the information within the BAM, VCF and phenotype files
Example XMLs
Below you can find a non-extensive list of example XMLs with descriptive fields (i.e. explaining what to provide in each field). Furthermore, you can also find real examples (i.e. the true value of the provided fields) in our GitHub repository.
Submission XML
The submission XML is used to validate, submit or update any number of other objects. The submission XML refers to other XMLs.
New submissions use the ADD action to submit new objects. Object updates are done using the MODIFY action and objects can be validated using the VERIFY action.
Descriptive submission XML example
True values submission XML example
Study XML
The study XML is used to describe the study containing a title, a study type and abstract as it would appear in a publication.
Descriptive study XML example
True values study XML example
Please use the following notation within the property "STUDY_LINKS" when including PubMed citations in the Study XML:
<STUDY_LINKS>
<STUDY_LINK>
<XREF_LINK>
<DB>PUBMED</DB>
<ID>18987735</ID>
</XREF_LINK>
</STUDY_LINK>
</STUDY_LINKS>
Sample XML
The sample XML is used to describe the samples used to obtain the data, whether they were sequenced, measured in any other way, or have an associated phenotype. The mandatory fields include information about the taxonomy of the sample, sex, subject ID and phenotype.
For example, the mandatory attribute fields for each sample would look like these, within the array of "SAMPLE_ATTRIBUTES":
<SAMPLE_ATTRIBUTES>
<SAMPLE_ATTRIBUTE>
<TAG>subject_id</TAG>
<VALUE>free text!</VALUE>
</SAMPLE_ATTRIBUTE>
<SAMPLE_ATTRIBUTE>
<TAG>sex</TAG>
<VALUE>female/male/unknown</VALUE>
</SAMPLE_ATTRIBUTE>
<SAMPLE_ATTRIBUTE>
<TAG>phenotype</TAG>
<VALUE>Free text, EFO terms (e.g. EFO:0000574) are recommended</VALUE>
</SAMPLE_ATTRIBUTE>
</SAMPLE_ATTRIBUTES>
Sample is one of the most important objects to be described biologically, it is highly recommended that “TAG-VALUE” pairs are generated as SAMPLE_ATTRIBUTES to describe the sample in as much detail as possible. For example, were we to give the population ancestry of the sample, we could add a new attribute to the array, in which, for example, we would indicate that the sample derives from an individual of "Mende in Sierra Leone" (MSL), with an african ancestry:
<SAMPLE_ATTRIBUTE>
<TAG>Population</TAG>
<VALUE>MSL</VALUE>
</SAMPLE_ATTRIBUTE>
Given that VALUE and TAG are free text, the combinations are limitless in order to give you full flexibility on the information you want to provide.
We recommend you use the Experimental Factor Ontology (EFO) to describe the phenotypes of your samples. You can provide more than one phenotype by adding more items to the array of SAMPLE_ATTRIBUTES. Phenotypes considered essential for understanding the data submission should be provided. Each phenotype described should be listed as a separate sample attribute <SAMPLE_ATTRIBUTE> </SAMPLE_ATTRIBUTE>. There is no limit to the number of phenotypes that can be submitted.
If a suitable EFO accession cannot be found for your phenotype attribute, please consider using another controlled ontology database (e.g. HPO, MONDO, etc.) before using free text.
Descriptive sample XML example
True values sample XML example
Experiment XML
The experiment XML is used to describe the experimental setup, including instrument platform and model details, library preparation details, and any additional information required to correctly interpret the submitted data. Where any of these values differ between runs, a new experiment object must exist, since runs are grouped by experiments.
Each experiment references a study and a sample by alias, or if previously-submitted, by accession. Pooled data must be demultiplexed by barcode for submission.
Descriptive experiment ( Illumina paired read ) XML example
True values experiment ( Illumina paired read ) XML example
Run XML
The run XML is used to associate data files with experiments and typically comprises a single data file (e.g. a FASTQ file). Please note that pooled samples should be de-multiplexed prior submission and submitted as different runs.
Descriptive run XML example
True values run XML example
Analysis XML
Given that an analysis can be used to submit any type of processed data to the EGA, we will list below an example of each of the three most common types of analysis XMLs submitted to the EGA: sequence alignments (e.g. BAM files); sequence variation (e.g. VCF files); and clinical metadata or phenotypes (e.g. phenopackets).
Regardless of the type of processed data submitted in the analysis, the analysis must be associated with a Study and can reference multiple types of other objects, from samples to experiments, if they are available at the EGA.
Just like with Runs, whenever a file is submitted to the EGA through an analysis object, the file MD5 checksums must be present, in order for the EGA to validate file integrity upon transfer. This also includes index files when applicable (e.g. .bai.md5 files).
Ideally, any analysis that uses a reference sequence for some kind of alignment (e.g. BAM, CRAM or VCF files), would contain metadata about the alignment, such as INSDC reference assemblies and sequences, by either using accessions (e.g. CM000663.1) or common labels (e.g. GRCh37).
Read alignment (BAM) Analysis XML
The Analysis can be used to submit BAM alignments to EGA. Only one BAM file can be submitted in each analysis and the samples used within the BAM read groups must be associated with Samples.
Descriptive bam alignments XML example
True values bam alignments XML example
Sequence variation (VCF) Analysis XML
The Analysis can be used to submit VCF files to EGA. Only one VCF file can be submitted in each analysis and the samples used within the VCF files must be associated with Samples.
Download analysis XML (VCF)
Phenotype files
The Analysis XML can be used to submit phenotype files to the EGA. Only one phenotype file can be submitted in each analysis and the samples used within the phenotype files must be associated with EGA Samples.
Download analysis XML (Phenotype)
DAC XML
The DAC XML describes the Data Access Committee (DAC) affiliated to the data submission. The DAC may consist of a group or a single individual and is responsible for the data access decisions based on the application procedure described in the POLICY.XML.
As with any other object, if it was already submitted to the EGA, there is no need to submit it again: you can reference an existing object within the EGA. Hence, A DAC XML does not need to be provided if your submission is affiliated to an existing EGA DAC.. Further information on DACs can be found here, and you can always contact our Helpdesk team if you have further inquiries.
Descriptive dac XML example
True values dac XML example
Policy XML
The Policy XML describes the Data Access Agreement (DAA) to be affiliated to the named Data Access Committee.
Descriptive policy XML example
True values study XML example
Dataset XML
The dataset XML describes the data files, defined by the Run.XML and Analysis.XML, that make up the dataset and links the collection of data files to a specified Policy. The dataset xml is commonly the last metadata object to be submitted, since it references multiple other entities.
Please consider the number of datasets that your submission consists of. For example, a case-control study is likely to consist of at least two datasets. In addition, we suggest that multiple datasets should be described for studies using the same samples but different sequence technologies.
Descriptive dataset XML example
True values dataset XML example
Validating and submitting your EGA
Validating EGA's XMLs through Webin
After you have ensured that the XMLs are properly formatted and contain all the required information. You can proceed to validate and submit your data.
Use the curl command to validate your XML file:
Once you have prepared your XML file and asserted you have access to Webin, you can validate your XML file programmatically against EGA's schemas using the curl command.
There are multiple ways in which you can validate your XMLs. This variety has to do with the fact that: (1) there are 2 instances of Webin (test and production); and (2) that validation is a default step during submission. In other words, any time that you submit your data through Webin, it will be validated automatically before being accepted. This allows for 4 possible routes of validation, all having the same validation result: validating or submitting to either the production service or the test service of Webin.
For example, directly validating a "study" object XML in the testing service (wwwdev…) would look like the following:
curl -u <USERNAME>:<PASSWORD> -F "ACTION=VALIDATE" "https://wwwdev.ebi.ac.uk/ena/submit/drop-box/submit/" -F "STUDY=@study.xml"
In this command, you would need to replace <USERNAME> and <PASSWORD> with your EGA account username and password, respectively. You would also replace <INPUT_FILE> with the path to your XML file. A mock example would look like the following:
curl -u ega-test-data@ebi.ac.uk:egarocks -F "ACTION=VALIDATE" "https://wwwdev.ebi.ac.uk/ena/submit/drop-box/submit/" -F "STUDY=@study.xml"
The validation attempt can have different results depending on the given arguments:
If your XML file is valid according to EGA's schemas, you will see a message indicating that your XML file is compliant. For example, see below for our mock example, where the "success" was "true" (i.e. no validation errors found). Nevertheless, notice how the "<STUDY accession=" is empty: it is because we were simply validating, so the study did not get an accession or ID.
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="receipt.xsl"?>
<RECEIPT receiptDate="2023-04-11T15:19:28.850+01:00" submissionFile="submission-EBI-TEST_1681222768850.xml" success="true">
<STUDY accession="" alias="Mock example" status="PRIVATE"/>
<SUBMISSION accession="" alias="SUBMISSION-11-04-2023-15:19:28:840"/>
<MESSAGES>
<INFO>VALIDATE action has been specified.</INFO>
<INFO>Submission has been rolled back.</INFO>
<INFO>This submission is a TEST submission and will be discarded within 24 hours</INFO>
</MESSAGES>
<ACTIONS>VALIDATE</ACTIONS>
<ACTIONS>PROTECT</ACTIONS>
If there are any errors or warnings, the tool will display them, allowing you to correct them before submitting your data to EGA. For example, in the following response, it is said that the object we were trying to submit was already existing, and therefore the "success" was "false".
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="receipt.xsl"?>
<RECEIPT receiptDate="2023-04-11T15:12:35.609+01:00" submissionFile="submission-EBI-TEST_1681222355609.xml" success="false">
<STUDY alias="Example!_Human Microbiome Project SP56J" status="PRIVATE" holdUntilDate="2023-03-11Z"/>
<SUBMISSION alias="SUBMISSION-11-04-2023-15:12:35:576"/>
<MESSAGES>
<ERROR>In study, alias: "Example!_Human Microbiome Project SP56J". The object being added already exists in the submission account with accession: "ERP127584".</ERROR>
<INFO>VALIDATE action has been specified.</INFO>
<INFO>Submission has been rolled back.</INFO>
<INFO>This submission is a TEST submission and will be discarded within 24 hours</INFO>
</MESSAGES>
<ACTIONS>VALIDATE</ACTIONS>
<ACTIONS>PROTECT</ACTIONS>
If the curl command retrieves no response at all, please double check if your username and password are correctly provided.
Also notice the "ACTION=..." argument passed to the Curl command. This specifies the action to take during the call to Webin, so we do not need a "Submission" XML just for a validation attempt. See more at submission actions without submission XML.
Furthermore, validation of multiple files or objects (e.g. sample, experiment, study…) can be done in a single command by adding more arguments (i.e. '-F'). For example:
curl -u <USERNAME>:<PASSWORD> -F "ACTION=VALIDATE" "https://wwwdev.ebi.ac.uk/ena/submit/drop-box/submit/" -F "STUDY=@study.xml" -F "SAMPLE=@sample.xml" -F "DATASET=@dataset.xml"
As mentioned above, beside "validate" action in the test environment, you can also validate your metadata by three other methods:
"Validate" in the production server. From our example above, you simply need to take the "dev" away from the URL.
curl -u <USERNAME>:<PASSWORD> -F "ACTION=VALIDATE" "https://www.ebi.ac.uk/ena/submit/drop-box/submit/" -F "STUDY=@study.xml"
"Add" in the development server. From our example above, you would simply need to replace the action: from "validate" to "add". Whatever is submitted to this service will be discarded in 24h, so whether something gets submitted or not would not matter in the long run.
curl -u <USERNAME>:<PASSWORD> -F "ACTION=ADD" "https://wwwdev.ebi.ac.uk/ena/submit/drop-box/submit/" -F "STUDY=@study.xml"
"Add" in the productionserver. A combination of the previous two methods, which would render this attempt into a submission. This path is just to be taken when you are sure your metadata is compliant and what you want to submit.
curl -u <USERNAME>:<PASSWORD> -F "ACTION=ADD" "https://www.ebi.ac.uk/ena/submit/drop-box/submit/" -F "STUDY=@study.xml"
What happens after the submission of a dataset XML?
Once you have completed the registration of your dataset/s please contact our Helpdesk Team to provide a release date for your study.
Please note that all datasets affiliated to unreleased studies are automatically placed on hold until the authorised submitter or DAC contact contact the EGA Helpdesk for the study to be released.
We strongly advise you not to delete your data until EGA Helpdesk confirms that your data has been successfully archived.
Documentation
submission/metadata/submission/programmatic-submission-xml
Genetics_of_gene_expression_in_human_macrophage_response_to_Salmonella
This data is part of a pre-publication release. For information on the proper use of pre-publication data shared by the Wellcome Trust Sanger Institute (including details of any publication moratoria), please see http://www.sanger.ac.uk/datasharing/
Study
EGAS00001002236
Patient-derived lung cancer organoids
Patient-derived lung cancer organoids cram files : targeted seq 13 samples, whole exome seq 12 samples
mutation profiles of PDO and matched tissue : aggregated vcf 1 file
details : https://www.nature.com/articles/s41467-019-11867-6
Dataset
EGAD00001005317
Validation of Exome-sequencing of S7RE-iPSC lines
PCR products were obtained from each target loci using genomic DNA from human iPS cells. Subsequently, PCR products are pooled and subjected to Illumina library preparation. The library will be sequenced either by HiSeq or MiSeq.
This data is part of a pre-publication release. For information on the proper use of pre-publication data shared by the Wellcome Trust Sanger Institute (including details of any publication moratoria), please see http://www.sanger.ac.uk/datasharing/
Dataset
EGAD00001001449
HCA_Heart_Disease_BHF_DZHK_RNA_
Cell Atlas of the diseased human heart. This data is part of a pre-publication release. For information on the proper use of pre-publication data shared by the Wellcome Trust Sanger Institute (including details of any publication moratoria), please see http://www.sanger.ac.uk/datasharing/
Study
EGAS00001004566
STRATAA_RNAseq
Transcriptome signatures of acute typhoid infection
This data is part of a pre-publication release. For information on the proper use of pre-publication data shared by the Wellcome Trust Sanger Institute (including details of any publication moratoria), please see http://www.sanger.ac.uk/datasharing/
Study
EGAS00001003967
A_systems_biology_approach_to_understand_immunity_and_pathogenesis_of_malaria_in_children_exposed_to_endemic_Plasmodium_falciparum_transmission
From long-term studies of children in Kenya, we know that some children have numerous malaria episodes through to older age, while other children become immune more rapidly. Our hypothesis is that these children with frequent malaria episodes are caught in a vicious circle, whereby malaria episodes lead to impaired immunity to malaria, which in turn leads to further episodes of malaria. We will therefore investigate immune responses in children with different life histories of malaria episodes. We will also include a group of children living in similar environmental conditions but who are completely unexposed to malaria. Aims: Using RNAseq we will characterise the transcriptome of a group of children with frequent and febrile malaria or infrequent/asymptomatic. These data will be combined with cytokine profiling data (NIMR) and used to build predictive models (Exeter). The predictions will be validated against a similar second group. A single snap shot of immune responses may represent only the endpoint of many immune processes and may not reflect those that are causally related to differences in malaria outcome. Therefore, we will collect and store samples from a larger group of children who will then be followed up for over 3-4 years. This extended follow up will allow us to identify the same groups as described above, and then examine samples taken before the development of those outcomes. This data is part of a pre-publication release. For information on the proper use of pre-publication data shared by the Wellcome Trust Sanger Institute (including details of any publication moratoria), please see http://www.sanger.ac.uk/datasharing/
Study
EGAS00001002978
Anaplastic Thyroid Cancer aligned whole exome sequence data
Whole exome sequence data in fastq format was aligned to the GRCH38 reference genome. Aligned sequence was preprocessed with GATK for Indel Realignment and Base Quality Score Recalibration. Duplicates were marked with Picard Mark Duplicates. Aligned sequence is in bam format. Details of the alignment can be found in he bam header. Tumour samples were classified as Anaplastic Thyroid, Poorly-differentiated or well-differentiated cancers.
Dataset
EGAD00001005791
Yemen_and_Chad_Genotyping
HumanOmni2.5-8 data from Chad and Yemen.This data is part of a pre-publication release. For information on the proper use of pre-publication data shared by the Wellcome Trust Sanger Institute (including details of any publication moratoria), please see http://www.sanger.ac.uk/datasharing/
Study
EGAS00001001231
RNAseq from 6 organotypic co-cultures (OC cells bulk/OCSC with or without TME)
The dataset contains samples of 6 organotypic co-cultures, assembled with patient-derived material from ovarian cancer (OC) patients. Tumor cells, both as bulk and as cancer stem cells-enriched (OCSC) populations, are cultured or not with in vitro peritoneal TME (for details see Battistini C et al, Tumor microenvironment-induced FOXM1 regulates ovarian cancer stemness, CDDis 2024).
Dataset is composed by fastq file (paired end) type from bulk RNA-Seq.
Dataset
EGAD50000000523
Hereditary_Cerebellar_Ataxias___Whole_Genome_Sequencing___2021
Looking to identify mutations in order to validate that lines we have classified as from Ataxia patients contain the disease relevant mutations. This will allow us to publish on the existance of these lines which are now commercially avalable. This data is part of a pre-publication release. For information on the proper use of pre-publication data shared by the Wellcome Trust Sanger Institute (including details of any publication moratoria), please see http://www.sanger.ac.uk/datasharing/
Study
EGAS00001006234
Anaplastic Thyroid Cancer aligned sequence data
Sequence data in fastq format was aligned to the GRCH38 reference genome. Aligned sequence was preprocessed with GATK for Indel Realignment and Base Quality Score Recalibration. Duplicates were marked with Picard Mark Duplicates. Aligned sequence is in bam format. Details of the alignment can be found int he bam header. In total, data generated from 174 tumour samples 102 matched blood normal controls was aligned. Tumour samples were classified as Anaplastic Thyroid, Poorly-differentiated or well-differentiated cancers.
Dataset
EGAD00001004126
PAGE: Global Reference Panel
Stanford contributed samples to the PAGE study that can act as a population reference dataset across the globe. Therefore this dataset includes reference individuals, without phenotypes, chosen to help infer ancestry that will help us understand the diverse samples available in PAGE. The complete dataset comprises individuals of European, African, Asian, Oceanian, and Native American descent, from a total of over 50 populations. A subset of these individuals from Puno, Peru and Easter Island (Rapa Nui), Chile, are included in the PAGE samples that were whole genome sequenced in 2015. Additional details are available in the Study Acknowledgments. The Global Reference Panel comprises 6 sample sets: A population sample of Andean individuals primarily of Quechuan/Aymaran ancestry from Puno, Peru A population sample of Easter Island (Rapa Nui), Chile Individuals of indigenous origin from Oaxaca, Mexico Individuals of indigenous origin from Honduras Individuals of indigenous origin from Colombia Individuals of indigenous origin from the Nama and Khomani KhoeSan populations of the Northern Cape, South Africa In addition, we genotyped publicly available samples that will be hosted on the Bustamante lab website (https://bustamantelab.stanford.edu/). These comprise large public datasets to provide an open reference dataset for the world: The additional related individuals from the Americas in the Human Genome Diversity Panel (H952) plus all additional samples from the Americas A subset of the unrelated individuals from the Maasai in Kinyawa, Kenya (MKK) dataset from the International Hapmap Project hosted at Coriell Additional samples will be available for restricted use with a data access agreement with the Bustamante Lab. This study is part of the Population Architecture using Genomics and Epidemiology (PAGE) study (phs000356).
Study
phs001033
IBD_Whole_Genome_Sequencing
Whole genome sequences at 15X depth of patients with Inflammatory Bowel Disease.
This data is part of a pre-publication release. For information on the proper use of pre-publication data shared by the Wellcome Trust Sanger Institute (including details of any publication moratoria), please see http://www.sanger.ac.uk/datasharing/
Study
EGAS00001002754
Microinjection_of_hIPSC_derived_intestinal_organoids_with_Salmonella_Typhimurium
To generate an RNA-Seq dataset for organoids apically stimulated with Salmonella Typhimurium.These data are part of a pre-publication release. For information on the proper use of pre-publication data shared by the Wellcome Trust Sanger Institute (including details of any publication moratoria), please see http://www.sanger.ac.uk/datasharing/
Study
EGAS00001001253
Repeated_clinical_malaria_episodes_are_associated_with_modification_of_the_immune_system_in_children_
Repeated clinical malaria episodes are associated with modification of the immune system in children.This data is part of a pre-publication release. For information on the proper use of pre-publication data shared by the Wellcome Trust Sanger Institute (including details of any publication moratoria), please see http://www.sanger.ac.uk/datasharing/.
Study
EGAS00001003167
Whole_Genome_Sequencing_of_INTERVAL
15x Whole Genome Sequencing of 15,000 individuals from the INTERVAL study cohort, phase II.
This data is part of a pre-publication release. For information on the proper use of pre-publication data shared by the Wellcome Trust Sanger Institute (including details of any publication moratoria), please see http://www.sanger.ac.uk/datasharing/
Study
EGAS00001002461
Whole_Genome_Sequencing_of_INTERVAL
15x Whole Genome Sequencing of 15,000 individuals from the INTERVAL study cohort, phase III.
This data is part of a pre-publication release. For information on the proper use of pre-publication data shared by the Wellcome Trust Sanger Institute (including details of any publication moratoria), please see http://www.sanger.ac.uk/datasharing/
Study
EGAS00001002787
Bottleneck_Sequencing_Of_Human_Tissue__Wgs_
Bottleneck sequencing of human tissue including neurons, cord blood, sperm
This data is part of a pre-publication release. For information on the proper use of pre-publication data shared by the Wellcome Trust Sanger Institute (including details of any publication moratoria), please see http://www.sanger.ac.uk/datasharing/
Study
EGAS00001004066
Single cell RNAseq of stenotic, inflamed and non-inflamed transmural lesions from patients with Crohn's disease
This dataset contains a gene-cell matrix derived from single-cell RNA sequencing (scRNA-seq) data of ileal tissue from Crohn's disease (CD) patients and colorectal cancer (CRC) patients. It includes:
Crohn's Disease Patients: A trio of transmural lesions (stenotic, inflamed, and non-inflamed) from each patient.
Colorectal Cancer Patients: Unaffected ileal tissue used as external non-inflamed control.
Cell Level Metadata:
The dataset includes relevant cell-level metadata such as cell type annotations used in the study.
Experimental Details:
Platform: 10x Genomics Chromium Single Cell 3' GEX
Sequencing: Illumina NovaSeq
Processing: Data processed with Cell Ranger software. Resulting count matrices were merged for downstream analysis, including integration and dimensionality reduction.
Dataset Composition:
Crohn's Disease Patients: 10 patients with 3 samples each (non-inflamed, inflamed, stenotic), totaling 30 samples.
Colorectal Cancer Patients: 5 patients with 1 sample each of unaffected tissue, totaling 5 samples.
Data Provided:
Merged Raw Count Matrix: The final merged raw count matrix used for downstream analysis.
Cell Metadata File: Contains details of sample, tissue, and patient for each cell in the count matrix.
Barcodes File: Indicate each cell barcode which also encodes the sample, tissue, and patient details for each cell.
CD.S_Inf: Stenotic Corhn's disease inflamed samples
CD.S_Sten: Stenotic CD patient stenosis sample
CD.S_Prox: Stenotic CD Patient - proximal non-inflamed sample
CC.C_Prox: CRC Patient proximal unaffected sample
eg: A barcode 'CC.C_1_Prox_AAGTCGTAGACCCTTA' indicates CRC Patient unaffected proximal sampe from CRC Patient no.1 and the nucleic acid sequence indicate a unique cell from this sample.
Total Samples:
Crohn's Disease (CD) Patients: 30 samples
Colorectal Cancer (CRC) Patients: 5 samples
Patient_no Sample Sample_type
1 CC.C_1 CC.C_1_Prox CC.C_Prox
2 CD.S_1 CD.S_1_Prox CD.S_Prox
3 CD.S_1 CD.S_1_Infl CD.S_Infl
4 CD.S_1 CD.S_1_Sten CD.S_Sten
5 CC.C_2 CC.C_2_Prox CC.C_Prox
6 CD.S_2 CD.S_2_Prox CD.S_Prox
7 CD.S_2 CD.S_2_Infl CD.S_Infl
8 CD.S_2 CD.S_2_Sten CD.S_Sten
9 CC.C_3 CC.C_3_Prox CC.C_Prox
10 CC.C_4 CC.C_4_Prox CC.C_Prox
11 CD.S_3 CD.S_3_Prox CD.S_Prox
12 CD.S_3 CD.S_3_Infl CD.S_Infl
13 CD.S_3 CD.S_3_Sten CD.S_Sten
14 CD.S_4 CD.S_4_Prox CD.S_Prox
15 CD.S_4 CD.S_4_Infl CD.S_Infl
16 CD.S_4 CD.S_4_Sten CD.S_Sten
17 CC.C_5 CC.C_5_Prox CC.C_Prox
18 CD.S_5 CD.S_5_Prox CD.S_Prox
19 CD.S_5 CD.S_5_Infl CD.S_Infl
20 CD.S_5 CD.S_5_Sten CD.S_Sten
21 CD.S_6 CD.S_6_Prox CD.S_Prox
22 CD.S_6 CD.S_6_Infl CD.S_Infl
23 CD.S_6 CD.S_6_Sten CD.S_Sten
24 CD.S_7 CD.S_7_Prox CD.S_Prox
25 CD.S_7 CD.S_7_Infl CD.S_Infl
26 CD.S_7 CD.S_7_Sten CD.S_Sten
27 CD.S_8 CD.S_8_Prox CD.S_Prox
28 CD.S_8 CD.S_8_Infl CD.S_Infl
29 CD.S_8 CD.S_8_Sten CD.S_Sten
30 CD.S_9 CD.S_9_Prox CD.S_Prox
31 CD.S_9 CD.S_9_Infl CD.S_Infl
32 CD.S_9 CD.S_9_Sten CD.S_Sten
33 CD.S_10 CD.S_10_Prox CD.S_Prox
34 CD.S_10 CD.S_10_Infl CD.S_Infl
35 CD.S_10 CD.S_10_Sten CD.S_Sten
Dataset
EGAD50000000559
CRISPR_single_cell_activation
Single cell CRISPR activitaion analysis with 96 genes with the aim to build a quantative CRISPR activation model. This data is part of a pre-publication release. For information on the proper use of pre-publication data shared by the Wellcome Trust Sanger Institute (including details of any publication moratoria), please see http://www.sanger.ac.uk/datasharing/
Study
EGAS00001005528
Growth Hormone (GH) -secreting Pituitary Adenoma
Data from 12 fresh-frozen somatotropinomas and their corresponding blood samples. Details are given in Valimaki et al. Whole-genome sequencing of Growth Hormone (GH) -secreting Pituitary Adenoma. Provisionally accepted, 2015.
Study
EGAS00001001293
Validation of SNVs found by Exome-seq in S2-SF1, -SF5 and -SF9 hiPSCs
PCR products were obtained from each target loci using genomic DNA from human iPS cells. Subsequently, PCR products are pooled and subjected to Illumina library preparation. The library will be sequenced either by HiSeq or MiSeq.
This data is part of a pre-publication release. For information on the proper use of pre-publication data shared by the Wellcome Trust Sanger Institute (including details of any publication moratoria), please see http://www.sanger.ac.uk/datasharing/
Dataset
EGAD00001000631
Targeted_Sequencing_of_Human_Myeloid_Malignancies
This study involves targeted sequencing of samples from myeloid malignancies at different timepoints to assess clonal evolution of malignancya.This data is part of a pre-publication release. For information on the proper use of pre-publication data shared by the Wellcome Trust Sanger Institute (including details of any publication moratoria), please see http://www.sanger.ac.uk/datasharing/
Study
EGAS00001001289
Orphan_Tumour_Study_NB
The aim of this study is to investigate the transcriptional landscape of human cancer.
This data is part of a pre-publication release. For information on the proper use of pre-publication data shared by the Wellcome Trust Sanger Institute (including details of any publication moratoria), please see http://www.sanger.ac.uk/datasharing/
Study
EGAS00001003445
Whole-genome sequencing of Tibetans from China
Whole-genome sequencing of 27 Tibetan individuals from China using the Illumina-B HiSeq X platform. These data are part of a pre-publication release. For information on the proper use of pre-publication data shared by the Wellcome Sanger Institute (including details of any publication moratoria), please see http://www.sanger.ac.uk/datasharing/.
Study
EGAS00001003500
Whole-genome sequencing of Himalayan populations
Whole-genome sequencing of 60 individuals from 15 Himalayan populations using the Illumina-B HiSeq X platform. These data are part of a pre-publication release. For information on the proper use of pre-publication data shared by the Wellcome Sanger Institute (including details of any publication moratoria), please see http://www.sanger.ac.uk/datasharing/.
Study
EGAS00001007269
Post_Mortem_Tissue_COVID19_RNA
Single cell analysis of post mortem tissue samples from SARS-CoV2-infected patients
We aim to directly identify the human cell types infected by SARS-CoV-2 and measure the cellular response to COVID19 infection across 20 different tissues from infected patient autopsies. We have examined the expression pattern of viral entry receptors across healthy human tissues to predict several candidate target cell types across the airway and heart. In addition, high prevalence of cardiac failure and abnormal renal function in COVID19 patients implicates heart and kidney involvement, but the pathogenesis of organ specific damage - whether via a direct cytopathic mechanism or an indirect inflammatory response - remains unknown . Currently, we lack confirmation of target cell types and cellular processes in infected tissues as autopsies are discouraged in most countries due to health and safety risks. Our collaborators Drs Michael Osborn and Brian Hanley (Imperial College) have outlined guidelines to perform post-mortem in COVID19 patients (Hanley et al., 2020) and have established a programme of autopsies for research to be performed in a high-risk facility at Westminster Public Mortuary. Here, we propose to identify infected cell types and aberrant molecular pathologies in this precious tissue resource using single cell and spatial genomics. We will prioritise three organ systems: the human airway, heart and the kidney. We will directly examine the cellular identities of SARS-CoV-2 infected cell types and identify the cellular responses to infection across these organs. This fundamental knowledge will help guide future treatment choices for COVID19.
This data is part of a pre-publication release. For information on the proper use of pre-publication data shared by the Wellcome Trust Sanger Institute (including details of any publication moratoria), please see http://www.sanger.ac.uk/datasharing/
Study
EGAS00001004442
Post_Mortem_Tissue_COVID19_spatial
Single cell analysis of post mortem tissue samples from SARS-CoV2-infected patients
We aim to directly identify the human cell types infected by SARS-CoV-2 and measure the cellular response to COVID19 infection across 20 different tissues from infected patient autopsies. We have examined the expression pattern of viral entry receptors across healthy human tissues to predict several candidate target cell types across the airway and heart. In addition, high prevalence of cardiac failure and abnormal renal function in COVID19 patients implicates heart and kidney involvement, but the pathogenesis of organ specific damage - whether via a direct cytopathic mechanism or an indirect inflammatory response - remains unknown . Currently, we lack confirmation of target cell types and cellular processes in infected tissues as autopsies are discouraged in most countries due to health and safety risks. Our collaborators Drs Michael Osborn and Brian Hanley (Imperial College) have outlined guidelines to perform post-mortem in COVID19 patients (Hanley et al., 2020) and have established a programme of autopsies for research to be performed in a high-risk facility at Westminster Public Mortuary. Here, we propose to identify infected cell types and aberrant molecular pathologies in this precious tissue resource using single cell and spatial genomics. We will prioritise three organ systems: the human airway, heart and the kidney. We will directly examine the cellular identities of SARS-CoV-2 infected cell types and identify the cellular responses to infection across these organs. This fundamental knowledge will help guide future treatment choices for COVID19.
This data is part of a pre-publication release. For information on the proper use of pre-publication data shared by the Wellcome Trust Sanger Institute (including details of any publication moratoria), please see http://www.sanger.ac.uk/datasharing/
Study
EGAS00001004441
Ribosome_Profiling_of_Macrophages_during_Salmonella_Infection
The aim of this study is to assess translational changes in macrophages over a time course of Salmonella infection.
This data is part of a pre-publication release. For information on the proper use of pre-publication data shared by the Wellcome Trust Sanger Institute (including details of any publication moratoria), please see http://www.sanger.ac.uk/datasharing/
Study
EGAS00001001285
Longitudinal cfDNA methylome and fragmentome profiles in health
This dataset contains 432 plasma cfDNA samples from healthy individuals, comprising a diurnal cohort of 16 participants and a cross-sectional cohort of 144 participants. Samples underwent capture-based target enrichment using a custom probe set covering 4,991 genomic regions, followed by targeted enzymatic methyl-sequencing on Illumina instruments in 150 bp paired-end mode. The dataset consists of raw FASTQ files generated from the sequencing runs, accompanied by a metadata file containing individual demographic details and sampling information.
Dataset
EGAD50000001721
Human_Colorectal_Cancer_Exome_Sequencing
In this experiment we sequenced a collection of genes identified as being mutated in a Sleeping Beauty Screen in a mouse colorectal Cancer Model in human tumours collected from patients with germline mutations in APC and also other familial CRC predisposition syndromes. We also sequenced the germline of these patients allowing us to identify somatically mutated genes. This data is part of a pre-publication release. For information on the proper use of pre-publication data shared by the Wellcome Trust Sanger Institute (including details of any publication moratoria), please see http://www.sanger.ac.uk/datasharing/
Study
EGAS00001000077
GATCI whole genome sequence data
Sequence data in fastq format was aligned to the GRCh38 reference genome with BWA-MEM. Aligned sequence was preprocessed with GATK for Indel Realignment and Base Quality Score Recalibration. Duplicates were marked with Picard Mark Duplicates. Aligned sequence is in bam format. Details of the alignment can be found in the bam header
Dataset
EGAD00001005914
Highlighted samples from the BCH CRDC
These exomes are from some of the patients and family members detailed in our forthcoming publication, Children’s Rare Disease Cohorts: an integrative research and clinical genomics initiative. While this paper details an institutional collection of exomes, the samples here were specifically highlighted as having significant findings.
Study
EGAS00001004436
TTV018_RORC_IBD_associated_genotype_effects_on_RORgT_expression_and_function_in_ex_vivo_T_cells
RNA sequencing of peripheral immune cells from patients +/- an IBD risk variant. Peripheral immune cells +/- in vitro test compound treatment.This data is part of a pre-publication release. For information on the proper use of pre-publication data shared by the Wellcome Trust Sanger Institute (including details of any publication moratoria), please see http://www.sanger.ac.uk/datasharing/
Study
EGAS00001001590
Y_phylogeny_haplogroupDE
High-coverage whole genome sequences using Hiseq X for 4 individuals to investigate their Y chrosmosmes' relationship to the known phylogeny.
This data is part of a pre-publication release. For information on the proper use of pre-publication data shared by the Wellcome Trust Sanger Institute (including details of any publication moratoria), please see http://www.sanger.ac.uk/datasharing/
Study
EGAS00001002674
SCAT_Osteosarcoma_Validation
This experiment is to validate putative somatic substitutions and indels identified in an exome screen of ~50 osteosarcoma tumour/normal pairs. It is the first stage in our ICGC commitment to study osteosarcoma. The validation process is an important component of our analysis to clarify the data prior to looking for evidence of new cancer genes, or subverted pathways important in the development of cancer. This data is part of a pre-publication release. For information on the proper use of pre-publication data shared by the Wellcome Trust Sanger Institute (including details of any publication moratoria), please see http://www.sanger.ac.uk/datasharing/
Dataset
EGAD00001000280
Chracterising_cellur_pathways_underlying_CD3_CD28_activation_of_human_CD4__cells
We devised an approach to disentangle the TCR and CD28 pathways upon stimulation in naive and memory primary human CD4+ T cells (Tcons) in response to defined stimulatory signals. Sorted Tcons were activated using a titration of anti-CD3 and anti-CD28 in combination as well as individually. As a control we cultured cells in the same conditions but without the stimuli. In total, we defined seven conditions from four individuals for sequencing.
This data is part of a pre-publication release. For information on the proper use of pre-publication data shared by the Wellcome Trust Sanger Institute (including details of any publication moratoria), please see http://www.sanger.ac.uk/datasharing/
Study
EGAS00001004147
Chracterising_cellur_pathways_underlying_CD3_CD28_activation_of_human_CD4__cells
We devised an approach to disentangle the TCR and CD28 pathways upon stimulation in naive and memory primary human CD4+ T cells (Tcons) in response to defined stimulatory signals. Sorted Tcons were activated using a titration of anti-CD3 and anti-CD28 in combination as well as individually. As a control we cultured cells in the same conditions but without the stimuli. In total, we defined seven conditions from four individuals for sequencing.
This data is part of a pre-publication release. For information on the proper use of pre-publication data shared by the Wellcome Trust Sanger Institute (including details of any publication moratoria), please see http://www.sanger.ac.uk/datasharing/
Study
EGAS00001002438
De_novo_mutations_in_cell_free_foetal_DNA__cffDNA_
This project is to develop and validate a method to detect de novo mutations in a foetal genome through deep sequencing of cell-free DNA from the plasma of pregnant women.This data is part of a pre-publication release. For information on the proper use of pre-publication data shared by the Wellcome Trust Sanger Institute (including details of any publication moratoria), please see http://www.sanger.ac.uk/datasharing/
Study
EGAS00001000322
Tracking early lung cancer metastatic dissemination in TRACERx using ctDNA
These data relate to Abbosh et al 2023 'Tracking early lung cancer metastatic dissemination in TRACERx using ctDNA'. Some patients were included in this work that were excluded from the other TRACERx 421 papers (Frankell et al and Al-bakir et al) often due to exclusion criteria in the TRACERx clinical trial. Reasons include partial contamination with C>A artefact mutational signature (CRUK0480 ,0490) which were included for ctDNA detection analyses using high confidence clonal mutations but not clonality analyses, incomplete resection at primary surgery (CRUK0291, 0234, 0230, 0387, 0622) and a concurrent oesophageal second primary cancer (CRUK0498). The primary tissue exome sequencing for these patients that were used in this study are included in this repository. Deep targeted sequencing of (median 200) tumour specific mutations in plasma for 1069 samples from 198 TRACERx patients are also included. Additional details related to these patients and samples can be found in the supplementary or data repositories from Frankell et al and Abbosh et al.
Study
EGAS00001006923
GATCI whole exome germline variants
Exome sequences were aligned to the GRCH38 reference genome. Aligned sequence was analyzed with GATK Haplotype Caller to generate germline variant calls. Variant calls are in VCF format. Details for the call can be found in the VCF header
Dataset
EGAD00001005916
GATCI whole exome somatic variants (MuTect)
Exome sequences were aligned to the GRCH38 reference genome. Aligned sequence was analyzed with GATK/MuTect, to generate somatic variant calls. Somatic variant calls are in VCF format. Details for the mutect call can be found in the vcf header.
Dataset
EGAD00001005917
GATCI whole exome somatic variants (SomaticSniper)
Exome sequences were aligned to the GRCH38 reference genome. Aligned sequence was analyzed with GATK/SomaticSniper, to generate somatic variant calls. Somatic variant calls are in VCF format. Details for the mutect call can be found in the vcf header.
Dataset
EGAD00001005918
Investigating_the_impact_of_MBD4_on_the_mutability_of_the_germline
We will be testing the hypothesis that MBD4 PTV germline carriers also show an increased number of C toT germline mutations in their offspring.
This data is part of a pre-publication release. For information on the proper use of pre-publication data shared by the Wellcome Trust Sanger Institute (including details of any publication moratoria), please see http://www.sanger.ac.uk/datasharing/
Study
EGAS00001002861
IL_10_signalling_and_macrophage_gene_expression
Study to stimulate WT and IL-10RB mutant macrophages with LPS in presence or absence of recombinant IL-10 and compare their gene expression profiles by RNASeqThese data are part of a pre-publication release. For information on the proper use of pre-publication data shared by the Wellcome Trust Sanger Institute (including details of any publication moratoria), please see http://www.sanger.ac.uk/datasharing/
Study
EGAS00001001283
CAR_T_cell_Study
CAR-T cell Study
The aim of this study is to understand single cell transcription and chromatin accessibility in CAR-T cells.
This data is part of a pre-publication release. For information on the proper use of pre-publication data shared by the Wellcome Trust Sanger Institute (including details of any publication moratoria), please see http://www.sanger.ac.uk/datasharing/
Study
EGAS00001004718
Gene_Characterization_in_Carbohydrate_metabolic_alterations__neonatel_diabetes___congenital_hyperinsulinemic__in_early_childhood
Whole Exome Sequencing of trios (proband + parents) or probands only with Neonatal Diabetes Mellitus (NDM) or Congenital Hyperinsulinism of Infancy (CHI) of unknown genetic origin.
This data is part of a pre-publication release. For information on the proper use of pre-publication data shared by the Wellcome Trust Sanger Institute (including details of any publication moratoria), please see http://www.sanger.ac.uk/datasharing/
Study
EGAS00001002074
PROGRESS/ELEMENT DNA Methylation Study
An extension to the Early Life Exposures in Mexico to Environmental Toxicants (ELEMENT) birth cohort of Mexico City, the Programming Research in Obesity, GRowth, Environment and Social Stress (PROGRESS) Cohort is an ongoing longitudinal pre-birth cohort, established in 2006 in Mexico City, partnering Icahn School of Medicine at Mount Sinai with Harvard University and the National Institute of Public Health in Mexico, which was designed to study the effects of prenatal exposure to toxic metals, air pollution, phthalates, and stress on childhood development. Pregnant women of 18 years of age and older, pregnant for less than 20 weeks of gestation, had no documentation of heart or kidney disease, no use of steroids or anti-epilepsy drugs, no daily alcohol consumption, had telephone access, and planned to live in Mexico city for the following 3 years, and receiving care through the Mexican Social Security System were initially enrolled (n=1,054). In addition to clinical, demographic and exposure data collected, cord blood was collected to interrogate DNA methylation across the genome for over 300 mother-child dyads. Clinical assessments and exposures were captured during several life stages, including prenatal, infant (0-1 year), youth (1-18 years), and adulthood (mother). The PROGRESS cohort added well-documented phenotyping of children for obesity, metabolic dysfunction, respiratory outcomes, and cardiovascular outcomes, as well as measures of air pollutant, personal care/consumer product, non-chemical stress, and metal mixture exposures. No clinical trials were conducted in this cohort. The data collected in this study should provide a unique resource to investigate DNA methylation as it relates to several environmental exposures and adverse cardiometabolic and neurocognitive health in mothers and children from a prospective birthing cohort. For access to demographic, clinical, and exposure data please directly contact study principal investigators.
Study
phs002754
Transcriptomes_of_human_lymphocytes
In this study we will compare the single cell transcriptomes of immune cells from asthmatics responding to steroids, those refractory to steroids, and a control population.This data is part of a pre-publication release. For information on the proper use of pre-publication data shared by the Wellcome Trust Sanger Institute (including details of any publication moratoria), please see http://www.sanger.ac.uk/datasharing/
Study
EGAS00001001755
WTCCC2 Reading and Mathematics (RM) samples
A WTCCC2 project genome-wide association study for reading and mathematics ability in 3665 12-year-old individuals from the UK, genotyped on the Affymetrix 6.0 array. Details of the WTCCC2 analysis can be found in Davis et al. [Nat. Commun. 2014 July;5:4204]
Study
EGAS00001000886
Paediatric_and_adult_nasal_RNAseq___COVID19
Understanding infectivity, progression and disease severity of COVID-19 in children.
Understanding infectivity, progression and disease severity of COVID-19 in children: It has been observed that COVID-19 infection appears to present with reduced clinical severity and prevalence within children (<18 years) compared to adults. This project as part of a national collaborative initiative seeks to answer this question.
This data is part of a pre-publication release. For information on the proper use of pre-publication data shared by the Wellcome Trust Sanger Institute (including details of any publication moratoria), please see http://www.sanger.ac.uk/datasharing/
Study
EGAS00001004391
Genetic_screening__of_GPI_anchor_protein_synthesis__
This study will analyse the guide sequence which were used for making mutations in the Cas9-expressing cells. We used GeCKO v2 library which were released by Feng Zhang, 2014. This data is part of a pre-publication release. For information on the proper use of pre-publication data shared by the Wellcome Trust Sanger Institute (including details of any publication moratoria), please see http://www.sanger.ac.uk/datasharing/
Study
EGAS00001001256
Genome_Diversity_in_Africa_Project___GemCode_libraries_
High depth whole genome sequencing from GemCode (10x Genomics) DNA libraries containing long range linkage information for one Baganda trio and one Baganda child (parent already sequenced at high depth).
This data is part of a pre-publication release. For information on the proper use of pre-publication data shared by the Wellcome Trust Sanger Institute (including details of any publication moratoria), please see http://www.sanger.ac.uk/datasharing/
Study
EGAS00001001828
Mapping_genetic_variants_underlying_gene_regulation_in_inflammed_intestinal_cell_types_to_identify_novel_IBD_drug_targets
Biopsies from the terminal ileum and rectum of individuals with Crohn's disease are digested on ice to single cells and processed for single-cell RNA-sequencing (10X Genomics and Illumina)
This data is part of a pre-publication release. For information on the proper use of pre-publication data shared by the Wellcome Trust Sanger Institute (including details of any publication moratoria), please see http://www.sanger.ac.uk/datasharing/
Study
EGAS00001003770
ATAC_SEQ_MAIN___PHASE_1
This is a study to test ATAC-seq protocols. CD4+ and CD8+ cells have been obtained from three different anatomical compartments. We aim to assay open-chromatin regions across these cells and perform comparative analyses.This data is part of a pre-publication release. For information on the proper use of pre-publication data shared by the Wellcome Trust Sanger Institute (including details of any publication moratoria), please see http://www.sanger.ac.uk/datasharing/
Study
EGAS00001000947
TARGET-seq+ single-cell genotyping
Single-cell genotyping data for bone marrow samples from 9 cases with clonal hematopoiesis and 1 control sample. The TARGET-seq+ protocol was used to generate plate-based 3' transcriptome data. For details on cell sorting and the TARGET-seq+ protocol see the methods section of the manuscript. One FASTQ file is provided per cell. Cells are named with their plate and well IDs and the subject ID. Empty wells (no-cell controls) are named "blank". Corresponding transcriptome files use the same naming with the "_transcriptome" suffix.
Dataset
EGAD00001011150
The_genetics_of_thinness_compared_to_obesity
The variation in weight within a shared environment is largely attributable to genetic factors. Whilst many genes/loci confer susceptibility to obesity, little is known about the genetic architecture of thinness. In this study we performed a genome-wide association study of 1,622 persistently thin healthy individuals (STILTS), 1,985 severe childhood onset obesity cases (SCOOP) and 10,433 population based individuals (UKHLS) used as a common set of controls. All participants were genotyped on the Illumina Core Exome array, including 551,839 markers and imputed to the combined UK10K and 1000G (phase3) reference panel. We contrast the genetic architecture of thinness with that of severe early onset obesity and explore whether the genetic loci influencing thinness are the same as those influencing obesity pr whether there are important genetic differences between them.
This data is part of a pre-publication release. For information on the proper use of pre-publication data shared by the Wellcome Trust Sanger Institute (including details of any publication moratoria), please see http://www.sanger.ac.uk/datasharing
Study
EGAS00001002624
Exploring_the_heterogeneity_of_sarcoma_using_single_cell_sequencing_
Multi region samples are collected from patients, with consent, immediately after resection of the tumour. Samples are digested and sorted using FACS as single cells into lysis buffer. Cells are then stored until further processing for G&T-seq. After sequencing, we will explore intra-tumour heterogeneity using computational approaches to integrate RNA and DNA data onto the tumour phylogeny
This data is part of a pre-publication release. For information on the proper use of pre-publication data shared by the Wellcome Trust Sanger Institute (including details of any publication moratoria), please see http://www.sanger.ac.uk/datasharing/
Study
EGAS00001002866
Exome_sequencing_of_a_cohort_of_Rett_syndromelike_patients
The aim of the project is the definition of the molecular defect in a cohort of Rett-like patients negative for mutations in known disease genes. To this aim, a number of unrelated trios (patients plus parents) will be analysed by exome sequencing.
This data is part of a pre-publication release. For information on the proper use of pre-publication data shared by the Wellcome Trust Sanger Institute (including details of any publication moratoria), please see http://www.sanger.ac.uk/datasharing/
Study
EGAS00001002059
BCR_repertoire_sequencing
Isotype-specific B-cell receptor repertoire sequencing in six immune-mediated diseases at diagnosis and during therapy reveals an unexpectedly complex B-cell architecture, which may provide a platform for a better understanding of pathological mechanisms and treatment responses.
This data is part of a pre-publication release. For information on the proper use of pre-publication data shared by the Wellcome Trust Sanger Institute (including details of any publication moratoria), please see http://www.sanger.ac.uk/datasharing/
Study
EGAS00001003185
How to upload Crypt4GH files
Uploading files
Users who hold an ega-box-XXX account can upload files using either INBOX or FTP. Users who have a Submitter role associated with their email will only be able to upload files using INBOX.
Before uploading your files, please make sure that any files that will be uploaded to EGA do not use special characters in their naming convention, such as # ? ( ) [ ] / \ = + < > : ; " ' , * ^ | &. This can cause issues with the archiving process, leading to problems for end users.
The EGA is a shared, public service with limited storage. To manage the available resources, we enforce a limit of 10TB per submission account at any one time. If you exceed this limit, a “permission denied” message will be displayed. This will prevent you from uploading more files, but connecting to your inbox.For submissions larger than 10TB, please perform uploads in 10TB batches: register all the metadata and then finalise the submission. Upload the next batch of files and repeat the same metadata registration and finalisation process until you have completed the file upload. Further information can be found in the SP documentation.
INBOX
FTP
The INBOX is only compatible with files encrypted using the Crypt4gh tool
Before uploading
If you are not a registered EGA user, you will first need an EGA user account.
Please note that it may take a few days for your account to be activated, as it needs to be vouched for by the EGA Helpdesk. Once your account is validated, you will be able to request a
submitter role.
[Optional] Meanwhile, you can create and add your public key to your EGA account profile. This option is not available for old submission accounts (e.g., ega-box-NNN).
As soon as you have been granted a submitter role, you will be able to connect with your username and password to the EGA inbox using the SFTP protocol. If you have also registered a public key in your profile, you can also connect using this key.
To upload files to your account, you can use the graphical user interface (GUI) or the command line.
Graphical User Interface (GUI)We recommend using FileZilla, a free, open-source FTP client. However, you can use any other GUI that allows connecting over the SFTP protocol.
For FileZilla as your GUI, follow these steps to upload files:
Create a new connection in Site Manager (File > Site Manager) and select the
following options (Figure 1):
Protocol: SFTP - SSH File Transfer ProtocolHost: __EGA_INBOX_DOMAIN__Logon Type: Key fileUser: your EGA usernameKey file: Path/to/your/private_keyFigure 1: Process of establishing a new connection to __EGA_INBOX_DOMAIN__ using a key file as the logon method in FileZilla. The figure showcases the FileZilla version 3.52.2 operating on IOS v11.2.3. By following the depicted steps, users can create a secure and efficient connection to the inbox, ensuring seamless data transfers.Click Connect, and you will log in remotely to your home directory. You can think of this folder as a storage "in the EGA cloud" in which you will add your files for the EGA. The uploading area has three folders:To-encrypt: Files uploaded in this folder will be encrypted automatically on the fly.Encrypted: Files uploaded in this folder must already be encrypted with Crypt4gh. Upload your files here if your connection is unstable or you have problems completing the upload into-encrypt.Etc: This folder contains two files that allow the server to show you your username and group instead of some internal numbers. Please do not upload files here; otherwise, you will obtain a permission denied error. Find the files you want to upload by browsing your local storage (left side of your screen in FileZilla). Select all the files you want to upload, then right-click on them and select Upload (Figure 2).
Figure 2: Step-by-step process of manually uploading files to __EGA_INBOX_DOMAIN__ using FileZilla, with FileZilla version 3.52.2 operating on IOS v11.2.3. The figure demonstrates how users can transfer data from their local storage to the "EGA cloud" by following the depicted steps
Please note that regardless of which folder you upload your files in, both folders (to-encrypt, encrypted) will point to the same path (/) (Figure 3). Therefore, you will see your files in both folders.
Figure 3: Both folders, to-encrypt and encrypted, point to the same path (/)"
If your connection is unstable, please encrypt your files first using Crypt4gh. Then upload them to the ‘encrypted’ folder.
The example above shows how to connect to __EGA_INBOX_DOMAIN__ using the private key. However, if you prefer to log in using your credentials, you can do so. Please go to the Frequently Asked Questions (FAQs) for more information.
SFTP command line
To upload files securely to your private area of the EGA, you can use SFTP(Secure File Transfer Protocol) with your favorite FTP client. Here's what you need to know to get started:
Connect to the target host __EGA_INBOX_DOMAIN__. This is the new hostname for the EGA SFTP service. Log in with your EGA username and key files (or password). Upload files to your private EGA inbox to ensure that only you can access the files.
By following these steps, you can securely upload your files to the EGA for safe storage and sharing.
Using the SFTP command line client in Linux/Unix
Open a terminal and type sftp username@hostnameEnter your EGA passwordTo see a list of available SFTP commands, type helpsftp> put – Upload filesftp> get – Download filesftp> cd path – Change remote directory to ‘path’sftp> pwd – Display remote working directorysftp> lcd path – Change the local directory to ‘path’sftp> lpwd – Display local working directorysftp> ls – Display the contents of the remote working directorysftp> lls – Display the contents of the local working directoryType the "put" command to upload files. For example: put *.bamUse the bye command to close the connection (SFTP session).
After uploading- Once you have uploaded files to the inbox, please bear in mind that the checksum needs to be calculated, which can take up to two days. You will only be able to link your files to a run/analysis once the encrypted checksum has been calculated.- When linking your files to the 'Run' or 'Analysis', ensure that the file name matches the file path '/name' in the INBOX folder.- Please delete the files from your SFTP INBOX after all the runs/analyses have been registered and files are ingested (SP > Files > Files ingested). This will clear your inbox space an allow you to upload more files. This will also prevent the files from reappearing in your Submitter Portal inbox.
Frequently Asked Questions
Specific to the inbox
What username should I use to log in to my inbox?
The authentication process for logging in to the EGA website, as well as accessing your inbox and outbox, requires the use of your username.
If you have forgotten your registered username, please contact our Helpdesk team for assistance.
How are checksums calculated in your inbox?
If you encrypt the file beforehand and upload it to the "encrypted" folder, the unencrypted checksum will not be calculated until the file is ingested (i.e., until it is used in a run/analysis). If the file is uploaded to the "to-encrypt" folder, then both checksums are calculated.Please bear in mind that after files have been uploaded to the inbox, the checksum must be calculated, which can take from a few hours to two days.
Specific to using keys to authenticate
Can I access one EGA account from different devices?
Yes, you can access your account from different devices by linking several public keys to your EGA account. Each device can generate a unique public-private key pair, and the corresponding public keys can be linked to the same account. This way, you can use different public keys on different devices and still have access to the same account and data.
I have several keys and I don't remember which one is which
When generating SSH keys, it's a good practice to add a comment using the -C flag. This will allow you to add a descriptive tag to your key, making it easier to identify later on. Here's an example command that generates an SSH key with a comment:
ssh-keygen -t ed25519 -C work-pass
In this example, we're generating an ed25519 SSH key with the comment work-pass. Once you have multiple keys with different comments, you can use
the comments to easily identify each key.
To view the comments for your existing SSH keys, you can use the following command:
ssh-keygen -l -f /path/to/key
This will display the key fingerprint and the associated comment. By checking the comments, you should be able to identify which key is which.
What if I can't find my SSH keys for uploading files with a key file, and how can I use new keys?
If you can't find your SSH keys, don't worry - you can make new ones. To do this, open your terminal or command prompt and type a command to make a new SSH key. You can pick a name for the key, and choose a password to keep it safe. After making the key, you can add the new key to your account or server where you want to upload files using the key file. This usually involves copying and pasting the key's "public" (e.g. file.pub) part to the right place. If you lose track of the key again, just make a new one and add it again. Keep in mind that SSH keys belong to you and your computer, so if you switch computers or accounts, you'll need to make new keys.
I don't want to type the passphrase every time I use the key. What can I do?
You can use an ssh-agent to avoid typing the passphrase every time you use the key. An ssh-agent is a program that stores your private keys in memory and provides them to ssh when needed. You can add your key to the ssh-agent using the command ssh-add followed by the path to your key file.Here's an example of the steps to follow:
Open a terminal window.Start the ssh-agent by typing the command eval $(ssh-agent).Add your key to the ssh-agent by typing the command ssh-add [key filepath].
For instance, if your key file is located in the home directory with the name mykey, the command will look like this:
ssh-add ~/mykey
After adding your, key to the ssh-agent, you should be able to use ssh without having to enter your passphrase every time.
Can I use my password for authentication (without my private key)?
If you prefer to use your username and password for authentication instead of your private key, you can still do so. When using a Graphical User Interface (GUI) such as FileZilla, you can select Ask for password as your Logon Type (Figure 3). This option will prompt you to enter your password when you click
Connect, instead of using your private key.
Figure 3: This option will prompt you to enter your password when you click "Connect", instead of using your private key. Figure 3: Process of establishing a new connection to __EGA_INBOX_DOMAIN__ using your password as the logon method in FileZilla. The figure showcases the FileZilla version 3.52.2 operating on IOS v11.2.3. By following the depicted steps, users can create a secure and efficient connection to the inbox, ensuring seamless data transfers.
It's worth noting that using a password for authentication can be less secure than using an SSH key, as passwords can be more easily compromised through various means. However, if you choose to use your password for authentication, selecting "Ask for password" as your Logon Type is a good way to do so securely via a GUI.
Why is it better to use my key and not my password?
SSH keys for authentication is generally considered to be more secure and convenient than using passwords. SSH keys are more difficult to crack than passwords, and they can be restricted to specific users and machines, giving
you more control over access. Once you set up your SSH keys, you can use them to authenticate quickly and easily, without having to enter a password every time. This makes automation of tasks, such as uploading encrypted files, much
simpler. Additionally, SSH keys provide better logging, allowing you to keep track of who is accessing your systems and when. All in all, using SSH keys is a good practice for improving security and convenience in your authentication process.
Documentation
submission/data/uploading-files/inbox
SudanMitoSeq: Sudanese mitochondrial sequencing
In various contexts, mitochondrial function or dysfunction can be linked to mitochondrial genome variations. The use of mitochondrial genetics thus promises personalized diagnostics and treatments. In order to devise specific precision medicine approaches based on mitochondrial genetic variation and to test them within clinical studies, control data is indispensable. Such control data needs to comprehensively cover genetic variation commonly observed and thus expected. In the context of this study we assessed whether current, world-wide mitochondrial data sufficiently represents the region of North and East Africa, that is, whether current reference data is ready for precision medicine in this region. Towards this, we sequenced mitochondrial genomes of 159 Sudanese individuals provided as part of this EGA study and analyzed them together with various other, publicly available data concerning mitochondrial variation. For details, please refer to the publication.
Study
EGAS00001005669
Role_of_Epigenetic_Memory_in_Human_Induced_Pluripotent_Stem_Cells_Pilot
Fibroblasts have been shown to re-program into induced pluripotent stem (hiPS) cells, through over-expression of pluripotency genes. These hiPS cells show similar characteristics to embryonic stem cells including cell surface markers, epigenetic changes and ability to differentiate into the three germ layers. However it is unclear as to the extent of changes in gene expression through the re-programming process.. This data is part of a pre-publication release. For information on the proper use of pre-publication data shared by the Wellcome Trust Sanger Institute (including details of any publication moratoria), please see http://www.sanger.ac.uk/datasharing/
Study
EGAS00001000742
Anaplastic Thyroid Cancer somatic variants (MuTect)
Sequence was aligned to the GRCH38 reference genome. Aligned sequence was analyzed with GATK/MuTect, to generate somatic variant calls across the SureSelect All Exon V5+UTR target region. Somatic variant calls are in VCF format. In total there are 166 tumour samples, 94 of which have a matched normal. Somatic variants for tumours without a matched normal, were called against a panel of normals. Details for the mutect call can be found in the vcf header.
Dataset
EGAD00001004129
IBD_Whole_Genome_Sequencing
We will sequence at 15X coverage the genomes of 1536 IBD patients. These samples are currently onsite at Sanger and made available for sequencing via our collaboration with the UK IBD Genetics consortium.
This data is part of a pre-publication release. For information on the proper use of pre-publication data shared by the Wellcome Trust Sanger Institute (including details of any publication moratoria), please see http://www.sanger.ac.uk/datasharing/
Study
EGAS00001002238
RNA_expression_profiling_of_melanoma_patient_derived_xenograft
Patient-derived xenografts (n=96) were derived from metastatic melanoma patients. RNA expression profiling will be preformed to study 1. HLA-typing and 2. the effect of the tumour microenvironment on tumour growthThis data is part of a pre-publication release. For information on the proper use of pre-publication data shared by the Wellcome Trust Sanger Institute (including details of any publication moratoria), please see http://www.sanger.ac.uk/datasharing/
Study
EGAS00001001537
Mutation_analysis_in_human_iPS_cells_
PCR products were obtained from each target loci using genomic DNA from human iPS cells. Subsequently, PCR products are pooled and subjected to Illumina library preparation. The library will be sequenced by MiSeq. This data is part of a pre-publication release. For information on the proper use of pre-publication data shared by the Wellcome Trust Sanger Institute (including details of any publication moratoria), please see http://www.sanger.ac.uk/datasharing/
Study
EGAS00001000359
PyEGA3 download client
PyEGA3
Checks prior to download - Is the dataset available for download?
Most datasets are available using the PyEGA3 Download client, except for specific legacy datasets. For these ones, refer to Live Outbox distribution, an alternative way to download these datasets.If the dataset you've been granted access to begins with the ID EGAD5XXXXXXX, please refer to the Live Outbox distribution.
If the dataset you have been granted access to contains encrypted data (filename ending in .gpg), contact EGA Helpdesk for detailed information on how to download these files.General Information
The PyEGA3 download client is available at its GitHub repository. In its README you will find a step by step guide on how to use it to download files you have been granted access to.
PyEGA3 implements the GA4GH-compliant HTSGET protocol supporting requests over genomic ranges. This enables the download of specific regions of interest rather than the entire file.
HTSGET is not possible for all the files held by EGA: if the dataset does not contain index files (e.g. filename ending in *.crai), then genomic range requests cannot be performed.
The maximum number of connections allowed is 30. If more than 30 connections are used, the download client will encounter 500 errors. The recommended number of connections for maximum output is one.
Check out our video tutorial!
Frequently Asked Questions
General questions
How do I download the datasets to which I have been granted access?
After setting up your EGA download account, you can proceed to download using the EGA download client – PyEGA3. The pyEGA3 download client is a python-based tool for viewing and downloading files from authorised EGA datasets. This download client is continuously being developed for more user-friendly download experiences.
Can I download the dataset’s metadata via the download client?
No, the download client cannot be used to download metadata. Registered EGA users can download metadata of an authorised dataset by logging into the EGA webpage and navigating to the dataset of choice. Approximately two thirds down the page you will find the option to download the metadata as a zip file.
How do I download large datasets as quickly as possible?
The main EGA download client is pyEGA3 download client, a python-based tool for viewing and downloading files from authorised EGA datasets.We could exceptionally consider the option of an Aspera download for users experiencing very slow download rates or facing particular issues.Contact the EGA Helpdesk for further information.
Pre-requirement and Installation
What are the major requirements for the PyEGA3 download client installation and usage?
The PyEGA3 client is compatible with any OS with Python 3.6+ installed. The client requires a connection to the internet, sufficient space on the destination drive, and the EGA download account credentials.
Where can I find the PyEGA3 download client installation and download instructions?
We strongly recommend checking our video tutorial demonstrating the usage of PyEGA3 from installation to file download. More detailed information is available at PyEGA3 GitHub.
What ports are required for the PyEGA3 download client to function?
PyEGA3 makes HTTPS calls to the EGA Data API (https://ega.ebi.ac.uk:8443). This port 8443 must be reachable from the location where PyEGA3 is executed to avoid timeouts.
PyEGA3 download client
How to update the PyEGA3 download client to be on the latest version?
Updating the client to the latest version can be achieved by running the following command:
pip3 install pyega3 --upgrade
I do not have an active EGA account. Can I test the download with the PyEGA3 download client?
An EGA download test account (ega-test-data@ebi.ac.uk) has been created for troubleshooting and training purposes. The test account does not require an EGA username and password because it contains publicly accessible files from the 1000 Genomes Project. More information about the use of this account and log in details are available at pyEGA3.
I lost my connection while the download was in progress. Does this mean that I have to download the file again from the start?
The PyEGA3 download client supports automatic resumption of downloads. To enable quicker download speeds, the API breaks files into up to four segments and downloads them in parallel. With the resume feature, downloads will automatically resume if you encounter any errors or if the connection is interrupted.
How can I improve the overall download speed of the PyEGA3 download client?
Download speeds can be optimised using the --connections parameter which will parallelise download at the file level. If the --connections parameter is provided, all files >100Mb will be downloaded using the specified number of parallel connections.
It is important to note that files are still downloaded sequentially, so using multiple connections does not mean downloading multiple files in parallel. The maximum number of connections allowed is 30. If more than 30 connections are used, the download client will encounter 500 errors. The recommended number of connections for maximum output is one.
Why is my file taking a long time to be saved?
Please note that when a file is being saved, it goes through two processes. First, the downloaded file "chunks" are pieced back together to reconstruct the original file. Secondly, PyEGA3 calculates the checksum of the file to confirm that the file was downloaded successfully. Larger files will take more time to reconstruct and validate the checksum.
Why is the HTSGET protocol not working with my dataset of interest?
PyEGA3 implements the GA4GH-compliant htsget protocol for supporting requests over genomic ranges. This exciting new feature means that for data files with accompanying index files (e.g. .crai for CRAM files) users can download specific regions of interest rather than the entire file, saving both time and storage space. Please note that in order for the genomic range requests to work for a BAM, CRAM or VCF file there must be an associated BAI, CRAI or TBI file, respectively. If the dataset does not contain these index files then genomic range requests cannot be performed.
After download
Do I need to decrypt data files downloaded through the pyEGA3 download client?
Files are transferred over secure HTTPS connections and received unencrypted, thus there is no need for decryption after data download.
What is the purpose of an MD5 file?
After the download completes, file integrity is verified using checksums. The PyEGA3 download client automatically verifies each file’s unencrypted md5 after download to ensure that the file was downloaded correctly from the EGA.
Errors
Why do I get a “400 Client Error” while trying to access the download client?
Ensure that your credentials are formatted correctly. Please contact the EGA Helpdesk if you are unsure about your account’s username (i.e. your email address).
Why do I get a “500 Server Error”?
It is important to note that files are still downloaded sequentially, so using multiple connections does not mean downloading multiple files in parallel. The maximum number of connections allowed is 30. If more than 30 connections are used, the download client will encounter 500 errors. The recommended number of connections for maximum output is one.
Otherwise, 500 server errors are internal server errors from EGA’s end. Retry after some time and, if the 500 error persists:
check our homepage or Twitter for any outage information or report it to EGA Helpdesk.
Why do I get a “slice error”?
This type of error occurs when one slice download was prematurely interrupted. Please re-start the same download. The download client will pick up any partial download and download any missing slices.
Why do I get the error “Dataset' EGADXXXXXXXXXXX' is not in the list of your authorised datasets”?
Please note that the data access was given to the email (username) provided to the Data Access Committee (DAC) in your official request. Therefore, you shall check that the account that you are using is the same one that you sent to the DAC.
Why do I get “ERROR:root:Failed to obtain IP address”?
This might mean that the connection ports are blocked. Check that the port 8443 is reachable.In case the port is reachable and the error persists, please try to modify the file pyega3.py on line 44 in the python module installation:
Replace
endpoint = 'https://ipinfo.io/json'
with
endpoint = 'https://api.myip.com'
Why do I get the error “Invalid username, password or secret key - please check and retry”?
Make sure that you are using the correct credentials (username and password) and there is no typo. Contact EGA Helpdesk if the problem persists.
Documentation
access/download/files/pyega3
Lymphocyte RNA profiling
This data is part of a pre-publication release. For information on the proper use of pre-publication data shared by the Wellcome Trust Sanger Institute (including details of any publication moratoria), please see http://www.sanger.ac.uk/datasharing/
Dataset
EGAD00001002183
RNA-seq BAM files from newborn screening dried blood spot samples
Paired-end RNA-seq BAM files from 21 newborn screening dried blood spot (DBS) samples. These DBS samples were obtained from extremely low gestional age newborns, where 10 of them were affected by a fetal inflammatory response (FIR) before birth, and 11 were unaffected. Total RNA was sequenced using an Illumina NextSeq-500 instrument. The sample preparation protocol included the depletion of rRNA and globin mRNA using the Globin Zero Gold rRNA Removal Kit from Illumina. Libraries were prepared using the NebNext Ultra TM II Directionl RNA LIbrary Prep Kit (New England Biolabs). There is one BAM file per sample and there is an additional BAM file, corresponding to sample BS13, which was downsampled to 1/4 of its original depth (see BS13_README file for details).
Dataset
EGAD00001005009
TARGET-seq+ single-cell transcriptome sequencing
Single-cell whole transcriptome sequencing data for bone marrow samples from 9 cases with clonal hematopoiesis and 4 control samples. The TARGET-seq+ protocol was used to generate plate-based 3' transcriptome data. For details on cell sorting and the TARGET-seq+ protocol see the methods section of the manuscript. One FASTQ file is provided per cell. Cells are named with their plate and well IDs and the subject ID. Empty wells (no-cell controls) are named "blank". Corresponding genotyping files use the same naming without the "_transcriptome" suffix.
Dataset
EGAD00001011175
TCELL PILOT ATAC-SEQ
This data is part of a pre-publication release. For information on the proper use of pre-publication data shared by the Wellcome Trust Sanger Institute (including details of any publication moratoria), please see http://www.sanger.ac.uk/datasharing/
Dataset
EGAD00001001317
Validation_of_Exome_sequencing_of_S7RE_iPSC_lines
PCR products were obtained from each target loci using genomic DNA from human iPS cells. Subsequently, PCR products are pooled and subjected to Illumina library preparation. The library will be sequenced either by HiSeq or MiSeq. This data is part of a pre-publication release. For information on the proper use of pre-publication data shared by the Wellcome Trust Sanger Institute (including details of any publication moratoria), please see http://www.sanger.ac.uk/datasharing/
Study
EGAS00001000423
Deep_sequencing_of_S7EPC_genome
PCR products were obtained from each target loci using genomic DNA from human iPS cells. Subsequently, PCR products are pooled and subjected to Illumina library preparation. The library will be sequenced either by HiSeq or MiSeq. This data is part of a pre-publication release. For information on the proper use of pre-publication data shared by the Wellcome Trust Sanger Institute (including details of any publication moratoria), please see http://www.sanger.ac.uk/datasharing/
Study
EGAS00001000437
Validation_of_SNVs_found_by_Exome_seq_in_S2_SF1___SF5_and__SF9_hiPSCs
PCR products were obtained from each target loci using genomic DNA from human iPS cells. Subsequently, PCR products are pooled and subjected to Illumina library preparation. The library will be sequenced either by HiSeq or MiSeq. This data is part of a pre-publication release. For information on the proper use of pre-publication data shared by the Wellcome Trust Sanger Institute (including details of any publication moratoria), please see http://www.sanger.ac.uk/datasharing/
Study
EGAS00001000464
Subclonal_analysis_in_S7RE2_and_S7RE14_iPS_cells
PCR products were obtained from each target loci using genomic DNA from human iPS cells. Subsequently, PCR products are pooled and subjected to Illumina library preparation. The library will be sequenced either by HiSeq or MiSeq. This data is part of a pre-publication release. For information on the proper use of pre-publication data shared by the Wellcome Trust Sanger Institute (including details of any publication moratoria), please see http://www.sanger.ac.uk/datasharing/
Study
EGAS00001000441