Download Client V3

Checks prior to download

*** Prior to starting your download, please check that your Dataset is available to download using a download client by following this link : Datasets currently not present in the download Client . Please email EGA Helpdesk to request an Aspera download account for any datasets that are unavailable in the client.

If the Dataset you have been granted access to contains *.gpg encrypted data please reach out to the Helpdesk for information on installation and use of a compatible download client. ***

GitHub

The download client V3 documentation is available from EGA's github repository

Frequently Asked Questions

After setting up your EGA download account, you can proceed to download using the EGA download client – pyEGA3. The pyEGA3 download client is a python-based tool for viewing and downloading files from authorized EGA datasets. The client is continuously being developed for more user-friendly download experiences.

The pyEGA3 client is compatible with any OS with Python 3.6+ installed. The client requires a connection to the internet, sufficient space on the destination drive, and the EGA download account credentials.

A video tutorial demonstrating the usage of pyEGA3 from installation to file download is available. Detailed information is available at pyEGA3

pyEGA3 makes HTTPS calls to the EGA AAI (https://ega.ebi.ac.uk:8443) and the EGA Data API (https://ega.ebi.ac.uk:8052). Ports 8443 and 8052 must both be reachable from the location where pyEGA3 is executed to avoid timeouts.

Updating the client to the latest version can be achieved by running the following command:

 pip3 install pyega3 -U

An EGA download test account has been created which can be used for troubleshooting and training purposes. The test account does not require an EGA username and password because it contains publicly accessible files from the 1000 Genomes Project. More information is available at pyEGA3

Files are transferred over secure HTTPS connections and received unencrypted, thus there is no need for decryption after data download.

After the download completes, file integrity is verified using checksums. The python download client automatically verifies each file’s unencrypted md5 after download to ensure the file has downloaded correctly from the EGA.

Legacy datasets are not currently available for download through pyEGA3. Prior to starting your download, please check that your dataset is not present in Datasets currently not present in the download client. If your dataset is listed, please contact EGA helpdesk and detail the legacy datasets you wish to download.

No, the python download client cannot be used to download dataset metadata. Registered EGA users can download metadata of a dataset they have permissions for by logging into the EGA Archive webpage and navigating to the dataset of choice. In the download section of the dataset page, you will find the option to download the metadata as a tarball file.

Please ensure that your credentials are formatted correctly. Email addresses (usernames) are case-sensitive. Please contact the EGA helpdesk if you are unsure about your username.

500 server errors are internal server errors from the EGA end. If the 500 error persists, please follow our homepage/Twitter for any outage information or report to EGA Helpdesk

The python download client supports the automatic resumption of downloads. To enable quicker download speeds, the API breaks files into up to four segments and downloads them in parallel. With the resume feature, downloads will automatically resume if you encounter any errors or if the connection is interrupted.

Download speeds can be optimized using the --connections parameter which will parallelize download at the file level. If the --connections parameter is provided, all files >100Mb will be downloaded using the specified number of parallel connections.

Using a very high number of connections will introduce overhead that can slow the download of the file. It is important to note that files are still downloaded sequentially, so using multiple connections does not mean downloading multiple files in parallel. We recommend trying with 30 connections initially and adjusting from there to get maximum throughput.

Please note that when a file is being saved, it goes through two processes. First, the downloaded file "chunks" are pieced back together to reconstruct the original file. Second, pyEGA3 calculates the checksum of the file to confirm the file downloaded successfully. Larger files will take more time to reconstruct and validate the checksum.

pyEGA3 implements the GA4GH-compliant htsget protocol for supporting requests over genomic ranges. This exciting new feature means that for data files with accompanying index files (e.g. .crai for CRAM files) users can download specific regions of interest rather than the entire file therefore saving both time and storage space. Please note that in order for the genomic range requests to work for every BAM or CRAM file there should be a BAI or CRAI file associated. If the dataset does not contain these index files then genomic range requests cannot be performed.