PyEGA3

Checks prior to download - Is the dataset available for download?

Most datasets are available using the PyEGA3 Download client, except for specific legacy datasets. For these ones, refer to Live Outbox distribution, an alternative way to download these datasets.

If the dataset you've been granted access to begins with the ID EGAD5XXXXXXX, please refer to the Live Outbox distribution.

If the dataset you have been granted access to contains encrypted data (filename ending in .gpg), contact EGA Helpdesk for detailed information on how to download these files.

General Information

The PyEGA3 download client is available at its GitHub repository. In its README you will find a step by step guide on how to use it to download files you have been granted access to.

PyEGA3 implements the GA4GH-compliant HTSGET protocol supporting requests over genomic ranges. This enables the download of specific regions of interest rather than the entire file.

HTSGET is not possible for all the files held by EGA: if the dataset does not contain index files (e.g. filename ending in *.crai), then genomic range requests cannot be performed.

The maximum number of connections allowed is 30. If more than 30 connections are used, the download client will encounter 500 errors. The recommended number of connections for maximum output is one.

Check out our video tutorial!

Frequently Asked Questions

General questions

How do I download the datasets to which I have been granted access?

After setting up your EGA download account, you can proceed to download using the EGA download client – PyEGA3. The pyEGA3 download client is a python-based tool for viewing and downloading files from authorised EGA datasets. This download client is continuously being developed for more user-friendly download experiences.

Can I download the dataset’s metadata via the download client?

No, the download client cannot be used to download metadata. Registered EGA users can download metadata of an authorised dataset by logging into the EGA webpage and navigating to the dataset of choice. Approximately two thirds down the page you will find the option to download the metadata as a zip file.

How do I download large datasets as quickly as possible?

The main EGA download client is pyEGA3 download client, a python-based tool for viewing and downloading files from authorised EGA datasets.We could exceptionally consider the option of an Aspera download for users experiencing very slow download rates or facing particular issues.Contact the EGA Helpdesk for further information.

Pre-requirement and Installation

What are the major requirements for the PyEGA3 download client installation and usage?

The PyEGA3 client is compatible with any OS with Python 3.6+ installed. The client requires a connection to the internet, sufficient space on the destination drive, and the EGA download account credentials.

Where can I find the PyEGA3 download client installation and download instructions?

We strongly recommend checking our video tutorial demonstrating the usage of PyEGA3 from installation to file download. More detailed information is available at PyEGA3 GitHub.

What ports are required for the PyEGA3 download client to function?

PyEGA3 makes HTTPS calls to the EGA Data API (https://ega.ebi.ac.uk:8443). This port 8443 must be reachable from the location where PyEGA3 is executed to avoid timeouts.

PyEGA3 download client

How to update the PyEGA3 download client to be on the latest version?

Updating the client to the latest version can be achieved by running the following command:

pip3 install pyega3 --upgrade

I do not have an active EGA account. Can I test the download with the PyEGA3 download client?

An EGA download test account (ega-test-data@ebi.ac.uk) has been created for troubleshooting and training purposes. The test account does not require an EGA username and password because it contains publicly accessible files from the 1000 Genomes Project. More information about the use of this account and log in details are available at pyEGA3.

I lost my connection while the download was in progress. Does this mean that I have to download the file again from the start?

The PyEGA3 download client supports automatic resumption of downloads. To enable quicker download speeds, the API breaks files into up to four segments and downloads them in parallel. With the resume feature, downloads will automatically resume if you encounter any errors or if the connection is interrupted.

How can I improve the overall download speed of the PyEGA3 download client?

Download speeds can be optimised using the --connections parameter which will parallelise download at the file level. If the --connections parameter is provided, all files >100Mb will be downloaded using the specified number of parallel connections.

It is important to note that files are still downloaded sequentially, so using multiple connections does not mean downloading multiple files in parallel. The maximum number of connections allowed is 30. If more than 30 connections are used, the download client will encounter 500 errors. The recommended number of connections for maximum output is one.

Why is my file taking a long time to be saved?

Please note that when a file is being saved, it goes through two processes. First, the downloaded file "chunks" are pieced back together to reconstruct the original file. Secondly, PyEGA3 calculates the checksum of the file to confirm that the file was downloaded successfully. Larger files will take more time to reconstruct and validate the checksum.

Why is the HTSGET protocol not working with my dataset of interest?

PyEGA3 implements the GA4GH-compliant htsget protocol for supporting requests over genomic ranges. This exciting new feature means that for data files with accompanying index files (e.g. .crai for CRAM files) users can download specific regions of interest rather than the entire file, saving both time and storage space. Please note that in order for the genomic range requests to work for a BAM, CRAM or VCF file there must be an associated BAI, CRAI or TBI file, respectively. If the dataset does not contain these index files then genomic range requests cannot be performed.

After download

Do I need to decrypt data files downloaded through the pyEGA3 download client?

Files are transferred over secure HTTPS connections and received unencrypted, thus there is no need for decryption after data download.

What is the purpose of an MD5 file?

After the download completes, file integrity is verified using checksums. The PyEGA3 download client automatically verifies each file’s unencrypted md5 after download to ensure that the file was downloaded correctly from the EGA.

Errors

Why do I get a “400 Client Error” while trying to access the download client?

Ensure that your credentials are formatted correctly. Please contact the EGA Helpdesk if you are unsure about your account’s username (i.e. your email address).

Why do I get a “500 Server Error”?

Otherwise, 500 server errors are internal server errors from EGA’s end. Retry after some time and, if the 500 error persists:

check our homepage or Twitter for any outage information or
report it to EGA Helpdesk.

Why do I get a “slice error”?

This type of error occurs when one slice download was prematurely interrupted. Please re-start the same download. The download client will pick up any partial download and download any missing slices.

Why do I get the error “Dataset' EGADXXXXXXXXXXX' is not in the list of your authorised datasets”?

Please note that the data access was given to the email (username) provided to the Data Access Committee (DAC) in your official request. Therefore, you shall check that the account that you are using is the same one that you sent to the DAC.

Why do I get “ERROR:root:Failed to obtain IP address”?

This might mean that the connection ports are blocked. Check that the port 8443 is reachable.

In case the port is reachable and the error persists, please try to modify the file pyega3.py on line 44 in the python module installation:

Replace

endpoint = 'https://ipinfo.io/json'

with

endpoint = 'https://api.myip.com'

Why do I get the error “Invalid username, password or secret key - please check and retry”?

Make sure that you are using the correct credentials (username and password) and there is no typo. Contact EGA Helpdesk if the problem persists.