Download Client V3.1

Checks prior to download

*** Prior to starting your download, please check that your Dataset is available to download using a download client by following this link : Datasets currently not present in the download Client . Please email EGA Helpdesk to request an Aspera download account for any datasets that are unavailable in the client.

If the Dataset you have been granted access to contains *.gpg encrypted data please reach out to the Helpdesk for information on installation and use of a compatible download client. ***


Overview

The new download client is python based and data is downloaded over secure https connections, instead of http. This allows EGA to send the data as unencrypted data (via encrypted connections); so, you don’t have to decrypt files after download. Files are verified against the unencrypted MD5 after download (you can also get the unencrypted MD5 via REST call from the API directly).

The new client also supports segmenting (breaking a file into up to 4 segments and downloading them in parallel) and resume; so even if a file is downloaded as one continuous stream, the download will simply resume if there was an error or an interrupted connection.


Tutorial Video

Here you can find a video tutorial demonstrating the usage of the Python Download client from its installation to download instructions.


Requirements

Python "requests" module Python Requests Official Documentation

pip3 install requests
If the "requests" module is already installed on your system we recommend ensuring that this module is up to date using the command:
pip3 install requests --upgrade

Firewall Ports

This client makes https calls to the EGA AAI (https://ega.ebi.ac.uk:8443/) and to the EGA Data API (https://ega.ebi.ac.uk:8052). Both ports 8443 and 8052 must be reachable from the location where this client script is run. Otherwise you will experience timeouts. (e.g. https://ega.ebi.ac.uk:8443/ega-openid-connect-server/, https://ega.ebi.ac.uk:8052/elixir/central/stats/load should not time out).

In order to check if ports 8443 and 8052 are open please run the following two commands:

openssl s_client -connect ega.ebi.ac.uk:8052
openssl s_client -connect ega.ebi.ac.uk:8443

Installation via Pip

sudo pip3 install pyega3

Installation via Conda (Bioconda channel)

conda config --add channels bioconda
conda config --add channels conda-forge
conda install pyega3

Upgrade via Pip

 pip3 install pyega3 -U

Installation via GitHub

Download the client (click on ‘clone' or 'download') that can be obtained from : ega github repository

1.Navigate to the directory where the client was downloaded to

2.Three scripts are provided to install the required Python environment, based on the host operating system..

 Linux: debian_dependency_install.sh
 Linux (Red Hat): red_hat_dependency_install.sh
 Mac: osx_dependency_install.sh

3.Select the appropriate script, based on your host operating system and, from the console perform :

 sh  debian_dependency_install.sh  

Installing the Client for Windows Users

1.Download Python 3

2.Install Python 3 following the prompt commands

3.Verify the correct install from the terminal

python --version

4.Upgrade pip to the latest

 python -m pip install --upgrade pip

5.Install ‘request’ module

 python -m pip install requests

6.Install ‘tdqm’ module

 python -m pip install tqdm

7.Install ‘htsget’ module

 python -m pip install htsget


Usage

pyega3 -h
positional arguments:
  {datasets,files,fetch}
                        subcommands
    datasets            List authorized datasets
    files               List files in a specified dataset
    fetch               Fetch a dataset or file

optional arguments:
  -h, --help            show this help message and exit
  -d, --debug           Extra debugging messages
  -cf CREDENTIALS_FILE, --credentials-file CREDENTIALS_FILE
                        JSON file containing credentials
                        e.g.{'username':'user1','password':'toor','key':
                        'abc'}
  -c CONNECTIONS, --connections CONNECTIONS
                        Download using the specified number of connections


Genomic Range Requests ( via Htsget protocol ) :

usage: pyega3 fetch [-h] [--reference-name REFERENCE_NAME]
                    [--reference-md5 REFERENCE_MD5] [--start START]
                    [--end END] [--format {BAM,CRAM}] [--saveto [SAVETO]]
                    identifier

positional arguments:
  identifier            Id for dataset (e.g. EGAD00000000001) or file (e.g.
                        EGAF12345678901)

optional arguments:
  -h, --help            show this help message and exit
  --reference-name REFERENCE_NAME, -r REFERENCE_NAME
                        The reference sequence name, for example 'chr1', '1',
                        or 'chrX'. If unspecified, all data is returned.
  --reference-md5 REFERENCE_MD5, -m REFERENCE_MD5
                        The MD5 checksum uniquely representing the requested
                        reference sequence as a lower-case hexadecimal string,
                        calculated as the MD5 of the upper-case sequence
                        excluding all whitespace characters.
  --start START, -s START
                        The start position of the range on the reference,
                        0-based, inclusive. If specified, reference-name or
                        reference-md5 must also be specified.
  --end END, -e END     The end position of the range on the reference,
                        0-based exclusive. If specified, reference-name or
                        reference-md5 must also be specified.
  --format {BAM,CRAM}, -f {BAM,CRAM}
                        The format of data to request.
  --saveto [SAVETO]     Output file(for files)/output dir(for datasets)


How to define your Credential file

Create a file called CREDENTIALS_FILE and place it in the directory from where the client will run. Ideally, this file has to be saved in .json format and should contain your registered EGA email address and EGA password

Please find a reference example for the credentials file.

Your username and password are provided to you by EGA. If you face any login issues with your correct credentials or if you are unsure of how your email is registered with us, please contact the EGA Helpdesk


Testing the pyEGA3 Download Client

We recommend that all fresh installations of the pyEGA3 Download Client be tested. To assist you in acomplishing this we have created a test user account which can be used with one of the following commands:

Listing the datasets included in the Test Account

1. pyega3 -d -t datasets

Listing the Files included in the Test Account

2. pyega3 -d -t files EGAD00001003338

Downloading a File from the Test Account Dataset

3. pyega3 -d -t fetch EGAF00001775036

The test user does not require a username and password and is linked to a 1000 Genomes Public dataset that is accessible to all. The data in this Dataset is close to 1TB in size and can be used both for Troubleshooting and for Training purposes.

If you are having issues recovering data from the Test Account the debug output from the commands above should be included with your request for assistance from the Helpdesk.

Following a successful Test Account verification you will be able to recover data from the Datasets you have been granted access to using the following process:

Using the pyEGA3 Download Client

Red fields need to be changed accordingly

Display authorized datasets

pyega3 -cf /Path/To/CREDENTIALS_FILE.json datasets 

Display files in a dataset

pyega3 -cf /Path/To/CREDENTIALS_FILE.json files EGAD<NUM> 

Download a dataset

pyega3 -cf /Path/To/CREDENTIALS_FILE.json fetch EGAD<NUM> --saveto /Path/To/Output 

Download a single file

pyega3 -cf /Path/To/CREDENTIALS_FILE.json fetch EGAF<NUM>  --saveto /Path/To/Output 

How to list the unencrypted md5s for all files in a dataset?

To access the functionality of listing the Unencrypted md5 sums for all the files in a dataset , please use the following command

pyega3 -cf  /Path/To/CREDENTIALS_FILE.json files EGAD<NUM>

If you need to output the Unencrypted md5 sums into a separate text document, please execute the command below

nohup pyega3 -cf  /Path/To/CREDENTIALS_FILE.json files EGAD<NUM> /Path/To/File/md5sums.txt

Download a file or dataset using n connections/ streams:

pyega3 -c 5 -cf /Path/To/CREDENTIALS_FILE.json fetch EGAD<NUM> --saveto /Path/To/Output 


Troubleshooting

If, following a successful Test Account connectivity verification, you are facing difficulies downloading files from your authorised Datasets, please retry the download commands listed above using the -d flag to generate the debug output for your account. Forwarding the debug output to the Helpdesk will speed up the investigation of your assistance request.


Reporting issues

We encourage our users facing download failures to contact the EGA Helpdesk.


Parallelism

Download via multiple connections works at the file level, but is still usable while downloading the whole dataset. If the -c command line switch is provided all files >100Mb in the dataset will be downloaded using specified number of connections.

The connections break down the download of individual files into segments, which can be processed in parallel. Using a very high number o connections will introduce an overhead that can slow the download of the file. It is important to note that Files are still downloaded sequencially – so multiple connections won't mean downloading multiple files in parallel if the option is used for an entire dataset download.


Frequently Asked Questions

It is important to remember that when a file is being saved it goes through two processes. First of all of the downloaded "chunks" have to be pieced back together again to make one large file. Following that it has to calculate the checksum of the file. So in essence, the larger the file the longer it will take for the file to be saved.