EGA Download Client v3 - Quick Guide

Background to download

*** Prior to starting your download, please check that your dataset is available to download using the client by following this link : Datasets currently not present in the download Client . Please email EGA Helpdesk to request an Aspera download account for any datasets that are unavailable in the client. ***


Secondly, check that your dataset contains files with the .cip extension if it contains files with .gpg use this client to download


Overview

The new download client is python based and data is downloaded over secure https connections, instead of http. This allows EGA to send the data as unencrypted data (via encrypted connections); so, you don’t have to decrypt files after download. You can download files or datasets directly, without first having to make a request. Files are verified against the unencrypted MD5 after download (you can also get the unencrypted MD5 via REST call from the API directly).

The new client also supports segmenting (breaking a file into up to 4 segments and downloading them in parallel) and resume; so even if a file is downloaded as one continuous stream, the download will simply resume if there was an error or an interrupted connection.

*PLEASE NOTE THE NEW CLIENT DOES NOT SUPPORT THE DOWNLOADING OF FILES THAT END WITH THE SUFFIX ".gpg" – these will have to be downloaded using the previous version of the client, whose documentation can be found here

Downloading the Client
Files ready to use
  • At this point your files should be ready to us. Please contact the EGA Helpdesk for further queries




Requirements

Python "requests" module
http://docs.python-requests.org/en/master/

pip3 install requests
Firewall Ports

This client makes https calls to the EGA AAI (https://ega.ebi.ac.uk:8443/) and to the EGA Data API (https://ega.ebi.ac.uk:8051). Both ports 8443 and 8051 must be reachable from the location where this client script is run. Otherwise you will experience timeouts. (e.g. https://ega.ebi.ac.uk:8443/ega-openid-connect-server/, https://ega.ebi.ac.uk:8051/elixir/central/stats/load should not time out).


Obtaining the Download Client

Download the client (click on ‘clone' or 'download') that can be obtained from : ega github repository

Installing the client

Installation via Pip:

sudo pip3 install pyega3

Installation via Pip:

conda config --add channels bioconda
conda install pyega3

Red fields need to be changed accordingly

1.Navigate to the directory where the client was downloaded to.

2.Select the appropriate installation script and follow on screen prompts.

Points to Notice
Three scripts are provided to install the required Python environment, based on the host operating system:

  Linux: debian_dependency_install.sh
  Linux (Red Hat): red_hat_dependency_install.sh
  Mac: osx_dependency_install.sh
Select the apropriate script, based on your host operating system and, from the console perform :
 sh  debian_dependency_install.sh  

3.Create a file called CREDENTIALS_FILE, in the directory where the client will run, and add the following text in exactly the same (JSON) format only use your own email address and EGA password.

Define the Credentials

Create a credentials file with the following JSON object in the directory where the client will run:

{
        "username": "my.email@domain",
        "password": "mypassword",
        "client_secret":"AMenuDLjVdVo4BSwi0QD54LL6NeVDEZRzEQUJ7hJOM3g4imDZBHHX0hNfKHPeQIGkskhtCmqAJtt_jm7EKq-rWw"
}

A copy of the template for credentials file can be downloaded from here. (right click and save file as ...)

Your username and password are provided to you by EGA. Specifying password is not mandatory - if password is not provided the user will be asked to enter it from the console

Installing the client for Windows users

1.Download Python 3

2.Install Python 3 following the prompt commands.

3.Verify the correct install from the terminal

$ python --version

4.Upgrade pip to the latest

python -m pip install --upgrade pip

5.Install ‘request’ module

python -m pip install requests

6.Install ‘’tqdm” module

python -m pip install tqdm


Using the Download Client

Note : <output> must be populated with the fullpath and the filename without the .cip extension such as :

 /Users/jeff/EGA/Download_client/3.0/Oxstat.tar.gz 

Red fields below need to be changed accordingly

Display datasets

python pyega3/pyega3.py -cf CREDENTIALS_FILE datasets

Display files in a dataset

python pyega3/pyega3.py -cf CREDENTIALS_FILE files EGAD00001000951 <output> 

Download a dataset

python pyega3/pyega3.py -cf CREDENTIALS_FILE fetch EGAD00001000951 <output> 

Download a single file

python pyega3/pyega3.py -cf CREDENTIALS_FILE fetch EGAF00000585895 <output> 

Download a file or dataset using 4 streams:

python pyega3/pyega3.py -c 4 -cf CREDENTIALS_FILE fetch EGAF00001412793 <output> 

Parallelism ( download via multiple connections ) works on the file level, but still usable while downloading whole dataset. If -c command line switch is provided all big files (>100Mb) in the dataset will be downloaded using specified # of connections.

The number of connections breaks down individual file downloads into segments, which are then downloaded in parallel. So using a very high number actually introduces overhead that slows down the download of the file. Files are still downloaded in sequence – so multiple connections doesn't mean downloading multiple files in parallel, if an entire dataset is being downloaded.


Positional Arguments

positional arguments:
  {datasets,files,fetch}
                        subcommands
    datasets            List authorized datasets
    files               List files in a specified dataset
    fetch               Fetch a dataset or file

optional arguments:
  -h, --help            show this help message and exit
  -d, --debug           Extra debugging messages
  -cf CREDENTIALS_FILE, --credentials-file CREDENTIALS_FILE
                        JSON file containing credentials
                        e.g.{'username':'user1','password':'toor','key':
                        'abc'}
  -c CONNECTIONS, --connections CONNECTIONS
                        Download using the specified number of connections


Genomic Range Requests ( via Htsget protocol ) :

usage: pyega3 fetch [-h] [--reference-name REFERENCE_NAME]
                    [--reference-md5 REFERENCE_MD5] [--start START]
                    [--end END] [--format {BAM,CRAM}] [--saveto [SAVETO]]
                    identifier

positional arguments:
  identifier            Id for dataset (e.g. EGAD00000000001) or file (e.g.
                        EGAF12345678901)

optional arguments:
  -h, --help            show this help message and exit
  --reference-name REFERENCE_NAME, -r REFERENCE_NAME
                        The reference sequence name, for example 'chr1', '1',
                        or 'chrX'. If unspecified, all data is returned.
  --reference-md5 REFERENCE_MD5, -m REFERENCE_MD5
                        The MD5 checksum uniquely representing the requested
                        reference sequence as a lower-case hexadecimal string,
                        calculated as the MD5 of the upper-case sequence
                        excluding all whitespace characters.
  --start START, -s START
                        The start position of the range on the reference,
                        0-based, inclusive. If specified, reference-name or
                        reference-md5 must also be specified.
  --end END, -e END     The end position of the range on the reference,
                        0-based exclusive. If specified, reference-name or
                        reference-md5 must also be specified.
  --format {BAM,CRAM}, -f {BAM,CRAM}
                        The format of data to request.
  --saveto [SAVETO]     Output file(for files)/output dir(for datasets)


Reporting issues

If you have any issues with the client, please raise them on the ega github repository or contact the EGA Helpdesk .