Backing up the data

From ECRIN-MDR Wiki
Revision as of 16:02, 21 December 2020 by Admin (talk | contribs) (Zipping up the XML files)
Jump to navigation Jump to search

Introduction

There are two sets of data that need to be backed up from the MDR system:

  • The data in the databases
  • The downloaded XML files

In both cases the data is first compressed on the server – as backup files for the databases, and as zip arechives for the XML files. The compressed data is then also copied onto a machine that is outside the OVH infrastructure.

Backing up the databases

Each schema in each database is backed up within the same procedure. The PostgreSQL utility pgdump is used to construct a compressed folder for each schema, and name it according to the source database and schema name, as well as the day of the week of the backup. Each folder contains a series of 'tarball' archive files, with extension .dat.gz. These archives must also be restored using pg_restore – they cannot be read directly. The alternative, of creating backup files as SQL statements, results in enormous files, and so was rejected.
Because pgdump is a console program that takes parameters it is relatively easy to call it within a Powershell script, and then schedule the script. Essentially the script

  • first determines the day of the week, to identify the destination folder for the backups,
  • establishes 2 SQL statements, one that retrieves all the databases present on the PostgreSQL instance, and a second that brings back all the schemas of interest in those databases,
  • deletes all files in the back up folder for that day,
  • uses the Postgres CLI psql to call the SQL commands constructed above. Note the connection parameters are ‘’not’’ included in the script.
  • for each schema, use pgdump to create a backup folder.

The Powershell script is below. It is heavily commented, and those comments should explain most of the parameters used for the various functions.

# set up variable pointing to backup location
# and transfer the execution context to the Postgres folder
# so that Postgres .exes (psql and pg_dump) can be found

$backup_folder = "C:\MDR_DBBackups\Postgres"
$postgres_folder = "C:\Program Files\PostgreSQL\12\bin\"

# Obtain day and folder name

$day = ((get-date).DayOfWeek).ToString().Substring(0,3)
$day_suffix = switch($day)
{ 
  "Sun" {"_01_Sun"; Break}
  "Mon" {"_02_Mon"; Break}
  "Tue" {"_03_Tue"; Break}
  "Wed" {"_04_Wed"; Break}
  "Thu" {"_05_Thu"; Break}
  "Fri" {"_06_Fri"; Break}
  "Sat" {"_07_Sat"; Break}
}
$backup_folder = $backup_folder + $day_suffix

# sql required to bring back a list of databases in the cluster
# (works in any database)

$db_sql = "select datname from pg_database where datistemplate = false"

# sql required to bring back list of schemas of interest in each database
# i.e. those that include tables, excluding foreign tables or views, 
# and that are not part of the system catalogues

$schema_sql = "select distinct s.schema_name from information_schema.schemata s inner join information_schema.tables t"
$schema_sql = $schema_sql + " on s.schema_name = t.table_schema where table_type <> 'FOREIGN' and table_type <> 'VIEW'"
$schema_sql = $schema_sql + " and schema_name <> 'pg_catalog' and schema_name <> 'information_schema'"

# get the dbs in the cluster and for each the schemas 
# of interest. -qtAX used in each case simplifies the data returned 
# from psql and makes it readable by powershell as a simple collection.
# -q = quiet (no messages) -t = tuples only (just values)
# -A = no align (no attempt at pretty arrangements) -X = no-psqlrc (do not read start-up files)
# -c is the command flag, followed by the sql statement

# calls to both psql and pg_dump use standard connection parameters 
# but do not explicitly include password. For security, this is stored in pgpass.conf 
# file, in %APPDATA%\postgresql\ for user, as per Postgres documentation

Set-Location $postgres_folder

# Need to first clear any pre-existing folders / files from the folder
Write-Host $backup_folder
Get-ChildItem $backup_folder -Recurse | Remove-Item -Recurse

$dblist = & psql -U postgres -d mdr -qtAX -c $db_sql
foreach($db in $dblist)
{
    $schemalist = & psql -U postgres -d $db -qtAX -c $schema_sql
    foreach($schema in $schemalist)
    {
        # Backup each schema using the 'Directory' option (-Fd)
        # Provides a highly compressed folder. Using a plain sql backup
        # resulted in huge files that were sometimes beyond the limit (2 GB) 
        # for compression within a Powershell script.
        # Data must be restored using pg_restore.
        # -c option means that tables are deleted (if they exist) and then recreated
        # but note that database itself should exist on destination cluster.
        # -n option makes it specific to the named schema

        $file_name = $backup_folder + "\" + $db + "_" + $schema + ".backup"
        Write-Host $file_name
        & pg_dump -h localhost -U postgres -d $db -p 5432 -n $schema -Fd -c --if-exists -f $file_name
    }
}

# Line below and Write Host in loop to be removed from
# production version. Retained here to assist development

# Read-Host -Prompt "Press Enter to exit"

The result of running the script is a set of dump folders in the relevant backup folder.

Zipping up the XML files

The XML files are stored in a set of folders, one for each source. Sources with very large numbers of files (e.g. ClinicalTrials.gov, PubMed) have the files distributed within sub-folders, usually by using their Id (in the source data, e.g. the NCT number). Sources with fewer files have no sub-folders, but may contain up to 50,000 files.

The folders are zipped by running a small utility written in C# called FileZipper. This is a console program that can take parameters. An -A flag will zip all files from all sources, whereas an -s flag will zip only the sources listed in a following string of source Ids (e.g. -s “100120, 100135”). A -J flag will zip the JSON files created by the MDR program.

For sources that have subfolders, the folders are split into 8 large groups, and each group is zipped into a single archive file. This is done to make it easier, if the need ever arose, to unzip the archives back to the original XML files. For source folders without subfolders all files are zipped into a single zip file. The zip files are placed into a folder, with sub-folders for each source. The location of the Zip folders is read from a JSON configuration file when the FileZipper utility begins

Copying the data from the server

The data is copied from the server by ‘pulling’ it with an installation of WinSCP on the external destination machine.

Scheduling back up activity

Because both backups and XML files are only scheduled to change in certain parts of the week they do not have to be backed up every day.