Backing up the data

From ECRIN-MDR Wiki
Revision as of 16:38, 21 December 2020 by Admin (talk | contribs) (Copying the data from the server)
Jump to navigation Jump to search

Introduction

There are two sets of data that need to be backed up from the MDR system:

  • The data in the databases
  • The downloaded XML files

In both cases the data is first compressed on the server – as backup files for the databases, and as zip arechives for the XML files. The compressed data is then also copied onto a machine that is outside the OVH infrastructure.

Backing up the databases

Each schema in each database is backed up within the same procedure. The PostgreSQL utility pgdump is used to construct a compressed folder for each schema, and name it according to the source database and schema name, as well as the day of the week of the backup. Each folder contains a series of 'tarball' archive files, with extension .dat.gz. These archives must also be restored using pg_restore – they cannot be read directly. The alternative, of creating backup files as SQL statements, results in enormous files, and so was rejected.
Because pgdump is a console program that takes parameters it is relatively easy to call it within a Powershell script, and then schedule the script. Essentially the script

  • first determines the day of the week, to identify the destination folder for the backups,
  • establishes 2 SQL statements, one that retrieves all the databases present on the PostgreSQL instance, and a second that brings back all the schemas of interest in those databases,
  • deletes all files in the back up folder for that day,
  • uses the Postgres CLI psql to call the SQL commands constructed above. Note the connection parameters are ‘’not’’ included in the script.
  • for each schema, use pgdump to create a backup folder.

The Powershell script is below. It is heavily commented, and those comments should explain most of the parameters used for the various functions.

# set up variable pointing to backup location
# and transfer the execution context to the Postgres folder
# so that Postgres .exes (psql and pg_dump) can be found

$backup_folder = "C:\MDR_DBBackups\Postgres"
$postgres_folder = "C:\Program Files\PostgreSQL\12\bin\"

# Obtain day and folder name

$day = ((get-date).DayOfWeek).ToString().Substring(0,3)
$day_suffix = switch($day)
{ 
  "Sun" {"_01_Sun"; Break}
  "Mon" {"_02_Mon"; Break}
  "Tue" {"_03_Tue"; Break}
  "Wed" {"_04_Wed"; Break}
  "Thu" {"_05_Thu"; Break}
  "Fri" {"_06_Fri"; Break}
  "Sat" {"_07_Sat"; Break}
}
$backup_folder = $backup_folder + $day_suffix

# sql required to bring back a list of databases in the cluster
# (works in any database)

$db_sql = "select datname from pg_database where datistemplate = false"

# sql required to bring back list of schemas of interest in each database
# i.e. those that include tables, excluding foreign tables or views, 
# and that are not part of the system catalogues

$schema_sql = "select distinct s.schema_name from information_schema.schemata s inner join information_schema.tables t"
$schema_sql = $schema_sql + " on s.schema_name = t.table_schema where table_type <> 'FOREIGN' and table_type <> 'VIEW'"
$schema_sql = $schema_sql + " and schema_name <> 'pg_catalog' and schema_name <> 'information_schema'"

# get the dbs in the cluster and for each the schemas 
# of interest. -qtAX used in each case simplifies the data returned 
# from psql and makes it readable by powershell as a simple collection.
# -q = quiet (no messages) -t = tuples only (just values)
# -A = no align (no attempt at pretty arrangements) -X = no-psqlrc (do not read start-up files)
# -c is the command flag, followed by the sql statement

# calls to both psql and pg_dump use standard connection parameters 
# but do not explicitly include password. For security, this is stored in pgpass.conf 
# file, in %APPDATA%\postgresql\ for user, as per Postgres documentation

Set-Location $postgres_folder

# Need to first clear any pre-existing folders / files from the folder
Write-Host $backup_folder
Get-ChildItem $backup_folder -Recurse | Remove-Item -Recurse

$dblist = & psql -U postgres -d mdr -qtAX -c $db_sql
foreach($db in $dblist)
{
    $schemalist = & psql -U postgres -d $db -qtAX -c $schema_sql
    foreach($schema in $schemalist)
    {
        # Backup each schema using the 'Directory' option (-Fd)
        # Provides a highly compressed folder. Using a plain sql backup
        # resulted in huge files that were sometimes beyond the limit (2 GB) 
        # for compression within a Powershell script.
        # Data must be restored using pg_restore.
        # -c option means that tables are deleted (if they exist) and then recreated
        # but note that database itself should exist on destination cluster.
        # -n option makes it specific to the named schema

        $file_name = $backup_folder + "\" + $db + "_" + $schema + ".backup"
        Write-Host $file_name
        & pg_dump -h localhost -U postgres -d $db -p 5432 -n $schema -Fd -c --if-exists -f $file_name
    }
}

# Line below and Write Host in loop to be removed from
# production version. Retained here to assist development

# Read-Host -Prompt "Press Enter to exit"

The result of running the script is a set of dump folders in the relevant backup folder.

Zipping up the XML files

The XML files are stored in a set of folders, one for each source. Sources with very large numbers of files (e.g. ClinicalTrials.gov, PubMed) have the files distributed within sub-folders, usually by using their Id (in the source data, e.g. the NCT number). Sources with fewer files have no sub-folders, but may contain up to 50,000 files.

The folders are zipped by running a small utility written in C# called FileZipper. This is a console program that can take parameters. An -A flag will zip all files from all sources, whereas an -s flag will zip only the sources listed in a following string of source Ids (e.g. -s “100120, 100135”). A -J flag will zip the JSON files created by the MDR program.

For sources that have subfolders, the folders are split into 8 large groups, and each group is zipped into a single archive file. This is done to make it easier, if the need ever arose, to unzip the archives back to the original XML files. For source folders without subfolders all files are zipped into a single zip file. The zip files are placed into a folder, with sub-folders for each source. The location of the Zip folders is read from a JSON configuration file when the FileZipper utility begins

Copying the data from the server

The data is copied from the server by ‘pulling’ it with an installation of WinSCP on the external destination machine. Setting up the WinSCP system is reasonably straightforward as long as the correct protocols are used. Attempts to use SSH or SFTP were unsuccessful, perhaps because of security settings on the server (dictated by the infrastructure), so secure FTP (FTPS) was used instead.

The script is shown below. The script's activity is logged into a named CSV file, so this is set up first as the 'logging provider'.

try 
{
    # Set up the logging provider as a log file
    $paramSetPSFLoggingProvider = @{
        Name         = 'logfile'
        InstanceName = 'Transferring DB backups'
        FilePath     = 'D:\MDR Scripts\Transfer log.csv'
        Enabled      = $true
    }
    Set-PSFLoggingProvider @paramSetPSFLoggingProvider

    $module = 'Transferring DB backups'
    Write-PSFMessage -Level Important -Message 'Started module.' -ModuleName $module -Tag 'Info'

    # get the day abbreviation
    $day = ((get-date).DayOfWeek).ToString().Substring(0,3)
    $day_suffix = switch($day)
    { 
        "Sun" {"01_Sun"; Break}
        "Mon" {"02_Mon"; Break}
        "Tue" {"03_Tue"; Break}
        "Wed" {"04_Wed"; Break}
        "Thu" {"05_Thu"; Break}
        "Fri" {"06_Fri"; Break}
        "Sat" {"07_Sat"; Break}
    }

    # Get folder names for this day
    $remote_folder = "/Postgres_" + $day_suffix
    $local_folder = "D:\MDR Backups\Postgres_" + $day_suffix
   
    # delete current contents of local folder
    Get-ChildItem $local_folder -Recurse | Remove-Item -Recurse
    Write-PSFMessage -Level Important -Message 'Existing files deleted in local folder' -ModuleName $module -Tag 'Info'

    # Load WinSCP .NET assembly
    Add-Type -Path "D:\MDR Scripts\WinSCPnet.dll"

    # Set up session options
    $sessionOptions = New-Object WinSCP.SessionOptions -Property @{
        Protocol = [WinSCP.Protocol]::Ftp
        HostName = "51.91.154.6"
        PortNumber = 22
        UserName = "Admin"
        Password = "1nU9I2wFumJPwyKk"
        FtpSecure = [WinSCP.FtpSecure]::Explicit
        TlsHostCertificateFingerprint = "97:09:fc:bc:1f:b3:62:0e:05:57:b6:f9:94:8e:cc:64:b9:56:7d:29"
    }
    $session = New-Object WinSCP.Session -ErrorAction Stop

    # Connect
    $session.Open($sessionOptions)  
    Write-PSFMessage -Level Important -Message 'Connection opened with remote server' -ModuleName $module -Tag 'Info'
    
    try
    {
        $session.add_QueryReceived( { 
            Write-PSFMessage -Level Warning -Message 'Error: $($_.Message)' -Tag 'Failure' -ModuleName $module -ErrorRecord $_
            $_.Continue()
        } )

        # Transfer files
        $session.GetFiles($remote_folder, $local_folder).Check()
        Write-PSFMessage -Level Important -Message 'Completed transfer of files.' -ModuleName $module -Tag 'Success'
    }

    finally
    {
        # Disconnect, clean up
        $session.Dispose()
    }

    exit 0
}

catch
{
    Write-PSFMessage -Level Warning -Message $Error[0] -Tag 'Failure' -ModuleName $module 
    exit 1
}

Scheduling back up activity

Because both backups and XML files are only scheduled to change in certain parts of the week they do not have to be backed up every day.
For the database data

  • Backups are carried out at 00.05 on Wednesday and Saturday morning. This is just before and just after the databases are usually changed.
  • The backup folders are transferred to an external machine at 04.00 on Wednesday and Saturday morning.

For the XML files

  • The zipping of the files is carried out at 00.05 on Monday and Thursday morning. This is just before and just after the XML files are usually updated.
  • The zip files are transferred to an external machine at 04.00 on Monday and Thursday morning.

Scheduling of the backups / zips on the server uses Windows Task Scheduler, scheduling PowerShell and FileZipper respectively. PowerShell is version 5.1 on the server, the default installation for Windows Server 2019. Scheduling the C# utility is straightforward, but for PowerShell to work properly with the task scheduler, after the job is set up initially using 'Create Basic Task', it has to be configured in a certain way:

  • In the General tab, ensure 'Run whether user is logged in or not' is checked.
  • In the General tab, ensure the 'Configure for' drop down indicates Windows Server 2019.
  • In the Settings tag, for the 'if the task is already running, then the following rule applies' ensure that 'Stop the existing instance' is selected. This is more to ensure that the scheduled task is not blocked from any earlier error.
  • In the Actions tag, ensure that the correct version of PowerShell is referenced. For v5.1 this is usually 'C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe'
  • In the Actions tag, ensure that the parameters begin with the -no profile and -ExecutionPolicy Bypass flags, and the script reference is preceded by -File, e.g.
 -noprofile -ExecutionPolicy Bypass -File "D:\MDR Scripts\DB backups.ps1"

For scheduling the transfer to the external machine, Powershell is used for both the database backups and the zip files. Again the job must be configured in a particular way. Because the external machine runs Windows 10 and uses the cross platform PowerShell 7, based on .Net core, (the scripts could therefore run on a Linux box) the configuration should be as described above, except that:

  • In the General tab, ensure the 'Configure for' drop down indicates Windows 10.
  • In the Actions tag, ensure that the correct version of PowerShell is referenced. For v7 this is usually 'C:\Program Files\PowerShell\7\pwsh.exe'.