RT Guide: Chapter 4, Exporting Datasets & Tables from the SDC

Exporting Datasets from the SDC

Researchers should be able to export the data of the system based on the compliance and data
usage policies set forth by a Data Provider.
There are two different types of Researchers:

  1. General Researcher : This type of Researcher must provide justification to the Data Provider for
    each data product that they want to export out of the SDC system. The intent is to ensure
    that the Data Provider has oversight of the exported data. This type of Researcher can also
    request trusted status from the Data Provider while filling out the approval form.

  2. Trusted Researcher : This type of Researcher already has a trusted status which is provided by
    the Data Providers. The intent is to reduce the effort for exporting data products of
    analyses out of the SDC system. A trusted user has a pre-existing and approved
    relationship with the Data Provider. Trusted status is typically granted in extremely limited cases, at the discretion of the Data Provider.

Once the Researcher completes creating derived datasets, either working on the SDC datasets
or combining with other datasets that they import into the system, they can export the derived datasets or share the datasets with other team members.

After Logging to SDC portal, navigate to Datasets tab as shown in figure below

 

Request to Export Data

The following are the steps that the Researcher needs to follow to export the data of their
analysis from the SDC system to support their research:

  1. Each Researcher is part of a team bucket which is displayed in the Datasets section.
    When ready to export, Researchers can select the file (or files) that they want to export
    out of the SDC system and place them in a separate staging folder (i.e., export_requests)
    in their team bucket. Researchers can request for exporting a file in this folder by
    clicking on the export symbol for the file they want to export out of the SDC system.
    Please note that if Researchers want to print out a hard copy of a document, they will
    need to export it from the SDC workstation to their local machine.



  2. Once the export button is selected, a dialog box for requesting the export data will be displayed. The Researcher will then need to provide the details of the Project, Data Provider, and Data Type that he has used to create his own dataset and click on the NEXT button once finished.

     

  3. The additional information regarding the request for exporting the data must be filled out in the Approval Form tab, shown below. These details are shared with the Data Providers, which helps them to accept or reject the request made by the Researchers.

     

Acceptable Use Policy

  1. If not previously awarded 'Trusted User Status'* for the selected dataset, users must accept the Acceptable Usage policy for the export request to go through to the Data Provider. The form will not be submitted if the user declines.
    (*see the Request Trusted User Status section below for more information.)

  2. Upon successful submission, the export request will be sent to appropriate Data
    Providers. Data Providers will be responsible for accepting or rejecting the export
    requests.

  3. Once Data Providers approve the request, Researchers will be able to download the
    dataset out of SDC through the portal.

     

Request Auto-Export Status

If the user is a trusted user for the selected data provider and data type, then he/she will be
prompted with the option for requesting the auto-export status from the Data Provider. This will
allow the Researcher to be able to export data automatically, as opposed to filling out the request
form and then going through the approval process before exporting. Not all data are available for the auto export feature. Data Providers may request the feature be available for their data.

NOTE: The Auto-Export Status tab will not be visible if the user does not have a trusted status.

  1. The Researcher enters the derived dataset name of the data type they want to export into
    ‘Specify the derived dataset name’ and then the justification for requesting permissions
    into ‘Reason for requesting permission.’

  2. The ‘Request to Export Data’ form will still be submitted if the user declines the option
    to submit a request for auto-export status by selecting ‘No’ under ‘Do you wish to request
    Auto-Export Status from the Data Provider?’

     

How to Auto-Export Files

Please make sure that you meet the following prerequisites:

Once you are approved for the auto-export status from a specific Data Provider for a specific
datatype, use the following instructions to auto-export files.

  1. Log into the SDC Portal to start and launch your SDC workstation.
    Refer to the instructions in Chapter 3 https://securedatacommons.atlassian.net/wiki/spaces/DESK/pages/2225405986/RT+Guide+Chapter+3+Accessing+and+Launching+Workstations if needed.

  2. Log into the Guacamole workstation.

  3. Perform work and save any derived data that needs to be exported to the local machine.
    For example, the ‘Documents’ folder would be a good place.
    NOTE: When auto-exporting a file, make sure the name of the file does not contain one
    of the following special characters: ! @ # $ % ^ & ( ) + = { } [ ] ; ' , ~ `
    A file that contains any one of these characters cannot be auto-exported.

  4. Open PowerShell and navigate to the directory where the derived data file is located.
    • If you are unable to find PowerShell, right-click on the Windows menu, select Search,
    and enter “powershell.”
    • Use the command cd to change the directory, for example, cd Documents

  5. Copy the derived data file into the correct directory under your team bucket’s
    auto_export directory that you were approved to auto-export.
    For example, if you were approved to auto-export ‘Monthly_Reports’ datatype, then run
    the following command in PowerShell to copy the file to the ‘auto_export/Monthly_Reports/’ directory under your team bucket.

    aws s3 cp deriveddatatoexport.file s3://teambucket/auto_export/Monthly_Reports/

    See the AWS S3 CLI Commands section for more information on the aws cp command.

  6. After the copy operation has completed, the SDC will automatically copy the derived
    data file to a specific datatype folder under the data provider’s auto-export bucket. This
    “auto-export” process should only take a few seconds.

  7. To download this auto-exported data file onto your local machine, sign into the AWS CLI on your local machine using the SDC-provided authentication script.

  8. Make sure to choose the auto-export role for the specific derived datatype. You will ONLY be able to export the data of the role you have assigned. You must change your role to export other types of derived data if you are approved for multiple datatypes. If you need access to the authentication script, please submit a Service Desk ticket.

  9. Using the AWS CLI, you may list the content of the derived datatype’s directory from the
    data provider’s auto-export bucket and download any files located in that directory to
    your local machine.
    Your permissions are very limited, and the commands must be very precise for them to
    work.
    List files in the data provider’s auto-export bucket for the specified datatype

    aws s3 ls s3://DP-auto-export-bucket/datatype/

    Download file in the data provider’s auto-export bucket for the specified datatype

    aws s3 cp s3://DP-auto-exportbucket/datatype/deriveddata.file localmachinelocation

 

Request Trusted User Status

If the user is not a trusted user, he/she has the option to request trusted status from the Data Owner/ Data Steward. This will allow the Researcher to export the data immediately, as opposed to waiting for review and approval from the Data Owner/Steward. Trusted status is typically granted in extremely limited cases, at the discretion of the Data Owner/Steward.

1. The form to request Trusted User Status can be opened from the DataSets page. Click on the ‘Request Trusted User Status’ button to open the dialog box.

 

2. The Researcher will then need to select the details of the Project, Data Provider, and Data Type for which he/she would like to request Trusted Status, as well as enter his/her reason for the request into the ‘Justification’ field'.

3. The Researcher must also indicate that he/she has read and accepts the Acceptable Use Policy.

4. Once all the required information has been selected and entered, the submit button will activate so the request may be submitted to the Data Owner/ Data Steward.

 

Exporting Tables to the Edge Database

Researchers can also export summarized, aggregated and/or analysis results data, which do not contain sensitive information, to the public Edge database. This enables connections from the Edge database to local applications and/or other public data sources which are located outside of your SDC workstation.

Researchers are required to submit requests prior to exporting a new table for review and approval by the Data Steward. Data Steward review and approval is also required given any changes to a previously approved table’s schema. In this case, a new Table Export request should be submitted to identify the change.

Submit a Request to Export Table Data to the Edge Database

To submit a request to export your table data:

  1. Navigate to the ‘My Edge Database’ panel, which is found on the Datasets page. This is located just below the ‘Request Trusted User Status’ button.

  2. Click on ‘Request to Export Table to Edge Database’ button to open a request form.

     

  3. Fill in the required fields on the form.

    1. The name of your database will appear pre-populated in the form.

    2. Provide the name of the table to be exported to the public Edge database.
      E.g. wazedata_test01

    3. Select from the cascading drop-down menus to identify the primary Project/Dataset, Data Provider, and Sub-Dataset/DataType used to create your table.
      E.g. Waze (Jams, Irregularity or Alert)

    4. List any additional data sources that are used to create your table dataset.

    5. Provide justification for your request.

    6. Click the check box to indicate your review of and agreement with the ‘Acceptable Use Policy’.

  4. Once all of the required information has been provided, click the 'Submit' button. This will send your request to the Data Owner/ Data Steward for review.

  5. When the Data Steward completes their review process, you will receive an email notification confirming whether your request was approved or denied.

Connect to Databases Using DBeaver

The Edge Database feature is architected with AWS Aurora (PostgreSQL compatible) database instances.
Endpoint connection strings for both the internal and external (Edge) databases will be provided to Researchers by the Enablement Team.

Complete the following pre-requisite steps to establish JDBC connection to either Aurora database endpoint using DBeaver.

Configure PostgreSQL Driver

  1. Open DBeaver.

  2. In the DBeaver menu, open Database → Driver Manager

  3. In the Driver Manager dialog box, select PostgreSQL, then click Edit…

  4. In Edit Driver ‘PostgreSQL' dialog box, click the Libraries tab

  5. Select and Delete the following
    org.postgresql:postgresql:RELEASE
    net.postgis:postgis-jdbc:RELEASE
    net.postgis:postgis-geometry:RELEASE

  6. Click Add Folder

  7. In the Open driver directory dialog box, navigate to: C:\Users\Public\JDBC Drivers\PostgreSQL and click Select Folder.

  8. Click Find Class.

  9. From the Driver Class dropdown, select org.postgresql.Driver

  10. Click OK.

  11. Click Close

Create a New Database Connection to the Edge Database

  1. In the DBeaver toolbar, click the New Database Connection button

  2. In the Connect to a database/Select your database dialog box, search for and select PostgreSQL

  3. Click Next

  4. In the Connect to a database/Connection Settings dialog box, on the Main tab, enter the Host (endpoint), Port, Database, Username and Password that were provided to you by the Enablement Team

  5. Click Test Connection

  6. You should see a new Connection test dialog box giving information about the server. Similar to the following:

  7. If you get an error, re-check the information you entered on step 4.

  8. f you continue to get an error, please contact the Enablement Team via the Service Desk.

Connecting to Tableau for Visualizations

This guide assumes you have Tableau Desktop installed on a DOT system with access to the DOT Common Operating Environment. Note: this will not work with Tableau Reader.

  1. After launching Tableau Desktop, you can connect to a server, select more, and select ‘PostgreSQL’ (The same options will show if you select ‘Connect to Data’ from the workbook view:

  2. After selecting PostgreSQL, you will be required to fill in the connection details. Similar to the DBeaver connection above, you will need to provide the server , port, database name, username, and password. Note: the server is provided without any slashes ( / ) or colons ( : ).

    1. server: edgedb-externaldb.cluster-cuwig46oq690.us-east-1.rds.amazonaws.com

    2. port: 5432

    3. database, username, and password are provided to you by the sdc support team

  3. After clicking on ‘Sign In’, Tableau will attempt to connect to the database and you will be presented with the option to select tables for use in your visualizations