DP Guide: Chapter 4, Data Ingestion

 

Data Ingestion

This chapter provides guidance on how to upload data to the SDC platform. Two approaches are currently supported by the SDC platform to upload data. Note: the SDC can also connect to external data sources or APIs and pull data into the platform. Please contact sdc-support@dot.gov to discuss alternative methods for getting your data to the SDC.

Using AWS CLI to Upload Data to SDC

Prerequisites

  • Install AWS Command Line interface tools: https://aws.amazon.com/cli/

  • Please ensure you have the following folder and files under your home directory (these should be present after installing the CLI):

\--aws

|--config

\--credentials

  • Please make sure you have python and the jq python library installed on your machine. (Note : Scripts can be run successfully using Python 3.6 Version and above)

  • Download a python script attached to an instruction email from sdc-support@dot.gov with a set of instructions

  • The script has the below set of values specific to each data provider:

    • API_KEY = REAN CLOUDREAN CLOUD

    • API_END_POINT = REAN CLOUDREAN CLOUD.

  • Run the downloaded script which will generate a temporary access/secret keys and
    update ~/.aws/credentials files by creating a new profile named sdc-token. If you already
    have the profile named sdc-token, it will overwrite by updating the credentials.

  • Use the sdc-token profile in your aws commands or python code through which you are
    uploading data to the SDC.

  • You will need to run the above shell script before you start uploading data to the SDC.

  • Note: credentials expire after 1 hour. You will need to develop a process for running the token script before uploads cross a 1 hour threshold.

Uploading to Your Drop Zone Bucket in S3

Run the following from a terminal window to copy a file from your local machine to the sdc:

aws --profile sdc-token s3 cp <local file path> s3://<drop zone bucket>/<file name>

Uploading to Kinesis Firehose

Run the following from a terminal window to upload a message to the SDC via Kinesis Firehose:

aws --profile sdc-token firehose put-record --delivery-stream-name <name of delivery stream> --record="{\"Data\": \"test\"}"

 

Uploading Data Via SDC Web portal

  • Click “Okay” at prompt.

  • Choose a location for your file to be uploaded from the dropdown menu as shown below.

  • Once the selection is made the “Upload” button will be highlighted and you can start uploading files to your desired location.

    • Note: Uploads must be fewer than 5 GB in size (use SFTP to upload larger files)

    • Please note you can upload only to one location at one time.

Upload Data to SDC via SFTP (AWS Transfer family)

The steps for uploading files to our SFTP server are as follows:

  1. Create your SSH keys via PuTTY Key Generator

  2. Convert those keys to an AWS compatible format via the CMD prompt

  3. Email us your keys / wait for confirmation email that keys your have been added

  4. Use your private key and SDC username to connect to the SFTP Server for upload via:

    1. GUI SFTP program (FileZilla)

    2. CMD Prompt

    3. Programmatically (Python)

Instructions

Create SSH Keys via PuTTY Generator

  1. Open PuTTY Generator and select RSA in the parameter menu and then hit Generate. This will then prompt you to drag your mouse around the box to create a random key.

  2. Once your keys are generated save the public and private keys to your computer using the “Save public key“ and “Save private key“ buttons. Name your public key “public-key.pub“

  3. Once you have saved your keys, navigate to their location via the CMD prompt (Guide: https://www.digitalcitizen.life/command-prompt-how-use-basic-commands/)

  4. Once in the same directory as your public and private keys, run this command to convert your public key to an AWS readable format.

    ssh-keygen -i -f public-key.pub > PEM-key.pub“
  5. Open the PEM-key.pub in Notepad and copy it.

  6. Send the copied public key to sdc-support@dot.gov

Connecting to the Server

Once you have received your confirmation email that your public key has been added you can connect to the server in 3 ways: FileZilla/CMD Prompt/Python

FileZilla

  1. Download FileZilla https://filezilla-project.org/download.php?platform=win64

  2. Open FileZilla and navigate to File → Site Manager

    1. Fill out the following fields:

      1. Protocol: SFTP - SSH File Transfer Protocol

      2. Host: sftp.sdc.dot.gov

      3. Logon Type: Key file

      4. User: Your SDC Username

      5. Key file: The location of your private key

    2. Press Connect

  3. Drag and drop files into the highlighted box to upload to the SDC Data Lake

CMD Prompt

  1. Open CMD Prompt

  2. Enter this command replacing the “path_to_private_key” with your private key path and “your_username” with your SDC username and hit “Enter”

  3. You will then be greeted with this screen

  4. You can now upload files with the “put“ command followed by the file path of the file

  5. To leave the psftp menu use the “exit“ command

Python

  1. Open PuTTY Key Generator

  2. Navigate to File → Load private key, and select your private key in the file browser popup

  3. Navigate to Conversions → Export OpenSSH key (force new file format) → Yes

  4. Save your new key with the “.pem” file type

  5. Now that you have your .pem key you can use it as your private key in a Python program using the paramiko package as shown below in this sample program.