DP Guide: Chapter 4, Data Ingestion
- 1 Data Ingestion
Data Ingestion
This chapter provides guidance on how to upload data to the SDC platform. Two approaches are currently supported by the SDC platform to upload data. Note: the SDC can also connect to external data sources or APIs and pull data into the platform. Please contact sdc-support@dot.gov to discuss alternative methods for getting your data to the SDC.
Using AWS CLI to Upload Data to SDC
Prerequisites
Install AWS Command Line interface tools: https://aws.amazon.com/cli/
Please ensure you have the following folder and files under your home directory (these should be present after installing the CLI):
\--aws
|--config
\--credentials
Please make sure you have python and the jq python library installed on your machine. (Note : Scripts can be run successfully using Python 3.6 Version and above)
Download a python script attached to an instruction email from sdc-support@dot.gov with a set of instructions
The script has the below set of values specific to each data provider:
API_KEY = REAN CLOUDREAN CLOUD
API_END_POINT = REAN CLOUDREAN CLOUD.
Run the downloaded script which will generate a temporary access/secret keys and
update ~/.aws/credentials files by creating a new profile named sdc-token. If you already
have the profile named sdc-token, it will overwrite by updating the credentials.Use the sdc-token profile in your aws commands or python code through which you are
uploading data to the SDC.You will need to run the above shell script before you start uploading data to the SDC.
Note: credentials expire after 1 hour. You will need to develop a process for running the token script before uploads cross a 1 hour threshold.
Uploading to Your Drop Zone Bucket in S3
Run the following from a terminal window to copy a file from your local machine to the sdc:
aws --profile sdc-token s3 cp <local file path> s3://<drop zone bucket>/<file name>
Uploading to Kinesis Firehose
Run the following from a terminal window to upload a message to the SDC via Kinesis Firehose:
aws --profile sdc-token firehose put-record --delivery-stream-name <name of delivery stream> --record="{\"Data\": \"test\"}"
Uploading Data Via SDC Web portal
Navigate to the SDC web portal using https://portal.sdc.dot.gov/register
Login with your SDC credentials
Navigate to the Datasets Tab and click on the “Upload Files” button as shown in the picture below
Click “Okay” at prompt.
Choose a location for your file to be uploaded from the dropdown menu as shown below.
Once the selection is made the “Upload” button will be highlighted and you can start uploading files to your desired location.
Note: Uploads must be fewer than 5 GB in size (use SFTP to upload larger files)
Please note you can upload only to one location at one time.
Upload Data to SDC via SFTP (AWS Transfer family)
The steps for uploading files to our SFTP server are as follows:
Create your SSH keys via PuTTY Key Generator
Convert those keys to an AWS compatible format via the CMD prompt
Email us your keys / wait for confirmation email that keys your have been added
Use your private key and SDC username to connect to the SFTP Server for upload via:
GUI SFTP program (FileZilla)
CMD Prompt
Programmatically (Python)
Instructions
Create SSH Keys via PuTTY Generator
Open PuTTY Generator and select RSA in the parameter menu and then hit Generate. This will then prompt you to drag your mouse around the box to create a random key.
Once your keys are generated save the public and private keys to your computer using the “Save public key“ and “Save private key“ buttons. Name your public key “public-key.pub“
Once you have saved your keys, navigate to their location via the CMD prompt (Guide: CMD: 11 basic commands you should know (cd, dir, mkdir, etc.))
Once in the same directory as your public and private keys, run this command to convert your public key to an AWS readable format.
ssh-keygen -i -f public-key.pub > PEM-key.pub“
Open the PEM-key.pub in Notepad and copy it.
Send the copied public key to sdc-support@dot.gov
Connecting to the Server
Once you have received your confirmation email that your public key has been added you can connect to the server in 3 ways: FileZilla/CMD Prompt/Python
FileZilla
Download FileZilla https://filezilla-project.org/download.php?platform=win64
Open FileZilla and navigate to File → Site Manager
Fill out the following fields:
Protocol: SFTP - SSH File Transfer Protocol
Host: sftp.sdc.dot.gov
Logon Type: Key file
User: Your SDC Username
Key file: The location of your private key
Press Connect
Drag and drop files into the highlighted box to upload to the SDC Data Lake
CMD Prompt
Open CMD Prompt
Enter this command replacing the “path_to_private_key” with your private key path and “your_username” with your SDC username and hit “Enter”
You will then be greeted with this screen
You can now upload files with the “put“ command followed by the file path of the file
To leave the psftp menu use the “exit“ command
Python
Open PuTTY Key Generator
Navigate to File → Load private key, and select your private key in the file browser popup
Navigate to Conversions → Export OpenSSH key (force new file format) → Yes
Save your new key with the “.pem” file type
Now that you have your .pem key you can use it as your private key in a Python program using the paramiko package as shown below in this sample program.
Video Link- AWS Transfer Family
Simplified Transfer of Files in the SDC