RT Guide: Chapter 1, Introduction and Document Overview
Introduction and Document Overview
The Secure Data Commons (SDC) is a United States Department of Transportation (U.S. DOT)
sponsored cloud-based analytical sandbox designed to create wider access to sensitive
transportation datasets, with the goal of advancing the state of the art of transportation research
and state/local traffic management.
The SDC stores sensitive transportation data made available by participating Data Providers, and
grants access to approved researchers to these datasets. The SDC also provides access to open
source tools and allows researchers to collaborate and share code with other system users.
The SDC platform is a research environment that allows users to conduct analyses and do
development and testing of new tools and software products. It is not intended to be an
alternative to any local jurisdiction’s traffic management center or local data repository. The
current SDC platform provides users with the following data, tools, and features:
Data: The SDC is ingesting several datasets currently. Additional datasets will be added
to the environment over time. Users can bring their own data into the environment to use
along with the Waze data.Tools: The environment provides access to open source tools including Python, RStudio,
Microsoft R, SQL Workbench, Power BI, Libre Office, and Jupyter Notebook. These
tools are available on a virtual machine in the system enabling data analytics in the cloud.Functionality: Users can access and analyze data within the environment, save their
work to a virtual machine, and publish processes and results to share with other SDC
users.
The SDC platform supports two major roles:
Data Providers – These are entities that provide data hosted on the SDC platform. The
Data Provider establishes the data protection needs and acceptable use terms for the data
analysts.Researchers – These are users that conduct research and analysis using the datasets hosted within
the SDC system. Note that researchers can bring their own data and tools into the SDC
system.
During a project’s onboarding phase, Data Providers work with the SDC support team to
describe their project’s data (for example, the type of data, frequency of new data, data file
formats, etc., every project’s data is unique). Then, Data Providers upload data files to
designated “S3 Ingestion Buckets” (a secure, scalable object storage service provided by the
SDC platform through Amazon Web Services). We call this the “Data Lake.”
As new data arrives to the SDC, policies and procedures established by the Data Provider then
govern who, when, and how Researchers can access the data. It is common that once new data
arrives in the SDC, automated processes “ingest” and “curate” the data, making the data
available in other forms. For example, some data may be loaded into our Data Warehouse tools
(Redshift or Hadoop databases), whereas other data may be transformed from its source format
into other easier-to-use formats, or filtered through a process to identify corrupt, invalid, or
duplicate data. Exactly which automated processes a project’s data undergoes is determined
during the project onboarding processes.
Once data has been ingested and curated, it is then available to Researchers through the tools
listed above (note that we are always adding new Researcher tools based on request). A typical
Researcher workflow may be:
Use a tool to develop SQL queries to see a subset of the larger data set.
Compare the data subset against models to draw unique insights (for example, develop
programs utilizing the analytic capabilities of R or Python).Use powerful tools that present data and insights in graphical format (some use the power
of Python and GeoPandas, others have developed graphical outputs in R Studio, whereas
others use Libre Office as an open source alternative to Microsoft Excel). The SDC
support team has worked with yet other Researchers to install proprietary licensed
software to enhance their analytical capabilities.Finally, there are capabilities by which Researchers can export their work out of the SDC,
subject to the data use agreements and approval of the Data Providers.This document provides guidance for the Researcher role.
There is a similar guide for Data Providers: https://securedatacommons.atlassian.net/wiki/spaces/DESK/pages/1376780433
This document is organized as follows:
Prerequisites
Workstation access will not be granted for a Researcher user until the user has:
Submitted a completed https://securedatacommons.atlassian.net/wiki/spaces/DESK/pages/1349484563
Received approval for the request;
Received an email message with onboarding instructions from the support team; and
Received a walkthrough of the system from the support team.
Refer to the Useful Links section later in the document for further information on technologies relevant to SDC.
https://securedatacommons.atlassian.net/wiki/spaces/DESK/pages/2226061322/RT+Guide+Chapter+6+Technical+Documentation+and+Contact+Information#Useful-Links