RT Guide: Chapter 1, Introduction and Document Overview

Introduction and Document Overview

The Secure Data Commons (SDC) is a United States Department of Transportation (U.S. DOT)
sponsored cloud-based analytical sandbox designed to create wider access to sensitive
transportation datasets, with the goal of advancing the state of the art of transportation research
and state/local traffic management.


The SDC stores sensitive transportation data made available by participating Data Providers, and
grants access to approved researchers to these datasets. The SDC also provides access to open
source tools and allows researchers to collaborate and share code with other system users.


The SDC platform is a research environment that allows users to conduct analyses and do
development and testing of new tools and software products. It is not intended to be an
alternative to any local jurisdiction’s traffic management center or local data repository. The
current SDC platform provides users with the following data, tools, and features:

  • Data: The SDC is ingesting several datasets currently. Additional datasets will be added
    to the environment over time. Users can bring their own data into the environment to use
    along with the Waze data.

  • Tools: The environment provides access to open source tools including Python, RStudio,
    Microsoft R, SQL Workbench, Power BI, Libre Office, and Jupyter Notebook. These
    tools are available on a virtual machine in the system enabling data analytics in the cloud.

  • Functionality: Users can access and analyze data within the environment, save their
    work to a virtual machine, and publish processes and results to share with other SDC
    users.

The SDC platform supports two major roles:

  • Data Providers – These are entities that provide data hosted on the SDC platform. The
    Data Provider establishes the data protection needs and acceptable use terms for the data
    analysts.

  • Researchers – These are users that conduct research and analysis using the datasets hosted within
    the SDC system. Note that researchers can bring their own data and tools into the SDC
    system.


During a project’s onboarding phase, Data Providers work with the SDC support team to
describe their project’s data (for example, the type of data, frequency of new data, data file
formats, etc., every project’s data is unique). Then, Data Providers upload data files to
designated “S3 Ingestion Buckets” (a secure, scalable object storage service provided by the
SDC platform through Amazon Web Services). We call this the “Data Lake.”

As new data arrives to the SDC, policies and procedures established by the Data Provider then
govern who, when, and how Researchers can access the data. It is common that once new data
arrives in the SDC, automated processes “ingest” and “curate” the data, making the data
available in other forms. For example, some data may be loaded into our Data Warehouse tools
(Redshift or Hadoop databases), whereas other data may be transformed from its source format
into other easier-to-use formats, or filtered through a process to identify corrupt, invalid, or
duplicate data. Exactly which automated processes a project’s data undergoes is determined
during the project onboarding processes.

Once data has been ingested and curated, it is then available to Researchers through the tools
listed above (note that we are always adding new Researcher tools based on request). A typical
Researcher workflow may be:

Prerequisites

Workstation access will not be granted for a Researcher user until the user has:

  1. Submitted a completed RT Form: Researcher Agreement and Access Request

  2. Received approval for the request;

  3. Received an email message with onboarding instructions from the support team; and

  4. Received a walkthrough of the system from the support team.
    Refer to the Useful Links section later in the document for further information on technologies relevant to SDC.
    RT Guide: Appendix B, Technical Documentation and Contact Information | Useful Links