DP Guide: Data Documentation Guidelines

What’s New

  • May 14, 2021

    • You can now be notified when new or updated documentation is added to the data lake

  • April 28, 2021

    • Markdown documentation is published to the Data Provider's GitLab wiki

Instructions

This document summarizes the procedures for uploading metadata packages, data issue logs, and metadata packages to the SDC. Data providers should follow these guidelines when sharing these documents with the SDC platform.

Data providers should utilize their normal data upload steps to upload data documentation and metadata to the SDC. For specific documentation on how to upload data to the SDC as a data provider, see the SDC data provider user guide.

Data providers should upload documentation using a specific “prefix” folder in their normal ingest bucket. The name of this prefix should be “documentation.” As a reminder, data providers can use the following upload command to upload documentation to the “documentation” folder:

aws --profile sdc s3 cp <local files or folders to upload> s3://<Data Provider's Ingest Bucket>/documentation/
  • The term <local files or folders to upload> in the above command should indicate the files or folders that should be uploaded to the provider’s normal ingest bucket.

  • The term <Data Provider's Ingest Bucket> in the above command should be replaced with the data provider’s specific ingest bucket, for example prod-dot-sdc-cvp-nyc-ingest... .

Remember to use the credential generation script when uploading to your specific ingest bucket and the “sdc” profile to reference these temporary credentials. For other troubleshooting issues, please refer to the data provider user guide or contact the SDC Service Desk via the Secure Data Commons Help Center.

Documentation Template

Markdown is the preferred format for data documentation. Other file types will be supported, but only by upload and links in the GitLab data documentation wiki.

The following template should serve as a starting point and be adapted as needed. The only exception being the file level metadata. If file metadata is included, it must follow the metadata requirements.

This document can also be found on GitHub, https://github.com/USDOT-SDC/Public : https://github.com/USDOT-SDC/Public/blob/master/DOCUMENTATION-template.md


<template>

<Project Name> Data Documentation

  • Project Acronym: <PROJ-ACRNM>

1.0 Points of Contact (POCs)

Fill in the following, as appropriate and known

2.0 Dataset Overview:

  • Introduction or abstract

  • Time period covered by the data

  • Physical location (description, polygon or lat/lon/elev) of observations in the dataset

  • Data source if applicable

  • Any website references (i.e. additional documentation such as Project website)

3.0 Instrument/Collection Device Description:

  • Brief text (i.e. 1-2 paragraphs) describing the instrument(s) with references

  • Figures (or links), if applicable

  • Table of specifications (i.e. accuracy, precision, frequency, resolution, time zone, etc.)

4.0 Data Collection and Processing:

  • Description of data collection

  • Description of derived parameters and processing techniques used

  • Description of quality control procedures

  • Data inter-comparisons, if applicable

5.0 Data Files:

  • File names: Data file structure and file naming conventions (e.g. column delimited ASCII, NetCDF, GIF, JPEG, etc.)

  • Field names: Data format and layout (i.e. description of header/data records, sample records)

  • Data type (i.e. string, int, ISO DATE/TIME, etc.)

  • Description:

    • List of parameters with units, sampling intervals, frequency, range

    • Data version number and date

    • Description of flags, codes used in the data, and definitions (i.e. good, questionable, missing, estimated, etc.)

5.1 Data Elements

  • File name

    • Field name [datatype] Description

5.2 Data Format Example

  • agci_events_20210301.csv

    • event_datetime [ISO datetime] Date and time of event capture

    • event_id [int] Unique event identifier

    • event_type [string] Category of event

    • event_speed [decimal] Speed of vehicle in MPH

6.0 File Level Metadata

  • If possible, file level metadata should be included at the beginning of ASCII text files

  • If included, metadata will be enclosed in tags (<metadata></metadata>)

  • The contents of metadata will be in yaml format

6.1 Metadata Elements

If metadata is included, it should use the following elements.

  • Required

    • PROJECT [string]

    • SUBSET, NAME [string]

    • SUBSET, FORMAT VERSION [decimal]

    • COUNT [integer]

  • Optional

    • SUBSET, SOURCE [string]

    • COVERAGE, BEGIN [ISO DATE/TIME]

    • COVERAGE, END [ISO DATE/TIME]

    • REMARKS [string]

    • NOTIFY [email as string]

Please note that PROJECT and SUBSET, NAME values must be consistent throughout the SDC

See the next section for an example

6.3 Example

<metadata> PROJECT: FRA-ARDS SUBSET: NAME: AGCI FORMAT VERSION: 1.0 SOURCE: RSIS [schema].[table_name] COVERAGE: BEGIN: 2021-01-01T00:00:00.00-06:00 END: 2021-01-31T00:00:00.00-06:00 COUNT: 142 REMARKS: - Office of Safety/Knowledge Management Division - Observation DATE/TIME provided in local timezone - No data available for 2021-01-16 NOTIFY: - KRogers@stevens.edu - DCrocetti@dot.gov - SDavis@dot.gov </metadata> DATE/TIME LAT LONG ID TYPE TRESPASSER FATALITY Deg Deg NUMBER COUNT 2021-01-08T14:35-06:00 33.087 102.116 AI.20.1 Public True 0 2021-01-15T05:48-06:00 33.090 102.120 AI.20.2 Private False 0 2021-01-28T07:52-06:00 33.087 102.116 AI.20.3 Public True 1 2021-01-30T18:16-06:00 33.090 102.120 AI.20.4 Private False 0

7.0 Data Remarks:

  • PI's assessment of the data (i.e. disclaimers, instrument problems, quality issues, etc.)

  • Missing data periods

  • Software compatibility (i.e. list of existing software to view/manipulate the data)

8.0 References:

  • List of documents cited in this documentation

No data available for 2021-01-16

</template>