Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 11 Current »

Data Quality Checks and Processes

Validating uploaded Data

For security reasons and to avoid the possibility of tampering with uploaded data, data files that are ingested through S3 buckets are moved immediately to a different location when uploaded. Upon successful upload of data files to the ingest bucket using the above step, data will be moved to the standardized data bucket or “Data Lake.”

Background processes will move the data from Raw data bucket to the Standardized data bucket under a folder labelled based on the date it was uploaded. As a data provider, you will have a folder in the Data Lake that contains all of the data you upload.

 drop-zoneraw-datastandardized-data

Data uploads can be verified by running the below AWS CLI command on the Standardized data bucket to list the objects there.

aws s3 ls s3://prod.sdc.dot.gov.data-lake.standardized-data/<data-provider> --profile sdc

Configuration Files

When the uploaded data reaches the Standardized Data Bucket, it will be checked by the validation lambda function. This function confirms that all the correct fields exist for each message and that the data in each field is reasonable (eg. Variable speed limit set to 0-254 MPH). The field and data checks are stored in a file called config.ini.

The path for each config file must be added manually into the lambda in AWS.

Validation Results and Alerts

When the validator runs and validates the content of the data files submitted to the SDC, it will compile metrics of the results from the validator. These metrics include the number of valid messages, the number of invalid messages and the number of total data files validated. These numbers are then summed up throughout the day and are used to find specific data files that may not be validating properly. A daily report will be sent out to all data providers about the validation for their data files for the day and any alerts of invalidations that have occurred.

  • No labels