Release Notes: Data Lake Pipeline

Data Lake Pipeline

The Data Lake is both a system and repository for the processing and storage of data from Data Providers. This data is stored in its natural or raw format, it may include structured data, semi-structured data, unstructured data or binary data. Research Teams use this data to support there research goals.

What's New

Check out the newest features and enhancements we've just released.

v1.3.0 (2022-07-13)

  • Updated glue jobs by splitting single job into 3

  • Updated logic for allSubDivs ingest to match stored procedure on RSIS

What's Next

Here are the features and enhancements we're working on... or at least we're thinking about working on them. 

  • We want you to have a list of all available documentation, wiki pages, Word documents, PDF, etc.

What's Not New

v1.2.2 (2022-03-29)

  • Bugfix: removed tranformation_ctx from glue job; railroad and failureformdata failed to load

v1.2.1 (2022-02-04)

  • AWS Glue being used to ingest RSIS files for FRA-ARDS

  • Created the following resources

    • jdbc connection to rsis server

    • Glue crawler to crawl rsis database and create rsis data catalog

    • Glue rsis database/catalog

    • glue job for ingest from rsis

    • scheduled based trigger for job

    • IAM role for running glue jobs/crawlers

v1.2.0 (2022-01-05)

  • Features

    • tar and tar.gz files are now decompressed and extracted during copy to standardized data bucket

  • Refactor

    • raw-data_to_standardized-data Lambda code is easier to follow

    • improved testing scripts/data

v1.1.0 (2021-11-24)

  • Refactor

    • Terraform now uses modules

    • Improved script build_lambda_deployment_packages.py

  • Add OSS4ITS Kinesis Data Firehose

v1.0.3 (2021-06-30)

  • BugFixes

    • Solved bug that caused gzip extraction to produce empty files.

v1.0.2 (2021-05-27)

  • Features

    • You can now be notified when any new or updated documentation is added to the data lake (markdown and non-markdown)

    • You can now read a wiki page listing all non-markdown documentation in S3

v1.0.1 (2021-05-20)

  • Features

    • You can now be notified when new or updated documentation is added to the data lake

    • The Support Team will now be notified when objects in the drop-zone buckets are more than 24 hours old

    • Developers can now deploy the Data Lake Pipeline by running a single script deploy.py

  • Component Upgrades

    • Terraform: 0.12 0.15

    • AWS Provider: 2.0 3.39

v1.0.0 (2021-04-28)

  • Greenfield Data Lake Pipeline

  • Data Providers can push, or the SDC can pull objects to drop-zone buckets

  • Objects are moved from drop-zones to raw-data

    • Date prefix is added (/2021/04/28/data.txt)

  • Markdown documentation is published to the Data Provider's GitLab wiki

  • Objects are copied from raw-data to standardized-data

    • Compressed objects are uncompressed (zip files are unzipped)

  • Research Teams access objects in the standardized-data bucket