Release Notes: Data Lake Pipeline
Data Lake Pipeline
The Data Lake is both a system and repository for the processing and storage of data from Data Providers. This data is stored in its natural or raw format, it may include structured data, semi-structured data, unstructured data or binary data. Research Teams use this data to support there research goals.
What's New
Check out the newest features and enhancements we've just released.
v1.3.0 (2022-07-13)
Updated glue jobs by splitting single job into 3
Updated logic for allSubDivs ingest to match stored procedure on RSIS
What's Next
Here are the features and enhancements we're working on... or at least we're thinking about working on them.
We want you to have a list of all available documentation, wiki pages, Word documents, PDF, etc.
What's Not New
v1.2.2 (2022-03-29)
Bugfix: removed tranformation_ctx from glue job; railroad and failureformdata failed to load
v1.2.1 (2022-02-04)
AWS Glue being used to ingest RSIS files for FRA-ARDS
Created the following resources
jdbc connection to rsis server
Glue crawler to crawl rsis database and create rsis data catalog
Glue rsis database/catalog
glue job for ingest from rsis
scheduled based trigger for job
IAM role for running glue jobs/crawlers
v1.2.0 (2022-01-05)
Features
tar and tar.gz files are now decompressed and extracted during copy to standardized data bucket
Refactor
raw-data_to_standardized-data Lambda code is easier to follow
improved testing scripts/data
v1.1.0 (2021-11-24)
Refactor
Terraform now uses modules
Improved script
build_lambda_deployment_packages.py
Add OSS4ITS Kinesis Data Firehose
v1.0.3 (2021-06-30)
BugFixes
Solved bug that caused gzip extraction to produce empty files.
v1.0.2 (2021-05-27)
Features
You can now be notified when any new or updated documentation is added to the data lake (markdown and non-markdown)
You can now read a wiki page listing all non-markdown documentation in S3
v1.0.1 (2021-05-20)
Features
You can now be notified when new or updated documentation is added to the data lake
The Support Team will now be notified when objects in the drop-zone buckets are more than 24 hours old
Developers can now deploy the Data Lake Pipeline by running a single script
deploy.py
Component Upgrades
Terraform: 0.12 0.15
AWS Provider: 2.0 3.39
v1.0.0 (2021-04-28)
Greenfield Data Lake Pipeline
Data Providers can push, or the SDC can pull objects to drop-zone buckets
Objects are moved from drop-zones to raw-data
Date prefix is added (/2021/04/28/data.txt)
Markdown documentation is published to the Data Provider's GitLab wiki
Objects are copied from raw-data to standardized-data
Compressed objects are uncompressed (zip files are unzipped)
Research Teams access objects in the standardized-data bucket