Data Lake Pipeline

The Data Lake is both a system and repository for the processing and storage of data from Data Providers. This data is stored in its natural or raw format, it may include structured data, semi-structured data, unstructured data or binary data. Research Teams use this data to support there research goals.

What's New

Check out the newest features and enhancements we've just released.

v1.1.0 (2021-11-24)

Refactor
- Terraform now uses modules
- Improved script build_lambda_deployment_packages.py
Add OSS4ITS Kinesis Data Firehose
Bug fix:

What's Next

Here are the features and enhancements we're working on... or at least we're thinking about working on them. 🤓

We want you to have a list of all available documentation, wiki pages, Word documents, PDF, etc.

What's Not New

v1.0.3 (2021-06-30)

BugFixes
- Solved bug that caused gzip extraction to produce empty files.

v1.0.2 (2021-05-27)

Features
- You can now be notified when any new or updated documentation is added to the data lake (markdown and non-markdown)
- You can now read a wiki page listing all non-markdown documentation in S3

v1.0.1 (2021-05-20)

Features
- You can now be notified when new or updated documentation is added to the data lake
- The Support Team will now be notified when objects in the drop-zone buckets are more than 24 hours old
- Developers can now deploy the Data Lake Pipeline by running a single script deploy.py
Component Upgrades
- Terraform: 0.12 ↗ 0.15
- AWS Provider: 2.0 ↗ 3.39

v1.0.0 (2021-04-28)

Greenfield Data Lake Pipeline
Data Providers can push, or the SDC can pull objects to drop-zone buckets
Objects are moved from drop-zones to raw-data
- Date prefix is added (/2021/04/28/data.txt)
Markdown documentation is published to the Data Provider's GitLab wiki
Objects are copied from raw-data to standardized-data
- Compressed objects are uncompressed (zip files are unzipped)
Research Teams access objects in the standardized-data bucket

Release Notes: Data Lake Pipeline