Data Lakes Explained
FAQs
A data lake is a centralized repository to store vast amounts of data in its original (raw data) format. This means that data ingestion into a data lake is possible without any type of preformatting required.
Some key characteristics of data lakes include:
- Multitenancy (where a single software instance serves multiple customers).
- Schema-on-read data ingestion.
- Storage of unstructured data in its native state.
- A "store now, analyze later" focus.
The main data lake architecture layers include:
- Ingestion layer: In this layer, raw data is ingested into the data lake in either real-time or batches.
- Distillation layer: The raw data from the ingestion layer is then converted into a format in readiness for analysis that includes column-oriented files and tables.
- Processing layer: This is the layer where analytical algorithms and advanced analytical tools are found.
- Insights layer: Sometimes referred to as the research layer, this is where insights from where patterns and output from the processing layer are explored and analyzed.
- Operations layer: The operations layer governs system management and monitoring.
There are many differences between a data lake and a data warehouse. Two of the main differences include:
Schema flexibility: In data lakes, an approach called schema-on-read is used to determine the structure of the data schema. This means that the actual data being ingested, as well as the existing data in the data lake, determines the schema used at any given time. With a data warehouse, an approach called schema-on-write is used — which means a pre-defined schema, hierarchy, etc. must be used for all data being ingested.
Data types stored: Data lakes are able to store all types of structured data, semi-structured data, and unstructured data in their respective native states. However, data warehouses are only able to store structured and semi-structured data after processing it into a predefined format and hierarchy prior to ingestion.
Get started with Atlas today
- 125+ regions worldwide
- Sample data sets
- Always-on authentication
- End-to-end encryption
- Command line tools