Throughout the day we make many decisions relying on previous experience. Our brains store trillions of bits of data about past events and leverage those memories each time we face the need to make a decision. Like people, companies generate and collect tons of data about the past. And this data can be used to make better decisions.
While our brain serves to both process and store, companies need multiple tools to work with data. And one of the most important ones is a data warehouse.
In this article, we will discuss what an enterprise data warehouse is, its types and functions, and how it’s used in data processing. We will define how enterprise warehouses are different from the usual ones, what types of data warehouses exist, and how they work. The focus is to provide information about the business value of each architectural and conceptual approach to building a warehouse.
What is an Enterprise Data Warehouse?
If you know how much terabyte is, you’d probably be impressed by the fact that Netflix had about 44 terabytes of data in its warehouse back in 2016. The size alone hints at why we call it a warehouse, instead of just a database. So let’s begin with the basics.
An Enterprise Data Warehouse (EDW) is a form of corporate repository that stores and manages all the historical business data of an enterprise. The information usually comes from different systems like ERPs, CRMs, physical recordings, and other flat files. To prepare data for further analysis, it must be placed in a single storage facility. This way, different business units can query it and analyze information from multiple angles.
With a data warehouse, an enterprise can manage huge data sets, without administering multiple databases. Such practice is a futureproof way of storing data for business intelligence (BI), which is a set of methods/technologies of transforming raw data into actionable insights. With the EDW being an important part of it, the system is similar to a human brain storing information, but on steroids.
Enterprise data warehouse vs usual data warehouse: what’s the difference?
Any data warehouse is a database that is always connected with raw-data sources via data integration tools on one end and analytical interfaces on the other. If so, why do we isolate the enterprise form for discussion?
Any warehouse provides storage that has mechanisms to transform data, move it, and present it to the end user. The difference between a usual data warehouse and an enterprise one is in its much wider architectural diversity and functionality. Because of the complex structure and size, EDWs are often decomposed into smaller databases, so end users are more comfortable in querying these smaller databases. Considering this, we’re focusing on an enterprise warehouse to cover the whole spectrum of functionality.
However, the size of a warehouse doesn’t define its technical complexity, the requirements for analytical and reporting capabilities, number of data models, and the data itself. So, to understand what makes a warehouse a warehouse, let’s dive into its core concepts and functionality.
Enterprise Data Warehouse concepts and functions
With all the bells and whistles, at the heart of every warehouse lay basic concepts and functions. These pillars define a warehouse as a technological phenomenon:
Serves as the ultimate storage. An enterprise data warehouse is a unified repository for all corporate business data ever occurring in the organization.
Reflects the source data. EDW sources data from its original storage spaces like Google Analytics, CRMs, IoT devices, etc. If the data is scattered across multiple systems, its unmanageable. So, the purpose of EDW is to provide the likeness of the original source data in a single repository. As there is always new, relevant data generated both inside and outside the company, the flow of data requires a dedicated infrastructure to manage it before it enters a warehouse.
Stores structured data. The data stored in an EDW is always standardized and structured. This makes it possible for the end users to query it via BI interfaces and form reports. And this is what makes a data warehouse different from a data lake. Data lakes are used to store unstructured data for analytical purposes. But unlike warehouses, data lakes are used more by data engineers/scientists to work with big sets of raw data.
Subject-oriented data. The main focus of a warehouse is business data that can relate to different domains. To understand what the data relates to, it’s always structured around a specific subject called a data model. An example of a subject can be a sales region or total sales of a given item. Additionally, metadata is added to explain in detail where every piece of information comes from.
Time-dependent. The data collected is usually historical data, because it describes past events. To understand when and for how long a certain tendency took place, most stored data is usually divided into time periods.
Nonvolatile. Once placed in a warehouse, the data is never deleted from it. The data can be manipulated, modified, or updated due to source changes, but it’s never meant to be erased, at least by the end users. As we speak about historical data, deletions are counterproductive for analytical purposes. Yet general revisions may occur once in a few years to get rid of irrelevant data.
Considering the base principles, we’ll look at the implementation types of DWs.
Data warehouse types
Considering EDW functions, there is always a room for discussion on how to design it technically. In the case of data storage and processing, they are specific and distinct to different kinds of businesses. Depending on the amount of data, analytical complexity, security issues, and budget, of course, there is always an option on how to set up your system.
Classic data warehouse
Unified storage that has its dedicated hardware and software is considered a classic variant for an EDW. With physical storage, you don’t have to set up data integration tools between multiple databases. Instead, EDW can be connected with data sources via APIs to constantly source information and transform it in the process. So, all the work is done either in the staging area (the place where data is transformed before loading into the DW), or in the warehouse itself.
A classic data warehouse is considered superlative to a virtual one (that we discuss below), because there is no additional layer of abstraction. It simplifies the work for data engineers and makes it easier to manage data flow on the preprocessing side, as well as actual reporting. The drawbacks of the classic warehouse depend on the actual implementation, but for most businesses these are:
- Expensive technological infrastructure, both hardware and software;
- Hiring a team of data engineers and DevOps specialists to set up and maintain the whole data platform.
When to use: appropriate for organizations of all sizes that want to process their data and make use of it. Classic warehouses allow for morphing into different architectural styles of the data platform, as well as scaling up and down on purpose.
Virtual data warehouse
A virtual data warehouse is a type of EDW used as an alternative to a classic warehouse. Essentially, these are multiple databases connected virtually, so they can be queried as a single system.
Such an approach allows organizations to keep it simple: The data can stay in its sources, but can still be pulled with the help of analytical tools. Virtual warehouses can be used if you don’t want to mess with all the underlying infrastructure, or the data you have is easily manageable as it is. However, such an approach has many drawbacks:
- Multiple databases will require constant software and hardware maintenance and costs.
- The data stored in a virtual DW still requires a transformation software to make it digestible for the end users and reporting tools.
- Complex data queries may take too much time, as the required pieces of data may be placed in two separate databases.
When to use: suitable for businesses that have raw data in a standardized form that doesn’t require complex analytics. It also fits organizations that don’t use BI systematically, or want to start with it.
Cloud Data Warehouse
For a decade, cloud/cloudless technologies have become more of a standard for setting up organization-level technologies. You’ll find countless providers on the market that offer warehousing-as-a-service. To name a few:
- Amazon Redshift/ Pricing page
- IBM Db2/ Pricing page
- Google BigQuery/ Pricing page
- Snowflake/ Pricing page
- Microsoft SQL Data Warehouse/ Pricing page
All of the providers mentioned offer fully-managed, scalable warehousing as a part of their BI tooling, or focus on EDW as a standalone service, like Snowflake does. In this case, cloud warehouse architecture has the same benefits as any other cloud service. Its infrastructure is maintained for you, meaning you don’t need to set up your own servers, databases, and tooling to manage it. The price for such a service will depend on the amount of memory required, and the amount of computing capabilities for querying.
The only aspect you might be concerned about in terms of a cloud warehouse platform is data security. Your business data is a sensitive thing. So, you want to check if the vendor you have chosen can be trusted to avoid breaches. This doesn’t necessarily mean that an on-premise warehouse is more secure, but in this case, the safety of your data is in your hands.
When to use: Cloud platforms are a great choice for organizations of any size. If you need everything set up for you, including managed data integration, DW maintenance, and BI support.
Enterprise Data Warehouse Architecture
While there are many architectural approaches that extend warehouse capabilities in one way or another, we will focus on the most essential ones. Without diving into too much technical detail, the whole data pipeline can be divided into three layers:
- Raw data layer (data sources)
- Warehouse and its ecosystem
- User interface (analytical tools)
The tooling that concerns data Extraction, Transformation, and Loading into a warehouse is a separate category of tools known as ETL. Also, under the ETL umbrella, data integration tools perform manipulations with data before it’s placed in a warehouse. These tools operate between a raw data layer and a warehouse.
When the data is loaded into a warehouse, it can also be transformed. So, the warehouse will require certain functionality for cleaning/standardization/dimensionalization. These and other factors will determine architecture complexity. We will look at the EDW architecture from the standpoint of growing organizational needs.
Given that data integration is well-configured, we can choose our data warehouse. In most cases, a data warehouse is a relational database with modules to allow multidimensional data, or one that can separate some domain-specific information for easier access. In its most primitive form, warehousing can have just one-tier architecture.
One-tier architecture for EDW means that you have a database directly connected with the analytical interfaces where the end user can make queries. Setting the direct connection between an EDW and analytical tools brings several challenges:
- Traditionally, you can consider your storage a warehouse starting from 100GB of data. Working with it directly may result in messy query results, as well as low processing speed.
- Querying data right from the DW may require precise input, so that the system will be able to filter out non-required data. Which makes dealing with presentation tools a little difficult.
- Limited flexibility/analytical capabilities exist.
Additionally, the one-tier architecture sets some limits to reporting complexity. Such an approach is rarely used for large-scale data platforms, because of its slowness and unpredictability. To perform advanced data queries, a warehouse can be extended with low-level instances that make access to data easier.
Two-tier architecture (data mart layer)
In two-tier architecture, a data mart level is added between the user interface and EDW. A data mart is a low-level repository that contains domain-specific information. Simply put, it’s another, smaller-sized database that extends EDW with dedicated information for your sales/operational departments, marketing, etc.
Creating data mart layer will require additional resources to establish hardware and integrate those databases with the rest of the data platform. But, such an approach solves the problem with querying: Each department will access required data more easily because a given mart will contain only domain-specific information. In addition, data marts will limit the access to data for end users, making EDW more secure.
Three-tier architecture (Online analytical processing)
On top of the data mart layer, enterprises also use online analytical processing (OLAP) cubes. An OLAP cube is a specific type of database that represents data from multiple dimensions. While relational databases represent data in just two dimensions (think of Excel or Google Sheets), OLAP allows you to compile data in multiple dimensions and move between dimensions.
It’s pretty difficult to explain in words, so let’s look at this handy example of what a cube can look like.
So, as you can see, a cube adds dimensions to the data. You may think of it as multiple Excel tables combined with each other. The front of the cube is the usual two-dimensional table, where region (Africa, Asia, etc.) is specified vertically, while sales numbers and dates are written horizontally. The magic begins when we look at the upper facet of the cube, where sales are segmented by routes and the bottom specifies time-period. That’s known as multidimensional data.
The business value of OLAP is that it allows users to slice and dice the data to compile detailed reports. As long as the cubes are optimized to work with warehouses, they can be used both directly with an EDW to give access to all the corporate data or with each data mart specifically. In terms of implementation, nearly all warehouse providers offer OLAP as a service. As an example, check Microsoft documentation on their OLAP offer.
On that point, we have discussed a high-level design of an EDW applied to organizational needs. Now we’re going to drill down into technical components that a warehouse may include.
Data Warehouse vs Data Lake vs Data Mart
Speaking about data storage architecture, we have to mention such options as using a data mart or a data lake instead of a warehouse. Frequently conflated, we’ll elaborate on the definitions.
Data warehouses are meant to store structured data, so that querying tools and end users can get comprehensive results. Warehouses, mostly used for BI, usually vary in size between 100GB and infinity.
Data lakes, however, are used to store mostly raw or mixed data. These are often leveraged for machine learning, big data, or data mining purposes. For the last couple of years, data lakes were used for BI: Raw data is loaded into a lake and transformed, which is an alternative to the ETL process. While this approach has its pros and cons, data lakes can be too messy for reaching structured data.
Then we have data marts, which can also be used as an alternative to DW. Such models (like Kimball’s model) assumes using multiple data marts to distribute information by domains and connect to each other. But, because of their small size (usually less than 100GB), data marts can hardly be used by enterprises. More often, data marts are used to segment a large DW into more operable ones.
Enterprise Data Warehouse Components
There are a lot of instruments used to set up a warehousing platform. We’ll have already mentioned most of them, including a warehouse itself. So, let’s a bird’s eye view on the purpose of each component and their functions.
Sources. That’s simple, the databases where raw data is stored.
Extract, Transform, Load (ETL) or Extract, Load, Transform (ELT) layer. These are the tools that perform actual connection with source data, its extraction, and loading to the place where it will be transformed. Transformation unifies data format. ETL and ELT approaches differ in that in ETL the transformation is done before EDW, in a staging area. ELT is a more modern approach that handles all the transformation in a warehouse.
Staging area. In the case of ETL, the staging area is the place data is loaded before EDW. Here, it will be cleaned and transformed to a given data model. The staging area may also include tooling for data quality management.
DW database. The data is finally loaded into the storage space. In ELT, it might still take some transformation here. But, at that stage, all the general changes will be applied, so the data will be loaded in its final model(s). As we mentioned, data warehouses are most often relational databases. DW will also include a database management system and additional storage for metadata.
Meta-data module. Put simply, metadata is data about data. These are the explanations that give hints for users/administrators of what subject/domain this information relates to. This data can be technical meta (e.g. initial source), or business meta (e.g. region of sales). All the meta is stored in a separate module of EDW and is managed by a metadata manager.
Reporting layer. These are tools that give end users access to data. Also called BI interface, this layer will serve as a dashboard to visualize data, form reports, and pull separate pieces of information.
Understanding the chain of tooling that passes data along can help you figure out what actually fits your data platform requirements. Planning to set up a warehouse may take years of planning and testing, because of the scale of it in a most basic form.
As a business owner, you might be confused by the number of options and technologies used, so it’s vital to consult with experts in the field of warehousing, ETL, and BI. While experts can help you with the technical aspect, to define the business purpose, speak with the ones who will use the actual data in their work.
Originally published at AltexSoft “Enterprise Data Warehouse: Concepts, Architecture, and Components”