Let’s say you run a large online bookstore. It’s open 24/7. Users may place orders and pay for them literally every minute or second. That means your website must quickly process lots of transactions involving small amounts of data like order ID and details, user ID, or credit card data. Online transaction processing (OLTP) systems, namely databases and applications like a shopping cart, make it possible for an eCommerce business to work nonstop as it must.
Besides running daily operations, you may evaluate your performance. For instance, analyze sales for a given book or author during the previous month. It means you must collect transactional data and move it from the database that supports transactions to another system that can handle large volumes of data. And, as is common, to transform it before loading to another storage system. Only after these actions can you analyze data with dedicated software (a so-called online analytical processing or OLAP system). But how do you move data? You need infrastructure, hardware and/or software, that will allow you to do that. You need an efficient data pipeline.
What is a data pipeline?
A data pipeline is a set of tools and activities for moving data from one system with its method of data storage and processing to another system in which it can be stored and managed differently. Moreover, pipelines allow for automatically getting information from many disparate sources, then transforming and consolidating it in one high-performing data storage.
Imagine that you’re gathering various data that shows how people engage with your brand. That can be their location, device, session recordings, purchase and customer service interaction history, feedback shared, and more. And then you place these pieces of information in one place, a warehouse, creating a profile for each customer.
Thanks to data consolidation, everyone who uses data to make strategic and operational decisions or build and maintain analytical tools, can easily and quickly access it. These can be data science teams, data analysts, BI engineers, chief product officers, marketers, or any other specialists that rely on data in their work.
Building and managing infrastructure for data movement and its strategic usage is what data engineers do.
Data pipeline components
To understand how the data pipeline works in general, let’s see what a pipeline usually consists of. Senior research analyst of Eckerson Group David Wells considers eight types of data pipeline components. Let’s discuss them in brief.
Origin is the point of data entry in a data pipeline. Data sources (transaction processing application, IoT device sensors, social media, application APIs, or any public datasets) and storage systems (data warehouse or data lake) of a company’s reporting and analytical data environment can be an origin.
The final point to which data is transferred is called a destination. Destination depends on a use case: Data can be sourced to power data visualization and analytical tools or moved to storage like a data lake or a data warehouse. We’ll get back to the types of storages a bit later.
That’s the movement of data from origin to destination, including the changes it undergoes along the way as well as data stores it goes through. One of the approaches to dataflow is called ETL, which stands for extract, transform, and load:
Extract — getting/ingesting data from original, disparate source systems.
Transform — moving data in a temporary storage known as a staging area. Transforming data to ensure it meets agreed formats for further uses, such as analysis.
Load — loading reformatted data to the final storage destination.
Storage refers to systems where data is preserved at different stages as it moves through the pipeline. Data storage choices depend on various factors, for example, volume of data and frequency and volume of queries to a storage system, uses of data, etc. (think of the online bookstore example).
Processing includes activities and steps for ingesting data from sources, storing it, transforming, and delivering to a destination. While data processing is related to dataflow, it focuses on how to implement this movement. For instance, one can ingest data by extracting it from source systems, copying from one database to another one (database replication), or by streaming data. We mention just three options, but there are more of them.
Workflow defines a sequence of processes (tasks) and their dependence on each other in a data pipeline. Knowing several concepts — jobs, upstream, and downstream — would help you here. A job is a unit of work or execution that performs specified work — what is being done to data in this case. Upstream means a source from which data enters a pipeline, while downstream means a destination it goes to. Data, like water, flows down the data pipeline. Also, upstream jobs are the ones that must be successfully executed before the next ones — downstream — can begin.
The goal of monitoring is to check how the data pipeline and its stages are working: whether it remains efficient with growing data load, data remains accurate and consistent as it goes through processing stages, or whether no data is lost along the way.
These are tools and infrastructure behind data flow, storage, processing, workflow, and monitoring. Tooling and infrastructure options depend on many factors, such as organization size and industry, data volumes, use cases for data, budget, security requirements, etc. Some of the building blocks for data pipeline are:
- ETL tools, including data preparation and data integration tools (Informatica Power Center, Apache Spark, Talend Open Studio).
- data warehouses — central repositories for relational data transformed (processed) for a particular purpose (Amazon Redshift, Snowflake, Oracle). Since the main users are business professionals, a common use case for data warehouses is business intelligence.
- data lakes — storages for raw, both relational and non-relational data (Microsoft Azure, IBM). Data lakes are mostly used by data scientists for machine learning projects.
- batch workflow schedulers (Airflow, Luigi, Oozie, or Azkaban) that allow users to programmatically specify workflows as tasks with dependencies between them, as well as automate and monitor these workflows.
- tools for processing streaming data — data that’s continuously generated by sources like machinery sensors, IoT devices, transaction systems (Apache Spark, Flink, Storm, Kafka).
- programming languages (Python, Ruby, Java) to define pipeline processes as a code.
When do you need a data pipeline?
Reliable infrastructure for consolidating and managing data helps organizations power their analytical tools and support daily operations. Having a data pipeline is necessary if you plan to use data for different purposes, with at least one of them requiring data integration, for example, processing and storing transaction data and conducting a sales trend analysis for the whole quarter. To carry out the analysis, you will have to pull data from a number of sources (i.e., a transaction system, CRM, a website analytics tool) to access it from a single storage and prepare it for the analysis. So, a data pipeline allows for solving “origin-destination” problems, especially with large amounts of data.
Also, the more use cases, the more forms data can be stored in, the more ways it can be processed, transmitted, and used.
Data pipeline types and their use cases
Data pipelines can be distinguished by the type of analytics used in an organization: traditional and real-time (AKA streaming) analytics.
Traditional analytics is about making sense of data gathered over time (historical data) to support decision-making. This analytics type is related to business intelligence (BI). Traditional analytics uses batch processing: Data is periodically collected, transformed, moved to a destination system, and processed by blocks (batches). And a batch is queried by a user or a software program. So, some time passes between the analysis and generation and upload to a destination. Batch processing enables complex analysis of large datasets.
Dollar Shave Club: a data pipeline to power an ML-based recommendation engine
Dollar Shave Club is an American online store that delivers razors and men’s grooming products to more than 3 million subscribed members. Besides shooting hilarious commercials and succeeding in building a strong brand, the company has an efficient data infrastructure hosted on Amazon Web Services.
It uses a Redshift cluster as the central data warehouse that receives data from various systems, including production databases. “Data also moves between applications and into Redshift mediated by Apache Kafka,” said the company spokesperson. To collect event data from the web, mobile clients, as well as server-side applications, the company uses the Snowplow platform. Event data includes page views, link clicks, user browsing activity/behavior, and “any number of custom events and contexts.” Analytics platforms access data once it gets to the Redshift data warehouse.
Dollar Shave Club wanted to get even more data insights, so it developed a recommender system to define which products to promote and how to rank them in a monthly email sent to a specific customer. The engine is based on the Apache Spark unified analytics engine and runs on Databricks unified data analytics platform.
To enable the ETL process, the engineering team built an automated data pipeline on Spark. The pipeline works the following way:
- Data is extracted from Redshift.
- Data about a specific member’s behavior is aggregated and pivoted to get features that describe them.
- Selected features are included in final predictive models.
The brand said that this product recommendation project was successful.
Real-time analytics is about analyzing constantly flowing and updated (streaming) data. It uses stream processing that entails performing simple calculations over data as it’s created. Unlike batch processing, stream processing is about ingesting a sequence of data, and progressively updating metrics, reports, and summary statistics in response to every data record that becomes available.
Here is how streaming analytics systems process data:
So, real-time analytics allows businesses to get up-to-date information about operations and react without a delay, or to provide solutions for smart monitoring of infrastructure performance as Hewlett Packard Enterprise does.
Hewlett Packard Enterprise’s switch to stream processing for its InfoSight solution
Hewlett Packard Enterprise (HPE) is a platform-as-a-service company providing data management, digital transformation, IT support, and financial services. One of their solutions, InfoSight, monitors data center infrastructure they provide to customers.
Storage devices are equipped with sensors that collect performance data and send it to InfoSight. “HPE has over 20 billion sensors deployed in data centers all around the globe sending trillions of metrics each day to InfoSight, providing analytics on petabytes of telemetry data,” a case study says.
To enhance customer experience with the predictive maintenance feature, HPE needed to upgrade the infrastructure so it could enable near-real-time analytics. The answer was data architecture with support for streaming analytics. Also, the streaming application must be available all the time, recover from failures quickly, and be able to scale elastically.
The company used the Lightbend Platform to develop such an infrastructure. The platform includes several streaming engines (Akka Streams, Apache Spark, Apache Kafka) “for handling tradeoffs between data latency, volume, transformation, and integration,” besides other technologies.
Companies may have pipelines serving both analytics types. For example, Uber uses Apache Kafka to connect the two parts of their data ecosystem. Real-time data is used “for activities like computing business metrics, debugging, alerting, and dashboarding.” The batch pipeline data is more exploratory, according to the company blog post.
Implementation options for data pipelines
You can implement your data pipeline using cloud services by providers or build it on-premises.
On-premises data pipeline. To have an on-premises data pipeline, you buy and deploy hardware and software for your private data center. You also have to maintain the data center yourself, taking care of data backup and recovery, doing a health check of your data pipeline, or increasing storage and computing capabilities. This approach is time- and resource-intensive but will give you full control over your data, which is a plus.
Cloud data pipeline. Cloud data infrastructure means you don’t have physical hardware. Instead, you access a provider’s storage space and computing power as a service over the internet and pay for the resources used. This brings us to a discussion of the pros of a cloud-based data pipeline.
- You don’t manage infrastructure and worry about data security because it’s the vendor’s responsibility,
- Scaling storage volume up and down is a matter of a few clicks.
- You can adjust computational power to meet your needs.
- Downtime risks are close to zero.
- Cloud ensures faster time-to-market.
Disadvantages of cloud include the danger of a vendor lock: It will be costly to switch providers if one of the many pipeline solutions you use (i.e., a data lake) doesn’t meet your needs or if you find a cheaper option. Also, you must pay a vendor to configure settings for cloud services unless you have a data engineer on your team.
If you struggle to evaluate which option is right for you in both the short and long run, consider talking to data engineering consultants.
Originally published at AltexSoft tech blog “What is Data Pipeline: Components, Types, and Use Cases”