DataOps: Adjusting DevOps for Analytics Product Development

There is nothing particularly unlucky about the number 13. Unless you meet it in the article that discloses that “only 13 percent of data science projects… make it into production.” This sounds really ominous — especially, for companies heavily investing in data-driven transformations.

The bright side is that no initiative is initially doomed to fail. New approaches arise to speed up the transformation of raw data into useful insights. Similar to how DevOps once reshaped the software development landscape, another evolving methodology, DataOps, is currently changing Big Data analytics — and for the better.

What is DataOps: a brief introduction

The shorthand for data and operations was first introduced in 2015 by Lenny Liebmann, former Contributing Editor at InformationWeek. In his blog post “DataOps: Why Big Data Infrastructure Matters,” Lenny describes the new concept as “the discipline that ensures alignment between data science and infrastructure.”

Gartner broadens the initial definition to “a collaborative data management practice focused on improving the communication, integration, and automation of data flows between data managers and data consumers across an organization.” Whichever explanation you prefer, this doesn’t change DataOps objectives — building trust in analytics through better data quality and accelerating value creation in data-intensive projects.

To get a solid understanding of how DataOps achieves its goals, we must mention its close relatives — DevOps and MLOps.

How DataOps relates to Agile, DevOps, and MLOps.

DataOps vs DevOps

The table shows how DevOps helps tech giants accelerate deployment of updates with no loss in quality. Source : The Phoenix Project

DevOps, in turn, owes its breakthrough to Agile — an approach promoting small and frequent software releases instead of rare global changes. Seeking to speed up delivery of trustworthy data insights, DataOps takes the cue and incorporates Agile into data analytics. It also borrows some other best practices along with the overall mindset from DevOps, which we explained in our video.

DataOps can be a useful addition to DevOps, ensuring that an application is delivered with the right production data and is being tested against the right datasets.

DataOps vs MLOps

What MLOps has in common with DataOps.

Both DataOps and MLOps can be viewed as an extension of DevOps methodology in data science. DataOps covers data journeys from extraction to deployment analytics products. It may prepare quality datasets and features for machine learning algorithms but doesn’t offer solutions for training ML models and running them in production. That’s where MLOps steals initiative.

Shared Ops principles

  • cross-functional collaboration,
  • shared responsibility for outcomes,
  • component reusability,
  • pervasive automation,
  • observability or the ability to track and measure results,
  • learning from mistakes, and
  • continuous improvement through multiple iterations.

DataOps adopts these fundamentals for the needs of data professionals and data analytics flows.

DataOps process structure

All data operations run within a continuous integration / continuous delivery (CI/CD) workflow promoted by DevOps. It introduces automation throughout the entire lifecycle of the data analytics pipeline and into its individual segments to enable updates and ensure data quality at each step.

Data analytics pipeline exists within a CI/CD framework.

Data analytics pipeline: key stages

Data ingestion. Data, extracted from various sources, is explored, validated, and loaded into a downstream system.

Data transformation. Data is cleansed and enriched. Initial data models are designed to meet business needs.

Data analysis. At this stage, data teams may understand that they need to collect more data to derive trustworthy conclusions. Otherwise, they produce insights using different data analysis techniques.

Data visualization/reporting. Data insights are represented in the form of reports or interactive dashboards.

Different teams conduct different stages of the data workflow pipeline. However, all individuals involved should share knowledge, learn from each other, and document how they achieve success.

CI /CD for data operations

Development. In DataOps, the step may involve building a new pipeline, changing a data model or redesigning a dashboard.

Testing. The DataOps framework fosters checking the most minor update for data accuracy, potential deviation, and errors. Testing of inputs, outputs and business logic should be performed at each stage of the data analytics pipeline, to verify that results of the particular data job meet expectations. For complex pipelines, this means myriad tests that should be automated where possible.

Deployment. This presumes moving data jobs between environments, pushing them to the next stage, or deploying the entire pipeline in production.

Monitoring. A prerequisite for data quality, monitoring allows data professionals to identify bottlenecks, catch abnormal patterns, and measure adoption of changes. Currently, monitoring often relies on AI-driven anomaly detection algorithms.

Orchestration. In the data world, orchestration automates moving data between different stages, monitoring progress, triggering autoscaling, and other operations related to the management of data flows. It covers each stage and the end-to-end data analytics pipeline.

DataOps people

Data stakeholders uniting around business requirements.

The methodology builds the bridge between

  • data managers — data architects, data engineers, ETL developers, and data stewards;
  • data consumers — data analysts, data scientists, dashboard developers, BI teams, machine learning engineers and others who use data to deliver results via visualizations, APIs, ML models, applications or other mediums; and
  • a DevOps team — software developers and IT operations professionals.

Each of these players contributes to the end-to-end process of data transformation — from raw pieces of information to analytics solutions for end customers. Getting settled into the DataOps environment, they work together to deliver valuable business insights faster.

Now that you have a general idea of DataOps principles, players and components, the question is: What makes the entire ecosystem work? Below, we’ll look at the core practices and technologies to put the concept into practice.

Best practices to support DataOps

Treat data as code

Create a data catalog

Catalogs enable data professionals to easily find and understand datasets they need. This results in significant time savings along with improved speed and quality of insights. Besides that, data catalogs can prevent unauthorized data access and simplify compliance with GDPR and other data protection regulations.

Consider ELT

For decades, ETL prevailed over ELT because storage mediums couldn’t cope with the volume and complexity of raw data. So, businesses had to reshape, clean, and organize information before loading it.

Modern cloud data warehouses are much cheaper and faster than their predecessors. They handle both structured and unstructured data, making ELT an optimal choice for many DataOps projects — and here are key reasons why.

  • ELT is faster and cheaper to run.
  • ELT creates rich and easy-to-access pools of historical data, with no details missed during transformations. Businesses can use it anytime for analysis and generating BI.
  • With ETL, transformations are owned by data engineers. In ELT, changes happen in the warehouse, where data analysts can also contribute to transformations writing them in SQL.
  • You can design new data models multiple times without revising the entire pipeline.
ETL and ELT approaches to moving data.

There are still companies where ETL would be preferable. This includes cases when data is predictable, transformations are minimal, and required models are unlikely to change. ETL is also inevitable for legacy architectures and data workflows that must be transformed before entering the target system — say, when you have to delete personal identifying information (PII) to comply with GDPR.

However, with more organizations moving to the cloud, ELT gains greater popularity due to its agility, low price, and flexibility. The downside is that, as with any relatively new concept, ELT tools are often far from perfect, requiring a high level of expertise from your data team.

Build Directed Acyclic Graphs (DAG) for orchestration

Visualizing of DAG dependencies in Apache Airflow.

Technologies to run DataOps

Among preferred technologies are Git for version control and Jenkins for CI/CD practices. Similar to DevOps, DataOps accords well with microservices architecture, using Docker for containerization and Kubernetes for managing containers. For data visualizations, DataOps often utilizes Tableau. However, the core of the DataOps stack is made of data-specific solutions.

Data pipeline tools

Besides mentioned-above Apache Overflow, the list of solutions widely used in DataOps include

  • Piperr, a suite of pre-built pipelines to run data integration, cleaning, and enrichment;
  • Genie, an open-source engine designed by Netflix to orchestrate Big Data jobs;
  • Apache Oozie, a workflow scheduler for Hadoop jobs;
  • Prefect, a platform for data flow automation and monitoring;
  • Pachyderm for data version control and data lineage; and
  • dbt (data built tool), a development environment to write and execute data transformation jobs in SQL. In essence, dbt is the T in the ELT (extract, load, transform) process.
Here’s how dbt fits into data workflows. Source: Medium

Automated testing and monitoring tools

  • initial data testing,
  • data structures testing (validating database objects, tables, columns, data types, etc.),
  • ETL testing (if a company opts for ETL),
  • integration testing verifying that all pipeline components work well together, and
  • BI / Report testing.

A reliable automated test suite is key to making a go of analytics continuous delivery. Here are several platforms to consider.

iCEDQ connects with over 50 widely-used data sources to compare data across databases and files. It validates and verifies initial data, data structures, ETL processes, and BI reports. By integration with popular CI/CD and orchestration tools, the technology facilitates workflow management and deployment as well.

ETL testing with iCEDQ.

Naveego is a data accuracy solution that detects errors across all enterprise data sources. It teams up with Kubernetes, big data processing engine Apache Spark, and event streaming platform Apache Kafka for fast deployment and seamless integration of data, no matter its schema or structure. The platform cleans stored information to make it ready for analytics.

RightData supports numerous data sources, compares datasets between source and target, and alerts to errors. Users can create their own validation rules.

Databand tracks data quality metrics across files, tables, and pipeline inputs and outputs. It also monitors performance of workflows capturing deviations from baselines.

DataOps platforms

DataKitchen is an enterprise-level DataOps platform that orchestrates all key operations, from data access to analytics delivery — providing the testing and monitoring of each step. The technology allows data professionals to develop and maintain pipelines with Python, create models in R, and visualize results in Tableau — all in one workspace. It also automates end-to-end ML pipelines and can be used for MLOps projects.

Saagie accelerates delivery of data projects via a Plug-and-Play orchestrator. Data professionals can create and manage data pipelines combining a range of pre-integrated tools and technologies. The platform also gives access to versioning, rollback, and monitoring capabilities.

StreamSets offers visual tools to build, execute, monitor, and optimize myriads of data pipelines. It speeds up big data ingestion and simplifies data transformation. With StreamSets this step doesn’t require coding skills. As the solution is infrastructure agnostic and legacy friendly, engineers can run projects in different environments, keeping data synchronized across platforms.

StreamSet DataOps platform structure.

DataOps implementation tips

Choose the right time

Make sure your staff has core competencies

Keep in mind that success may not come quickly

Originally published at AltexSoft tech blog “DataOps: Adjusting DevOps for Analytics Product Development

Being a Technology & Solution Consulting company, AltexSoft co-builds technology products to help companies accelerate growth.