Ultimate Guide to Streaming Data Pipelines: Six Criteria to Evaluate and Select the Right Tool

ABSTRACT: This blog, the second in a series, recommends six criteria to help data leaders evaluate tools for managing streaming data pipelines.
Read time: 6 mins.
Sponsored by Striim
The Greek philosopher Heraclitus observed that we “never step in the same river twice” because change is constant. In business, it’s a similar story: we must seize opportunities before they wash away.
Streaming data pipelines can help. When we design and implement them well, streaming data pipelines enable us to identify, analyze and act on real-time business opportunities. This blog, the second in a series, recommends six criteria to help data leaders evaluate tools for managing such pipelines. The criteria include pipeline functionality, ease of use, support for unstructured data, ecosystem integration, governance and performance/scalability. The first blog in our series explained why and how streaming data drives the success of generative AI (GenAI).
To recap, a streaming data pipeline is a workflow that captures a digital “event”—perhaps a credit card purchase, website login, or factory sensor alert—and replicates that event to a target platform. Streaming pipelines typically relay events in “real-time” increments—every few minutes, seconds or even milliseconds, whatever the business requires. They also might perform light data transformations along the way.
Let’s walk through our evaluation criteria.
Pipeline functionality
Most streaming pipeline tools use change data capture (CDC) to identify and replicate live changes on sources, most often relational databases. For example, log-based CDC identifies and captures live data changes by scanning a source database’s backup logs. This method is less disruptive to the source database than alternative CDC methods such as queries and triggers, both of which consume CPU and memory. While some data sources require certain CDC methods—for example, some sources do not have backup logs to scan—you should seek out pipeline tools whose CDC uses the log-based method wherever possible.
Streaming pipelines also should have the ability to transform data in flight or after arriving at the target. In-flight transformations, part of a streaming Extract, Transform, Load (ETL) pattern, might filter out certain events or table columns, merge two or more streams, convert data between formats or reorganize data by applying a new schema. In addition to transformations like these, your pipeline tool should support schema evolution by propagating table structure changes from source to target.
Ease of use
Your streaming pipeline tool should reduce the time and effort it takes your team to build and manage streaming data pipelines compared with homegrown scripting. Assess the level of automation it provides to design, configure, execute and monitor pipelines. For example, your tool should automate processes for inferring and migrating source schemas to the target, loading historical snapshots, activating CDC pipelines and monitoring latency.
Also look for tools that guide users with suggested pipeline attributes, intuitive icons, and clear mouseover explanations, along with searchable and navigable documentation to answer user questions. AI copilots can simplify the experience further with a natural language chatbot that recommends pipeline designs or code.
Given the pace of change in modern enterprise environments, ease of change is just as critical as ease of setup. The right pipeline tool enables users to reconfigure, add, or remove various pipeline elements, including sources, targets, and transformation scripts, in a modular fashion.
Support for unstructured data
Traditional data pipelines handle structured data (i.e., tables) and semi-structured data such as log files. Now rapid adoption of AI drives the need for text, imagery and other types of unstructured data. Ingesting and transforming unstructured objects like these requires new techniques. Your pipeline tool should be able to handle such transformations, for example by tokenizing and vectorizing content within a document, then loading the resulting vector embeddings into a vector database.
Once there, this content can feed retrieval-augmented generation (RAG), a popular workflow for applying GenAI to proprietary enterprise data. The RAG workflow retrieves relevant embeddings from the vector database and injects it into user prompts, enabling a GenAI language model to generate accurate and context-sensitive outputs. RAG and GenAI are the future, and your tool should create the pipelines that support them.
Ecosystem integration
The explosion of data supply and demand makes environments ever more complex. To simplify data pipelines, your tool should support all major data sources, targets, processors, formats, transformation tools, workflow tools, database tools, and programming languages. While your tool might not interact directly with all of them, you need to ensure it does not limit data portability between any of them as your environment evolves. Evaluate pipeline tools based on the level of effort required to support each of these elements.
- Sources include relational databases, legacy mainframe systems, SaaS applications, text files, imagery and IT log files. They reside on premises and in the cloud.
- Targets include data warehouses, lakehouses, vector databases, SaaS applications, relational databases and NoSQL databases, based on the cloud or sometimes on premises.
- Formats span popular open table formats such as Apache Iceberg and Delta tables as well as the more traditional CSV, JSON and Apache Parquet.
- Transformation tools range from custom scripts to vendor offerings such as dbt.
- Workflow tools comprise elements such as Apache Airflow, which orchestrates data pipelines and integrates them into operational or analytical workflows.
- Programming languages range from SQL to JavaScript, Python and Java. Data scientists and developers build programs in these languages most often SQL, to transform or analyze data in flight.
Governance
Your pipeline tool should align with your data governance program and help enforce policies. For starters, it should track the lineage of data—from source to pipeline to target—so users can assess its validity. Pipeline tools also should organize metadata about files, users and pipelines to inform catalogs and satisfy internal or external audit requirements. Another requirement is role-based access controls: as with any data tool, you must authenticate user identities and authorize their actions, both natively and through integration with third-party identity management tools such as Azure Active Directory.
As an added security layer, your pipeline tool should obfuscate personally identifiable information (PII)—for example, social security numbers in a table column—using measures such as masking or encryption. These various governance features assist compliance with privacy regulations and reduce the risk of AI initiatives.
Performance and scalability
Your team needs confidence that its pipelines deliver timely data to the business even as volumes, varieties and velocities rise. You can probe vendor tools’ performance and scalability in several ways. To start, learn how each tool would fit into your data architecture. It should impose as little processing burden as possible on existing resources, especially on premises applications and servers. It also should scale on demand, for example by activating cloud compute nodes to support new sources, targets, and data streams.
Have your team devise a rigorous proof of concept that tests tools’ ability to support your most rigorous use cases with the right latency, throughput and concurrency. Ask each vendor for proof points about the SLAs they met for other customers with similar use cases and environments. You also should have granular visibility into CPU/memory utilization metrics and performance KPIs so you can anticipate and prevent bottlenecks.
Streaming forward
While we never step in the same river twice, we can adopt consistent tools that capture live opportunities as they arise—and crucially, these tools are built to adjust to changing circumstances. Data leaders can achieve this by selecting the right streaming data pipeline tool and using them to help AI models and business applications generate real-time insights. The right pipeline tool helps with robust pipeline functionality, ease of use, support for unstructured data, ecosystem integration, governance, and performance scalability.