Introduction to ETL pipelines
ETL, or extract, transform, and load, is the process of taking data from one source, transforming it, and then loading it into a destination. As an example, suppose you have a mobile app to sell goods, which sends log data in a generic format to your backend services. Now, you want to use this log data to figure out how frequently your app is crashing. Because you don’t need all the log fields, you can remove unwanted ones and transform the structure into a format that makes sense to you; then, you can load it into a database. This is ETL.
In a big data project, the collected data serves many different use cases across departments, with each department having its own ETL pipeline that transforms the incoming data per a predefined structure before loading it into a database.
Stages of ETL
Let’s look more closely at the three stages of ETL:
Extract: Data is extracted from the source data store (either internal or external) to the organization, aka extraction. Examples of data sources include SQL databases, data lakes, HTTP endpoints, and streaming tools such as Kafka.
Transform: Transformations are applied to the extracted data; these can be as simple as removing an unnecessary field to computing derivations from the data. Tools such as Apache Spark are used to transform data, after which it is written to another data store for other systems to read. ELT, discussed below, is when a transformation is done in place, meaning the source and the destination are the same.
Load: The last stage in the pipeline is the loading of data to the destination data store. Data transformation is usually done in memory, so it is important to persist the transformed data to disk. Tools here include Apache Spark, which has database connectors available to load the data directly from Spark to a destination database.
ETL vs. ELT
ELT is very similar to ETL in that you still perform the extraction, transformation, and loading of data, but not in the same order. Because the volume and the velocity of data are too high in a big data project, it becomes difficult to transform all of it before it actually gets loaded into a data store. So to avoid data loss or very slow pipelines, data is first loaded to the destination from the source before transformation.
In ELT, the data transformation happens in place, meaning the data is extracted and written to the same data store.
In most cases, the same tools and technologies used in ETL can be used in ELT, but there are specialized tools for the latter that you may want to implement for certain use cases.
Choosing the Right Data Tool
There are two approaches you can take for both ETL and ELT pipelines: transform all incoming data as a stream or do it in batches. This decision depends on your business use case and how important the recency of the data is. Here are some of the most commonly used tools:
- Apache Spark: Built on top of Hadoop’s MapReduce, Spark can connect to HDFS, Kafka, HBase, and other databases to extract data, transform it, and then load it to another database. WIth abstractions over MapReduce, Spark is easy to implement.
- Apache Kafka: Kafka is the tool of choice for message streaming at scale. It offers powerful features such as topic partitioning and consumer groups to improve parallelism. Kafka is a great choice for delivery of data to different tools.
- Amazon S3: S3 is used as a data lake in pipelines that are built in the AWS ecosystem. DIfferent AWS services can connect to S3 for both extracting and loading of data.
- Amazon Kinesis Data Firehose: Kinesis Firehose is the data delivery service within AWS. It is used to deliver data that is either extracted or transformed into data lakes such as S3, RedShift, etc.
Challenges with ETL
Every system poses difficulties, and an ETL pipeline is no different. Here are a few common challenges you may face.
With big data comes big scaling challenges. These can be solved in the initial stages of the pipeline by simply adding extra hardware, but this is otherwise not a long-term solution.
Certain design aspects of an ETL pipeline can help reduce the need for scaling; e.g., the ability to remove unwanted data, perform transformations at multiple stages, and scheduling.
Also, selecting a tool that can handle scaling is very important. For example, instead of a custom service with HTTP endpoints as interfaces to stream data, use tools such as Kafka and Kinesis to reduce complexity and improve scaling.
Another option is to use a fully managed cloud solution instead of hosting everything on-premises.
ETL pipelines usually have data coming in from a variety of different sources, and each source can have its own data structure. It’s up to you to massage that data into an optimized structure for your business needs.
Of course, as the number of data sources increases, so does the complexity of data transformation. Separate pipelines for each source can address this issue. But this creates yet another challenge as data sources increase. You have to strike a balance between how many sources you can club into one pipeline to apply common transformations and at how many stages this clubbing can happen so that ultimately you have only one stream of data to process.
Performance is another challenge that you face when working with a massive scale of data. If you have an on-premises setup, you have to make sure that you have enough hardware to not only handle the scale but also optimally perform no matter the scale. This is not so much of a problem with a cloud offering and is one of the reasons for the growing adoption of cloud services. The Big 3—Amazon, Microsoft, and Google—provide all the necessary services to set up both ETL and ELT pipelines in the cloud for both batch processing and streaming. Also, depending on the scale of data, cloud services can achieve better cost-efficiency.
Optimizing the performance of an ETL or ELT pipeline depends entirely on data coming into the pipeline. Understanding that data is always the first step to optimization, and there are a few general principles that can help you along the way.
Identifying and Fixing Data Issues
If you are working with a data source that you don’t control, you can always expect the schema of the data to change. Identifying this early on will make sure you don’t lose a lot of data due to such a change. For each run of the ETL pipeline, check for errors caused by a schema change. If there are any, make sure to fix them and rerun the pipeline before the next run.
Checkpoints and Logging in the Pipeline
No matter how careful you are while developing your pipeline, you will encounter errors. But if you are unable to figure out the cause of an error, more problems will ensue. During each stage of development, make sure you log enough data frequently so that you can easily figure out at what stage the issue occurred and why.
Break It Down into Chunks
In most cases, breaking down a problem into smaller chunks improves the performance of the pipeline overall. For example, suppose you have some sort of computation that repeatedly performs the same calculation on the same data as an intermediary step. Caching this data either in memory or to disk and reusing the result will speed up the pipeline a lot. Some tools, like Apache Spark, have built-in methods for this. Make sure you understand your tool well in order to utilize these features.
There’s a lot happening in the ETL and ELT space, with the volume, velocity, variety, veracity, variability, and value of data increasing on a daily basis. Keeping up with tooling and technology updates is important. But even more important is understanding how your pipelines are performing. So make sure you’re prepared for schema changes, increased load, and other issues when designing your pipeline so as not to be taken by surprise and to be able to address problems quickly.