The Granulate Blog: Spark

Cloudera vs Databricks vs Snowflake: Choosing the Right Data Management Platform for Your Needs
Dive into the world of three major players in data management: Cloudera, Databricks, and Snowflake, and discover which one is right for your...
Optimizing AI: Large-scale Data Processing and Analytics
Optimizing AI: Large-Scale Data Processing and Analytics
The second in Intel Granulate’s series, Optimizing AI, a deep dive into optimizing Large-scale Data Processing and Analytics applications for...
Spark Streaming (Spark Structured Streaming): the Basics and a Quick Tutorial
Spark Streaming (Spark Structured Streaming): the Basics and a Quick Tutorial
Spark Structured Streaming is a newer streaming engine that provides a declarative API, offers end-to-end fault tolerance, and supports more...
Azure Databricks: Spark on Steroids in the Azure Cloud
Azure Databricks: Spark on Steroids in the Azure Cloud
Azure Databricks is an Apache Spark-based analytics platform optimized for the Microsoft Azure cloud services platform.
EKS for Spark Workloads
Session Recap: Best Practices for Embracing EKS for Spark Workloads
If you missed the live session, read this recap on the best practices for embracing EKS for Spark workloads.
Pyspark Tutorial: Setup, Key Concepts, and MLlib Quick Start
Pyspark Tutorial: Setup, Key Concepts, and MLlib Quick Start
PySpark is a library that lets you work with Apache Spark in Python. Apache Spark is an open-source distributed general-purpose...
Spark vs. Kafka: 5 Key Differences and How to Choose
Spark vs. Kafka: 5 Key Differences and How to Choose
Apache Spark is an open-source, distributed system for processing large volumes of data. Apache Kafka is an open-source, high performance...
Apache Spark: Quick Start and Tutorial
Apache Spark: Quick Start and Tutorial
Apache Spark is an open-source, distributed computing system for big data processing. Get a full tutorial and see how to get started with Apache...
Optimizing Resource Allocation for Apache Spark
Optimizing Resource Allocation for Apache Spark
Resource allocation for Apache Spark and how you can configure and optimize your Spark environment for maximum performance.
Understanding PySpark: Features, Ecosystem, and Optimization
Understanding PySpark: Features, Ecosystem, and Optimization
PySpark is a Python library for Apache Spark that allows users to interface with Spark using Python
Hadoop vs. Spark: 5 Key Differences and Using Them Together
Hadoop vs. Spark: 5 Key Differences and Using Them Together
The Hadoop platform is an open source system that allows storing and processing larger data sets on a cloud base. Apache Spark is an open source...
5 PySpark Optimization Techniques You Should Know
5 PySpark Optimization Techniques You Should Know
Apache PySpark is the Python API for Apache Spark, an open-source, distributed computing system that is designed for high-speed processing of...
Apache Spark: Architecture, Best Practices, and Alternatives
Apache Spark: Architecture, Best Practices, and Alternatives
Apache Spark is an analytics engine that rapidly performs processing tasks on large datasets. It can distribute data processing tasks on...
Spark on AWS: How It Works and 4 Ways to Improve Performance
Spark on AWS: Amazon EMR Features & Creating Your First Cluster
Apache Spark is an open source, distributed data processing system for big data applications. It enables fast data analysis using in-memory...
spark performance
Introduction To Apache Spark Performance
In this article, we first present Spark’s fundamentals, including its architecture, components, and execution mode, as well as APIs.