Apache Iceberg is an open table format designed for huge, petabyte-scale tables. Iceberg brings transactions, record-level updates/deletes to data lakes. The project was originally developed at Netflix to solve long-standing issues with their usage of huge, petabyte-scale tables. It was open-sourced in 2018 as an Apache Incubator project. Amazon Web Services (AWS) recently announced the public preview of Amazon Athena ACID transactions,... Continue Reading →
Step by Step Data Analysis Process
As a data engineer, one of our biggest task is to analyze the data. It is a very important step to understand problems and to explore data in meaningful ways. Data analysis helps us to understand the past by exploring the data and creating predictive models by providing input to the data science teams. The... Continue Reading →
Optimizing and Tuning Apache Spark
Spark has many configurations for tuning and in this post we will be looking at some the important and commonly used configurations. For a full list of configurations you can check the Spark documentation from the following link https://spark.apache.org/docs/latest/configuration.html How to View and Change the Spark Configurations There are multiple ways to edit Spark configurations.... Continue Reading →
Quick Introduction to Apache NiFi and Key Features
Apache NiFi is one of the most popular ETL platform within the open-source community. It provides a web-based user interface for creating, monitoring & controlling data flows. Apache Nifi Terms While working with NiFi, there are terms you need to get familiar with and these are the important aspects of NiFi. Most important building blocks... Continue Reading →
Analyzing Event Stream Dataset Using Spark
In this post, we will assume we were given a task to analyze click stream events and generate a report to display top viewed items. We will be using a sample dataset from kaggle as input data. You can download the sample dataset from the following link https://www.kaggle.com/retailrocket/ecommerce-dataset?select=events.csv There are multiple CSV files, we will... Continue Reading →
Running Kafka Locally on Windows Using Docker
In this post, we will discuss how you can run Kafka on your windows machine. If you are looking to create a local development environment which uses Kafka, the easiest way is to get the confluent platform docker image and run with docker compose. Compose is a tool for defining and running multi-container docker applications,... Continue Reading →
Analyzing Apache Web Server Logs with Apache Spark
In this blog We will be analyzing Apache web server log files and extract the HTTP request code information from log files and get the total number of responses for each status code. First we need to import the required packages. Pyspark is an interface for Apache Spark in Python. Pyspark isn't on path by... Continue Reading →