Spark has many configurations for tuning and in this post we will be looking at some the important and commonly used configurations. For a full list of configurations you can check the Spark documentation from the following link https://spark.apache.org/docs/latest/configuration.html How to View and Change the Spark Configurations There are multiple ways to edit Spark configurations.... Continue Reading →
Analyzing Event Stream Dataset Using Spark
In this post, we will assume we were given a task to analyze click stream events and generate a report to display top viewed items. We will be using a sample dataset from kaggle as input data. You can download the sample dataset from the following link https://www.kaggle.com/retailrocket/ecommerce-dataset?select=events.csv There are multiple CSV files, we will... Continue Reading →
Analyzing Apache Web Server Logs with Apache Spark
In this blog We will be analyzing Apache web server log files and extract the HTTP request code information from log files and get the total number of responses for each status code. First we need to import the required packages. Pyspark is an interface for Apache Spark in Python. Pyspark isn't on path by... Continue Reading →