As a data engineer, one of our biggest task is to analyze the data. It is a very important step to understand problems and to explore data in meaningful ways. Data analysis helps us to understand the past by exploring the data and creating predictive models by providing input to the data science teams.
The data analysis process is composed of following steps
Statement of Problem
Problem definition starts with business domain questions, such as how is the new feature deployed in product performing, how to track user interactions with the product. Understanding the requirements from the business perspective will help to determine the problem and will help data analysis project to succeed.
Data collection is gathering information from various sources and usually done by using ETL tools.
Extract, Transform, and Load (ETL) obtains data from various data sources, performs some transformation functions depending on our data model, and loads the resulting data into the destination.
Data extraction allows us to obtain data from multiple data sources, such as relational databases, data streaming, text files (JSON, CSV, and XML), and NoSQL databases.
Data preparation is the process of collecting, normalizing and transforming data into a dataset. You should avoid data quality issues such as missing values, out-of-range values, invalid or duplicate values. This is the most important step of data analysis as it can lead highly misleading results.
Important characteristics of high quality data are:
Data cleaning also known as data scrubbing, data cleansing, is the process of correcting or removing incorrect, inaccurate, incomplete, improperly formatted, or duplicated data from a dataset. The result of the data analysis process is highly dependent on the quality of the data, so this step plays an important factor in data analysis process.
There are some statistical methods which can help us to determine these records to be corrected or removed. In order to clean the data or to understand unexpected, out-of-range or erroneous data we need to have an understanding of the business domain. Statistical validations like below will give an insight and understanding of the data and handle missing or out-of-range values in the dataset.
Mean: This is the usual average of the data and calculated by summing up all values and then dividing this value by the number of values.
Median: This is the middle value in the list of values.
Constraints: Values that should fall within a certain range. They have a possible min/max value and all values should be within this range.
Clustering: When we obtain data from the source, they can include ambiguity or typo. For instance VA and Virginia both may be valid values for a state field and these can be clustered to get accurate results and eliminate duplicates.
Text Parsing and Validation
We can parse the text to validate whether a string of data is well formatted and to make sure there are no syntax errors. Usually regular expressions are used for data validation. Lets say you can use regular expressions to validate a date field in the dataset if you are expecting certain date format(s) and eliminate the ones that have syntax errors. Almost all the programming languages support a library for regular expression operations.
Normalization is usually required when working with attributes on a different scale. Although there are different methods to normalize data, the most common is min-max normalization. By using this, values are shifted and rescaled so that they end up ranging between 0 and 1. Where the value is equal to min value in the column then it will be 0, if the value is equal to max value in the column then it will be 1.
Extract, Transform, and Load (ETL) tools obtain data from various data sources, performs some transformation functions, and loads the resulting data into the destination.
Data transformations allow us to clean, convert, aggregate, join, validate, replace, split, and format the data. There are many ETL tools and they provide lots of data transformation options. For instance Apache NiFi is an open source ETL platform and transformations are called processors and you can check some of the processors supported by NiFi here.
This step is the process of looking at the data either by using visualization or statistically and try to find patterns, relations and connections between entities in the data. There are variety of tools that can help with data exploration and visualization. There is a tool which I recently come across loginom which is a platform that provides tool for data transformation, preparation, modeling and visualization. Although options are limited, seems a good tool and they will improve it more in the future. There are also more known visualization tools such as Tableau and Power BI.
Data exploration starts with asking questions about data;
- What are the values in each column?
- What unexpected values are in each column?
- Are there any data format irregularities, such as datetimes missing hours
- What relationships are there between columns?
- What are frequencies of values in columns and do these frequencies make sense?
These are the overall data analysis steps and it will give an idea about how to start a data analysis project and what steps to follow, what questions we need to ask to get the analysis results right.