First time using Spark

Sharing my story about my first time using the framework spark

4 min readMar 17, 2021

You will probably hear about Apache Spark if you have to work with the so called Big Data, but how much of data is it reasonable to migrate from to well known pandas to Spark ? Should I choose Spark or Hadoop ?

The main goal of Spark is to process a huge amount of data ( I would say from 10GB to 1TB ), and it does it very well, because spark uses the RDDs or Resilient Distributed Datasets, that basically just reads the script but not execute it untill you use an action command. Spark can be use with Apache hadoop ( for storage ), and Spark’s use is optimized when you are storing your data in a cluster.

In my project (Udacity’s project) I had a dataset with informations about users in a fictitious music plataform called “Sparkify”, such as user’s name, the users’s plan if it’s free or paid, the page the user was logged in, the duration the user spent in the plataform and others variables that you can see below.

columns of the dataset — Columns and their types

The problem

The goal of this project was to predict if a user would cancel his plan or not, this is know as churn prediction, and the column that gives us this information is the column “page”, as you can see below. Fortunately for the Sparkify company, we have few people who has cancelled their services, but this can cause a problem, when using machine learning, if we have too few observations of a specific class, the model can not learn very well about this class. This problem is known as unbalanced class, take a look at the picture below, the number of users who has cancelled is in the variable “Cancellation Conf…”, there are only 52 who has cancelled the plan.

Strategy

My strategy to solve this problem,was to first identify the explanatory variables and the target variable, and then I did some exploratory analysis. The main goal here is to build a machine learning model to predict the customer churn, a classification problem.

Metric

The metric I used to evaluate the model was F1-Score, and I think this metric is the best one to give the accuracy in a classification problem and where there’s an imbalanced data.

Data exploration

After analyse the dataset trying to find out some insights, and understand how the data is distributed, I also had to do some transformations, for instance: convert the target variable (“Page”) from string type to numeric type. I did this by converting all the possible outputs to 0 but the cancellation confirmed to 1, where 0 is not cancelled and 1 is cancelled, after that, I was able to convert the type to numeric.

Not only with this variable, but with all the other categorical ones, I repeated the same proccess, and the reason for that is: the machine learning model doesn’t “learn” with words, it “learns” with numbers.

A few steps before I build the model, I had to scale the data using the function MinMaxScaler, the transformation is given by:

X_std = (X — X.min(axis=0)) / (X.max(axis=0) — X.min(axis=0))
X_scaled = X_std * (max — min) + min

Modelling and Validation

After the pre-processing step, I finally build the model, in this case I used the Logistic Regression, and the final accuracy using F1-Score was approximately 78.5%

It’s not so bad, but we could do better if we had more data, especially more observations about the users who had their plan cancelled.

Conclusion

I can conclude that Apache Spark is a great framework when you have to work with a huge amount of data, it’s much faster than Hadoop MapReduce, but it’s not always the case, sometimes you have just a single file with ten thousands lines, and it would be better and faster to use only python and its libraries.