Data Analyzing : Spotify Mega Hits

Series Part 1: Data Acquisition and Cleaning

Billions of listens. Pristinely engineered tracks. A record-label powered marketing machine every producer craves. But what really makes a song a mega hit? The answer lies in our favorite snake and a few snippets of code.

Image by Author

Bet you’re stoked for this. Let’s get started. First, let’s fire up a Jupyter Notebook and import a few frameworks we’ll be using.

The Spotipy API is a nifty lil tool that lets us use Spotify’s open source framework to scrape track data. Pandas lets us work on dataframes with ease. As more parts of this series come along, we’ll be working more with pandas and related frameworks so hold tight for now.

Spotify requires you to sign up to use their API. You can learn more about this here. Once you’ve populated the client credentials with your own you should be good to go. Since we’re analyzing mega hits, I find it serendipitous that there is a playlist entitled ‘Mega Hit Mix’. Let’s store the link to that in a global variable. While we’re at it, let’s also declare a global dataframe that we can use to retain sticky data.

Step 1: Time for our first function! Since we’re all about modular code, let’s write a super small function to pull data. This way we can call the function any time and get a dictionary of the playlist’s tracks.

Step 2: Hold tight, the next part’s going to get a little lengthy. Our next step is to use the track data pulled in Step 1 to create and populate a data frame with relevant elements. Let’s break it down step by step.

i. The for loop in our function iterates through the track dictionary to scrape for four metrics. Let’s use lists to store these.

ii. The audio feature function lets us get audio features for each track we just pulled. It requires track URI’s as an input, and we have a list for just that!

iii. This part is purely experimental. From my experience producing, I find that most popular songs can be classified into three broad genres by BPM. Let’s make an arbitrary classification while we’re here — this might come in handy as a control for our unsupervised cluster analysis later.

iv. Lastly, we add columns to our method’s data frame with our populated lists.

Boom. We now have a pretty lil dataframe of track data just waiting to be analyzed. Let’s take a look at it.

On first glance, we can see that while our audio features are measured on a scale of 0–1, some important columns like Loudness, Duration and Popularity have absolute values. This might skew our analysis later on, so let’s go ahead and normalize these columns using a simple min-max scaler.

This is what our normalized columns now look like:

Set-up done and we’re ready to rumble. Stay tuned for the next part, where we’ll start off with data visualization and a correlation analysis.

Studying Financial Mathematics at USC. Catch me producing music, learning data science or surfing in my free time

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store