How Web Scraping is Used in Apple Music Streaming Data Analysis?

X-Byte Enterprise Crawling
6 min readDec 21, 2021

We will analyze my personal music streaming statistics from Apple Music in this study. Apple Music is an Apple Inc. music and video streaming service. My personal broadcasting on the platform is represented by the dataset utilized here.

These topics will be discussed here.

  • Data requests and downloads
  • Cleaning and preparing data
  • Analyzing data and gaining interesting insights from it

Requesting and Downloading Data

These are the steps to take. Apple will provide you with your personal information if you ask for it.

Data Preparation and Cleaning

  • Import any libraries that are required.
  • Obtain the dataset (csv file)
  • Examine the dataframe’s form and columns.
  • Look for any missing values.
  • Examine the column’s fundamental statistics.

Importing Libraries

import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns import plotly_express as px pd.set_option('display.max_columns', None)

Load the Dataset

Apple Id NumberApple Music SubscriptionArtist NameBuild VersionClient IP AddressContent NameContent ProviderContent Specific TypeDevice IdentifierEnd Position In MillisecondsEnd Reason TypeEvent End TimestampEvent Reason Hint TypeEvent Received TimestampEvent Start TimestampEvent TypeFeature NameGenreItem TypeMedia Duration In MillisecondsMedia TypeMetrics Bucket IdMetrics Client IdMilliseconds Since PlayOfflineOriginal TitlePlay Duration MillisecondsProvided Audio Bit DepthProvided Audio ChannelProvided Audio Sample RateProvided Bit RateProvided CodecProvided Playback FormatSource TypeStart Position In MillisecondsStore Country NameTargeted Audio Bit DepthTargeted Audio ChannelTargeted Audio Sample RateTargeted Bit RateTargeted CodecTargeted Playback FormatUser’s Audio QualityUser’s Playback FormatUTC Offset In Seconds

There are 2,11,47 music streaming tracks with 45 features in total. To gain insights from our information, our first task is to remove the columns that aren’t needed. There are several columns in which all of the values are NULL. We must first eliminate such columns.

nans = [col for col in music_df.columns if music_df[col].isnull().all()==True] print(nans)['Original Title', 'Provided Audio Bit Depth', 'Provided Audio Channel', 'Provided Audio Sample Rate', 'Provided Bit Rate', 'Provided Codec', 'Provided Playback Format', 'Targeted Audio Bit Depth', 'Targeted Audio Channel', 'Targeted Audio Sample Rate', 'Targeted Bit Rate', 'Targeted Codec', 'Targeted Playback Format', 'User's Audio Quality', 'User's Playback Format']# drop the above columns from the dataframe music_df.drop(nans, axis=1, inplace=True)

There are a few columns like “Apple Id Number” and “Build Version” that aren’t really useful, so we’ll remove those as well.

to_delete = ['Apple Id Number', 'Build Version', 'Client IP Address', 'Device Identifier', 'Metrics Bucket Id', 'Metrics Client Id', 'UTC Offset In Seconds', 'Store Country Name'] music_df.drop(to_delete, axis=1, inplace=True)

From the original 45 columns, we now have 22 columns in our dataframe. The final issue is converting object-formatted timestamp columns to the actual TimeStamp variable.

music_df['Event End Timestamp'] = pd.to_datetime(music_df['Event End Timestamp'], format='%Y-%m-%dT%H:%M:%S') music_df['Event Received Timestamp'] = pd.to_datetime(music_df['Event Received Timestamp'], format='%Y-%m-%dT%H:%M:%S') music_df['Event Start Timestamp'] = pd.to_datetime(music_df['Event Start Timestamp'], format='%Y-%m-%dT%H:%M:%S')

Questions and Answers

1. Who are the Top 10 Favorite Artists?

fig = px.bar(top_10_artist, title="Top 10 favourite artists", labels={"index":"Artists", 'value':"No. of times song played"}, color_discrete_sequence=px.colors.qualitative.Set2) fig.show()

2. Which are the Top 20 Songs Played? (Favorite Songs)

fig = px.bar(top_20_songs, title="Top 20 favourite songs", labels={"index":"Songs", 'value':"No. of times song played"}, color_discrete_sequence=px.colors.qualitative.Bold) fig.update_xaxes(tickangle=22) fig.show()

3. Who are the Top 10 Favorite Content providers?

fig = px.bar(top_10_labels, title="Top 20 favourite labels", labels={"index":"Music Labels", 'value':"No. of times song label played"}, color_discrete_sequence=px.colors.qualitative.Pastel) fig.update_xaxes(tickangle=25) fig.show()

To check top tracks from a specific music label provider, we will create a little helper function.

def top_10_song_of_label(label): """ Function to see what are the top musics played from particular label. """ # use groupby method and sort ascending label_df = music_df[music_df['Content Provider'] == label] top_10_song = label_df['Content Name'].value_counts()[:10] print(top_10_song) fig = px.bar(top_10_song, labels={"index": "Song Names", "value": "No. of time song played", "variable":"Song name"}, title=f"Top songs from {label}") fig.show() and it goes like this - for example, top Warner Music Group songs top_10_song_of_label('The Warner Music Group')Hola (feat. Maluma) 82 I Don't Care 69 Thinking Out Loud 63 Attention 62 Perfect 60 1, 2, 3 (feat. Jason Derulo & De La Ghetto) 59 Dirty Sexy Money (feat. Charli XCX & French Montana) 52 Hymn for the Weekend 51 Crown 50 10,000 Hours 48 Name: Content Name, dtype: int64

Top Songs from T-Series

top_10_song_of_label(‘Super Cassettes Industries Pvt Limited a.k.a. T-Series’)

Ishq Tera 66 Chota Sa Fasana 60 Maahi Ve 59 High Rated Gabru 50 Tu Chale 45 Tera Yaar Hoon Main 45 Befikra 41 Zindagi Do Pal Ki 40 Duniyaa 40 Chalte Chalte 40 Name: Content Name, dtype: int64

4. Which are the Top 10 Songs According to Playtime?

fig = px.bar(top_longest_played[:10], labels={"Content Name": "Song Names", "value": "Play Time (in mins)", "variable":"Duration"}, color_discrete_sequence=colors.G10_r) fig.show()

5. What is the Usual Reason to End the Song?

6. Which is Your Most Favorite Genre?

fig = px.bar(top_genre, color_discrete_sequence=colors.T10_r) fig.show()

7. Which Media Type Do You Prefer Most on Apple Music?

fig = px.pie(music_df, names='Media Type', color_discrete_sequence=colors.Dark2, title="Most preferable Media Type (eg. Audio/Video)") fig.show()

8. What Would You Prefer Listening to Music When You Are Online/Offline?

fig = px.pie(music_df, names="Offline", title="Do you prefer listening to music Offline?") fig.show()

9. Which Time do You Prefer to Listen to Music?

fig = px.bar(hours, title="Most active hours (24hr)", labels={"value": "count", "Event Start Timestamp":"Timings (hours)"}, color_discrete_sequence=colors.Prism) fig.update_xaxes(dtick=1) fig.show()

10. Which Month have You Listened to Songs Most?

m = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sept', 'Oct', 'Nov','Dec'] fig = px.bar(months, title="Most active Months", text=m, labels={"value": "count", "Event Start Timestamp":"Months"}, color_discrete_sequence=colors.Light24) fig.update_xaxes(dtick=1) fig.show()

11. Which Year Have You Listened to Songs Most on Apple Music?

fig = px.bar(years, title="Most active year", labels={"value": "count", "Event Start Timestamp":"Year"}, color_discrete_sequence=colors.Prism_r) fig.update_xaxes(dtick=1) fig.show()

12. Total Time Spent Listening to Music

total_mins = total_time/60000 print("Total minutes spent: {:.2f} mins".format(total_mins)) total_hours = total_mins/60 print("Total hours spent: {:.2f} hours".format(total_hours)) Total minutes spent: 24568.91 mins Total hours spent: 409.48 hours

From beginning to end, the maximum amount of time you could listen to music is,

total_possible_hours = total_possible_time * 24 print("Total possible hours from start to end: {} hours".format(total_possible_hours)) Total possible hours from start to end: 31632 hours

The important question now is how much of my total available time was spent listening to music.

hours_spent_list = np.array([total_hours, total_possible_hours]) hours_spent_list_labels = [" Actual Hours Spent", "Possible Hours"] fig, ax = plt.subplots(figsize=(12,6)) ax.pie(hours_spent_list, labels= hours_spent_list_labels, autopct='%1.1f%%', explode=[0.2,0.2], startangle=180, shadow = True); plt.title("Hours Spent Percentage");

13. Daily Average Songs Played

total_songs = music_df.shape[0] print("Daily average of songs played: {:.2f} songs".format(total_songs/total_possible_time)) Daily average of songs played: 16.04 songs

You can Connect with us at X-Byte Enterprise Crawling for further queries and Request a quote!!

Originally published at https://www.xbyte.io.

--

--

X-Byte Enterprise Crawling

Offer web scraping & Data extraction services like Amazon data scraping, Real Estate,eBay, Travel & all type of services per client requirements. www.xbyte.io