Using Python to scrape the Billboard Hot-100 playlist to generate a Spotify playlist

Introduction

In this post I shall go over how I used Python to create a Spotify playlist using the tracks taken from the billboard.com Hot-100 chart and the Spotify Web API.

Overview

At a high level a quick overview.

Here is a list of tools/technologies used:

Beautiful Soup

This Python module allowed me to easily extract data out of the billboard.com Hot-100 webpage. I chose this library since the billboard.com RSS feed did not have the sufficient data to pull the required track information.

I used pip to install this using.

pip install beautifulsoup4

HTML page parsing

Before we dig deep into the Python code, it is important to look at the HTML page taken from the billboard.com webpage.

Each track on the Hot-100 chart is included in a parent denoted tag.

<span class="chart-element__information">

The example below only shows the 2 main tags used, there are various others such as weeks on chart, which are not in scope for this.

<span class="chart-element__information">
<span class="chart-element__information__song text--truncate color--primary">No Guidance</span>
<span class="chart-element__information__artist text--truncate color--secondary">Chris Brown Featuring Drake</span>
....
</span>
<table class="has-fixed-layout" ><tr >Span classDescription</tr><tbody ><tr >
<td >chart-element__information__song text
</td>
<td >Track name
</td></tr><tr >
<td >chart-element__information__artist text--truncate color--secondary
</td>
<td >Artist name
</td></tr></tbody></table>

Now that there was a pattern to use to extract the relevant information, the next step was to look at how to extract this data using the Beautiful Soup library.

Here is an example of the code to use the library.

HTML_FILE = is the saved version of this webpage.

You can use the requests library in Python to do a get for the page, and save the text to disk. For testing I cached the page to disk so that I didn’t keep requesting the page each time.

# define a class to encapsulate a 'track'
class Track():
    def __init__(self,name,artist):
        self.name=name
        self.artist=artist

def getTop100Tracks():
    try:
        logger.info("Parsing Billboard Hot-100 tracks.")
        
        # use the BeautifulSoup constructor to pass in the location of the html file, features tells the library the format of the html, and is set to the default html5lib.

        soup_parse = BeautifulSoup(open(HTML_FILE),features="html5lib")

        # find all span elements with the name 'chart-element__information

        spans = soup_parse.find_all('span', {'class':'chart-element__information'})

        if(len(spans) > 0):
            # iterate over the span elements extracting the track and artist names
            for s in spans:
                # extract the track name, and artist name
                track = Track(s.find('span', {'class':'chart-element__information__song'}).text,s.find('span', {'class':'chart-element__information__artist'}).text)
                
                # save track info in a dictionary
                ALL_TRACKS[track.name] = track.artist
                
    except:
        logger.error("Parsing of html page failed.")
        raise

Now that the extraction of the data from the webpage is complete, lets take a look at the Spotify Web API.

Spotify Web API

Before using the Web API, please ensure you have completed the following. For more information see: https://developer.spotify.com/documentation/web-api/quick-start/

Set Up Your Account

When you have a user account, go to the Dashboard page at the Spotify Developer website and, if necessary, log in. Accept the latest Developer Terms of Service to complete your account set up.

To use the Web API, start by creating a Spotify user account (Premium or Free). To do that, simply sign up at www.spotify.com.

Register Your Application

Any application can request data from Spotify Web API endpoints and many endpoints are open and will return data without requiring registration. However, if your application seeks access to a user’s personal data (profile, playlists, etc.) it must be registered. Registered applications also get other benefits, like higher rate limits at some endpoints.

You can register your application, even before you have created it.

Spotify Web API

Authentication

Spotify has 2 ways to authenticate, based on the requirements.

App Authorization: Spotify authorizes your app to access the Spotify Platform (APIs, SDKs and Widgets).

User Authorization: Spotify, as well as the user, grant your app permission to access and/or modify the user’s own data. For information about User Authentication, see User Authentication with OAuth 2.0. Calls to the Spotify Web API require authorization by your application user. To get that authorization, your application generates a call to the Spotify Accounts Service /authorize endpoint, passing along a list of the scopes for which access permission is sought.

Spotify Web API

Further details can be found at https://developer.spotify.com/documentation/general/guides/authorization-guide/

User Authorization

I used the User Authorization route since I needed access to modify user’s own data.

Note. When registering your application make sure you note down the client id, client secret and call back URL provided in the registration form. You will need those details later on.

This authentication flow typically uses the following path.

  1. Request Authorization code 2. Use Authorization code to request Access and Refresh token 3. Use Access token to access the Spotify Web API 4. Use Refresh token to update expired Access token

Step 1 - Request Authentication code

  • Parameters:
  1. client id (required)
  2. call back URL (required)
  3. scope _(optional) _- this is list of authorization scopes which you require access to
  4. state (optional but recommended) - a random string; provides protection against attacks such as cross-site request forgery
  5. show_dialog - boolean _(optional) _to force the user to approve the app again if they’ve already done so

Details of some of the authorization scopes are listed below for a complete list please see https://developer.spotify.com/documentation/general/guides/scopes/

playlist-read-collaborative

  • Include collaborative playlists when requesting a user’s playlists.

playlist-modify-private

  • Write access to a user’s private playlists.

playlist-modify-public

  • Write access to a user’s public playlists.

playlist-read-private

  • Read access to user’s private playlists.

user-read-private

  • Read access to user’s subscription details (type of user account).

user-library-modify

  • Write/delete access to a user’s “Your Music” library.

Using curl to get an authorization code.

curl -v 'https://accounts.spotify.com/authorize?client_id=<replace with client id>&response_type=code&redirect_uri=<replace with call back url>&scope=<replace with authorization scope>&state=<replace with random string>&show_dialog=false'

When provided a list of authorization scope items - ensure they are separated using the escape character %20 .e.g. user-read-private%20user-read-email%20playlist-modify-private

When running curl using the verbose (-v) flag this exposes the location header which is similar to the following.

<location: https://accounts.spotify.com/login?continue=https%3A%2F%2Faccounts.spotify.com%2Fauthorize%3Fscope%3Duser-read-private%2Buser-read-email%2Bplaylist-modify-private%26response_type%3Dcode%26redirect_uri%3Dhttps%253A%252F%252Fexample.com%252Fcallback%26state%3D34fFs29kd09%26client_id%<client_id>%26show_dialog%3Dfalse     

Open the URL in a web-browser.

The user is asked to authorize access within the scopes. The Spotify Accounts service presents details of the scopes for which access is being sought.

The user is redirected back to the specified redirect_uri. After the user accepts, or denies your request, the Spotify Accounts service redirects the user back to your redirect_uri.

Once the app access has been authorized you will see a redirect uri with the authorization code like below.

https://example.com/callback?code=AQBcyrnRH8CnShgs...&state=<this is the state code used in the earlier curl cmd>
    get access token 

Step 2 - Request an Access token

Requesting an Access token requires the following.

  1. base64 encoded client id:client secret
  2. authorization code
  3. redirect uri
  4. state

Here is a curl POST example.

curl -H "Authorization: Basic <base64 client_id:client_secret>" -d grant_type=authorization_code -d 'code=AQAXSYw9unYk.....' -d 'redirect_uri=https://example.com/callback' -d 'state=<your state code>' https://accounts.spotify.com/api/token

The response if successful.

{
    "access_token": "NgCXRK...MzYjw",
    "token_type": "Bearer",
    "scope": "user-read-private user-read-email",
    "expires_in": 3600,
    "refresh_token": "NgAagA...Um_SHo"
}

Step 3: Once you have the <access_token> you can use this to make calls to the Spotify Web API.

**Step 4: **Refresh expired access token

If the access token has expired. Make the following request which will return a new access token.

curl -H "Authorization: Basic <base 64 client_id:client_secret>" -d grant_type=refresh_token -d refresh_token=NgAagA...Um_SHo https://accounts.spotify.com/api/token

Spotify Web API Endpoints

I’ve used several endpoints as part of the playlist creation. For a complete list see the Spotify Web API documentation.

Python Functions

Add tracks to playlist.

API Endpoint used = SPOTIFY_API_PLAYLIST_TRACKS

HTTP method used = POST

Response = 201

def addTracksPlaylist(playlistId,trackUids,access_token):
    headers={"Accept" : "application/json", "Content-Type" : "application/json", "Authorization" : "Bearer {}".format(access_token)}

    try:
        
        response = requests.post(SPOTIFY_API_PLAYLIST_TRACKS.format(playlistId,trackUids),headers=headers,verify=False)
        
        if(response.status_code == 201):
            return response.json()
        else:
            print("[ERROR] " + response.text)
            raise
    except:
        raise

Get current users playlists

API Endpoint used = SPOTIFY_API_PLAYLIST_CURRENT

HTTP method used = GET

Response = JSON 200

def getPlaylists(access_token):
    headers={"Accept" : "application/json", "Content-Type" : "application/json", "Authorization" : "Bearer {}".format(access_token)}

    try:
        response = requests.get(SPOTIFY_API_PLAYLIST_CURRENT,headers=headers,verify=False)

        if(response.status_code == 200):
            return response.json()
        else:
            if( 'error' in response.json() and 'The access token expired' in response.json()['error']['message']):
                print("[WARNING] The access token expired, requesting refresh.")
                access_token = getNewAccessToken()
                with open('.accesstoken',mode='w') as w:
                    w.writelines(access_token)
                w.close()

                getPlaylists(access_token)
    except:
        raise

Create new playlist

API Endpoint used = SPOTIFY_API_CREATE_PLAYLIST

HTTP method used = POST

Payload = { name, description, public }

Response = 201

def createPlaylist(uid,name,description,access_token):
    headers={"Accept" : "application/json", "Content-Type" : "application/json", "Authorization" : "Bearer {}".format(access_token)}

    data = {}
    data['name'] = name
    data['description'] = description
    data['public'] = 'false'

    try:
        response = requests.post(SPOTIFY_API_CREATE_PLAYLIST.format(uid),data=json.dumps(data),headers=headers,verify=False)

        if(response.status_code == 201):
            return response.json()['id']
        else:
            if( 'error' in response.json() and 'The access token expired' in response.json()['error']['message']):
                print("[WARNING] The access token expired, requesting refresh.")
                access_token = getNewAccessToken()
                with open('.accesstoken',mode='w') as w:
                    w.writelines(access_token)
                w.close()

            createPlaylist(uid,name,description,access_token)


    except:
        raise

Search for Artist ID

API Endpoint used = SPOTIFY_API_SEARCH_ARTIST

HTTP method used = GET

HTTP request query requires artist name to search

Response = JSON 200

def searchArtistId(artist,access_token):
    headers={"Accept" : "application/json", "Content-Type" : "application/json", "Authorization" : "Bearer {}".format(access_token)}

    try:
        
        response = requests.get(SPOTIFY_API_SEARCH_ARTIST.format(artist),headers=headers,verify=False)

        if(response.status_code == 200 and len(response.json()['artists']['items']) > 0):
            #print(response.text)
            return response.json()['artists']['items'][0]['id']
        else:
            
            if( 'error' in response.json() and 'The access token expired' in response.json()['error']['message']):
                print("[WARNING] The access token expired, requesting refresh.")
                access_token = getNewAccessToken()
                with open('.accesstoken',mode='w') as w:
                    w.writelines(access_token)
                w.close()

                searchArtistId(artist,access_token)
            else:
                pass
    except:
        raise

Get Artist Top Tracks

API Endpoint used = SPOTIFY_API_SEARCH_ARTIST_TOP

HTTP method used = GET

Response = JSON 200

def getArtistTopTracks(artistId,access_token):
    headers={"Accept" : "application/json", "Content-Type" : "application/json", "Authorization" : "Bearer {}".format(access_token)}

    try:

        response = requests.get(SPOTIFY_API_SEARCH_ARTIST_TOP.format(artistId),headers=headers,verify=False)

        if(response.status_code == 200):
            
            return response.json()
        else:
            raise
    except:
        raise

Refresh Expired Access token

Endpoint used = SPOTIFY_API_TOKEN


Payload = { grant_type, refresh_token }

HTTP method = POST

def getNewAccessToken():
    headers={"Authorization" : "Basic " + SPOTIFY_API_BASE64_CLIENT}
    data={}
    data['grant_type']="refresh_token"
    data['refresh_token']=SPOTIFY_API_REFRESH_TOKEN

    try:
        response = requests.post(SPOTIFY_API_TOKEN,data=data,headers=headers,verify=False)

        if(response.status_code == 200):
            return response.json()['access_token']
        else:
            print("[ERROR] Failed to get access token {}".format(response.text))
            raise
    except:
        print("[ERROR] Failed to send request to get new access token.")
        raise
    
Last updated on 26 Oct 2019
Published on 26 Oct 2019