Building a Spotify scraper

·

9 min read

Recently I came across spotifychart.com, an official website where Spotify has been publishing daily and weekly listening trends since 2017 for several regions, including global.

Seeing that I asked myself how social media apps that rely heavily on music (think Vine, and more recently TikTok) influence these charts. Especially in the case of TikTok, some musicians careers seem to have been kickstarted by this platform. Naturally I asked myself whether this would reflect also in the numbers, and so I decided to run a small explorative project analyzing Spotify Charts.

In this post I'll describe how I wrote a scraper to download these charts so we can analyze them offline.

The API

Even though Spotify does not provide an official public API to access these charts, it was quite clear that downloading all charts should not be too hard. There is a link on the website to download charts for a given date and region as a CSV file. Lookin at the website's source code we can see what URL is called when clicking that link:

<a href="/regional/global/daily/2020-10-30/download" class="header-csv" download="">Download to CSV</a>

We can see, that both region (global) and date are part of the URL. Scripting this should not be hard. From the dropdown menu we can see, that charts are available for 66 regions at this moment, starting at January 2017. In total we will end up with about 92k CSVs. That's a lot of files to download, and doing that sequentially is definitely a not an option. Fortunately Go, my favourite programming language, makes it very easy to design with concurrency in mind.

Concurrency First Design

The idea to this design is to have a number of workers download whole regions concurrently. Each of those workers in turn spawns new workers which download one day of charts for the region each.

[...]
maxRegions := make(chan struct{}, 10)
wg := sync.WaitGroup{}
wg.Add(len(dates) * len(spotify.Regions))
for regCode, regName := range spotify.Regions {
    regCode := regCode
    regName := regName
    maxRegions <- struct{}{}

    // Each worker spawned here will download one region ...
    go func() {
        defer func() {
            <-maxRegions
        }()

        maxConcurrency := make(chan struct{}, 50)
        for _, date := range dates {
            maxConcurrency <- struct{}{}

            // ... using new workers that each download one day.
            go func(date string) {
                defer func() {
                    wg.Done()
                    <-maxConcurrency
                }()
[...]

In this listing we can see the code that is responsible for spawning the workers. It is relatively straightforward. First, we range over all regions known (defined in a map called spotify.Regions) and spawn a worker for each region using a goroutine. Secondly, for each day, each worker will spawn another goroutine.

In Go, the go keyword before a function will execute the function asynchronously. This means that we can iterate through the loop very quickly, because all workers now run in the background. However, the Go runtime does not wait for goroutines to finish, so use a sync.WaitGroup. For each goroutine we spawn, we increase the counter (line 43) and decrease it for each goroutine that finishes (line 62). In the end we call wg.Wait(), which blocks until all goroutines called wg.Done().

What we want to keep in mind, that we want to limit the level of concurrency. Spawning a big number of goroutines is not a big problem by itself, we want to remind ourselves, that we are scraping data from someone else's servers and don't want to overload them with 92k concurrent requests. Realistically, we would probably first run into ratelimits or starve our own internet connection before overloading Spotify, but it's still a good practicce.

In order to limit the number of concurrent goroutines spawned, I use buffered channels which act as semaphores. For each goroutine we create, we push a struct{} into the channel (lines 47, 57). Once the channel has reached it's size limit, this operation becomes blocking until another goroutine removes a struct from the channel (lines 52, 63). This pattern ensures that we never have more than 500 workers running at any time.

Saving the files

The heart of the scraper runs inside the workers: the function that actually downloads the CSV and writes it to a file.

[...]
func() error {
    url := fmt.Sprintf("%s/regional/%s/daily/%s/download", baseURL, regCode, date)
    r, err := http.Get(url)
    if err != nil {
        return err
    }
    defer r.Body.Close()
    if r.StatusCode == http.StatusNotFound {
        // If the file is not there, there is no point in retrying.
        return stop{errors.New("not found")}
    }
    if h := r.Header.Get("Content-Type"); h != "text/csv;charset=UTF-8" {
        return errors.New("non CSV data")
    }
    if r.StatusCode != http.StatusOK {
        return errors.New("non 200 code")
    }
    p := path.Join(dataPath, fmt.Sprintf("%s/%s", regCode, date))
    f, err := os.Create(p)
    if err != nil {
        return err
    }
    defer f.Close()
    _, err = io.Copy(f, r.Body)
    if err != nil {
        return err
    }
    return nil
}
[...]

The code here is very straightforward. We download the CSV file, perform a couple of checks and write it into a file, by copying the buffer. A hard lessen to learn was to check the HTTP Response Content-Type Headers to make sure we are actually downloading CSV content. Before adding this, I accidentally downloaded the HTML representation for these from time to time, which quickly used up several GB of my disk.

Running it

With all the important parts pieced together and a bit of extra code for some visual progress we can now run the program:

2020/11/01 17:52:00 1398 days, 66 regions
Peru                 76 / 1398 [>----------------]   9m14s   5 %
Czech Republic      242 / 1398 [==>--------------]   2m32s  17 %
Indonesia           201 / 1398 [=>---------------]    3m3s  14 %
Chile               182 / 1398 [=>---------------]   3m26s  13 %
Slovakia            125 / 1398 [=>---------------]   4m33s   9 %
Thailand            387 / 1398 [====>------------]    1m7s  28 %
Estonia             144 / 1398 [=>---------------]   3m36s  10 %
Turkey              132 / 1398 [=>---------------]   3m58s   9 %
Mexico               67 / 1398 [>----------------]   7m53s   5 %
Hong Kong            38 / 1398 [-----------------]  13m38s   3 %

It downloads all CSVs into a directory called data, separated by region within a couple of minutes.

Conclusion

The whole codebase is available on Github. Feel free to check it out and contribute.

I am by no means a data scientist, but I think this dataset is very interesting. Maybe someone with a background in datascience can use it to produce some cool visualizations for r/dataisbeatiful.