What People Sing About?

The goal of this project is to analyze and compare songs that occupy top chart positions in different countries - to discover similarities and differences, and ultimately to answer the question if the popular songs content reflects cultural differences between people around the world. The top-charts data for the following 29 countries are analyzed in this project: Argentina, Australia, Austria, Belgium, Brazil, Bulgaria, Canada, Chile, China, Denmark, Finland, France, German, Greece, India, Ireland, Italy, Japan, Netherlands, New Zealand, Norway, Portugal, Russia, Spain, Sweden, Switzerland, UK, Ukraine, and USA.

Dataset

Card image cap
Creating international lyrics dataset
Scrapping chart data. We scrape a real-time data about top 20 songs for each country from top40-charts.com. This includes song names, artists, rankings, and date. The songs ratings per country for a sample dataset from 18 Aug 2018 can be found here.
Lyrics search. We search for the lyrics of all songs using Google and Genius Lyrics search. The rate of success is around 78%.
Translation to common language (English). We then translate all the lyrics to English by using Google Translate and Google Apps Script.
SQLite database. We insert data about songs, artists and countries into a database. Then we clean the lyrics, parse it into words, remove stop words and finally lemmatize the words. The list of keywords per song can be found here, and the list of keywords per country here. The full Database schema can be found here, and sample SQLite database can be downloaded from here.

The steps for data collection phase are fully described here.

Analytics

Card image cap
Data analysis
We perform a series of experiments:
Top words per country. Here we produce 10 words that occur most frequently in lyrics of different countries. We then produce a 7-word country lyrics signature - i.e. seven words with top tf-idf scores - most helpful in distinguishing one set of documents from another.
Topic modeling Using LDA we extract from song lyrics three major themes (topics), tentatively name each topic, and assign the topic with the largest score to each song. See the results with top words per topic here. The assignment of a topic for each song can be seen here. As you can see, most songs belong to one of the three major categories: light songs, sad songs, and aggressive songs. We then generate topic distribution per country.
Clustering countries by content of their lyrics. We use hierarchical clustering to create clusters of countries based on song words. The clustering based on total word counts uses Tanimoto distance, and the clustering based on tf-idf scores uses Pearson correlation.

The steps for data analytics phase are fully described here.

Results

Top Words

We created two visualizations to show the most frequent lyrics words in each country. One is based on the total number of words, and the other one is based on the tf-idf score of each word.

Top words w/ Counts Top words w/ tf-idf
Country clusters


Hierarchical Clustering with Tanimoto Distance based on Counts
Responsive image



Hierarchical Clustering with Pearson Distance based on tf-idf Score
Responsive image
Distribution of topics per country


Proportion of light songs (topic 4)
Responsive image



Proportion of sad songs (topic 9)
Responsive image



Proportion of aggressive songs (topic 8)
Responsive image

Future Work

This research project has a lot of space for future improvement. For example, some song lyrics were not found on Google or Genius websites, making datasets for Chinese, Japanese, and Ukrainian songs disproportionally small. Including more lyric sources could help with solving this problem.

We also missing data from Middle Eastern and African countries, as they are not represented in these charts. It would be interesting to see how different are the popular songs in these countries, similar to what we discovered about Asian countries that differ significantly by their vocabulary from European and American songs.

Note also that some charts on Top40-Charts.com are last updated in 2015 or 2016, so not all countries have the most up-to-date charts. So it may be the case that we are comparing songs for different time periods.

The analytics part of this project can be extended to analyzing and comparing song emotions, use of offensive language, use of personal pronouns, use of methaphors and similes, and improved topic modeling, to name a few.