What Does BTS Sing About?

An attempt at applying topic modelling to understand BTS’s lyrics.

Kaili
Geek Culture

--

Photo by Oğuz Şerbetci on Unsplash

1. Introductions

BTS (방탄소년단) is many things, including

  • A septet that debuted in 2010 under Big Hit Entertainment
  • The best-selling artist in South Korean history
  • 43rd in the Forbes Celebrity 100 (2019) as one of the world’s top-earning celebrities

A lot of BTS’s success is credited to what they sing about, usually embedding social issues into their lyrics, and is often praised as being a more authentic k-pop group.

2. Motivation

First time with BTS?” — Dope

  • Extension of a side project to collect data on BTS songs
  • Desire to attempt a text analysis project that did not involve the usual, more commonly explored datasets (news, social media, etc.)
  • To better understand BTS’s progression as artists thorough understanding of their lyrics

3. Dataset

“Just take ’em take ‘em” — Blood Sweat & Tears

  • Self-procured and manually cleaned using data from lyrical website Genius and BTS’s official webpage
  • Consists of: 18 albums, 225 tracks from 2013 to 2021, with English translated lyrics of each track
  • More details regarding the dataset can be found on the Kaggle page

4. Data Pre-Processing

“snowflakes fall down and get away little by little” — Spring Day

The full code can be found here, this piece of writing attempts to keep the amount of code block embeds to a minimum and most code steps are not reflected, refer to the link to get the full code.

Only Keeping Unique Tracks

Tracks that are not included for analysis beyond exploratory purposes are:

  • repackaged tracks (songs previously released but included again in later album) — duplicate of an existing track
  • remix tracks — considered duplicate as their lyrics rarely differ
  • tracks with the full version ( has_full_ver = TRUE ) — considered duplicate as its lyrics are already represented in the full version tracks
  • “skit” tracks — with the understanding of BTS’s discography, the band include snippets of conversations or soundbites, known as a “skit” in their tracklist as well
  • “notes” tracks — In some albums, Genius also has translations of “notes”, which are the printed text brochures included with the physical albums; these will be removed from the analysis as they are not lyrics
Code Block 1. Data Pre-processing: only keeping unique tracks in the data frame

Normalize Lyrics

  • pre_normalise function — replaces text before the usual text normalization methods (e.g. replace phrases containing special characters)
  • lyrics functions — methods to deal with various lyrical features (e.g. reducing contractions such as changing ‘dat’ to ‘that’; removing the ‘la la la’ or non-lexical vocables from the lyrics
  • text functions — methods to perform routine text data cleaning (remove stopwords, lemmatize words, etc.)
  • replacements — self-defined dictionary (store in a separate file: replacements.py) of replacements to be made in addition to already prescribed text data cleaning
Code Block 2. Data Pre-processing: normalize lyrics

5. Exploratory Data Analysis

did you see my bag?” — Mic Drop

General

Fig 1. Number of albums released yearly (Source: author)
Fig 2. Number of tracks released yearly (Source: author)
Fig 3. Number of tracks per album (Source: author)

Observations

  • 2020 may have seen the highest number of albums released, the number of both total and unique tracks released are not significantly higher than in 2016 and 2018, where half the number of albums were released — likely due to singles released in 2020, each considered a different album (Dynamite Day; Dynamite Night)
  • The number of unique tracks follows a short phrase of increase before a sharp decrease where total tracks > unique tracks (the decrease usually due to an album is a repackage), suggesting that BTS has a release pattern where one can expect a repackage album after about 3 albums where total < unique tracks
  • Leveraging personal knowledge, to add to the previous point, BTS commonly release a repackaged album as the last album to a series of albums sharing similar prefixes and themes (e.g. Young Forever as the epilogue to the Most Beautiful Moment in Life series; Answer as the epilogue to the Love Yourself series)

Lyrics

Fig 4. Distribution of lyric word count in unique tracks (Source: author)
Fig 5. The average number of words per unique track by album (Source: author)

Observations

  • The number of words in BTS lyrics follow a typical normal distribution
  • The median number of words = 207 words, which supports the previous point that the data has more or less a symmetrical distribution
  • The average number of words across the albums follow a general decreasing trend, which is interesting as early BTS releases songs that are more ‘gritty’ for lack of a better word (think: angrier, rap heavy), while in the recent past, their songs have been more vocal, which may be what this observation shows
Fig 6. word cloud of the most common words in BTS lyrics (Source: author)
Fig 7. Top 15 most common bigrams in BTS lyrics (Source: author)

Observations

  • A lot of bigrams are the same words repeated, which is not unexpected given that they are lyrics (e.g. know know, love love, go go, bang bang, run run, want want)
  • Most of the bigrams are natural bigrams, making it unsurprising that the words are used together, often (e.g. let us, let go, one day, hip hop, feel like, one two)
  • Most of the bigrams also contain common words, as shown in the word cloud
Fig 8. Top 10 tracks with the most number of words and unique words (Source: author)
Fig 9. Top 10 tracks with the least number of words and unique words (Source: author)

Observations

  • After normalization, track ‘Interlude: What Are You Doing Now’ has no words – to remove from the data used for further analysis
  • All the cypher tracks (rap tracks) are in the tracks with the most number of words
  • Intro and outro tracks, predictably, make up the tracks with the least number of words

Lyrics: Word Significance

  • Significance measured using TF-IDF, which gives a higher weightage to less common (hence, more meaningful and “interesting”) words
Fig 11. Most interesting words in each album’s lyrics (Source: author)

Observations

  • The Love Yourself series lives up to the title, the word “love” is the top word of significance in all except one of the album
  • As a matter of probable fact, “love” is very consistently a relatively significant word
  • Earlier album surfaces the word “girl”, which looks to not be perhaps as significant after the noticeably lonely album “The Most Beautiful Moment in Life pt. 2”

6. Topic Modelling

“try babbling into the mirror, who the heck are you” — Fake Love

Prepare Model Inputs

Code Block 3. Prepare LDA model inputs

Utility Function

  • Function compute_coherence_value will be used to facilitate less code repetition
Code Block 4. Function to initiate and run genism’s LDA model

Base Model

Code Block 5. Base Model
  • Output: base model c_v: 0.2907

Model 1: Changing the Number of Topics (n)

Code Block 6. model 1 parameters
  • The range of the number of topics to test is set from 2 to 10 to not only reduce training time, too to reduce overfitting the model to the data
Fig 12. Plotting model 1’s coherence score
  • For simplicity’s sake, the number of topics (n) that achieves the highest c_v will be used from here on out — n=7
  • Output: model 1 c_v: 0.2940 — which is an improvement from the base model

Model 2: Changing alpha values (a)

Code Block 7. model 2 parameters
Fig 13. Table of each alpha value and coherence score
  • Coherence score is highest when a=0.71
  • Output: model 2 c_v: 0.2932 — which is an improvement from the base model but slightly lower than model 1

Model 3: Changing eta (b) values

Code Block 8. model 3 parameters
Fig 14. Table of each eta value and coherence score
  • Coherence score is highest when b=0.91
  • Output: model 3 c_v: 0.3977 — which is an improvement from the base model and models 1 and 2

Final Model

Code Block 9. Using model 3 as the final LDA model
Fig 15. 15 words associated with each topic extracted by the LDA model

Giving a name to the numerical topics, based on personal knowledge of BTS’s discography and history:

Fig 16. Putting a name to the topics

7. Conclusion

“drink it up. (creature of creation)” — Dionysus

Fig 17. Table showing topics and the number of tracks classified as each topic
  • The topics are not evenly distributed, as shown in the scant numbers for topics ‘kid love’ and ‘party’
  • From a perspective with more human (and fan) judgment, the topics extracted by the model are similar to each other
Fig 18. The trend of each topic by album
Fig 19. The composition of each albums’ topics
  • The number of each topic in the different albums makes sense, the high number of ‘dreamy love’ in the ‘Skool Luv Affair, the peak in ‘missed love’ in the ‘WINGS’ album
Fig 20. Visualizing LDA topics with pyLDAvis
  • There are no huge areas of overlap between the different bubbles, which is a good sign that the topics are fairly different from each other
  • At the same time, most of the bubbles are not huge, or prevalent enough for the model to be a good classifier (if one should choose to use it that way), this is also reflected in the low c_v score of the model
Final model c_v : 0.3977
  • About the c_v score: this LDA model gave the best c_v score in the time spent running this model, suggesting that semantically, this model produces the most coherent topics

Reflections

  • BTS majority of the time, sing in Korean. The dataset used in the analysis was English translated, which may have introduced some linguistic loss
  • The dataset only included Korean and the recent English albums, it does not include Japanese tracks and misses some special albums (e.g. mixtapes, game ost albums) — additional data which may better inform the model in topic extraction
  • The parameters varied to derive the final LDA model (within specific ranges) are limited to the number of topics, alpha, and eta. Potentially, experimenting with varying other model parameters (e.g. number of passes, random state) might lead to a better-resulting model
  • LDA might not be the best model to extract topics, other methods (e.g. NMF) might be a better approach
  • Perhaps BTS just refuses to have detectable large groups of similar topics in their discography

References

I’m curious about everything” — Boy with Luv

Edits: 16 May 2021, added Fig 19 and corrected some spelling and grammar.

--

--