Course Clustering:
Bard College at Simon's Rock
Method: Guided LDA (Latent Dirichlet Allocation)

Thanks to Vikash Singh's semi-supervised guided topic model.

  Step 1: Find topic model

  • TF-IDF (Term Frequency-Inverse Document Frequency)
  • LDA Corpus (standard)
  • Guided LDA

  •   Step 2: Normalize values (topic document numbers from model are extremely small floats)

  • Max-min method
  • Multiply by an exponent of 10 to change data range from 0 to 1 to integers

  •   Step 3: K-means clustering

    Elbow Method
    Clustering Distances
  • Pearson (distance measurement)
  • Best k: 5


  • Results: Documents and Class Clusters

    Documents:

    Word Clouds 10
    Word Clouds 20
    Word Clouds 100 Labeled Mixed:

    Word Clouds All 10
    Word Clouds All 20
    Word Clouds All 100 Class Clusters:

    Distance Matrix Distance Matrix

    Classes
    Class Word Clouds 10
    Class Word Clouds 20
    Class Word Clouds 100 Labeled

    Analysis:

    Percentage Matrix