I have read about applications of topic models beyond text to things like genetic data and image data, but as far as I can tell there hasn’t been much work on topic modeling musical performance data. I think the performance program data from the New York Philharmonic archive is very amenable to topic modeling. For example you could think of the concerts as being documents, and the pieces performed as being words. Then assume a generative model that draws pieces from a topic-specific distribution of pieces conditional on a latent topic assignment variable, which is itself drawn from a concert-specific topic-distribution parameter.

Recently I implemented the Topics-Over-Time (TOT) model (Wang & McCallum, 2006) and applied it to New York Philharmonic performances. This model adds an observed timestamp variable for each word, drawn from topic-specific beta distribution conditional on the latent topic, which allows modeling of time dynamics for topics. For this project I did not treat pieces as words, but instead used composers of pieces. This reduces the sparsity of the problem a bit, since composers wrote many pieces but many individual pieces are performed very rarely (no I haven’t checked, but this seems reasonable to me). Also instead of using concerts as documents, I aggregated all the composers’ pieces performed by the same conductor into a document, so that I could analyze per-conductor topic distributions.

I fit a 70 topic model using a collapsed Gibbs sampler. I chose 70 topics on the basis of coherence scores (Mimno, Wallach, Talley, Leenders, & McCallum, 2011) for a variety of topic numbers. Below are four of the inferred topics, with composers ranked in decreasing order of term score, that I found interesting.

Topics

The labels are my interpretation of course. In Musicals/Pops, we see Stephen Sondheim at the top, who is famous for his musical scores. In Opera we see the usual, mainly Italian, suspects like Verdi and Puccini. In American we have Ives and John Phillip Sousa, the latter of Stars-and-Stripes fame. Finally I’ve included a topic I call Standard, because it’s filled with stalwarts of the orchestral repertoire. This is the bread and butter of orchestral fare that orchestras everywhere return to time and time again.

The TOT model assumes a beta distribution for the normalized timestamp, and estimates them using method-of-moments. Here are the time distribution for the four topics I discussed.

topic_time

The topics are very localized, only American covers much more than a few decades. This is not so desirable, since you would expect topics like opera to be persistent. Especially the standard topic should be present throughout the history. Instead the topics might just correspond to individual conductors, whose careers only last so long. Or there might be some kind of organizational quirk that leads to localized topics, like if a pops series was begun after 1990 or so, leading to the pops topic.

For each topic you can see which conductors have the highest probability. For the above topics:

topic_time

I’m not familiar with most of the names here, but you can see that Aaron Copland has a high probability of conducting American music, which makes sense. Granted that’s because he mostly conducted his own works, see this at my conductor-composer browser here. You can see that there are some issues with conductor tokenizing; Andre Previn gets listed twice because of what looks like a unicode issue and sometimes multiple conductors are listed for one piece performed, as with Claudio Abbado and friends in the American category.

The TOT paper also describes how you can calculate topic distribution mixtures over time:

topic_time

The four topics discussed above are pops (29), opera (52), American (45), and standard (26). There are relatively few topics early on, and then many after about 1930. Partly that’s because the NY phil didn’t have as many performances in the nineteenth century. But perhaps it also reflects an increase in programming diversity in the twentieth century, due to conductor preferences and/or general musical trends.

I hope you enjoyed the pretty plots, but there are quite a few issues with the application of TOT to this kind of data. Most importantly I’m not even using all the data available; ideally I’d also model pieces played, not just composers, so that we could allow for topics that favor certain pieces for the same composer. Another issue is that topic modeling does not work so well for shorter documents (see this question on stats.stackexchange), and concert programs are all very short, usually with fewer than 10 pieces performed. There has been some work on directly modeling word co-occurence (Yan, Guo, Lan, & Cheng, 2013) in shorter documents that leads to good results, and I think it might be effective for this kind of data.

References

  1. Wang, X., & McCallum, A. (2006). Topics over Time: A non-Markov Continuous-time Model of Topical Trends. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 424–433). New York, NY, USA: ACM. http://doi.org/10.1145/1150402.1150450
  2. Mimno, D., Wallach, H. M., Talley, E., Leenders, M., & McCallum, A. (2011). Optimizing Semantic Coherence in Topic Models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 262–272). Stroudsburg, PA, USA: Association for Computational Linguistics. Retrieved from http://dl.acm.org/citation.cfm?id=2145432.2145462
  3. Yan, X., Guo, J., Lan, Y., & Cheng, X. (2013). A Biterm Topic Model for Short Texts. In Proceedings of the 22Nd International Conference on World Wide Web (pp. 1445–1456). New York, NY, USA: ACM. http://doi.org/10.1145/2488388.2488514