Multimodal text prediction model

31 May, 2025

Note: I am in week 9 of the quarter and I really should be studying to keep my GPA alive, but I had to write this idea out to potentially come back to it, or let someone else become inspired to work on it.

Like most other university students, I have my AirPods in my ears whenever I'm doing work that doesn't require active interaction with others. Also, like many others of my age, I have probably listened to a ~10^3 songs(?) and a lot of them with lyrics. I'm really into trying to predict events or things before they occur, and I try to do this with pop songs I've never heard before. And I surprisingly found out that I can guess phrases, or at least the last few words of a lyric. This made me wonder if there is any correlation between the chord progression or notes across different songs and the words used, or it's entirely due to the topics in of themselves. My bet is on the latter( there are only so many words to describe love, breakups, relationships etc), but it might be surprising and really interesting to be proven wrong!

I think building something like a multimodal transformer model that takes in the chord progression as well as the lyrics and outputs text that completes the lyrics while also using the chord progression would be really interesting to compare with a large language model in of itself.