This is a continuation of the long-term IMDb data analysis using the Internet Archive. Thanks Internet Archive! You can see part one of this series here. Cheers.
‘Ello everyone. A few months back (or a few days with regards to this website) I tried to solve the BMysTery of the mysterious inflection point in IMDb Data. Don’t know what I mean? The short run down is that a lot of movies seems to have two slopes, one for growth before 2011 and one for after. The previous post explored that and came to a (I think) reasonable conclusion. So what is all this about then? Well, I have a ton of data just lying around and something just kicked up and itched my brain. Time for the long story.
You guys know Material Girls right? Hilary and Haylie Duff vehicle, pretty big deal. Well, every time we do a preview for a movie we generate trajectories for both IMDb rating and votes through time. Usually this results in a scream of “WHY?! Why has the rating of this terrible film gone up over time?!” And typically it was left there, because hey, people have different tastes, and maybe it is just kind of a trait of the data. But then Material Girls!
First, holy moley that 2011 inflection. Even the rating has an inflection! This was a huge red flag for me. Second, the rating jumps 2.5 points! That is patently absurd. Through all of this I couldn’t help but think maybe …. it was related to this recent blog post by fivethirtyeight. But then I was looking through some of my very old programs and stumbled onto a very prescient comment:
#Look at that variance! Awesome, basically regression to the mean. #Movies are superlative when they come out #End up regressing both up and down to the mean
So that’s what this (short) entry will look at: The regression to the mean in IMDb ratings. Something I clearly knew about literally 7 months ago then managed to forget pretty much instantaneously … yeah, I’m an idiot.
First start with a plot of all of the rating data I’ve got:
Nonsense. But you can kind of see that things condense as time goes by. But it is all easier if you plot the rating change (over ten years) by the initial rating of the movie. I’ve included a regression and Material Girls is marked out by a blue square:
Nice. Pretty much the entirety of the crazy jump in ratings is explained by regression to the mean. Just look at Material Girls. And funny enough the rating at which it crosses over, 6.0, is kind of the cut off point for bad movies as well, which is fun.
It is interesting, especially looking at the first plot: the rating doesn’t just regress by some exponential, it pretty much follows the voting trajectory. But … yeah, they aren’t that correlated:
The rating can’t move without votes, so it following the vote trajectory through time I think is just a consequence of that inherent underlying connection. And I think that’ll just about do it for that. The regression is interesting, but probably at this point hard to utilize for good. It could be used in tandem with a vote number trajectory predictor to try and predict vote/rating trajectories into the past. But predicting votes is the rub, and I’ve found rather difficult.
But I declare this BMysTery closed! It wasn’t that hard, I mean, I apparently knew the answer seven months ago, but yeah, bad movie IMDb ratings tend to go up (and the opposite for good movies) over time. It isn’t people waking up and realizing movies are better than their rating, it is just regression to the mean. And Material Girls probably wasn’t brigaded by guys.