The Inventions In Everything team has covered a lot of unusual patents over the years and are used to seeing really strange things, but even they were dumbfounded after being presented with the following chart. If you read the headline on this article, you already know something remarkable is afoot!...
If you know your data correlation coefficients, you can see from the information presented with the chart that the correlation between the per capita consumption of American cheese in the U.S. and the total number of patents granted in the U.S. appears to be both strong and statistically significant.
What could possibly explain the strength of that correlation? Could the inspiration behind the visions of American inventors be explained by their consumption of American cheese? Then again, maybe its not the inventors but the patent examiners instead. Could an increased intake of American cheese be making them more productive in issuing patents?
Or maybe, just maybe, someone's playing fast and loose with their data analysis and this whole cheesy correlation is a carefully invented fraud.
As much as we'd love to be able to definitively link the total number of patents issued to the individual consumption of American cheese, and despite the existence of an AI-generated paper proclaiming that such a relationship is real, the truth is it is all a spurious correlation. Tyler Vigen did some serious data dredging to uncover the unlikely relationship. Here, he explains why it works:
Why this works
Data dredging: I have 25,153 variables in my database. I compare all these variables against each other to find ones that randomly match up. That's 632,673,409 correlation calculations! This is called “data dredging.” Instead of starting with a hypothesis and testing it, I instead abused the data to see what correlations shake out. It’s a dangerous way to go about analysis, because any sufficiently large dataset will yield strong correlations completely at random.
Lack of causal connection: There is probably no direct connection between these variables, despite what the AI says above. This is exacerbated by the fact that I used "Years" as the base variable. Lots of things happen in a year that are not related to each other! Most studies would use something like "one person" instead of "one year" to be the "thing" studied.
Observations not independent: For many variables, sequential years are not independent of each other. If a population of people is continuously doing something every day, there is no reason to think they would suddenly change how they are doing that thing on January 1. A simple p-value calculation does not take this into account, so mathematically it appears less probable than it really is.
Confounding variable: 2020 is particularly different from the other years on this graph. Confounding variables (like global pandemics) will cause two variables to look connected when in fact a "sneaky third" variable is influencing both of them behind the scenes.
Y-axis doesn't start at zero: I truncated the Y-axes of the graph above. I also used a line graph, which makes the visual connection stand out more than it deserves. Mathematically what I showed is true, but it is intentionally misleading.
Vigen's website is a treasure trove of similarly dubious connections, and of course, he's the author of a book on the topic. In the following video he gives an overview of what can make a correlation spurious:
Still, these kinds of things are fun to think about. If you know any inventors or patent examiners, don't be afraid to send them some good old fashioned American cheese to snack on so they can keep their productive output up. Just in case it might really work.
References
Tyler Vigen. Spurious correlation #2,196. tylervigen.com. Creative Commons. CC by 4.0 Attribution 4.0 International Deed.