Given the amount of data we handle, one of the larger challenges we face is how to visually present it.
A good case in point is our recently updated chart showing the United States' total federal tax receipts and personal income tax receipts from year to year as a percentage share of annual GDP. In addition to this basic information, we also present statistically-derived data on the chart, indicating the means and some key thresholds related to the standard deviation and normal distribution of the year-to-year data, similarly to what we might do on a statistical control chart:
The problem we run into is that while this data is indicated, it's difficult to tell exactly where those thresholds apply. We could, for instance, directly indicate the values of those thresholds for each data set on the chart, but with the penalty of adding clutter to the chart, making the chart more difficult to read.
It occurred to us though that the two datasets are related - since the federal government's personal income tax receipts are a component of its total tax receipts (along with payroll and corporate taxes), we could show the total tax receipts against the personal tax receipts.
Doing so would then allow us to get a sense of how much of the variation in the federal government's total tax receipts might be explained by the variation in personal income tax receipts, which is only hinted at in our "control" chart showing the two datasets independently. The chart below shows what we found when we created that plot showing those key statistical thresholds and also performing a simple linear regression:
The chart above indicates the ±1σ (plus or minus one standard deviation) limits for each dataset as the horizontal (personal income tax receipts) and vertical (total federal tax receipts) shaded orange regions. These shaded regions are significant in that we can expect any point for the normally distributed data within each dataset to fall within those bounds some 68.1% of the time.
The dashed lines indicate the ±3σ (plus or minus three standard deviations) limits for each dataset, within which, we can expect any data point to fall for some 99.8% of all observations.
The resulting overlapping darker orange shaded box then is where we would then expect to find the vast majority of data points. Which we do!
Going to the linear regression portion of the analysis, we find that the resulting coefficient of determination indicates that the percentage of variance explained in the percent share of GDP for U.S. total federal tax receipts by the variance observed in just the government's personal income tax receipts is 82%. In this case, since we know what other components feed into total federal tax receipts, we can state that about 18% of the variation in total federal tax receipts may be attributed to the annual variation in corporate and payroll taxes.
Going back to our "control" chart, we find that while they may account for 82% of the variation in total tax collections from year to year, personal income taxes themselves only represent an average of 44-45% of total federal tax receipts in the period from 1946 through 2009.
This latter observation confirms that personal income taxes have a disproportionate impact upon the total tax collections of the U.S. federal government.