The Economist's Visualization Errors
The Economist’s Errors and Credit Where Credit is Due
The Economist is serious about their use of data visualization and they have occasionally owned up to errors in their visualizations. They can be deceptive, uninformative, confusing, excessively busy, and present a host of other barriers to clean communication. Their blog post on their errors is great.
I have drawn the following example from a
#tidyTuesday earlier this year that explores this. Here is the link to the setup page.
Your task comes in two parts. The first one is to recreate any of the second through sixth visualizations: Brexit, Dogs, EU Balance, Pensions, or Trade. You should have a look at the original and note the pathologies, but reproduce a nice visual (either their reconceptualization or something useful). One thing to note is that three axis graphics are almost always bad.
For example, the first one; the example is bad because it cuts the axis.
My Re-creation [in Labour red]
I built the whole thing in
equisse with two exceptions. First, reorder here is used to arrange the names in the order that I want to plot them. By default they are alphabetical. Second, I wanted the label geometry which is not an option in esquisse but one treats it like a point with the text as the plotting character. In the aesthetic, it is a label;
geom_text does the same thing without a box but it harder to distinguish in this case. If you want to play, the cheat sheet is here.
library(tidyverse) # call the tidyverse for %>% and ggplot library(ggthemes) # Use the economist theme corbyn <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-04-16/corbyn.csv") # reorder will allow me to sort the names by the values. # Uncomment to see: # with(corbyn, reorder(political_group,avg_facebook_likes)) # geom_label will allow me to put the numbers on the graph by specifying a label. ggplot(corbyn, aes(x=reorder(political_group,avg_facebook_likes), y=avg_facebook_likes, label=avg_facebook_likes)) + geom_bar(stat="identity", fill="#DC241f") + geom_label(color="#DC241f") + labs(x="Associate of the Labour Party", y="Average Facebook Likes") + theme_economist() + coord_flip()
In the next code chunk, you will need to cut and paste the code for grabbing your particular choice of data.
Now I am going to grab the Brexit data.
brexit <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-04-16/brexit.csv") summary(brexit)
## date percent_responding_right percent_responding_wrong ## Length:85 Min. :40.00 Min. :41.00 ## Class :character 1st Qu.:42.00 1st Qu.:43.00 ## Mode :character Median :44.00 Median :45.00 ## Mean :43.71 Mean :44.51 ## 3rd Qu.:45.00 3rd Qu.:45.00 ## Max. :47.00 Max. :48.00
# The variable names are too long and terrible. Let me change them. names(brexit) <- c("Date","Right","Wrong") # The dates are also not read as proper dates. Let me fix that. It can also be fixed by coercion in esquisse brexit$Date <- as.Date(brexit$Date, "%d/%m/%y") brexit
## # A tibble: 85 x 3 ## Date Right Wrong ## <date> <dbl> <dbl> ## 1 2016-08-02 46 42 ## 2 2016-08-09 45 44 ## 3 2016-08-17 46 43 ## 4 2016-08-23 45 43 ## 5 2016-08-31 47 44 ## 6 2016-09-14 46 43 ## 7 2016-10-12 45 44 ## 8 2016-10-20 45 44 ## 9 2016-11-15 46 43 ## 10 2016-11-29 44 45 ## # … with 75 more rows
The data are structured side by side and that makes it hard for R [or anything] to draw axes because it needs to know the boundaries for the axes. I need to stack the rights and wrongs into one column with an associated label in another column for Right and Wrong. That is reshaping data; pivoting longer and wider in the language of tidyr.
# My first step is to stack the percent reporting on top of each other to make color and fill work properly. I am going to take wider data [two columns of data] and stack them on top of each other. What am I going to name this new stacked column? "direction" and which columns do I wish to stack? c(Right,Wrong) My.Brexit <- brexit %>% pivot_longer(.,names_to = "direction",cols = c(Right,Wrong)) My.Brexit
## # A tibble: 170 x 3 ## Date direction value ## <date> <chr> <dbl> ## 1 2016-08-02 Right 46 ## 2 2016-08-02 Wrong 42 ## 3 2016-08-09 Right 45 ## 4 2016-08-09 Wrong 44 ## 5 2016-08-17 Right 46 ## 6 2016-08-17 Wrong 43 ## 7 2016-08-23 Right 45 ## 8 2016-08-23 Wrong 43 ## 9 2016-08-31 Right 47 ## 10 2016-08-31 Wrong 44 ## # … with 160 more rows
brexit %>% ggplot(aes(x=Date, y=Right)) + geom_smooth(color="green", size=1) + geom_point(color = "green") + geom_point(aes(x=Date, y=Wrong), color="red") + geom_smooth(aes(x=Date, y=Wrong), color="red") + labs(y="% Saying Right [Green] or Wrong [Red] to UK Leaving EU")
# The only addition beyond esquisse here is the addition of geom_smooth that gives me the smmoth intervals and annotate to write the sides in the graph instead of the legend.. ggp <- My.Brexit %>% ggplot(aes(x = Date, y = value, color=direction, fill = direction)) + geom_smooth(size = 1L) + geom_point(size=1.5) + scale_color_viridis_d(name="Direction", guide = FALSE) + scale_fill_viridis_d(name="Direction", guide = FALSE) + theme_economist_white() + labs(title="Bremorse...", subtitle = "In hindsight, do you think Britain was Right or Wrong to vote to leave the EU?", x="Date", y = "Percent Responding ...") + # add two annotations to the plot instead of a scale annotate("text" , x=as.Date("2018-01-01"), y=47, label="Wrong", size=6, color = "#fde725ff") + annotate("text" , x=as.Date("2018-01-01"), y=41, label="Right", size=6, color = "#440154ff") ggp
That’s pretty cool.
library(ggrepel) # I need to repel the labels. pensions <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-04-16/pensions.csv") pensions$scaled.65 <- scale(pensions$pop_65_percent) pensions$scaled.GSgdp <- scale(pensions$gov_spend_percent_gdp) pensions$prod.scaled <- pensions$scaled.65*pensions$scaled.GSgdp
Now a picture. I want it simple but complete. At first, I used text but it is not easy to read.
pensions <- pensions %>% mutate(Mean65 = mean(pop_65_percent), MeanSpend = mean(gov_spend_percent_gdp)) pensions %>% ggplot(., aes(x=pop_65_percent, y=gov_spend_percent_gdp, label=country, color=prod.scaled)) + geom_point(alpha=0.1) + geom_hline(yintercept = mean(pensions$MeanSpend)) + geom_vline(xintercept = mean(pensions$Mean65)) + geom_text(size=3, fill="white") + labs(x="Percent of Population over 65", y="Government Pension Spending/GDP") + theme_economist_white() + labs(title="Population over 65 and Pension Spending per GDP", caption = "Coloring by Scaled Covariance") + scale_color_viridis_c(guide=FALSE)
**ggrepel* allows labels that arrow to plot locations and keeps the labels from overlapping. I called the library earlier, now I will use it’s geom called geom_label_repel.
ggplot(pensions) + aes(x=pop_65_percent, y=gov_spend_percent_gdp, label=country, color=prod.scaled) + geom_label_repel(size=3, fill="white") + labs(x="Percent of Population over 65", y="Government Pension Spending/GDP") + geom_point() + geom_hline(yintercept = mean(pensions$MeanSpend)) + geom_vline(xintercept = mean(pensions$Mean65)) + theme_economist_white() + labs(title="Population over 65 and Pension Spending per GDP", caption = "Coloring by Scaled Covariance") + scale_color_viridis_c(guide=FALSE)
That seems to get it. What’s up with Brazil?
dogs <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-04-16/dogs.csv")
p <- ggplot(dogs) + aes(x=avg_neck, y=avg_weight, size=year) + geom_point() + theme_economist_white() + labs(x="Neck Size", y="Weight", title="The Incredible Shrinking Dog?") p
This is a perfect case for a text label. If they are decent sized, they overlap. I used
geom_text which requires a label [label = year] in the aesthetic mapping.
p <- ggplot(dogs) + aes(x=avg_neck, y=avg_weight, label=year) + geom_text(size=3, color="white") + theme_dark() + labs(x="Neck Size", y="Weight", title="The Incredible Shrinking Dog?") p
library(emojifont) p <- ggplot(dogs) + aes(x=avg_neck, y=avg_weight, size=-year, label = emoji('dog')) + geom_text() + theme_minimal() + labs(x="Neck Size", y="Weight", title="The Incredible Shrinking Dog?", caption = "Sized by Year -- decreasing... Shrinking Dogs?") + guides(size=FALSE) p
eu_balance <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-04-16/eu_balance.csv") skimr::skim(eu_balance)
|Number of rows||266|
|Number of columns||4|
|Column type frequency:|
Variable type: character
Variable type: numeric
## ## budget current ## 133 133
## ## Austria Belgium Cyprus Estonia Finland France ## 14 14 14 14 14 14 ## Germany Greece Ireland Italy Latvia Lithuania ## 14 14 14 14 14 14 ## Luxembourg Malta Netherlands Portugal Slovakia Slovenia ## 14 14 14 14 14 14 ## Spain ## 14
ggplot(eu_balance) + aes(x = year, y = value, fill = account_type, color=account_type) + geom_line(size = 1L, alpha=0.3) + geom_hline(yintercept = 0, size=1, color="black") + scale_fill_hue() + theme_minimal() + facet_wrap(vars(country), scales = "free_y")
gridExtra allows us to combine multiple plots into one. I build two plots and grid them together in the last line.
library(gridExtra) trade <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-04-16/trade.csv") trade
## # A tibble: 22 x 3 ## year trade_deficit manufacture_employment ## <dbl> <dbl> <dbl> ## 1 1995 -3.38e10 17244583. ## 2 1996 -3.95e10 17236750 ## 3 1997 -4.97e10 17417833. ## 4 1998 -5.69e10 17560000 ## 5 1999 -6.87e10 17322667. ## 6 2000 -8.38e10 17265250 ## 7 2001 -8.31e10 16440583. ## 8 2002 -1.03e11 15256833. ## 9 2003 -1.24e11 14508500 ## 10 2004 -1.62e11 14314750 ## # … with 12 more rows
p1 <- ggplot(trade, aes(x=year, y=trade_deficit/1000000000)) + geom_area(fill="red") + theme_economist() + labs(y = "US Trade Deficit in Billions of US $") p2 <- ggplot(trade, aes(x=year, y=manufacture_employment/1000000)) + geom_line() + theme_economist() + labs(y = "US Manufacturing Employment in millions") grid.arrange(p1,p2)
That is fairly simple and it works. The trick here was to use a grid to create two plots and stack them.
The Last One: No Solution Provided
The authors of the post give you good advice on how to solve this. I contend that there is a great plot to generate using
esquisse though it may not be just one plot. Have a go. If you wish to, you can peruse the appropriate timelines for
#tidyTuesday and see what people came up with though in many cases, this will be apparent so use appropriate citations; it is easy to do with the examples here in this Markdown.
women_research <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-04-16/women_research.csv") women_research
## # A tibble: 60 x 3 ## country field percent_women ## <chr> <chr> <dbl> ## 1 Japan Health sciences 0.24 ## 2 Chile Health sciences 0.43 ## 3 United Kingdom Health sciences 0.45 ## 4 United States Health sciences 0.46 ## 5 Mexico Health sciences 0.46 ## 6 Denmark Health sciences 0.47 ## 7 EU28 Health sciences 0.48 ## 8 France Health sciences 0.48 ## 9 Canada Health sciences 0.49 ## 10 Australia Health sciences 0.5 ## # … with 50 more rows
# There's a spelling error I want to fix, select the rows and columns to fix and reassign the value women_research[women_research$field=="Women inventores","field"] <- "Women Inventors" # The label for CS/Maths is too long, also women_research[women_research$field=="Computer science, maths","field"] <- "CS/Math"
The data consist of three columns and are a long pivot table. The pivots are country and field with a quantitative indicator of the percent of women in the field. It is a long pivot table characterised by country and field combinations in the rows. To show the data cleanly, let me spread out the fields by widening the pivot table by fields.
# I find it far more intuitive to use pivot_wider but this will always work and is safer on versions of tidyr # women_research %>% pivot_wider(field, percent_women) women_research %>% spread(field, percent_women) %>% knitr::kable()
|country||CS/Math||Engineering||Health sciences||Physical sciences||Women Inventors|
The first way is to use esquisse and a simple grid plot for two categorical axes and add details to it in the boxes.
ggp1 <- ggplot(women_research) + aes(x = field, y = country, fill = percent_women) + geom_tile(size = 1L) + scale_fill_viridis_c(option = "viridis") + labs(x = "Field or Occupation", y = "Country", title = "Publication and Invention by Women Around the World", subtitle = "Percent of Publications or Patents by Women", caption = "source: Economist Visualization Errors", fill = "Percent Women") + theme_minimal(base_size = 10) ggp1 + theme(plot.background = element_rect(colour = "whitesmoke"))
We can also use equisse to build a faceted barplot by dragging the field to facet. X is country, y is percent women, fill is country with facets determined by fields.
women_research %>% ggplot(.) + aes(x = country, fill = country, weight = percent_women) + geom_bar() + scale_fill_hue() + coord_flip() + theme_minimal(base_size = 10) + theme(legend.position = "none") + labs(y = "Percent of Publications/Patents", x = "Country", title = "Publication and Invention by Women Around the World", subtitle = "Percent of Publications or Patents by Women", caption = "source: Economist Visualization Errors") + facet_wrap(vars(field))
Or reverse the above use of field and country where field is the X and the fill, y is still percent_women and facets are now countries. This works quite well if only because there are 12 entities which grid into a 3 x 4 nicely.
women_research %>% ggplot(.) + aes(x = field, fill = field, weight = percent_women) + geom_bar() + scale_fill_hue() + coord_flip() + theme_minimal(base_size = 10) + theme(legend.position = "none") + labs(y = "Percent of Publications/Patents", x = "Field", title = "Publication and Invention by Women Around the World", subtitle = "Percent of Publications or Patents by Women", caption = "source: Economist Visualization Errors") + facet_wrap(~country)
To show something very cool that R can do, I have added one more bit that is turned off. In the following code chunk, I use
eval=FALSE to avoid evaluating the chunk in R. If the FALSE is changed to TRUE, R will need
library(ggiraph) women_research <- women_research %>% group_by(country) %>% mutate(Avg.Women.Percent = mean(percent_women)) %>% ungroup() women_research$PercentWomen <- as.character(round(women_research$percent_women, 3)) p1 <- ggplot(women_research) + aes(x = reorder(country,Avg.Women.Percent), fill = country, weight = percent_women, tooltip = PercentWomen, data_id = country) + geom_bar_interactive() + scale_fill_viridis_d(option = "magma") + coord_flip() + theme_minimal() + labs(y = "Percent of Publications/Patents", x = "Country/Grouping [sorted by avg.]", fill="Country/Grouping", title = "Publication and Invention by Women Around the World", subtitle = "Percent of Publications or Patents by Women", caption = "source: Economist Visualization Errors") + facet_wrap(~field) girafe(code = print(p1))