四、描述变量分布的可视化图形

 对于离散变量,可以用频数、比例、百分数的条形图表现单个离散变量分布,可以用热力图表现两个离散变量的分布。对于连续型变量,可以用直方图、密度估计图表现单个变量分布,可以对多个变量同时做密度估计图,可以用正态 QQ 图、盒形图、经验分布函数图等。 (一)单变量可视化分布 案例数据集为泰坦尼克号乘客的数据集。泰坦尼克号上大约有 1 300 名乘客(不包括船员),数据集提供了其中 756 人的年龄。我们想知道泰坦尼克号上有多少不同年龄阶段的乘客,即有多少儿童、年轻人、中年人、老年人等。 乘客不同年龄分组的相对比例称为乘客的年龄分布。将所有乘客根据年龄分到相应的组中,然后计算每个组的乘客数量与总体占比即可求得乘客数量与年龄分布。 本案例演示所需工具包在“_common.R”文件中,直接加载即可。 # run setup script source("_common.R") age_hist <- cbind(age_hist, age = (1:15) * 5 - 2.5) h1 <- ggplot(age_hist, aes(x = age, y = count)) + geom_col(width = 4.7, fill = "#56B4E9") + scale_y_continuous(expand = c(0, 0), breaks = 25 * (0:5)) + scale_x_continuous( name = "Age",limits = c(0, 75), expand = c(0, 0)) + coord_cartesian(clip = "off") + theme_dviz_hgrid() + theme( axis.line.x = element_blank(), plot.margin = margin(3, 7, 3, 1.5) ) h1 输出的乘客年龄分组结果如表 6-2-2 所示。 表 6-2-2 乘客年龄分组 通过绘制填充的矩形来可视化表中数据(见图 6-2-41),其高度对应计数,宽度对应年龄组的宽度,我们把条形图称之为直方图。注意,直方图的宽度表示组距大小,高度表示分组频数大小。为了使可视化成为有效的直方图,所有的条形规定为相同的宽度。 age_hist <- cbind(age_hist, age = (1:15) * 5 - 2.5) h1 <- ggplot(age_hist, aes(x = age, y = count)) + geom_col(width = 4.7, fill = "#56B4E9") + scale_y_continuous(expand = c(0, 0), breaks = 25 * (0:5)) + scale_x_continuous( name = "Age",limits = c(0, 75), expand = c(0, 0)) + coord_cartesian(clip = "off") + theme_dviz_hgrid() + theme( axis.line.x = element_blank(), plot.margin = margin(3, 7, 3, 1.5) ) h1 图 6-2-41 乘客年龄分组直方图
直方图是通过对数据分组后绘制而成的,因此,视觉外观取决于组距大小。大多数系统生成的直方图可视化程序都会选择一个默认的柱形宽度,但是柱形宽度可能不是最合适的宽度。因此,掌握不同柱形宽度绘图技巧至关重要。 为了绘制的直方图能准确地反映数据的基本特征,就要考虑分组大小。一般来说,分组条形宽度过小,直方图就会变得过于尖峰和拥挤,数据分布趋势和特征可能会被掩盖。另一方面,过大的分组会导致条形宽度过大,导致数据分布的差异特征被掩盖,数据中较小的特征值可能会消失。 泰坦尼克号乘客的年龄分布如图 6-2-42 所示,可以看到,1 年的组距使条形宽度太小,15 年的条形宽度太大,而 3~5 年的组宽就较为可行。 图 6-2-42 年龄分组大小不同的直方图
age_hist_1 <- data.frame(age = (1:75) - 0.5, count = hist(titanic$age, breaks = (0:75) + .01, plot = FALSE)$counts ) age_hist_3 <- data.frame(age = (1:25) * 3 - 1.5, count = hist(titanic$age, breaks = (0:25) * 3 + .01, plot = FALSE)$counts ) age_hist_15 <- data.frame(age = (1:5) * 15 - 7.5, count = hist(titanic$age, breaks = (0:5) * 15 + .01, plot = FALSE)$counts ) h2 <- ggplot(age_hist_1, aes(x = age, y = count)) + geom_col(width = .85, fill = "#56B4E9") + scale_y_continuous(expand = c(0, 0), breaks = 10 * (0:5)) + scale_x_continuous(name = "age (years)", limits = c(0, 75), expand = c(0, 0)) + coord_cartesian(clip = "off") + theme_dviz_hgrid(12) + theme(axis.line.x = element_blank(), plot.margin = margin(3, 1.5, 3, 1.5) ) h3 <- ggplot(age_hist_3, aes(x = age, y = count)) + geom_col(width = 2.75, fill = "#56B4E9") + scale_y_continuous(expand = c(0, 0), breaks = 25 * (0:5)) + scale_x_continuous(name = "age (years)", limits = c(0, 75), expand = c(0, 0)) + coord_cartesian(clip = "off") + theme_dviz_hgrid(12) + theme(axis.line.x = element_blank(), plot.margin = margin(3, 1.5, 3, 1.5) ) h4 <- ggplot(age_hist_15, aes(x = age, y = count)) + geom_col(width = 14.5, fill = "#56B4E9") + scale_y_continuous(expand = c(0, 0), breaks = 100 * (0:4)) + scale_x_continuous(name = "age (years)", limits = c(0, 75), expand = c(0, 0)) + coord_cartesian(clip = "off") + theme_dviz_hgrid(12) + theme(axis.line.x = element_blank(), plot.margin = margin(3, 1.5, 3, 1.5) ) plot_grid(h2, NULL, h3,NULL, NULL, NULL, h1 + theme_dviz_hgrid(12) + theme(axis.line.x = element_blank(), plot.margin = margin(3, 1.5, 3, 1.5)), NULL, h4, align = 'hv', labels = c("a", "", "b", "", "", "", "c", "", "d"), rel_widths = c(1, .04, 1), rel_heights = c(1, .04, 1) ) 随着计算机可视化能力的不断提高,直方图逐渐被一类密度图所代替。密度图一般使用连续曲线来可视化数据的概率分布,这与统计概率的定义是一致的。因为概率密度函数通常用一条钟形曲线来描述,而这条曲线需要数据进行估计,最常用的估计方法是核密度估计。在核密度估计中,每个数据点的位置绘制一条宽度很小的连续曲线(由一个叫作带宽的参数控制),然后将所有这些曲线拼合起来得到最终的密度估计。使用最广泛的高斯核密度(即高斯钟形曲线)如图 6-2-43 所示。 图 6-2-43 年龄分布密度图
ggplot(titanic, aes(x = age)) + geom_density_line(fill = "#56B4E9", color = darken("#56B4E9", 0.5), bw = 2, kernel = "gaussian") + scale_y_continuous(limits = c(0, 0.046), expand = c(0, 0), name = "概率密度") + scale_x_continuous(name = "年龄", limits = c(0, 75), expand = c(0, 0)) + coord_cartesian(clip = "off") + theme_dviz_hgrid() + theme( axis.line.x = element_blank(), plot.margin = margin(3, 7, 3, 1.5) ) 与直方图类似,密度图的具体外观取决于条形宽度的选择。条形宽度参数的行为类似于直方图中的条形宽度。如果条形宽度太小,那么密度函数曲线可能变得过于尖峰和拥挤,数据中的主要趋势可能会被模糊化。另一方面,如果条形宽度太大,那么数据分布中的较小特征可能消失。此外,条形宽度的选择也会影响密度曲线的形状,如图 6-2-44 所示。例如,高斯曲线将倾向于产生看起来像高斯密度函数概率的估计,具有平滑的尾部特征。相比之下,矩形可以在密度曲线下生成。一般来说,数据集中的数据点越多,内核的选择就越不重要。因此,密度图对于大型数据集来说是较为可靠且有强大表现力的可视化图形,对于连续型随机变量来说,数据量越大,绘制的图形结果越准确,而对于只有几个点的数据来说,就可能产生较大误差。 图 6-2-44 不同分组的核密度图
pdens1 <- ggplot(titanic, aes(x = age)) + geom_density_line(fill = "#56B4E9", color = darken("#56B4E9", 0.5), bw = .5, kernel = "gaussian") + scale_y_continuous(limits = c(0, 0.046), expand = c(0, 0), name = "density") + scale_x_continuous(name = "age (years)", limits = c(0, 75), expand = c(0, 0)) + coord_cartesian(clip = "off") + theme_dviz_hgrid(12) + theme(axis.line.x = element_blank(),plot.margin = margin(3, 1.5, 3, 1.5) ) pdens2 <- ggplot(titanic, aes(x = age)) + geom_density_line(fill = "#56B4E9", color = darken("#56B4E9", 0.5), bw = 2, kernel = "gaussian") + scale_y_continuous(limits = c(0, 0.046), expand = c(0, 0), name = "density") + scale_x_continuous(name = "age (years)", limits = c(0, 75), expand = c(0, 0)) + coord_cartesian(clip = "off") + theme_dviz_hgrid(12) + theme(axis.line.x = element_blank(),plot.margin = margin(3, 1.5, 3, 1.5) ) pdens3 <- ggplot(titanic, aes(x = age)) + geom_density_line(fill = "#56B4E9", color = darken("#56B4E9", 0.5), bw = 5, kernel = "gaussian") + scale_y_continuous(limits = c(0, 0.046), expand = c(0, 0), name = "density") + scale_x_continuous(name = "age (years)", limits = c(0, 75), expand = c(0, 0)) + coord_cartesian(clip = "off") + theme_dviz_hgrid(12) + theme(axis.line.x = element_blank(),plot.margin = margin(3, 1.5, 3, 1.5) ) pdens4 <- ggplot(titanic, aes(x = age)) + geom_density_line(fill = "#56B4E9", color = darken("#56B4E9", 0.5), bw = 2, kernel = "rectangular") + scale_y_continuous(limits = c(0, 0.046), expand = c(0, 0), name = "density") + scale_x_continuous(name = "age (years)", limits = c(0, 75), expand = c(0, 0)) + coord_cartesian(clip = "off") + theme_dviz_hgrid(12) + theme(axis.line.x = element_blank(),plot.margin = margin(3, 1.5, 3, 1.5) ) plot_grid(pdens1, NULL, pdens2, NULL, NULL, NULL,pdens3, NULL, pdens4, align = 'hv', labels = c("a", "", "b", "", "", "", "c", "", "d"), rel_widths = c(1, .04, 1), rel_heights = c(1, .04, 1) )
(二)多变量可视化分布 在许多情况下,有多个分布需要同时可视化。例如,想知道泰坦尼克号男乘客和女乘客的年龄分布。年龄是唯一分组,乘客的性别有两个分类。通常情况下,可视化策略是使用一个堆叠的直方图来展示,如图 6-2-45 所示,使用不同的颜色将女乘客的直方图置于男乘客的直方图之上,把这类图形称之为堆积或堆叠直方图。 图 6-2-45 不同年龄和性别乘客的堆叠直方图
data.frame( age = (1:25)*3 - 1.5, male = hist(filter(titanic, sex == "male")$age, breaks = (0:25)*3 + .01, plot = FALSE)$counts, female = hist(filter(titanic, sex == "female")$age, breaks = (0:25)*3 + .01, plot = FALSE)$counts ) %>% gather(gender, count, -age) -> gender_counts gender_counts$gender <- factor(gender_counts$gender, levels = c("female", "male")) p_hist_stacked <- ggplot(gender_counts, aes(x = age, y = count, fill =gender)) + geom_col(position = "stack") + scale_x_continuous(name = "age (years)", limits = c(0, 75), expand = c(0, 0)) + scale_y_continuous(limits = c(0, 89), expand = c(0, 0), name = "count") + scale_fill_manual(values = c("#D55E00", "#0072B2")) + coord_cartesian(clip = "off") + theme_dviz_hgrid() + theme( axis.line.x = element_blank(), legend.position = c(.9, .87), legend.justification = c("right", "top"), legend.box.background = element_rect(fill = "white", color = "white"), plot.margin = margin(3, 7, 3, 1.5) )
图 6-2-45 存在两个问题:首先,从数据上看,完全不清楚这些条形图到底是从哪个变量值开始的,是从颜色变化的地方开始还是从 0 度量开始?例如,18~20 岁的女乘客约有 25人,但是图上所示有约 80 人。其次,女乘客的条形图高度不是统一尺度的起点,横向之间不能直接相互比较。例如,男乘客的平均年龄比女乘客大,但图中看不出来。尝试让所有条形图都从 0 开始,超出部分透明显示,如图 6-2-46 所示。 p_hist_overlapped <- ggplot(gender_counts, aes(x = age, y = count, fill = gender)) + geom_col(position = "identity", alpha = 0.7) + scale_x_continuous(name = "age (years)", limits = c(0, 75), expand = c(0, 0)) + scale_y_continuous(limits = c(0, 56), expand = c(0, 0), name = "count") + scale_fill_manual(values = c("#D55E00", "#0072B2"), guide = guide_legend(reverse = TRUE)) + coord_cartesian(clip = "off") + theme_dviz_hgrid() + theme( axis.line.x = element_blank(), legend.position = c(.9, .87), legend.justification = c("right", "top"), legend.box.background = element_rect(fill = "white", color = "white"), plot.margin = margin(3, 7, 3, 1.5) ) stamp_bad(p_hist_overlapped) 图 6-2-46 起点相同的堆叠直方图
然而,从图 6-2-46 中看出,实际上出现了三个不同的类,而不是两个,我们仍然不完全确定每个条形图的起点和终点。堆叠直方图的可视化效果并不好,因为在另一个直方图上绘制的半透明条形图看起来不像半透明条形图,而像是用不同颜色绘制的条形图。堆叠密度图通常不存在堆叠直方图所存在的问题,因为连续的密度曲线有助于视觉保持分离。然而,对于这个数据集,男乘客和女乘客的年龄分布在 17 岁左右几乎是重叠的,此后出现分离,因此,可视化的效果仍然不理想,如图 6-2-47 所示。 titanic2 <- titanic titanic2$sex <- factor(titanic2$sex, levels = c("male", "female")) ggplot(titanic2, aes(x = age, y = ..count.., fill = sex, color = sex)) + geom_density_line(bw = 2, alpha = 0.7) + scale_x_continuous(name = "age (years)", limits = c(0, 75), expand = c(0, 0)) + scale_y_continuous(limits = c(0, 19), expand = c(0, 0), name = "scaled density") + scale_fill_manual(values = c("#0072B2", "#D55E00"), name = "gender") + scale_color_manual(values = darken(c("#0072B2", "#D55E00"), 0.5), name = "gender") + guides(fill = guide_legend(override.aes = list(linetype = 0))) + coord_cartesian(clip = "off") + theme_dviz_hgrid() + theme( axis.line.x = element_blank(), legend.position = c(.9, .87), legend.justification = c("right", "top"), legend.box.background = element_rect(fill = "white", color = "white"), plot.margin = margin(3, 7, 3, 1.5) ) 图 6-2-47 堆积密度图
对于这个数据集,最好的解决方案是分别显示男乘客和女乘客的年龄分布,每个人都占总体年龄分布的比例如图 6-2-48 所示。从图 6-2-48 中可以更直观、清晰地展示泰坦尼克号上20~50 岁年龄段的女性比男性要少得多。 ggplot(titanic2, aes(x = age, y = ..count..)) + geom_density_line( data = select(titanic, -sex), aes(fill = "all passengers"), color = "transparent" ) + geom_density_line(aes(fill = sex), bw = 2, color = "transparent") + scale_x_continuous(limits = c(0, 75), name = "passenger age (years)", expand = c(0, 0)) + scale_y_continuous(limits = c(0, 26), name = "scaled density", expand = c(0, 0)) + scale_fill_manual( values = c("#b3b3b3a0", "#D55E00", "#0072B2"), breaks = c("all passengers", "male", "female"), labels = c("all passengers ", "males ", "females"), name = NULL, guide = guide_legend(direction = "horizontal") ) + coord_cartesian(clip = "off") + facet_wrap(~sex, labeller = labeller( sex = function(sex) paste(sex, "passengers"))) + theme_dviz_hgrid() + theme( axis.line.x = element_blank(), strip.text = element_text(size = 14, margin = margin(0, 0, 0.2, 0, "cm")), legend.position = "bottom", legend.justification = "right", legend.margin = margin(4.5, 0, 1.5, 0, "pt"), legend.spacing.x = grid::unit(4.5, "pt"), legend.spacing.y = grid::unit(0, "pt"), legend.box.spacing = grid::unit(0, "cm") ) 图 6-2-48 按乘客性别展示的堆积密度图
当要精确地可视化两个分布时,还可以制作两个独立的直方图,将它们旋转 90,让其中一个直方图中的柱形点指向另一个直方图的相反方向,如图 6-2-49 所示。在可视化年龄分布时,通常使用这个技巧,得到的图形被称为年龄金字塔图。 ggplot(gender_counts, aes(x = age, y = ifelse(gender == "male",-1, 1)*count, fill = gender)) + geom_col() + scale_x_continuous(name = "age (years)", limits = c(0, 75), expand = c(0, 0)) + scale_y_continuous(name = "count", breaks = 20*(-2:1), labels = c("40", "20", "0", "20")) + scale_fill_manual(values = c("#D55E00", "#0072B2"), guide = "none") + draw_text(x = 70, y = -39, "male", hjust = 0) + draw_text(x = 70, y = 21, "female", hjust = 0) + coord_flip() + theme_dviz_grid() + theme(axis.title.x = element_text(hjust = 0.61)) 图 6-2-49 年龄金字塔图
如果同时可视化两个以上的分布时,上述这些图就不起作用了。同时展示多个分布,直方图往往会变得非常混乱,而密度图较好地克服了直方图的这个缺陷。如图 6-2-50 所示,使用堆积的密度图展示了 4 种不同奶制品乳脂百分比,对于同类的研究对象进行横向比较时能较好地表现数据的内部信息。 cows %>% mutate(breed = as.character(breed)) %>% filter(breed != "Canadian") -> cows_filtered cows_dens <- group_by(cows_filtered, breed) %>% do(ggplot2:::compute_density(.$butterfat, NULL)) %>% rename(butterfat = x) cows_max <- filter(cows_dens, density == max(density)) %>% ungroup() %>% mutate( hjust = c(0, 0, 0, 0), vjust = c(0, 0, 0, 0), nudge_x = c(-0.2, -0.2, 0.1, 0.23), nudge_y = c(0.03, 0.03, -0.2, -0.06) ) cows_p <- ggplot(cows_dens, aes(x = butterfat, y = density, color = breed, fill = breed)) + geom_density_line(stat = "identity") + geom_text( data = cows_max, aes( label = breed, hjust = hjust, vjust = vjust, color = breed, x = butterfat + nudge_x, y = density + nudge_y ), inherit.aes = FALSE, size = 12/.pt ) + scale_color_manual( values = darken(c("#56B4E9", "#E69F00", "#D55E00", "#009E73"), 0.3), breaks = c("Ayrshire", "Guernsey", "Holstein-Friesian", "Jersey"), guide = "none" ) + scale_fill_manual( values = c("#56B4E950", "#E69F0050", "#D55E0050", "#009E7350"), breaks = c("Ayrshire", "Guernsey", "Holstein-Friesian", "Jersey"), guide = "none" ) + scale_x_continuous( expand = c(0, 0), labels = scales::percent_format(accuracy = 1, scale = 1), name = "butterfat contents" ) + scale_y_continuous(limits = c(0, 1.99), expand = c(0, 0)) + coord_cartesian(clip = "off") + theme_dviz_hgrid() + theme(axis.line.x = element_blank()) cows_p 图 6-2-50 四种乳制品乳脂百分比堆积密度图