I am pleased to announce that choroplethr version 1.1.0 is now available on CRAN! (If you are new to the choroplethr package, please see this blog post). There are three features that I would like to make users aware of.
Version 1.1.0 is the first version that is available on CRAN. This means that you can install and load it by typing the following on an R console:
The first line will download and install the package – you only need to type this once. Each session where you want to use the package, you must type the library command.
My previous blog post demonstrated how to use the choroplethr_acs command to create choropleth maps from data from the 2011 American Community Survey (ACS). Let’s look at an example of county level total population, table B01003.
This map is very interesting in and of itself – the distribution of people across the US has many interesting patterns. But it also has limitations: many small but important counties such as New York County (Manhattan) are not visible at all. And because there is a population cluster in the Northeast, they all appear the same color. Each county in this map is assigned one of 9 colors, and there are an equal number of counties with each color. This means that we can’t see how counties in the Northeast compare with each other.
Version 1.1 of choroplethr attempts to solve these problems by allowing users to specify a states parameter which defaults to state.abb (a vector that has abbreviations of all 50 states). Here is how the county-level population map looks when we zoom in on the Northeast
northeast_states=c("CT", "DE", "ME", "MD", "MA", "NH", "NY", "PA", "RI", "VT") choroplethr_acs("B01003", "county", states=northeast_states)
When choroplethr zooms in, it also recalculates the buckets. This map shows that there is a much larger variation in the Northeastern US than origionally appeared on the national map. Hopefully this feature will be useful to users in their own analyses.
By default choroplethr_acs maps data from the 2011 5 year ACS. However, the Census Bureau has been conducting the American Community Survey since 2005. You can see a list of all ACS here. I was originally quite excited by this wealth of data, and was hoping to make animated gifs showing demographic changes from 2005 to 2012. However, Ezra Glenn, the author of the package choroplethr uses to get census data, told me that this is not possible.
Simply put, the data that the Census makes available via their API is only a subset of all its data; choroplethr can only access the 5-year surveys that ended since 2010.
Even though the data is limited, let’s still try to compare the population of the Northeast between 2010 and 2012 (the largest date range that the Census API supports). In order to make the comparisons as valid as possible, let’s set the num_buckets parameter to 1 so that we use a continuous scale in both images. We can then compare the maps side by side:
pop_2010 = choroplethr_acs("B01003", "county", num_buckets=1, endyear=2010, span=5, states=northeast_states) pop_2012 = choroplethr_acs("B01003", "county", num_buckets=1, endyear=2012, span=5, states=northeast_states) library(gridExtra) grid.arrange(pop_2010, pop_2012, nrow=1, ncol=2)
The problem with the above comparison is that the legends are not equal. Therefore comparing the changes in colors between the maps has limited value. The problem is exacerbated if you choose a value of num_buckets greater than 1 (because then you are comparing divisions between populations with diffferent ranges). In the future I hope to create helper functions to support the comparison between choropleths. (One idea is create a new data.frame that is the difference between the data.frames that make each constituent map, and then plot that (as either a percent change or absolute value). Another idea is to force the same scale onto both choropleths). But since the range of data that is available via the Census API is so small, I’m not sure how high a priority to make this. This leads me to my final point: asking the Census Bureau to make more data available from its API.
I think that if the Census Bureau added historical data to its API it would have tremendous value to researchers and all citizens who want to better understand the changing demographic nature of the US. To help make this happen I created a petition on whitehouse.gov to try and make this happen. The text of the entire petition reads:
WE PETITION THE OBAMA ADMINISTRATION TO:
make all Census data available via the Census API.
The US Census Department provides the most comprehensive demographic information about America. Unfortunately, most of this data is difficult for researches to access.
As an example, here is a list of all American Community Surveys (ACS): http://1.usa.gov/1geFSSj. However, only surveys taken since 2010 are accessible via the Developer API (http://1.usa.gov/1nYk8OU).
This makes it extremely difficult to perform historical demographics anlaysis. As an example of research that this API enables, please see my choroplethr R package which creates thematic maps (choropleths) of Census data: http://bit.ly/1eZzNWP.
I would like to empower people to view demographic maps of any Census data from any year. That is only possible if the data is made available via the API.
If this is something which you would also like to see, please consider signing the petition here.
If you would like to give feedback on version 1.1.0 of choroplethr, report bugs, request features, or share your own interesting choropleths, please consider posting on the choroplethr google group.0 comments
Recently I was pleased to write a blog post introducing the choroplethr package for R. One of the goals of choroplethr is to easily display data from the Census’ American Community Survey (ACS). To accomplish this, the choroplethr_acs function works with R’s ACS package to map results from the Census API. To demonstrate this, I displayed several images including this map showing 2011 per-capita income on a zipcode basis. (Technically, the census uses Zip-Code Tabulated Areas, or ZCTAs, and not postal ZIP codes. See this page for details about ZCTAs).
Twitter user @BrashEQLibrium made an interesting comment on this:
— Brash Equilibrium (@BrashEQLibrium) January 22, 2014
If you click thru you will see a funny comic that says that many geographic profile maps wind up just being population maps. It wasn’t immediately clear to me whether ZCTA population and income would be correlated, so I decided to investigate.
As a first pass, I simply used the choroplethr_acs function to place maps of population and income side by side:
library(choroplethr) library(acs) # create two maps, side by side, of the tables in question # 2011 ACS table ids can be found here: http://factfinder2.census.gov/faces/help/jsf/pages/metadata.xhtml?lang=en&type=dataset&id=dataset.en.ACS_11_5YR# incomeTableId = "B19301" populationTableId = "B01003" map_income = choroplethr_acs(tableId = incomeTableId , lod = "zip"); map_population = choroplethr_acs(tableId = populationTableId, lod = "zip"); grid.arrange(map_income, map_population, nrow = 1, ncol = 2)
These maps appear to have significant differences. For example, the northern central part of the county appears much darker in the income map than in the population map.
My colleague Chris Vensko recommended creating an animated GIF of these two images on an infinite loop. That way the differences between the images would have a stronger contrast. Here is the resulting image with a five second delay.
The traditional way to explore the relationship between two variables is a scatterplot with a smoothed conditional mean.
# data from 2011 ACS for ZCTAs for income income.data = acs.fetch(geography=choroplethr:::make_geo("zip"), table.number = incomeTableId, col.names = "pretty") income.df = choroplethr:::make_df("zip",income.data, 1) colnames(income.df) = "Income" # data from 2011 ACS for ZCTAs for population population.data = acs.fetch(geography=choroplethr:::make_geo("zip"), table.number = populationTableId, col.names = "pretty") population.df = choroplethr:::make_df("zip",population.data, 1) colnames(population.df) = "Population" final_df = merge(income.df, population.df) ggplot(final_df, aes(Population, Income)) + geom_jitter(alpha=1/5) + geom_smooth() + scale_x_continuous(label=comma) + scale_y_continuous(label=comma) + ggtitle("Relationship between ZIP Code Population and Income\nData from 2011 Census ACS ZCTAs")
Again, there doesn’t seem to be a strong relationship here between ZIP (ZCTA) population and income. It does, indeed, seem to go up a bit in the beginning. But then it goes down.
The data seem to not bear out the relationship that some people expected between ZIP/ZCTA population and income. I thought about this for a while and have a possible explanation: people might be confusing ZCTA population with whether or not a given ZIP is rural or urban.
I am not an expert in demography, but I seem to remember there being many studies of the demographic differences between rural and urban America. For example, this article claims that there is a large difference in per-capita income between rural and urban counties. Some other references on the demographic differences between rural and urban America are here and here.
But the key point, I believe, is this: ZCTA population counts need not correlate with whether or not a ZIP is in a rural or urban county. ZIP codes can be of varying size, so the same population count in two different ZIP codes can mean two different population densities. ZIP codes are created and maintained by the postal service for the sole purpose of facilitating mail delivery. Here is a quote from my previous article which attempts to explain some of the issues related to using ZCTAs for data analysis:
The highest level of detail that choroplethr supports is the zip code. From both a mapping and demographic standpoint zip codes are problematic. On the one hand, zip codes are useful because they are smaller than counties (so you can get a higher level of detail) and everyone knows which zip code they live in (so they are an intuitive unit for people). On the other hand zip codes are managed by the postal service for the sole purpose of delivering mail. This means that they can change without notice and are not always polygons. For an in depth discussion of these problems see this article from georeference.org; for an overview of zip codes in general see this article on Wikipedia.
Despite these problems the US Census Bureau attempts to capture demographics at the zip level. They have created ZCTAs (Zip Code Tabulated Areas) which roughly correspond to zip codes. You can learn more about ZCTAs here. Because of these issues choroplethr renders zip code choropleths as scatterplots. It uses the zipcode package, created by Jeffrey Breen, to map each zip code to a longitude and latitude point.
At Trulia we deal with a lot of spatial information: housing markets vary dramatically from one part of the country to another, as do the demographics of each region. Being able to visualize these regional differences helps us to understand them. Choropleth maps are a useful way to visualize this kind of information. In a choropleth, regions are colored based on some metric, such as which presidential candidate a state voted for. I recently created a package in R to facilitate creating choropleths called choroplethr. choroplethr also makes it easy to visualize data from the US Census. You can install it from an R console like this:
# install.packages("devtools") library(devtools) install_github("choroplethr", "trulia") library(choroplethr)
The American Community Survey (ACS)
choroplethr was initially created to visualize information from the American Community Survey (ACS). The ACS is an ongoing statistical survey run by the US Census Bureau. Most people are familiar with the decennial census, which asks a handful of questions of all Americans every 10 years. The ACS, by contrast, asks a large number of questions of a sampling of the population every year. You can learn more about the ACS here. An important point to note is that, because the ACS only samples the population, all of the reported numbers are estimates. The results from the ACS are summarized and available as tables. You can see a list of the 2011 5-year ACS tables here. Thoughout this blog post we will be using table B19301 which contains information about per capita income in the last 12 months. choroplethr uses the R package acs to get ACS data. The acs package was developed by Ezra Glenn, a lecturer in the Department of Urban Studies and Planning at MIT.
To view a choropleth of an ACS table you simply need to call the function choroplethr_acs and pass it a table number from the 2011 5-year ACS and a level of detail (LOD). Valid LODs are “state”, “county” and “zip”. For example, to see a choropleth of state-level per-capita income type:
By default choroplethr divides the lower 48 states into 9 equally sized buckets and colors the buckets using a sequential brewer scale, where darker colors indicate a larger value. Many patterns become immediately apparent when the data is displayed this way. For example, there are clusters of wealth in the northeast and west coasts, as well as the north central part of the county. Additionally, there is a cluster of lower-income states in the southeast. From the legend we can see that the difference between the richest and poorest states is approximately $17,000, and each bucket covers approximately $2,000. choroplethr renders maps with the ggplot2 library.
Things change dramatically when we look at the same dataset at the county level of detail:
Many people are not familiar with county-level maps of the continental US and are surprised by both the number of counties (3,076) and their relative size (counties on the west coast tend to be larger than counties on the east coast). Like before, choroplethr divides each region into 9 equally sized buckets. This map allows us to look within a state, and see that some states have both extremes of wealth, while some are more consistent. It is instructive to compare and contrast these two maps.
It is worth studying the legend as well. Now the scale has a range of $53,000; moving from the state LOD to the county LOD increased the range of our scale by over 3x. This is a trend that occurs frequently in choropleths: as the level of detail becomes higher the range of the scale increases as well. Another trend is that the buckets at the extremes cover an increasingly large amount. The highest bucket now covers a range of approximately $32,000, which is larger than the entire range covered in the state choropleth. The lowest bucket now covers approximately $9,000.
The highest level of detail that choroplethr supports is the zip code. From both a mapping and demographic standpoint zip codes are problematic. On the one hand, zip codes are useful because they are smaller than counties (so you can get a higher level of detail) and everyone knows which zip code they live in (so they are an intuitive unit for people). On the other hand zip codes are managed by the postal service for the sole pupose of delivering mail. This means that they can change without notice and are not always polygons. For an in depth discussion of these problems see this article from georeference.org; for an overview of zip codes in general see this article on Wikipedia.
Despite these problems the US Census Bureau attempts to capture demographics at the zip level. They have created ZCTAs (Zip Code Tabulated Areas) which roughly correspond to zip codes. You can learn more about ZCTAs here. Because of these issues choroplethr renders zip code choropleths as scatterplots. It uses the zipcode package, created by Jeffrey Breen, to map each zip code to a longitude and latitude point. To render an estimate of per capita income at the zip code level type this:
The acs package returns 32,481 ZCTAs for this query, so overplotting is a serious issue. That being said, it is still an informative map. For example, many people are surprised by the low number of zip codes in the western part of the US. Also, the color distribution between the county and zipcode maps is roughly analogous. Additionally, the range of the scale has increased dramatically to $375,900. The highest bucket alone accounts for $339,000 of that range.
At the zip LOD outliers and sampling error become a serious issue. For exampe, it is unlikely that the median annual per-capita income in zip 54307 is truly $137. The acs package was developed to make it easy to access not only estimates, but also the statistical uncertainty measurements that accompany these estimate. You can learn more about these features of the acs package here. As an aside, you can learn more about zip code 54307 by simply typing zip 54307 in google.
By default choroplethr creates a scale by dividing each region into 9 equally sized buckets. This is an example of a discrete scale. For discrete scales you can choose between 2 and 9 equally sized buckets, and each bucket size provides you with different information. For example, using two buckets will show you which regions are above and below the median. Here is how to show which counties have above and below the median income:
choroplethr_acs(tableId="B19301", lod="zip", num_buckets=2)
Setting num_buckets to 1 will force a continuous scale:
choroplethr_acs(tableId="B19301", lod="county", num_buckets=1)
What’s notable about this map is that most of the regions appear to be the same color. To understand why, it is ?useful to view the values as a boxplot:
Most counties have a per capita income in the range of $20,000-$25,000. But there are outliers both over $60,000 and below $6,000. Because a single color range must contain all values, most values are mapped to a similar color.
All of our examples so far has used the choroplethr_acs function to create choropleths of ACS data. But we can create similar maps of arbitrary data with the choroplethr function. All of the parameters are the same except for the first: instead of a tableId, we pass in a data.frame with one column named region and one column named value. For state level choropleths region can be any common naming of a state (e.g. “California”, “california” or “CA”):
df = data.frame(region=state.abb, value=sample(100, 50)) choroplethr(df, lod="state")
For county level choropleths region must be a 4 or 5 digit county FIPS code:
data(county.fips, package="maps") df = data.frame(region=county.fips$fips, value=sample(100, nrow(county.fips), replace=TRUE)) choroplethr(df, lod="county", num_buckets=2)
For zip level choropleths, region must be a 5 digit zip code
data(zipcode, package="zipcode", envir=environment()) df = data.frame(region=zipcode$zip, value = sample(100, nrow(zipcode), replace=TRUE)) choroplethr(df, lod="zip", num_buckets=1)
I hope that you found this tour of choroplethr version 1.0.0 useful. In summary, choroplethr seeks to provide a simple interface to create choropleths in R at 3 levels of detail (state, county and zip) and 2 scale types (discrete and continuous). It attempts to work seemlessly with the acs package to create choropleths of US Census data.
Version 1.1.0, which is already under development, will support rendering choropleths for a subset of states, as well as mapping data from any ACS, not just the 2011 5-year survey.
If you have any technical support issues, feature requests or want to share your results, please post at the choroplethr google group.0 comments
Hi, My name is Peter Black and I’m the lead geospatial engineer at Trulia. We’ve been making some interesting maps here at Trulia, displaying crime heatmaps, a commute tool that selects homes within a travel time polygon, and home value estimates down to the parcel level. Today, I’m writing to tell you the why’s and how’s of our most recent series on natural hazards.
When Hurricane (ok ok, it was an extra tropical storm) Sandy slammed into the New Jersey shoreline on October 30th, I watched with horror and tried to stay in contact with my loved ones and friends in harms way. Seeing the awful damage that resulted cemented my feeling that I had to incorporate maps on natural hazards that I knew were readily available from various federal sources into the Trulia experience. Doing so would open up a new avenue for millions of people to better understand the natural world and the risks they face when they’re making the decision on where to buy a home.
There are many types of natural hazards of course, and we couldn’t possibly put all of them on Trulia. So we chose the five hazards that have caused the most damage in the past few decades. These are: hurricanes, floods, earthquakes, tornadoes, and wildfires. Fortunately there is excellent data available for each hazard, mostly from federal sources. In compiling the data, I noticed some interesting things. For example, why was the Charleston South Carolina area at risk for earthquake? As it turns out, there was a magnitude 7.3 that shook the area in 1886.
There have been tornados in my neck of the woods in northern California. Southern New Jersey (the pine barrens) is at risk for a forest fire.
Given these revelations, it only made us work harder to create a new audience for this insightful information. We noticed that there wasn’t any really good mashup for all of the historical information around hurricanes and tornadoes. So for each, we took the historical track data along with their attributes, and assigned them to an underlying nested hexagon grid. Once that was accomplished, we classified the data and created a really cool visualization of historical hotspots for each hazard. I stress the word historical intentionally since we have no idea where the next hurricane or tornado will hit. Our intention is to solely show where the storms have hit in the past 60 years or so, when this meteorological data became more reliable and sophisticated due to the advent of technologies like radar (in the 40s) and satellite imagery (in the 60s).
I hope you enjoy the maps. They are pretty informative and provide an interesting tool for homebuyers that can help people make more informed choices. I’d like to thanks my excellent team of engineers whose talent and professionalism are truli-amazing, the awesome pr crew we have, as well as the senior management team at Trulia who supported this idea from its inception.0 comments
Trulia is at it again, hosting unique and informational events right here in our San Francisco office for employees and the local community alike. We take tech (and socializing) pretty seriously around here by staying up on the latest gadgets, understanding the current market and interacting with fellow data-obsessed techies. Over the last several weeks we have felt honored to host some “trulia-mazing” speakers and welcomed visitors from far and wide. During the UX in Space! vol.2 event Eric Bell, Jess Zak, and Ulrika Andersson joined us to discuss how their designs have solved interesting challenges in varying environments. The Storylines Meetup Group has brought in both Wendy Yu and Trulia’s very own Heather Fernandez to share their personal stories of where they are today and the path they took to get there. Being the data-geeks that we are, we were excited to host the Urban Data Challenge Showcase, brought on by Young Professionals in Transportation, where participants demoed their submissions and spoke about the challenges, the process, and the findings that came along from their projects. SF Data Mining stopped by our offices to host a Crowdsourcing Meetup, where Edwin Chen spoke to human-powered machine learning in regards to use cases, methods for quality control and running your first task.
Take a look at what’s coming up next at our office and grab a spot while you still can. We look forward to seeing you at Trulia shortly!
Trulia HQ has been the host to many tech events here in the Bay Area, welcoming an array of organizations into our office space. Data visualization, tech leadership, UXD, and data mining groups have all spent an evening or two with us. Our unique rooftop event space (aka the “Trulia Penthouse”) is the perfect backdrop, offering beautiful views of the city. Trulia is proud to support the tech community in San Francisco and foster a great environment for idea exchange and learning. Our employees are encouraged to attend these evening events and mingle with fellow SF Techies. One of our last featured events was Pamela Fox’s Story, where Pamela joined us to discuss her experience as a Front End Engineer and her involvement in Girl Develop It. We also hosted speaker Scott Murray, author of Interactive Data Visualization for the Web, and Chris Viau, the force behind @d3visualization, during the Bay Area d3 User Group event. Trulia is closely linked to many groups in the Meetup.com world and we always seem to have an event just around the corner, check it out…
Upcoming Events @ Trulia:0 comments
When we first implemented map markers on our Android apps we were just using an image to highlight each location. Over time we wanted to add the specific price as well. As I started working on an implementation of that feature I was hard pressed to find any articles or examples of what we wanted to accomplish. So I had to start from scratch and figure out a method by myself. The initial implementation is currently in use on our Trulia Rentals android app. However, there are some deficiencies with that version and some of the code is sub-optimal. In this post I will detail the new method I am working on (which will be ported into all of our android apps).0 comments
With the last major release of our Android app we added in Street View capability on property detail pages. The response we got from users was extremely positive but we couldn’t help but notice a few issues that really degraded a user’s experience.
First, we assumed (incorrectly) that the built in Street View app would be available on all Android phones. And how did that work out? The stack traces in the Market crash reports told the whole story: some people didn’t have the app installed.0 comments
At Trulia we have a set of values we work by with the acronym “IMPACT“. The “C” stands for Customer-obsessed. As Trulia grows we continue to ramp up our dedication to that particular value. We’ve found that the best way to successfully evolve our offerings is to always keep listening to the folks who use our site and mobile products. This is not a novel idea to be sure, but there is a lot more to it than meets the eye. Here are some ways Trulia stays in tune with our site visitors and customers.0 comments
July 2011 marked the anniversary of Trulia’s Innovation Days. We are extremely excited about our innovation program and wanted to take this opportunity to share some details and insights into what these Innovation Days entail.
It all started with the recognition that innovation is very important when creating a winning company, and that it can really blossom when actively nurtured and developed. We truly believe innovation is free spirited and should not be bound by strict rules! The program evolved following a simple principle: “Pave sidewalks where people walk”. This statement helps the program to retain its grass roots, unbound and organic characteristics and leverages the power of our teams without over-management.0 comments