This guide lists and describes online sources that provide Big Data or Open Data.
Remember to look at the meta data to figure out what kinds of
information you can download.
Airbnb: Provides csv files containing detailed information on data on airbnb hosts. The data are in longitude/latitude. They don’t provide historical data.
Bikesharing: Web sites providing public use data on bikesharing. Provides station-to-station data.
OpenStreetMaps. osmdata is an R package for downloading OpenStreetMaps data. The site provides a couple of vignettes on using the package.
Array of things. The City of Chicago installed modular sensor boxes around Chicago to collect real-time data on the city’s environment, infrastructure, and activity for research and public use. Other cities have followed.
Zillow. Provides housing price data at the metro, city and zipcode levels. R has a package for downloading Zillow data directly.
Yelp. A public use dataset put together by Yelp specifically for personal and educational purposes, but has been used in academic and applied research. You can use the Yelp API, and here is a tutorial, and another, but there are some restrictions, specifically getting an access ID and creating your own app. Here is another tutorial for a specific R package that uses the Yelp API.
Twitter. Twitter provides access to a sample of their tweets. You’ll need to register for an API. Here are some guides to collect and manage tweets in R: here, here, and here.
Open data portals
Many city, county and even state governments maintain open data portals. These portals provide various datasets held and maintained by the public sectors. Some of the data are measured at a fine spatial scale, going doing to longitude/latitude.
There are a couple of sites that maintain open data portal directories, including
Here are links to various open data portals in US cities (updated 01/08/24)
California
Major Cities
Your city/county not listed above? Use Google. It’s your friend.
Looking for data?
Kaggle is a crowd-sourced platform for all things data science. This includes competitions, discussion forums, online tutorials, and most importantly, at least for the purpose of this guide, a repository of big data sources. A lot of these data are not pertinent to this class, but some are; specifically, those with geographic information that allows you to connect data to geographic locations. Check out their datasets here.
Esri provides a repository that many of its members use to store various big and open data all in shapefile format. Check out what’s available here.
This
work is licensed under a
Creative
Commons Attribution-NonCommercial 4.0 International License.
Website created and maintained by Noli Brazil