Saturday, November 3, 2018

11 websites to find free public datasets

11 websites to find free, interesting datasets


If you're new to the data space, or if you've recently learned a new skill, or just trying to build a more robust data science/analystportfolio, a perfect way of solidifying your skills is to do some mini-projects focused on your new skills. Below we outline a few places you can find publicly available data for your next project.

If you're interested in practicing real data scientist and analyst interview questions, feel free to sign up for our email newsletter, where we send a few curated questions per week to help you prepare for interviews at top companies.

FiveThirtyEight is an interactive news and sports site that has some incredible data visualizations (which you should totally check out). They makes a lot of their data open to the public, meaning you can download and play with the source data yourself!

Here are some examples:

  • Airline Safety — contains information on accidents from each airline
  • US Weather History — historical weather data for the US.
  • Study Drugs — data on who's taking Adderall in the US.


    BuzzFeed makes the data sets, analysis, libraries, tools, and guides used in its articles available on Github. Check them out to learn from some of the best!

    Here are some examples:

  • Federal Surveillance Planes — contains data on planes used for domestic surveillance.
  • Zika Virus — data about the geography of the Zika virus outbreak.
  • Firearm background checks — data on background checks of people attempting to buy firearms.


    Kaggle, recently acquired by Google, is a place where you can learn, practice, and fine-tune your data science/analytics skills. They have tons of data that’s open to the public, and allow users of the platform to share code so you can learn best practices within the data space. They also host competitions where you can win real money if you have a top ranking model!

    Here are some examples:

  • Federal Surveillance Planes — contains data on planes used for domestic surveillance.
  • Zika Virus — data about the geography of the Zika virus outbreak.
  • Firearm background checks — data on background checks of people attempting to buy firearms.


    Socrata hosts cleaned open source data sources ranging from government, business, and education data sets.

    Here are some examples:

  • White House staff salaries — data on what each White House staffer made in 2010.
  • Radiation Analysis — data on what milk products in what locations in the US were radioactive.
  • Workplace fatalities by US state — the number of workplace deaths across the US.


    This github hosts a library of awesome, public datasets! They are all sorted by category and link you straight to the hosting website.

    Here are some examples:

  • Global Climate Data — climate information for every country in the world with historical data in some cases date back to 1929
  • Heart rate time series data — two series of data contains 1800 evenly-spaced measurements of instantaneous heart rate from a single subject
  • Plane crash database — plane crash data dating from 1929 to now.


    Google lists all of the data sets on a page. Google has a cloud hosting service called Google Cloud Platform (GCP), and you can query using a tool called BigQuery to explore these datasets. You'll need to sign up for a GCP account, but the first 1TB of queries you make are free! But be careful not to go over or you’ll have to pay!

    Here are some examples:

  • US name data set — contains all names from social security card applications from births that occur after 1879
  • Major League Baseball data — data includes pitch-by-pitch data for Major League Baseball (MLB) games in 2016


    University of California Irvine hosts 440 data set as a service to the machine learning community. These data sets are nice because most of them are squeky clean, and are ready for modeling!

    Here are some examples:

  • Iris data set — the most famous pattern recognition dataset.
  • Wine data set — using chemical analysis to determine the origin of wine.
  • Forest fires — try to predict the burn area of forest fires using this dataset.


    Data.gov allows you to download and explore data from multiple US government agencies. Data can range from government budgets to climate data. The data is very well documented so you should have an easy time to navigate the sources.
    You can browse the data sets on Data.gov directly, without registering. You can browse by topic area, or search for a specific data set.

    Here are some examples:

  • Food Environment Atlas — contains data on how local food choices affect diet in the
  • School system finances — a survey of the finances of school systems in the US.
  • Chronic disease data — data on chronic disease indicators in areas across the US.


    Academic Torrents is a site that is geared around sharing the data sets from scientific papers. It has tons of interesting data sets. You can browse the data sets directly on the site, and download if you find interesting!

    Here are some examples:

  • Enron emails — a set of many emails from executives at Enron, a company that famously went bankrupt.
  • Student learning factors — a set of factors that measure and influence student learning.
  • News articles — contains news article attributes and a target variable.


    Quandl is a repository of economic and financial data. Some of the datasets are free, while others are up for purchase.

    Here are some examples:

  • Entrepreneurial activity by race and other factors — contains data from the Kauffman foundation on entrepreneurs in the US.
  • Chinese macroeconomic data — indicators of Chinese economic health.
  • US Federal Reserve data — US economic indicators, from the Federal Reserve.


    Jeremy Singer-Vine collects awesome data sets across multiple sources. If you're interested in getting data sets straight to your inbox, you should consider signing up for his newsletter.



    Ace your next data science interview

    Get better at data science interviews by solving a few questions per week



DataTau published first on DataTau

No comments:

Post a Comment