Best Public Datasets for Machine Learning and Data Science: Sources and Advice on the Choice

Catalogs of data portals and aggregators

OpenDataSoft: a map with more than 2600 data portals

Knoema: home to nearly 3.2-billion time series data of 1040 topics from more than 1200 sources

  • a search panel on the homepage,
  • the World Data Atlas with datasets clustered by countries, sources, indicators, as well as other data like commodities’ value change or county groups, and
  • the Data Bulletin section with the latest releases of new datasets and updates of existing sources.
Data exploration options on Knoema

Government and official data

Data.gov: 261,073 sets of the US open government data

Various filters are available on data.gov

Eurostat: open data from the EU statistical office

Data navigation tree of Eurostat database

Scientific research datasets

Re3data: 2000 research data repositories with flexible search

Text and visual modes for subject search on Re3data

Harvard Dataverse: 92,839 datasets by the scientific community for the scientific community

Academic torrents: 53.52TB research data aggregated at one place

The Sloan Digital Sky Survey: 3D maps of the Universe

  • APOGEE-2 — the Milky Way exploration from both hemispheres,
  • eBOSS (including SPIDERS and TDSS) — the observation of galaxies and, in particular, quasars to measure the Universe, and
  • MaNGA (including MaStar) — the mapping of the inner workings of thousands of nearby galaxies.
Image exploration with the SDSS navigation tool

Verified datasets from data science communities

UCI Machine Learning Repository: one of the oldest sources with 488 datasets

data.world: open data community

GitHub: a list of awesome datasets made by the software development community

Kaggle datasets: 25,144 themed datasets on “Facebook for data people”

Searching for datasets on Kaggle is simple

KDnuggets: a comprehensive list of data repositories on a famous data science website

Reddit: datasets and requests of data on a dedicated discussion board

BuzzFeed: datasets and related content by a media company

FiveThirtyEight: datasets from data-driven pieces

Finance and economic datasets

Quandl: Alternative Financial and Economic Data

Search filters on Quandl

The International Monetary Fund and The World Bank: International Economy Stats

Healthcare datasets

World Health Organization: Global Health Records from 194 Countries

The Center for Disease Control (CDC): Searching for data is easy with an online database

Medicare: data from the US health insurance program

The Healthcare Cost and Utilization Project (HCUP): another source with data on healthcare services

Travel and transportation datasets

Bureau of Transportation Statistics: the US transportation system in over 260 data tables

Looking for datasets on the Bureau of Transportation Statistics website

Federal Highway Administration: US road transportation data

Other sources

Amazon Web Services: free public datasets and paid machine learning tools

Google Public datasets: data analysis with the BigQuery tool in the cloud

Advice on the dataset choice

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
AltexSoft Inc

AltexSoft Inc

2.7K Followers

Being a Technology & Solution Consulting company, AltexSoft co-builds technology products to help companies accelerate growth.