Best Public Datasets for Machine Learning and Data Science: Sources and Advice on the Choice

Catalogs of data portals and aggregators

OpenDataSoft: a map with more than 2600 data portals

Knoema: home to nearly 3.2-billion time series data of 1040 topics from more than 1200 sources

  • a search panel on the homepage,
  • the World Data Atlas with datasets clustered by countries, sources, indicators, as well as other data like commodities’ value change or county groups, and
  • the Data Bulletin section with the latest releases of new datasets and updates of existing sources.
Data exploration options on Knoema

Government and official data 261,073 sets of the US open government data

Various filters are available on

Eurostat: open data from the EU statistical office

Data navigation tree of Eurostat database

Scientific research datasets

Re3data: 2000 research data repositories with flexible search

Text and visual modes for subject search on Re3data

Harvard Dataverse: 92,839 datasets by the scientific community for the scientific community

Academic torrents: 53.52TB research data aggregated at one place

The Sloan Digital Sky Survey: 3D maps of the Universe

  • APOGEE-2 — the Milky Way exploration from both hemispheres,
  • eBOSS (including SPIDERS and TDSS) — the observation of galaxies and, in particular, quasars to measure the Universe, and
  • MaNGA (including MaStar) — the mapping of the inner workings of thousands of nearby galaxies.
Image exploration with the SDSS navigation tool

Verified datasets from data science communities

UCI Machine Learning Repository: one of the oldest sources with 488 datasets open data community

GitHub: a list of awesome datasets made by the software development community

Kaggle datasets: 25,144 themed datasets on “Facebook for data people”

Searching for datasets on Kaggle is simple

KDnuggets: a comprehensive list of data repositories on a famous data science website

Reddit: datasets and requests of data on a dedicated discussion board

BuzzFeed: datasets and related content by a media company

FiveThirtyEight: datasets from data-driven pieces

Finance and economic datasets

Quandl: Alternative Financial and Economic Data

Search filters on Quandl

The International Monetary Fund and The World Bank: International Economy Stats

Healthcare datasets

World Health Organization: Global Health Records from 194 Countries

The Center for Disease Control (CDC): Searching for data is easy with an online database

Medicare: data from the US health insurance program

The Healthcare Cost and Utilization Project (HCUP): another source with data on healthcare services

Travel and transportation datasets

Bureau of Transportation Statistics: the US transportation system in over 260 data tables

Looking for datasets on the Bureau of Transportation Statistics website

Federal Highway Administration: US road transportation data

Other sources

Amazon Web Services: free public datasets and paid machine learning tools

Google Public datasets: data analysis with the BigQuery tool in the cloud

Advice on the dataset choice



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
AltexSoft Inc

AltexSoft Inc


Being a Technology & Solution Consulting company, AltexSoft co-builds technology products to help companies accelerate growth.