Choosing an Open Source Machine Learning Library: TensorFlow, Theano, Torch, scikit-learn, Caffe

From healthcare and security to marketing personalization, despite being at the early stages of development, machine learning has been changing the way we use technology to solve business challenges and everyday tasks. This potential has prompted companies to start looking at machine learning as a relevant opportunity rather than a distant, unattainable virtue.

We’ve already discussed machine learning as a service tools for your ML projects. But now let’s look at free and open source software that allows everyone to board the machine learning train without spending time and resources on infrastructure support.

Why Open Source Machine Learning?

The benefits of using open source tools don’t stop at their availability. Generally, such projects have a vast community of data engineers and data scientists eager to share datasets and pre-trained models. For instance, instead of building image recognition from scratch, you can use classification models trained on the data from ImageNet, or build your own using this dataset. Open source ML tools also let you leverage transfer learning, meaning solving machine learning problems by applying knowledge gained after working with a problem from a related or even distant domain. So, you can transfer some capacities form the model that has learned to recognize cars to the model aimed at trucks recognition.

Depending on the task you’re working with, pre-trained models and open datasets may not be as accurate as custom ones, but they will save a substantial amount of effort and time, and they don’t require you to gather datasets. According to Andrew Ng, former chief scientist at Baidu and professor at Stanford, the concept of reusing open source models and datasets will be the second biggest driver of the commercial ML success after supervised learning.

Comparing GitHub commits and contributors for different open source tools

Among many active and less popular open source tools, we’ve picked five to explore in depth to help you find the one to start you on the road to data science experimentation. Let’s begin.

TensorFlow: Profound and Favored Tool from Google

TensorFlow is a great Python tool for both deep neural networks research and complex mathematical computations, and it can even support reinforcement learning. The uniqueness of TensorFlow also lies in dataflow graphs — structures that consist of nodes (mathematical operations) and edges (numerical arrays or tensors).

https://www.altexsoft.com/wp-content/uploads/2017/08/tensors_flowing-1.gif

Dataflow graphs allow you to create a visual representation of data flow between operations and then execute calculations
Source:
TensorFlow

Datasets and models

The flexibility of TensorFlow is based on the possibility of using it both for research and recurring machine learning tasks. Thus, you can use the low level API called TensorFlow Core. It allows you to have full control over models and train them using your own dataset. But there are also public and official pre-trained models to build higher level APIs on top of TensorFlow Core. Some of the popular models you can apply are MNIST, a traditional dataset helping identify handwritten digits on an image, or Medicare Data, a dataset by Google used to predict charges for medical services among others.

Audience and learning curve

For someone exploring machine learning for the first time, TensorFlow’s variety of functions may be a bit of a struggle. Some even argue that the library doesn’t try to accelerate a machine learning curve, instead making it even steeper. TensorFlow is a low-level library that requires ample code writing and a good understanding of data science specifics to start successfully working with the product. Consequently, it may not be your first choice if your data science team is IT-centric and there: There are simpler alternatives we’ll be discussing.

Use cases

Considering its complexity, the use cases for TensorFlow mostly include solutions by large companies with access to machine learning specialists. For example, the British online supermarket Ocado applied TensorFlow to prioritize emails coming to their contact center and improve demand forecasting. Also, the global insurance company Axa used the library to predict large-loss car incidents involving their clients.

Theano: Mature Library with Extended Possibilities

Datasets and models

There are public models for Theano, but each framework used on top also has plenty of tutorials and pre-trained datasets to choose from. Keras, for instance, stores available models and detailed usage tutorials in its documentation.

Audience and learning curve

If you use Lasagne or Keras as high-level wrappers on top of Theano, again you’ll have a multitude of tutorials and pre-trained datasets at your fingertips. Moreover, Keras is considered one of the easiest libraries to start with at early stages of deep learning exploration.

Since TensorFlow was designed to replace Theano, a big part of its fanbase left. But there are still a lot of advantages that many data scientists find compelling enough keep them with an outdated version. Theano’s simplicity and maturation are serious points to consider when making this choice.

Use cases

Considered an industry standard for deep learning research and development, Theano was originally designed to implement state-of-the-art deep learning algorithms. However, considering that you probably won’t use Theano directly, its numerous uses expand as you use it as foundation for other libraries: digit and image recognition, object localization, and even chatbots.

Torch: Facebook-Backed Framework Powered by Lua Scripting Language

Datasets and models

You can find a list of popular datasets to be loaded for use in Torch on its GitHub cheatsheet page. Moreover, Facebook released an official code for Deep Residual Networks (ResNets) implementation with pre-trained models with instructions for fine-tuning your own datasets.

Audience and learning curve

Regardless of the differences and similarities, the choice will always come down to the language. The market population of experienced Lua engineers will always be smaller than that of Python. However, Lua is significantly easier to read, which is reflected in the simple syntax of Torch. The active Torch contributors swear by Lua, so it’s a framework of choice for both novices and those wishing to expand their toolset.

Use cases

Facebook used Torch to create DeepText, a tool categorizing minute-by-minute text posts shared on the site and providing a more personalized content targeting. Twitter has been able to recommend posts based on algorithmic timeline (instead of reverse chronological order) with the help of Torch.

scikit-learn: Accessible and Robust Framework from the Python Ecosystem

scikit-learn is a high level framework designed for supervised and unsupervised machine learning algorithms. Being one of the components of the Python scientific ecosystem, it’s built on top of NumPy and SciPy libraries, each responsible for lower-level data science tasks. While NumPy sits on Python and deals with numerical computing, the SciPy library covers more specific numerical routines such as optimization and interpolation. Subsequently, scikit-learn was built precisely for machine learning. The relationship between the three along with other tools in the Python ecosystem reflects different levels in the data science field: The higher you go, the more specific the problems you can solve.

The Python NumPy-based ecosystem includes tools for array-oriented computing

Datasets and models

The library already includes a few standard datasets for classification and regression despite their being too small to represent real-life situations. However, the diabetes dataset for measuring disease progression or the iris plants dataset for pattern recognition are good for illustrating how machine learning algorithms in scikit behave. Moreover, the library provides information about loading datasets from external sources, includes sample generators for tasks like multiclass classification and decomposition, and offers recommendations about popular datasets usage.

Audience and learning curve

Despite being a robust library, scikit-learn focuses on ease of use and documentation. Considering its simplicity and numerous well-described examples, it’s an accessible tool for non-experts and neophyte engineers, enabling quick application of machine learning algorithms to data. According to testimonials by software shops AWeber and Yhat, scikit is well-suited for production characterized by limited time and human resources.

Use cases

scikit-learn has been adopted by a plethora of successful brands like Spotify, Evernote, e-commerce giant Birchbox, and Booking.com, for product recommendations and customer service. However, you don’t have to be an expert to explore data science with the library. Thus, a technology and engineering school Télécom ParisTech uses the library for its machine learning courses to allow students to quickly solve interesting problems.

Caffe/Caffe2: Easy to Learn Tool with Abundance of Pre-Trained Models

Despite being criticized for slow development, Caffe’s successor Caffe2 has been eliminating the existing problems of the original technology by adding flexibility, weightlessness, and support for mobile deployment.

Datasets and models

Caffe encourages users to get familiar with datasets provided by both the industry and other users. The team fosters collaboration and links to the most popular datasets that have already been trained with Caffe. One of the biggest benefits of the framework is Model Zoo — a vast reservoir of pre-trained models created by developers and researchers, which allow you to use, or combine a model, or just learn to train a model of your own.

Audience and learning curve

The Caffe team claims that you can skip the learning part and start exploring deep learning using the existing models straightaway. The library is targeted at developers who want to experience deep learning first hand and offers resources that promise to be expanded as the community develops.

Use cases

By using the state-of-the-art Convolutional Neural Networks (CNNs) — deep neural networks successfully applied for visual imagery analysis and even powering vision in self-driving cars — Caffe allowed Facebook to develop its real-time video filtering tool for applying famous artistic styles on videos. Pinterest also used Caffe to expand a visual search function and allow users to discover specific objects on a picture.

When Demand Matches Proposition

Even in our age at the dawn of machine learning, we have such a wide range of opportunities that it’s hard to make a choice. The most important takeaway is that each of these projects has been created for a series of scenarios and your task is to find the ones that fit your approach the best.

Being a Technology & Solution Consulting company, AltexSoft co-builds technology products to help companies accelerate growth.