Overview of the library ecosystem for a complete machine learning workflow


man using binoculars in between stack of books

Backstory

Firstly, TensorFlow library was launched in 2015 as a FOSS originally developed at Google and currently supported by Google engineers.

Secondly, Pytorch library was launched in 2016 as FOSS originally developed at Facebook and currently supported by Facebook engineers.

Pytorch gained widespread adoption due to ease-of-use and support for dynamic computation graphs which makes prototyping ML models very simple.

TensorFlow 1.xPytorch
Computation graph is static (must be defined before being run)Computation graph is dynamic (can be defined and run as you go)
Execute graph within a session with tf.SessionTightly integrated with Python td native library
Debugging via tfdbgNative debugging with standard tools like REPL, iPython, pdb, your IDE, etc.
Visualize using TensorBoardVisualize using matplotlib, seaborn, etc.*
Deployed using special library TF ServingHad to set up REST API with a framework like Flask.
Parallel Training on GPU was relatively hard with tf.device and tf.DeviceSpecParallel Training on GPU was relatively easy with torch.nn.DataParallel
*There is talk of building Pytorch support into TensorBoard.

TensorFlow was convenient for visualization and deployment, but Pytorch was better at everything else.

Therefore, TensorFlow 2.0

TensorFlow 2.0 was released in September 2019 has improved programming model. It is not backwards compatible with TensorFlow 1.x.

Also, if you have never learned TensorFlow 1, I don’t recommend to start now; unless you use it within your organization. That is because TensorFlow 2 is much closer to Pytorch than TensorFlow 1. So if you’re familiar with Pytorch, you will find that you have a leg up on learning TensorFlow 2.

TensorFlow 1.xTensorFlow 2.x
Only static computation graphsBoth dynamic and static supported
Heavyweight build-then-run was overkill for simple applicationsEager execution for development, Lazy execution for deployment
Low-level APIs with multiple high-level APIs availableTightly integrated with Keras as high-level API
tf.Session for hard separation from PythonNo sessions, just functions; tf.function decorator for advanced uses
Lazy execution using static computation graph is preferred in deployment because it has better performance.

Keras in TensorFlow 2

Keras was meant to be a high-level API capable of running on top of TensorFlow, CNTK and Theano frameworks.

With the evolution of both TensorFlow and Keras, Keras is today a central part of the tightly connected TensorFlow 2.0 ecosystem. Covering every part of the machine learning workflow in TensorFlow, today the way to use TensorFlow is to use Keras.

API Cleanup in TensorFlow 2.x:

NLTK

First released in 2001, Natural Language Toolkit is a free, open-source, community-driven project and a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources.

NLTK is recommended for computational linguistics which is the superset that encompasses NLP.

spaCy

spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python.

spaCy is designed specifically for production use and helps you build applications that process and “understand” large volumes of text. It can be used to build information extraction or natural language understanding systems, or to pre-process text for deep learning. It’s great for creating parse trees. It has excellent tokenizer.

However, it is opinionated. This leads to fairly different design decisions than NLTK or CoreNLP, which were created as platforms for teaching and research. The main difference is that spaCy is integrated and tries to avoid asking the user to choose between multiple algorithms that deliver equivalent functionality. Keeping the menu small lets spaCy deliver generally better performance and developer experience.

While spaCy can be used to power conversational applications, it’s not designed specifically for chat bots, and only provides the underlying text processing capabilities.

Gensim

Gensim is the abbreviation of Generate Similar. It is a free Python library for similarity detection and topic modelling. It allows:

  • Scalable statistical semantics
  • Analyze plain-text documents for semantic structure
  • Retrieve semantically similar documents

Gensim started off as a collection of various Python scripts by Radim Řehůřek for the Czech Digital Mathematics Library dml.cz in 2008, where it served to generate a short list of the most similar articles to a given article.

Radim also wanted to try these fancy “Latent Semantic Methods”, but he was not content with the amount of computing power it demanded. So he reinvented the wheel with his LREC publication in 2010 describing clear, efficient and scalable design decisions. He made algorithmic scalability of distributional semantics the topic of my PhD thesis.

Gensim is recommended for unsupervised semantic modelling from plain text.

PyText

PyText is a deep-learning-based NLP modelling framework built on PyTorch. Facebook open-sourced it in 2018.

PyText addresses the often-conflicting requirements of enabling rapid experimentation and of serving models at scale. It achieves this by providing simple and extensible interfaces and abstractions for model components, and by using PyTorch’s capabilities of exporting models for inference via the optimized Caffe2 execution engine.

fastText

fastText is an open-source, free, lightweight library for efficient text classification and representation learning by Facebook. Text classification is a core problem to many applications, like spam detection, sentiment analysis or smart replies.

It transforms text into continuous vectors that can later be used on any language-related task. It has a library of embeddings. It works on standard, generic hardware. Models can later be reduced in size to even fit on mobile devices.

It only runs on CPU, not GPU. Python is officially supported. There are few unofficial wrappers for JavaScript, Lua and other languages available on Github.

scikit-learn

This is probably the most popular library for beginners and experts alike. It is open-source and commercially usable. My favourite resource from them is the flowchart for choosing the right estimator.

This project was started in 2007 as a Google Summer of Code project by David Cournapeau. Later that year, Matthieu Brucher started working on this project as part of his thesis.

In 2010 Fabian Pedregosa, Gael Varoquaux, Alexandre Gramfort and Vincent Michel of INRIA took leadership of the project and made the first public release, February the 1st 2010. Since then, several releases have appeared following a ~ 3-month cycle, and a thriving international community has been leading the development.

fast.ai

The fast.ai library simplifies training fast and accurate neural nets using modern best practices. It’s based on research into deep learning best practices undertaken at fast.ai, including “out of the box” support for vision, text, tabular, and collab (collaborative filtering) models.

When a project is led by a team of people that each deserve a separate post, you don’t have to wait for my recommendation to make the switch.