Optimizations
carefree-learn not only provides carefree APIs for easier usages, but also did quite a few optimizations to make training on tabular datasets faster than other similar libraries. In this page we'll introduce some techniques carefree-learn adopted under the hood, and will show how much performance boost we've obtained with them.
Categorical Encodings
Encoding categorical features is one of the most important pre-processing we need to perform on tabular datasets. As mentioned in Design Principles, carefree-learn will stick to one_hot encoding and embedding encoding, which will be introduced in turn in the following sections.
One Hot Encoding
A one_hot encoding basically encodes categorical features as a one-hot numeric array, as defined in sklearn. Suppose we have 5 classes in total, then:
We can figure out that this kind of encoding is static, which means it will not change during the training process. In this case, we can cache down all the encodings and access them through indexing. This will speed up the encoding process for ~66x:
The gap will decrease a little when we increase num_classes to a larger number, let's say 100:
However there are still ~33x boost.
caution
Although caching can boost performance, it is at the cost of consuming much more memories. A better solution should be caching sparse tensors instead of dense ones, but PyTorch has not supported sparsity good enough. See Sparsity section for more details.
Embedding
An embedding encoding actually borrows from Natual Language Processing (NLP) where they converted (sparse) input words into dense embeddings with embedding look up. It is quite trivial to turn categorical features into embeddings with the same look up techniques, but tabular datasets hold a different property compared with NLP: tabular datasets will maintain many embedding tables because they have different categorical features with different number of values, while in NLP it only need to maintain one embedding table in most cases.
Since embedding is a dynamic encoding which contains trainable parameters, we cannot cache them beforehand like we did to one_hot. However, we can still optimize it with fast embedding. A fast embedding basically unifies the embedding dimension of different categorical features, so one unified embedding table is sufficient for the whole embedding process.
There's one more thing we need to take care of when applying fast embedding: we need to increment the values of each categorical features. Here's a minimal example to illustrate this. Suppose we have two categorical features () with 2 and 3 classes respectively, then our embedding table will contain 5 rows:
In this table, the first two rows belong to , while the last three rows belong to . However, as we defined above, and . In order to assign to , we need to increment by (which is the number of choices could have). After increment, we have so it can successfully look up .
Note that the incremented indices are static, so carefree-learn will cache these indices to avoid duplicate calculations when fast embedding is applied.
Since the embedding dimensions are unified, fast embedding actually reduces the flexibility a little bit, but it can speed up the encoding process for ~17.5x:
note
Theoratically, embedding encoding is nothing more than a one_hot encoding followed by a linear projection, so it should be fast enough if we apply sparse matrix multiplications between one_hot encodings and a block diagnal embedding look up table. However as mentioned in One Hot Encoding section, PyTorch has not supported sparsity good enough. See Sparsity section for more details.
Sparsity
It is quite trivial that the one_hot encoding actually outputs a sparse matrix with sparsity equals to:
So the sparsity could easily exceed 90% when num_classes only needs to be greater than 10, therefore it is quite natural to think of leveraging sparse data structures to cache these one_hot encodings. What's better is that the embedding encoding could be represented as sparse matrix multiplications between one_hot encodings and a block diagnal embedding look up table, so THEORATICALLY (๐คฃ) we could reuse the one_hot encodings to get the embedding encodings efficiently.
Unfortunately, although scipy supports sparse matrices pretty well, PyTorch has not yet supported them good enough. So we'll stick to the dense solutions mentioned above, but will switch to the sparse ones iff PyTorch releases some fancy sparsity supports!