Optimizations
carefree-learn
not only provides carefree APIs for easier usages, but also did quite a few optimizations to make training on tabular datasets faster than other similar libraries. In this page we'll introduce some techniques carefree-learn
adopted under the hood, and will show how much performance boost we've obtained with them.
Categorical Encodings
Encoding categorical features is one of the most important pre-processing we need to perform on tabular datasets. As mentioned in Design Principles
, carefree-learn
will stick to one_hot
encoding and embedding
encoding, which will be introduced in turn in the following sections.
One Hot Encoding
A one_hot
encoding basically encodes categorical features as a one-hot numeric array, as defined in sklearn
. Suppose we have 5 classes in total, then:
We can figure out that this kind of encoding is static, which means it will not change during the training process. In this case, we can cache down all the encodings and access them through indexing. This will speed up the encoding process for ~66x:
The gap will decrease a little when we increase num_classes
to a larger number, let's say 100:
However there are still ~33x boost.
caution
Although caching can boost performance, it is at the cost of consuming much more memories. A better solution should be caching sparse tensors instead of dense ones, but PyTorch
has not supported sparsity good enough. See Sparsity section for more details.
Embedding
An embedding
encoding actually borrows from Natual Language Processing (NLP) where they converted (sparse) input words into dense embeddings with embedding look up. It is quite trivial to turn categorical features into embeddings with the same look up techniques, but tabular datasets hold a different property compared with NLP: tabular datasets will maintain many embedding tables because they have different categorical features with different number of values, while in NLP it only need to maintain one embedding table in most cases.
Since embedding
is a dynamic encoding which contains trainable parameters, we cannot cache them beforehand like we did to one_hot
. However, we can still optimize it with fast embedding. A fast embedding basically unifies the embedding dimension of different categorical features, so one unified embedding table is sufficient for the whole embedding
process.
There's one more thing we need to take care of when applying fast embedding: we need to increment the values of each categorical features. Here's a minimal example to illustrate this. Suppose we have two categorical features () with 2 and 3 classes respectively, then our embedding table will contain 5 rows:
In this table, the first two rows belong to , while the last three rows belong to . However, as we defined above, and . In order to assign to , we need to increment by (which is the number of choices could have). After increment, we have so it can successfully look up .
Note that the incremented indices are static, so carefree-learn
will cache these indices to avoid duplicate calculations when fast embedding is applied.
Since the embedding dimensions are unified, fast embedding actually reduces the flexibility a little bit, but it can speed up the encoding process for ~17.5x:
note
Theoratically, embedding
encoding is nothing more than a one_hot
encoding followed by a linear projection, so it should be fast enough if we apply sparse matrix multiplications between one_hot
encodings and a block diagnal embedding
look up table. However as mentioned in One Hot Encoding section, PyTorch
has not supported sparsity good enough. See Sparsity section for more details.
Sparsity
It is quite trivial that the one_hot
encoding actually outputs a sparse matrix with sparsity equals to:
So the sparsity could easily exceed 90% when num_classes
only needs to be greater than 10, therefore it is quite natural to think of leveraging sparse data structures to cache these one_hot
encodings. What's better is that the embedding
encoding could be represented as sparse matrix multiplications between one_hot
encodings and a block diagnal embedding
look up table, so THEORATICALLY (๐คฃ) we could reuse the one_hot
encodings to get the embedding
encodings efficiently.
Unfortunately, although scipy
supports sparse matrices pretty well, PyTorch
has not yet supported them good enough. So we'll stick to the dense solutions mentioned above, but will switch to the sparse ones iff PyTorch
releases some fancy sparsity supports!