Introduction
Advantages
Like many similar projects, carefree-learn
can be treated as a high-level library to help with training neural networks in PyTorch. However, carefree-learn
does more than that.
carefree-learn
once focused on tabular (structured) datasets (v0.1.x
), and sincev0.2.x
, unstructured datasets (e.g. CV datasets or NLP datasets) are also supported with โค๏ธ as well!And CV came before NLP because I'm more familiar with it ๐คฃ.
carefree-learn
is highly customizable for developers. We have already wrapped (almost) every single functionality / process into a single module (a Python class), and they can be replaced or enhanced either directly from source codes or from local codes with the help of some pre-defined functions provided bycarefree-learn
(see Register Mechanism).carefree-learn
supports easy-to-use saving and loading. By default, everything will be wrapped into a.zip
file, andonnx
format is natively supported!carefree-learn
supports Distributed Training.
Apart from these, carefree-learn
also has quite a few specific advantages in each area:
Machine Learning ๐
carefree-learn
provides an end-to-end pipeline for tabular tasks, including AUTOMATICALLY deal with (this part is mainly handled bycarefree-data
, though):- Detection of redundant feature columns which can be excluded (all SAME, all DIFFERENT, etc).
- Detection of feature columns types (whether a feature column is string column / numerical column / categorical column).
- Imputation of missing values.
- Encoding of string columns and categorical columns (Embedding or One Hot Encoding).
- Pre-processing of numerical columns (Normalize, Min Max, etc.).
- And much more...
carefree-learn
can help you deal with almost ANY kind of tabular datasets, no matter how dirty and messy it is. It can be either trained directly with some numpy arrays, or trained indirectly with some files locate on your machine. This makescarefree-learn
stand out from similar projects.
info
From the discriptions above, you might notice that carefree-learn
is more of a minimal Automatic Machine Learning (AutoML) solution than a pure Machine Learning package.
tip
When we say ANY, it means that carefree-learn
can even train on one single sample.
For example
This is especially useful when we need to do unittests or to verify whether our custom modules (e.g. custom pre-processes) are correctly integrated into carefree-learn
.
For example
There is one more thing we'd like to mention: carefree-learn
is Pandas-free. The reasons why we excluded Pandas are listed in carefree-data
.
Computer Vision ๐ผ๏ธ
carefree-learn
also provides an end-to-end pipeline for computer vision tasks, and:Supports native
torchvision
datasets.Currently only
mnist
is supported, but will add more in the future (if needed) !Focuses on the
ImageFolderDataset
for customization, which:- Automatically splits the dataset into train & valid.
- Supports generating labels in parallel, which is very useful when calculating labels is time consuming.
See IFD introduction for more details.
carefree-learn
supports various kinds ofCallback
s, which can be used for saving intermediate visualizations / results.- For instance,
carefree-learn
implements anArtifactCallback
, which can dump artifacts to disk elaborately during training.
- For instance,
Configurations
In most cases, Pipeline
will be the (internal) user interface in carefree-learn
, which can handle training, evaluating, saving and loading easily.
Therefore, configurations in carefree-learn
is mostly done by sending args and kwargs to the Pipeline
module.
In order to provide even better user experience, carefree-learn
also provides many handy APIs to directly access to corresponding Pipeline
s or Model
s. For example, if we want to use resnet18
model, we can access it with one line of code:
It's also possible to load pretrained-weights by specifying pretrained=True
:
note
It is worth mentioning that although carefree-learn
supports very fine-grained configurations (e.g. supports configuring different optimizers for different parameters, which is a common use case in GANs), it also provides straight forward configurations when the tasks are not so complicated.
For instance, in GAN tasks, we may need to
But in 'simple' tasks, we may only need to
And if we simply want to run a default configuration, we can
And the rest will be handled by carefree-learn
:
info
Please refer to the Configurations section for further details.
Components
As shown in the framework at the beginning of this page, carefree-learn
is mainly constructed with 5 loose coupled modules:
Data Layer
: this part is mainly handled bycarefree-data
&DataLoader
(fromPyTorch
) for Machine Learning ๐ & Computer Vision ๐ผ๏ธ tasks respectively.Model
: should implement the core algorithms (basically it should implement theforward
method).Inference
: it is responsible for making inference. It should be able to work w/ or w/o aModel
, where for the latter case it will useONNX
instead.Trainer
: it will train aModel
with specific training data loader & validation data loader.Pipeline
: as mentioned above, it serves as the user interface incarefree-learn
.
note
Please refer to Design Principles for further details.
Data Loading Strategy
The data loading strategy of tabular datasets is very different from unstructured datasets' strategy. For instance, it is quite common that a CV dataset is a bunch of pictures located in a folder, and we will either read them sequentially or read them in parallel. Nowadays, almost every famous deep learning framework has their own solution to load unstructured datasets efficiently, e.g. PyTorch officially implements DataLoader to support multi-process loading and other features (which is also adopted by carefree-learn
).
Although we know that RAM speed is (almost) always faster than I/O operations, we still prefer leveraging multi-process to read files than loading them all into RAM at once. This is because unstructured datasets are often too large to allocate them all to RAM. However, when it comes to tabular datasets, we prefer to load everything into RAM at the very beginning. The main reasons are listed below:
- Tabular datasets are often quite small and are able to put into RAM once and for all.
- Network structures for tabular datasets are often much smaller, which means using multi-process loading will cause a much heavier overhead.
- We need to take Distributed Training into account. If we stick to multi-process loading, there would be too many threads in the pool which is not a good practice.
Terminologies
In carefree-learn
, there are some frequently used terminologies, and we will introduce them in this section. If you are confused by some other terminologies in carefree-learn
when you are using it, feel free to edit this list:
step
One step
in the training process means that one mini-batch passed through our model.
epoch
In most deep learning processes, training is structured into epochs. An epoch is one iteration over the entire input data, which is constructed by several step
s.
batch_size
It is a good practice to slice the data into smaller batches and iterates over these batches during training, and batch_size
specifies the size of each batch. Be aware that the last batch may be smaller if the total number of samples is not divisible by the batch_size
.
config
A config
indicates the main part (or, the shared part) of the configuration.
increment_config
An increment_config
indicates the configurations that you want to update on config
.
tip
This is very useful when you only want to tune a single configuration and yet you have tons of other configurations need to be fixed. In this case, you can set the shared configurations as config
, and adjust the target configuration in increment_config
.
forward
A forward
method is a common method required by (almost) all PyTorch modules.
info
Here is a nice discussion.
task_type
We use task_type = "clf"
to indicate a classification task, and task_type = "reg"
to indicate a regression task.
info
And we'll convert them into cfdata.tabular.TaskTypes under the hood.
train, valid & test
- We use
train
dataset to directly optimize our model (please refer to optimizers for more details). - We use
valid
(validation) dataset to monitor our model, and to decide which checkpoint should we use / when shall we perform early strop. - We use
test
dataset to evaluate our model.
metrics
Although losses
are what we optimize directly during training, metrics
are what we actually want to optimize (e.g. acc
, auc
, f1-score
, etc.). Sometimes we may want to take multiple metrics
into consideration, and we may also want to eliminate the fluctuation comes with mini-batch training by applying EMA on the metrics.
Of course, carefree-learn
also supports using losses
as metrics
directly ๐.
tip
Please refer to metrics and see how to customize the behaviour of metrics
in carefree-learn
.
optimizers
In PyTorch (and other deep learning framework) we have optimizers
to optimize the parameters of our model. We sometimes need to divide the parameters into several groups and optimize them individually (which is quite common in GANs).
tip
Please refer to optimizers and see how to control the behaviour of optimizers
in carefree-learn
.