Factorization Machine¶
Difacto is refined factorization machine (FM) with sparse memory adaptive constraints.
Given an example \(x \in \mathbb{R}^d\) and an embedding dimension \(k\), FM models the example by
\[f(x) = \langle w, x \rangle + \frac{1}{2} \|V x\|_2^2 - \sum_{i=1}^d x_i^2 \|V_i\|^2_2\]
where \(w \in \mathbb{R}^d\) and \(V \in \mathbb{R}^{d \times k}\) are the models we need to learn. The learning objective function is
\[\frac 1{|X|}\sum_{(x,y)} \ell(f(x), y)+ \lambda_1 |w|_1 +
\frac12 \sum_{i=1}^d \left[\lambda_i w_i^2 + \mu_i \|V_i\|^2\right]\]
where the first sparse regularizer \(\lambda_1 |w|_1\) induces a sparse \(w\), while the second term is a frequency adaptive regularization, which places large penalties for more frequently features.
Furthermore, Difacto adds two heuristics constraints
- \(V_i = 0\) if \(w_i = 0\), namely we mark the embedding for feature i
is inactive if the according linear term is filtered out by the sparse
regularizer. (You can disable it by
l1_shrk = false
) - \(V_i = 0\) if the occur of feature i is less the a threshold. In other
words, Difacto does not learn an embedding for tail features. (You can specify
the threshold via
threshold = 10
)
Train by Asynchronous SGD. w is updated via FTRL while V via adagrad.
Configuration¶
The configure is defined in the protobuf file config.proto
Input & Output¶
Type | Field | Description |
---|---|---|
string | train_data | The training data, can be either a directory or a wildcard filename |
string | val_data | The validation or test data, can be either a directory or a wildcard filename |
string | data_format | data format. supports libsvm, crb, criteo, adfea, ... |
string | model_out | model output filename |
string | model_in | model input filename |
string | predict_out | the filename for prediction output. if specified, then run/ prediction. otherwise run training |
Model and Optimization¶
Type | Field | Description |
---|---|---|
float | lambda_l1 | l1 regularizer for \(w\): \(\lambda_1 |w|_1\) |
float | lambda_l2 | l2 regularizer for \(w\): \(\lambda_2 \|w\|_2^2\) |
float | lr_eta | learning rate \(\eta\) (or \(\alpha\)) for \(w\) |
Config.Embedding | embedding | the embedding \(V\) |
int32 | minibatch | the size of minibatch. the smaller, the faster the convergence, but the/ slower the system performance |
int32 | max_data_pass | the maximal number of data passes |
bool | early_stop | stop earilier if the validation objective is less than prev_obj - min_objv_decr |
Config.Embedding¶
embedding \(V\). basic:
Type | Field | Description |
---|---|---|
int32 | dim | the embedding dimension \(k\) |
int32 | threshold | features with occurence < threshold have no embedding (\(k=0\)) |
float | lambda_l2 | l2 regularizer for \(V\): \(\lambda_2 \|V_i\|_2^2\) |
advanced:
Type | Field | Description |
---|---|---|
float | init_scale | V is initialized by uniformly random weight in/ [-init_scale, +init_scale] |
float | dropout | apply dropout on the gradient of \(V\). no in default |
float | grad_clipping | project the gradient of \(V\) into \([-c c]\). no in default |
float | grad_normalization | normalized the l2-norm of gradient of \(V\). no in default |
float | lr_eta | learning rate \(\eta\) for \(V\). if not specified, then share the same with \(w\) |
float | lr_beta | leanring rate \(\beta\) for \(V\). |
Adavanced Configurations¶
Type | Field | Description |
---|---|---|
int32 | save_iter | save model for every k data pass. default is -1, which only saves for the/ last iteration |
int32 | load_iter | load model from the k-th iteration. default is -1, which loads the last/ iteration model |
bool | local_data | give a worker the data only if it can access. often used when the data has/ been dispatched to workers’ local filesystem |
int32 | num_parts_per_file | virtually partition a file into n parts for better loadbalance. default is 10 |
int32 | rand_shuffle | randomly shuffle data for minibatch SGD. a minibatch is randomly picked from/ rand_shuffle * minibatch examples. default is 10. |
float | neg_sampling | down sampling negative examples in the training data. no in default |
bool | prob_predict | if true, then outputs a probability prediction. otherwise \(\langle x, y \rangle\) |
float | print_sec | print the progress every n sec during training. 1 sec in default |
float | lr_beta | learning rate \(\beta\), 1 in default |
float | min_objv_decr | the minimal objective decrease in early stop |
bool | l1_shrk | use or not use the contraint \(V_i = 0\) if \(w_i = 0\). yes in default |
int32 | num_threads | number of threads used within a worker and a server |
int32 | max_concurrency | the maximal concurrent minibatches being processing at the same time for/ sgd, and the maximal concurrent blocks for block CD. 2 in default. |
bool | key_cache | cache the key list on both sender and receiver to reduce communication/ cost. it may increase the memory usage |
bool | msg_compression | compression the message to reduce communication cost. it may increase the/ computation cost. |
int32 | fixed_bytes | convert floating-points into fixed-point integers with n bytes. n can be 1,/ 2 and 3. 0 means no compression. |