Input Data¶
Wormhole supports various input data sources and formats.
Data Formats¶
Both text and binary formats are supported.
LIBSVM¶
Wormhole supports a more general version of the LIBSVM format. Each example is presented as a text line:
label feature_id[:weight] feature_id[:weight] ... feature_id[:weight]
- label
- a
float
label - feature_id
- a
unsigned 64-bit integer
feature index. It is not required to be continuous. - weight:
- the according
float
weight, which is optional
Compressed Row Block (CRB)¶
This is a compressed binary data format. One can use bin/text2crb
to convert
any supported data format into it.
Customized Format¶
Adding a customized format requires only two steps.
- Define a subclass to implement the function
ParseNext
of ParserImpl. Examples:- Parse the text Criteo CTR dataset criteo_parser
- Parse the binary
crb
format: crb_parser
- Then add the this new parser to a reader. For example, adding them in the minibatch reader
Data Sources¶
Besides standard filesystems, wormhole supports the following distributed filesystems.
HDFS¶
To support HDFS, compile with the flag USE_HDFS=1
such as make
USE_HDFS=1
or set the flag in config.mk
. An example filename of a HDFS
file
hdfs:///user/you/ctr_data/day_0
Amazon S3¶
To supports Amazon S3, compile with the flag USE_S3=1
. Besides, one needs to
set the environment variables AWS_ACCESS_KEY_ID
and
AWS_SECRET_ACCESS_KEY
properly. For example, add the following two lines in
~/.bashrc
(replace the strings with your AWS credentials):
export AWS_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE
export AWS_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
An example filename of a S3 file
s3://ctr-data/day_0
Microsoft Azure Blob Storage (Alpha support)¶
To support Azure blob storage, compile with the flag USE_AZURE=1
and DEPS_PATH=deps
, which needs the Azure C++ Storage SDK (https://github.com/Azure/azure-storage-cpp)
- Install Azure Storage SDK (TODO: move to make/deps.mk) ::
sudo apt-get -y install libboost1.54-all-dev libssl-dev cmake libxml++2.6-dev libxml++2.6-doc uuid-dev
cd deps && mkdir -p lib include
git clone https://git.codeplex.com/casablanca cd casablanca/Release mkdir build.release cd build.release CXX=g++ cmake .. -DCMAKE_BUILD_TYPE=Release make -j4 cp Binaries/libcpprest* ../../../lib cp -r ../include/* ../../../include/ cd ../../..
git clone https://github.com/Azure/azure-storage-cpp cd azure-storage-cpp/Microsoft.WindowsAzure.Storage mkdir build.release cd build.release CASABLANCA_DIR=../../../../casablanca/ CXX=g++ cmake .. -DCMAKE_BUILD_TYPE=Release make -j4 cp Binaries/libazurestorage* ../../../lib cp -r ../includes/* ../../../include/ cd ../../../..
One also needs to set the environment variables properly (About Azure storage account):
export AZURE_STORAGE_ACCOUNT=mystorageaccount
export AZURE_STORAGE_ACCESS_KEY=EXAMPLEKEY
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/dmlc-core/deps/lib
- An example filename of an Azure file ::
- azure://container/agaricus.txt.test