Define a new file format and implement InputFormat and RecordReader for it #77

beomyeol · 2015-08-19T09:45:50Z

For various datasets, the data is stored in different file formats. For example, the data of MNIST database is saved in their own file format and the data of CIFAR-10 database is stored in Python pickle file and their own file format. Supporting all these file format is too burdensome.
So, I suggest defining a new file format which our DNN uses to load data from file. In order to support a variety of datasets such as MNIST and ImageNet, we can convert these datasets to our file format and provide them for DNN.
After define the new file format, we also need InputFormat and RecordReader for it to run our neural network on REEF.

The text was updated successfully, but these errors were encountered:

dongjoon-hyun · 2015-08-19T13:59:19Z

I agree with your concerns. Dolphin is a framework, not an application. But, at this point, I have a concern about reinventing the wheel. Why don't you use ND4J serialization? If you agree with using ND4J format, @swlsw and I will make a converter for that. Let's assume like the following simply.

All data have a matrix form using INDArray interface.
Training / Cross Validation / Test Data Set will be loaded as a m x n matrix.
- m is a number of instance.
- n is a number of feature.
Label Data Set will be loadded as a m x 1 vector.

By the way, in reality, we do not use both MNIST vector format or CIFAR-10 pickle file. We mostly use the only original files like JPEG.

For MNIST, the original data is JPEG file. (Black and White)
The folder structure is described in Support Multiple Data Sources #63. (hdfs://data/image/mnist/jpg)
@jsjason , could you download this from SKT cluster into your cluster, if you need?
You can use SKT cluster itself too.
For ImageNet, the original data is also JPEG file. (RGB Color)
Also, you can download them in our cluster (hdfs://data/image/imagenet/tar)
You can connect through VPN. Please get one from @jsjason .

beomyeol · 2015-08-20T00:47:28Z

Thank you for your suggestion, @dongjoon-hyun. I will consider ND4J serialization and discuss this with @jsjason. If it is okay to use it, I will let you know and start implementing it.

I can connect to SKT cluster through VPN. Thanks to @jsjason.

dongjoon-hyun · 2015-08-20T01:04:04Z

Thank you for considering. By the way, I found that the following codes in DL4J and ND4J. Actually, the file is plain text file delimeted spaces. three-spaces : " "

DL4J

        ClassPathResource resource = new ClassPathResource("/mnist2500_X.txt");
        File f = resource.getFile();
        INDArray data = Nd4j.readNumpy(f.getAbsolutePath(),"   ").get(NDArrayIndex.interval(0,100),NDArrayIndex.interval(0,784));

ND4J

    /**
     * Read line via input streams
     *
     * @param filePath the input stream ndarray
     * @param split    the split separator
     * @return the read txt method
     */
    public static INDArray readNumpy(String filePath, String split) throws IOException {
        return readNumpy(new FileInputStream(filePath), split);
    }

    /**
     * Read line via input streams
     *
     * @param filePath the input stream ndarray
     * @return the read txt method
     */
    public static INDArray readNumpy(String filePath) throws IOException {
        return readNumpy(filePath, "\t");
    }

I think we already have Numpy compatible read function in ND4J.

beomyeol · 2015-08-20T01:33:32Z

Thank you for letting me know readNumpy(). But, I have a concern. Is it okay to use plain text file with delimiter? Using a plain text file needs more space than a binary file.

In addition, I saw the code of readNumpy() in ND4J library. It supports Numpy compatible plain text file, but does not support Numpy compatible binary file such as .npy or .npz.

dongjoon-hyun · 2015-08-20T01:45:04Z

Yep. That is right. But I think we can depend on that part in ND4J layer.
If we design our architecture having ND4J layer that handles readNumpy, the converting job for numpy is a piece of cake. We can implement a converter for numpy as a just a small python script with opening .npy and storing .txt. :)

dongjoon-hyun · 2015-08-20T01:51:01Z

By the way, for the efficiency, we have to distinguish between input file format and internal storage format. The followings are my opinions until now.

For the input format, just call readNumpy().
For the internal storage format, just use ND4J serialization.

jsjason · 2015-08-20T02:09:10Z

@dongjoon-hyun When you say 'internal storage format', are you referring to the intermediate and final output data?

jsjason · 2015-08-20T02:23:20Z

One thing I am concerned about is our dependency on ND4J. I don't know much about scientific computing libraries, but is it okay to rely on ND4J this much? We could search for and use a library with a greater community.

beomyeol · 2015-08-20T02:43:41Z

@dongjoon-hyun,
Okay, we can decide to use a plain text format as input format. I have one more concern about it.
REEF does not support multiple data sources now as we discussed in #63. We need to put images and labels into a single text file and consider this file format. I think about following format.

(image) (delimiter) (label) (newline)
(image) (delimiter) (label) (newline)
...
(image) (delimiter) (label) (newline)

By using readNumpy(), image and label data can be loaded and we can set ',' as the delimiter, for example. Is this format fine to use? If so, I don't think we need custom InputFormat and RecordReader. We can just use TextInputFormat.

bgchun · 2015-08-20T05:22:00Z

@beomyeol It'd be nice to resolve #63 eventually. But if #63 takes time, we should address it later since there are other more important issues.

dongjoon-hyun · 2015-08-20T05:37:51Z

@jsjason , I meant 'internal storage format' for really dolphin's internal format, if needed. It's not output format.

For dependency, I always welcome your further research and proposal for better BLAS library supporting CPU/GPU. :)

dongjoon-hyun · 2015-08-20T05:44:38Z

Ur, @beomyeol , I meant float matrix for dolphin. Sorry for making you confuse. All image/sound/text data will be transformed by me and @swlsw into numpy matrix for dolphin. dolphin has no need to care about that. What I described above is the real final application goal. For dolphin nueral network algorithm, you can assume that float matrix as a input and perform mathematical operation only.

dongjoon-hyun · 2015-08-20T05:54:04Z

For the train/test data and label, you can read with the similar way as you described, i.e., m x (n + 1) matrix.

m : the number of data instance
n : the number of features
1 : the last column is the label column.

In addition, dolphin should load pre-trained model. This is more important. Do you have any idea for this?

jsjason · 2015-08-20T06:08:35Z

The pre-trained model equals the initial parameter set for the DNN case, right? Unlike the other algorithms, for DNNs we are trying to provide a ParameterInitializer that generates the initial values for edge weights and biases.

beomyeol · 2015-08-20T06:40:37Z

@dongjoon-hyun. I am still confused a little bit. What is the format of file which dolphin loads? Is it a Numpy compatible plain text file format like 'mnist2500_X.txt' in DL4J?

in addition, for pre-trained model, I have not thought about it yet. We may need a snapshot feature of neural network and a feature of reconstructing neural network from the the snapshot. I'd like to discuss this as a separate issue.

dongjoon-hyun · 2015-08-20T07:29:11Z

@jsjason , that's right. ParameterInitializer sounds Good!

dongjoon-hyun · 2015-08-20T07:30:32Z

@beomyeol . For the first question, Yes. For the second question, @jsjason answered in the previous comment.

jsjason · 2015-08-20T07:40:35Z

Thanks, @beomyeol and @dongjoon-hyun. Let's keep this issue open since we'll probably going to have more discussions when PRs starts to come up.

beomyeol · 2015-08-20T07:42:39Z

Thank @dongjoon-hyun for you comment :)

beomyeol changed the title ~~Define new file format and implement InputFormat and RecordReader for it~~ Define a new file format and implement InputFormat and RecordReader for it Aug 19, 2015

dongjoon-hyun added this to the DNN - Data partitioning milestone Aug 25, 2015

beomyeol mentioned this issue Sep 1, 2015

Load the pre-trained neural network model and save the trained model. #104

Open

jsjason added the DNN label Sep 30, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Define a new file format and implement InputFormat and RecordReader for it #77

Define a new file format and implement InputFormat and RecordReader for it #77

beomyeol commented Aug 19, 2015

dongjoon-hyun commented Aug 19, 2015

beomyeol commented Aug 20, 2015

dongjoon-hyun commented Aug 20, 2015

beomyeol commented Aug 20, 2015

dongjoon-hyun commented Aug 20, 2015

dongjoon-hyun commented Aug 20, 2015

jsjason commented Aug 20, 2015

jsjason commented Aug 20, 2015

beomyeol commented Aug 20, 2015

bgchun commented Aug 20, 2015

dongjoon-hyun commented Aug 20, 2015

dongjoon-hyun commented Aug 20, 2015

dongjoon-hyun commented Aug 20, 2015

jsjason commented Aug 20, 2015

beomyeol commented Aug 20, 2015

dongjoon-hyun commented Aug 20, 2015

dongjoon-hyun commented Aug 20, 2015

jsjason commented Aug 20, 2015

beomyeol commented Aug 20, 2015

Define a new file format and implement InputFormat and RecordReader for it #77

Define a new file format and implement InputFormat and RecordReader for it #77

Comments

beomyeol commented Aug 19, 2015

dongjoon-hyun commented Aug 19, 2015

beomyeol commented Aug 20, 2015

dongjoon-hyun commented Aug 20, 2015

DL4J

ND4J

beomyeol commented Aug 20, 2015

dongjoon-hyun commented Aug 20, 2015

dongjoon-hyun commented Aug 20, 2015

jsjason commented Aug 20, 2015

jsjason commented Aug 20, 2015

beomyeol commented Aug 20, 2015

bgchun commented Aug 20, 2015

dongjoon-hyun commented Aug 20, 2015

dongjoon-hyun commented Aug 20, 2015

dongjoon-hyun commented Aug 20, 2015

jsjason commented Aug 20, 2015

beomyeol commented Aug 20, 2015

dongjoon-hyun commented Aug 20, 2015

dongjoon-hyun commented Aug 20, 2015

jsjason commented Aug 20, 2015

beomyeol commented Aug 20, 2015