The TFRecord.md is for learning Tensorflow Records. What they are? and how they can be used?
This document highly the most important information which is descripting in the references, so I strongly recomment reader to get more details, please read the references.
Tensorflow is a second machine learning framwork that Google created for researching and developing AI/ML or DNN application, it is widely used in both academic and engineering society. TFRecord is a kind of Tensorflow's own binary format.
Advantages:
If you are working with large datasets, using a binary format for storaging and reading of your data can hav significan impact on the performance of your import and consequence on the training time of your model.
👍 Binary format take up less space on your disk.
👍 Binary format take less time to copy and can be read much more efficiently from disk.
👍 It easy to combine multiple datasets and intergrates seamlessly with the data import and preprocessing functionality provided by library.
👍 For dataset that are too large to be stored fully in memory this is an advantage as only the data that is require at the time (eg. batch) is loaded form disk and then processed.
👍 It's possible to store sequence data
Disadvantages:
👎 You have to convert your data to this format and there are quite little documents for descripting fully how to do that task.
A TFRecord file store your data as a sequence of binary strings. This mean that you need to desclare theis format before you can write it in to files.
Tensorflow give us two APIs to do this purpose:
tf.train.Example
tf.train.SequenceExample
And then you can use:tf.python_io.TFRecordWriter
to write them to your disk.
If you have dataset consist of feature, where each feature is a list of value of the same type, tf.train.Example
is a right way to use
the movie recomendation application:
Age | Movie | Movie Ratings | Surggestion | Surggestion Purchased | Purchase Price |
---|---|---|---|---|---|
29 | The Shawshank Redemption | 9.0 | inception | 1.0 | 9.99 |
Fight Club | 9.7 |
It's clearly seen that we now have list of features, each of them have same type, like for example:
- feature Age is integer
- feature Movie is string.
- feature Movie Ratings is real number
- feature Surggestion is string
- feature Surggestion Purchased is real number
- feature Purchase Price is real number
We need to create the list that consitute the features by using:
movie_name_list = tf.train.BytesList(value=[b'The Shawshank Redemption', b'Fight Club'])
movie_rating_list = tf.train.FloatList(value=[9.0, 9.7])
python string need to be converted to bytes before they are stored in
movie_names = tf.train.Feature(bytes_list=movie_name_list)
movie_ratings = tf.train.Feature(float_list=movie_rating_list)
collect all named features by using
movie_dict = {'Movie Names: movie_names, Movie Ratings: movie_ratings}
movies = tf.train.Features(feature=movie_dict)
Write to disk by using
with tf.python_io.TFRecordWriter('movie_ratings.tfrecord) as writer: writer.write(example.SerializeToString())
If you have features that consist of list of identically typed data and maybe some contextual data
Data | ||
---|---|---|
Movie 1 | Movie 2 | |
Movie name | The shawshank Redemption | Fight Club |
Movie Rating | 4.5 | 5 |
Actor | Tim Robins | Brad Pitt |
Morgan Freeman | Edward Norton | |
Helena Bonham Carter |
Context | ||
---|---|---|
Locale | Age | Favorites |
"pt_BR" | 19.0 | Majesty Rose |
Savannah Outen | ||
One Direction |
In this example we have a number of context features such as Locale, Age, Favorites and a list of movies recomendation which consist of Movie name, Movie Rating, Actor .
The data look very similar to the previous example, where each feature consisted of a single list. Each entry in the list represented the same information in different movie, for example:
- Movie Raing
But now we also have Actor, this type cannot be stored in a tf.train.Example
. We need different type of structure for this kind of data, that is tf.train.SequenceExample
it have two attributes:
Data from Context table we stores in tf.train.Features
, data from Movie name, Movie Rating, Actor we stores in tf.train.FeatureLists
[1] Tensorflow Records? What they are and how to use them
[2] How to use TFRecord with Datasets and Iterators in Tensorflow with code samples
[3] TensorFlow Tutorial For Beginners
[4] Using TFRecords and tf.Example