Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consume TensorFlowInferSchema from Java Spark 2.4.0 #34

Open
yeshbash opened this issue Sep 4, 2021 · 2 comments
Open

Consume TensorFlowInferSchema from Java Spark 2.4.0 #34

yeshbash opened this issue Sep 4, 2021 · 2 comments

Comments

@yeshbash
Copy link

yeshbash commented Sep 4, 2021

This is not an issue and is more of a request for guidance. I'd like to just use the TensorFlowInferSchema functionality in my Java spark job. Below is the sample snippet

JavaRDD<Example> exampleRdd =
          jsc.parallelize(
              Arrays.asList(
                  Example.newBuilder()
                      .setFeatures(
                          Features.newBuilder()
                              .putFeature(
                                  "Test",
                                  Feature.newBuilder()
                                      .setInt64List(Int64List.newBuilder().addValue(10).build())
                                      .build())
                              .build())
                      .build()));
StructType schema = TensorFlowInferSchema.apply(rdd.rdd(), ???);

However, the apply method expects a second argument - implicit evidence$1 : scala.reflect.runtime.universe.TypeTag[T].

Below is the compiled Java class

def apply[T](rdd : org.apache.spark.rdd.RDD[T])(implicit evidence$1 : scala.reflect.runtime.universe.TypeTag[T]) : org.apache.spark.sql.types.StructType = { /* compiled code */ }

I'm not familiar with Scala and it appears that it's not possible to create TypeTag[Example] in java.

Appreciate if you could share your thoughts on the below

  • Is it safe (and possible) to just consume TensorFlowInferSchema in Java without going via spark.read.format("tfrecord").option("recordType", "Example")?
  • What to pass for the TypeTag[T] argument?
  • Is there a Java example for this use case

Thanks for maintaining this project and highly appreciate your help!

@junshi15
Copy link
Contributor

junshi15 commented Sep 4, 2021

"TensorFlowInferSchema" only infers the schema. Is that what you need?
I am not familiar with Spark Java API, but can you call scala function from Java?
If you want to read/write TFRecord with Java API, I assume something like this will work.
Dataset<Row> usersDF = spark.read().format("tfrecord").load("examples/src/main/resources/users.tfrecord"); usersDF.select("name", "favorite_color").write().format("tfrecord").save("namesAndFavColors.tfrecord");

@yeshbash
Copy link
Author

yeshbash commented Sep 4, 2021

Thanks for the example on loading - Yes, that should help to read tfrecord files into a DataFrame. However, as you point out, I just need automatic schema inference as we already build a JavaRDD<Example>. My understanding is Java classes can consume Scala classes and vice versa but I'm not familiar enough to understand if TensorFlowInferSchema can be invoked.
I was hoping to get some help there.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants