Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tfrecord write results in no data but no error #46

Open
dennisobrien opened this issue Feb 20, 2022 · 2 comments
Open

tfrecord write results in no data but no error #46

dennisobrien opened this issue Feb 20, 2022 · 2 comments

Comments

@dennisobrien
Copy link

Hi -- I am trying to use spark-tfrecord with Spark 3.1.2, but the files written have no data.

  • Spark 3.1.2
  • Python 3.8.10
  • Java 1.8.0
  • Scala 2.12.10

I'm using the latest version available from the maven repo as:

<dependency>
    <groupId>com.linkedin.sparktfrecord</groupId>
    <artifactId>spark-tfrecord_2.12</artifactId>
    <version>0.3.4</version>
</dependency>

Following the pyspark example from the README but simplified further:

path = "/tmp/test-output.tfrecord"

fields = [
    StructField("a", IntegerType()),
    StructField("b", FloatType()),
    StructField("c", StringType()),
]
schema = StructType(fields)
test_rows = [
    [1, 0.5, 'x'],
    [2, 1.5, 'y'],
    [3, 2.5, 'z'],
]
rdd = spark.sparkContext.parallelize(test_rows)
df = spark.createDataFrame(rdd, schema)
df.show()

Outputs:

+---+---+---+
|  a|  b|  c|
+---+---+---+
|  1|0.5|  x|
|  2|1.5|  y|
|  3|2.5|  z|
+---+---+---+

Saving the spark dataframe to tfrecord does not throw an error.

path = "/tmp/test-output.tfrecord/"
df.write.mode("overwrite").format("tfrecord").option("recordType", "Example").save(path)

But the directory only has a _SUCCESS flag and a crc file, no data.

ls -la /tmp/test-output.tfrecord/
total 12
drwxr-xr-x.  2 build build 4096 Feb 19 19:00 .
drwxrwxrwx. 11 root  root  4096 Feb 19 19:00 ..
-rw-r--r--.  1 build build    0 Feb 19 19:00 _SUCCESS
-rw-r--r--.  1 build build    8 Feb 19 19:00 ._SUCCESS.crc

And of course, trying to read the file fails.

spark.read.format('tfrecord').option('recordType', 'Example').load(path).show()

Error:

AnalysisException: Unable to infer schema for TFRECORD. It must be specified manually.

Let me know if there is more system/config information that could help to debug this.

FWIW, I had the exact same situation when testing spark-tensorflow-connector which I was building from source. I figured there was something wrong with my dependencies or something and thought I would try this project.

thanks,
Dennis

@kpfoley
Copy link

kpfoley commented Aug 12, 2024

I am also running into this same problem, with the same error - writes no data, but raises no error message. Write path only has _SUCCESS and ._SUCCESS.crc files. Everything works as expected on a GPU instance but it fails to write data on a CPU instance.

Here are my details:

Spark: 3.5.0
Java: Zulu 8.78.0.19-CA-linux64
Python: 3.11.0
Scala: 2.12.18
tensorflow: 2.16.1

@junshi15
Copy link
Contributor

@kpfoley
I tried the code above with Spark 3.5.0 and spark-tfrecord_2.12:0.7.0. It worked fine on my macbook pro (part files were generated).

pyspark --packages com.linkedin.sparktfrecord:spark-tfrecord_2.12:0.7.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants