Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spark 读取 tfrecord文件,解析tf.train.Feature(bytes_list=tf.train.BytesList(np.array(feature, dtype=np.float32).tobytes())) 文件字节码异常 #48

Open
shaoshuaig opened this issue Apr 26, 2022 · 1 comment

Comments

@shaoshuaig
Copy link

  1. 如下图所示:用spark直接去读取tfrecord数据,string类型的字段可以直接读取,但是adc_traj等numpy.array的数据会是乱码的形式。

image

2. 原因是spark加载tfrecord数据的时候,会默认把bytes类型的数据转成str,目前不清楚字符编码。而adc_traj存储的时候是用tf.train.Feature(bytes_list=tf.train.BytesList(np.array(feature, dtype=np.float32).tobytes()))方式存储的。 3. 用 np.frombuffer(str_y., dtype=np.float32)去解析数据 数据还是会报错: ValueError: buffer size must be a multiple of element size
@mizhou-in
Copy link
Contributor

Hello, I think your issue is related to spark encoding, can you try the charset option? This stackoverflow thread may be helpful to you: https://stackoverflow.com/questions/51957742/read-a-bytes-column-in-spark

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants