Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enclosed InputFormats do not work #83

Open
doublebyte1 opened this issue May 17, 2019 · 6 comments
Open

Enclosed InputFormats do not work #83

doublebyte1 opened this issue May 17, 2019 · 6 comments

Comments

@doublebyte1
Copy link

doublebyte1 commented May 17, 2019

I am following the instructions in this tutorial, and I am able to create a table using the
UnenclosedEsriJsonInputFormat.

However, I would like to use the enclosed format.

I have tried these two serdes:

      CREATE TABLE taxi_agg(area BINARY, count DOUBLE)
      ROW FORMAT SERDE 'com.esri.hadoop.hive.serde.EsriJsonSerDe' 
      STORED AS INPUTFORMAT 'com.esri.json.hadoop.EnclosedEsriJsonInputFormat'
      OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';
      CREATE TABLE taxi_agg(area BINARY, count DOUBLE)
      ROW FORMAT SERDE 'com.esri.hadoop.hive.serde.GeoJsonSerDe' 
      STORED AS INPUTFORMAT 'com.esri.json.hadoop.EnclosedGeoJsonInputFormat'
      OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';

Although I am able to create the table, and insert data, when I do a select the result is always empty:
select ST_AsGeoJSON(area), count from taxi_agg;
Changing EnclosedEsriJsonInputFormat to UnenclosedEsriJsonInputFormat, or EnclosedGeoJsonInputFormat to UnenclosedGeoJsonInputFormat gives correct results.

Not sure if I am doing something wrong, or if there is a problem with the Enclosed Serde.

Version: 2.0.0

@randallwhitman
Copy link
Contributor

Thanks for reporting this. I assume "version 2.0.0" refers to Spatial Framework for Hadoop. Please let us know the versions of Hive and Hadoop.

@doublebyte1
Copy link
Author

@randallwhitman Hadoop 2.8.5, Hive 2.3.4

@randallwhitman
Copy link
Contributor

Thanks for the details. We do not have Hive-2.3.4 (nor Hadoop-2.8.5) installed, and unfortunately the testing framework is not at the level of making it easy to paste a sample query into a test - Esri/spatial-framework-for-hadoop#163. Maybe it will reproduce with another version of Hive or with SparkSql.

@doublebyte1
Copy link
Author

I can confirm that both issues reproduce on Hadoop 2.8.3 and Hive 2.3.2.

@randallwhitman randallwhitman changed the title Enclosed SerDe does not work Enclosed InputFormats do not work May 31, 2019
@randallwhitman
Copy link
Contributor

randallwhitman commented Jun 29, 2019

I took a look at reading Enclosed Esri JSON, using 15 points from the JSON-MR mini-sample, and Hive-2.3.5 read the table data OK.

create external table test15eej(rowid int, shape binary)
row format serde 'com.esri.hadoop.hive.serde.EsriJsonSerDe'
stored as inputformat 'com.esri.json.hadoop.EnclosedEsriJsonInputFormat'
outputformat 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
location 'hdfs://hdfs:8020/path/to/test15_eej';

hive> select rowid, ST_AsText(shape) from test15eej;
1505    POINT (15 5)
535     POINT (5 35)
2323    POINT (23 23)
3222    POINT (32 22)
3728    POINT (37 28)
2233    POINT (22 33)
2838    POINT (28 38)
3434    POINT (34 34)
6219    POINT (62 19)
7114    POINT (71 14)
7525    POINT (75 25)
6535    POINT (65 35)
5549    POINT (55 49)
6545    POINT (65 45)
4566    POINT (45 66)

I guess that tests only reading not writing.

@randallwhitman
Copy link
Contributor

randallwhitman commented Nov 16, 2020

Finally repro the reported issue.

create external table test15eej(rowid int, shape binary)
row format serde 'com.esri.hadoop.hive.serde.EsriJsonSerDe'
stored as inputformat 'com.esri.json.hadoop.EnclosedEsriJsonInputFormat'
outputformat 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
location 'file:///tmp/test15eej';
hive> select rowid, ST_AsText(shape) from write15eej;
OK
Time taken: 0.154 seconds

The output file was in fact unenclosed - cat /tmp/write15eej/000000_0 :

{"attributes":{"rowid":1505},"geometry":{"x":15,"y":5}}
{"attributes":{"rowid":535},"geometry":{"x":5,"y":35}}
{"attributes":{"rowid":2323},"geometry":{"x":23,"y":23}}
{"attributes":{"rowid":3222},"geometry":{"x":32,"y":22}}
{"attributes":{"rowid":3728},"geometry":{"x":37,"y":28}}
{"attributes":{"rowid":2233},"geometry":{"x":22,"y":33}}
{"attributes":{"rowid":2838},"geometry":{"x":28,"y":38}}
{"attributes":{"rowid":3434},"geometry":{"x":34,"y":34}}
{"attributes":{"rowid":6219},"geometry":{"x":62,"y":19}}
{"attributes":{"rowid":7114},"geometry":{"x":71,"y":14}}
{"attributes":{"rowid":7525},"geometry":{"x":75,"y":25}}
{"attributes":{"rowid":6535},"geometry":{"x":65,"y":35}}
{"attributes":{"rowid":5549},"geometry":{"x":55,"y":49}}
{"attributes":{"rowid":6545},"geometry":{"x":65,"y":45}}
{"attributes":{"rowid":4566},"geometry":{"x":45,"y":66}}
create external table alt15uej(rowid int, shape binary)
row format serde 'com.esri.hadoop.hive.serde.EsriJsonSerDe'
stored as inputformat 'com.esri.json.hadoop.UnenclosedEsriJsonInputFormat'
outputformat 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
location 'file:///tmp/write15eej'
hive> select rowid, ST_AsText(shape) from alt15uej limit 2;
OK
1505    POINT (15 5)
535     POINT (5 35)
Time taken: 0.146 seconds, Fetched: 2 row(s)

With larger data, the output would be expected to span multiple files. In that case, it's not clear how the file[s] could be enclosed at all - maybe each file of the collection could have Enclosed format?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants