Friday, September 16, 2016

Messing with the data schema to make it work with Drill (without using Drill's additional functions.)

I must warn that this is not practical - you may not have the access or capacity to modify the schema of the data you want to query in the first place. Unless the data bulk was taken as a dump and queried a million times, there is no performance benefit in doing the below attempt. But for research purposes, why not? :)

I had this multi-dimensional array in my data that was impossible to query with Drill due to its complex data schema. Before the readers pointing me out that it is indeed possible to query complex arrays, what I mean is, it is impossible to query with the same performance level, as we need to use flatten function which ruins the performance and output format. On the other hand, knowing the exact indices of arrays is impractical too.

I have this multi-dimensional array that makes it impossible for me to proceed:
"coordinates":[[ [.. , ..] , [.. , ..] , [.. , ..] ]]

Error message was:
Error: SYSTEM ERROR: UnsupportedOperationException: Unsupported type LIST

Fragment 0:0

[Error Id: 5cc520ff-9594-4b9b-998d-20bf8569981b on llovizna:31010] (state=,code=0)


I had to transition the above into the below structure to make it work without any Drill functionality such as flatten.

"coordinates" : [ { "x" : .., "y" : .. }, { "x" : .., "y" : .. }, { "x" : .., "y" : .. } ]

0: jdbc:drill:zk=local> create table dfs.tmp.camic as select * from dfs.`/home/pradeeban/programs/apache-drill-1.6.0/head2.json`;
+-----------+----------------------------+
| Fragment  | Number of records written  |
+-----------+----------------------------+
| 0_0       | 1                          |
+-----------+----------------------------+
1 row selected (1,392 seconds)



What I did essentially was to make the multi-dimensional array into a map.

0: jdbc:drill:zk=local> select * from dfs.tmp.camic;
+-----+------+-----------+---------+---------------+-------------+---+---+------------+------+----------+-----------+------------+------------+-------------+
| _id | type | parent_id | randval | creation_date | object_type | x | y | normalized | bbox | geometry | footprint | properties | provenance | submit_date |
+-----+------+-----------+---------+---------------+-------------+---+---+------------+------+----------+-----------+------------+------------+-------------+
| {"$oid":"56a784647b7b51c562"} | Feature | self | 0.3712421875 | 2026-11-16 01:17:13.101 | nucleus | 0.049646965 | 0.435353796 | true | [0.042729646965,0.85353796,0.8105608,0.7145562075] | {"type":"Polygon","coordinates":[{"x":0.04795445442,"y":0.87641187789917},{"x":0.0427805229,"y":0.87187789917}]} | 17.0 | {"scalar_features":[{"ns":"http://u24.bi.rk.eu/v1","nv":[{"name":"Hty","value":242.50489875},{"name":"ty","value":25.0},{"name":"Hty","value":-12.11},{"name":"ee","value":2.11}]}]} | {"image":{"case_id":"TC-2-00-01-01-T2","subject_id":"TC-02-000"},"analysis":{"execution_id":"ta-test","study_id":"tdma:::tue-jan-6-19:17:13-est-2011","source":"computer","computation":"segmentation"},"data_loader":"1.3"} | 2016-01-16 01:17:13.102 |
+-----+------+-----------+---------+---------------+-------------+---+---+------------+------+----------+-----------+------------+------------+-------------+
1 row selected (1,31 seconds)

Now, this works. :P

No comments:

Post a Comment

You are welcome to provide your opinions in the comments. Spam comments and comments with random links will be deleted.