Greenplum JSON Guide
This article includes the following sections:
Conceptual Overview
JavaScript Object Notation, also known as JSON, is a popular data encoding format. It is used by Twitter to format the tweets they deliver via their REST API, and is also used by GNIP for most of their premium data sources. As any company trying to Enable Big Science knows, extending the data you are already gathering with this type of social data is extremely important. This guide will explore three different ways to ingest JSON data into a Greenplum environment quickly and efficiently.
JSON Conversion
Since Greenplum does not natively support loading JSON data, the first option we have is to convert the data into a format Greenplum does support before loading it. Most modern scripting languages provide a library for handling JSON loading. For example, Python introduce its own JSON library in version 2.6.
It is easy to create a simple script that loads JSON data and then outputs the data in a CSV format Greenplum can load directly. Assume the following JSON data:
{ "id": 1, "type": "home" }
{ "id": 2, "type": "fax" }
{ "id": 3, "type": "cell" }
This is an example of a simple Python script that converts the above JSON data and outputs pipe delimited data. Note that in practice you may want to further convert or escape the data before printing it out:
import sys, json
for line in sys.stdin:
try:
j = json.loads( line )
except ValueError:
continue
vals = map( lambda x: str( j[x] ), ['id', 'type'] )
print '|'.join( vals ).encode('utf-8')
Run the above Python script from the command line to convert your data:
$ python jsonconvert.py < my.json
1|home
2|fax
3|cell
This final CSV output can be easily loaded into Greenplum or PostgreSQL using a standard External Table. The following is an example External Table definition which points to the CSV data file just created.
CREATE EXTERNAL TABLE json_data_ext (
id int,
type text
) LOCATION (
'gpfdist://localhost:8080/data/json_data.csv'
) FORMAT 'CSV' (
DELIMITER AS '|'
);
In Greenplum, an external table is simply a logical pointer to data that resides in an external source. You must execute a SQL INSERT statement to move the data to a real table inside the database.
INSERT INTO json_data SELECT * FROM json_data_ext;
Web External Table
Converting the data before loading it is simple and can be effective for one-time load jobs, but the process can be made even easier by triggering the data conversion at load time. For this we use Greenplum’s Web External Table feature.
The Web External Table is very similar to a regular External Table except for the fact that it can execute a script of our choosing whenever the script is executed.
CREATE EXTERNAL WEB TABLE json_data_web_ext (
id int,
type text
) EXECUTE 'parse_json.py' ON MASTER
FORMAT 'CSV' (
DELIMITER AS '|'
);
The above assumes that parse_json.py is executable and available in gpadmin’s path on the master. Whenever the External Web Table is accessed, the parse_json.py script will be executed and the data outputted by the script will be treated as if it came from a real table. From here we could load it into a real table as in the above example, filter it, or join it to any other data we have in the database.
SELECT * FROM json_data_web_ext;
id | type
---+-----
1 | home
2 | fax
3 | cell
(3 rows)
Custom Data Formatter
For optimum load speed and flexibility we can use a Greenplum Custom Data Formatter to enable loading or unloading JSON data almost natively. Custom data formatters are user defined functions in the database. When specified in an external table definition, Greenplum will call the custom data formatter function at load time to parse the external data.
Some custom data formatters (CDFs) are distributed with Greenplum, but they are extendable and new ones can be created and distributed by the open source community. A Greenplum JSON Formatter is one such available library.
After the CDF is built and installed on your cluster, it can be declared directly in your external table without the need to convert your data first, or to rely on an external script to do the conversion at load time. For example, the sample data we used above could be loaded with the following External Table using a CDF:
CREATE EXTERNAL TABLE json_data_ext (
id int,
type text
) LOCATION (
'gpfdist://localhost:8080/data/sample.json.'
) FORMAT 'custom' (
formatter=json_formatter_read
);
It is even capable of handling more complicated JSON structures, such as nested objects, directly. Here is an example of JSON objects which each contain an ‘address’ sub-object:
{ "id": 1, "type": "home", "address": { "city": "Boise", "state": "Idaho" } }
{ "id": 2, "type": "fax", "address": { "city": "San Francisco", "state": "California" } }
{ "id": 3, "type": "cell", "address": { "city": "Chicago", "state": "Illinois" } }
We simply use a ‘.’ in the column name specifier of our external table to indicate a sub-object:
CREATE EXTERNAL TABLE json_data_ext (
id int,
type text,
"address.city" text,
"address.state" text
) LOCATION (
'gpfdist://localhost:8081/data/sample.json.'
) FORMAT 'custom' (
formatter=json_formatter_read
);
Now when we select from the external table we see the expected output:
select * from json_data_ext;
id | type | address.city | address.state
----+------+---------------+---------------
1 | home | Boise | Idaho
2 | fax | San Francisco | California
3 | cell | Chicago | Illinois
(3 rows)
A custom data formatter also allows us to write data from the database to an external source in JSON format. To do this we simply declare our external table as writable when we create it. Assume we have the above data loaded into a ‘phone’ table in our database:
select * from phone;
id | type | address.city | address.state
----+------+---------------+---------------
1 | home | Boise | Idaho
2 | fax | San Francisco | California
3 | cell | Chicago | Illinois
(3 rows)
Now we can declare a Writable External Table like so:
CREATE WRITABLE EXTERNAL TABLE phone_writable_ext (
id int,
type text,
"address.city" text,
"address.state" text
) LOCATION (
'gpfdist://localhost:8081/out/phone.json'
) FORMAT 'custom' (
formatter=json_formatter_write
);
Now we can use a SQL INSERT statement to put our desired data into the writable external table:
insert into phone_writable_ext select * from test;
INSERT 0 3
When we look at the specified data file we can see that our database data was indeed written out in JSON format. In this way we can feed data from the Greenplum database into another data system which can natively read JSON data.
$ cat phone.json
{"id": 1, "type": "home", "address": {"city": "Boise", "state": "Idaho"}}
{"id": 2, "type": "fax", "address": {"city": "San Francisco", "state": "California"}}
{"id": 3, "type": "cell", "address": {"city": "Chicago", "state": "Illinois"}}
Conclusion
New data storage formats are frequently being designed and developed. Greenplum allows developers to create Custom Data Formatters to read or write data in the storage format of their choosing. One example is the Greenplum JSON Formatter, which makes it easy to work with JSON data in the database without relying on custom data conversion scripts.
comments powered by Disqus