What's the best Hive JSON SerDe

JSON SerDe libraries

In Athena you can use two SerDe libraries to deserialize JSON data. Deserialization converts the JSON data so that it can be serialized (written out) into another format such as Parquet or ORC.

SerDe names

Hive-JsonSerDe

Openx-JsonSerDe

Library names

Use one of the following:

org.apache.hive.hcatalog.data.JsonSerDe

org.openx.data.jsonserde.JsonSerDe

Hive JSON SerDe

The Hive JSON SerDe is often used to process JSON data such as events. These events are represented as JSON-encoded text blocks, each separated by a new line. The Hive JSON SerDe does not allow duplicate keys in or key names.

The following sample DDL statement uses the Hive JSON SerDe to create a table based on sample online advertising data. In the clause, replace the one contained in Use the identifier of the region in which you are running Athena (for example).

After you have created the table, run MSCK REPAIR TABLE to load the table so that Athena can query it:

OpenX JSON SerDe

In addition to being used that defines the columns in the table, the OpenX JSON SerDe has the following optional properties that can be useful for debugging inconsistencies in data.

ignore.malformed.json

Optional. If you set to, you can skip incorrect JSON syntax. The default is.

dots.in.keys

Optional. The default is. When set to, the SerDe can replace the periods in key names with underscores. For example, if the JSON dataset has a key namedYou can use this property to get the column name alsin Athena. By default (without this SerDe), Athena does not allow periods in column names.

case.insensitive

Optional. The default is. When set to, the SerDe converts all uppercase letters into lowercase letters.

Use key names that are case-sensitive in your data. Then for each key that is not already completely lowercase, specify an assignment from the column name to the property name using the following syntax:

If you have two keys like and that are the same when they are lowercase, an error like the following can occur:

To fix this, set the property to and map the keys to different names, as shown in the following example:

allocation

Optional. Maps column names to JSON keys that are not the same as the column names. The parameter is useful when the JSON data contains keys that are keywords. For example, if you have a JSON key with the name, use the following syntax to map the key to a column with the name:

Like the Hive JSON SerDe, the OpenX JSON SerDe does not allow duplicate keys in or key names.

The following DDL statement uses the OpenX JSON SerDe to create a table based on the same sample online advertising data used in the Hive JSON SerDe sample. In the clause, replace over the identifier of the region in which you are running Athena.

Example: Deserializing nested JSON data

You can use the JSON SerDes to parse more complex JSON encoded data. This requires the use of - statements that use - and - elements to represent nested structures.

The following example creates an Athena table from JSON data with nested structures. To parse JSON encoded data in Athena, make sure that each JSON document is on its own line, separated by a new line.

This example assumes JSON-coded data with the following structure:

The following statement uses OpenX-JsonSerDe with collection data types and to set up collections. Each JSON document is listed on its own line and these are separated from one another by a new line. To avoid errors, the queried data does not contain duplicate keys in or map key names.

More resources

For more information on working with JSON and nested JSON in Athena, see the following resources:

LazySimpleSerDe for CSV and TSV files, as well as custom, comma-separated files