Python object serialization and deserialization is a crucial aspect of any non-trivial program. If you save something to a file in Python, if you read a configuration file, or if you respond to an HTTP request, you do object serialization and deserialization.
In one sense, serialization and deserialization are the most boring things in the world. Who cares about all the formats and protocols? You want to persist or stream some Python objects and get them back later intact.
This is a healthy way to look at the world at the conceptual level. But, at the pragmatic level, the serialization scheme, format, or protocol you choose may determine how fast your program runs, how secure it is, how much freedom you have to maintain your state, and how well you're going to interoperate with other systems.
There are so many options because different circumstances call for different solutions. There is no "one size fits all." In this two-part tutorial, I'll:
- go over the pros and cons of the most successful serialization and deserialization schemes
- show how to use them
- provide guidelines for choosing between them when faced with a specific use case
Running Example
We will serialize and deserialize the same Python object graphs using different serializers in the following sections. To avoid repetition, let's define these object graphs here.
Simple Object Graph
The simple object graph is a dictionary that contains a list of integers, a string, a float, a boolean, and a None
.
1 |
simple = dict(int_list=[1, 2, 3], |
2 |
text='string', |
3 |
number=3.44, |
4 |
boolean=True, |
5 |
none=None) |
Complex Object Graph
The complex object graph is also a dictionary, but it contains a datetime
object and user-defined class instance that has a self.simple
attribute, which is set to the simple object graph.
1 |
from datetime import datetime |
2 |
|
3 |
class A(object): |
4 |
def __init__(self, simple): |
5 |
self.simple = simple |
6 |
|
7 |
def __eq__(self, other): |
8 |
if not hasattr(other, 'simple'): |
9 |
return False |
10 |
return self.simple == other.simple |
11 |
|
12 |
def __ne__(self, other): |
13 |
if not hasattr(other, 'simple'): |
14 |
return True |
15 |
return self.simple != other.simple |
16 |
|
17 |
complex = dict(a=A(simple), when=datetime(2016, 3, 7)) |
Pickle
Pickle is a native Python object serialization format. The pickle interface provides four methods: dump
, dumps
, load
, and loads
.
- The
dump()
method serializes to an open file (file-like object). - The
dumps()
method serializes to a string. - The
load()
method deserializes from an open file-like object. - The
loads()
method deserializes from a string.
By default, Pickle supports a textual protocol and has a binary protocol, which is more efficient, but not human-readable (less helpful when debugging).
Here is how you pickle a Python object graph to a string and file using both protocols.
1 |
import pickle |
2 |
|
3 |
print(pickle.dumps(simple)) |
4 |
print(pickle.dumps(simple, protocol=pickle.HIGHEST_PROTOCOL)) |
The result will be:
1 |
b'\x80\x04\x95O\x00\x00\x00\x00\x00\x00\x00}\x94(\x8c\x08int_list\x94]\x94(K\x01K\x02K\x03e\x8c\x04text\x94\x8c\x06string\x94\x8c\x06number\x94G@\x0b\x85\x1e\xb8Q\xeb\x85\x8c\x07boolean\x94\x88\x8c\x04none\x94Nu.'
|
2 |
b'\x80\x05\x95O\x00\x00\x00\x00\x00\x00\x00}\x94(\x8c\x08int_list\x94]\x94(K\x01K\x02K\x03e\x8c\x04text\x94\x8c\x06string\x94\x8c\x06number\x94G@\x0b\x85\x1e\xb8Q\xeb\x85\x8c\x07boolean\x94\x88\x8c\x04none\x94Nu.'
|
The binary representation may seem larger, but this is an illusion due to its presentation. When dumping to a file, the textual protocol is 130 bytes, while the binary protocol is only 85 bytes.
First, we dump the files as text and binary.
1 |
pickle.dump(simple, open('simple1.pkl', 'w')) |
2 |
pickle.dump(simple, open('simple2.pkl', 'wb'), protocol=pickle.HIGHEST_PROTOCOL) |
Then, let's examine the file sizes:
1 |
ls -la sim*.* |
2 |
|
3 |
-rw-r--r-- 1 gigi staff 130 Mar 9 02:42 simple1.pkl
|
4 |
-rw-r--r-- 1 gigi staff 85 Mar 9 02:43 simple2.pkl
|
Unpickling from a string is as simple as:
1 |
x = pickle.loads("(dp1\nS'text'\np2\nS'string'\np3\nsS'none'\np4\nNsS'boolean'\np5\nI01\nsS'number'\np6\nF3.4399999999999999\nsS'int_list'\np7\n(lp8\nI1\naI2\naI3\nas.") |
2 |
assert x == simple |
3 |
|
4 |
x = pickle.loads('\x80\x02}q\x01(U\x04textq\x02U\x06stringq\x03U\x04noneq\x04NU\x07boolean\x88U\x06numberq\x05G@\x0b\x85\x1e\xb8Q\xeb\x85U\x08int_list]q\x06(K\x01K\x02K\x03eu.') |
5 |
assert x == simple |
Note that pickle can figure out the protocol automatically. There is no need to specify a protocol, even for the binary one. Unpickling from a file is just as easy. You just need to provide an open file.
1 |
x = pickle.load(open('simple1.pkl')) |
2 |
assert x == simple |
3 |
|
4 |
x = pickle.load(open('simple2.pkl')) |
5 |
assert x == simple |
6 |
|
7 |
x = pickle.load(open('simple2.pkl', 'rb')) |
8 |
assert x == simple |
According to the documentation, you're supposed to open binary pickles using the 'rb' mode, but as you can see, it works either way. Let's see how pickle deals with the complex object graph.
1 |
pickle.dumps(complex) |
2 |
|
3 |
# output is;
|
4 |
"(dp1\nS'a'\nccopy_reg\n_reconstructor\np2\n(c__main__\nA\np3\nc__builtin__\nobject\np4\nNtRp5\n(dp6\nS'simple'\np7\n(dp8\nS'text'\np9\nS'string'\np10\nsS'none'\np11\nNsS'boolean'\np12\nI01\nsS'number'\np13\nF3.4399999999999999\nsS'int_list'\np14\n(lp15\nI1\naI2\naI3\nassbsS'when'\np16\ncdatetime\ndatetime\np17\n(S'\\x07\\xe0\\x03\\x07\\x00\\x00\\x00\\x00\\x00\\x00'\ntRp18\ns." |
5 |
|
6 |
pickle.dumps(complex, protocol=pickle.HIGHEST_PROTOCOL) |
7 |
|
8 |
# output is:
|
9 |
'\x80\x02}q\x01(U\x01ac__main__\nA\nq\x02)\x81q\x03}q\x04U\x06simpleq\x05}q\x06(U\x04textq\x07U\x06stringq\x08U\x04noneq\tNU\x07boolean\x88U\x06numberq\nG@\x0b\x85\x1e\xb8Q\xeb\x85U\x08int_list]q\x0b(K\x01K\x02K\x03eusbU\x04whenq\x0ccdatetime\ndatetime\nq\rU\n\x07\xe0\x03\x07\x00\x00\x00\x00\x00\x00\x85Rq\x0eu.' |
If we dump this complex object to a file in text and binary format:
1 |
pickle.dump(complex, open('complex1.pkl', 'w')) |
2 |
pickle.dump(complex, open('complex2.pkl', 'wb'), protocol=pickle.HIGHEST_PROTOCOL) |
And compare their sizes:
1 |
ls -la comp*.* |
2 |
|
3 |
-rw-r--r-- 1 gigi staff 327 Mar 9 02:58 complex1.pkl
|
4 |
-rw-r--r-- 1 gigi staff 171 Mar 9 02:58 complex2.pkl
|
We can see that the efficiency of the binary protocol is even greater with complex object graphs.
JSON
JSON (JavaScript Object Notation) has been part of the Python standard library since Python 2.5. I'll consider it a native format at this point. It is a text-based format and is the unofficial king of the web as far as object serialization goes. Its type system naturally models JavaScript, so it is pretty limited.
Let's serialize and deserialize the simple and complex object graphs and see what happens. The interface is almost identical to the pickle interface. You have dump()
, dumps()
, load()
, and loads()
functions. But there are no protocols to select, and there are many optional arguments to control the process. Let's start simple by dumping the simple object graph without any special arguments:
1 |
import json |
2 |
|
3 |
simple = dict(int_list=[1, 2, 3], |
4 |
text='string', |
5 |
number=3.44, |
6 |
boolean=True, |
7 |
none=None) |
8 |
|
9 |
print(json.dumps(simple)) |
The output here will be:
1 |
{"int_list": [1, 2, 3], "text": "string", "number": 3.44, "boolean": true, "none": null} |
The output looks pretty readable, but there is no indentation. For a larger object graph, this can be a problem. Let's indent the output:
1 |
print(json.dumps(simple, indent=4)) |
The result will be:
1 |
{
|
2 |
"int_list": [ |
3 |
1, |
4 |
2, |
5 |
3
|
6 |
],
|
7 |
"text": "string", |
8 |
"number": 3.44, |
9 |
"boolean": true, |
10 |
"none": null |
11 |
}
|
That looks much better. Let's move on to the complex object graph.
1 |
json.dumps(complex) |
This will result in an error TypeError:
as shown below:
1 |
Traceback (most recent call last): |
2 |
File "serialize.py", line 49, in <module> |
3 |
print(json.dumps(complex) |
4 |
File "/usr/lib/python3.8/json/__init__.py", line 231, in dumps |
5 |
return _default_encoder.encode(obj) |
6 |
File "/usr/lib/python3.8/json/encoder.py", line 199, in encode |
7 |
chunks = self.iterencode(o, _one_shot=True) |
8 |
File "/usr/lib/python3.8/json/encoder.py", line 257, in iterencode |
9 |
return _iterencode(o, 0) |
10 |
File "/usr/lib/python3.8/json/encoder.py", line 179, in default |
11 |
raise TypeError(f'Object of type {o.__class__.__name__} ' |
12 |
TypeError: Object of type A is not JSON serializable
|
Whoa! That doesn't look good at all. What happened? The error message is that the A
object is not JSON serializable. Remember that JSON has a very limited type system, and it can't serialize user-defined classes automatically. The way to address it is to subclass the JSONEncoder
class used by the json
module and implement the default()
method that is called whenever the JSON encoder runs into an object it can't serialize.
The job of the custom encoder is to convert it to a Python object graph that the JSON encoder is able to encode. In this case, we have two objects that require special encoding: the datetime
object and the A
class. The following encoder does the job. Each special object is converted to a dict
where the key is the name of the type surrounded by dunders (double underscores). This will be important for decoding.
1 |
import json |
2 |
|
3 |
class CustomEncoder(json.JSONEncoder): |
4 |
def default(self, o): |
5 |
if isinstance(o, datetime): |
6 |
return {'__datetime__': o.replace(microsecond=0).isoformat()} |
7 |
return {'__{}__'.format(o.__class__.__name__): o.__dict__} |
Let's try again with our custom encoder:
1 |
serialized = json.dumps(complex, indent=4, cls=CustomEncoder) |
2 |
print(serialized) |
The output will be:
1 |
{
|
2 |
"a": { |
3 |
"__A__": { |
4 |
"simple": { |
5 |
"int_list": [ |
6 |
1, |
7 |
2, |
8 |
3
|
9 |
],
|
10 |
"text": "string", |
11 |
"number": 3.44, |
12 |
"boolean": true, |
13 |
"none": null |
14 |
}
|
15 |
}
|
16 |
},
|
17 |
"when": { |
18 |
"__datetime__": "2016-03-07T00:00:00" |
19 |
}
|
20 |
}
|
This is beautiful. The complex object graph was correctly serialized, and the original type information of the components was retained via the keys "__A__"
and "__datetime__"
. If you use dunders for your names, you need to develop a different convention to denote special types. Let's decode the complex object graph.
1 |
deserialized = json.loads(serialized) |
2 |
|
3 |
deserialized == complex |
4 |
# evaluates to False
|
The deserialization worked (no errors), but it's different from the original complex object graph we serialized. Something is wrong. Let's take a look at the deserialized object graph. I'll use the pprint
function of the pprint
module for pretty printing.
1 |
import json |
2 |
from pprint import pprint |
3 |
from serialize import serialized |
4 |
deserialized = json.loads(serialized) |
5 |
pprint(deserialized) |
6 |
|
7 |
# prints:
|
8 |
# {'a': {'__A__': {'simple': {'boolean': True,
|
9 |
# 'int_list': [1, 2, 3],
|
10 |
# 'none': None,
|
11 |
# 'number': 3.44,
|
12 |
# 'text': 'string'}}},
|
13 |
# 'when': {'__datetime__': '2016-03-07T00:00:00'}}
|
The json module doesn't know anything about the A class or even the standard datetime object. It just deserializes everything by default to the Python object that matches its type system.
To get back to a rich Python object graph, you need custom decoding. There is no need for a custom decoder subclass. The load()
and loads()
functions provide the object_hook
parameter that lets you provide a custom function to convert dicts to objects.
1 |
def decode_object(o): |
2 |
if '__A__' in o: |
3 |
a = A() |
4 |
a.__dict__.update(o['__A__']) |
5 |
return a |
6 |
|
7 |
elif '__datetime__' in o: |
8 |
return datetime.strptime(o['__datetime__'], '%Y-%m-%dT%H:%M:%S') |
9 |
|
10 |
return o |
Let's decode using the decode_object()
function as a parameter to the loads()
object_hook
parameter.
1 |
deserialized = json.loads(serialized, object_hook=decode_object) |
2 |
print(deserialized) |
3 |
# prints: {'a': <__main__.A object at 0x10d984790>, 'when': datetime.datetime(2016, 3, 7, 0, 0)}
|
4 |
|
5 |
deserialized == complex |
6 |
# evaluates to False
|
Conclusion
In part one of this tutorial, you've learned about the general concept of serialization and deserialization of Python objects and explored the ins and out of serializing Python objects using Pickle and JSON.
In part two, you'll learn about YAML, performance and security concerns, and a quick review of additional serialization schemes.
This post has been updated with contributions from Esther Vaati. Esther is a software developer and writer for Envato Tuts+.