Introducing Python 3.7's Dataclasses
Python 3.7's dataclasses reduce repetition in your class definitions.
Newcomers to Python often are surprised by how little code is required to accomplish quite a bit. Between powerful built-in data structures that can do much of what you need, comprehensions to take care of many tasks involving iterables, and the lack of getter and setter methods in class definitions, it's no wonder that Python programs tend to be shorter than those in static, compiled languages.
However, this amazement often ends when people start to define classes
in Python. True, the class definitions generally will be pretty
short. But the __init__
method, which adds attributes to a new object,
tends to be rather verbose and repetitive—for example:
class Book(object):
def __init__(self, title, author, price):
self.title = title
self.author = author
self.price = price
Let's ignore the need for the use of self
, which is an outgrowth of
the LEGB (local, enclosing, global, builtins) scoping rules in
Python and which isn't going away. Let's also note that there is a
world of difference between the parameters title
,
author
and
price
and the attributes self.title
,
self.author
and
self.price
.
What newcomers often wonder—and in the classes I teach, they often wonder
about this out loud—is why you need to make these assignments at
all. After all, can't __init__
figure out that the three non-self
parameters are meant to be assigned to self
as attributes? If
Python's so smart, why doesn't it do this for you?
I've given several answers to this question through the years. One is that Python tries to make everything explicit, so you can see what's happening. Having automatic, behind-the-scenes assignment to attributes would violate that principal.
At a certain point, I actually came up with a half-baked solution to
this problem, although I did specifically say that it was un-Pythonic
and thus not a good candidate for a more serious implementation. In a
blog post, "Making Python's __init__ method magical", I proposed that
you could assign parameters to attributes automatically, using a
combination of inheritance and introspection. This was was a thought
experiment, not a real proposal. And yet, despite my misgivings and
the skeletal implementation, there was something attractive about not
having to write the same boilerplate __init__
method, with the same
assignment of arguments to attributes.
Fast-forward to 2018. As I write this, Python 3.7 is about to be released. And, it turns out that one of the highlights of this new version is "dataclasses"—a way to write classes that removes the need to write boilerplate code. The implementation was done in a much different (and better) way than I had proposed, and it includes a great deal of functionality I hadn't even imagined. And yet, I expect that for many people, dataclasses will become their preferred way to create Python classes.
So in this article, I review the new dataclasses functionality in Python 3.7. If you're reading this before 3.7 has been released, I suggest downloading and installing it, albeit not as your main, production version of Python, just in case issues arise before the first production release.
Simple Dataclasses
Let's take the class from above:
class Book(object):
def __init__(self, title, author, price):
self.title = title
self.author = author
self.price = price
Here's how you can translate it into a dataclass:
from dataclasses import dataclass
@dataclass
class Book(object):
title : str
author : str
price : float
If you have any experience with Python, you can recognize the outline of what's going on here, but a whole bunch of things are different.
First is using the dataclass
decorator to modify class
definition. Decorators are one of Python's most powerful tools,
allowing you to modify functions and classes both when they are defined
and when they are called. In this case, the decorator inspects the
class definition and then writes __init__
and other methods on the
fly, based on that definition.
Next, you'll notice that no __init__
has been defined, or any
other methods, for that matter. Instead, what is defined is what would
appear to be class attributes. But then again, they're not really
class attributes, since they lack any values. So what are they doing?
Moreover, there might not be any values associated with these class
attributes, but there are types, using the type-annotation syntax
introduced in Python 3. Type annotations allow you to tag a variable
with a particular object. The annotations aren't used or enforced by
Python, but they can be used by your editor or by external programs
(such as MyPy) to improve the accuracy of your code. You don't have to
stick with the simple built-in types either; you can use the
typing
module to import a variety of predefined types, including one called
Any
if you want to allow for anything.
So already you likely can see a few advantages to dataclasses. You don't
need to write the boilerplate code in __init__
, and type annotations
already are included. But aside from clearer, shorter code and the
ability to run code checkers, what else do you get?
Well, it turns out that the @dataclass
decorator doesn't just create
__init__
. It creates a number of other methods as well. For
example, it defines __eq__
, the method that lets you determine if two
classes are equal to one another using the ==
equality operator.
It also defines __repr__
to be far more attractive and useful than
the existing Python default.
With the above class definition, you thus can say:
b1 = Book('MyTitle1', 'AuthorFirst AuthorLast', 20)
b2 = Book('MyTitle2', 'AuthorFirst AuthorLast', 25)
print(b1)
print(b2)
The output will be:
Book(title='MyTitle1', author='AuthorFirst AuthorLast',
↪price=20)
Book(title='MyTitle2', author='AuthorFirst AuthorLast',
↪price=25)
Note that while the attribute names are specified in the dataclass at
the class level, the names actually are stored as attributes on the
individual instances. You can see this by exploring the new objects a
little bit. For example, if you ask to print vars(b1)
, you get the
following:
{'title': 'MyTitle1', 'author': 'AuthorFirst AuthorLast',
↪'price': 20}
And if you ask to see the type of b1.title
, Python tells you that it's a
string. So nothing fancy is being created here, such as a
property or a descriptor. Rather, this is just creating a regular old class,
albeit with some useful and interesting functionality.
Adding Methods
The name "dataclass" implies that such classes are to be used for data, and only data. And indeed, part of the thinking behind the development of dataclasses was that folks wanted something easier to write than regular Python classes, but with the same easy-to-read syntax as named tuples or dictionaries. The name implies that such classes are used only for storing data, without the ability to write methods.
But, that's not the case. You can add methods to a dataclass, just as you would add it to any other class. For example, say you want to get the book author's name as a list of strings, rather than as a single string. This would be useful if you want to alphabetize or display books by the author's last name and then first name.
In a dataclass, you add such a method by...adding the method. In the body of the class, you would write:
def author_split(self):
return self.author.split()
In other words, you can create whatever methods you want, using the same syntax that you've used before.
Optional Functionality
Dataclasses offer a great deal of functionality that can help you modify the default behavior.
First and foremost, you can provide each of your declared attributes with a default value. Doing so makes them optional when you create a new instance. For example, say you want the default book price to be $20. You can say:
@dataclass
class Book(object):
title : str
author : str
price : float = 20
Notice how the syntax reflects the Python 3 syntax for function parameters that have both type annotation and a default value. Just as is the case with function parameter defaults, dataclass attributes with defaults must come after those without defaults.
Rather than declaring a value for a default, you actually can pass a function that is executed (without any arguments) each time a new object is created.
To do this, and to take advantage of a number of other features having
to do with dataclass attributes, you must use the field
function
(from the dataclass
module), which lets you tailor the way
the attribute is defined and used.
If you pass a function to the default_factory
parameter, that
function will be invoked each time a new instance is created without a
specified value for that attribute. This is very similar to the way
that the defaultdict
class works, except that it can be specified
for each attribute.
For example, you can give each new book a default random price between $20 and $100 in the following way:
import random
from dataclasses import dataclass, field
def random_price():
return random.randint(20,100)
@dataclass
class Book(object):
title : str
author : str
price : float = field(default_factory=random_price)
Note that you cannot both set default_factory
and a default value;
the whole point is that default_factory
lets you run a function and,
thus, provides the value dynamically, when the new instance is created.
The main thing that the __init__
method in a Python object does is add
attributes to the new instance. Indeed, I'd argue that the majority of
__init__
methods I've written through the years do little more than
assigning the parameters to instance attributes. For such objects,
the default behavior of dataclasses works just fine.
But in some cases, you'll want to do more than just assign values. Perhaps you want to set up values that aren't dependent on parameters. Perhaps you want to take the parameters and adjust them in some way. Or perhaps you want to do something bigger, such as open a file or make a network connection.
Of course, the whole point of a dataclass is that it takes care of
writing __init__
for you. And thus, if you want to do more than just
assign the parameters to attributes, you can't do so, at least not in
__init__
. I mean, you could define
__init__
, but the whole point of
a dataclass is that it does so for you.
For cases like this, dataclasses have another method at their
disposal, called __post_init__
. If you define
__post_init__
, it
will run after the dataclass-defined __init__
. So, you're assured
that the attributes have been set, allowing you to adjust or add to
them, as necessary.
Here's another case that dataclasses handle. Normally, instances of user-created classes are hashable. But in the case of dataclasses, they aren't. This means you can't use dataclasses as keys in dictionaries or as elements in sets.
You can get around this by declaring your class to be "frozen", making
it immutable. In other words, a frozen dataclass is defined at runtime
and then never changes—similar to a named tuple. You can do this by
giving a True
value to the dataclass decorator's
frozen
parameter:
>>> @dataclass(frozen=True)
... class Foo(object):
... x : int
...
>>> f1 = Foo(10)
>>> f1.x = 100
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.7/dataclasses.py", line 448,
↪in _frozen_setattr
raise FrozenInstanceError(f'cannot assign to field {name!r}')
dataclasses.FrozenInstanceError: cannot assign to field 'x'
Moreover, now you can run hash
on the variable:
>>> hash(f1)
3430012387537
There are a number of other optional pieces of functionality in dataclasses as well—from indicating how your objects will be compared, which fields will be printed and the like. It's impressive to see just how much thought has gone into the creation of dataclasses. I wouldn't be surprised if in the next few years, most Python classes will be defined as dataclasses, along with whatever customization and additions the user requests.
Conclusion
Python's classes always have suffered from some repetition, and dataclasses aim to fix that problem. But, dataclasses go beyond macros to provide a toolkit that a large number of Python developers can and should use to improve the readability of their code. The fact that dataclasses integrate so nicely into other modern Python tools and code, such as MyPy, tells me that it's going to become the standard way to create and work with classes in Python very quickly.
Resources
Dataclasses are described most fully in the PEP (Python Enhancement Proposal) 557. If Python 3.7 isn't out by the time you read this article, you can go to https://python.org and download a beta copy. Although you shouldn't use it in production, you definitely should feel comfortable trying it out and using it for personal projects.