dry.jpg
The idea of DRY is to reduce the repetition of code.
If DRY means “Don’t Repeat Yourself”… then WET means “Write Every Time”, or “We Enjoy Typing”
Don’t write WET code!
We write DRY code - or we DRY out WET code - through a combination of abstraction and normalization.
The “principle of abstraction” aims to reduce duplication of information (usually code) in a program whenever it is practical to do so:
“Each significant piece of functionality in a program should be implemented in just one place in the source code. Where similar functions are carried out by distinct pieces of code, it is generally beneficial to combine them into one by abstracting out the varying parts.”
Benjamin C. Pierce - Types and Programming Languages
The easiest way to understand abstraction is to see it in action. Here’s an example that you are already familiar with; determining the energy emitted by an object as a function of its temperature:
\(Q = \epsilon \sigma T^4\)
where \(\epsilon\) is an object’s emmissivity, \(\sigma\) is the Stefan-Boltzmann constant, and \(T\) is temperature in degrees Kelvin.
We might write the following code to determine \(Q\):
. . .
But this code is going to get very WET very fast.
. . .
Here’s a DRY version obtained using abstraction:
. . .
We keep our code DRY by using abstraction. In addition to functions, python also provides Classes
as another important way to create abstractions.
Functions and Classes are the subject of this tomorrow’s exercise.
In general, the process of keeping code DRY through successive layers of abstraction is known as re-factoring.
The “Rule of Three” states that you should probably consider refactoring (i.e. adding abstraction) whenever you find your code doing the same thing three times or more.
Normalization is the process of structuring data in order to reduce redundancy and improve integrity.
Some of the key principles of Normalization include:
Primary Key
, which uniquely identifies a record. Usually, in python, this key is called an Index
.Atomic
columns, meaning entries contain a single value. This means no collections should appear as elements within a data table. (i.e. “cells” in structured data should not contain lists!)This form of normalization is easy to obtain, as the idea of an Index
is embedded in almost any Python data structure, and a core component of data structures witin pandas
, which is the most popular data science library in python (coming next week!).
. . .
The idea of atomic columns is that each element in a data structure should contain a unique value. This requirement is harder to obtain and you will sometimes violate it.
. . .
The idea of atomic columns is that each element in a data structure should contain a unique value. This requirement is harder to obtain and you will sometimes violate it.
. . .
The idea of transitive dependencies is the inclusion of multiple associated attributes within the same data structure.
Transitive dependencies make updating data very difficult, but they can be helpful in analyzying data.
So we should only introduce them in data that we will not be editing.
Usually environmental data, and especially timeseries, are rarely modified after creation. So we don’t need to worry as much about these dependencies.
For example, contrast a data record of “temperatures through time” to a data record of “user contacts in a social network”.
The idea of transitive dependencies is the inclusion of multiple associated attributes within the same data structure.
. . .
In general, for data analysis, basic normalization is handled for you.
For read only data with fixed associations, a lack of normalization is manageable.
However, many analyses are easier if you structure your data in ways that are as normalized as possible.
If you are collecting data then it is important to develop an organization structure that is normalized.