Software Engineering for Data Scientists: Outlining and Drafting
Last week we showed how outlining can lead to breakthroughs in research.
Basically, it’s a philosophical tool: a way to drill down on what we mean when we say things and how that domain discovery might guide better questions in our Googling.
Let’s put outlining in the context of drafting. We did plenty of back-and-forths between outlining and journaling, showing how the holes that outlining made clear turned into excellent tasks for research. But we laid off drafting since that would affect our code screenshots and potentially distract from our main message.
Outlining has three main synergies with drafting.
Acceptance Test-Driven Development
Outlining is a fantastic pair to acceptance test-driven development. Because we’re building up a scaffolding of what code may go where based on our domain model, it should at all times be roughly testable.
We use “pass” a lot, making ATDD impossible. A better idiom might be to return some faked data--a bare minimum object or some primitive that should work like 0.
A more advanced technique would be to pass this fake object around but flag it so you can quickly know what is fake and what is not. Consider the following:
Here we have a function. This function will fail via assertion if a global flag isn’t true; otherwise, it just returns what you passed it.
This function can wrap any return values you’re faking out. This helps in two ways.
You can do a text search for your function name.
You can set your flag to false, and your tests should find it
Contracts afford a fantastic synergy between ATDD and outlining. A valuable metric to ask about during testing is how many assertions fire per test run? The more assertions, the better. However, we don’t want more complex tests. (This is hard to accomplish!)
We want our tests to be relatively simple. Send a known input in with a known state, and get a known input out with a known state. That’s the ideal. We can find more assertions by making this more complex--multiple state transitions or multiple sets of inputs and outputs. But those need to be separate tests.
Instead, contract-style programming (precondition assertions, postcondition assertions, and object invariant assertions) allows us to assert more about the known inputs without expanding those inputs. It will enable us to assert more about the outputs without expanding them. Finally, it allows us to assert more about the states without more state transitions!
These assertions are stored in the objects themselves, making them reusable across tests. One assertion may improve the assertion density of dozens of tests! And every additional test written will exercise dozens of assertions for free!
If you want more advice for your team on where to test and where to contract, contact us!
As you saw when we worked through our example, relying on the linter is part and parcel of outlining and drafting. The linter will catch all kinds of other issues during outlining--spelling, design smells, and the like. It will be pretty challenging to crush all linter errors too.
For example, Pylint often doesn’t like our attrs classes and says they should have at least two methods. This is because any object with only one method could probably be a function; any object with no methods could probably just be a named dict or other “dataclass light.” In other words, Pylint doesn’t like how attrs works (at least my version of Pylint and my version of attrs.)
I’m a fan of this because I don’t like data classes.
That’s not to say they’re wrong, but rather, every data class (a class with only variable members) should probably be considered a place to grow. You may not see these opportunities now, but basically, the thing you’re going to want to be on the lookout for is feature envy.
An easy way to spot this in Python is to mark an attribute as private and see what the linter picks up. You can do this in one of two ways.
Change the attribute’s name to add an underscore to the front in the class itself. See what now starts crashing during your ATDD or red-lined under your linting.
Use a rename refactor that changes the name everywhere. Now look for yellow lines in your linting that warn you when accessing a private member.
Datespan may not be a great use of this as it logically fulfills the “data class” idea. Many objects you can “freeze” will do this. However, let’s see what happens.
So we actually did some other stuff to make this happen:
Added a daily returns object
Added a date span attribute to that object
Assumed we’d do something similar to indicator results
Much of this would probably be replaced with pandas and proper time series now that we know that’s the solution; but I wanted to show this for effect.
So the linter shows we’re accessing a variable that’s no longer there. (We followed option 1 above and renamed begin to _begin). What are some solutions?
Break encapsulation and just rename to _begin here.
Move this method to Datespan.
In this case, it actually makes sense to move the method. Clearly, it’d be nice to compare datespans directly, which would compare beginning and end in some method on datespan.
Aside: Ironically, attrs already allows us to compare directly, and it will “do the right thing.” So, in this case, we can just compare date_spans and be done; we don’t need to add a new method.
Using many data classes and slowly moving methods to make them richer as you discover these methods is an excellent way to grow your code.
Ultimately, what we’re building here is called value objects.
Most of the objects you discover in your outline are likely value objects. That means it’s pretty easy to start them with attrs, set them to frozen, and slowly add methods as you discover them.
Value objects and data classes often can get away with public attributes. There’s no state to protect since there’s no way to change the state. That makes adding functions globally easier than adding methods to the class interface. But this is a bad idea: a rich type and class will provide commonly needed functionality as methods.
To identify this commonly needed functionality but not violate YAGNI or KISS, we’re trying to do a kind of “just-in-time” design. We identify the functionality we need to solve our problem, then refactor it where it ought to live, rather than guess what functionality each class needs when we write that class.
That, combined with a little bit of design upfront using our nouns and verbs trick from before, is enough to get us started. The rest of your design skills will grow in time as you practice outlining.
If you’ve got reasonably good “fake” objects and your outline can run, you get the superpower of starting wherever the problem is hardest.
Frequently to even get to the point of drafting and trying out various prototype code, you’ve got to write a lot of boilerplate. This boilerplate tends to be the stuff that needs refactoring later; it has straight, procedural data transformations all to get things in the “shape” they need to be.
Outlining gives you an excellent first draft of a design that won’t be overbearing or over-engineered. It will allow you to draft more productively. It’s not always perfect, but it will often sit at precisely the same layers your data transformations would have typically sat. After all, switching from one noun to another represents some sort of procedure.
This allows you to get to the meat of your problem much faster. Ultimately, if you run into issues with how to transform your inputs and outputs into the logic you want to test drive, you’ll always find a way around it. But, on the other hand, you may run into issues with the logic itself that require a reworking of the requirements. So it’s beneficial to swiftly get to the part of the problem most likely to give you trouble.
This isn’t hard. Get your high-level acceptance smoke test running on your outline (all data is faked, etc.). Then drop your debugger and stop at the method you want to start prototyping.
The inputs to that method should be basic fake data. You may want to switch to something canned (canned means real, but read from a CSV file or something rather than generated), or you can prototype something with the fake data.
For instance, you may have passed in an empty series because you didn’t want to populate it. However, most of your algorithm should still work with an empty series.
You can still start making calls to scikit-learn, pandas, and other third-party libraries in your drafting. These calls will still benefit from linting and type checking, and you’ll have an initial prototype when you’re done. This may have all been done with just an empty series. However, it will have also been a conversation with the rest of the code, perhaps slightly modifying types or contracts at each layer of your outline. At the end of the prototype, you’ll have an idea of what some sort of canned data should look like--which you can generate then pass in and immediately exercise your prototype.
How to Improve Design Skills
Design is like anything else; it gets better with practice.
There are some methods--outlining, in this case. There are some principles, which we discussed KISS and YAGNI.
There are some patterns--things that tend to crop up in good designs and things that crop up in bad designs (often called “code smells”). We’ll get into that next week.
Right now, the primary way to become a better designer is to start somewhere, determine whether you were successful, and try to improve. This is the same with any sport, game, or task. Try, then improve.
We’ve got a great way to do this between journaling and outlining. You’re going to give a design a guess. Then, at the end of your Pomodoro, you can reflect on how well your design is helping you discover new requirements, research topics, or drafting opportunities. How well does your design abide by principles like KISS and YAGNI? How well does your design help you understand the problem you’re trying to solve?
When we’re not programming, often it’s nice to go to a whiteboard and model the problem we’re trying to solve--either collaboratively or alone. That’s all we’re doing here. We’re building a model and then using type checking and linting to “check” the model’s correctness on a few simple things.
Design is about human factors, even inside software. How much of a joy to use is the code? How easy is it to understand? How easy is it to extend? Does it help you understand the problem, or is it a problem to understand?
Another OODA loop will occasionally be maintenance and extension of old code. Here you’re going to dive again into your domain model; only this time, it may have been a while. Is it easy to understand for someone with fresh eyes? What documentation is missing? What objects or functions seem clumsy?
All of this feedback will help you outline a bit better each time. You’ll begin to build an intuition for what are the correct nouns to reify and the correct verbs to turn into functions. It will help you start to spot the domain model’s layers and shapes from conversations and thinking about them. Then, you’ll wash, rinse, and repeat. With these new designs, what issues are you finding? What’s easy, and what’s hard?
So outlining makes you a better designer, and designing makes you a better outliner. It will just take practice.
Are there other ways to improve? Yes! Exposure to sound design principles and patterns can help your practice. We’ll go over those in our next and final installment on outlining.
We’re on a mission to make jobs suck less, one software management tip at a time. We need your help!
Do you want to stay current on the latest management tips and data science techniques?