Software Engineering for Data Scientists: Outlining and Contract Programming
Last post, we went over how outlining can drive design decisions you didn’t know you had. It can also work its way into the ongoing conversation between problem and solution, helping you ask the “right” questions.
We’ll continue that today but focus on the idea of contracts. Contracts are assertions in the code rather than in tests. Commonly, they’re put at the beginning (called “preconditions”), ending (called “post-conditions”), or both on objects (called “invariants”).
We’re at the point with our date object where we’d have to start drafting to get anywhere (ignoring the richly typed years and days for now). So let’s try out what using Python’s datetimes would look like.
In terms of outlining, that’s it; we’re done. We’ll need to draft to use the built-in formatting methods in datetime.date. No additional validation is required since that’s already done in datetime.date.
Pros of using datetime.date:
Comparisons work out of the box
Comes with string formatting out of the box
Comes with validations out of the box
Cons of using datetime.date:
We’ll have to learn datetime.date
It comes with stuff we don’t need
Is that last thing a real con, though? Many people think so.
For some reason, code with additional features is considered “heavy,” while code that just does what you need is considered “light.”
In actuality, there are only two things you need to consider when thinking of something “heavyweight” versus “lightweight.”
Heavy code, in a performance-critical system, is slow code. So measure it; use a profiler. In those cases, some “extra feature” you don’t actually use may end up slowing things down. An example would be boxing and unboxing of primitives in languages that do that sort of thing. However, most code you write will not be on a performance-critical system.
Datetime.date is not slow. It may even be faster than our string implementation. So, from this standpoint, it is not “heavy.”
Heavy code can also be defined as code that takes a lot of effort to maintain and extend. This is, ironically, what most people mean when they imply something is lightweight versus heavy. Even here, though, they are committing a logical fallacy.
When people write “lightweight wrappers,” they’re doing what they know how to do--simple programming to solve some problem. Looking at “heavyweight libraries” and dependencies makes them uncomfortable. It will mean learning something new.
Learning is nearly always painful. You will inevitably feel stupid.
The only thing that matters though is the time spent. Is the time spent learning datetime less than writing your own dates?
You Gonna Believe me, or Your Lyin’ Intuition?
Once again, cognitive biases will rear their head. Building your own will feel salient, and at hand, it will seem like it will be easier. The issue is datetime has already dealt with all the problems of dates and times--which are pretty complex! The feeling that writing your own date feels easier is because you have no idea how complicated it is yet.
Using datetime will be orders of magnitude faster in terms of coding effort. Often, system libraries like these will also be more tuned than yours. Don’t expect your own wrappers to be faster than something like datetime.
Thus, prefer third-party libraries where possible. There are exceptions where it really will take you more time to use them (and not just feel like it will); or where the library hits a performance-critical area of code, and you’ve measured it to be slower than an alternative approach. These exceptions tend to be rare.
Play to Your Strengths
Let’s get back to your domain model. You want it to be heavy on the problem you’re trying to solve. Let’s say you manage to create an algorithm that you can use to trade stocks.
Was it because you handled datetimes differently than other people? Almost certainly not.
This is akin to basic business strategy. You want to buy things where you aren’t competing: vendors, partners, etc. You want your core business to be the thing you do that no one else can. Will your business succeed because it did dates differently than other businesses? Almost certainly not. That’s engineering time wasted.
(Think about this next time you dive into a code base. Look at the sheer number of lines written to avoid using a pretty standard API as a “wrapper” that provides no value.)
Spending time learning these 3rd party libraries is also one of the things that separate beginner programmers from experts. Ultimately, experts have more things at hand. They can think of all kinds of parts you aren’t aware of. They don’t think about basic programming constructs but instead of third-party libraries, services, patterns, and other idioms they’ve picked up along the way. They think at a higher level of abstraction; they learn these abstractions over time.
So we’re going to go with datetimes. But, while we’ve introduced a basic contract, we can actually see what that contract is telling us.
It’s saying that there’s something special about these two dates. Something that we’re not modeling in our domain model (yet). Let’s reify that. Think about this change:
Let’s talk about this piece by piece.
First, we changed the type of the make_stock to take a new class: the DateSpan. The DateSpan is just two dates. Now, these dates travel together.
We also moved the precondition to the DateSpan object itself. You can find more information on that syntax at the attrs website, but it basically runs the check on creating a DateSpan object.
We finally set the DateSpan object to “frozen.” This makes any changes to begin or end throw errors. Also called “immutable” in other languages, making data, not change renders a lot of reasoning about code more straightforward.
In our case, it means a precondition becomes an invariant. Preconditions are assertions we run at the beginning of a method or function. Invariants are assertions we “run” at the beginning and end of every method on an object. They’re basically saying, “This is true of the object’s data going in, and it’s true of the object’s data going out.”
However, if the object is immutable, you only have to check the relationship once at the object’s construction. Everywhere else in your code, you can rely on that check. This greatly simplifies what you have to consider. In our case, make_stock doesn’t have to ensure that a DateSpan is in order. DateSpans are, by definition, in order.
Immutability? Preconditions? Invariants? I need to model dammit! I don't have time to understand all this. Can't I just reach out to you?
We’re developing a domain model: Indicator, DateSpan, Stock, Performance. We’re encoding new, rich types where they appear. We’re encoding things into contracts and assertions where we can. We’re trying to write out the docstrings to drive our thinking forward.
This has allowed us to ask well-crafted questions--like, what is stock performance?--that, in turn, drive our research forward. In this case, it’s allowed us to refactor our contracts from preconditions to invariants and introduce new domain objects.
Let’s check in with our journal.
We’re done chiefly with the outlining of the stock domain objects. We ultimately know performance will be based on the stuff from yfinance. We haven’t thought too much further about that.
We still have some outstanding research questions. But because we’ve used our domain model to drive out some concepts, we know these remaining questions primarily focus on performance, classification, and regression.
Let’s Google “stock performance regression.”
Without clicking on any results, I see a few things with the title “returns.” That’s a new noun and not in our domain model.
I Google “what are stock returns.” I get a result from Investopedia saying it’s a combination of dividends and price changes. This is a good point and something we’ve overlooked. Where do we get dividend data? Is it in yfinance?
I see from a Nerdwallet result that the average stock market return is 10%. That sounds about right - that percent is what we’re trying to track.
Oops... It’s about here that I realized, as an author, that all my journaling used the term “returns.” This is because I think about this problem and already know how to frame the design. However, just pretend in the previous journals, I used the term “spy’s performance.”
So we’ve got some good research questions and an idea of how to push our domain model further. We need to think about how to outline “returns.”
I Google “how to calculate returns of a stock.” This gives me a relatively simple formula: ending date’s price minus beginning date’s price over beginning date’s price. There’s some stuff about splits in there, but we just take a note of that for now.
We have our DateSpan object. That’s quite a coincidence; it has beginnings and ends too. So we have a return for a beginning minus and end… but that’s just one return. How would we see if our RSI indicator could predict that?
When we Google what a regression or classification problem is, we see people talk about large data sets. A lot of pictures of flowers and what those flowers are. A lot of people’s education and those educations’ predicted salary levels.
The concept of the return feels like it’s on the right track, but we need more returns. Not just one from a single DateSpan.
What if we had a lot of stocks? Then each stock has an RSI, a begin and end date; and we have our data. We know with regression and classification, we need endogenous variables--the thing we want to predict. In this case, we’re starting to settle on returns. And we have exogenous variables, the RSI. That seems about right.
But our journal says to predict the returns of SPY, a single index fund.
We’ve got two options here.
The discussion with the code has driven new research (e.g., Googling) and pointed in a direction that was not in our journal. When we first formulated our problem, predicting SPY’s performance intuitively made sense. We wanted to, for instance, know when to buy or sell SPY.
But now, after some domain modeling and research, our code is telling us maybe that’s not a straightforward concept. Perhaps we misunderstood how trading works. Maybe we need to buy and sell different stocks based on the RSI?
You can actually go either way here. You can decide that, no, the original intuition still makes sense. That means we’re missing something in our domain model. Or, you can say the initial requirements don’t make sense, and we need to reframe things.
We’re going to go with the first option. We’re still missing something. People might decide to buy a stock one day and sell another; they seem to make money. That’s what we’re trying to model here. So our domain model is still missing something.
For that missing something, tune in next time!
We’re on a mission to make jobs suck less, one software management tip at a time. We need your help!
Do you want to stay current on the latest management tips and data science techniques?