Software Engineering for Data Scientists: Outlining and Research
In our last post, we were trying to decide whether we misunderstood something about our plan to trade a stock based on the RSI.
We needed more data–literally. We had a way to generate a single Performance object for our target symbol, SPY. But to run a regression or classification, we needed many Performance objects.
One way to do this is to get a Performance object across many symbols. That would have been a significant change to our goals. Is there another way? Our intuition told us that people day trade single stocks using the RSI. Surely there’s a way to make a model around that?
We’re going to pursue that intuition. Rather than changing our plan, we’re going to stick with it. We’ll use this interruption as a realization that we are missing something about our understanding of stocks.
Let’s go back to that Googling we did earlier; maybe there’s something in deeper results.
In this case, for my Googling, the 7th search result uses the term “daily return.” What’s new and different from other sites using the words “total return?”
Aside: You’ll notice what we do a lot with outlining is we do some research, identify concepts that might be useful to us, and then see if those concepts are in our code. Based on docstrings we can write a lot about and docstrings we can’t, this sort of exercise is helping us explore the domain and pick concepts we need to solve our problem.
You may go back and forth a bit before getting some momentum. In our current example, we are 7 search results deep looking for a new concept. Often, different results will cover different ideas. You’ll want to be on the lookout for new words. Other times, you’ll have to reframe your search. Finally, you may need to use your docstrings, contracts, and types in a peer review to generate new questions to keep you going.
Daily returns use the same formula as our total returns but are calculated daily. Well, there are a lot of days! So that means a lot of endogenous daily returns for our regression/classification problem.
Remember, this is a lot like dimensional analysis. In fact, a lot of coding is just trying to think through dimensional analysis. To do a regression or classification, we need a “bunch” (a list or array) of data called the endogenous data, and then a “bunch” (a list or array of the same size) of data called the exogenous data.
So it’s not just typing, like dates versus strings, that helps us do this dimensional thinking. It’s also the shape of the data–in this case, a list of something. Finally, it’s also contractual thinking: these two lists must be the same size.
Thinking in this way guides our outlining. We can add contracts and types. For instance, we now know that our performance class may need to change to a daily_returns class. Or perhaps performance generates daily returns.
But we also know that our is_predictive function will change to require a list of data for our exogenous variable (which our indicator class will need to provide) and a list of data for the endogenous variable (which our stock class will need to provide).
We’ll also need to insert a contract on our is_predictive that these two lists align.
So quite a few things we discovered here. For one, we must decide how our is_predictive deals with datespans. Because we are outlining and know the types each function should take, we know we need a datespan. We hadn’t considered up until now: is this function time-sensitive?
There are merits to thinking that way; but for now, we’ll note that we need to get the beginning and end of all the stock data. We’ll use it all–all the stuff yfinance has.
The other issue is that we don’t have beginning or end dates on a mere list.
Let’s put a pin in that for now, do some lint clean-up, and get this new iteration up for peer review. It’s been a while.
Per your partner–looking at this naively–it seems that you’re ensuring the two lists have the same dates but only check two dates. So even in your ideal domain model where you create a new class to represent a list of data with two dates, what you’re asserting here isn’t actually what you’re asserting here.
Your domain model will need something where it has a list for the dates and a list for the data, and the two lists need to be the same size.
This seems complicated. There’s a lot of alignment you’ll need to do. Let’s see if there’s some pattern, idiom, or third-party library that helps us here.
Googling “data with dates Python” gets us a lot of stuff on the datetime module. We want to emphasize that we’re not just dealing with date data; instead, we’re dealing with a lot of data with dates.
Googling “lots of data with dates Python” gets more datetimes. But here, we see a new word on the first result: Time Series.
Believe It or Not, We’re Done
This has been chiefly a fairy tale. It all seems so obvious in retrospect because I used an unfamiliar domain. None of this is something you could have done in an afternoon.
Moreover, while we did go over options when looking at dates (strings, our own date objects, or Python’s datetime.dates), we didn’t emphasize that basically, at every step, there are many options. We didn’t show how many inevitable false starts and dead ends there are.
Journaling–and the Mikado method we discussed–helps us deal with these false starts. You need to know when to abandon a path. Outlining helps us make a lot of progress in those Pomodoro sprints. We don’t need to figure out exactly how each step will work, but roughly the shape. This can help us determine whether we’re on the right track without committing to anything.
When we blow away a design that won’t work, we aren’t losing a lot of effort since we won’t reuse that code. We maintain the benefits of thinking about the shape of those types; all the work that went into docstrings and the like will help us in the future.
Outlining leverages code as conversation. Combined with journaling (or even streaming yourself for rubber ducking), we’re asking questions. We are using the computer to tell us where the holes are and using that feedback to ask better questions.
Do you need more than a conversation with a rubber duck? Talk to us!
We are trying to ensure that all the concepts we consider are encoded as types. We are also trying to ensure their relationships with one another are encoded. This can be as methods or attributes of those types so that we get the benefit of the type checker to help us along our way. Again, we can now move at lightning speed between Pomodoro’s to help define the shape of the code: what types are inputs, what types are outputs, and what is plugged into what.
The inputs we’re missing, well, that’s a design problem we need to solve. Are we on the right track? Wrong track? Is there a 3rd party library we can drop in here, or do we need to start drafting to get these interfaces aligned?
Outputs we’re dropping. That’s a code smell, but it happens often. We don’t always need everything a third party is providing. It’s still pretty satisfying when everything aligns.
Our docstrings are our eventual documentation and another way to state our solution–except in English. And while we can’t use the type checker on that English, we can continually look for nouns and verbs we’re using and know that those need to be in the code eventually. Those nouns and verbs will have types we need to add.
The problem we chose to solve is a time series regression problem. By knowing those words–time series regression–and knowing about pandas (and probably sci-kit learn, which wouldn’t have been hard to find with some Googling), you’ve got the 3rd party libraries you need. By learning about yfinance (and quandl, which also shows up in google searches), you know where to get the data.
By not simply using the docstrings to help drive the domain model but also experimenting with new keywords you find, you will eventually “get there.”
Over time, you’ll build up a sixth sense for the kinds of keywords you’ll need. You’ll see where a problem is well-trodden and that there are many libraries for it. Lots of libraries mean you’ll need to know lots of interfaces to get your domain model/dimensional analysis working.
So much of code is “knowing what to Google.” And so much of knowing what to Google is Googling as best you can and keeping track of entirely new concepts. Sometimes you’ll get a hint. For example, I had to change my phrasing of “lots of data” quite a bit to obtain time series in the search results. You might also ask a friend or do more basic research. You’ll find “time series” pretty quickly if you just Google how to quantitatively analyze stocks. I wanted to show that you can find it while building a rich domain model along the way.
We’ve seen enough of outlining to see how useful it can be. We’ll talk next week about how it interacts with Drafting and Journaling.
We’re on a mission to make jobs suck less, one software management tip at a time. We need your help!
Do you want to stay current on the latest management tips and data science techniques?