Software Engineering for Data Scientists: Outlining, Types, and Design

John Graham
Jun 23, 2022
8 min read

Updated: Jun 24, 2022

When we last met on Outlining, we ran into an issue: should our functions take strings or stock objects?

Pushing forward good design discussions like these (rather than delaying them) is a massive benefit of this technique. So let’s dive in!

What should a function expect?

There are two designs we want to consider here. Let’s start with the get_relative_strength_index function. The type says we should expect a stock, but we passed a string earlier. The first design uses a string, while the second uses our stock.

What are the merits of the first design?

Well, there’s some intuitive aspect. After all, it’s how we originally coded things. And if you think about stock symbols a lot, you think of those letter codes; so it’d be nice to pass things around like that.

What are the merits of the second design?

We want our functions to do as little as possible. The string design means the function must:

Do a lookup of the stock from its symbol
Calculate the relative strength index

The second design only does one of those things – the stock lookup has already been done. While it’s hard to know where designs will head, you might also think, “There will be other technical indicators I want to build, and I’ll be able to reuse a stock. But if I stick with strings, I would have to add the stock lookup logic to each one.”

The second design also has robustness built-in. If I called

get_relative_strength_index("SBUX")

on the first design, it should work. SBUX is the symbol for Starbucks. But if I made a typo and called

get_relative_strength_index("SSSBUX")

then it won’t work. SSSBUX isn’t a symbol for anything. I’ll either get a nasty and potentially incomprehensible error, or I’ll have to maintain error-checking logic around symbols inside get_relative_strength_index (which can get hairy).

There’s a final argument for the second design: avoiding primitive obsession, which is a “code smell” (an element of bad design). You rarely want integers, floating-point numbers, strings, or other “primitive types.” You usually wish to have richer types.

You rarely want integers; you want ages or zip codes. You rarely want floats; you want temperatures and longitudes. You rarely want strings; you want names and streets.

Aside: Libraries like scikit-learn sometime seem to have avoided this design--allowing strings and not being very strict about the types it accepts. This has made maintenance of those libraries more difficult, as a good chunk of scikit-learn code is dedicated to resolving what arguments mean. A different design would have taken advantage of language features to do that.

In fact, let’s go into our code and start talking about all the rich types we want to use.

We discovered a new type. We need an Indicator class!

We have been putting our new classes and functions at the top of the file, while drafting says you should put them in your test function. We put them at the top of the file, in this case, to ensure they’re clearly visible in our screenshots and for emphasis.

We need to add that!

We want to go ahead and add the docstring. Adding docstrings is part of the outlining method. In case that wasn’t clear, we had been adding them to fix linting issues. But docstrings also give us a chance to think about our problem in rich English and discover new nouns and verbs on the fly.

In the docstring, we use another noun that seems to have some domain value, so maybe we need to put it in. Performance.

This time, we must make explicit the relationship between a stock and its performance. Let’s add a bit to stock.

Okay, we did quite a bit of work in that one. First, we added the glorious attrs library. This allows us to make elementary classes easily. To use it, employ the decorator syntax @decorator on the class you want to define.

In the older version I use, you have to set “auto_attribs” to true, then use the regular typing syntax on your variables. (See the documentation on attrs for how to use the most recent versions.)

Caveat: Python has built-in “dataclasses” now covering some of what attrs does. However, attrs also includes things like validation and freezing. These are valuable features for our purposes.

So a stock now has-a performance.

Let’s go back to drafting for a second. Some linting and typing issues have cropped up.

We try to stop passing the string to the functions and pass a stock instead. But that triggers a new issue: stock expects a performance, not a string. We should probably make the stock symbol a part of stock, too; but we’re still left with the question of where performance comes from.

We put some placeholder default Performance objects in there to get the linter happy, but we know that can’t work. It does, however, clarify some duplication. We can eliminate that.

Let’s update our journal because we’ve learned some things.

Believe it or not, our outlining has helped our quest. From the peer reviews and analysis of the documentation we were writing, we discovered the word “performance” is essential. Let’s iterate the journal more.

We go ahead and type in that query and take the first non-medium article (because we’ve maxed out the number of medium articles we can read this month. Or maybe I have, and you haven’t.)

This Post mentions the yfinance library from pandas. It also uses the terms historical data. This may be new to you if you don’t know anything about stock trading. If you do, you’d think, of course, that’s always what I meant by performance.

But you didn’t say that anywhere in your docs, and you didn’t mention it in the peer review. By being very explicit about our nouns and verbs, we’re discovering the hidden context we need to solve our problem better.

We now know that performance, in our case, will be looked at in terms of historical data. Maybe the growth in the stock over time. And that meshes well with our earlier research on the relative strength index. That index needs much of this data to be computed.

What does this historical data look like in yfinance? It’s a dataframe with a few columns. Each row corresponds to a date.

Design Thinking

Design thinking is about using the tools product designers use in our day-to-day lives. In this case, it’s crucial to bring up the co-evolution of problem understanding and solution, which is a good design thinking principle.

If we decided to – from on high – divine the perfect data structure for “stock performance” from first principles, we’d probably take a long time. And we’d also probably come up with something worse than a dataframe of opens, closes, highs, lows, and volumes with a date index.

Just like code is a back-and-forth conversation, your research into integrating third-party tools is a conversation. Each option is going to affect the ultimate solution you take. And based on our design thinking principles, this is okay. You will get done faster and have a more extensive, more maintainable solution if you find a few third-party libraries that coexist well together and don’t violate the domain model you’re building up.

Back to Outlining

Regarding the back and forth conversation, reading a bit about this yfinance library has changed our journal.

So we need to think about how to handle dates.

The kind of date that yfinance uses is a string; but we’ve already talked ourselves out of using strings for everything. However, we translate back and forth if we use something different.

We could create our own date object that seems to follow the rich typing described above. But is this reinventing the wheel? What’s wrong with Python’s own datetime.date class?

Should we just use that?

Speak the Native Tongue

Maybe we just stay with strings. What would that look like?

First, we’ll need a function to download this stuff and put it in one of our state objects; so let’s add that.

Aside: Some might argue you can just put this logic in Stock’s __init__ method. I prefer keeping initializers simple – no more complex than what attr’s built-in initializer does if possible. This eases testing since it becomes straightforward to create dummy objects. There are also often multiple ways to build an object – yet you can only have on __init__ method. Rather than overloading __init__ with a lot of argument parsing logic, just have multiple factory functions.

So we know performance will need a stock symbol and two dates, a begin and an end, per our literature review. Let’s add them here and see what it looks like. We can draft any time and make the calls to yfinance. But since this is an outlining blog, we’ll focus on that for now.

We add the docstrings, and something becomes apparent.

We need to ensure that begin is before the end. This is an example of a contract. Similar to a type, contracts are ways to describe valid data. However, contracts are checked at runtime and enforce other things that need to be true for our program to run correctly. Types are checked when we run the type-checker.

Right now, to check this, we can technically make a string comparison because of the format. But that emphasizes that we need to get our strings for dates precisely right for the code to work. For example, the correct format needs to be 0 padded: 2022-12-13 for December 13th, 2022. If you forget to 0 pad, you’ll get a date that should be earlier seeming later by sheer string comparison.

Regardless, let’s go ahead and add a contract on the relationship between the dates. This will be true no matter what design we choose.

So, to recap: Pros of string dates:

They work
Require no translation back and forth to the yfinance API

Cons of string dates:

Format is error-prone

Let’s try using our own date objects now.

The built-in comparisons from attrs should work if we put the attributes in the order we did (year, month, day). Moreover, we don’t have to worry about 0 padding when making these dates. The logic should be correct.

However, if we’re being in the spirit of being richly typed, something should have set off alarm bells – not one of day, month, or year is an integer. The year comes closest, as 3000 and 0 are both valid years, though they may not sanity check for our use case.

Days, though, depending on the month, as a hard rule, will never go above 31. Months are a number between 1 and 12, though if you 0 index them, they are between 0 and 11. Neither is ever negative.

So, we’d probably need to introduce a day, month, and year type and set those conditions as contracts. You’ll often find this pattern: at your lowest level of types, you have wrappers around primitives like floats and integers that also put a layer of runtime protection on them. These contracts may be rigid – like a year can’t be negative – or soft, like “if a temperature above 150 is passed in our health app, something’s obviously wrong.”

Furthermore, we don’t have a way to generate the actual date strings yfinance needs, so we’ll need to add that.

Pros of using our dates:

No format errors around 0 padding
More validation than string dates

Cons of our dates:

Needs more work: validation logic and string formatting

Let’s talk about that last one. Agile (or any adaptive process) tells us that when we try to estimate something, we should use our prior measurements as part of that estimation. When we try to guess how long a project takes: we look at similar projects, see how long they took, and use that information in our new guess.

The same can be said of a design. If a design needs code added, the more you look at it, the more code will need to be added in the future as new requirements are discovered. The code is “hot.”

Generally, we want our designs to be “hot” around a unique competitive advantage or something essential to the problem we’re solving. If we’re extending code with new features (like format strings or additional validation) and it isn’t logically tied to the core of our program, we’re not being as efficient as we could be.

This sometimes has to be done as there’s no other way. But design is a lot like strategy: focus on your strengths, partner on your weaknesses. Libraries are your partners in software design.

“What dates are valid dates?” certainly is a question you need to answer to perform financial analysis. But it isn’t as compelling as, “How do I predict the performance of the S&P 500?” These questions are the core of the problem. We’d prefer the resulting domain model (the high-level view of the classes and modules we develop) to focus on introducing many concepts that help tear apart the stock market performance question. We’d also prefer to not have many classes and modules dedicated to tangential issues like date validation. More on this later!

We’ll consider one more approach – using Python’s built-in datetime objects – when we continue in the next in this series. Stay tuned!

We’re on a mission to make jobs suck less, one software management tip at a time. We need your help!

Do you want to stay current on the latest management tips and data science techniques?