Software Engineering for Data Scientists: Introduction to Outlining

John Graham
Jun 14, 2022
7 min read

That's what you've written down. That's the first step of your journaling process.

You're trying out our "journaling" and "drafting" techniques on a little Python script you're writing to help you make investment decisions. You're now trying to see if the relative strength index helps predict whether to buy or sell the SPY index fund.

You are thinking about switching to drafting; but you don't feel like you'd know where to begin. Maybe you start with something like:

Let's do our drafting steps. There are some errors here. Let's add some docstrings, create some dummy functions, fix a spelling error, and rename a function to conform to pylint.

This is good. The linter flags some new warnings, so there is more stuff for us to fix. We'll go ahead and do that.

The remaining errors are unused arguments; you can't fix those without making progress. But you're still not sure what the next step is. However, just this amount of drafting has fed into your journaling.

What now? Well, let's stay in journaling for a bit. I think we can break this apart further.

After some Googling on the first, you get a good reference on calculating the RSI. The second bit of research is more complicated. All you've learned is that there are two kinds of problems -- classification and regression.

You're a little stuck. You think you can start tackling the RSI issue; but the problem doesn't sound interesting right now. You want to figure out this gap in classification and regression. You commit what you have and ask for some peer review to help figure out where you're headed.

You can't find anyone who knows about this stuff, so the peer review feedback is a bit harsh.

These are all excellent questions. You can add the answers to the documents; but no place seems to fit. You don't want to describe what a stock is in the test for the RSI. You don't want to talk about what stock performance is in the is_predictive function.

A great design principle we're applying here is we're thinking of what our final documentation will look like after applying a tool like Sphinx and trying to shape our code to give coherent documentation.

You've done journaling; you've done drafting. You've asked for feedback. You feel stuck on a few different angles. The problem feels too big for drafting; but you can't break it down anymore through journaling. Is the answer just banging your head against the wall of statistics and machine learning until you understand the next step? And none of that will answer your friend's questions.

Enter Outlining

You can actually continue to make progress. The more progress you make in the code, the better your understanding of the problem and the better questions you'll ask.

Code is a conversation. You're trying to describe a problem so concretely that a computer can solve it. That means you've got to understand the problem very well. And if you don't understand the problem very well, then coding can help you understand the problem well. But it doesn't just have to be trial and error!

Want us to train your team in drafting, journaling, or outlining? Contact us!

When drafting and breaking down what you wanted to do earlier, you considered what functions you needed. That's very good.

From a design standpoint -- and this is not always true -- functions are often seen as "verbs." That means if you're doing acceptance test-driven development and breaking things down into functions, you're fighting with one hand tied behind your back. You aren't thinking of what nouns to introduce.

This idea of nouns in design comes from Object Oriented Analysis and Design (OOA/D).

Whether starting at the top with a requirements document, clean user stories, or just using a trial and error approach, the conversation with the code will give you ideas of what nouns you'll need to solve your problem. In the former, pick out some of the nouns from your user stories and see how well they fit.

From the latter, see what nouns come from the ideal documentation and see if they fit.

What's the ideal documentation? Well, it's the docs you generate by trying to clean up linter errors, plus the docs you'd generate by answering any remaining questions that weren't already answered. Your idea of the "ideal documentation" will improve over time with experience, but you'll always be able to start somewhere with the linter+peer review approach.

As we stated above, you know what you need to describe -- what a stock is, what a stock performance is -- but you don't know where. Let's use that discomfort of "I don't know where this belongs" as a sign that we need to create a place just for that documentation.

For verbs, we create functions. For nouns, we make classes.

Such little code, but it unlocks so many questions.

What goes in there?

What is a class, anyway?

You know what a function is - a way to combine reusable behavior. You also understand what a module is, whether you know you do or not. That's a function container, and you get them using the import keyword.

import pandas

gets all the functions from the pandas module, for instance.

There are also packages, which are bundles of modules. But we're not going to go into those right now...

Imagine a module with a collection of functions that all did helpful things. Maybe it's a linear algebra library. It might have the following functions:

def mat_product(matrix1, matrix2):
    " Multiplies two matrices together."
	...
	
def mat_transpose(matrix):
    " Transposes a matrix "
	...

def determinite(matrix):
	" Gets the determinite of a matrix "
	...

Notice how the first argument for each of those functions is a matrix. In fact, the entire module is full of functions that take and perform operations on matrices.

We could say that if you ever have a matrix, you know you can find the transpose and the determinate; and if you have two matrices, you can find the product. Basically, rather than thinking of functions and the arguments they take, you can think in terms of what you have -- and what functions you can call using it.

This can be combined with some syntactical sugar not dissimilar to how you might pull a function out of a module. So, to call mat_product from your matrix module, you might say...

import matrix_math

x = Matrix(...)
y = Matrix(...)

answer = matrix_math.mat_product(x, y)

To call it in a different syntax -- an "object-oriented" syntax--it might look like...

x = Matrix(...)
y = Matrix(...)

answer = x.product(y)

# or even

answer = x * y

In other words, modules group functions. Classes also group functions and data. It groups the data and the functions that would operate on that data.

Terminology time! An "instantiated" class is called an "object." So, you might define a class in code, like class Matrix. But when you need to use it, you do something like x = Matrix(args...). x, in this case, is an "object" of type "Matrix."

Stay tuned if you're still unsure what problem classes solve. I just wanted to cover some nuts and bolts that many data scientists aren't always exposed to.

Back to Outlining

So we introduced our first noun, a "Stock."

There are a lot of places we can go from here. We can talk about what makes a stock or what a stock does. I want to start with who uses a stock. If we have a better and better idea of who uses a stock, we can get a better and better idea of what a stock does. Then, we'll know what makes a stock, given that we'll need the data to solve the problem of what a stock does.

But, as you'll see, we can do this recursively. The more we discuss who uses a stock, the more we ask stocks to do. And the more we ask stocks to do, the more we'll find out other nouns that exist that we need.

First, let's talk about the wrong way to go about this: you try to encode what a stock is based on your idea of one.

We start thinking about what a stock is and come up with all kinds of details: they're companies, they have financial statements, there are multiple exchanges, there are foreign stocks...

The problem is, we can do this all day. The domain of stock trading is incredibly vast. We need to focus on the part of the domain that matters to our problem. We want to keep our code malleable so that we can add more detail later if needed; but until then, we only want to put the parts of the domain there that we need to solve our problem.

So, when you're adding nouns and verbs -- and writing documentation to think about what it is you're doing -- keep in mind the context of the problem you're solving. Indeed, we often call this idea of part of the domain the "bounded context."

Who uses stocks?

In modern Python and many other statically typed languages, we now can indicate who uses what and have it checked for us by a program. This is called "typing" -- as we explicitly name the types of our arguments. Languages that force you to do this before you run your program are called statically typed; while languages that figure out the types on the fly are called dynamically typed. Finally, languages like Python that allow you to add types as necessary are called optionally typed.

Types are like -- almost exactly like, in fact -- the "dimensional analysis" you may have done in chemistry. If you need Newtons at the end of your calculation, you know that somewhere, you must divide mass by acceleration. And if you need acceleration, you know you'll need meters per second squared.

Types allow the program to check that all the arguments make sense. Functions in a typed language become like little factories that take in specific inputs and promise certain outputs.

Let's see where we'd need them without figuring out what stocks do yet or what they're made of.

Notice that we denote a type of an argument with the

argument_name: type

decoration in our functions.

But wait! We found an issue! We're passing in strings where we expect stocks. If you run MyPy or PyCharm's checker, you'll get these type errors flagged.

This tells us a few things and drives our design forward! Let's denote this in our journal.

(We left other to-do's and doings out for brevity and focused on what we discovered in thinking about stocks.)

Let's tackle the second TODO: Should functions take strings or stocks?

We'll do this in next week's entry: Outlining, Types, and Design! We've only scratched the surface of this technique, so stay tuned.

We're on a mission to make jobs suck less, one software management tip at a time. We need your help!

- Share to Reddit

- Share to Hacker News

Do you want to stay up to date on the latest management tips and data science techniques?

- Click “Subscribe to the Soapbox” below for more!

- Follow us on Twitter

- Follow us on LinkedIn

Software Engineering for Data Scientists: Introduction to Outlining

Enter Outlining

What is a class, anyway?

Back to Outlining

Who uses stocks?

Recent Posts

Comments