top of page
Post: Blog2_Post
Writer's pictureJohn Graham

Software Engineering for Data Scientists: The Drafting Technique

For years, I’ve been working on quantitative finance as a side project. I’ve often used it as a testbed for software engineering practices.

It’s given me a unique perspective as machine learning and AI were first treated as niche computer science topics, then slowly morphed into a new discipline called Data Science.

And while I’ve seen the tooling spring up around data science - most notably, the Notebook - and seen the good in those tools, at no point did that really compete with the workflow I was able to cobble together using my years of software experience. So I never really switched.

(I’m not arguing that no one should have used Notebooks. I just personally never met the threshold to cover my switching costs.)

Now, I am finding that more people reach the end of the sigmoid learning curve on Notebooks. They do not really realize what they’re missing using more traditional engineering methods. So I’m putting together a few posts to describe some tooling and workflows that work for me and emulate the best parts of Notebooks.

I will never claim that this workflow provides the same user experience as Notebooks. I’m not trying to replace them. Instead, it brings many of the benefits of Notebooks to a more traditional software engineering workflow.

Why?

Because many of the benefits of software engineering still appear to be lacking in Notebooks. At least, the best of software engineering is further from Notebooks than the best of data science is from the workflows I’m about to describe.

I also won’t claim to be up to date on Notebooks’ latest and greatest feature offerings or any plugins. Use them if you can get tests, linting, and REPL debugging in your Notebook!

A lot of this content originally appeared in my personal blog

What are you missing by only using Notebooks?

Folks who just use Notebooks primarily miss the ability to scale. And I don’t mean running a machine learning model across a cluster of AWS EC2 instances.

Large code becomes cumbersome. Large models use a lot of code, so large models become cumbersome. This scaling issue affects the usability of the code. The code is brittle, hard to understand, and hard to extend.

Software engineering - or at least what I consider software engineering - is a set of techniques for resisting this ossification of code as it gets bigger.

Software engineering is an overloaded term with a not-so-spectacular history. That’s why I emphasize my particular definition: good tooling and repeatable methodology.

I’ll claim good software engineering practices slow the ossification of code in two ways:

  • Reuse keeps code small

  • Quality and tooling keeps code malleable

I can go into various techniques to keep code small and tooling to keep it malleable one by one; but I’d rather show you a workflow that works for me.

Again, I want to emphasize that Notebooks can have a place in this workflow. I’m not trying to replace the Notebook. I am showing you a way to get a lot of the ease of traditional practices while also not necessarily leaving the benefits of the Notebook behind. You can decide at the moment which workflow suits your purposes best.

The Drafting Technique

Drafting combines the following software engineering techniques...

  • pseudocode

  • acceptance test-driven development

  • linting

  • auto-formatting

  • automated refactoring

  • debugging-REPL

  • contract-programming

...into a single system.

It can be thought of as a step-by-step system. Once you get the hang of it, it is about choosing which technique makes sense, given the problem you’re trying to solve.

Write a one-liner test

First, we’ll be writing all within a single file. This is if you’re maintaining a legacy system or doing something green. We will try to put as much in our single file as possible.

This test will be an “acceptance test." We’re starting with the broadest view of our problem. We will use refactoring to pull unit tests out of this comprehensive test. We will let how we need the solution to behave drive our design decisions and refactoring decisions by starting with the test.

This may mean integrating your solution into legacy code comes later. But you’ll be integrating fleshed-out modules and packages at that point that are all well tested.

You’re going to write a single line, it’s going to start with test_ something, and you’re going to have it print "hello world."

Set up your IDE to run this test; that will be our debugging target.

(By using such a tiny target, it becomes a little easier to set up your tooling and know if you’re successful. You’ll want to run the test using PyCharm’s test running under debug settings and be able to stop at a breakpoint inside the test. Then you’ll want to ensure you can get a Python Console at that breakpoint. You can find tutorials and tips for these smaller steps with some Googling.)

I use PyCharm, but an IDE is essential here to simulate a handy feature of Notebooks - getting the diagrams out quickly. They won’t be inline, which is a loss. However, the rapid iteration of REPL debugging allows you to get multiple versions of the diagrams out as you experiment.

In exchange for losing inline diagrams, an IDE like Pycharm can give you:

  • Linting issues beyond pylint or pyflakes, including type checking

  • Automated refactors

Keep your linter on and as hot as possible. Fix linting errors as they appear. Commit often and consider a WIP branch for folks to review as you go.

(Hot here means “leave all the rules on that you can.” Based on my experience with this workflow, it’s easier to use the tool to find issues for you turned up to 11 and not quibble over the false positives. Just fix them and move on.)

Additionally, I’d set up an auto-formatter like Black. There is no sense even fixing linting issues like line lengths that can be fixed automatically. And lines may get really long as you try and keep all your code in a single block.

“Uh… how do I test?”

I’m not going to go into the nuts and bolts of unit testing in this blog as various other resources will give it better coverage (Get it? No? Well, you soon will!) than I could.

However, I will point out a few specific things to testing models.

(A “test,” by the way, is just using the “assert” keyword, often then followed by X == Y or some other predicate. I use pytest as my testing framework. This tool will go through your code and discover all functions named “test_xxxxx” and run them and translate all the assertions into something very human-readable.)

Save your Data!

You do not test on data that changes. This means you may need to can a few CSV versions of your data and save it off in Git as test data. This often also means you will mainly test on small data sets rather than large sets.

Any change to your test data will throw off your models’ performance, requiring you to go back in and double-check that things are still okay.

Often you can rapidly prototype on a large data set and then shrink the data set when you think you’ve got something worthwhile. You’ll have to double-check any failing tests; but at least you only have to do it once, and you know that the code you wrote will handle more data with ease.

Use Random Seeds

Many machine learning and statistical algorithms are stochastic. You’re going to want to make those algorithms not stochastic by setting their random seeds to a known value. This will ensure that run to run, the tests still work.

Factor out the above!

You need to use your models on actual data - especially if you want them to be reusable; they may be used on various data. To support the “save your data” step above, you need to design code that accepts the data as an argument. Your functions should not go out to a known place on disk and find a CSV, query a database, or anything else.

Factor the code where you get your data out and then pass it in to your modeling code. It doesn’t need to know where you got it.

This makes testing super easy. You just pass in the test data.

You’re also going to want to factor out the random seed since you’ll clearly want to be able to turn stochastic behavior on and off. I’d add it as a default arg akin to the following:

def foo(data, random_seed=None):
    if random_seed is None:
        # set random seed, maybe using the clock
    ...

This allows your test code to set it to a known value - say 0 or something - and remove stochastic behavior. However, you still get some excellent usability; the users of your models can just leave the random seed out, and it will “do the right thing.”

Start Broad, then Narrow

The broadest test - and, to be honest, the most useful one in my experience - is the “end to end smoke test.” Did your entire codebase run and not return an error?

This test provides a lot of value, especially if your codebase is lint clean (which is part of this method).

It also combines well with contract programming, which boils down to “you don’t have to just assert inside tests.” Because all of your contracts will be exercised with this level of testing too.

This is true, in general, of this technique as a software engineering technique. However, some caveats apply in narrowing to data science.

You probably aren’t used to writing tests in Notebooks. All we’re going to do is that when we’ve double-checked that something is sane, we will assert that it’s sane.

Did you just run an ordinary least squares, and one of your regressors has a p-value of less than 3%? And you were expecting (or hoping) for that?

Great - add a line to your code asserting that the same regressor maintains that p-value on all future runs. That’s now one of your tests (or contracts, depending on how you use it).

You can undoubtedly do standard software-style unit tests - pass known values in and check that you get known values out. But often, you’re exploring data. Testing in data science is more of a conversation between you and the data. Whenever you’re running a statistical test or measuring the model’s performance, you’re checking it against what you, as a human, know about the data.

When you’ve got something that you feel is meaningful, capture it with an assertion. And yes, statistical tests can be tests - add them as contracts or tests!

Peer review the tests

If you test as I mentioned above, you’ll find that peer reviews become basically reviews of experiments. You found a p-value of 3% or less. Now because you’re asserting it, you’re telling another human being that you found it. This allows them to quickly gut-check your assertions as part of your methodology.

“Uh… what’s a linter?”

A linter is a tool that runs on a code base and flags things that look suspicious. They use techniques of varying levels of power - from simple pattern recognition to building the actual abstract syntax tree of the code and trying to find logic errors.

A great linter comes built into PyCharm. But linters are like people at a party - the more you can invite without a disagreement, the better! I like to run pylint as well (though it can be a little slow, so I run right before I’m ready to commit).

Pycharm will do type checking as well. This is another form of static analysis (finding errors in code without needing to run it); while tests are a form of dynamic analysis. We’re not going to leverage it heavily in this blog post - but we’ll cover how you can make types work for you in a future one.

Is this all a bit overwhelming? Do you need a more hands-on approach to becoming a more effective Data Scientist with the Drafting technique? Don’t hesitate to reach out!

Then Pseudocode

Let’s get back to the drafting method.

State your problem and your idealized solution in English, all within the comments of this test. You can use docstrings if you want; but comments may be better as these comments will get spread out.

Putting a single docstring at the top of your test that describes your overall goal for your feature is wise if you can pull off a way to do it briefly.

It’s best to learn the hotkey to commit to your branch and commit as often as possible. Indeed, the end of a pseudocoding session is a good time. But if you need to take breaks as you try to capture as much of your design doc in the code, do it.

While some of these comments may be deleted in time, this commit will serve as a lighthouse you can later return to see the entire thought at once. Others can also use it to find your thinking when you put code together.

Rather than merely describing your solution, you’ll want to describe the entire problem and solution. When we do the next step, we’ll need to know how to start putting together test data to represent your problem. Additionally, any time you think anything should be true – really, almost after any line – put a note in your comments. It may be as simple as:

# Sort list
# assert list is sorted
...

Spin Up the Debugger

Put a breakpoint on the first line. Add a "hello world" if you need a line to execute. Then spin up a debugger that can break into a REPL.

Now, you need to – in the REPL – do whatever the first logical line says needs to be done. Since we described our problem, it may be dedicated to putting together some canned data to drive the test forward.

(I just said above not to mix how you load your data and how you use it. But we’re going to solve that problem using refactoring. Right now, just do the “simplest thing that can possibly work.” We want one large function that does everything, from soup to nuts. This will look a lot like Notebook code. We’ll use automated refactoring to break this code into logical chunks.)

Since we’re already in the REPL, it’s easy to interact with our libraries and generate canned data. We can use copy-paste to get this data into the file.

To capture that first logical line, you may want to restart your REPL debugger multiple times to clear out local variables and smoke test what you’ve been writing so far.

When you’re done with the first line, copy and paste what you think will work into the actual Python file. Then move on to the following line by moving the breakpoint. The really nice thing about this debugging-REPL technique is that you’ll get all of the state you have in your script when running the REPL. This makes it, so you don’t have to remember everything you’ve already instantiated. It also helps to not have to do any complicated setup in your own REPL sessions to try something out.

This is where we’re trying to simulate a bit of Notebooks. One nice feature in Notebooks is being able to change a few things and re-run a line. Using tests to change things and re-run things is one version of this in the large. In the small, another is staying in the REPL-debugger, just trying out different expressions. We can quickly and rapidly prototype various things. This allows us to explore APIs (with tab completion!) and try multiple things out in a single debugging session.

We don’t get the play buttons; specific workflows may not be as fine-tuned. But in exchange, we get - well - the whole point, actually. We’ll have well-factored and tested code. Code prepped for reuse, maintenance, and readability is all under test.

REPL sessions are one of the best ways to explore APIs and try multiple things to get a line right, all with far faster cycle time than re-running tests. Additionally, as you add tests, the debugger-REPL will execute them and only get to your breakpoint if all the tests currently pass.

Ideally, use a debugging-REPL with code completion, like IPython’s or JetBrains.

Refactor as You Go

When it makes sense to make something a method, make it a method. But don’t try to organize code until you start to see the solution emerge. The premature organization will slow you down. You actually want rough code coming out – the ‘simplest thing that can possibly work’ in your debugging-REPL sessions.

You may even want to keep the code rough for a while longer than that – so long as it works – as you go back into pseudocode to flesh out some extended behavior or off-nominal paths.

As things emerge that seem elegant, extract methods and add them to classes and whatnot as you go. You don’t need to do this upfront; you want your solution to guide you to what the code should look like.

When and where to refactor is essentially something driven by experience – at the same time, so is the design you use to solve your problem. But I would not refactor too early, as you want to see your entire solution in one go. Then and only then will specific patterns emerge.

Refactoring with auto-formatting is fantastic. While prototyping and getting things correct, you can write code as dense and nested as you want - the auto-formatter will keep it legible. Then when you refactor it out, it will again be auto-formatted as if it was always where it belonged.

A style-opinionated linter like pylint may also give you hints on where to refactor (method too long, etc.). These can also be nice safety nets for your refactoring efforts after all the obvious logical breaks are there.

Test and Contract as You Go

While you’re putting in statements, one line at a time, there are all kinds of things you can assert at any time. Go nuts; think of as many assertions as you can. A stickler may be able to add one or more assertions per line of behavior actually written.

Are these tests? Are they contracts?

You can decide. Right now, in this primordial ooze, contracts and tests are going to look the same. Contracts may, in fact, be easier since getting canned data may require a copy and paste; you’re able to simply assert that data is in particular relation to other data.

If it’s a contract-like assertion, then that will live inside a method in the future. If it is a test-like assertion, we will be refactoring that out into a test in the future.

“Uh… What’s a contract?”

A contract is just an assertion that you put in your code rather than a test.

Tests use assertions to check things like, “If I give this function this known input, do I get out this known output?”

Contracts refer to uses of assertions where you’re basically saying, “In all cases, this should be true.”

A great example might be a model that assumes its inputs abide by some distribution. Run the statistical check against the inputs that confirm that distribution and assert that it’s significant (or passes). If someone ever misuses that function, they’ll get a nice human-readable error saying the function only works on certain distributions.

Another example - that I hope gets outdated by better type checking - would be matrices. Assert that the shape of your NumPy arrays is correct; or assert that your pandas’ data frames have the valid column names (or the number of columns, or have the suitable types of data, or are not NA, etc.).

Don’t Get Too Far Ahead of Yourself

This technique will feel so fast that you’ll just want to keep coding. You’ll want to keep adding behavior and going back. Try, though, to be disciplined about when to branch things off.

We’ll discuss finding a good rhythm in a future blog post on “journaling.”

If two features genuinely are independent, it may be better to get your branch under peer review, spin up another branch, and go on with the new feature. Otherwise, it will become harder and harder to ‘keep it rough’ and keep it all in your head.

This will be especially apparent when you want to ‘run past the end’ and add more behavior based on the results of your acceptance test (as you would anywhere else in the code using this method).

Try to avoid this. Instead, start a new file with a new test that imports the code from the old file. Run the behavior, get the output, and add new assertions and behavior.

Technique: Grab Tests Data Using the Debugger

What you’re initially writing is an acceptance test – it’s a complete walk of the new feature using some expected canned inputs, ensuring you get expected outputs.

Pulling unit tests out of this is relatively easy using the REPL, though, since you can step through the code, inspect a few variables, and copy and paste their state into the code as an assertion. These ‘assertions on the way’ on the state are good refactoring targets to turn into unit tests. And these refactoring targets also help you determine when and where your method boundaries should be as you try to break things up.

Technique: Grab Canned Data Using the Debugger

You may have a longer test that you want to break into a few unit tests, but it uses complex data. You can always manually save this off in the debugger to a CSV, then load the data as part of the test.

Technique: Use "Extract Method" to pull unit tests out of your integration test

Using the extract method twice – once on the behavior you want in a method and second on an assertion on the results of that method will allow the automated refactor to figure out all the dependencies on a specific line of code.

Setting a breakpoint at the top of that ‘test’ when it’s called allows you to use the copy-paste technique to remove these dependencies and rely on canned data.

Extracting an inline test is easy. Simply extract the method the immediately call it right after.

This makes the generation of unit tests really easy once you already have an acceptance test. Once unit tests are easy to generate, it’s easy to run specific tests and copy tests more quickly while changing certain defaults to test off-nominal behavior.

Technique: Todo’s embedded in the code, and Groom the Code

Brand new features or directions can be captured as to-dos. We want to stay within the file as much as possible. Keep all of your notes there and at your fingertips, all within context.

As you go, it may make sense to ‘groom’ this implied backlog – while deciding what to do next, try to add more information to a TODO each time you see it. What approach might you take, how can the TODO be broken down, and what questions need to be considered?

As you groom, the easier TODOs get knocked out, while the harder ones get easier and easier, with a fresh mind in-between visits.

TODOs that aren’t worth doing at the time can later be removed into an actual task or story management backlog; but by now, you’ll have enough context for that to be worthwhile.

Always try to write as much of your thinking in a TODO you make as possible. You won’t be reviewing these things for some time, and simple todos like ‘fix this’ aren’t going to help your future self. In the TODO, describe what’s wrong and possible ideas for fixing it, if possible.

(In a future post on Journaling, we’re going to talk about a better way to try and track commitments and tasks. This technique can still work, though, primarily by catching minor design issues or refactoring ideas, rather than “user story” style tasks.)

Pull Everything Apart

When the code works and you’ve got it well structured, the final step is to pull it apart. Now is the time to decide where everything lives. Package all this code into clean modules and packages that interrelate in obvious ways.

What will come out of this is code that looks like it was obvious how to do things, even though it won’t have seemed evident. Code whose behavior seems obvious is very readable; it’s what we strive to have when we’re done.

So many reasons code is unreadable is we try to put in place abstractions at the beginning that end up not being good fits. But because the refactoring overhead is so significant for these abstractions, we end up shoehorning things anyway.

This makes the code hard to read, as the abstractions aren’t helpful.

Using the drafting method, many abstractions that emerge are ones you wish you knew of going in. Because we don’t prematurely abstract or organize but instead solve our problem, one line at a time. We let our complexity be kept in check by many assertions and comments. We’re able to refactor with the complete solution in mind.

Honorable Mention: Vim vs IDEs

A common religious war in software engineering circles is comparing a text editor like Vim with an IDE like PyCharm.

There’s a good reason to be familiar with both, as Vim is installed nearly everywhere - though this “patch in production” narrative may happen in Data Science less.

However, after a decade of sticking to just Vim, I found emulating Vim on PyCharm way easier than emulating PyCharm on Vim.

Can you get a plugin manager to give you good syntax highlighting and linting in Vim? Yeah, that’s actually relatively easy. Can you get automated refactors and a Debugging REPL? You can, at least according to the tools I’ve seen, but I could never get them to work. Maybe things are better now.

I can, however, easily get Vim keybindings (and macros!) in PyCharm.

Ensuring you use a text editor keybindings in your IDE also helps you practice for that one time a year; you might have to patch something in production, even in data science.

Where to Go from Here?

I glazed over a lot, but you can fill in the blanks at least. Namely, you know enough about testing to read more on testing and be able to apply it to your data science code.

You’ve also learned enough about refactoring to start researching more kinds of refactors that might be useful to you. Design concepts such as “keep it simple, stupid,” “you ain’t gonna need it,” and “don’t repeat yourself” are now a bit easier to apply as you get comfortable with automated refactors. Tests double-checking that your refactors worked.

We’ll be covering two more ensemble methods in the future: Journaling and Outlining. Journaling will help with software engineering requirements and project management aspects, while outlining will help you with design. All three together will provide many productivity benefits in easily remembered habits.

Don’t think you have to do all bits of Drafting correctly; try each little bit independently and see how to make it work for you. Each tool or technique we discuss has a dose-response effect. They should help you be more productive over time on their own but have a huge “set completion” bonus.

You should also expect to hit the learning curve. You will slow down at first as you consciously try these techniques. Eventually, they will become habits, and you won’t be able to remember coding any other way. Likewise, if you feel rushed for time, don’t worry about not doing Drafting strictly. Software is like anything else in life - you build healthy habits over time by practicing (i.e., consciously trying to do a thing). It is not a domain where constant flagellation and white knuckles are the only way to ‘git gud.’

Practice when you can, and you’ll get better when you have to just wing it. The more you practice refactoring and linting, the easier it will get when time constraints are tight, and the more quickly it will help you go faster!

Do you want training, coaching, or consulting on Drafting for yourself or your team? Contact us!

Or…

  • Click “Subscribe to the Soapbox” below for more!

  • Share to Reddit

  • Share to Hacker News

  • Follow us on Twitter

  • Follow us on LinkedIn

446 views0 comments

Recent Posts

See All

Comments


Soapbox Logo (Transparent).png
bottom of page