Variable Names: Why They’re a Mess and How to Clean Them Up – Built In

Quick, what does the following code do?

Its impossible to tell right? If you were trying to modify or debug this code, youd be at a loss unless you could read the authors mind. Even if you were the author, a few days after writing this code you might not remember what it does because of the unhelpful variable names and use of magic numbers.

Working with data science code, I often see examples like above (or worse): code with variable names such as X, y, xs, x1, x2, tp, tn, clf, reg, xi, yi, iiand numerous unnamed constant values. To put it frankly, data scientists (myself included) are terrible at naming variables.

As Ive grown from writing research-oriented data science code for one-off analyses to production-level code (at Cortex Building Intelligence), Ive had to improve my programming by unlearning practices from data science books, coursesand the lab. There are significant differences between deployable machine learning code and how data scientists learn to program, but well start here by focusing on two common and easily fixable problems:

Unhelpful, confusing or vague variable names

Unnamed magic constant numbers

Both these problems contribute to the disconnect between data science research (or Kaggle projects) and production machine learning systems. Yes, you can get away with them in a Jupyter Notebook that runs once, but when you have mission-critical machine learning pipelines running hundreds of times per day with no errors, you have to write readable and understandable code. Fortunately, there are best practices from software engineering we data scientists can adopt, including the ones well cover in this article.

Note: Im focusing on Python since its by far the most widely used language in industry data science. Some Python-specific naming rules (see here for more details) include:

More From Will KoerhsenThe Poisson Process and Poisson Distribution, Explained

There are three basic ideas to keep in mind when naming variables:

The variable name must describe the information represented by the variable. A variable name should tell you concisely in words what the variable stands for.

Your code will be read more times than it is written. Prioritize how easy your code is to read over than how quick it is to write.

Adopt standard conventions for naming so you can make one global decision in a codebase instead of multiple local decisions.

What does this look like in practice? Lets go through some improvements to variable names.

If youve seen these several hundred times, you know they commonly refer to features and targets in a data science context, but that may not be obvious to other developers reading your code. Instead, use names that describe what these variables represent such as house_features and house_prices.

What does the value represent? It could stand for velocity_mph, customers_served, efficiencyorrevenue_total. A name such as value tells you nothing about the purpose of the variable and just creates confusion.

Even if you are only using a variable as a temporary value store, still give it a meaningful name. Perhaps it is a value where you need to convert the units, so in that case, make it explicit:

If youre using abbreviations like these, make sure you establish them ahead of time. Agree with the rest of your team on common abbreviations and write them down. Then, in code review, make sure to enforce these written standards.

Avoid machine learning-specific abbreviations. These values represent true_positives, true_negatives, false_positivesand false_negatives, so make it explicit. Besides being hard to understand, the shorter variable names can be mistyped. Its too easy to use tp when you meant tn, so write out the whole description.

The above are examples of prioritizing ease of reading code instead of how quickly you can write it. Reading, understanding, testing, modifying and debugging poorly written code takes far longer than well-written code. Overall, trying to write code faster by using shorter variable names will actually increase your programs development and debugging time! If you dont believe me, go back to some code you wrote six months ago and try to modify it. If you find yourself having to decipher your own past code, thats an indication you should be concentrating on better naming conventions.

These are often used for plotting, in which case the values represent x_coordinates and y_coordinates. However, Ive seen these names used for many other tasks, so avoid the confusion by using specific names that describe the purpose of the variables such as times and distances or temperatures and energy_in_kwh.

When Accuracy Isn't Enough...Use Precision and Recall to Evaluate Your Classification Model

Most problems with naming variables stem from:

On the first point, while languages like Fortran did limit the length of variable names (to six characters), modern programming languages have no restrictions so dont feel forced to use contrived abbreviations. Dont use overly long variable names either, but if you have to favor one side, aim for readability.

With regards to the second point, when you write an equation or use a model and this is a point schools forget to emphasize remember the letters or inputs represent real-world values!

We write code to solve real-world problems, and we need to understand the problem our model represents.

Lets see an example that makes both mistakes. Say we have a polynomial equation for finding the price of a house from a model. You may be tempted to write the mathematical formula directly in code:

This is code that looks like it was written by a machine for a machine. While a computer will ultimately run your code, itll be read by humans, so write code intended for humans!

To do this, we need to think not about the formula itself (the how)and consider the real-world objects being modeled (the what). Lets write out the complete equation. This is a good test to see if you understand the model):

If you are having trouble naming your variables, it means you dont know the model or your code well enough. We write code to solve real-world problems, and we need to understand the problem our model represents.

While a computer will ultimately run your code, itllbe read by humans, so write code intended for humans!

Descriptive variable names let you work at a higher level of abstraction than a formula, helping you focus on the problem domain.

One of the important points to remember when naming variables is: consistency counts. Staying consistent with variable names means you spend less time worrying about naming and more time solving the problem. This point is relevant when you add aggregations to variable names.

So youve got the basic idea of using descriptive names, changing xs to distances, e to efficiency and v to velocity. Now, what happens when you take the average of velocity? Should this be average_velocity, velocity_mean, or velocity_average? Following these two rules will resolve this situation:

Decide on common abbreviations: avg for average, max for maximum, std for standard deviation and so on. Make sure all team members agree and write these down. (An alternative is to avoid abbreviating aggregations.)

Put the abbreviation at the end of the name. This puts the most relevant information, the entity described by the variable, at the beginning.

Following these rules, your set of aggregated variables might be velocity_avg, distance_avg, velocity_min, and distance_max. Rule two is a matter of personal choice, and if you disagree, thats fine. Just make sure you consistently apply the rule you choose.

A tricky point comes up when you have a variable representing the number of an item. You might be tempted to use building_num, but does that refer to the total number of buildings, or the specific index of a particular building?

Staying consistent with variable names means you spend less time worrying about naming and more time solving the problem.

To avoid ambiguity, use building_count to refer to the total number of buildings and building_index to refer to a specific building. You can adapt this to other problems such as item_count and item_index. If you dont like count, then item_total is also a better choice than num. This approach resolves ambiguity and maintains the consistency of placing aggregations at the end of names.

For some unfortunate reason, typical loop variables have become i, j, and k. This may be the cause of more errors and frustration than any other practice in data science. Combine uninformative variable names with nested loops (Ive seen loops nested include the use of ii, jj, and even iii) and you have the perfect recipe for unreadable, error-prone code. This may be controversial, but I never use i or any other single letter for loop variables, opting instead for describing what Im iterating over such as

or

This is especially useful when you have nested loops so you dont have to remember if i stands for row or column or if that was j or k. You want to spend your mental resources figuring out how to create the best model, not trying to figure out the specific order of array indexes.

(In Python, if you arent using a loop variable, then use _ as a placeholder. This way, you wont get confused about whether or not the variable is used for indexing.)

All of these rules stick to the principle of prioritizing read-time understandability instead of write-time convenience. Coding is primarily a method for communicating with other programmers, so give your team members some help in making sense of your computer programs.

A magic number is a constant value without a variable name. I see these used for tasks like converting units, changing time intervals or adding an offset:

(These variable names are all bad, by the way!)

Magic numbers are a large source of errors and confusion because:

Only one person, the author, knows what they represent.

Changing the value requires looking up all the locations where it's used and manually typing in the new value.

Instead of using magic numbers in this situation, we can define a function for conversions that accepts the unconverted value and the conversion rate as parameters:

If we use the conversion rate throughout a program in many functions, we could define a named constant in a single location:

(Remember, before we start the project, we should establish with our team that usd = US dollars and aud = Australian dollars. Standards matter!)

Heres another example:

Using a NAMED_CONSTANT defined in a single place makes changing the value easier and more consistent. If the conversion rate changes, you dont need to hunt through your entire codebase to change all the occurrences, because youve defined it in only one location. It also tells anyone reading your code exactly what the constant represents. A function parameter is also an acceptable solution if the name describes what the parameter represents.

As a real-world example of the perils of magic numbers, in college, I worked on a research project with building energy data that initially came in 15-minute intervals. No one gave much thought to the possibility this could change, and we wrote hundreds of functions with the magic number 15 (or 96 for the number of daily observations). This worked fine until we started getting data in five and one-minute intervals. We spent weeks changing all our functions to accept a parameter for the interval, but even so, we were still fighting errors caused by the use of magic numbers for months.

More From Our Data Science ExpertsA Beginner's Guide to Evaluating Classification Models in Python

Real-world data has a habit of changing on you. Conversion rates between currencies fluctuate every minute and hard-coding in specific values means youll have to spend significant time re-writing your code and fixing errors. There is no place for magic in programming, even in data science.

The benefits of adopting standards are that they let you make a single global decision instead of many local ones. Instead of choosing where to put the aggregation every time you name a variable, make one decision at the start of the project, and apply it consistently throughout. The objective is to spend less time on concerns only peripherally related to data science: naming, formatting, style and more time solving important problems (like using machine learning to address climate change).

If you are used to working by yourself, it might be hard to see the benefits of adopting standards. However, even when working alone, you can practice defining your own conventions and using them consistently. Youll still get the benefits of fewer small decisions and its good practice for when you inevitably have to develop on a team. Anytime you have more than one programmer on a project, standards become a must!

Keep Clarifying Your Code5 Ways to Write More Pythonic Code

You might disagree with some of the choices Ive made in this article, and thats fine! Its more important to adopt a consistent set of standards than the exact choice of how many spaces to use or the maximum length of a variable name. The key point is to stop spending so much time on accidental difficulties and instead concentrate on the essential difficulties. (Fred Brooks, author of the software engineering classic The Mythical Man-Month, has an excellent essay on how weve gone from addressing accidental problems in software engineering to concentrating on essential problems).

Now let's go back to the initial code we started with and fix it up.

Well use descriptive variable names and named constants.

Now we can see that this code is normalizing the pixel values in an array and adding a constant offset to create a new array (ignore the inefficiency of the implementation!). When we give this code to our colleagues, they will be able to understand and modify it. Moreover, when we come back to the code to test it and fix our errors, well know precisely what we were doing.

Clarifying your variable names may seem like a dry activity, but if you spend time reading about software engineering, you realize what differentiates the best programmers is the repeated practice of mundane techniques such as using good variable names, keeping routines short, testing every line of code, refactoring, etc. These are the techniques you need to take your code from research or exploration to production-ready and, once there, youll see how exciting it is for your data science models to influence real-life decisions.

This article was originally published on Towards Data Science.

Original post:

Variable Names: Why They're a Mess and How to Clean Them Up - Built In

Related Posts

Comments are closed.