My problem with job titles is that at every organization I’ve ever worked at or with, a given job title has little real value or meaning. All a job title is, is an attempt to label a collection of job duties and responsibilities into something succinct. And in our neck of the woods, there’s a decent assortment of such titles – DB Admin, DB Dev, Dev DBA, DB Engineer, Architect, blah, blah, blah. But as I’m sure others are blogging about today, the actual duties wildly vary from company to company, for identical job titles.
I also argue that they can often be meaningless except from a prestige perspective. Early on in my career (omg, this anecdote was 20 years ago), I worked for one of the largest banks in the world. My title was “Assistant Vice President.” In reality, I was a database developer. And most of the other 100+ in my division had identical titles as well! It was in place due to salary bands being tied to titles at that organization and IT, being higher paying, had to be on the “president” track. Ridiculous.
What Does [dev\null] Mean to YOU?
One thing I started doing in the course of my career is that when I’d interview, one of my counter-questions would be “what does [job title] mean to you?” followed by “what are the duties, tasks, and day-to-day for [job title] at this organization?”
I remember one place I was interviewing for an “Architect” role. But the job description’s was more fitting for a T-SQL developer & performance tuning expert. And I was right. They defined “Architect” totally differently than how I would. So we pivoted our conversation on the actual needs and tasks and the day-to-day.
So What Does Matter?
In my opinion, job titles are meaningless. For me, the only thing that does matter is what set of duties are expected and most importantly, whether the pay is proper for what is being expected.
“Call me a data janitor for all I care, as long as the paycheck clears the bank.”
Now let’s start focusing on performance characteristics. We can impact the performance of a BACKUP operation by making changes to or more of the following:
Increase the Number of Backup Buffers
Increase the Size of Backup Buffers
Add more CPU Reader Threads
Add more CPU Writer Thread
Analogy Time!
I like to explain things with silly analogies and today will be no exception. Today, let’s pretend that we are in a shipping warehouse.
Track the flow of widgets from the left to the right, then back to the left, in a clockwise fashion.
The person on the left “reads” the contents of the main container and makes copies of each widget, tossing as many widgets as they can into an available yellow basket. When the basket is full, it is sent to the other employee, who takes the contents of each basket and places it into the truck. And they also return the basket back to the first employee to be refilled.
In this silly analogy, the left employee represents a single CPU Reader Thread. They are copying data pages out of a database file and placing them into a yellow basket, which represents a backup buffer. The stack of empty yellow baskets represents the other backup buffers waiting to be filled in the Free Queue. Once a basket is filled, it is sent over to the Data Queue, which the other employee picks up one by one. That employee takes the contents of the basket and loads it into the truck. This is analogous to a CPU Writer Thread consuming a backup buffer and writing out to the backup target.
Note: The duration of transferring backup buffers from one queue to another is effectively nil. “Transfer” is probably not the best word choice here as I don’t believe any actual data movement occurs.
Backup Buffers
To help accelerate our backup operations, we can make two adjustments to our backup buffers – adding more backup buffers and increasing the size of the backup buffers.
Add More Baskets
First, let’s explore adding additional backup buffers. That’d be like having more than seven baskets to pass blocks back and forth in the diagram above. In a perfect system, both employees would be processing baskets at perfectly equal speeds. But the real world isn’t perfect – one employee may be much faster than the other.
Scenario 1
Let’s pretend the first employee is the slower one. They can fill a basket at a rate of 1 basket every 10 seconds. The second employee can process a basket at a rate of 1 every 5 seconds. So when they begin work, the second employee is sitting idle waiting for a basket to arrive. It is able to process the received basket quickly, pass it back, then sits and waits for the first employee to complete filling another basket.
Scenario 2
Conversely, if the first employee can fill a basket at a rate of 1 per second, they’ll have sent all 7 available baskets to the other employee after 7 seconds. The second employee still takes 5 seconds to process a filled basket, but the first employee will have to wait 4 seconds between receiving free baskets after they get going. The second employee now has a backlog of baskets to process because it is much slower than the first employee.
Time lapse chart to help illustrate state of the Reader and Writer thread at different time intervals.
Check out the time lapse chart. One thing that is very clear is that due to the different processing rates, either the Reader or Writer will incur some idle time. Imagine what this might look like if the number of buckets was larger or smaller? And imagine what this might look like if the processing rates were orders of magnitude different?
Use Bigger Baskets
The other adjustment that can be made is to make each basket bigger. Depending on the rate of filling or processing a basket, making all baskets larger can increase overall throughput of the process. If we doubled the size of the basket, the fill rate and process rate would still increase proportionally. So in the first scenario above, the fill rate would be 1 basket every 20 seconds and the processing rate would be 1 basket every 10 seconds. And the second scenario, we’d see a fill rate of 1 basket every 2 seconds and a processing rate of 1 basket every 10 seconds.
Are you seeing a bit of a “so what’s the benefit, why bother?” conundrum here? Hold that thought for a bit.
What Are the T-SQL Parameters?
Skipped over the analogy to the good stuff? Here you go:
BUFFERCOUNT (BC) – Define the number of Backup Buffers to be used
MAXTRANSFERSIZE (MTS) – Define the size of each Backup Buffer
MAXTRANSFERSIZE can range anywhere from 64KB to 4MB. Frankly, I’m not sure what the allowed range of values is for BUFFERCOUNT. But because of the workflow above, there is a point where having too many Backup Buffers are a waste, because neither side can fill or process them fast enough.
Additionally, there’s a resource price to pay in the form of RAM consumed. Very simply, your BACKUP operation will consume BUFFERCOUNT * MAXTRANSFERSIZE in RAM. So if you run a BACKUP with BUFFERCOUNT = 10 and MAXTRANSFERSIZE = 1MB, your operation will consume 10MB of RAM. But if you crank things up, BUFFERCOUNT = 100 and MAXTRANSFERSIZE = 4MB, that’s now 400MB of RAM for your operation.
What about the existing SQL Server Buffer Pool? Backup Buffers are allocated outside of buffer pool memory. If you don’t have enough RAM available, SQL Server will shrink the buffer pool as needed to accommodate your backup buffer requirement. (I plan to talk about overall resource balancing in a future part of this series.)
What’s the Benefit?
So what’s the overall benefit of messing with these two parameters. If one is solely leveraging these two parameters and no other adjustments, not much really. The main bottleneck your BACKUP operation will experience is due to only having 1 Reader and 1 Writer thread. So adding a few more Backup Buffers can help your overall performance a bit, but as you saw with the time lapse examples, if you have different processing rates on one side or the other, the extra backup buffers don’t really help – they’ll just sit in one of the queues waiting on the CPU threads.
It is far more beneficial to make adjustments to BUFFERCOUNT and MAXTRANSFERSIZE if you also make adjustments to the number of Reader and/or Writer threads for your BACKUP operation. And that’s what I’ll be covering in Part 3 of this series, so stay tuned! Thanks for reading!
Recently, I had the pleasure of delivering a new presentation called How to Accelerate Your Database Backups for MSSQLTips.com. This blog series is intended to be a companion piece, particularly for those who prefer to read content instead of watching a video. And this series is focused exclusively on FULL BACKUP operations. I might talk DIFFs and LOG BACKUPs another time.
For many of us, native BACKUPs on SQL Server “just work.” Many are content to use BACKUP with defaults and as long as errors aren’t thrown, don’t think much beyond that. But not all of us have the luxury of adequate BACKUP maintenance windows, especially when faced with the ever-growing sizes of our databases.
So to speed up our BACKUP operations, we turn to using backup compression, backup file striping, and two parameters: BUFFERCOUNT and MAXTRANSFERSIZE. Many have been blogging about these four things for many years now. But what I find interesting is that very few of those blogs have really explored WHY each brings a performance benefit for BACKUP operations.
That ends to today.
But How Do They Work?
In the simplest example, a BACKUP operation has the following players:
a database source volume
a backup output file
a reader thread
a writer thread
a set of backup memory buffers
Note: Notice Database VOLUMES, not Database Data Files. You’ll learn why shortly.
Here is the basic flow of a BACKUP operation:
The Reader Thread will read data from a database file, into an empty Backup Buffer in the Free Queue.
Once the Backup Buffer is full, it will be moved into the Data Queue.
The Reader Thread will start filling up another Backup Buffer in the Free Queue (if available), otherwise it must wait until the Free Queue has an empty Backup Buffer available.
The Writer Thread monitors the Data Queue and once it sees that a filled Backup Buffer has entered the Data Queue, it will transfer the contents to a Backup Device.
Once the Backup Buffer is emptied, it is moved back to the Free Queue to be populated again by the Reader Thread.
When you run a basic BACKUP command without any optional parameters, you will always get 1 reader thread and 1 writer thread. The number of Backup Buffers varies depending on the Backup Device type (DISK, URL, TAPE). The most commonly used is DISK, which will get 7 Backup Buffers, each Buffer being 1MB in size.
Low Overhead
It’s interesting to note that BACKUP’s resource utilization is relatively low. A BACKUP to DISK with no optional parameters operation requires 2 CPU threads (1 Reader, 1 Writer), and 7MB of RAM (7 Backup Buffers, 1MB each). My understanding is that this was architected this way intentionally, long ago when production servers were less powerful than my smartphone, to ensure that a BACKUP operation has minimal impact when running.
Give Me More Power!
As data can be moved between 1 Reader, 7 Backup Buffers, and 1 Writer, BACKUP can be rather slow. And these days, our databases are terabytes in size, with limited maintenance windows to complete them. Thankfully, we have a number of different parameters that will allow us to increase the horsepower behind a BACKUP operation, which I’ll write in Part 2 of this BACKUP Internal series!
For today’s blog post, I want to focus our attention on the first Lightboard video: How Volumes work on FlashArray and a discrete benefit to FlashArray’s architecture. If you’re not yet familiar with that, I strongly suggest you read “Re-Thinking What We’ve Known About Storage” first.
Silly Analogy Time!
As many of you know from my presentations, I pride myself on silly analogies to explain concepts. So here’s a silly one that I hope will help to explain a key benefit of our “software defined” volumes.
My wife Deborah (b|t) and I have a chihuahua mix named Sebastian. And like many dog owners, we take way too many of pictures of our dog. So let’s pretend that Deborah and I are both standing next to one another and we each snap photos of the dog on our respective smartphones.
Sebastian wasn’t quite ready for for me, but posed perfectly for Deborah (of course)!
We also have a home NAS. Both Deborah and I have our own volume shares on the NAS and each of our smartphones automatically backs up to each of our respective shares. So once on the NAS, each photo would consume X megabytes per photo, right? That’s what we’re all used to.
Software Saves Space!
Now let’s pretend that instead of a regular consumer NAS, I was lucky enough to have a Pure Storage FlashArray as my home NAS. One of its special super powers is that it deduplicates data, not just within a given volume, but across all volumes across the entire array! So for example, if I have 3 different copies of the same SQL Server 2022 ISO file on the NAS, just in different volumes, FlashArray would dedupe that down to one underneath the covers.
But FlashArray’s dedupe is not just done at the full file level – it goes deeper than that!
So going back to our example, when we each upload our respective photos to our individual backup volumes, commonalities will be detected and deduplicated. Sebastian’s head is turned differently and he’s holding his right paw up in one photo but not the other. Otherwise, the two photos are practically identical.
So instead of having to store two different photos, it can singly store the identical elements of the photo as a “shared” canvas and plus the distinct differences. (If your mind goes to the binary bits and bytes that comprise a digital photo, I’m going to ask you to set that aside and just think about the visual scene that was captured. This is just a silly analogy after all.)
How About Some Numbers?
Let’s say each photo is 10MB each, and 85% of the two photos’ canvas is shared while the remaining 15% is unique to each photo. If we had traditional storage, we’d need 20MB to store both photos. But with deduplication technology, we’d only need to store 11.5MB! That breaks down as 8.5MB (shared: photo 1 & 2) + 1.5MB (unique: photo 1) + 1.5MB (unique: photo 2). That’s a huge amount of space savings, by being able to consolidate duplicate canvases!
Think At Scale
As I mentioned earlier, Deborah and I take TONS of photos of our dog. Most happen to be around our house. And because Sebastian is a jittery little guy, we’ll take a dozen shots just to try to get one good one out of the batch. And if we had data deduplication capabilities on our home NAS, that’d translate to a huge amount of capacity savings that is otherwise wasted storing redundant canvases.
What About the Real World?
Silly analogies aside, what does this look like in the real world? On Pure Storage FlashArray, a single SQL Server database get an average of 3.5:1 data reduction (comprised of data deduplication + compression), but that ratio skyrockets as you persist additional copies of a given database (ex: client federated dbs, non-prod copies on staging, QA, etc. ). If your databases are just a few hundred gigabytes, you might not care. But once you start getting into the terabyte range, the data reduction savings starts to add up FAST.
Wouldn’t you be happy if your 10TB SQL Server database only consumed 2.85TB of actual capacity? I sure would be. Thanks for reading!
For today’s blog post, I want to focus on the subject matter of the first Lightboard video: How Volumes work on FlashArray.
Storage is Simple, Right?
For my many years as a SQL Server database developer and administrator, I always thought rather simplistically of storage. I had a working knowledge of how spinning media worked, and basic SAN & RAID architecture knowledge from a high level. And then flash media came along and I recall learning about its differences and nuances.
But fundamentally, storage still remained a simplistic matter in my mind – it was the physical location to write your data. Frankly, I never thought about how storage and a SAN could offer much much more than simply that.
A Legacy of Spinning Platters
Many of us, myself included, grew up with spinning platters as our primary storage media. Over the early years, engineers have come up with a variety of creative ways to squeeze out better performance. One progression was to move from one single disk to many disks working together collectively in a SAN. That enabled us to stripe or “parallelize” a given workload across many disks rather than just be stuck with the physical performance constraints of a single disk.
Carve It Up
In the above simplified example, we have a SAN with 16 disks. And let’s say that each disk gives us 1,000 IOPs @ 4kb. I have a SQL Server whose workload needs 4,000 IOPs for my data files and 6,000 IOPs for my transaction log. So I would have to create two volumes containing the appropriate number of disks from the SAN to give me the performance characteristics that I require for my workload. (Remember, this is a SIMPLIFIED diagram to illustrate the general point)
Now imagine being a SAN admin trying to have to juggle hundreds of volumes across dozens of attached servers, each with their own performance demands. Not only is that a huge challenge to keep organized, but it’s highly unlikely that every server will have their performance demands met, given the finite number of disks available. What a headache, right?
But what if we were no longer limited by the constraints presented by spinning platters? Can we approach this differently?
Letting Go Of What We Once Knew
One thing that can be a challenge for many technologists, myself especially, is letting go of old practices. Oftentimes those practices were learned a very hard way, so we want to make sure we never have go through whatever rough times again. Even when we’re presented with new technology, we often just stick to the “tried and true” way of doing certain things, because we know it works.
One of things “tried and true” things we can revisit with Pure Storage and FlashArray is the headache of carving up a SAN to get specific performance characteristics for our volumes. When Pure Storage first came to be, they focused solely on all-flash storage. As such, they were not tied to legacy spinning disk paradigms and could dream up new ways of doing things that suited flash storage media.
Abstractions For The Win
On FlashArray, a volume is not a subset or physical allocation of storage media assigned to it. Instead, a volume on FlashArray is just a collection of pointers to wherever the data wound up being landed.
Silly analogy: pretend you’re boarding a plane. On a traditional airline, typically first class boards first and goes to first class, then premium economy passengers go board to their section, then regular economy boards and go to their section, and basic economy finally boards and goes to theirs. But if you were on Southwest Airlines, you can choose your own seat. So you’d board, and simply go wherever you wish (and pretend you report back that you’ve taken a particular seat to an employee). Legacy storage is like that traditional airline where you (data) were limited to sit down in to your respective seat class, because that’s how the airplane was pre-allocated. But on FlashArray, you’re not limited in that way and can simply sit where you like, because you (data) have access to sit anywhere.
Another way of describing it that might resonate is that legacy storage assigned disk storage to a volume and whatever data landed on that volume landed on the corresponding assigned disk. On FlashArray, the data can be landed anywhere on the entire array, and the volume that the data was written to simply stores a pointer to wherever the data wound up on the array.
Fundamental Changes Make a Difference
This key fundamental change in how FlashArray stores your data, opens up a huge realm of other interesting capabilities that were either not possible or much more difficult to accomplish on spinning platters. I like to describe it as software-enhanced storage, because there’s many things we’re doing besides just “writing your data to disk” on the software layer. In fact, we’re not quite writing your raw data to disk… there’s an element of “pre-processing” that takes place. But that’s another blog for another day.
Take 3 Minutes for a Good Laugh
If you want to watch me draw some diagrams on a lightboard that illustrate all of this, then please go watch How Volumes work on FlashArray. It’s only a few minutes long and listening to me on 2x is quite entertaining in of itself. Just be sure to re-watch it to actually listen to the content, because I’m guarantee you’ll be laughing your ass at me chattering at 2x speed. 🙂
In my opinion, 2022 is absolutely a major release with significant enhancements which should make it compelling to upgrade rather than wait for another release down the line. I’m thrilled for the improvements in Intelligent Query Processing, TempDB, and (after the training helped me ‘get it’) Arc-enabled SQL Servers. But that’s not what I want to blog about today.
It’s Often the Little Things
By sheer coincidence, I had the privilege of being invited to a private SQL Server 2022 workshop taught by Bob Ward last week. And through my job, I also had the privilege of doing some testing work around QAT backups and S3 Data Virtualization during the private preview phase last summer. So while I had exposure and access to SQL Server 2022 for much longer than others, there were many things that Microsoft loaded into the 2022 release that I barely skimmed over or knew were even there.
Towards the end of the workshop, Bob presented a slide called Purvi’s List. Purvi Shah is an engineer on the SQL performance team and as Bob said, “spends her time finding ways to make SQL Server and Azure SQL faster.” When Bob put up Purvi’s List, I let out an audible “holy shit,” much to Grant’s amusement.
So what caught me by surprise?
Instant File Initialization (IFI) for Transaction Logs
Okay, that’s cool!
For simplicity’s sake, I’ll just quote the documentation (as written today):
Transaction log files cannot be initialized instantaneously, however, starting with SQL Server 2022 (16.x), instant file initialization can benefit transaction log autogrowth events up to 64 MB. The default auto growth size increment for new databases is 64 MB. Transaction log file autogrowth events larger than 64 MB cannot benefit from instant file initialization.
So yeah, it is limited to 64MB growth size. But another entry on Purvi’s List is that the VLF algorithm has also been improved.
If growth is less than 64 MB, create 4 VLFs that cover the growth size (for example, for 1 MB growth, create 4 VLFs of size 256 KB). … and starting with SQL Server 2022 (16.x) (all editions), this is slightly different. If the growth is less than or equal to 64 MB, the Database Engine creates only 1 VLF to cover the growth size.
So do I think everyone should now change their Transaction Log autogrow sizes to 64MB? Of course not. But do I think that this kind of small but interesting improvement is still notable and will hopefully be expanded on in a future release to a larger scale? Absolutely!
If you’re like me, Plan Explorer has always been a must-have tool in your performance tuning arsenal. And one of the things that made it so useful was a simple little SSMS Integration that would allow you to right click on an Execution Plan and see “View with […] Plan Explorer.”
Unfortunately, I started hearing reports of that no longer being available in SSMS v19. But I know a thing or two, so was willing to bet 30 minutes of time that I could get it back.
How Do I Get The Integration Back?!?
You’ll need another installation of SSMS v18 that already has the Plan Explorer plug-in files.
Navigate to this folder (assuming default install paths):
C:\Program Files (x86)\Microsoft SQL Server Management Studio 18\Common7\IDE\Extensions\SentryOne Plan Explorer SSMS Plugin
Copy the entire SentryOne Plan Explorer SSMS Plugin folder, and copy it to. (The folder might be named slightly differently because… well, you know… )
C:\Program Files (x86)\Microsoft SQL Server Management Studio 19\Common7\IDE\Extensions\
You’ll notice that the only thing that changed with SSMS is the folder name. At the time of this writing, the PE installer hasn’t been updated in an extremely long time and while the fix is trivial, who knows if/when it’ll actually be addressed. But until then, getting the files there and restarting SSMS should do the trick.
I’ll admit this up front – there’s a ton of awesome technologies out there that I’ve had my eye on, learned a little bit about, and have hardly touched since. Docker is one of those technologies… along with Grafana. Well conveniently enough for me, Anthony Nocentino just wrote a blog post on Monitoring with the Pure Storage FlashArray OpenMetrics Exporter. And this monitoring solution uses both. And best of all, it’s actually quite easy to implement – even for a clueless rookie like me!
Recap of Andy Stumbling Along…
Ages ago, I had attended an introductory session or two on Docker, and read some random blogs about it, but otherwise not really messed with it too much beyond a few examples. So I thought I’d just take the quick and dirty route, went into my team lab, and installed Docker Desktop for Windows on a random Windows VM of mine. And while I waited… and waited… and waited for the installation to run, let me go on a slight tangent here.
Tangent – Where To Run Docker From
TL;DR – Docker on Windows is a lousy experience. I expected it to “run okay for a dev box.” Nope, it was worse than that. Run it from a Linux machine if you can – you’ll be much happier.
Slightly longer “why.” Underneath the covers, Docker is essentially Linux. So to run it on Windows, you need to have Hyper-V running, essentially adding virtualization layer. And if you’re silly like me, you’ll do all of this on a Windows machine that’s really a VMware VM… so yay, nested virtualization = mediocre perf!
In my case, after rebooting my VM, Docker failed to start with a lovely “The Virtual Machine Management Service failed to start the virtual machine ‘DockerDesktopVM‘ because one of the Hyper-V components is not running” error. Some quick Google-fu revealed that I had to go into vSphere and on my VM, adjust a CPU setting for Hardware virtualization: Expose hardware assisted virtualization to the guest OS.
Three reboots later, and I finally had Docker for Windows running. Learn from lazy Andy… it would have been faster for me to just spin up a Linux VM and get docker installed and running.
Let’s Monitor a Pure FlashArray!
At this point, I could start with Anthony’s Getting Started instructions. That was super easy at least – Anthony outlined everything that I needed to do.
I did encounter another error after I ran ‘docker-compose up –detach’ for the first time: ‘Error response from daemon: user declined directory sharing‘. That one involved another Docker setting about file sharing. Once I changed that, it errored again, because I failed to restart Docker – doh! At least I didn’t have to reboot my VM again?
So finally I ran ‘docker-compose up –detach’ and stuff started appearing in my terminal – yay! I immediately went to the next step of opening a browser and got a browser error. WHUT?!? I thought something was broken, because Docker was “doing” something. But the reality is that prometheus, grafana, and the exporter, all had to do some stuff before the dashboard was up and available. Several more minutes later, I had a working Grafana and dashboard of my Pure FlashArray – yay!
Take the Time to Try Something New
All of the above took maybe a half hour at most? And a chunk of that was waiting around for stuff to complete, and other time was burned resolving the two errors I encountered. So not a huge time investment to stand up something that is really useful to monitor if you don’t have monitoring tools in place already.
But most importantly, this little experience was gratifying. It felt good to try something new again and to be able to stand this up pretty quickly and fairly painlessly. And if you don’t repeat my mistakes above, you can get your own monitoring operational even faster!
New Stars of Data has had a tremendous impact on our technical community, helping to grow new speakers. But the task of finding, encouraging, and mentoring new speakers is always ongoing. To that end, wanted to share some thoughts for #NewStarNovember.
Let’s Do the Time Warp Again!
Because I got my start speaking through the encouragement of others, growing new speakers became a passion of mine as well. In late 2016, I put my money where my mouth is, with this T-SQL Tuesday: Growing New Speakers blog. I challenged prospective new speakers to blog about their ideas and existing speakers to share their experiences and wisdom. And to really drive home how serious I was, I openly offered mentorship to any new speaker who wanted it.
Then and Now
The round-up to that call was tremendous. If nothing else, I would encourage you to revisit that roundup and look over the names of the people who were categorized as New or Novice speakers. And consider where many of them are today!
I don’t know about you, but I think that’s absolutely amazing. We speakers often say things like “speaking opens doors” and other sayings like that. I’d strongly argue that this list of people, looking at them then and now, is solid proof that speaking and presenting pays off exponentially.
Call to Action
Are you thinking about getting into speaking? Or are you relatively new to speaking but want to keep growing that skill? I would like to repeat my offer from 2016 and offer mentorship once again.
If you’ve never presented before and take me up on this challenge I will take the time to work with you.
• If you want help developing your first presentation out, I will help you. • If you are wary of putting your first PowerPoint together, I will help you. • If you need ideas on how to write demo scripts, I will help you.
I will do whatever I can to help you begin this journey. But it’s up to you to take that first step. Trust me (and others) when we all say that it’s absolutely worth it!
Welcome to another edition of T-SQL Tuesday! This month’s edition is hosted by Tom Zika (b|t) and Tom wants to know what makes “code production-ready?”
Measure Twice, Cut Once…
I still remember learning early on my career, the mantra of “Measure Twice, Cut Once.” It’s a important lesson that I’ve had to be re-taught periodically over my career, sometimes in a rather stressful or painful manner. But in the unforgiving realm of code, that is quite literal if you think about it, accuracy is paramount. We should always thoroughly test our code before it gets to Production.
No One Develops in Production, Right? RIGHT?
On the hand, I’ve also had more than a handful of occasions where Production is broken in a BAD WAY, and you have to FIX IT NOW. I remember when I worked in the financial commercial credit card industry, where our post-deployment QA smoke testing would take anywhere from 8-12 hours, and that’s AFTER a minimum 4 hour deployment. We rolled out a big release on Friday night, QA started their work in the wee hours of the morning Saturday and we found a nasty breaking bug that only manifested with Production data in play. Our app devs were trying to figure out how they could fix the code, but even the effort of committing, integrating, and cutting a new build was a multi-hour affair. In the meantime, we database developers were trying to figure out if there was a viable SQL Server/stored procedure workaround that could be implemented to allow us to not have to rollback everything.
I was the db dev lead on this particular release and my gut told me that there had to be a workaround – I just had to iron it out. I requested 90 minutes of “do not bother me NO MATTER WHAT,” focused, and 90 minutes (and one interruption) later, I had successfully coded a workaround fix. But I also committed a cardinal sin – I developed in Production.
Was that Production Ready? Well, QA did spend a few hours testing it and validated it. But was it subject to the full battery of integration and performance tests that our application would typically go through? No… but we had little choice and in this case, it worked. Funny thing is that workaround fix, like many thing things that were never meant to be permanent, remained permanent.
Is There a Lesson to be Learned Here?
If I had to share one key takeaway, is that I believe it is critical that every business have a true “Prod + 1” environment. This means having a second FULL COPY of Production that’s refreshed on a regular basis, into which “next builds” are installed for testing. Unfortunately, this is not an easy or inexpensive task. Fortunately, there’s many more solutions available these days (like a certain orange organization I happen to know) that make it far more feasible too.