DEV Community: Coolblue

How We Reorganised Engineering Teams at Coolblue for Better Ownership and Business Alignment

Aman Agrawal — Wed, 07 Feb 2024 20:28:24 +0000

In this post, I will share my experiences leveraging Domain Driven Design strategies and Team Topologies to reorganise two product engineering teams in the Purchasing domain at Coolblue (one of the largest e-commerce companies in the Netherlands), along business capabilities to improve team autonomy, reduce cognitive load on teams and improve our architecture to better align with our business.

Disclaimer : I am not an expert in Team Topologies, I have only read the book twice and spoken to one of the core team members of Team Topologies creators. I am always looking to learn more about effectively applying those ideas and this post is just one of the ways we applied it to our problem space. YMMV!🙂

Context

Purchasing domain is one of the largest at Coolblue in terms of the business capabilities we support and the number of engineering teams (4 as of this writing, possibly growing in the future) and it has one very critical goal: to ensure we have the right kind of stock available to sell in our central warehouse at all times without over or under stocking and secure most optimum vendor agreements to improve profitability of our purchases. Our primary stakeholders are supply planners and buyers in various product category teams that are responsible for various categories of products we sell.

We buy stock for upwards of tens of thousands of products to meet our growing customer demands, so its absolutely critical that not only we are able to make good buying decisions (which relies on a lot of data delivered timely from across the organisation) but that we’re also able to manage pending deliveries and delivered stock efficiently and effectively (which relies on timely and accurate communications with suppliers).

Growth of the Purchasing Domain

Based on the strategic Domain Driven Design terminology Purchasing would be categorised as a supporting domain i.e. Purchasing capabilities are not our core differentiator. The workings of the domain are completely opaque to end customers. Most organisations will have similar purchasing processes and often similar systems (sometimes these systems are bought instead of being built).

However, over the last 10 years Purchasing domain has also increased in complexity, we have expanded our business capabilities: data science, EDI integration, supplier performance measurement, stock management, store replenishment, purchasing agreements and rebates etc. We have come to rely on more accurate and timely data to make critical purchasing decisions. Being able to quickly adapt our purchasing strategies during COVID-19 helped us stay on our business goals. For the most part we have built our own software due to the need to tackle this increased complexity, maintain agility in the face of global upset events and to integrate with the rest of Coolblue more effectively and efficiently. The following sub-domain map shows a very high level composition of the Purchasing domain:

High level break down of the Purchasing domain (simplified)

For this post, I will be focussing on the Supply sub-domain (shown in blue above) where we redesigned the engineering team organisation.

Domain vs Sub-Domain vs Bounded Contexts vs Teams

In DDD terminology, a sub-domain is a part of the domain with specific logically related subset of overall business responsibilities, and contributes towards the overall success of the domain. A domain can have multiple sub-domains as you can see the above visual. A sub-domain is a part of the problem space.

Sometimes it can be a bit difficult to differentiate between a domain and a sub-domain. From my pov, its all just domains. If a domain is large and complex enough, we tend to break it down into discrete areas of responsibilities and capabilities called sub-domains. But I don’t think this is hard and fast rule.

A bounded context is the one and only place where the solution (often software) to a specific business problem lives, the terminology captured here is consistent in its usage and meaning. It represents an area of applicability of a domain model. E.g. Supplier Price and Availability context will have software systems that know how to provide supplier prices and stock availability on a day to day basis. These terms have an unambiguous meaning in this context. The model that helps solve the problem of prices and stock availability is largely only applicable here and shouldn’t be copied in other bounded contexts because that will duplicate knowledge in multiple places and will introduce inconsistencies in data leading to expensive to fix bugs. Bounded contexts therefore provide a way to encapsulate complexities of a business concept and only provide well defined interfaces for others to interact with.

In an ideal world each sub-domain will map to exactly one bounded context owned and supported by exactly one team, but in reality multiple bounded contexts can be assigned to a sub-domain and one team might be supporting multiple bounded contexts and often multiple software systems in those contexts.

Here’s an illustration of this organisation (names are for illustrative purposes only):

An illustration of relationship between domain, sub-domain and bounded contexts (assume one team per sub-domain)

I am not going to go into the depths of strategic DDD but here are some excellent places to study it and understand it better. The strategic aspects of DDD are really quite crucial to understand in order to design software systems that align well with business expectations.

Old Team Structure

Simply put, the Supply sub-domain is primarily responsible for creating and sending appropriate purchase orders for products that we want to buy, to our suppliers, and managing their lifecycle to completion. There are of course ancillary stock administration related responsibilities as well that this sub-domain handles but not all of those software-ified…yet.

Historically, we had split the product engineering teams into two (the names of the teams should foreshadow the problems we will end up having):

Stock Management 2 : responsible for generating automated replenishment proposals and maintaining pre-purchase settings, and

Stock Management 1 : responsible for everything to do with purchase orders, but also over time responsibilities of maintaining EDI integration and store replenishment also fell on this team.

Both teams though had a separate backlog, they shared the same Product Owner and the responsibilities allocated to the teams grew…”organically”, that is to say, the allocation wasn’t always based on team’s expertise and responsibility area but mostly based on who had the bandwidth and space available in their backlog to build something. Purely efficiency focussed (how do we parallelise to get most work done), not effectiveness focussed (how do we organise to increase autonomy and expertise, and deliver the best outcomes for the business).

Because of this mindset, over time, Stock Management 2 also took on responsibilities that would have better fit Stock Management 1 e.g. they built a recommendation system on top of the purchase orders, something they had very little knowledge of. They ended up duplicating a lot of purchase order knowledge in this system – they had to – in order to create good recommendations. This also required replicating purchase order data in a different system which would later create data consistency problems.

As a result, dependencies grew in an unstructured and unwanted ways e.g. a lot of database sharing between the two teams, complex inter-service dependencies with multi-service hops required to resolve all the data needed for a given use case. The system architecture also grew “organically” with little to no alignment with the business processes it supported and the accidental complexity increased. Looking at the team names, no one could really tell what either teams were responsible for because what they were responsible for was neither well documented nor stable.

We ended up operating in this unstructured way until July 2023.

Trigger for Review

The trigger to review our team boundaries came in Q1 2023, when we nearly made the mistake of combining two teams into one single large team with joint scrum ceremonies with a proposal to add more process to manage this large team (LeSS). None of it had taken into account the business capabilities the teams supported or the desired state architecture we wanted. It was clear that no research had been done into how the industry is solving this problem, and it was being approached purely from a convenience of management pov.

Large teams, specially in a context that supports multiple business processes, is a bad idea in many ways (some of these are not unique to large teams):

Large teams are expen$ive, you’d often need more seniors on a large team in order to keep the technical quality high and technical debt low
No real ownership or expertise of anything and no clear boundaries
Team members are treated as feature factories instead of problem solving partners
Output is favoured over outcomes, business value delivered is equated to story points completed
Cognitive load and coordination/communication overhead increases
Meetings become less effective and people tend to tune out (I tend to doodle geometric shapes, its fun !😉)
Product loses direction and vision, its all about cramming more features which fuels the need to make the team bigger. Because of course, more people will make you go faster…NOT!
Often more process is required to “manage” large teams which kills team motivation and autonomy

This achieves the exact opposite of agility and we saw these degrading results when for a brief amount of time we experimented with the large team idea.

Joint sessions were becoming difficult and inefficient to participate in (not everyone can or will join on time)
Often team members walked away with completely different understanding and mental models which got put into code 😱.
Often there was confusion about who was doing what which increased the coordination overhead
Given historically the two teams had been separate with their own coding standards and PR standards, there often was friction in resolving these conflicts which slowed down delivery and reduced inter team trust.

Communication overhead grows as number of people in the group increases

The worst part of all of this is learned helplessness! We become so desensitised to our conditions that we accept it the sub-optimal conditions as our new reality.

So combining teams and adding more process wasn’t going to be the solution here and it most certainly shouldn’t be applied without involving the people whose work lives are about to be impacted i.e. the engineering teams.

These reorganisations should also not be done devoid of any alignment with the business process because you risk system architecture either not being fit for purpose or too complex for the team(s) to handle because all sorts of assumptions have been put into the design.

Team Topologies and Domain Driven Design

I had a feeling that we needed to take a different approach here and by this time I had been hearing a lot about Team Topologies so I bought the book (highly recommended), and read it cover to cover…twice…to understand the core ideas in it. A lot of people know about Conway’s Law but Team Topologies really brings the double edged nature of Conway’s Law into focus. Ignore it at your own peril!

This Comic Agile piece sums up how that realisation after reading TT book, dawned on me:

Check out more hilarious strips here

Traditionally, team and domain organisation in most companies, has been done by business far removed from the engineering teams, meaning they miss out a critical perspective in those discussions: that of the system architecture. And because the team design influences software design, many companies end up shooting their foot with unwieldy and misaligned software that delivers the opposite of agility. This is exactly why its crucial to have representation from engineering in these reorganisations. Just because something works doesn’t mean, it’s not broken!

By this time we had also conducted several event storming sessions for the core Supply sub-domain (for the entire purchase ordering flow) to identify critical domain events, possible bounded contexts and what we want our future state to be. I cannot emphasise enough how important this kind of event storming can be in helping surface complexity, potential boundaries and improvement opportunities to the current state.

Putting Team Topologies and strategic DDD together to create deliberate team boundaries was just a no-brainer.

Don’t worry, you are not meant to read the text, the identified boundaries are more important

Also worth bearing in mind that this wasn’t a greenfield operation, we had existing software systems that had to be mapped onto some of the bounded contexts, at least until we can determine their ultimate fate. Some of the bounded contexts had to drawn around those existing systems to keep the complexity from leaking out to other contexts.

Brainstorming on New Team Design

In May 2023, I, our development lead and our domain manager got to brainstorming on how can we organise our teams not only for efficiency but this time crucially also for effectiveness.

In these discussions I presented the ideas of Team Topologies and insights from the event storms we had been doing. According to Team Topologies, team organisations can essentially be reduced to the following 4 topologies:

Four fundamental topologies

Based on these and my formative understanding, I presented the following team design options:

The 2 team model

This model makes the Purchase Ordering team (stream aligned) solely responsible for full purchase order lifecycle handling, including the replenishment proposals (which is an automated way to create purchase orders). The Pre Purchase Settings team (platform team) will provide supporting services to the PO team (e.g. supplier connectivity and price & availability services, purchase price administration services, various replenishment settings services etc).

Another model was this:

The 3 team model

In the 3 team model, I split out the replenishment proposals part out of Purchase Ordering team, added the new actionable products capability that we were working on, to it and created another stream aligned team: Replenishment Optimisation Team. The platform team will now provide supporting services to both these stream aligned teams and the new optimisation team will essentially provide decision making insights to purchase ordering team.

In a perfect world, you want to assign one team per bounded context and as evident from the event storm we had several contexts, but Team Topologies also warns us to make sure the complexity of the work warrants a dedicated team. Otherwise, you risk losing people to low motivation, and still bearing the cost of creating multiple teams.

Nevertheless, after taking into account the practical constraints like money, complexity and team motivation but perhaps most importantly taking into account the impact of each design on the overall system architecture and what we wanted our desired state architecture to look like, we settled on the following cut:

Final team split

Basically, at their core, the Purchase Order Decisions team will own all components that factor into purchasing decision making:

Replenishment recommendation generation
Purchase order creation and verification
Actionable product insights

And the Purchase Order Management team will own all components that factor into the management of lifecycle of submitted purchase orders (I know “management” is a bit of a weasel word, but I am hoping over time we will be able to find a better name):

Purchase order submission
Purchase order lifecycle management/adjustments (manual and system generated)

The central idea behind this split being that purchase order verification is a pivotal event in our event storm and once a purchase order is verified, it will always be submitted. Submission is a key pre-condition to managing pending purchase order lifecycle and it has sufficient complexity due to communication elements involved with suppliers and our own warehouse management system, so it makes sense for Purchase Order Management to own everything from submission onwards. This also makes them the sole owner of the purchase order database and this breaks the shared database anti-pattern, and relies on asynchronous event driven communication between the bounded contexts owned by the teams. Benefit of this is that we can establish clearer communication contracts and expectations without knowing or needing to know the internals of another context.

In addition to this, we also identified several supporting capabilities/bounded contexts for which the complexity just wasn’t high enough to warrant a separate team entirely, at least for now:

Supplier price and availability retrieval
EDI connection management
Despatch advice forwarding
E-mail based supplier communication

These capabilities still had to be allocated between these two teams, so based on whether they belong more to decision making part or the management part, we created the following allocations:

Supplier price and availability retrieval (Purchase Order Decisions because its only used whilst creating replenishment recommendations and subsequent purchase orders)
EDI connection management, Despatch advice forwarding (Purchase Order Management because they already owned this and it definitely didn’t make sense as a part of decision making flows)
Email based supplier communication (Purchase Order Management because purchase order submission can happen via EDI or via E-mail so it makes sense for them to own all aspects of submission)

This brought the final design of teams to this:

Final team cut with bounded contexts owned by each

It might seem a bit excessive to have multiple bounded contexts to a single team and like I said, in a perfect world I would have one team be responsible for only one bounded context. But considering the constraints I mentioned before (cognitive load, complexity of challenge and financial costs of setting up many teams), I think this is a pragmatic choice for now. The identified bounded contexts are also not set in stone, so its entirely possible we might combine some of them into a single bounded context based on conceptual and linguistic cohesion. We might even split them out into dedicated teams should some bounded contexts grow in complexity enough to warrant separate teams.

NB: A bounded context might not always mean a single deployment unit (i.e. a service or an application). A single BC can map to one or more related services if the rates of change, fault tolerance requirements and deployment frequencies dictate as much. The single most important thing about BCs is that they encapsulate a single distinct business concept with consistent business language and consistent meanings of terms, so its perfectly plausible that there are good drivers for splitting one BC into multiple deployment units.

Some heuristics for determining bounded contexts

Go Live!

In June 2023 we presented this design to both the teams and asked for feedback, and both teams could see the value of the split because it created better ownership boundaries, better focus and allowed opportunity to reduce the cognitive overhead of communicating within a large team. So in July 2023, we put the new team organisation live and made all the administrative changes like changing the team names in the HR systems, Slack channels, assigning right teams to code repositories based on allocations etc. and got to work in the new set up.

Reflection

Whilst this team organisation is definitely the best we’ve ever had in terms of cleaner ownership boundaries, relatively appropriate allocation of cognitive load, better sense of purpose and autonomy, its by no means the best we will ever have. The most important thing about agility is continuous improvement, DDD tells us that there is no single best model, so it only makes sense that we revisit these designs regularly and seize any opportunities for improvement along any of those axes to keep aligned with the business and deliver value effectively. The organisation and the domain never stay the same, they grow in complexity so its crucial for engineering teams to evolve along with them in order to stay efficient and effective themselves, and also for the architecture to stay in alignment with the business. I loosely equate teams and organisations to living organisms that self-organise like cellular mitosis, its the natural order of things.

Ofcourse things are not perfect, both teams still have some degree of functional coupling i.e. if the model of the purchase order changes fundamentally or if we need to support new purchase order types, both teams will need to change their systems and coordinate to some extent. This is a trade-off of this team design option, but largely the teams are still autonomous and communicate asynchronously for the most part. Any propagation of model changes can still be limited by use of appropriate anti-corruption layers on either side.

One of the other significant benefits of this deliberate reorganisation, is that in both teams we created a north star roadmap for the desired state architecture because for a long time, both teams had incurred unwarranted technical complexities in the form of arbitrarily created services with mixed programming paradigms, which were getting difficult to maintain for a small team. Contract coupling at multiple service integration points made smallest of changes ripple out to multiple systems that had to be changed in a specific order to deploy safely (we’ve had outages in the past because we forgot to update the contracts consistently).

As a part of our new engineering roadmap, we are now reviewing these services with a strategic DDD eye and asking, “what business capability this service provides?” and if the answer is similar for two services and there are none of the benefits of microservices to be gained here, then those two services will be combined into a single modular monolith. Some services will not make sense in the new organisation so they will be decommissioned and the communication pathways simplified. We project a potential reduction of 40% in the complexity of overall system landscape because of these changes (and hopefully some money savings as well), or at the very least complexity will be better contained. But perhaps most importantly, we aim to make the architectural complexity fit the cognitive bandwidth of the teams and ensuring a team can own the flow end to end.

Another thing we will be working on next is strengthening our boundaries with dependent teams, historically the e-commerce database has been shared with all the teams in Coolblue and this creates challenges (subject for another post). So going forward we will be improving our web services and events portfolio so dependents can use our service contracts to communicate with our systems instead of sharing databases. With a better sense of what we own and don’t own, I expect these interfaces to become crisper over time.

These kinds of reorganisations can have a long maturity cycle before it becomes clear whether these decisions were the right ones or the team boundaries were the right ones, and organising teams is just the first though a significant step. The key is in keeping the discussion going and being deliberate about our system design decisions to ensure that business domains and system design stay in alignment. To that end we will continue investing in Domain Driven Design practices to ensure business and engineering can collaborate effectively to create systems that better reflect the domain expectations whilst keep the complexity low and keeping acceptably high levels of fault tolerance and autonomy of value delivery.

Monolith vs Microservices

Aman Agrawal — Sun, 01 Oct 2023 20:49:09 +0000

One of my colleagues shared this article with me a few days ago and having read through it (and many others like it before), I felt like I needed to provide hopefully a more balanced perspective to this age old debate based on my own experiences and learnings. So in this post, that’s what I am going to attempt to do.

Successful Startup != Microservices

The author shares this link to a security audit of start up codebases and emphasises on point number 2 of that article. The point being that all successful startups kept code simple and steered clear of microservices until they knew better.

I can see why, microservices are an optimisation pattern an org may need to apply when they scale beyond a certain size. When you are a fledgling startup with limited funding and an uncertain future, microservices is the last thing you should be worried about (and preferably not at all). All the time and money at this stage should be spent on generating value and treating your employees well (the latter is not optional…ever).

Here’s an example of a modern digital bank that went all in with microservices from the get go. They boast about their 1500+ microservices that a while ago invited some flak on Twitter. From my limited point of view on their context, this looks like insanity and though they touch on all the challenges that come with this kind of architecture, I can’t help but think that somewhere deep down they go, “Wish we hadn’t done this so soon!” But I am willing to give them the benefit of the doubt that they did their due diligence whilst evolving into a complex architecture and building a platform to support it.

Microservices Make Security Audit Harder

If I have 1500+ services potentially written in different languages spread across of hundreds if not thousands of repositories, my job as a security auditer just got exponentially harder! The author even mentions that in point 7 of his list: Monorepos are easier to audit. Monzo’s 1500+ Go services are in a mono-repo, so that’s one down I guess.

Also the security attack surface area also gets that much wider, can you ensure all 1500+ of your microservices leverage security hardened platform and industry best practices in a standard way? Do you even know what those are? What about the dependencies (direct and transitive) each of those services take on external code?

I think these are probably the most significant drivers for a security professional to gripe about microservices, but the more you distribute the more standardisation on the platform front you need. You don’t want to be reinventing the wheel, especially when it comes to security so the more sensible defaults you can bake into the platform the better and easier it might be to audit it.

Do we really need microservices?

I agree with the author that in some cases there can be “a dogma attached to not starting out with microservices on day one – no matter the problem“. Just because someone else (usually multi-billion dollar organisations with global footprint and tens of thousands of engineers, think FAANG) is doing microservices, doesn’t mean my 5 person startup also needs microservices.

But I have to add a bit of nuance here, an org doesn’t have to reach FAANG scale to realise they need to rearchitect. If my org is growing in terms of revenue, size and technology investment, then asking the following kinds of questions regularly is a part of engineering due diligence:

Is the current monolithic architecture with a shared database still the right thing to do?
Are we facing challenges in some areas where our current architecture is impeding our value delivery? If so, what might be some ways to alleviate that pain?
How much longer this system can keep growing with the same pace as the org and still be maintainable and agile?

Agile organisations and agile architectures are the ones that can evolve with time and need. The complexity of the architecture should be commensurate with the organisation’s growth rate and ambitions. No more, no less.

How web cos grow into microservices?

None of the web cos evolved to microservices over night, it was a long arduous journey over decades (far shorter than most employees’ tenure in an organisation btw). Here’s E-bay’s journey to microservices, here’s Netflix’s and here’s Amazon’s. In all cases you will notice that even though today they are microservice behemoths, they started the thinking and doing the groundwork many years prior when they were much smaller compared to today. Amazon for example started their thinking back in 1998, a full 25 years ago, which ultimately resulted in the manifesto linked above.

This is a testament to their forward thinking and agility that helped them survive and succeed, if they had waited until they got to today’s scale (assuming they ever managed to reach there in the first place), to start decomposing their architecture for growth and evolution, they probably won’t have made it.

So just fiendishly touting “there is nothing wrong with monolith” or “don’t do microservices” without justifying the arguments or clarifying the nuances is no different than someone wanting to have 1500+ microservices because someone else is doing it.

Look at where you are and where do you want to be

Its also true that many organisations are still monolithic-ish (from a technical pov) for example StackOverflow and Shopify and there’s probably more. But its not like StackOverflow will not ever entertain the possibility, they have multiple teams that are responsible for various parts of the SO so if they need to scale and increase the fault tolerance of a specific set of teams, they can always factor services out.

The article also gives examples of Instagram and Threads but what it omits is that Threads is built on top of Meta’s massive platform that is a collection of different and largely reusable services. Can you imagine the complexity of building something like that from ground up?

I can be pro-monolith and pro-large shared databases as an organisation as long as I regularly and critically review my architecture to sense signs of troubles and be mature enough to evolve it into a better state.

Problems with Distribution

Here’s where I probably agree somewhat with the author but I also think these are not problems unique to microservices:

Say goodbye to DRY

Somewhat yes but mostly no! It depends on what is being duplicated and can it really be considered duplication. If its knowledge of a domain concept that’s being duplicated then that’s bad and usually an indication of incorrect boundaries. If its data contract on provider and consumer ends, that’s not really a duplication.

This is also not a service architecture only problem, given a sufficiently large monolithic codebase (and depending on how well its modularised) I can bet you can duplicate knowledge in a monolith as well because in a rush to delivery, that’s just how engineers behave. Granted, it might be easier to spot and remedy if all code is in one place than if its spread across multiple codebases but then, that’s what you want to do even in a service architecture i.e. combine logically related codebases to reduce knowledge duplication. Nothing about many service architecture stops you from combining services when you need to.

Matter of fact, in my teams we’re simplifying our many-service architecture to a smaller set of carefully combined services. Note: services are not going anywhere, they are just getting a little less…micro. We are still working to decompose our shared monolithic e-commerce database by defining ownership boundaries around business capabilities

When combining services is not really an option, creating packaged libraries for common functionality and pushing them up to a central package registry for easier reuse, is the next best thing.

Developer ergonomics will crater

Yep! For new joiners in a team, even with all the support, guidance and onboarding, knowing the whole landscape can be quite daunting. And yes, over time you tend to start building a solid mental model and you can find the exact line of code in the exact service that lies on the critical path with a 2 minute Github search, but it still can be a long time before that happens.

Not to mention the time wasted just trying to get a service that doesn’t get changed often, to run and working on an engineer’s machine because people forget things they don’t look at and the environment also changes.

But once again, having a monolith doesn’t make it magically easier, especially if the monolith is sufficiently large. I would still need to make sure all the configuration for all the modules, is set up to bring up the system locally regardless of whether or not I need to touch that part of the system. With separate services, you only pay that cost for the module that you need to work on. Of course a lot depends on how the monolith is designed.

Integration tests – LOL

Yeah, kind of! But I would challenge this by saying that meaningful and fast integration testing in any sizeable organisation (think 40 different domains and 500+ engineers) long left the building. Integration testing though useful shouldn’t be the only way we test our code, monolith or microservices, because unless you are building your own payment gateway or geocoding platform, even your monolith will have external dependencies. You can forget about being able to do reliable and fast integration testing.

I would hate to see your testing code if the only tests you ever have are opaque integration tests with complicated dependency setup. How would one even reason about those tests? And if I can’t reason about them I would probably disable them or remove them or they’d get flaky overtime, in which case I’d have even less confidence to deploy changes. Having said that, the more dependencies you have (e.g. with microservices) the harder integration testing becomes but the same is true with “the more integration tests you have the harder it can be to maintain those tests”.

“observability” is the new “debugging in production”

Observability as a practice is not restricted to microservices or monoliths. Its just a sensible thing to want to do to get visibility into what the system is doing and how is it performing over time. It is essential for debugging production systems (mono or micro). You can’t step-thru debug code in production (though I have done it in past with Visual Studio Remote Debugging feature, back then it wasn’t a nice experience). Even if you could debug that way, the problem may not always be replicable in production because the environment is not 100% predictable and that’s why I rely on logs and metrics to observe the system performance over time and create a direction for my debugging or understand its rhythm.

No integration test can give you the profile over time that good monitoring does, because integration tests are a snapshot in time. Production is where the software is really tested, so yes I do want good observability in order to understand my system and troubleshoot it effectively.

What about just “services”?

Read on…

Services are about org design and business capabilities

Just because an org might not be planet scale doesn’t mean they can’t benefit from decomposing large systems into smaller ones to gain autonomy and resilience.

What an organisation should invest in is identifying how value flows through it, what people are empowered to and have the capability to, make decisions and draw contextual boundaries around those groups. Creating a stable platform that minimises reinventing the wheel, is also crucial as the org grows otherwise the amount of rework/grunt work across multiple services alone will be a drag and they will be writing about how microservices failed them.

The ideas in Team Topologies book talk about this kind of org design that allows a better implementation of Conway’s Law. Domain Driven Design talks about bounded contexts that create these relatively autonomous zones within an organisation that are coupled loosely from a functional and technical perspective.

Focusing on flow of value and organisation design should result in sensibly sized services that are driven by domain boundaries instead of technical wet dreaming. Micro, nano, pico or…mega…is irrelevant, because any change in the service granularity will/should be triggered by changes to the business value flow it delivers so a service should be as big as it needs to be. Splitting services for its own sake or combining services for its own sake (because you drank too much of the “nothing wrong with monolith” kool-aid), is ill-informed and cargo-culting. That’s a sure fire way to the madness the author is talking about.

How does one determine value flow and create better boundaries?

This needs its own post (or ten) so I will leave a cop-out list of other buzzwords to consider:

Value stream mapping
Big Picture Event Storming
Context Mapping
Domain modeling

N.B. Initial boundaries you will draw will probably be wrong, so be prepared to revisit and refactor them. You don’t want to stick with ineffective boundaries for too long.

In Closing…

Many-service architecture (I am not calling it microservices anymore), is definitely a scaling and optimisation pattern that shouldn’t be applied haphazardly or lightly just because you think it puts you in the cool kid category. It does add complexity because of many moving parts, increases the failure modes to consider and might even negatively impact the system performance
Pay attention to business capabilities and ownership boundaries (i.e. bounded contexts) by identifying flow of value in the org
Create services in correspondence to the bounded contexts and be prepared to redraw the boundaries and rearchitect both ways, that is:
- If you do this due diligence then you can even design a modular monolith to start with and split when actually needed, and
- Armed with those insights you can even combine multiple services into fewer to align better with the contextual boundaries.
Sometimes team reorganisation can cause reallocation of capabilities across portfolios, if you have a scruffy monolith then splitting out services to hand over might be harder than if you already had services.
You cannot have a loosely coupled services architecture if you are still sharing the monolithic database. If you are carving out services from the monolith, also take your data with you. Shared databases start out innocently enough when the org is small and simple but they are like bear cubs, eventually they get bigger, scarier, teethier and then they are no fun. Make breaking the monolithic database up a part of your engineering strategy
The organisation needs to have or be willing to build, a certain level of engineering maturity and leadership to execute a successful many service architecture evolution that is built on top of a stable platform
Thoughtlessly designed monolith is just as bad as thoughtlessly designed microservices.

Our Next(JS) Webshop

Stef van Hooijdonk — Mon, 17 Jul 2023 07:48:00 +0000

Ownership continued

In recent posts we have shared our views on ownership and how we use that @ Coolblue to develop our software. We value ownership on a team level as also seen in re-designing / re-engineering our 'backoffice' monolith.

Since last time we wrote about our Tech Principles, we actually added a Tech Principle to our collection that should help our teams understand ownership when we look at services, events and the connections between them. Especially how we want teams to handle these interdependencies in relation to new developments and maintenance. But that is something for another post (Soon ™).

This post will focus on our journey towards the "technical design" for our next implementation of our webshop (https://coolblue.nl). Our current webshop codebase is considered a Monolith. As far as I can see in Github the first commit dates back to June 17 of 2010. Over 13 years ago.

Today we have about 10 to 15 development teams working on this single codebase to make our customers smile. Each one of those teams has a pretty specific goal. And as such have to work together in this single codebase. This makes efforts that touch on large parts of this code base, like an upgrade to the next version of PHP, a shared and a multi team problem. Our Design System is tied to this same codebase.

Secondly we mainly use PHP and Twig for the implementation of our current Webshop. Great technologies, but we want to leverage a more modern set to allow for more fluid user interactions and to keep and attract talent.

Time for a change.

From Monolith to ?

In the introduction I already mentioned a few of the concerns we have with our current Webshop implementation. Based on those, we decided to investigate how we could address them.

Team Scope versus Team Disciplines

Technology wise we are looking to “downsize” the solutions a development team owns. Reduce the overall size. Size is still non-deterministic. It is a meta-unit composed of multiple factors: complexity, lines of code, number of services, number of classes and more.
From a monolith towards multiple smaller solutions that match closer to what a team with a goal can manage, build, maintain and enhance to deliver more value for our business.

A piece of history

About 2 years ago we held our first brainstorm session "Webshop Tech vNext".

One key decision we took back then already was to start using React as the base for our back office applications and the design system for them.

About 9 months ago, we held another brainstorm session to look at options for our Webshop. At that time we noticed that NextJS, together with React, could be a replacement for our Webshop tech stack.

We also learned about microfrontends. I myself found the explanation on this page on microfrontends useful. We were already doing more and more with Services and API's in many of our other solutions and the analogy made total sense to us. Especially in light of our ideas around ownership.

Proof of Concept time

We set out to test a few topics to learn how these would work for us in a microfrontend world.

During the Proof of Concept (Q1-Q2 of this year 2023) we actually implemented a piece of the routing needed for microfrontends in our Coolblue.nl webshop production (codebase). Only to be visible and used by our developers of course;

We found no large blockers and we decided to move ahead.

Roadmap it

Rebuilding our Webshop is a tremendous effort. And that is why we created a plan in the last few weeks. And have started on this massive journey.

In a year, we hope to share more on our learnings going through this process. But looking at the plan, I am sure that if you are a customer of ours, you will have been served a page or two based on this new implementation.

Technologies that help us achieve this

If you are reading this post just to learn: "What technology is Coolblue using for their webshop?" Sorry to have kept you waiting for so long.

Requests (users) will be routed to the right microfrontend application with AWS Lambda@Edge
The Application(s) will be hosted as Serverless Containers with AWS Fargate
For our Front End we will use React and TypScript
To serve our Front End we will rely on NextJS
Our background processing and services are built mostly with C# or NextJS/TypeScript. When needed, we will consolidate services via GraphQL.

Systems Thinking and Technical Debt

Aman Agrawal — Tue, 04 Apr 2023 07:00:00 +0000

I repeatedly see both business stakeholders and software engineers continue to struggle to see eye-to-eye on matters of technical debt despite the fact that both are impacted by it. I attribute this to the fact that both camps speak different languages and over the last 15 some odd years I haven’t found a silver bullet that can get a 100% alignment. Engineers are driven by:

Code complexity, maintainability and understandability
Making architecture more fault tolerant, resilient and quickly recoverable from outages
Keeping up with technological changes/staying on the cutting edge
Innate desire to improve software systems and not letting them rot

Business folks are driven by :

Investment vs return on that investment
Financial savings/profit
Time to market
Legal liability/other risk
Short term thinking and focus on features as opposed to long term outcomes

Its like two people arguing with each other where each speaks a language the other doesn’t understand! Never going to work! This InfoQ article tries to look under the hood of this communication gap between the two parties in more detail and makes some good recommendations, worth checking out.

The other problem that I think hinders alignment is a holistic understanding for how technical debt affects or is affected by, business drivers and having some way of visualising it. You can often sense a lack of this understanding when a manager says, “I don’t really see the business value in addressing this technical debt, right now we have critical functional work to do, can we do this tech thingy later?”. In this post I will try to use simplified Systems Thinking modelling language to put technical debt in the larger organisational context with the hope that it will make some sense to everyone.

Using Systems Thinking to Put Technical Debt in Context

I am going to take a crack at it by drawing a systems model using digital post-its connected by arrows (what else?). The post-its represent variables that can increase or decrease, green arrows mean a change in one variable results in corresponding increase in another variable and later on red arrows that mean a decrease in one variable triggered by a change in another variable.

DISCLAIMER: these models are abstractions of real life systems, so they are not meant to be 100% accurate but a useful approximation to help make sense of the complexities involved and connect them to the other parts of the organisational system.

For these models, I am going to use the following variables:

Number of business problems to solve/solved
Amount of business value created (somewhat abstract but let’s say its the measure of usefulness of the solutions that help improve the business outcomes)
Business success (EBITDA/revenue, new investment and expansions, new customer journeys, number of customers signed up, number of repeat customers, NPS what have you)
Business pressures (slow down in business success metrics creates pressure to do more)
Market forces (pandemic, war, supply chain issues, competitor action, economic turbulence etc)
Internal dynamics (org politics, reorganisation and restructuring, cost cutting, lawsuits, etc). Along with market forces, this generally tends to push down an organisation’s success.
Engineering velocity (roughly speaking, number of value add ideas productionised per cycle)
Engineering compromises (the number of shortcuts we take whilst productionising ideas)
Technical debt (well, I guess I don’t need to explain this, or do I? )
Engineer motivation and trust (mostly abstract but I guess WTFs per minute can be a good metric . In seriousness though, this erodes over time and can often be sensed when people abruptly leave or stop caring or become a very frustrated and a challenging member of the team.)

For the first diagram I am going to assume a perfect world where an organisation keeps going from strength to strength forever, and the engineering velocity keeps growing in tandem as well:

In a perfect world, business success and engineering velocity will continue to increase infinitely

Business opportunities generate business problems to be solved, the more problems we solve, the more business value we generate, the more the business will succeed, this means the pressure to succeed will increase in the form new revenue streams, new opportunities, new value streams, and the more demand this will force upon the engineering velocity which will respond by solving these challenges and generating more business value in turn. And the cycle will just continue resulting in an infinitely successful business and infinitely high engineering velocity with no technical debt whatsoever, its essentially a run-away positive feedback loop in systems thinking terminology. Of course, this is living in Harry Potter land, no relation to reality whatsoever!! So let’s descend down to reality, shall we?

In her book Thinking in Systems, Donella Meadows observes that:

no physical system can grow forever in a finite environment.

Meadows, Donella H.. Thinking in Systems

This is because an uncontrollably growing system eventually will tend towards instability and crash (the 2008 financial crisis is a glowing example of this runaway positive feedback loop, or, ever tried to bend an thin metal strip back and forth repeatedly until it snaps? Obvious, right?). In the light of this constraint, we can see that our model is missing other variables that serve to constrain the system so it doesn’t become a victim of its runaway success (or failure). So what would the picture look like with all these variables plugged in?

In a more realistic world, we need other variables that constrain the system

Suddenly the complexity explodes!

As we solve more business problems, the more value we add, the more the business succeeds which increases the pressure for sustained success, because…let’s not get complacent, yes? Internal dynamics such as reorganisation, politics etc and market forces such as pandemic, competitor actions, societal upheavals like wars, high inflation etc, push against the business success, this creates even more business pressure to succeed and pressure to increase engineering velocity to reduce time to market and gain competitive advantage.

Up to a certain point the velocity will grow very organically, but now that we know that no system can grow forever, eventually, the high demand on engineering velocity will result in more and more engineering compromises and shortcuts to be made. This will in turn increase the technical debt accumulated which initially will increase the velocity but with enough of these kinds of iterations, it will start to wear down the engineer motivation and trust in the system and the team as they struggle with the past engineering compromises and in the race to deliver faster will end up adding new compromises and debt on top of the existing ones. This also increases the maintenance costs of the software and eventually it will start to slow down the engineering velocity. This means a reduction in the number of business problems solved, more of the org’s investment in engineering goes towards just struggling with the technical debt and not adding new value. This in turn results in that much less value being created over all which will eventually start to reduce business success.

If left unchecked (in some cases this does happen), this cycle could also go in a runaway negative feedback loop where an org’s engineering capabilities are actively hindering its success as opposed to enhancing it. This erosion of value creation doesn’t happen over night, it can take a long time (often years) to build up but in the end its like the org is paying the engineering teams to actively sabotage itself, that’s horrifying! But since no system can grow or shrink forever, interventions will be made eventually to salvage the situation which will inevitably lead to Big Bang Rewrites of all the “legacy” systems. This will create its own problems (not represented in the diagram), for e.g. the time and money cost of rewrite will further erode the business value proposition of the system because value won’t be created until the first version of the “new” system goes live, leading to less business success, increasing costs, increasing management pressure to deliver successfully this time around. But since we want to go fast, we’d take shortcuts and compromises which starts the vicious cycle all over again just in the “new” system.

Can this be considered a smart business strategy?

In systems of comparable complexity, refactoring a system gradually towards health and improved design is generally a lower risk and rapid return investment than rewriting it from scratch (though in some cases, the opposite is also true). This is because much of the original investment (and knowledge) can still be valid and preserved and you are not rushed to finish the work for the fear of blocking the creation of new value. Old and new (if refactored carefully), can happily co-exist with every iteration not only creating new value now but also reducing the technical debt that we’ve accumulated along the way. The old can then be decommissioned eventually.

But I digress a bit, so what’s the solution to minimise the run away negative feedback loop then? We don’t want to pack down our business just because there are constraining forces at work, yes? How do we try to create a harmonious balance of short term debt advantage and long term stability and resilience of the system? In finance, the bank’s enforcement agents or government penalties, will solve that problem for you real quick, but unfortunately for most enterprise software engineering, we don’t have that level of “encouragement”.

…

How about Engineering Discipline? Sounds obvious but how will it fit in the model? Let’s see:

Introducing a little bit of discipline (top right corner-ish), can bring back some stability over time

When the velocity starts to drop and more engineering compromises are made to increase the velocity (see the irony here?), engineering discipline can act as a compensating driver that can make sure that we are able to reduce the previous compromises first before we add new value every cycle. This gradually results in increased engineer motivation and trust as they don’t need to struggle with bad decisions as much and this will serve to increase engineering velocity and value creation over time. The increase in the engineering velocity may not be by leaps and bounds and not right away, but at least it’s likely to not fall too low in the face of ever increasing business pressures, and the system won’t spiral into madness.

Like I said before the model is not perfect, no model is, but I think (and hope) it helps put the complexities involved, in perspective for both engineers and business stakeholders and clarifies the affect long term accumulation of technical debt can have on the business outcomes.

Engineering Discipline is critical to bringing the system back into equilibrium and this is why its important for engineering teams to take control and ownership of this discipline and be proactively on the lookout for variable changes that tend to push the system towards instability. We don’t need permission to do the right thing, we have the engineering expertise and experience to know when and how the right thing should be done because we also understand the long term implications of neglect. Though we do need to communicate about these implications to the business in the language they understand to the extent its feasible and possible_._

If you have tried similar tools to communicate the value of addressing technical debt or if you think this model could be made more convincing or more “correct”, please drop a comment! Cheers!

Accessibility

Stef van Hooijdonk — Mon, 05 Dec 2022 09:12:00 +0000

Web accessibility a.k.a a11y refers to the universal ability of different users and devices to access the content and features of a website, regardless of physical or cognitive ability.

In other words, web accessibility ensures that everyone can successfully use a website, including users who are blind, color blind; deaf or hard of hearing, as well as users who have difficulty using their hands or have other disabilities.

What does it mean for Coolblue

We want to adhere to the W3C’s Web Content Accessibility Guidelines (WCAG) 2.1

You can find a comprehensive checklist in Coolblue design system guidelines (internal).

Definition of done

Accessibility is considered to be part of your teams definition of done.

Security

Stef van Hooijdonk — Thu, 01 Dec 2022 09:30:00 +0000

With regards to Security it is always better to reuse proven methods than to reinvent the wheel. Therefore these principles are based on the best practices used by the Mozilla Foundation. Where applicable these have been adapted or expanded to align with the other Coolblue Prinicples and Core Values.

The “do” and “do not” used in this document are examples of controls or implementation of these principles, but do not represent an exhaustive list of possibilities. When in doubt, verify if your application, service or product aligns with the goal of the principles.

Least Privilege

Do not expose unnecessary services

Goal: Limiting the amount of reachable or usable services to the necessary minimum.

List all services presented to the network (Internet and Intranets). Justify the presence of each port or service.

Do not

OpenSSH Server (sshd) is running but no users ever login.
A web-application has a web accessible administration interface, but it is not used.
A database server (SQL) allows connections from any machine in the same VLAN, even though only a single machine needs to access it.
The administration login panel of the network switch for the office network is accessible by users of the office network.

Do not grant or retain permissions that are no longer needed

Goal: Expire user access to data or services when users no longer need them.

Use role-based access control (allows for easy granular escalation of privileges, only when necessary)
Expire access automatically when unused.
Automatically disable API keys after not having been used for a given period of time and notify the user.
Use different accounts for different role types (admin, developer, user, etc.) when no good role-based access control is available.
Routinely review user’s access permissions to ensure they’re still needed.

Do not

Grant global root access (e.g. via ‘sudo’) for all operation engineers on all systems.
Give access “just in case”.
Retain access to services that you no longer use.

Defense in Depth

Do not allow lateral movement

Goal: Make it difficult or impossible for an attacker to move from one host in the network to another host.

Prevent inbound network access to services on a host from clients that do not need access to the service through either host-based firewall rules, network firewall rules/AWS security groups, or both (which is preferred).
Clearly enforce which teams have access to which set of systems.
Alert on network flows being established between difference services.

Do not

Allow inbound OpenSSH, RDP connections from any host on any network.
Run unpatched container management services (e.g. Docker) or kernels which allow a user in one container to escape the container and affect other containers on the same host.

Isolate environments

Goal: Separating infrastructure and services from each other in order to limit the the impact of a security breach.

In cases where two distinct systems are used to govern access or authorization (e.g. AD and Okta), ensure that no single user or role has administrative permissions across both systems.
Use separate sets of credentials for different environments.
Do not
Have system administrators with access to every system/every service.
Establish service users with access to multiple services.
Allow tools remotely executing code on systems from a centralized location (single Puppet Master, Ansible Tower, Nagios, etc. instance) across multiple services.
Re-use functionality across services when not required (such as sharing load balancers, databases, etc.)

Patch Systems

Goal: Ensuring systems and software do not contain vulnerabilities when these are found in software over time.

Establish regular recurring maintenance windows in which to patch software.
-Ensure individual systems can be turned off and back on without affecting service availability.
Enable automatic patching where possible.
Check web application libraries and dependencies for vulnerabilities.

Meet Web Standards

Goal: Reduce exposure to web attacks by following the web security standards.

Achieve A or higher on Mozilla’s Observatory. (Deviaton from Mozilla standards which require a B+ or higher rating)
Follow the Web Security Standards.

Guarantee data integrity and confidentiality

Goal: Ensuring data confidentiality, integrity, and authenticity is respected throughout its lifecycle.

Details on confidentiality, integrity & availability can be found below at Explanations & Rationales

Use full-disk encryption where available on systems without physical security (laptops and mobile phones).
Encrypt credentials storage databases (Ansible Vault, Credstash, etc.)
Encrypt data in transit with TLS (during transmission).
Also encrypt data in transit inside the internal network.

Do not

Terminate TLS (e.g. with a reverse proxy or load balancer) outside a system and then transmit the data in clear-text across the rest of the network.
Use STARTTLS without also disabling clear-text connections.

Know Thy System

Fraud detection and forensics
Goal: Inspect events in real-time in order to alert on suspicious behavior, and store system behavior information in order to retrace actions after a security breach.

Audit and log system calls (e.g. with auditd or Windows Audit) made by processes when running in an operating system you control (e.g. not AWS Lambda)
Send logs off the account or system (e.g. AWS CloudTrail, system logs, etc.) outside of the account or system (different AWS account, MozDef, Papertrail, etc.)
Detect and alert on anomalous changes.

Are you at risk?

Goal: Assessing how exposed you are to danger, harm or loss.

Run Rapid Risk Assessments (RRA) for your services.
Estimate what would be the impact if your service was compromised.

Do not

Think it will never happen to you.

Inventory the Landscape

Goal: Provide an accurate, maintained catalog, or system of records for all assets.

Keep an inventory of services and service owners.
Keep an inventory of machines (e.g. ServiceNow, AWS Config, Infoblox, etc.) which is updated automatically.
Ensure that the inventory contains IP addresses of systems in particular when using IPv6 (which cannot realistically be scanned).
Never rely upon security through obscurity

Coolblue addition to Mozilla principles

No security by obscurity

Goal: To prevent substituting real security for secrecy.

Always assume an attacker with perfect knowledge

Do not

Rely on trust as a security measure

KISS - Keep It Simple and thus Secure

Goal: KISS comes from ‘Keep It Simple, Stupid’. You can only secure a system that you can completely understand.

Keep things simple. Prefer simplicity over a complex and specific architecture.
Ensure others can understand the design.
Use standardized tooling that others already know how to use.
Draw high-level data flow diagrams.
See also Code Clean & Simple.

Authentication and authorization

Require two-factor authentication

Goal: Require 2FA (or MFA) on all services internal or external to prevent attackers from reusing or guessing a single credential such as a password.

MFA (multi-factor authentication, also called 2FA for two-factors) is method of confirming a user’s claimed identity by utilising a combination of two different components such as something you know (password) and something you have (phone).

Use an SSO (Single Sign On) solution with MFA. For services that can not support SSO, use the service’s individual MFA features (e.g. GitHub). Servers carrying secrets or widespread access (or any other potentially sensitive data) should verify the user’s identity end to end, such as by prompting for an additional MFA verification when connecting to the server, even when behind a bastion host.

Use central identity management (Single Sign-On)

Goal: Minimize credential theft and identity mismanagement by minimizing the handling of user credentials such as password, MFA to a set of dedicated systems.

Use an SSO (Single Sign-On) solution that authenticates users credentials on your service’s behalf. Within Coolblue we strongly encourage the usage of SAML.
Servers update their user sessions from the SSO systems regularly to ensure the user is still active and valid.
Use authorization (e.g. group membership) data from the SSO system (possibly, in addition to your own authorization data) Do not
Accept, process, transmit or store user credentials (passwords, OTPs, keys, etc.) Let the authentication server directly handle that data.
Use direct LDAP authentication for users.

More information about Centralized User Account Management can be found below in the Explanation & Rationales section

Require strong authentication

Goal: Use credential-based authentication and user session management to grant access.

More information about Shared Passwords & Password Reuse can be found below in the Explanation & Rationales section

Use credential-based authentication and user session management where the session information is passed by the user. More info.
Use API keys for service authentication.
Prefer using asymmetric API keys with request signing (e.g. x509 client certificates, AWS Signature) over symmetric API keys (e.g. HTTP header) where possible.
Ensure that API keys can be automatically rotated in the case of a data leak.
Use a password manager to store distinct passwords for each service a user accesses.
Use purpose-built credential sharing mechanisms when sharing is required (1password for teams, LastPass, etc.)

Do not

Use easy to guess passwords or vendor default passwords.
Send your password to other individuals.
Send shared passwords over email or communication mediums other than purpose-built credential sharing mechanisms.
Use the same password for multiple services.
Trust traffic from a certain network address.
Rely on VLANs or AWS VPCs to indicate requests are safe.
Use IP ACLs as replacement for authentication.
Trust the office network for access to devices.
Use TCP Wrapper for access control.
Use machine API keys for user authentication.
Use user credentials for machine authentication.
Store API keys on devices that are not physically secure (e.g. laptops or mobile phones)
Always verify, never trust

Coolblue addition to Mozilla principles

Goal: Many security problems are caused by inserting malicious intermediaries in communication paths. Zero trust is applicable to all actors and IAAA (Identification, Authentication, Authorization and Accountability).

Deny by Default
Authenticate every transaction
Use allow lists instead of block lists

NOTE: The “do” and “do not” used in this document are example of controls or implementation of the principles, but do not represent an exhaustive list of possibilities. When in doubt, verify if your application, service or product aligns with the goal of the principles. Suggestions for do's and don'ts can be send to [email protected]">[email protected].

Explanations & Rationales

Confidentiality

The confidentiality of the data depends on the type of data that needs to be protected. Personally Identifiable Information (PII) such as Names, Addresses or E-mail need a high level of protection than Publicly Available data (e.g. product information). Within Coolblue we use 4 levels of confidentiality:

Secret Data should only be accessible by a very limited amount of users. Examples of secret data are Admin / Root passwords, Business Strategy Plans, API-keys or our password databases such as Active Directory or PasswordState.

Confidential Data should only be viewed by authorized persons. Examples of confidential data are Personally Identifiable Information, order history or contracts.

Restricted Data can be freely shared within the company but not outside Examples of Restricted data are the VVV, internal processes or Demo's

Public Data can be accessed by the whole world Examples of Public data are the product information on our website, marketing ads or our vacancies.

Integrity

The integrity of data is the assurance of accuracy and consistency of the data over its entire lifecycle. A high level of integrity indicates that changes to this data should be verified and the correctness is very important. A low level of integrity indicates that changes to this data don't matter for the outcome of a process. Within Coolblue we use 3 levels of Integrity:

High Integrity data is data that should always be subjected to a 4-eye principle before a change can be made. If technically feasible, integrity checks (e.g. hashes or checksums) should be implemented to verify the data after a transaction took place. Examples of high integrity data are our source code, product pricing or financial reporting data.

Medium Integrity data is data that should only be changed by an authorized person. An authentication mechanism should be in place before changes to this type of data can be made. Examples of Medium Integrity data are our processes, Google Drive documents or data on our website.

Low Integrity data is data of which it really doesn't matter if it is changed by an unauthorized person. Examples are an order list for coffee for your team or other volatile information.

It is important to state that Data Integrity is not the same as Data Quality. Data quality in general refers to whether data is useful. Data integrity, by contrast, refers to whether data is trustworthy.

Availability

The availability of data is important to ensure timely and reliable access to information and systems. For example, the website of Coolblue requires a high availability because our customers want to be able to place an order 24/7. Next to that, we want to upkeep our promise of next day delivery. So the systems & data needed for these processes also need to have a high availability.

To determine the required availability of dta or systems, the following parameters need to be determined:

Recovery Point Objective (RPO) - The amount of data, as a measure of time, we are willing to lose during a recovery event.
Maximum Tolerable Downtime (MTD) - The amount of time we can be without the asset that is unavailable before we have to declare a disaster and call into effect our DR plan
Recovery Time Objective (RTO) - The Earliest Possible Time that we can restore/recover the asset to full functionality IF everything goes as planned, and nothing else goes wrong.
Work Recovery Time (WRT) - Determines the maximum amount of time needed to verify files, services or processes have been restored correctly and are available for use.

Because an image is better than a thousand (or in this case 107) words:

It is important to be very critical when determining these parameters. Does the system really need to be restored within 2 hours, or can business be resumed without the system fully functional (e.g. in a limited capacity or with manual workarounds).

Based on the outcome of these parameters, measures should be taken to ensure that in a case of a disaster the determined timelines can be met.

Multi-Factor Authentication (MFA)

Multi-factor authentication (MFA) is a security system that requires more than one method of authentication from independent categories of credentials to verify the user's identity for a login or other transaction.

Requiring the use of MFA for internet accessible endpoints is encouraged because by requiring not only something the user knows (a knowledge factor like a memorized password) but also something that the user has (a possession factor like a smartcard, yubikey or mobile phone) the field of threat actors that could compromise the account is reduced to actors with physical access to the user.

In cases where the possession factor is digital (a secret stored in your mobile phone) instead of physical (a smartcard or yubikey), the effect of MFA is not to reduce the field of threat actors to only those that have physical access to the user, because a secret can be remotely copied off of a compromised mobile phone. Instead, in this case, the possession factor merely makes it more difficult for the threat actor since they now need to brute force/guess your password and compromise your mobile phone. This is, however, still possible to do entirely from a remote location. In particular, storing both first and second factor on the same device (for example: mobile phone) is strongly discouraged.

Shared Passwords & Accounts

Shared passwords are passwords or/and accounts that more than one person knows or has access to.

Usage of these type of accounts is discouraged because they make auditing access difficult:

multiple users appear in audit logs as one user and different users actions are difficult to differentiate. the number of audit logs that need to be searched increases. correlation of events across different systems is impossible
if multiple people are creating event records with a single shared account across multiple systems at the same time.

Furthermore, revoking access to a subset of the users of a shared password requires a password change that affects all users.

Password Reuse

Password reuse is the practice of a single user using the same password across multiple different accounts/sites. This is contrasted with creating a different, distinct password for every account/site. Users often employ hybrid forms of password reuse like:

Using the same password for a class of accounts/sites, for example, using one single password for multiple high value financial accounts, but a different single password for multiple low value forums and wikis.
Using a consistent reproducible method of password generation for each site, for example, every account/site has a password which begins with the same characters and ends with name of the site ("rosebud0facebook", "rosebud0linkedin")

Password reuse is discouraged because:

When a site is compromised by an attacker, the attacker can easily take the user's password that has been reused on other sites and gain access to those other sites. For example if a user uses the same password on a car forum website as they use on Facebook, when that car website gets compromised, the attackers can then takeover the user's Facebook account.
Unethical administrators of any sites where a password is reused may/can gain access to accounts using the reused password.

Note that it is dangerous for a user to rely on a site being able to effectively prevent an attacker from obtaining that user's password once an attacker has compromised the site.

Since it's difficult/impossible for a user to memorize a distinct password for every account/site, a common solution is to use a password manager.

Surviving Black Friday With Kubernetes

Piotr Zakrzewski — Mon, 28 Nov 2022 16:22:02 +0000

One of the most classic use-cases for Kubernetes is Horizontal auto-scaling: the ability of your system to increase its resources automatically in reaction to higher demand by adding new machines. The need for auto-scaling is not just about reducing need to accurately project load and manual work of adjusting hardware. It can also be existential as in case of my team at Coolblue (Dutch web-shop present in Benelux and Germany) when the marketing emails with Black Friday discounts start reaching customers, causing enormous peak in traffic. For some types of workloads: like optimising delivery routes (Vehicle Routing Problem with additional constraints, more complex version of famous Traveling Salesman Problem) CPU demand grows very fast and can not only dominate your infrastructure spending with over-provisioned hardware for handling the peak, but in the worst case scenario - without horizontal auto-scaling you can be surprised by the black Friday peak bigger than forecasted and experience an outage at the most important time of the year for the business.

When do you need auto-scaling?

Your traffic varies over time
Your traffic is hard to predict or its impact on the CPU and memory is hard to determine

In the picture above Coolblue fits the right-most chart - Q4 is the busiest season for e-commerce. Forecasting of sales provides good basis for preparation, but may still leave a huge range of possible hardware specs, leading to over-provisioning hardware and then of course over-paying. Which brings me to the next point

Load varying through the year is much easier to plan for. Big daily range of load is harder to cope with, just like the annual peak it leads to over-provisioning, but unlike the over-provisioned hardware on the scale of the year, you cannot scale down and adapt after daily peak. Not when your operations are manual that is. Not to mention, that in the scenario with high daily peak despite over-provisioning you are still left vulnerable to unexpectedly big peak in traffic.

The easiest architecture for most web-applications (or back-ends for them) is a vertically scaled single host deployment. When you start maxing it out - you bump the specs. The simplicity of it of course comes at a cost: if it really is just a single host you need downtime to add more CPU and RAM.

At certain point there isn't any bigger boat for you

More complex architecture that lets you increase capacity without bumping specs of individual machines - causing their downtime - is horizontal scaling. In this architecture you have more than one instance of your application and requests are handled by any of those machines that are currently available.

There is a catch: your application must be stateless for this architecture to work this way, if it is stateful, complexity of routing requests will grow further as you need to ensure the right instance receives its own requests.

Our Delivery Routes Optimisation is Stateful and needs to scale for annual and daily peaks

This is the case for our package delivery route optimisation service at Coolblue, for which my team is responsible. In order to cope with the Black Friday peak we needed to provision 128 CPUs VM, at the moment of writing it, it is the biggest machine available in GCP in our region and compatible with our software. We can only go horizontal from here.

Optimising delivery routes on top of being famously CPU intensive leads also to stateful architectures, as any optimisation job will need information about all delivery orders, vehicles, shifts and other constraints in memory at every step of the optimisation process. This data is big enough that fetching it from the database on each mutation would reduce system performance and the mutations happen multiple times a second both as a result of an ongoing process of optimisation and incoming requests to schedule new orders.

Are We ready for Black Friday 2023?

This year only one stateless micro-service responsible for calculating travel time between deliveries is in Kubernetes. Last year it was the one that gave in during the peak causing partial outage, this year it scales by itself under the same circumstances not only avoiding outage but also over-provisioning during quiet times (all our customers are in the same time-zone).

How exactly is Kubernetes helping here?

For our stateless microservice, the k8s setup is pretty simple and standard. The most important resource types we use:

Deployment to specify config of the travel-time service itself (env vars, hardware required to run single instance, etc)
Service to specify its internal endpoint where other components can reach it
HorizontalPodAutoscaler where we specify minimum and maximum number of replicas (app instances) that can be provisioned and most importantly: the CPU utilisation threshold that k8s can use to trigger the horizontal scaling

Early next year we will also put our stateful route optimisation application in Kubernetes, removing the ceiling from our capacity and also reducing how much over-provisioning for the daily and annual peaks we need. This work is significantly more complex (due to its stateful nature and custom scaling) and deserves a dedicated article.

The Monolith in the Room

Stef van Hooijdonk — Mon, 21 Nov 2022 09:10:58 +0000

It is very easy to talk about your current systems and code as if they are old and legacy. "That Monolith" is legacy. It is old and the code is written in a way new joiners might not like. Even the coding language or framework might be something not in the top charts at the Github Octoverse anymore.

But more often than not, these systems, monoliths, are still the moneymaker (€€) for your company. The same for us at Coolblue.

The problem

Let me explain the problem we believe to have with at least one of our monoliths based on two concerns.

Concern: Ownership

As written earlier on our vision to have Guided Ownership as low as possible in the Development teams, having a monolith that you want to improve or needs maintenance is counter intuitive then.

Shared ownership tends to leads to no ownership

We are probably not the first department that sees this issue 1.

Some practical issues can arise also when working with multiple teams on a single solution. Most of which can be addressed with proper release management, good automation and a mature CI/CD platform.

Concern: Technology

One of our two monoliths was written almost two decades ago. We used Delphi to write an application to handle all aspects of our business processes.

Application
In itself Delphi is not the problem, even though usage in the market is declining. For us the more pressing reason to actively address this monolith is that the application is written as a desktop/windows application.

Application design
The biggest concern we have is the lack of clear and separated business logic in the application design. Logic resides in either a button click/screen or in a trigger in the data layer.

Data
We also built a single database and datamodel for this monolith to work on. Here the lack of clear ownership is becoming more and more visible. Using data from tables across by services created by other teams, making it hard to than innovate and make changes to your schema.

What are we doing about it?

Not going to hang out our laundry too much here, just setting the scene why we are moving forward with the following approach.

# what are we going to do about it?
String.Replace("monolith","guided ownership");

Replace?

The basis of our approach to solve the described problems is going to be replace by rearchitecting. Every time we want to improve or change a process, we will build a new solution and make it replace part of the monolith.

This approach allows us to work piece by piece, carving out features, processes and data as we go. Allowing for MVP like implementations first. The downside is that in many cases you are dealing with part of your logic and data living in one system (the monolith) and a newer replacement system. When you look at the data part, this means we have the constant challenge to get to an accurate and consistent data warehouse.

We made the following choices to help us do exactly this:

Domain Driven Design; Together with our tech principle design-to-encapsulate-volatility and using a pattern as Ports and Adapters allows our code to separate logic and infrastructure specific code (e.g. Oracle specific queries). This allows us to separate out the parts we are re-architecting that still need the monolith (data)access.

Events and Event Driven Architecture; to decouple the different processes and their supporting apps and services we are moving towards Event Driven Architecture. This way we can abstract the data leaving the bounded context of an application from its technical form or technology. This helps to hide the underlying situation of having two data stores during the transition period (the monolith on the one hand, and the new re-architected one on the other) from the systems receiving these events.

Kafka? Really?

We also see that having more and more apps and services emit and rely on events we will need to support that. We chose to begin with Apache Kafka as the platform to facilitate these event streams. Allowing both our Data Warehouse to tap into these streams as well as enable teams to rely on Kafka to stream between apps (bounded context).

Let me be clear, we are not going to replace all inter-service-communication with events and Kafka. For some processes a batch approach, e.g. via Airflow, is still a valid and great choice. The "it depends" strikes again.

Enabling teams with our Pillars of Guided Ownership

This brings a new problem to the table for our teams. Setting up producers, topics and consumers in this new Kafka platform. We have to help the teams here by enabling them.

Here are three examples how we have been and will be enabling our teams for this new direction.

Enable via Automation and Self-service

In a previous post I hinted already that for the year of 2023 we are looking to invest and develop the tools, templates and processes to make deploying an app or service with queues, Kafka topics and BigQuery staging tables along side the needed compute components for AWS easy and enabled for Self-service.

Enable via Skills & Knowledge

In the past few months we have deliberately done a few things to help with the knowledge needed. We sent a small delegation to the DDD Europe event and the Event Storming workshop that was held then (summer 2022). This group soaked up the knowledge and went ahead inside our tech organisation to help teams and their respective domains to perform these event storming sessions to learn how to do them and to use the output generated in the design of their next solution. This was a joint effort between Principal Developers, Data Engineering and Development teams.

Enable via Building Blocks

We also have done some research and experiments and based on that we created a solution block based on the Transactional Outbox pattern in one of our core technologies in use: dotnet/c#. And will add this to our template for creating new apps and services. This way teams can use it right away, and can see how this integrates into their new solution.

What about Tomorrow?

Going forward, we will have to keep checking if Kafka is the right fit for us to do more decoupling via Events. We chose this technology based on an proof of concept we did almost 2 years ago.

We need to keep an eye out how the carving up of this monolith is progressing. Many parts are now separate Employee Activity focussed applications already. But to turn a monolith fully off, that is the biggest challenge.

And lastly, what ever we do or change in our technology vision going forward, we need to make sure we always enable the teams. Give them the Guided Ownership they need to realise our vision.

Monitoring/Observability

Stef van Hooijdonk — Tue, 15 Nov 2022 10:10:00 +0000

Complete coverage of all production systems.

No system should be active in production (i.e. providing a service to a customer or user) without being monitored.

All monitoring/logging is public, so that everyone in Coolblue has visibility of the vitality of the system and Tech Services can monitor specific aspects without exposing sensitive data.

Monitoring means tracking errors in critical workflows, health of critical dependencies and service KPIs.

Observability

Each application we build has to be observable. That means we need to know when something is wrong, and we need to be able to determine why this is.

To be able to tell when something is going wrong with our solutions we actively monitor them and we put in place alerts for our service level objectives.

To be able to find out why things are going wrong we make sure we have the logs to do so, combined with our monitoring data when needed.

This monitoring and logging principle describes two parts:

The first one being monitoring, which is the practice that describes methods to have insights in our applications and stacks. -The other one is the practice of Logging. A practice which describes methods to register log events and give insights into the complexity of our application and stacks.

Monitoring and Alerts

You and your team actively monitor your applications. First you determine the metrics that are relevant for your application to measure. Then you should create dashboards and define alerts and service level objectives. Dashboards give insights into the recent and/or current state of your application. You use them to see at a glance what is happening. Alerts, with or without a Service Level Objective tag help you to be notified when certain thresholds you set are met. The (SLO) Alerts are also acted upon via our Tech Services Department.

Logging

Applications are hard and complex to write and manage. Problems we are solving, the abstractions we create and the implementations we choose to use, are all part of the complexity we are building. In order to shed light on that complexity we can use the practice of logging.

Logging can bring us additional insights into the operations executed in the application which can help understand the sequence of events that might have lead to a certain outcome (error or otherwise). Its an investigative tool that, if exercised correctly, can help piece together the application behaviour leading up to the outcome giving developers potentially new insights into the emergent behaviour of their systems.

Playbooks

In order to work together with those that help us action SLO Alerts when they happen, even outside of your own working day, we agree to have playbooks in place. These playbooks contain information on the SLO Alert itself, the potential underlying issue and should help and direct the reader into actions to help resolve the issue. We have a template available to write these playbooks. Please make sure the playbook is findable via the SLO Alert (make sure the SLO Alert title in our observability platform matches the SLO field in the playbook, and you can add a link to the page in the slack alert for easy access).

PII Data and Sensitive Data

We monitor and we log without exposing sensitive company data or PII data on our customers.

Definition of PII Data: Personally identifiable information (PII) is any data that can be used to identify a specific individual. Social Security numbers, mailing or email address, and phone numbers have most commonly been considered PII, but technology has expanded the scope of PII considerably. It can include an IP address, login IDs, social media posts, or digital images. Geolocation and biometric is also be classified as PII.

Guided Ownership

Stef van Hooijdonk — Mon, 14 Nov 2022 09:12:00 +0000

In our Tech department here at Coolblue, we strive to give our development teams the ownership they need to build the solutions our domains, and thus our customers, need.

We enable ownership through (at least) 6 Pillars:

With clear Tech Principles
With guidance on when to use what through a radar, architecture blocks and solution blocks
Self-service cloud infrastructure (AWS/GCP)
Observability
Grow the skills you need through our Coolblue University, our Master classes or other resources you may need
Our Domain roadmaps

Lets dive a bit deeper in each of these.

Tech Principles

Everything we design, build, deploy and run adheres to our Tech principles. I wrote about these principles in a series earlier: Our Tech Principles.

Most of these principles have been learned, the hard way. And this is our way of passing that experience along to the teams of the future.

We have made these principles part of our evaluation criteria for those that work in Tech. Making it very clear how important we think it is to work with these at every stage and at every level of development.

Radar, Architecture blocks and Solution blocks

We want teams to develop solutions freely as much as possible.

We believe that means: give the teams a lot of ownership and freedom to build these solutions.

It also means, we need to enable the teams to do so as much as possible. And by having guidance and reusability we can eliminate some tedious work, making the time and effort a Team spends focus more on the actual value delivering solution.

Reuse
We all know that reusing proven code, for a give pattern or a piece of plumbing, helps you focus on the actual solution and increase your speed of delivery.
That is why we have quite a few reusable items:

Architecture blocks; in our industry we have quite a few Design patterns and we have selected those that work well in how we develop our solutions. You can see this for instance in our Tech Principle Encapsulate for volatility and the design pattern Ports and Adapters and the Transactional Outbox pattern.
Solution blocks; actual implementations developers can find and use in their solutions. Some of which are packages/components to reuse, or templates to jumpstart a new app or service. And a Design System for building Customer facing applications, one for building Activity Focussed Employee applications and one for our Email communications. Through these building blocks we also make sure common practices, like performance and security, get addressed with solid implementations to be used and to serve as inspiration.

fyi: Architecture blocks and Solution blocks originate from TOGAF 9 - Building Blocks

Deploy and maintain
That does not mean we think having every team build with different tools and languages is the best way to do so. Every language and every development environment means learning something new, means support from our CI/CD platform and from our cloud platform(s).

It also means when people want to move teams, or solutions move to a different team, we have to deal with the knowledge needed to support, develop and run these solutions over time.
So we adopted a few core languages with an intended use case. Maybe not entirely a Radar, but it does give guidance and focus.

Self-service cloud infrastructure

Is every team SecDevOps to the full extent? No. Not yet at least, but I want our Development teams to be able to create new secure solutions and run their existing ones by themselves as much as possible.
Not only can you see that through our Principles Recovery over Perfection, Automation and Testing, but we also want the Teams to deploy, scale and fix when it suits them.
Our Hosting & Deployment teams develop and maintain the tools and core infrastructure that enabled our teams to do so. Through automated (Github) repo creation for instance. Want to build something new? Add the desired repo to Github via automation yourself. The same way to make sure our observability platform is hooked up to a newly created stack and CI/CD pipeline. We use this also to implement proven practices when looking at availability and security for our. cloud infrastructure.

If you were to ask any our 50+ development teams, if they want more self-service? they will say YES. Of course they will, it's in our corporate culture to "just do it". So we keep growing their toolset to do so.

One key area for 2023 is to invest in this area. We want our Solutions to be onboarded with more standard Cloud Infrastructure and to make sure key data can be shared with other apps (outside of the bounded context), our data warehouse and our analysts. We aim to leverage Events and Event Drive Architecture more and more, and making sure key Infrastructure is ready to use will help the teams tremendously. Think standard Queue's, topics in Kafka, Tables in BigQuery and more.

Observability

Technical Observability

Any Development Team owning a solution wants to know how it is performing and if it is performing correctly. And using Observability is a great way of doing that. Our Development Teams rely on metrics, dashboards and alerts to be in the know of their solutions. We believe this is critical for a Team to fully Own It. Acting on these insights and making sure our Employees and Customers can do what they should.

Business Observability

Another way we enable teams is to also give insights beyond the technical metrics.

Operating Costs; having insights in the total costs your App/Stack makes, triggers conscious decisions on Right-Sizing the Stacks.
Business Insights; All our Applications have a purpose: it can be process Orders, finding the right amount of Stock to keep or Sending out the right Product to a Customer. By having access to these dashboards also, the Team with their Product Owner can track if what is being done is actually Moving the Needle.

Coolblue University

Maybe an obvious Pillar, but having the right skills (ie. technical, leadership or social) will make you better at what you do. It will allow you to work better with your Team and your Stakeholders.

For that reason we have a Coolblue University with over a hundred different trainings available for you to take and use. Other online resources, IRL events, books and Class room trainings are optionally available also of course.

To understand our Business and what it is they do, we also have created Master classes. These are currently presentation-based and help any that attend to fully learn our way of working on key topics in the company.

We also share what we have learned via our internal, monthly, Tech Demo's. These offer a podium for our developers and engineers to share their learnings and insights with the rest of us in tech. By tech for tech.

Domain roadmaps

We cannot forget the reason we build Solutions/develop Software. We build it because there is a benefit for our Customers (NPS) or the Company (EBITDA). And sometimes there are more strategic reasons to build something.

Each team works with a Roadmap. We evaluate these at least 4 times a year. Always be ready to change and adjust to what we have learned. The Market, your context will always be changing. And as such we need to adjust when needed and not fall for the invested-too-much-reasoning. Agility is crucial for us as a Company and for our development efforts then also. Evident by the inclusion of Flexible in our Corporate Values.

Conclusion

This post turned out to be longer than I expected and full of corporate speak I usually try to avoid. But here we are. There are more ways we support or teams off course, but I liked the idea of focussing this on these six Pillars.

This post is mostly to share and explain how we work and think at Coolblue and how we try to create the environment for Ownership for our development teams.

Design to encapsulate volatility

Stef van Hooijdonk — Fri, 04 Nov 2022 08:45:00 +0000

Our strong belief in Clean Architecture compliments the principle of Encapsulating Volatility. If something is likely to change, make sure it's easy.

A very retro analogy can be applied to hifi systems:

Some manufacturers sold complete integrated hifi systems. These systems had everything on them. Amp, record player, double cassette deck, radio, CD player, equalizer (a graphic one if you were lucky) etc.

The problem is that as CDs were replaced by next gen technology (like minidisc, ok, like mp3 on storage and then streaming), the whole system was at risk of becoming obsolete, even though the system still served its original purpose.

However high-end manufacturers stuck to their component model. You could buy an Amp, separately from a CD player, and speakers etc. You can even mix and match components from different manufacturers.

So when CDs were replaced by mp3’s then mp3’s by streaming services, you could just add, or swap out the relevant components. The CD player, one of the ‘music suppliers’ within the system was a volatile component.

This view on things goes beyond modularisation. It informs you what modules you should have. If it’s volatile, very likely to change, encapsulate it.

Clean architecture

Regardless of whether we choose Microservices, SOA, or a monolithic approach, Clean Architecture can and should still be used. All of our systems should be built this way. It emphasises the domain at the heart of our designs (key also to DDD etc).

Dependency inversion is integral (and thus this approach supports good programming practice in general, and some specific SOLID principles). It moves things that are more susceptible to change to the edge instead of the center, making them easier to replace. Whatever we build can be built following this principle (even other architectures/patterns can be built in this manner).

SOLID

S - Single-responsibility principle
O - Open-closed principle
L - Liskov substitution principle
I - Interface segregation principle
D - Dependency Inversion Principle

You adhere to SOLID principles (no exceptions for object-oriented development). These are 5 of the basic principles of Object Oriented design and programming. They should be second nature to every OO developer at Coolblue.

Testing

Stef van Hooijdonk — Thu, 27 Oct 2022 08:20:00 +0000

The software we create should do what it was intended to do. To be sure that it does, we want our production code to be well-covered by automated tests. These tests should be runnable with the click of a button, and they should be run automatically before each release.

Although testing, when done right, should ultimately make you more productive, by virtue of having to spend less time fixing problems after you release, it does ‘slow you down’ up front, as compared to writing code without any tests. Most applications or services have a long lifecycle that warrants having extensive test coverage, but given the nature of some of the work we do, some applications or features are so trivial, short-lived, or time-critical, that having few or no tests is an acceptable situation.

The metric of ‘code coverage’ is not considered of any value to determining the quality of tests. Consequently, we should not use it to fail builds or block deployments. The reason behind this is that tests can just trigger your production code, but are not testing the right business concepts. Next to that, some units might not even need to be tested. These units can for example only properly be tested using an integration test or might be too trivial and covered by their consumers.

TDD at Coolblue means:

Process

Write a failing test first, then code until that test passes. Do not write any more production code that it is necessary to make the one failing test pass.
Red-green refactor. Don’t forget to refactor, it’s the most important part.
Everyone on board. The whole team must adopt TDD, do not partially adopt.

Writing tests

Consider the inputs, outputs, all possible weaknesses (possible errors) and strengths (successful runs).
Do not overcomplicate your tests. The test should be simple to setup and execute.
Tests should run and pass on any machine/environment. If tests require special environmental setup or fail unexpectedly, then they are not good unit tests.
Make sure your tests clearly reveal their intent. Another developer can look at the test and understand what is expected of the code.
Each test should have a limited scope. If it fails it should be obvious why it failed. It’s important to only assert one logical concept in a single test. Meaning you can have multiple asserts on the same object since they will usually be the same concept.
Keep your tests clean. They are just like your production code.
External dependencies should be replaced with test doubles in unit tests.

Running tests

Unit tests should be fast, the entire unit test suite should finish running in under a minute. Unit tests should just run with zero effort (after installing dependencies).
Ratio of number of tests in each level of testing should be balanced; pyramid model. (e.g 80% unit tests, 15% integration tests, 5% acceptance tests).
Acceptance tests should be divided in suites based on features.

Measurements of success for teams

The integration and unit tests are passing before merging.
Acceptance tests are running successfully in the Acceptance environment before code is deployed to production.
Unit/Integration tests run within CI environment on each PR.
Failing tests block deployment. All tests run successfully on a CI environment before code is deployed.

Note: ‘Coverage’, i.e. the percentage of lines covered by a unit or integration test, is not a measure of success, because it says nothing about how well the code has been tested. Do not make a specific level of test coverage a requirement, because it is hard to reach and it will cause people to start writing nonsense tests just to reach the required level of coverage.

Suggested resources

Read Test Driven Development By Example by Kent Beck
Read Clean Code by Robert C. Martin
Read xUnit Test Patterns: Refactoring Test Code by Gerard Meszaros
Check training videos at https://cleancoders.com
Working Effectively with Legacy Code by Michael Feathers