DEV Community: Coolblue
The latest articles on DEV Community by Coolblue (@coolblue).
https://dev.to/coolblue
https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Forganization%2Fprofile_image%2F2760%2F52262073-3a04-4509-b5d8-e1b56dd977c2.jpg
DEV Community: Coolblue
https://dev.to/coolblue
en
-
How We Reorganised Engineering Teams at Coolblue for Better Ownership and Business Alignment
Aman Agrawal
Wed, 07 Feb 2024 20:28:24 +0000
https://dev.to/coolblue/how-we-reorganised-engineering-teams-at-coolblue-for-better-ownership-and-business-alignment-34gb
https://dev.to/coolblue/how-we-reorganised-engineering-teams-at-coolblue-for-better-ownership-and-business-alignment-34gb
<p>In this post, I will share my experiences leveraging Domain Driven Design strategies and Team Topologies to reorganise two product engineering teams in the Purchasing domain at <a href="proxy.php?url=https://www.coolblue.nl/" rel="noopener noreferrer">Coolblue</a> <em>(one of the largest e-commerce companies in the Netherlands)</em>, along business capabilities to improve team autonomy, reduce cognitive load on teams and improve our architecture to better align with our business.</p>
<p><strong>Disclaimer</strong> : I am not an expert in Team Topologies, I have only read the book twice and spoken to one of the core team members of Team Topologies creators. I am always looking to learn more about effectively applying those ideas and this post is just one of the ways we applied it to our problem space. YMMV!đ</p>
<h3>
Context
</h3>
<p>Purchasing domain is one of the largest at Coolblue in terms of the business capabilities we support and the number of engineering teams <em>(4 as of this writing, possibly growing in the future)</em> and it has one very critical goal: to ensure we have the right kind of stock available to sell in our central warehouse at all times without over or under stocking and secure most optimum vendor agreements to improve profitability of our purchases. Our primary stakeholders are supply planners and buyers in various product category teams that are responsible for various categories of products we sell.</p>
<p>We buy stock for upwards of tens of thousands of products to meet our growing customer demands, so its absolutely critical that not only we are able to make good buying decisions (which relies on a lot of data delivered timely from across the organisation) but that weâre also able to manage pending deliveries and delivered stock efficiently and effectively (which relies on timely and accurate communications with suppliers).</p>
<h3>
Growth of the Purchasing Domain
</h3>
<p>Based on the <a href="proxy.php?url=https://vladikk.com/2018/01/26/revisiting-the-basics-of-ddd/" rel="noopener noreferrer">strategic Domain Driven Design terminology</a> Purchasing would be categorised as a supporting domain i.e. Purchasing capabilities are not our core differentiator. The workings of the domain are completely opaque to end customers. Most organisations will have similar purchasing processes and often similar systems <em>(sometimes these systems are bought instead of being built).</em></p>
<p>However, over the last 10 years Purchasing domain has also increased in complexity, we have expanded our business capabilities: data science, EDI integration, supplier performance measurement, stock management, store replenishment, purchasing agreements and rebates etc. We have come to rely on more accurate and timely data to make critical purchasing decisions. Being able to quickly adapt our purchasing strategies during COVID-19 helped us stay on our business goals. For the most part we have built our own software due to the need to tackle this increased complexity, maintain agility in the face of global upset events and to integrate with the rest of Coolblue more effectively and efficiently. The following sub-domain map shows a very high level composition of the Purchasing domain:</p>
<p><a href="proxy.php?url=https://codequirksnrants.files.wordpress.com/2024/02/image.png" rel="noopener noreferrer"><img src="proxy.php?url=https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcodequirksnrants.files.wordpress.com%2F2024%2F02%2Fimage.png%3Fw%3D1024"></a><br>
<em>High level break down of the Purchasing domain (simplified)</em></p>
<p>For this post, I will be focussing on the Supply sub-domain (shown in blue above) where we redesigned the engineering team organisation.</p>
<h3>
Domain vs Sub-Domain vs Bounded Contexts vs Teams
</h3>
<p>In DDD terminology, a <strong>sub-domain</strong> is a part of the <strong>domain</strong> with specific logically related subset of overall business responsibilities, and contributes towards the overall success of the domain. A domain can have multiple sub-domains as you can see the above visual. A sub-domain is a part of the problem space.</p>
<p>Sometimes it can be a bit difficult to differentiate between a domain and a sub-domain. From my pov, its all just domains. If a domain is large and complex enough, we tend to break it down into discrete areas of responsibilities and capabilities called sub-domains. But I donât think this is hard and fast rule.</p>
<p>A <strong><a href="proxy.php?url=https://martinfowler.com/bliki/BoundedContext.html" rel="noopener noreferrer">bounded context</a></strong> is the one and only place where the solution (often software) to a specific business problem lives, the terminology captured here is consistent in its usage and meaning. It represents an area of applicability of a domain model. E.g. <em>Supplier Price and Availability</em> context will have software systems that know how to provide supplier prices and stock availability on a day to day basis. These terms have an unambiguous meaning in this context. The model that helps solve the problem of prices and stock availability is largely only applicable here and shouldnât be copied in other bounded contexts because that will duplicate knowledge in multiple places and will introduce inconsistencies in data leading to expensive to fix bugs. Bounded contexts therefore provide a way to encapsulate complexities of a business concept and only provide well defined interfaces for others to interact with.</p>
<p>In an ideal world each sub-domain will map to exactly one bounded context owned and supported by exactly one team, but in reality multiple bounded contexts can be assigned to a sub-domain and one team might be supporting multiple bounded contexts and often multiple software systems in those contexts.</p>
<p>Hereâs an illustration of this organisation <em>(names are for illustrative purposes only)</em>:</p>
<p><a href="proxy.php?url=https://codequirksnrants.files.wordpress.com/2023/12/image-4.png" rel="noopener noreferrer"><img src="proxy.php?url=https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcodequirksnrants.files.wordpress.com%2F2023%2F12%2Fimage-4.png%3Fw%3D877"></a><br>
<em>An illustration of relationship between domain, sub-domain and bounded contexts (assume one team per sub-domain)</em></p>
<p>I am not going to go into the depths of strategic DDD but <a href="proxy.php?url=https://vladikk.com/2018/01/26/revisiting-the-basics-of-ddd/" rel="noopener noreferrer">here</a> <a href="proxy.php?url=https://github.com/ddd-crew/ddd-starter-modelling-process?tab=readme-ov-file#understand" rel="noopener noreferrer">are</a> some <a href="proxy.php?url=https://medium.com/nick-tune-tech-strategy-blog/domains-subdomain-problem-solution-space-in-ddd-clearly-defined-e0b49c7b586c" rel="noopener noreferrer">excellent</a> places to <a href="proxy.php?url=https://verraes.net/#blog" rel="noopener noreferrer">study</a> it and understand it better. The strategic aspects of DDD are really quite crucial to understand in order to design software systems that align well with business expectations.</p>
<h3>
Old Team Structure
</h3>
<p>Simply put, the Supply sub-domain is primarily responsible for creating and sending appropriate purchase orders for products that we want to buy, to our suppliers, and managing their lifecycle to completion. There are of course ancillary stock administration related responsibilities as well that this sub-domain handles but not all of those software-ifiedâŠyet.</p>
<p>Historically, we had split the product engineering teams into two (the names of the teams should foreshadow the problems we will end up having):</p>
<p><strong>Stock Management 2</strong> : responsible for generating automated replenishment proposals and maintaining pre-purchase settings, and</p>
<p><strong>Stock Management 1</strong> : responsible for everything to do with purchase orders, but also over time responsibilities of maintaining EDI integration and store replenishment also fell on this team.</p>
<p>Both teams though had a separate backlog, they shared the same Product Owner and the responsibilities allocated to the teams grewâŠâorganicallyâ, that is to say, the allocation wasnât always based on teamâs expertise and responsibility area but mostly based on who had the bandwidth and space available in their backlog to build something. Purely efficiency focussed (<em>how do we parallelise to get most work done</em>), not effectiveness focussed (<em>how do we organise to increase autonomy and expertise, and deliver the best outcomes for the business</em>).</p>
<p>Because of this mindset, over time, Stock Management 2 also took on responsibilities that would have better fit Stock Management 1 e.g. they built a recommendation system on top of the purchase orders, something they had very little knowledge of. They ended up duplicating a lot of purchase order knowledge in this system â they had to â in order to create good recommendations. This also required replicating purchase order data in a different system which would later create data consistency problems.</p>
<p>As a result, dependencies grew in an unstructured and unwanted ways e.g. a lot of database sharing between the two teams, complex inter-service dependencies with multi-service hops required to resolve all the data needed for a given use case. The system architecture also grew âorganicallyâ with little to no alignment with the business processes it supported and the accidental complexity increased. Looking at the team names, no one could really tell what either teams were responsible for because what they were responsible for was neither well documented nor stable.</p>
<p>We ended up operating in this unstructured way until July 2023.</p>
<h3>
Trigger for Review
</h3>
<p>The trigger to review our team boundaries came in Q1 2023, when we nearly made the mistake of combining two teams into one single large team with joint scrum ceremonies with a proposal to add more process to manage this large team (LeSS). None of it had taken into account the business capabilities the teams supported or the desired state architecture we wanted. It was clear that no research had been done into how the industry is solving this problem, and it was being approached purely from a convenience of management pov.</p>
<p>Large teams, specially in a context that supports multiple business processes, is a bad idea in many ways (some of these are not unique to large teams):</p>
<ul>
<li>Large teams are expen$ive, youâd often need more seniors on a large team in order to keep the technical quality high and technical debt low</li>
<li>No real ownership or expertise of anything and no clear boundaries</li>
<li>Team members are treated as feature factories instead of problem solving partners</li>
<li>Output is favoured over outcomes, business value delivered is equated to story points completed</li>
<li>Cognitive load and coordination/communication overhead increases</li>
<li>Meetings become less effective and people tend to tune out <em>(I tend to doodle geometric shapes, its fun !đ)</em>
</li>
<li>Product loses direction and vision, its all about cramming more features which fuels the need to make the team bigger. Because of course, more people will make you go fasterâŠNOT!</li>
<li>Often more process is required to âmanageâ large teams which kills team motivation and autonomy</li>
</ul>
<p>This achieves the exact opposite of agility and we saw these degrading results when for a brief amount of time we experimented with the large team idea.</p>
<ul>
<li>Joint sessions were becoming difficult and inefficient to participate in (not everyone can or will join on time) </li>
<li>Often team members walked away with completely different understanding and mental models which got put into code đ±.</li>
<li>Often there was confusion about who was doing what which increased the coordination overhead </li>
<li>Given historically the two teams had been separate with their own coding standards and PR standards, there often was friction in resolving these conflicts which slowed down delivery and reduced inter team trust.</li>
</ul>
<p><a href="proxy.php?url=https://codequirksnrants.files.wordpress.com/2023/12/image.png" rel="noopener noreferrer"><img src="proxy.php?url=https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcodequirksnrants.files.wordpress.com%2F2023%2F12%2Fimage.png%3Fw%3D1024"></a><br>
<em>Communication overhead grows as number of people in the group increases</em></p>
<p>The worst part of all of this is learned helplessness! We become so desensitised to our conditions that we accept it the sub-optimal conditions as our new reality.</p>
<p>So combining teams and adding more process wasnât going to be the solution here and it most certainly shouldnât be applied without involving the people whose work lives are about to be impacted i.e. the engineering teams.</p>
<p>These reorganisations should also not be done devoid of any alignment with the business process because you risk system architecture either not being fit for purpose or too complex for the team(s) to handle because all sorts of assumptions have been put into the design.</p>
<h3>
Team Topologies and Domain Driven Design
</h3>
<p>I had a feeling that we needed to take a different approach here and by this time I had been hearing a lot about <a href="proxy.php?url=https://teamtopologies.com/" rel="noopener noreferrer">Team Topologies</a> so I bought the <a href="proxy.php?url=https://www.amazon.com/Team-Topologies-Organizing-Business-Technology/dp/1942788819/ref=sr_1_1?crid=2UBW1RHA4KIFI&keywords=Team+Topologies&qid=1703427691&sprefix=team+topologie%2Caps%2C160&sr=8-1" rel="noopener noreferrer">book</a> (highly recommended), and read it cover to coverâŠtwiceâŠto understand the core ideas in it. A lot of people know about <a href="proxy.php?url=https://martinfowler.com/bliki/ConwaysLaw.html" rel="noopener noreferrer">Conwayâs Law</a> but Team Topologies really brings the double edged nature of Conwayâs Law into focus.âIgnore it at your own peril!</p>
<p>This Comic Agile piece sums up how that realisation after reading TT book, dawned on me:</p>
<p><a href="proxy.php?url=https://codequirksnrants.files.wordpress.com/2023/12/pasted-image-20230928210406.png" rel="noopener noreferrer"><img src="proxy.php?url=https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcodequirksnrants.files.wordpress.com%2F2023%2F12%2Fpasted-image-20230928210406.png%3Fw%3D1024"></a><br>
<em>Check out more hilarious strips <a href="proxy.php?url=https://www.comicagile.net/" rel="noopener noreferrer">here</a></em></p>
<p>Traditionally, team and domain organisation in most companies, has been done by business far removed from the engineering teams, meaning they miss out a critical perspective in those discussions: <em>that of the system architecture</em>. And because the team design influences software design, many companies end up shooting their foot with unwieldy and misaligned software that delivers the opposite of agility. This is exactly why its crucial to have representation from engineering in these reorganisations. Just because something works doesnât mean, itâs not broken!</p>
<p>By this time we had also conducted several <a href="proxy.php?url=https://www.eventstorming.com/" rel="noopener noreferrer">event storming</a> sessions for the core Supply sub-domain (for the entire purchase ordering flow) to identify critical domain events, possible bounded contexts and what we want our future state to be. I cannot emphasise enough how important this kind of event storming can be in helping surface complexity, potential boundaries and improvement opportunities to the current state.</p>
<p>Putting Team Topologies and strategic DDD together to create deliberate team boundaries was just a no-brainer.</p>
<p><a href="proxy.php?url=https://codequirksnrants.files.wordpress.com/2023/12/core-purchase-ordering-event-storm.png" rel="noopener noreferrer"><img src="proxy.php?url=https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcodequirksnrants.files.wordpress.com%2F2023%2F12%2Fcore-purchase-ordering-event-storm.png%3Fw%3D1024"></a><br>
<em>Donât worry, you are not meant to read the text, the identified boundaries are more important</em></p>
<p>Also worth bearing in mind that this wasnât a greenfield operation, we had existing software systems that had to be mapped onto some of the bounded contexts, at least until we can determine their ultimate fate. Some of the bounded contexts had to drawn around those existing systems to keep the complexity from leaking out to other contexts.</p>
<h3>
Brainstorming on New Team Design
</h3>
<p>In May 2023, I, our development lead and our domain manager got to brainstorming on how can we organise our teams not only for efficiency but this time crucially also <strong>for effectiveness</strong>.</p>
<p>In these discussions I presented the ideas of Team Topologies and insights from the event storms we had been doing. According to Team Topologies, team organisations can essentially be reduced to the <a href="proxy.php?url=https://teamtopologies.com/key-concepts" rel="noopener noreferrer">following 4 topologies</a>:</p>
<p><a href="proxy.php?url=https://codequirksnrants.files.wordpress.com/2023/12/image-6.png" rel="noopener noreferrer"><img src="proxy.php?url=https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcodequirksnrants.files.wordpress.com%2F2023%2F12%2Fimage-6.png%3Fw%3D710"></a><br>
<em>Four fundamental topologies</em></p>
<p>Based on these and my formative understanding, I presented the following team design options:</p>
<p><a href="proxy.php?url=https://codequirksnrants.files.wordpress.com/2023/12/image-7.png" rel="noopener noreferrer"><img src="proxy.php?url=https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcodequirksnrants.files.wordpress.com%2F2023%2F12%2Fimage-7.png%3Fw%3D1024"></a><br>
<em>The 2 team model</em></p>
<p>This model makes the Purchase Ordering team (stream aligned) solely responsible for full purchase order lifecycle handling, including the replenishment proposals (which is an automated way to create purchase orders). The Pre Purchase Settings team (platform team) will provide supporting services to the PO team (e.g. supplier connectivity and price & availability services, purchase price administration services, various replenishment settings services etc).</p>
<p>Another model was this:</p>
<p><a href="proxy.php?url=https://codequirksnrants.files.wordpress.com/2023/12/image-8.png" rel="noopener noreferrer"><img src="proxy.php?url=https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcodequirksnrants.files.wordpress.com%2F2023%2F12%2Fimage-8.png%3Fw%3D1024"></a><br>
<em>The 3 team model</em></p>
<p>In the 3 team model, I split out the replenishment proposals part out of Purchase Ordering team, added the new actionable products capability that we were working on, to it and created another stream aligned team: Replenishment Optimisation Team. The platform team will now provide supporting services to both these stream aligned teams and the new optimisation team will essentially provide decision making insights to purchase ordering team.</p>
<p>In a perfect world, you want to assign one team per bounded context and as evident from the event storm we had several contexts, but Team Topologies also warns us to make sure the complexity of the work warrants a dedicated team. Otherwise, you risk losing people to low motivation, and still bearing the cost of creating multiple teams.</p>
<p>Nevertheless, after taking into account the practical constraints like money, complexity and team motivation but perhaps <strong>most importantly</strong> taking into account the impact of each design on the overall system architecture and what we wanted our desired state architecture to look like, we settled on the following cut:</p>
<p><a href="proxy.php?url=https://codequirksnrants.files.wordpress.com/2023/12/image-9.png" rel="noopener noreferrer"><img src="proxy.php?url=https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcodequirksnrants.files.wordpress.com%2F2023%2F12%2Fimage-9.png%3Fw%3D1024"></a><br>
<em>Final team split</em></p>
<p>Basically, at their core, the Purchase Order Decisions team will own all components that factor into purchasing decision making:</p>
<ul>
<li>Replenishment recommendation generation</li>
<li>Purchase order creation and verification</li>
<li>Actionable product insights</li>
</ul>
<p>And the Purchase Order Management team will own all components that factor into the management of lifecycle of submitted purchase orders <em>(I know âmanagementâ is a bit of a weasel word, but I am hoping over time we will be able to find a better name)</em>:</p>
<ul>
<li>Purchase order submission</li>
<li>Purchase order lifecycle management/adjustments (manual and system generated)</li>
</ul>
<p>The central idea behind this split being that purchase order verification is a pivotal event in our event storm and once a purchase order is verified, it will always be submitted. Submission is a key pre-condition to managing pending purchase order lifecycle and it has sufficient complexity due to communication elements involved with suppliers and our own warehouse management system, so it makes sense for Purchase Order Management to own everything from submission onwards. This also makes them the sole owner of the purchase order database and this breaks the shared database anti-pattern, and relies on asynchronous event driven communication between the bounded contexts owned by the teams. Benefit of this is that we can establish clearer communication contracts and expectations without knowing or needing to know the internals of another context.</p>
<p>In addition to this, we also identified several supporting capabilities/bounded contexts for which the complexity just wasnât high enough to warrant a separate team entirely, at least for now:</p>
<ul>
<li>Supplier price and availability retrieval</li>
<li>EDI connection management</li>
<li>Despatch advice forwarding</li>
<li>E-mail based supplier communication</li>
</ul>
<p>These capabilities still had to be allocated between these two teams, so based on whether they belong more to decision making part or the management part, we created the following allocations:</p>
<ul>
<li>Supplier price and availability retrievalâ<em>(Purchase Order Decisions because its only used whilst creating replenishment recommendations and subsequent purchase orders)</em>
</li>
<li>EDI connection management, Despatch advice forwarding <em>(Purchase Order Management because they already owned this and it definitely didnât make sense as a part of decision making flows)</em>
</li>
<li>Email based supplier communication <em>(Purchase Order Management because purchase order submission can happen via EDI or via E-mail so it makes sense for them to own all aspects of submission)</em>
</li>
</ul>
<p>This brought the final design of teams to this:</p>
<p><a href="proxy.php?url=https://codequirksnrants.files.wordpress.com/2024/02/image-2.png" rel="noopener noreferrer"><img src="proxy.php?url=https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcodequirksnrants.files.wordpress.com%2F2024%2F02%2Fimage-2.png%3Fw%3D1024"></a><br>
<em>Final team cut with bounded contexts owned by each</em></p>
<p>It might seem a bit excessive to have multiple bounded contexts to a single team and like I said, in a perfect world I would have one team be responsible for only one bounded context. But considering the constraints I mentioned before (cognitive load, complexity of challenge and financial costs of setting up many teams), I think this is a pragmatic choice for now. The identified bounded contexts are also not set in stone, so its entirely possible we might combine some of them into a single bounded context based on conceptual and linguistic cohesion. We might even split them out into dedicated teams should some bounded contexts grow in complexity enough to warrant separate teams.</p>
<p>NB: A bounded context might not always mean a single deployment unit (i.e. a service or an application). A single BC can map to one or more related services if the rates of change, fault toleranceârequirements and deployment frequencies dictate as much. The single most important thing about BCs is that they encapsulate a single distinct business concept with consistent business language and consistent meanings of terms, so its perfectly plausible that there are good drivers for splitting one BC into multiple deployment units.</p>
<p><a href="proxy.php?url=https://codequirksnrants.files.wordpress.com/2024/02/image-1.png" rel="noopener noreferrer"><img src="proxy.php?url=https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcodequirksnrants.files.wordpress.com%2F2024%2F02%2Fimage-1.png%3Fw%3D1024"></a><br>
<em>Some heuristics for determining bounded contexts</em></p>
<h3>
Go Live!
</h3>
<p>In June 2023 we presented this design to both the teams and asked for feedback, and both teams could see the value of the split because it created better ownership boundaries, better focus and allowed opportunity to reduce the cognitive overhead of communicating within a large team. So in July 2023, we put the new team organisation live and made all the administrative changes like changing the team names in the HR systems, Slack channels, assigning right teams to code repositories based on allocations etc. and got to work in the new set up.</p>
<h3>
Reflection
</h3>
<p>Whilst this team organisation is definitely the best weâve ever had in terms of cleaner ownership boundaries, relatively appropriate allocation of cognitive load, better sense of purpose and autonomy, its by no means the best <strong>we will ever have</strong>. The most important thing about agility is continuous improvement, DDD tells us that there is no single best model, so it only makes sense that we revisit these designs regularly and seize any opportunities for improvement along any of those axes to keep aligned with the business and deliver value effectively. The organisation and the domain never stay the same, they grow in complexity so its crucial for engineering teams to evolve along with them in order to stay efficient and effective themselves, and also for the architecture to stay in alignment with the business. I loosely equate teams and organisations to living organisms that self-organise like cellular mitosis, its the natural order of things.</p>
<p>Ofcourse things are not perfect, both teams still have some degree of functional coupling i.e. if the model of the purchase order changes fundamentally or if we need to support new purchase order types, both teams will need to change their systems and coordinate to some extent. This is a trade-offâof this team design option, but largely the teams are still autonomous and communicate asynchronously for the most part. Any propagation of model changes can still be limited by use of appropriate anti-corruption layers on either side.</p>
<p>One of the other significant benefits of this deliberate reorganisation, is that in both teams we created a north star roadmap for the desired state architecture because for a long time, both teams had incurred unwarranted technical complexities in the form of arbitrarily created services with mixed programming paradigms, which were getting difficult to maintain for a small team. Contract coupling at multiple service integration points made smallest of changes ripple out to multiple systems that had to be changed in a specific order to deploy safely (weâve had outages in the past because we forgot to update the contracts consistently).</p>
<p>As a part of our new engineering roadmap, we are now reviewing these services with a strategic DDD eye and asking, âwhat business capability this service provides?â and if the answer is similar for two services and there are none of the benefits of <a href="proxy.php?url=https://martinfowler.com/microservices/" rel="noopener noreferrer">microservices</a> to be gained here, then those two services will be combined into a single modular monolith. Some services will not make sense in the new organisation so they will be decommissioned and the communication pathways simplified. We project a potential reduction of 40% in the complexity of overall system landscape because of these changes (and hopefully some money savings as well), or at the very least complexity will be better contained. But perhaps most importantly, we aim to make the architectural complexity fit the cognitive bandwidth of the teams and ensuring a team can own the flow end to end.</p>
<p>Another thing we will be working on next is strengthening our boundaries with dependent teams, historically the e-commerce database has been shared with all the teams in Coolblue and this creates challenges (subject for another post). So going forward we will be improving our web services and events portfolio so dependents can use our service contracts to communicate with our systems instead of sharing databases. With a better sense of what we own and donât own, I expect these interfaces to become crisper over time.</p>
<p>These kinds of reorganisations can have a long maturity cycle before it becomes clear whether these decisions were the right ones or the team boundaries were the right ones, and organising teams is just the first though a significant step. The key is in keeping the discussion going and being deliberate about our system design decisions to ensure that business domains and system design stay in alignment. To that end we will continue investing in Domain Driven Design practices to ensure business and engineering can collaborate effectively to create systems that better reflect the domain expectations whilst keep the complexity low and keeping acceptably high levels of fault tolerance and autonomy of value delivery.</p>
teamtopologies
domaindrivendesign
architecture
-
Monolith vs Microservices
Aman Agrawal
Sun, 01 Oct 2023 20:49:09 +0000
https://dev.to/coolblue/monolith-vs-microservices-14f6
https://dev.to/coolblue/monolith-vs-microservices-14f6
<p>One of my colleagues shared this <a href="proxy.php?url=https://renegadeotter.com/2023/09/10/death-by-a-thousand-microservices.html">article</a> with me a few days ago and having read through it (and many others like it before), I felt like I needed to provide <em>hopefully</em> a more balanced perspective to this age old debate based on my own experiences and learnings. So in this post, thatâs what I am going to attempt to do.</p>
<h3>
Successful Startup != Microservices
</h3>
<p>The author shares this link to a <a href="proxy.php?url=https://kenkantzer.com/learnings-from-5-years-of-tech-startup-code-audits/">security audit</a> of start up codebases and emphasises on point number 2 of that article. The point being that all successful startups kept code simple and steered clear of microservices until they knew better.</p>
<p>I can see why, microservices are an optimisation pattern an org may need to apply when they scale beyond a certain size. When you are a fledgling startup with limited funding and an uncertain future, microservices is the last thing you should be worried about (and preferably not at all). All the time and money at this stage should be spent on generating value and treating your employees well (the latter is not optionalâŠever).</p>
<p>Hereâs an <a href="proxy.php?url=https://www.youtube.com/watch?v=t7iVCIYQbgk&pp=ygUSbW9uem8gbWljcm9zZXJ2aWNl">example</a> of a modern digital bank that went all in with microservices from the get go. They boast about their 1500+ microservices that a while ago invited some <a href="proxy.php?url=https://twitter.com/Grady_Booch/status/1190894532977520640">flak on Twitter</a>. From my limited point of view on their context, this looks like insanity and though they touch on all the challenges that come with this kind of architecture, I canât help but think that somewhere deep down they go, âWish we hadnât done this so soon!â But I am willing to give them the benefit of the doubt that they did their due diligence whilst evolving into a complex architecture and building a platform to support it.</p>
<p><a href="proxy.php?url=https://codequirksnrants.files.wordpress.com/2023/10/image-1.png"><img src="proxy.php?url=https://res.cloudinary.com/practicaldev/image/fetch/s--j9s4FXxf--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://codequirksnrants.files.wordpress.com/2023/10/image-1.png%3Fw%3D586" alt="" width="586" height="654"></a></p>
<h3>
Microservices Make Security Audit Harder
</h3>
<p>If I have 1500+ services potentially written in different languages spread across of hundreds if not thousands of repositories, my job as a security auditer just got exponentially harder! <a href="proxy.php?url=https://res.cloudinary.com/practicaldev/image/fetch/s--cbiE9eth--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://s0.wp.com/wp-content/mu-plugins/wpcom-smileys/twemoji/2/72x72/1f629.png" class="article-body-image-wrapper"><img src="proxy.php?url=https://res.cloudinary.com/practicaldev/image/fetch/s--cbiE9eth--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://s0.wp.com/wp-content/mu-plugins/wpcom-smileys/twemoji/2/72x72/1f629.png" alt="đ©" width="72" height="72"></a> The author even mentions that in point 7 of his list: <strong>Monorepos are easier to audit</strong>. Monzoâs 1500+ Go services are in a mono-repo, so thatâs one down I guess.</p>
<p>Also the security attack surface area also gets that much wider, can you ensure all 1500+ of your microservices leverage security hardened platform and industry best practices in a standard way? Do you even know what those are? What about the dependencies (direct and transitive) each of those services take on external code?</p>
<p>I think these are probably the most significant drivers for a security professional to gripe about microservices, but the more you distribute the more standardisation on the platform front you need. You donât want to be reinventing the wheel, especially when it comes to security so the more sensible defaults you can bake into the platform the better and easier it might be to audit it.</p>
<h3>
Do we <em>really</em> need microservices?
</h3>
<p>I agree with the author that in some cases there can be <em>âa dogma attached to not starting out with microservices on day one â no matter the problem</em>â. Just because someone else (usually multi-billion dollar organisations with global footprint and tens of thousands of engineers, think FAANG) is doing microservices, doesnât mean my 5 person startup also needs microservices.</p>
<p>But I have to add a bit of nuance here, an org doesnât have to reach FAANG scale to realise they need to rearchitect. If my org is growing in terms of revenue, size and technology investment, then asking the following kinds of questions regularly is a part of engineering due diligence:</p>
<ul>
<li>Is the current monolithic architecture with a shared database still the right thing to do?</li>
<li>Are we facing challenges in some areas where our current architecture is impeding our value delivery? If so, what might be some ways to alleviate that pain?</li>
<li>How much longer this system can keep growing with the same pace as the org and still be maintainable and agile? </li>
</ul>
<p>Agile organisations and agile architectures are the ones that can evolve with time and need. The complexity of the architecture should be commensurate with the organisationâs growth rate and ambitions. No more, no less.</p>
<h3>
How web cos grow into microservices?
</h3>
<p>None of the web cos evolved to microservices over night, it was a long arduous journey over decades (far shorter than most employeesâ tenure in an organisation btw). Hereâs <a href="proxy.php?url=https://www.infoq.com/presentations/shoup-ebay-architectural-principles/">E-bayâs</a> journey to microservices, hereâs <a href="proxy.php?url=https://www.infoq.com/presentations/microservices-netflix-industry/">Netflixâs</a> and hereâs <a href="proxy.php?url=https://www.allthingsdistributed.com/2022/11/amazon-1998-distributed-computing-manifesto.html">Amazonâs</a>. In all cases you will notice that even though today they are microservice behemoths, they started the thinking and doing the groundwork many years prior when they were much smaller compared to today. Amazon for example started their thinking back in 1998, a full 25 years ago, which ultimately resulted in the manifesto linked above.</p>
<p>This is a testament to their forward thinking and agility that helped them survive and succeed, if they had waited until they got to todayâs scale (assuming they ever managed to reach there in the first place), to start decomposing their architecture for growth and evolution, they probably wonât have made it.</p>
<p>So just fiendishly touting âthere is nothing wrong with monolithâ or âdonât do microservicesâ without justifying the arguments or clarifying the nuances is no different than someone wanting to have 1500+ microservices because someone else is doing it.</p>
<h3>
Look at where you are and where do you want to be
</h3>
<p>Its also true that many organisations are still monolithic-ish (from a technical pov) for example <a href="proxy.php?url=https://hanselminutes.com/847/engineering-stack-overflow-with-roberta-arcoverde">StackOverflow</a> and <a href="proxy.php?url=https://blog.quastor.org/p/shopify-ensures-consistent-reads">Shopify</a> and thereâs probably more. But its not like StackOverflow will not ever entertain the possibility, they have multiple teams that are responsible for various parts of the SO so if they need to scale and increase the fault tolerance of a specific set of teams, they can always factor services out.</p>
<p>The article also gives examples of Instagram and Threads but what it omits is that <a href="proxy.php?url=https://newsletter.pragmaticengineer.com/p/building-the-threads-app">Threads</a> is built on top of Metaâs massive platform that is a collection of different and largely reusable services. Can you imagine the complexity of building something like that from ground up?</p>
<p>I can be pro-monolith and pro-large shared databases as an organisation as long as I regularly and critically review my architecture to sense signs of troubles and be mature enough to evolve it into a better state.</p>
<h3>
Problems with Distribution
</h3>
<p>Hereâs where I probably agree somewhat with the author but I also think these are not problems unique to microservices:</p>
<blockquote>
<p>Say goodbye to DRY</p>
</blockquote>
<p>Somewhat yes but mostly no! It depends on what is being duplicated and can it <em>really</em> be considered duplication. If its knowledge of a domain concept thatâs being duplicated then thatâs bad and usually an indication of incorrect boundaries. If its data <strong>contract</strong> on provider and consumer ends, thatâs not really a duplication.</p>
<p>This is also not a service architecture only problem, given a sufficiently large monolithic codebase (and depending on how well its modularised) I can bet you can duplicate knowledge in a monolith as well because in a rush to delivery, thatâs just how engineers behave. Granted, it might be easier to spot and remedy if all code is in one place than if its spread across multiple codebases but then, thatâs what you want to do even in a service architecture i.e. combine logically related codebases to reduce knowledge duplication. Nothing about many service architecture stops you from combining services when you need to.</p>
<p>Matter of fact, in my teams weâre simplifying our many-service architecture to a smaller set of carefully combined services. <strong>Note: services are not going anywhere, they are just getting a little lessâŠmicro. We are still working to decompose our shared monolithic e-commerce database by defining ownership boundaries around business capabilities</strong></p>
<p>When combining services is not really an option, creating packaged libraries for common functionality and pushing them up to a central package registry for easier reuse, is the next best thing.</p>
<blockquote>
<p>Developer ergonomics will crater</p>
</blockquote>
<p>Yep! For new joiners in a team, even with all the support, guidance and onboarding, knowing the whole landscape can be quite daunting. And yes, over time you tend to start building a solid mental model and you can find the exact line of code in the exact service that lies on the critical path with a 2 minute Github search, but it still can be a long time before that happens.</p>
<p>Not to mention the time wasted just trying to get a service that doesnât get changed often, to run and working on an engineerâs machine because people forget things they donât look at and the environment also changes.</p>
<p>But once again, having a monolith doesnât make it magically easier, especially if the monolith is sufficiently large. I would still need to make sure all the configuration for all the modules, is set up to bring up the system locally regardless of whether or not I need to touch that part of the system. With separate services, you only pay that cost for the module that you need to work on. Of course a lot depends on how the monolith is designed.</p>
<blockquote>
<p>Integration tests â LOL</p>
</blockquote>
<p>Yeah, kind of! But I would challenge this by saying that meaningful and fast integration testing in any sizeable organisation (think 40 different domains and 500+ engineers) long left the building. Integration testing though useful shouldnât be the only way we test our code, monolith or microservices, because unless you are building your own payment gateway or geocoding platform, even your monolith will have external dependencies. You can forget about being able to do reliable and fast integration testing.</p>
<p>I would hate to see your testing code if the only tests you ever have are opaque integration tests with complicated dependency setup. How would one even reason about those tests? And if I canât reason about them I would probably disable them or remove them or theyâd get flaky overtime, in which case Iâd have even less confidence to deploy changes. Having said that, the more dependencies you have (e.g. with microservices) the harder integration testing becomes but the same is true with âthe more integration tests you have the harder it can be to maintain those testsâ.</p>
<blockquote>
<p>âobservabilityâ is the new âdebugging in productionâ</p>
</blockquote>
<p>Observability as a practice is not restricted to microservices or monoliths. Its just a sensible thing to want to do to get visibility into what the system is doing and how is it performing <strong>over time</strong>. It is essential for debugging production systems <em>(mono or micro)</em>. You canât step-thru debug code in production <em>(though I have done it in past with <a href="proxy.php?url=https://learn.microsoft.com/en-us/visualstudio/debugger/remote-debugging?view=vs-2022">Visual Studio Remote Debugging</a> feature, back then it wasnât a nice experience)</em>. Even if you could debug that way, the problem may not always be replicable in production because the environment is not 100% predictable and thatâs why I rely on logs and metrics to observe the system performance over time and create a direction for my debugging or understand its rhythm.</p>
<p>No integration test can give you the profile over time that good monitoring does, because integration tests are a snapshot in time. Production is where the software is really tested, so yes I do want good observability in order to understand my system and troubleshoot it effectively.</p>
<blockquote>
<p>What about just âservicesâ?</p>
</blockquote>
<p>Read on⊠<a href="proxy.php?url=https://res.cloudinary.com/practicaldev/image/fetch/s--sdPsxGXx--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://s0.wp.com/wp-content/mu-plugins/wpcom-smileys/twemoji/2/72x72/1f642.png" class="article-body-image-wrapper"><img src="proxy.php?url=https://res.cloudinary.com/practicaldev/image/fetch/s--sdPsxGXx--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://s0.wp.com/wp-content/mu-plugins/wpcom-smileys/twemoji/2/72x72/1f642.png" alt="đ" width="72" height="72"></a></p>
<h3>
Services are about org design and business capabilities
</h3>
<p>Just because an org might not be planet scale doesnât mean they canât benefit from decomposing large systems into smaller ones to gain autonomy and resilience.</p>
<p>What an organisation should invest in is identifying how value flows through it, what people are empowered to and have the capability to, make decisions and draw contextual boundaries around those groups. Creating a stable platform that minimises reinventing the wheel, is also crucial as the org grows otherwise the amount of rework/grunt work across multiple services alone will be a drag and they will be writing about how microservices failed them.</p>
<p>The ideas in <a href="proxy.php?url=https://teamtopologies.com/key-concepts">Team Topologies</a> book talk about this kind of org design that allows a better implementation of Conwayâs Law. Domain Driven Design talks about <a href="proxy.php?url=https://www.martinfowler.com/bliki/BoundedContext.html">bounded contexts</a> that create these <em>relatively</em> autonomous zones within an organisation that are coupled loosely from a functional and technical perspective.</p>
<p>Focusing on flow of value and organisation design should result in sensibly sized services that are driven by domain boundaries instead of technical wet dreaming. Micro, nano, pico orâŠmegaâŠis irrelevant, because any change in the service granularity will/should be triggered by changes to the business value flow it delivers so a service should be as big as it needs to be. Splitting services for its own sake or combining services for its own sake (because you drank too much of the ânothing wrong with monolithâ kool-aid), is ill-informed and cargo-culting. Thatâs a sure fire way to the madness the author is talking about.</p>
<h3>
How does one determine value flow and create better boundaries?
</h3>
<p>This needs its own post (or ten) so I will leave a cop-out list of other buzzwords to consider:</p>
<ul>
<li><a href="proxy.php?url=https://en.wikipedia.org/wiki/Value-stream_mapping">Value stream mapping</a></li>
<li>Big Picture Event Storming</li>
<li><a href="proxy.php?url=https://www.infoq.com/articles/ddd-contextmapping/">Context Mapping</a></li>
<li>Domain modeling</li>
</ul>
<p><strong>N.B.</strong> Initial boundaries you will draw will probably be wrong, so be prepared to revisit and refactor them. You donât want to stick with ineffective boundaries for too long.</p>
<h3>
In ClosingâŠ
</h3>
<ul>
<li>Many-service architecture (I am not calling it microservices anymore), is definitely a scaling and optimisation pattern that shouldnât be applied haphazardly or lightly just because you think it puts you in the cool kid category. It does add complexity because of many moving parts, increases the failure modes to consider and might even negatively impact the system performance</li>
<li>Pay attention to business capabilities and ownership boundaries (i.e. bounded contexts) by identifying flow of value in the org</li>
<li>Create services in correspondence to the bounded contexts and be prepared to redraw the boundaries and rearchitect both ways, that is:
<ul>
<li>If you do this due diligence then you can even design a modular monolith to start with and split when actually needed, and</li>
<li>Armed with those insights you can even combine multiple services into fewer to align better with the contextual boundaries.</li>
</ul>
</li>
<li>Sometimes team reorganisation can cause reallocation of capabilities across portfolios, if you have a scruffy monolith then splitting out services to hand over might be harder than if you already had services.</li>
<li>You cannot have a loosely coupled services architecture if you are still sharing the monolithic database. If you are carving out services from the monolith, also take your data with you. Shared databases start out innocently enough when the org is small and simple but they are like bear cubs, eventually they get bigger, scarier, teethier and then they are no fun. Make breaking the monolithic database up a part of your engineering strategy</li>
<li>The organisation needs to have or be willing to build, a certain level of engineering maturity and leadership to execute a successful many service architecture evolution that is built on top of a stable platform</li>
<li>Thoughtlessly designed monolith is just as bad as thoughtlessly designed microservices.</li>
</ul>
microservices
domaindrivendesign
boundedcontexts
conwayslaw
-
Our Next(JS) Webshop
Stef van Hooijdonk
Mon, 17 Jul 2023 07:48:00 +0000
https://dev.to/coolblue/our-nextjs-webshop-4nio
https://dev.to/coolblue/our-nextjs-webshop-4nio
<h2>
Ownership continued
</h2>
<p>In recent posts we have shared our views on <a href="proxy.php?url=https://dev.to/coolblue/guided-ownership-422j">ownership</a> and how we use that @ Coolblue to develop our software. We value ownership on a team level as also seen in re-designing / re-engineering our 'backoffice' <a href="proxy.php?url=https://dev.to/coolblue/the-monolith-in-the-room-1947">monolith</a>.</p>
<p>Since last time we wrote about our <a href="proxy.php?url=https://dev.to/coolblue/tech-principles-coolblue-1a2k">Tech Principles</a>, we actually added a Tech Principle to our collection that should help our teams understand ownership when we look at services, events and the connections between them. Especially how we want teams to handle these interdependencies in relation to new developments and maintenance. But that is something for another post (Soon âą).</p>
<p>This post will focus on our journey towards the "technical design" for our next implementation of our webshop (<a href="proxy.php?url=https://coolblue.nl" rel="noopener noreferrer">https://coolblue.nl</a>). Our current webshop codebase is considered a Monolith. As far as I can see in Github the first commit dates back to June 17 of 2010. Over 13 years ago. </p>
<p>Today we have about 10 to 15 development teams working on this single codebase to make our customers smile. Each one of those teams has a pretty specific goal. And as such have to work together in this single codebase. This makes efforts that touch on large parts of this code base, like an upgrade to the next version of PHP, a <em>shared</em> and a multi team problem. Our <a href="proxy.php?url=https://coolblue-blueprint.com" rel="noopener noreferrer">Design System</a> is tied to this same codebase.</p>
<p>Secondly we mainly use <a href="proxy.php?url=https://www.php.net" rel="noopener noreferrer">PHP</a> and <a href="proxy.php?url=https://twig.symfony.com" rel="noopener noreferrer">Twig</a> for the implementation of our current Webshop. Great technologies, but we want to leverage a more modern set to allow for more fluid user interactions and to keep and <a href="proxy.php?url=https://www.careersatcoolblue.com/tech/development/?query=front" rel="noopener noreferrer">attract talent</a>.</p>
<blockquote>
<p>Time for a change.</p>
</blockquote>
<h2>
From Monolith to ?
</h2>
<p>In the introduction I already mentioned a few of the concerns we have with our current Webshop implementation. Based on those, we decided to investigate how we could address them.</p>
<h3>
Team Scope versus Team Disciplines
</h3>
<p>Technology wise we are looking to âdownsizeâ the solutions a development team owns. Reduce the overall size. Size is still non-deterministic. It is a meta-unit composed of multiple factors: complexity, lines of code, number of services, number of classes and more. <br>
From a monolith towards multiple smaller solutions that match closer to what a team with a goal can manage, build, maintain and enhance to deliver more value for our business. </p>
<h3>
A piece of history
</h3>
<p>About 2 years ago we held our first brainstorm session "Webshop Tech vNext". </p>
<p><a href="proxy.php?url=https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff7o5nhxcpt6sqy0ip3km.png" class="article-body-image-wrapper"><img src="proxy.php?url=https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff7o5nhxcpt6sqy0ip3km.png" alt="Frontend Futures" width="800" height="103"></a></p>
<p>One key decision we took back then already was to start using React as the base for our back office applications and the design system for them.</p>
<p>About 9 months ago, we held another brainstorm session to look at options for our Webshop. At that time we noticed that NextJS, together with React, could be a replacement for our Webshop tech stack. </p>
<p>We also learned about microfrontends. I myself found the <a href="proxy.php?url=https://micro-frontends.org" rel="noopener noreferrer">explanation on this page</a> on microfrontends useful. We were already doing more and more with Services and API's in many of our other solutions and the analogy made total sense to us. Especially in light of our ideas around ownership.</p>
<h3>
Proof of Concept time
</h3>
<p>We set out to test a few topics to learn how these would work for us in a microfrontend world.</p>
<p><a href="proxy.php?url=https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvs63c1wq1xtvcriex8zy.png" class="article-body-image-wrapper"><img src="proxy.php?url=https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvs63c1wq1xtvcriex8zy.png" alt="microfrontend in our Tech Radar" width="583" height="261"></a></p>
<p>During the Proof of Concept (Q1-Q2 of this year 2023) we actually implemented a piece of the routing needed for microfrontends in our Coolblue.nl webshop production (codebase). Only to be visible and used by our developers of course;</p>
<p><a href="proxy.php?url=https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwqkqh7k411dluxs1icyo.png" class="article-body-image-wrapper"><img src="proxy.php?url=https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwqkqh7k411dluxs1icyo.png" alt="Proof of Concept in Production" width="800" height="353"></a></p>
<p>We found no large blockers and we decided to move ahead.</p>
<h3>
Roadmap it
</h3>
<p>Rebuilding our Webshop is a tremendous effort. And that is why we created a plan in the last few weeks. And have started on this massive journey. </p>
<p><a href="proxy.php?url=https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftoadjm5xdy76b6c70yrl.png" class="article-body-image-wrapper"><img src="proxy.php?url=https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftoadjm5xdy76b6c70yrl.png" alt="microfrontend Roadmap blurred 07-2023" width="500" height="202"></a></p>
<p>In a year, we hope to share more on our learnings going through this process. But looking at the plan, I am sure that if you are a customer of ours, you will have been served a page or two based on this new implementation. </p>
<h2>
Technologies that help us achieve this
</h2>
<p>If you are reading this post just to learn: "What technology is Coolblue using for their webshop?" Sorry to have kept you waiting for so long.</p>
<ul>
<li>Requests (users) will be routed to the right microfrontend application with <a href="proxy.php?url=https://aws.amazon.com/lambda/edge/" rel="noopener noreferrer">AWS Lambda@Edge</a>
</li>
<li>The Application(s) will be hosted as <a href="proxy.php?url=https://aws.amazon.com/blogs/apn/serverless-containers-are-the-future-of-container-infrastructure/" rel="noopener noreferrer">Serverless Containers</a> with <a href="proxy.php?url=https://aws.amazon.com/fargate/" rel="noopener noreferrer">AWS Fargate</a>
</li>
<li>For our Front End we will use <a href="proxy.php?url=https://react.dev/learn/describing-the-ui" rel="noopener noreferrer">React</a> and <a href="proxy.php?url=https://www.typescriptlang.org" rel="noopener noreferrer">TypScript</a>
</li>
<li>To serve our Front End we will rely on <a href="proxy.php?url=https://nextjs.org/learn/foundations/about-nextjs" rel="noopener noreferrer">NextJS</a> </li>
<li>Our background processing and services are built mostly with <a href="proxy.php?url=https://dotnet.microsoft.com/en-us/apps/aspnet/apis" rel="noopener noreferrer">C#</a> or NextJS/TypeScript. When needed, we will consolidate services via <a href="proxy.php?url=https://graphql.org" rel="noopener noreferrer">GraphQL</a>.</li>
</ul>
nextjs
webdev
react
technology
-
Systems Thinking and Technical Debt
Aman Agrawal
Tue, 04 Apr 2023 07:00:00 +0000
https://dev.to/coolblue/systems-thinking-and-technical-debt-4d0o
https://dev.to/coolblue/systems-thinking-and-technical-debt-4d0o
<p>I repeatedly see both business stakeholders and software engineers continue to struggle to see eye-to-eye on matters of technical debt despite the fact that both are impacted by it. I attribute this to the fact that both camps speak different languages and over the last 15 some odd years I havenât found a silver bullet that can get a 100% alignment. Engineers are driven by:</p>
<ul>
<li>Code complexity, maintainability and understandability</li>
<li>Making architecture more fault tolerant, resilient and quickly recoverable from outages</li>
<li>Keeping up with technological changes/staying on the cutting edge</li>
<li>Innate desire to improve software systems and not letting them rot</li>
</ul>
<p>Business folks are driven by :</p>
<ul>
<li>Investment vs return on that investment</li>
<li>Financial savings/profit</li>
<li>Time to market</li>
<li>Legal liability/other risk</li>
<li>Short term thinking and focus on features as opposed to long term outcomes</li>
</ul>
<p>Its like two people arguing with each other where each speaks a language the other doesnât understand! Never going to work! This InfoQ <a href="proxy.php?url=https://www.infoq.com/articles/communicating-engineering-work-business/">article</a> tries to look under the hood of this communication gap between the two parties in more detail and makes some good recommendations, worth checking out.</p>
<p>The other problem that I think hinders alignment is a holistic understanding for how technical debt affects or is affected by, business drivers and having some way of visualising it. You can often sense a lack of this understanding when a manager says, <em>âI donât really see the business value in addressing this technical debt, right now we have critical functional work to do, can we do this tech thingy later?â</em>. In this post I will try to use simplified <a href="proxy.php?url=https://en.wikipedia.org/wiki/Systems_thinking#:~:text=Systems%20thinking%20is%20a%20way,complex%20contexts%2C%20enabling%20systems%20change.">Systems Thinking</a> modelling language to put technical debt in the larger organisational context with the hope that it will make some sense to everyone.</p>
<h2>
Using Systems Thinking to Put Technical Debt in Context
</h2>
<p>I am going to take a crack at it by drawing a systems model using digital post-its connected by arrows (what else?). The post-its represent variables that can increase or decrease, green arrows mean a change in one variable results in corresponding increase in another variable and later on red arrows that mean a decrease in one variable triggered by a change in another variable.</p>
<p><strong><a href="proxy.php?url=https://res.cloudinary.com/practicaldev/image/fetch/s--qj0VRxef--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://s0.wp.com/wp-content/mu-plugins/wpcom-smileys/twemoji/2/72x72/26a0.png" class="article-body-image-wrapper"><img src="proxy.php?url=https://res.cloudinary.com/practicaldev/image/fetch/s--qj0VRxef--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://s0.wp.com/wp-content/mu-plugins/wpcom-smileys/twemoji/2/72x72/26a0.png" alt="â " width="72" height="72"></a>DISCLAIMER</strong>: these models are abstractions of real life systems, so they are not meant to be 100% accurate but a useful approximation to help make sense of the complexities involved and connect them to the other parts of the organisational system.</p>
<p>For these models, I am going to use the following variables:</p>
<ul>
<li>Number of business problems to solve/solved</li>
<li>Amount of business value created (somewhat abstract but letâs say its the measure of usefulness of the solutions that help improve the business outcomes)</li>
<li>Business success (EBITDA/revenue, new investment and expansions, new customer journeys, number of customers signed up, number of repeat customers, NPS what have you)</li>
<li>Business pressures (slow down in business success metrics creates pressure to do more)</li>
<li>Market forces (pandemic, war, supply chain issues, competitor action, economic turbulence etc)</li>
<li>Internal dynamics (org politics, reorganisation and restructuring, cost cutting, lawsuits, etc). Along with market forces, this generally tends to push down an organisationâs success.</li>
<li>Engineering velocity (roughly speaking, number of value add ideas productionised per cycle)</li>
<li>Engineering compromises (the number of shortcuts we take whilst productionising ideas)</li>
<li>Technical debt (well, I guess I donât need to explain this, or do I? <img src="proxy.php?url=https://res.cloudinary.com/practicaldev/image/fetch/s--zCyXRrdx--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://s0.wp.com/wp-content/mu-plugins/wpcom-smileys/twemoji/2/72x72/1f642.png" alt="đ" width="72" height="72">)</li>
<li>Engineer motivation and trust (mostly abstract but I guess WTFs per minute can be a good metric <img src="proxy.php?url=https://res.cloudinary.com/practicaldev/image/fetch/s--fumfYCPq--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://s0.wp.com/wp-content/mu-plugins/wpcom-smileys/twemoji/2/72x72/1f609.png" alt="đ" width="72" height="72">. In seriousness though, this erodes over time and can often be sensed when people abruptly leave or stop caring or become a very frustrated and a challenging member of the team.)</li>
</ul>
<p><a href="proxy.php?url=https://codequirksnrants.files.wordpress.com/2023/04/image-3.png"><img src="proxy.php?url=https://res.cloudinary.com/practicaldev/image/fetch/s--15TT3r3p--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://codequirksnrants.files.wordpress.com/2023/04/image-3.png%3Fw%3D389" alt="" width="389" height="204"></a></p>
<p>For the first diagram I am going to assume a perfect world where an organisation keeps going from strength to strength forever, and the engineering velocity keeps growing in tandem as well:</p>
<p><a href="proxy.php?url=https://codequirksnrants.files.wordpress.com/2023/04/image-2.png"><img src="proxy.php?url=https://res.cloudinary.com/practicaldev/image/fetch/s--2FYhIhXk--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://codequirksnrants.files.wordpress.com/2023/04/image-2.png%3Fw%3D1024" alt="" width="880" height="602"></a><br>
<em>In a perfect world, business success and engineering velocity will continue to increase infinitely</em></p>
<p>Business opportunities generate business problems to be solved, the more problems we solve, the more business value we generate, the more the business will succeed, this means the pressure to succeed will increase in the form new revenue streams, new opportunities, new value streams, and the more demand this will force upon the engineering velocity which will respond by solving these challenges and generating more business value in turn. And the cycle will just continue resulting in an infinitely successful business and infinitely high engineering velocity with no technical debt whatsoever, its essentially a run-away <em>positive feedback loop</em> in systems thinking terminology. Of course, this is living in Harry Potter land, no relation to reality whatsoever!! So letâs descend down to reality, shall we?</p>
<p>In her book <a href="proxy.php?url=https://www.amazon.com/Thinking-Systems-Donella-H-Meadows/dp/1603580557/ref=sr_1_2?crid=2G0VRLGPP9CI5&keywords=systems+thinking&qid=1680205424&sprefix=systems+thinking%2Caps%2C219&sr=8-2">Thinking in Systems</a>, Donella Meadows observes that:</p>
<blockquote>
<p>no physical system can grow forever in a finite environment.</p>
<p><cite>Meadows, Donella H.. Thinking in Systems</cite></p>
</blockquote>
<p>This is because an uncontrollably growing system eventually will tend towards instability and crash (the 2008 financial crisis is a glowing example of this runaway positive feedback loop, or, ever tried to bend an thin metal strip back and forth repeatedly until it snaps? Obvious, right?). In the light of this constraint, we can see that our model is missing other variables that serve to constrain the system so it doesnât become a victim of its runaway success (or failure). So what would the picture look like with all these variables plugged in?</p>
<p><a href="proxy.php?url=https://codequirksnrants.files.wordpress.com/2023/04/image-1.png"><img src="proxy.php?url=https://res.cloudinary.com/practicaldev/image/fetch/s--G88R1SXx--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://codequirksnrants.files.wordpress.com/2023/04/image-1.png%3Fw%3D1024" alt="" width="880" height="740"></a><br>
<em>In a more realistic world, we need other variables that constrain the system</em></p>
<p>Suddenly the complexity explodes!</p>
<p>As we solve more business problems, the more value we add, the more the business succeeds which increases the pressure for sustained success, becauseâŠletâs not get complacent, yes? Internal dynamics such as reorganisation, politics etc and market forces such as pandemic, competitor actions, societal upheavals like wars, high inflation etc, push <em>against</em> the business success, this creates even more business pressure to succeed and pressure to increase engineering velocity to reduce time to market and gain competitive advantage.</p>
<p>Up to a certain point the velocity will grow very organically, but now that we know that no system can grow forever, eventually, the high demand on engineering velocity will result in more and more engineering compromises and shortcuts to be made. This will in turn increase the technical debt accumulated which initially will increase the velocity but with enough of these kinds of iterations, it will start to wear down the engineer motivation and trust in the system and the team as they struggle with the past engineering compromises and in the race to deliver faster will end up adding new compromises and debt on top of the existing ones. This also increases the maintenance costs of the software and eventually it will start to slow down the engineering velocity. This means a reduction in the number of business problems solved, more of the orgâs investment in engineering goes towards just struggling with the technical debt and not adding new value. This in turn results in that much less value being created over all which will eventually start to reduce business success.</p>
<p>If left unchecked (in some cases this does happen), this cycle could also go in a runaway <em>negative feedback loop</em> where an orgâs engineering capabilities are actively hindering its success as opposed to enhancing it. This erosion of value creation doesnât happen over night, it can take a long time (often years) to build up but in the end its like the org is paying the engineering teams to actively sabotage itself, thatâs horrifying! But since no system can grow or shrink forever, interventions will be made eventually to salvage the situation which will inevitably lead to <strong>Big Bang Rewrites</strong> of all the âlegacyâ systems. This will create its own problems (not represented in the diagram), for e.g. the time and money cost of rewrite will further erode the business value proposition of the system because value wonât be created until the first version of the ânewâ system goes live, leading to less business success, increasing costs, increasing management pressure to deliver successfully <em>this</em> time around. But since we want to go fast, weâd take shortcuts and compromises which starts the vicious cycle all over again just in the ânewâ system.</p>
<p>Can this be considered a smart business strategy?</p>
<p>In systems of comparable complexity, <strong>refactoring</strong> a system gradually towards health and improved design is generally a lower risk and rapid return investment than <strong>rewriting</strong> it from scratch (though in some cases, the opposite is also true). This is because much of the original investment (and knowledge) can still be valid and preserved and you are not rushed to finish the work for the fear of blocking the creation of new value. Old and new (if refactored carefully), can happily co-exist with every iteration not only creating new value now but also reducing the technical debt that weâve accumulated along the way. The old can then be decommissioned eventually.</p>
<p>But I digress a bit, so whatâs the solution to minimise the run away negative feedback loop then? We donât want to pack down our business just because there are constraining forces at work, yes? How do we try to create a harmonious balance of short term debt advantage and long term stability and resilience of the system? In finance, the bankâs enforcement agents or government penalties, will solve that problem for you real quick, but unfortunately for most enterprise software engineering, we donât have that level of âencouragementâ.</p>
<p>âŠ</p>
<p>How about <strong>Engineering Discipline</strong>? Sounds obvious but how will it fit in the model? Letâs see:</p>
<p><a href="proxy.php?url=https://codequirksnrants.files.wordpress.com/2023/04/image.png"><img src="proxy.php?url=https://res.cloudinary.com/practicaldev/image/fetch/s--elUDUsn0--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://codequirksnrants.files.wordpress.com/2023/04/image.png%3Fw%3D1024" alt="" width="880" height="717"></a><br>
<em>Introducing a little bit of discipline (top right corner-ish), can bring back some stability over time</em></p>
<p>When the velocity starts to drop and more engineering compromises are made to increase the velocity (see the irony here?), engineering discipline can act as a compensating driver that can make sure that we are able to reduce the previous compromises first before we add new value every cycle. This gradually results in increased engineer motivation and trust as they donât need to struggle with bad decisions as much and this will serve to increase engineering velocity and value creation over time. The increase in the engineering velocity may not be by leaps and bounds and not right away, but at least itâs likely to not fall too low in the face of ever increasing business pressures, and the system wonât spiral into madness.</p>
<p>Like I said before the model is not perfect, no model is, but I think (and hope) it helps put the complexities involved, in perspective for both engineers and business stakeholders and clarifies the affect long term accumulation of technical debt can have on the business outcomes.</p>
<p><strong>Engineering Discipline</strong> is critical to bringing the system back into equilibrium and this is why its important for engineering teams to take control and ownership of this discipline and be proactively on the lookout for variable changes that tend to push the system towards instability. We donât need permission to do the right thing, we have the engineering expertise and experience to know when and how the right thing should be done because we also understand the long term implications of neglect. Though we do need to communicate about these implications to the business in the language they understand to the extent its feasible and possible_._</p>
<p>If you have tried similar tools to communicate the value of addressing technical debt or if you think this model could be made more convincing or more âcorrectâ, please drop a comment! Cheers!</p>
systemsthinking
communication
technicaldebt
models
-
Accessibility
Stef van Hooijdonk
Mon, 05 Dec 2022 09:12:00 +0000
https://dev.to/coolblue/accessibility-4k73
https://dev.to/coolblue/accessibility-4k73
<p>Web accessibility a.k.a a11y refers to the universal ability of different users and devices to access the content and features of a website, regardless of physical or cognitive ability.</p>
<p>In other words, web accessibility ensures that everyone can successfully use a website, including users who are blind, color blind; deaf or hard of hearing, as well as users who have difficulty using their hands or have other disabilities.</p>
<h3>
What does it mean for Coolblue
</h3>
<p>We want to adhere to the <a href="proxy.php?url=https://www.w3.org/TR/WCAG21/" rel="noopener noreferrer">W3Câs Web Content Accessibility Guidelines (WCAG) 2.1</a></p>
<p>You can find a comprehensive checklist in Coolblue design system guidelines (internal).</p>
<h3>
Definition of done
</h3>
<p>Accessibility is considered to be part of your teams definition of done.</p>
technology
development
web
a11y
-
Security
Stef van Hooijdonk
Thu, 01 Dec 2022 09:30:00 +0000
https://dev.to/coolblue/security-4ab6
https://dev.to/coolblue/security-4ab6
<p>With regards to Security it is always better to reuse proven methods than to reinvent the wheel. Therefore these principles are based on the best practices used by the <a href="proxy.php?url=https://infosec.mozilla.org/fundamentals/security_principles.html" rel="noopener noreferrer">Mozilla Foundation</a>. Where applicable these have been adapted or expanded to align with the other Coolblue Prinicples and Core Values.</p>
<blockquote>
<p>The âdoâ and âdo notâ used in this document are examples of controls or implementation of these principles, but do not represent an exhaustive list of possibilities. When in doubt, verify if your application, service or product aligns with the goal of the principles.</p>
</blockquote>
<h3>
Least Privilege
</h3>
<p>Do not expose unnecessary services</p>
<p>Goal: Limiting the amount of reachable or usable services to the necessary minimum.</p>
<p><strong>Do</strong></p>
<ul>
<li>List all services presented to the network (Internet and Intranets). Justify the presence of each port or service.</li>
</ul>
<p><strong>Do not</strong></p>
<ul>
<li>OpenSSH Server (sshd) is running but no users ever login.</li>
<li>A web-application has a web accessible administration interface, but it is not used.</li>
<li>A database server (SQL) allows connections from any machine in the same VLAN, even though only a single machine needs to access it.</li>
<li>The administration login panel of the network switch for the office network is accessible by users of the office network.</li>
</ul>
<h3>
Do not grant or retain permissions that are no longer needed
</h3>
<p>Goal: Expire user access to data or services when users no longer need them.</p>
<p><strong>Do</strong></p>
<ul>
<li>Use role-based access control (allows for easy granular escalation of privileges, only when necessary)</li>
<li>Expire access automatically when unused.</li>
<li>Automatically disable API keys after not having been used for a given period of time and notify the user.</li>
<li>Use different accounts for different role types (admin, developer, user, etc.) when no good role-based access control is available.</li>
<li>Routinely review userâs access permissions to ensure theyâre still needed.</li>
</ul>
<p><strong>Do not</strong></p>
<ul>
<li>Grant global root access (e.g. via âsudoâ) for all operation engineers on all systems.</li>
<li>Give access âjust in caseâ.</li>
<li>Retain access to services that you no longer use.</li>
</ul>
<h3>
Defense in Depth
</h3>
<p>Do not allow lateral movement</p>
<p>Goal: Make it difficult or impossible for an attacker to move from one host in the network to another host.</p>
<p><strong>Do</strong></p>
<ul>
<li>Prevent inbound network access to services on a host from clients that do not need access to the service through either host-based firewall rules, network firewall rules/AWS security groups, or both (which is preferred).</li>
<li>Clearly enforce which teams have access to which set of systems.</li>
<li>Alert on network flows being established between difference services.</li>
</ul>
<p><strong>Do not</strong></p>
<ul>
<li>Allow inbound OpenSSH, RDP connections from any host on any network.</li>
<li>Run unpatched container management services (e.g. Docker) or kernels which allow a user in one container to escape the container and affect other containers on the same host.</li>
</ul>
<h3>
Isolate environments
</h3>
<p>Goal: Separating infrastructure and services from each other in order to limit the the impact of a security breach.</p>
<p><strong>Do</strong></p>
<ul>
<li>In cases where two distinct systems are used to govern access or authorization (e.g. AD and Okta), ensure that no single user or role has administrative permissions across both systems.</li>
<li><p>Use separate sets of credentials for different environments.<br>
<strong>Do not</strong></p></li>
<li><p>Have system administrators with access to every system/every service.</p></li>
<li><p>Establish service users with access to multiple services.</p></li>
<li><p>Allow tools remotely executing code on systems from a centralized location (single Puppet Master, Ansible Tower, Nagios, etc. instance) across multiple services.</p></li>
<li><p>Re-use functionality across services when not required (such as sharing load balancers, databases, etc.)</p></li>
</ul>
<h3>
Patch Systems
</h3>
<p>Goal: Ensuring systems and software do not contain vulnerabilities when these are found in software over time.</p>
<p><strong>Do</strong></p>
<ul>
<li>Establish regular recurring maintenance windows in which to patch software.</li>
<li>-Ensure individual systems can be turned off and back on without affecting service availability.</li>
<li>Enable automatic patching where possible.</li>
<li>Check web application libraries and dependencies for vulnerabilities.</li>
</ul>
<h3>
Meet Web Standards
</h3>
<p>Goal: Reduce exposure to web attacks by following the web security standards.</p>
<p><strong>Do</strong></p>
<ul>
<li>Achieve A or higher on Mozillaâs Observatory. (Deviaton from Mozilla standards which require a B+ or higher rating)</li>
<li>Follow the Web Security Standards.</li>
</ul>
<h3>
Guarantee data integrity and confidentiality
</h3>
<p>Goal: Ensuring data confidentiality, integrity, and authenticity is respected throughout its lifecycle.</p>
<p>Details on confidentiality, integrity & availability can be found below at Explanations & Rationales</p>
<p><strong>Do</strong></p>
<ul>
<li>Use full-disk encryption where available on systems without physical security (laptops and mobile phones).</li>
<li>Encrypt credentials storage databases (Ansible Vault, Credstash, etc.)</li>
<li>Encrypt data in transit with TLS (during transmission).</li>
<li>Also encrypt data in transit inside the internal network.</li>
</ul>
<p><strong>Do not</strong></p>
<ul>
<li>Terminate TLS (e.g. with a reverse proxy or load balancer) outside a system and then transmit the data in clear-text across the rest of the network.</li>
<li>Use STARTTLS without also disabling clear-text connections.</li>
</ul>
<h3>
Know Thy System
</h3>
<p>Fraud detection and forensics<br>
Goal: Inspect events in real-time in order to alert on suspicious behavior, and store system behavior information in order to retrace actions after a security breach.</p>
<p><strong>Do</strong></p>
<ul>
<li>Audit and log system calls (e.g. with auditd or Windows Audit) made by processes when running in an operating system you control (e.g. not AWS Lambda)</li>
<li>Send logs off the account or system (e.g. AWS CloudTrail, system logs, etc.) outside of the account or system (different AWS account, MozDef, Papertrail, etc.)</li>
<li>Detect and alert on anomalous changes.</li>
</ul>
<h3>
Are you at risk?
</h3>
<p>Goal: Assessing how exposed you are to danger, harm or loss.</p>
<p><strong>Do</strong></p>
<ul>
<li>Run Rapid Risk Assessments (RRA) for your services. </li>
<li>Estimate what would be the impact if your service was compromised.</li>
</ul>
<p><strong>Do not</strong></p>
<ul>
<li>Think it will never happen to you.</li>
</ul>
<h3>
Inventory the Landscape
</h3>
<p>Goal: Provide an accurate, maintained catalog, or system of records for all assets.</p>
<p><strong>Do</strong></p>
<ul>
<li>Keep an inventory of services and service owners.</li>
<li>Keep an inventory of machines (e.g. ServiceNow, AWS Config, Infoblox, etc.) which is updated automatically.</li>
<li>Ensure that the inventory contains IP addresses of systems in particular when using IPv6 (which cannot realistically be scanned).</li>
<li>Never rely upon security through obscurity</li>
</ul>
<p><strong>Coolblue addition to Mozilla principles</strong></p>
<h3>
No security by obscurity
</h3>
<p>Goal: To prevent substituting real security for secrecy.</p>
<p><strong>Do</strong></p>
<ul>
<li>Always assume an attacker with perfect knowledge</li>
</ul>
<p><strong>Do not</strong></p>
<ul>
<li>Rely on trust as a security measure</li>
</ul>
<h3>
KISS - Keep It Simple and thus Secure
</h3>
<p>Goal: KISS comes from âKeep It Simple, Stupidâ. You can only secure a system that you can completely understand.</p>
<p><strong>Do</strong></p>
<ul>
<li>Keep things simple. Prefer simplicity over a complex and specific architecture.</li>
<li>Ensure others can understand the design.</li>
<li>Use standardized tooling that others already know how to use.</li>
<li>Draw high-level data flow diagrams.</li>
<li>See also Code Clean & Simple.</li>
</ul>
<h2>
Authentication and authorization
</h2>
<h3>
Require two-factor authentication
</h3>
<p>Goal: Require 2FA (or MFA) on all services internal or external to prevent attackers from reusing or guessing a single credential such as a password.</p>
<p>MFA (multi-factor authentication, also called 2FA for two-factors) is method of confirming a userâs claimed identity by utilising a combination of two different components such as something you know (password) and something you have (phone).</p>
<p><strong>Do</strong></p>
<ul>
<li>Use an SSO (Single Sign On) solution with MFA.
For services that can not support SSO, use the serviceâs individual MFA features (e.g. GitHub).
Servers carrying secrets or widespread access (or any other potentially sensitive data) should verify the userâs identity end to end, such as by prompting for an additional MFA verification when connecting to the server, even when behind a bastion host.</li>
</ul>
<h3>
Use central identity management (Single Sign-On)
</h3>
<p>Goal: Minimize credential theft and identity mismanagement by minimizing the handling of user credentials such as password, MFA to a set of dedicated systems.</p>
<p><strong>Do</strong></p>
<ul>
<li>Use an SSO (Single Sign-On) solution that authenticates users credentials on your serviceâs behalf. Within Coolblue we strongly encourage the usage of SAML.</li>
<li>Servers update their user sessions from the SSO systems regularly to ensure the user is still active and valid.</li>
<li>Use authorization (e.g. group membership) data from the SSO system (possibly, in addition to your own authorization data)
Do not</li>
<li>Accept, process, transmit or store user credentials (passwords, OTPs, keys, etc.) Let the authentication server directly handle that data.</li>
<li>Use direct LDAP authentication for users.</li>
</ul>
<p>More information about Centralized User Account Management can be found below in the Explanation & Rationales section</p>
<h3>
Require strong authentication
</h3>
<p>Goal: Use credential-based authentication and user session management to grant access.</p>
<p>More information about Shared Passwords & Password Reuse can be found below in the Explanation & Rationales section</p>
<p><strong>Do</strong></p>
<ul>
<li>Use credential-based authentication and user session management where the session information is passed by the user. More info.</li>
<li>Use API keys for service authentication.</li>
<li>Prefer using asymmetric API keys with request signing (e.g. x509 client certificates, AWS Signature) over symmetric API keys (e.g. HTTP header) where possible.</li>
<li>Ensure that API keys can be automatically rotated in the case of a data leak.</li>
<li>Use a password manager to store distinct passwords for each service a user accesses.</li>
<li>Use purpose-built credential sharing mechanisms when sharing is required (1password for teams, LastPass, etc.)</li>
</ul>
<p><strong>Do not</strong></p>
<ul>
<li>Use easy to guess passwords or vendor default passwords.</li>
<li>Send your password to other individuals.</li>
<li>Send shared passwords over email or communication mediums other than purpose-built credential sharing mechanisms.</li>
<li>Use the same password for multiple services.</li>
<li>Trust traffic from a certain network address.</li>
<li>Rely on VLANs or AWS VPCs to indicate requests are safe.</li>
<li>Use IP ACLs as replacement for authentication.</li>
<li>Trust the office network for access to devices.</li>
<li>Use TCP Wrapper for access control.</li>
<li>Use machine API keys for user authentication.</li>
<li>Use user credentials for machine authentication.</li>
<li>Store API keys on devices that are not physically secure (e.g. laptops or mobile phones)</li>
<li>Always verify, never trust</li>
</ul>
<p><strong>Coolblue addition to Mozilla principles</strong></p>
<p>Goal: Many security problems are caused by inserting malicious intermediaries in communication paths. Zero trust is applicable to all actors and IAAA (Identification, Authentication, Authorization and Accountability).</p>
<p><strong>Do</strong></p>
<ul>
<li>Deny by Default</li>
<li>Authenticate every transaction</li>
<li>Use allow lists instead of block lists</li>
</ul>
<blockquote>
<p>NOTE: The âdoâ and âdo notâ used in this document are example of controls or implementation of the principles, but do not represent an exhaustive list of possibilities. When in doubt, verify if your application, service or product aligns with the goal of the principles. Suggestions for do's and don'ts can be send to <a href="proxy.php?url=mailto:[email protected]">[email protected]</a>.</p>
</blockquote>
<h2>
Explanations & Rationales
</h2>
<h3>
Confidentiality
</h3>
<p>The confidentiality of the data depends on the type of data that needs to be protected. Personally Identifiable Information (PII) such as Names, Addresses or E-mail need a high level of protection than Publicly Available data (e.g. product information). Within Coolblue we use 4 levels of confidentiality:</p>
<p>Secret Data should only be accessible by a very limited amount of users. Examples of secret data are Admin / Root passwords, Business Strategy Plans, API-keys or our password databases such as Active Directory or PasswordState.</p>
<p>Confidential Data should only be viewed by authorized persons. Examples of confidential data are Personally Identifiable Information, order history or contracts.</p>
<p>Restricted Data can be freely shared within the company but not outside Examples of Restricted data are the VVV, internal processes or Demo's</p>
<p>Public Data can be accessed by the whole world Examples of Public data are the product information on our website, marketing ads or our vacancies.</p>
<h3>
Integrity
</h3>
<p>The integrity of data is the assurance of accuracy and consistency of the data over its entire lifecycle. A high level of integrity indicates that changes to this data should be verified and the correctness is very important. A low level of integrity indicates that changes to this data don't matter for the outcome of a process. Within Coolblue we use 3 levels of Integrity:</p>
<p>High Integrity data is data that should always be subjected to a 4-eye principle before a change can be made. If technically feasible, integrity checks (e.g. hashes or checksums) should be implemented to verify the data after a transaction took place. Examples of high integrity data are our source code, product pricing or financial reporting data.</p>
<p>Medium Integrity data is data that should only be changed by an authorized person. An authentication mechanism should be in place before changes to this type of data can be made. Examples of Medium Integrity data are our processes, Google Drive documents or data on our website.</p>
<p>Low Integrity data is data of which it really doesn't matter if it is changed by an unauthorized person. Examples are an order list for coffee for your team or other volatile information.</p>
<p>It is important to state that Data Integrity is not the same as Data Quality. Data quality in general refers to whether data is useful. Data integrity, by contrast, refers to whether data is trustworthy.</p>
<h3>
Availability
</h3>
<p>The availability of data is important to ensure timely and reliable access to information and systems. For example, the website of Coolblue requires a high availability because our customers want to be able to place an order 24/7. Next to that, we want to upkeep our promise of next day delivery. So the systems & data needed for these processes also need to have a high availability.</p>
<p>To determine the required availability of dta or systems, the following parameters need to be determined:</p>
<ul>
<li><p>Recovery Point Objective (RPO) - The amount of data, as a measure of time, we are willing to lose during a recovery event.</p></li>
<li><p>Maximum Tolerable Downtime (MTD) - The amount of time we can be without the asset that is unavailable before we have to declare a disaster and call into effect our DR plan</p></li>
<li><p>Recovery Time Objective (RTO) - The Earliest Possible Time that we can restore/recover the asset to full functionality IF everything goes as planned, and nothing else goes wrong.</p></li>
<li><p>Work Recovery Time (WRT) - Determines the maximum amount of time needed to verify files, services or processes have been restored correctly and are available for use.</p></li>
</ul>
<p>Because an image is better than a thousand (or in this case 107) words:</p>
<p>It is important to be very critical when determining these parameters. Does the system really need to be restored within 2 hours, or can business be resumed without the system fully functional (e.g. in a limited capacity or with manual workarounds).</p>
<p>Based on the outcome of these parameters, measures should be taken to ensure that in a case of a disaster the determined timelines can be met.</p>
<h3>
Multi-Factor Authentication (MFA)
</h3>
<p>Multi-factor authentication (MFA) is a security system that requires more than one method of authentication from independent categories of credentials to verify the user's identity for a login or other transaction.</p>
<p>Requiring the use of MFA for internet accessible endpoints is encouraged because by requiring not only something the user knows (a knowledge factor like a memorized password) but also something that the user has (a possession factor like a smartcard, yubikey or mobile phone) the field of threat actors that could compromise the account is reduced to actors with physical access to the user.</p>
<p>In cases where the possession factor is digital (a secret stored in your mobile phone) instead of physical (a smartcard or yubikey), the effect of MFA is not to reduce the field of threat actors to only those that have physical access to the user, because a secret can be remotely copied off of a compromised mobile phone. Instead, in this case, the possession factor merely makes it more difficult for the threat actor since they now need to brute force/guess your password and compromise your mobile phone. This is, however, still possible to do entirely from a remote location. In particular, storing both first and second factor on the same device (for example: mobile phone) is strongly discouraged.</p>
<h3>
Shared Passwords & Accounts
</h3>
<p>Shared passwords are passwords or/and accounts that more than one person knows or has access to.</p>
<p>Usage of these type of accounts is discouraged because they make auditing access difficult:</p>
<ul>
<li>multiple users appear in audit logs as one user and different users actions are difficult to differentiate.
the number of audit logs that need to be searched increases.
correlation of events across different systems is impossible </li>
<li>if multiple people are creating event records with a single shared account across multiple systems at the same time.</li>
</ul>
<p>Furthermore, revoking access to a subset of the users of a shared password requires a password change that affects all users.</p>
<h3>
Password Reuse
</h3>
<p>Password reuse is the practice of a single user using the same password across multiple different accounts/sites. This is contrasted with creating a different, distinct password for every account/site. Users often employ hybrid forms of password reuse like:</p>
<ul>
<li>Using the same password for a class of accounts/sites, for example, using one single password for multiple high value financial accounts, but a different single password for multiple low value forums and wikis.</li>
<li>Using a consistent reproducible method of password generation for each site, for example, every account/site has a password which begins with the same characters and ends with name of the site ("rosebud0facebook", "rosebud0linkedin")</li>
</ul>
<p>Password reuse is discouraged because:</p>
<p>When a site is compromised by an attacker, the attacker can easily take the user's password that has been reused on other sites and gain access to those other sites. For example if a user uses the same password on a car forum website as they use on Facebook, when that car website gets compromised, the attackers can then takeover the user's Facebook account.<br>
Unethical administrators of any sites where a password is reused may/can gain access to accounts using the reused password.</p>
<blockquote>
<p>Note that it is dangerous for a user to rely on a site being able to effectively prevent an attacker from obtaining that user's password once an attacker has compromised the site.</p>
</blockquote>
<p>Since it's difficult/impossible for a user to memorize a distinct password for every account/site, a common solution is to use a password manager.</p>
principles
development
security
-
Surviving Black Friday With Kubernetes
Piotr Zakrzewski
Mon, 28 Nov 2022 16:22:02 +0000
https://dev.to/coolblue/surviving-black-friday-with-kubernetes-2o7o
https://dev.to/coolblue/surviving-black-friday-with-kubernetes-2o7o
<p>One of the most classic use-cases for Kubernetes is Horizontal auto-scaling: the ability of your system to increase its resources automatically in reaction to higher demand by adding new machines. The need for auto-scaling is not just about reducing need to accurately project load and manual work of adjusting hardware. It can also be existential as in case of my team at Coolblue (Dutch web-shop present in Benelux and Germany) when the marketing emails with Black Friday discounts start reaching customers, causing enormous peak in traffic. For some types of workloads: like optimising delivery routes (Vehicle Routing Problem with additional constraints, more complex version of famous Traveling Salesman Problem) CPU demand grows very fast and can not only dominate your infrastructure spending with over-provisioned hardware for handling the peak, but in the worst case scenario - without horizontal auto-scaling you can be surprised by the black Friday peak bigger than forecasted and experience an outage at the most important time of the year for the business.</p>
<h2>
When do you need auto-scaling?
</h2>
<p><a href="proxy.php?url=https://res.cloudinary.com/practicaldev/image/fetch/s--9ju2ZpXZ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ahcsfw4gwylxbk0d1fhy.jpeg" class="article-body-image-wrapper"><img src="proxy.php?url=https://res.cloudinary.com/practicaldev/image/fetch/s--9ju2ZpXZ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ahcsfw4gwylxbk0d1fhy.jpeg" alt="jaws meme" width="800" height="489"></a></p>
<ul>
<li>Your traffic varies over time </li>
<li>Your traffic is hard to predict or its impact on the CPU and memory is hard to determine</li>
</ul>
<p><a href="proxy.php?url=https://res.cloudinary.com/practicaldev/image/fetch/s--duB_9Fzk--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/lz4p1m7h867qauzc77qr.png" class="article-body-image-wrapper"><img src="proxy.php?url=https://res.cloudinary.com/practicaldev/image/fetch/s--duB_9Fzk--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/lz4p1m7h867qauzc77qr.png" alt="diagram showing annual peaks" width="880" height="230"></a><br>
In the picture above Coolblue fits the right-most chart - Q4 is the busiest season for e-commerce. Forecasting of sales provides good basis for preparation, but may still leave a huge range of possible hardware specs, leading to over-provisioning hardware and then of course over-paying. Which brings me to the next point</p>
<p><a href="proxy.php?url=https://res.cloudinary.com/practicaldev/image/fetch/s--29_rtCk5--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/biyivxxm4qilis61772o.png" class="article-body-image-wrapper"><img src="proxy.php?url=https://res.cloudinary.com/practicaldev/image/fetch/s--29_rtCk5--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/biyivxxm4qilis61772o.png" alt="diagram showing daily peaks" width="880" height="260"></a><br>
Load varying through the year is much easier to plan for. Big daily range of load is harder to cope with, just like the annual peak it leads to over-provisioning, but unlike the over-provisioned hardware on the scale of the year, you cannot scale down and adapt after daily peak. Not when your operations are manual that is. Not to mention, that in the scenario with high daily peak despite over-provisioning you are still left vulnerable to unexpectedly big peak in traffic.</p>
<p>The easiest architecture for most web-applications (or back-ends for them) is a vertically scaled single host deployment. When you start maxing it out - you bump the specs. The simplicity of it of course comes at a cost: if it really is just a single host you need downtime to add more CPU and RAM. </p>
<p><a href="proxy.php?url=https://res.cloudinary.com/practicaldev/image/fetch/s--QGtXtr8s--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/0yv7ccw6kzy8ff6ko2qc.png" class="article-body-image-wrapper"><img src="proxy.php?url=https://res.cloudinary.com/practicaldev/image/fetch/s--QGtXtr8s--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/0yv7ccw6kzy8ff6ko2qc.png" alt="Vertical scaling" width="880" height="439"></a> </p>
<h2>
At certain point there isn't any bigger boat for you
</h2>
<p><a href="proxy.php?url=https://res.cloudinary.com/practicaldev/image/fetch/s--UBRL04vW--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/b02drz9axn1r5r441n4w.jpeg" class="article-body-image-wrapper"><img src="proxy.php?url=https://res.cloudinary.com/practicaldev/image/fetch/s--UBRL04vW--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/b02drz9axn1r5r441n4w.jpeg" alt="Jaws meme" width="610" height="410"></a> </p>
<p>More complex architecture that lets you increase capacity without bumping specs of individual machines - causing their downtime - is horizontal scaling. In this architecture you have more than one instance of your application and requests are handled by any of those machines that are currently available. </p>
<p><a href="proxy.php?url=https://res.cloudinary.com/practicaldev/image/fetch/s--SNl-0JWa--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/dhtrh8tsorhnv4x67jdo.png" class="article-body-image-wrapper"><img src="proxy.php?url=https://res.cloudinary.com/practicaldev/image/fetch/s--SNl-0JWa--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/dhtrh8tsorhnv4x67jdo.png" alt="request serving with a stateless application" width="880" height="510"></a></p>
<p>There is a catch: your application must be stateless for this architecture to work this way, if it is stateful, complexity of routing requests will grow further as you need to ensure the right instance receives its own requests.</p>
<p><a href="proxy.php?url=https://res.cloudinary.com/practicaldev/image/fetch/s--XXVLwwEe--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/uaoawt344p4igigviaj1.png" class="article-body-image-wrapper"><img src="proxy.php?url=https://res.cloudinary.com/practicaldev/image/fetch/s--XXVLwwEe--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/uaoawt344p4igigviaj1.png" alt="Diagram presenting stateless application" width="880" height="687"></a></p>
<p><a href="proxy.php?url=https://res.cloudinary.com/practicaldev/image/fetch/s--5nCfBgV6--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/z2j78v9gorqki2ci6zhx.png" class="article-body-image-wrapper"><img src="proxy.php?url=https://res.cloudinary.com/practicaldev/image/fetch/s--5nCfBgV6--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/z2j78v9gorqki2ci6zhx.png" alt="Diagram presenting stateful application" width="880" height="684"></a></p>
<p><a href="proxy.php?url=https://res.cloudinary.com/practicaldev/image/fetch/s--04wuode_--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ochobvw5tmbmwnw82b0v.png" class="article-body-image-wrapper"><img src="proxy.php?url=https://res.cloudinary.com/practicaldev/image/fetch/s--04wuode_--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ochobvw5tmbmwnw82b0v.png" alt="Diagram presenting requests routed with a load balancer for a stateful application" width="880" height="564"></a></p>
<h2>
Our Delivery Routes Optimisation is Stateful and needs to scale for annual and daily peaks
</h2>
<p>This is the case for our package delivery route optimisation service at Coolblue, for which my team is responsible. In order to cope with the Black Friday peak we needed to provision 128 CPUs VM, at the moment of writing it, it is the biggest machine available in GCP in our region and compatible with our software. We can only go horizontal from here.</p>
<p>Optimising delivery routes on top of being famously CPU intensive leads also to stateful architectures, as any optimisation job will need information about all delivery orders, vehicles, shifts and other constraints in memory at every step of the optimisation process. This data is big enough that fetching it from the database on each mutation would reduce system performance and the mutations happen multiple times a second both as a result of an ongoing process of optimisation and incoming requests to schedule new orders.</p>
<h2>
Are We ready for Black Friday 2023?
</h2>
<p>This year only one stateless micro-service responsible for calculating travel time between deliveries is in Kubernetes. Last year it was the one that gave in during the peak causing partial outage, this year it scales by itself under the same circumstances not only avoiding outage but also over-provisioning during quiet times (all our customers are in the same time-zone).</p>
<p><a href="proxy.php?url=https://res.cloudinary.com/practicaldev/image/fetch/s--VI4Ravjc--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/lpuxiefz3uebkjdzz2wr.png" class="article-body-image-wrapper"><img src="proxy.php?url=https://res.cloudinary.com/practicaldev/image/fetch/s--VI4Ravjc--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/lpuxiefz3uebkjdzz2wr.png" alt="Diagram presenting how hardware adapts to peaks with kubernetes" width="880" height="542"></a></p>
<h2>
How exactly is Kubernetes helping here?
</h2>
<p>For our stateless microservice, the k8s setup is pretty simple and standard. The most important resource types we use:</p>
<ul>
<li>
<code>Deployment</code> to specify config of the travel-time service itself (env vars, hardware required to run single instance, etc)</li>
<li>
<code>Service</code> to specify its internal endpoint where other components can reach it</li>
<li>
<code>HorizontalPodAutoscaler</code> where we specify minimum and maximum number of replicas (app instances) that can be provisioned and most importantly: the CPU utilisation threshold that k8s can use to trigger the horizontal scaling</li>
</ul>
<p>Early next year we will also put our stateful route optimisation application in Kubernetes, removing the ceiling from our capacity and also reducing how much over-provisioning for the daily and annual peaks we need. This work is significantly more complex (due to its stateful nature and custom scaling) and deserves a dedicated article.</p>
kubernetes
operations
devops
gcp
-
The Monolith in the Room
Stef van Hooijdonk
Mon, 21 Nov 2022 09:10:58 +0000
https://dev.to/coolblue/the-monolith-in-the-room-1947
https://dev.to/coolblue/the-monolith-in-the-room-1947
<p>It is very easy to talk about your current systems and code as if they are old and legacy. "That Monolith" is legacy. It is old and the code is written in a way new joiners might not like. Even the coding language or framework might be something not in the top charts at the <a href="proxy.php?url=https://octoverse.github.com/2022/top-programming-languages" rel="noopener noreferrer">Github Octoverse</a> anymore.</p>
<p>But more often than not, these systems, monoliths, are still the moneymaker (âŹâŹ) for your company. The same for us at <a href="proxy.php?url=https://www.coolblue.nl" rel="noopener noreferrer">Coolblue</a>.</p>
<p><a href="proxy.php?url=https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0aahkr6kfi3antn2mm66.png" class="article-body-image-wrapper"><img src="proxy.php?url=https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0aahkr6kfi3antn2mm66.png" alt="An elephant" width="800" height="416"></a></p>
<h2>
The problem
</h2>
<p>Let me explain the problem we believe to have with at least one of our monoliths based on two concerns.</p>
<h3>
Concern: Ownership
</h3>
<p>As written earlier on our vision to have <a href="proxy.php?url=">Guided Ownership</a> as low as possible in the Development teams, having a monolith that you want to improve or needs maintenance is counter intuitive then. </p>
<blockquote>
<p>Shared ownership tends to leads to no ownership</p>
</blockquote>
<p>We are probably not the first department that sees this issue <a href="proxy.php?url=https://www.platohq.com/resources/when-shared-ownership-no-longer-works-829891624" rel="noopener noreferrer">1</a>. </p>
<p>Some practical issues can arise also when working with multiple teams on a single solution. Most of which can be addressed with proper release management, good automation and a mature CI/CD platform.</p>
<h3>
Concern: Technology
</h3>
<p>One of our two monoliths was written almost two decades ago. We used Delphi to write an application to handle all aspects of our business processes. </p>
<p><strong>Application</strong><br>
In itself <a href="proxy.php?url=https://www.embarcadero.com/products/rad-studio" rel="noopener noreferrer">Delphi</a> is not the problem, even though usage in the market is declining. For us the more pressing reason to actively address this monolith is that the application is written as a desktop/windows application.</p>
<p><strong>Application design</strong><br>
The biggest concern we have is the lack of clear and separated business logic in the application design. Logic resides in either a button click/screen or in a trigger in the data layer.</p>
<p><strong>Data</strong><br>
We also built a single database and datamodel for this monolith to work on. Here the lack of clear ownership is becoming more and more visible. Using data from tables across by services created by other teams, making it hard to than innovate and make changes to your schema.</p>
<h2>
What are we doing about it?
</h2>
<p>Not going to hang out our laundry too much here, just setting the scene why we are moving forward with the following approach.<br>
</p>
<div class="highlight js-code-highlight">
<pre class="highlight plaintext"><code># what are we going to do about it?
String.Replace("monolith","guided ownership");
</code></pre>
</div>
<h3>
Replace?
</h3>
<p>The basis of our approach to solve the described problems is going to be <em>replace</em> by <em>rearchitecting</em>. Every time we want to improve or change a process, we will build a new solution and make it replace part of the monolith.</p>
<p><a href="proxy.php?url=https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkvkuthcda71jaufdtesv.png" class="article-body-image-wrapper"><img src="proxy.php?url=https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkvkuthcda71jaufdtesv.png" alt="Carving up the Monolith" width="800" height="450"></a></p>
<p>This approach allows us to work piece by piece, carving out features, processes and data as we go. Allowing for MVP like implementations first. The downside is that in many cases you are dealing with part of your logic and data living in one system (the monolith) and a newer replacement system. When you look at the data part, this means we have the constant challenge to get to an accurate and consistent data warehouse.</p>
<p>We made the following choices to help us do exactly this:</p>
<p><strong>Domain Driven Design</strong>; Together with our tech principle <a href="proxy.php?url=https://dev.to/coolblue/design-to-encapsulate-volatility-6g8">design-to-encapsulate-volatility</a> and using a pattern as <a href="proxy.php?url=https://en.wikipedia.org/wiki/Hexagonal_architecture_(software)" rel="noopener noreferrer">Ports and Adapters</a> allows our code to separate logic and infrastructure specific code (e.g. Oracle specific queries). This allows us to separate out the parts we are re-architecting that still need the monolith (data)access.</p>
<p><strong>Events</strong> and <strong>Event Driven Architecture</strong>; to decouple the different processes and their supporting apps and services we are moving towards <a href="proxy.php?url=https://microservices.io/patterns/data/event-driven-architecture.html" rel="noopener noreferrer">Event Driven Architecture</a>. This way we can abstract the data leaving the bounded context of an application from its technical form or technology. This helps to hide the underlying situation of having two data stores during the transition period (the monolith on the one hand, and the new re-architected one on the other) from the systems receiving these events. </p>
<h3>
Kafka? Really?
</h3>
<p>We also see that having more and more apps and services emit and rely on events we will need to support that. We chose to begin with <a href="proxy.php?url=https://kafka.apache.org" rel="noopener noreferrer">Apache Kafka</a> as the platform to facilitate these event streams. Allowing both our Data Warehouse to tap into these streams as well as enable teams to rely on Kafka to stream between apps (bounded context).</p>
<p>Let me be clear, we are not going to replace <em>all</em> inter-service-communication with events and Kafka. For some processes a batch approach, e.g. via Airflow, is still a valid and great choice. The "it depends" strikes again.</p>
<h3>
Enabling teams with our Pillars of Guided Ownership
</h3>
<p>This brings a new problem to the table for our teams. Setting up producers, topics and consumers in this new Kafka platform. We have to help the teams here by enabling them.</p>
<p>Here are three examples how we have been and will be enabling our teams for this new direction.</p>
<h4>
Enable via Automation and Self-service
</h4>
<p>In a previous <a href="proxy.php?url=https://dev.to/coolblue/guided-ownership-422j">post</a> I hinted already that for the year of 2023 we are looking to invest and develop the tools, templates and processes to make deploying an app or service with queues, Kafka topics and BigQuery staging tables along side the needed compute components for AWS easy and enabled for Self-service.</p>
<h4>
Enable via Skills & Knowledge
</h4>
<p>In the past few months we have deliberately done a few things to help with the knowledge needed. We sent a small delegation to the DDD Europe event and the Event Storming workshop that was held then (summer 2022). This group soaked up the knowledge and went ahead inside our tech organisation to help teams and their respective domains to perform these event storming sessions to learn how to do them and to use the output generated in the design of their next solution. This was a joint effort between Principal Developers, Data Engineering and Development teams.</p>
<h4>
Enable via Building Blocks
</h4>
<p><a href="proxy.php?url=https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fypltdg6b14ohvx4ibcrr.png" class="article-body-image-wrapper"><img src="proxy.php?url=https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fypltdg6b14ohvx4ibcrr.png" alt="Transactional outbox flow (cutout)" width="800" height="183"></a></p>
<p>We also have done some research and experiments and based on that we created a solution block based on the <a href="proxy.php?url=https://microservices.io/patterns/data/transactional-outbox.html" rel="noopener noreferrer">Transactional Outbox pattern</a> in one of our core technologies in use: dotnet/c#. And will add this to our template for creating new apps and services. This way teams can use it right away, and can see how this integrates into their new solution. </p>
<h2>
What about Tomorrow?
</h2>
<p>Going forward, we will have to keep checking if Kafka is the right fit for us to do more decoupling via Events. We chose this technology based on an proof of concept we did almost 2 years ago. </p>
<p>We need to keep an eye out how the carving up of this monolith is progressing. Many parts are now separate Employee Activity focussed applications already. But to turn a monolith fully off, that is the biggest challenge.</p>
<p>And lastly, what ever we do or change in our technology vision going forward, we need to make sure we always enable the teams. Give them the <a href="proxy.php?url=https://dev.to/coolblue/guided-ownership-422j"><strong>Guided Ownership</strong></a> they need to realise our vision. </p>
architecture
eventdriven
decoupling
ownership
-
Monitoring/Observability
Stef van Hooijdonk
Tue, 15 Nov 2022 10:10:00 +0000
https://dev.to/coolblue/monitoringobservability-5cdj
https://dev.to/coolblue/monitoringobservability-5cdj
<h2>
Complete coverage of all production systems.
</h2>
<p>No system should be active in production (i.e. providing a service to a customer or user) without being monitored.</p>
<p>All monitoring/logging is public, so that everyone in Coolblue has visibility of the vitality of the system and Tech Services can monitor specific aspects without exposing sensitive data.</p>
<p>Monitoring means tracking errors in critical workflows, health of critical dependencies and service KPIs.</p>
<h2>
Observability
</h2>
<p>Each application we build has to be <strong>observable</strong>. That means we need to know when something is wrong, and we need to be able to determine why this is.</p>
<p>To be able to tell when something is going wrong with our solutions we actively monitor them and we put in place alerts for our service level objectives.</p>
<p>To be able to find out why things are going wrong we make sure we have the logs to do so, combined with our monitoring data when needed.</p>
<blockquote>
<p>This monitoring and logging principle describes two parts:</p>
<ul>
<li>The first one being monitoring, which is the practice that describes methods to have insights in our applications and stacks.
-The other one is the practice of Logging. A practice which describes methods to register log events and give insights into the complexity of our application and stacks.</li>
</ul>
</blockquote>
<h2>
Monitoring and Alerts
</h2>
<p>You and your team actively monitor your applications. First you determine the metrics that are relevant for your application to measure. Then you should create dashboards and define alerts and service level objectives. Dashboards give insights into the recent and/or current state of your application. You use them to see at a glance what is happening. Alerts, with or without a Service Level Objective tag help you to be notified when certain thresholds you set are met. The (SLO) Alerts are also acted upon via our Tech Services Department.</p>
<h3>
Logging
</h3>
<p>Applications are hard and complex to write and manage. Problems we are solving, the abstractions we create and the implementations we choose to use, are all part of the complexity we are building. In order to shed light on that complexity we can use the practice of logging.</p>
<p>Logging can bring us additional insights into the operations executed in the application which can help understand the sequence of events that might have lead to a certain outcome (error or otherwise). Its an investigative tool that, if exercised correctly, can help piece together the application behaviour leading up to the outcome giving developers potentially new insights into the emergent behaviour of their systems.</p>
<h2>
Playbooks
</h2>
<p>In order to work together with those that help us action SLO Alerts when they happen, even outside of your own working day, we agree to have playbooks in place. These playbooks contain information on the SLO Alert itself, the potential underlying issue and should help and direct the reader into actions to help resolve the issue. We have a template available to write these playbooks. Please make sure the playbook is findable via the SLO Alert (make sure the SLO Alert title in our observability platform matches the SLO field in the playbook, and you can add a link to the page in the slack alert for easy access).</p>
<h2>
PII Data and Sensitive Data
</h2>
<p>We monitor and we log without exposing sensitive company data or PII data on our customers.</p>
<blockquote>
<p>Definition of PII Data: Personally identifiable information (PII) is any data that can be used to identify a specific individual. Social Security numbers, mailing or email address, and phone numbers have most commonly been considered PII, but technology has expanded the scope of PII considerably. It can include an IP address, login IDs, social media posts, or digital images. Geolocation and biometric is also be classified as PII.</p>
</blockquote>
o11y
principles
development
-
Guided Ownership
Stef van Hooijdonk
Mon, 14 Nov 2022 09:12:00 +0000
https://dev.to/coolblue/guided-ownership-422j
https://dev.to/coolblue/guided-ownership-422j
<p>In our Tech department here at Coolblue, we strive to give our development teams the ownership they need to build the solutions our domains, and thus our customers, need. </p>
<p><a href="proxy.php?url=https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdv9wj6r8v7x9ry55y7jn.png" class="article-body-image-wrapper"><img src="proxy.php?url=https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdv9wj6r8v7x9ry55y7jn.png" alt="Our Pillars of Guided Ownership" width="800" height="450"></a></p>
<p>We enable ownership through (at least) 6 <strong>Pillars</strong>:</p>
<ul>
<li>With clear <strong>Tech Principles</strong>
</li>
<li>With guidance on when to use what through a <strong>radar</strong>, <strong>architecture blocks</strong> and <strong>solution blocks</strong>
</li>
<li>
<strong>Self-service</strong> cloud infrastructure (AWS/GCP)</li>
<li><strong>Observability</strong></li>
<li>Grow the <strong>skills</strong> you need through our <strong>Coolblue University</strong>, our Master classes or other resources you may need</li>
<li>Our <strong>Domain roadmaps</strong>
</li>
</ul>
<p>Lets dive a bit deeper in each of these.</p>
<h3>
Tech Principles
</h3>
<p>Everything we design, build, deploy and run adheres to our Tech principles. I wrote about these principles in a series earlier: <a href="proxy.php?url=https://dev.to/stefvanhooijdonk/series/19139">Our Tech Principles</a>. </p>
<p>Most of these principles have been learned, the hard way. And this is our way of passing that experience along to the teams of the future. </p>
<p>We have made these principles part of our evaluation criteria for those that work in Tech. Making it very clear how important we think it is to work with these at every stage and at every level of development. </p>
<h3>
Radar, Architecture blocks and Solution blocks
</h3>
<p>We want teams to develop solutions freely as much as possible.</p>
<p>We believe that means: give the teams a lot of ownership and freedom to build these solutions. </p>
<p>It also means, we need to enable the teams to do so as much as possible. And by having guidance and reusability we can eliminate some tedious work, making the time and effort a Team spends focus more on the actual value delivering solution.</p>
<p><strong>Reuse</strong><br>
We all know that reusing proven code, for a give pattern or a piece of plumbing, helps you focus on the actual solution and increase your speed of delivery. <br>
That is why we have quite a few reusable items:</p>
<ul>
<li><p><strong>Architecture blocks</strong>; in our industry we have quite a few Design patterns and we have selected those that work well in how we develop our solutions. You can see this for instance in our Tech Principle <a href="proxy.php?url=https://dev.to/stefvanhooijdonk/design-to-encapsulate-volatility-6g8">Encapsulate for volatility</a> and the design pattern Ports and Adapters and the Transactional Outbox pattern.</p></li>
<li><p><strong>Solution blocks</strong>; actual implementations developers can find and use in their solutions. Some of which are packages/components to reuse, or templates to jumpstart a new app or service. And a Design System for building Customer facing applications, one for building Activity Focussed Employee applications and one for our Email communications. Through these building blocks we also make sure common practices, like performance and security, get addressed with solid implementations to be used and to serve as inspiration.</p></li>
</ul>
<p><em>fyi: Architecture blocks and Solution blocks originate from <a href="proxy.php?url=https://pubs.opengroup.org/architecture/togaf91-doc/arch/" rel="noopener noreferrer">TOGAF 9 - Building Blocks</a></em></p>
<p><strong>Deploy and maintain</strong><br>
That does not mean we think having every team build with different tools and languages is the best way to do so. Every language and every development environment means learning something new, means support from our CI/CD platform and from our cloud platform(s). </p>
<p>It also means when people want to move teams, or solutions move to a different team, we have to deal with the knowledge needed to support, develop and run these solutions over time. <br>
So we adopted a few core languages with an intended use case. Maybe not entirely a Radar, but it does give guidance and focus.</p>
<h3>
Self-service cloud infrastructure
</h3>
<p>Is every team SecDevOps to the full extent? No. Not yet at least, but I want our Development teams to be able to create new secure solutions and run their existing ones by themselves as much as possible.<br>
Not only can you see that through our Principles <a href="proxy.php?url=https://dev.to/coolblue/recovery-over-perfection-kbg">Recovery over Perfection</a>, <a href="proxy.php?url=https://dev.to/coolblue/automation-2mh2">Automation</a> and <a href="proxy.php?url=https://dev.to/coolblue/testing-37p6">Testing</a>, but we also want the Teams to deploy, scale and fix when it suits them. <br>
Our Hosting & Deployment teams develop and maintain the tools and core infrastructure that enabled our teams to do so. Through automated (Github) repo creation for instance. Want to build something new? Add the desired repo to Github via automation yourself. The same way to make sure our observability platform is hooked up to a newly created stack and CI/CD pipeline. We use this also to implement proven practices when looking at availability and security for our. cloud infrastructure.</p>
<p>If you were to ask any our 50+ development teams, if they want more self-service? they will say <strong>YES</strong>. Of course they will, it's in our corporate culture to "<a href="proxy.php?url=https://aboutcoolblue.com/en/culture/" rel="noopener noreferrer">just do it</a>". So we keep growing their toolset to do so.</p>
<p>One key area for 2023 is to invest in this area. We want our Solutions to be onboarded with more standard Cloud Infrastructure and to make sure key data can be shared with other apps (outside of the bounded context), our data warehouse and our analysts. We aim to leverage Events and <strong>Event Drive Architecture</strong> more and more, and making sure key Infrastructure is ready to use will help the teams tremendously. Think standard Queue's, topics in Kafka, Tables in BigQuery and more. </p>
<h3>
Observability
</h3>
<h4>
Technical Observability
</h4>
<p>Any Development Team owning a solution wants to know how it is performing and if it is performing correctly. And using Observability is a great way of doing that. Our Development Teams rely on metrics, dashboards and alerts to be in the know of their solutions. We believe this is critical for a Team to fully <strong>Own It</strong>. Acting on these insights and making sure our Employees and Customers can do what they should. </p>
<h4>
Business Observability
</h4>
<p>Another way we enable teams is to also give insights beyond the technical metrics.</p>
<ul>
<li><p>Operating Costs; having insights in the total costs your App/Stack makes, triggers conscious decisions on Right-Sizing the Stacks.</p></li>
<li><p>Business Insights; All our Applications have a <strong>purpose</strong>: it can be process Orders, finding the right amount of Stock to keep or Sending out the right Product to a Customer. By having access to these dashboards also, the Team with their Product Owner can track if what is being done is actually Moving the Needle.</p></li>
</ul>
<h3>
Coolblue University
</h3>
<p>Maybe an obvious Pillar, but having the right skills (ie. technical, leadership or social) will make you better at what you do. It will allow you to work better with your Team and your Stakeholders.</p>
<p>For that reason we have a Coolblue University with over a hundred different trainings available for you to take and use. Other online resources, IRL events, books and Class room trainings are optionally available also of course. </p>
<p>To understand our Business and what it is they do, we also have created Master classes. These are currently presentation-based and help any that attend to fully learn our way of working on key topics in the company.</p>
<p>We also share what we have learned via our internal, monthly, Tech Demo's. These offer a podium for our developers and engineers to share their learnings and insights with the rest of us in tech. By tech for tech.</p>
<h3>
Domain roadmaps
</h3>
<p>We cannot forget the reason we build Solutions/develop Software. We build it because there is a benefit for our Customers (NPS) or the Company (EBITDA). And sometimes there are more <strong>strategic</strong> reasons to build something. </p>
<p>Each team works with a Roadmap. We evaluate these at least 4 times a year. Always be ready to change and adjust to what we have learned. The Market, your context will always be changing. And as such we need to adjust when needed and not fall for the invested-too-much-reasoning. Agility is crucial for us as a Company and for our development efforts then also. Evident by the inclusion of Flexible in our <a href="proxy.php?url=https://aboutcoolblue.com/en/culture/" rel="noopener noreferrer">Corporate Values</a>.</p>
<h3>
Conclusion
</h3>
<p>This post turned out to be longer than I expected and full of corporate speak I usually try to avoid. But here we are. There are more ways we support or teams off course, but I liked the idea of focussing this on these six Pillars.</p>
<p>This post is mostly to share and explain how we work and think at Coolblue and how we try to create the environment for <strong>Ownership</strong> for our development teams.</p>
technology
teams
enable
ownership
-
Design to encapsulate volatility
Stef van Hooijdonk
Fri, 04 Nov 2022 08:45:00 +0000
https://dev.to/coolblue/design-to-encapsulate-volatility-6g8
https://dev.to/coolblue/design-to-encapsulate-volatility-6g8
<p>Our strong belief in <strong>Clean Architecture</strong> compliments the principle of <strong>Encapsulating Volatility</strong>. If something is likely to change, make sure it's easy.</p>
<p>A very retro analogy can be applied to hifi systems:</p>
<p>Some manufacturers sold complete integrated hifi systems. These systems had everything on them. Amp, record player, double cassette deck, radio, CD player, equalizer (a graphic one if you were lucky) etc.</p>
<p>The problem is that as CDs were replaced by next gen technology (like minidisc, ok, like mp3 on storage and then streaming), the whole system was at risk of becoming obsolete, even though the system still served its original purpose.</p>
<p>However high-end manufacturers stuck to their component model. You could buy an Amp, separately from a CD player, and speakers etc. You can even mix and match components from different manufacturers.</p>
<p>So when CDs were replaced by mp3âs then mp3âs by streaming services, you could just add, or swap out the relevant components. The CD player, one of the âmusic suppliersâ within the system was a volatile component.</p>
<p>This view on things goes beyond modularisation. It informs you what modules you should have. If itâs volatile, very likely to change, encapsulate it.</p>
<h2>
Clean architecture
</h2>
<p><a href="proxy.php?url=https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn6wujv50bkeha4468nzw.png" class="article-body-image-wrapper"><img src="proxy.php?url=https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn6wujv50bkeha4468nzw.png" alt="Clean architecture" width="800" height="793"></a></p>
<p>Regardless of whether we choose Microservices, SOA, or a monolithic approach, Clean Architecture can and should still be used. All of our systems should be built this way. It emphasises the domain at the heart of our designs (key also to DDD etc).</p>
<p>Dependency inversion is integral (and thus this approach supports good programming practice in general, and some specific SOLID principles). It moves things that are more susceptible to change to the edge instead of the center, making them easier to replace. Whatever we build can be built following this principle (even other architectures/patterns can be built in this manner).</p>
<h2>
SOLID
</h2>
<ul>
<li>S - Single-responsibility principle</li>
<li>O - Open-closed principle</li>
<li>L - Liskov substitution principle</li>
<li>I - Interface segregation principle</li>
<li>D - Dependency Inversion Principle</li>
</ul>
<p>You adhere to <a href="proxy.php?url=https://en.wikipedia.org/wiki/SOLID" rel="noopener noreferrer">SOLID</a> principles (no exceptions for object-oriented development). These are 5 of the basic principles of Object Oriented design and programming. They should be second nature to every OO developer at Coolblue.</p>
solid
principles
development
technology
-
Testing
Stef van Hooijdonk
Thu, 27 Oct 2022 08:20:00 +0000
https://dev.to/coolblue/testing-37p6
https://dev.to/coolblue/testing-37p6
<p>The software we create should do what it was intended to do. To be sure that it does, we want our <strong>production code</strong> to be well-covered by <strong>automated tests</strong>. These tests should be runnable with the click of a button, and they should be run automatically before each release.</p>
<p>Although testing, when done right, should ultimately make you more productive, by virtue of having to spend less time fixing problems after you release, it does âslow you downâ up front, as compared to writing code without any tests. Most applications or services have a long lifecycle that <strong>warrants having extensive test coverage,</strong> but given the nature of some of the work we do, some applications or features are so trivial, short-lived, or time-critical, that having few or no tests is an acceptable situation.</p>
<p>The metric of âcode coverageâ is not considered of any value to determining the quality of tests. Consequently, we should not use it to fail builds or block deployments. The reason behind this is that tests can just trigger your production code, but are not testing the right business concepts. Next to that, some units might not even need to be tested. These units can for example only properly be tested using an integration test or might be too trivial and covered by their consumers.</p>
<p>TDD at Coolblue means:</p>
<h2>
Process
</h2>
<ul>
<li>Write a failing test first, then code until that test passes. Do not write any more production code that it is necessary to make the one failing test pass.</li>
<li>Red-green refactor. Donât forget to refactor, itâs the most important part.</li>
<li>Everyone on board. The whole team must adopt TDD, do not partially adopt.</li>
</ul>
<h2>
Writing tests
</h2>
<ul>
<li>Consider the inputs, outputs, all possible weaknesses (possible errors) and strengths (successful runs).</li>
<li>Do not overcomplicate your tests. The test should be simple to setup and execute.</li>
<li>Tests should run and pass on any machine/environment. If tests require special environmental setup or fail unexpectedly, then they are not good unit tests.</li>
<li>Make sure your tests clearly reveal their intent. Another developer can look at the test and understand what is expected of the code.</li>
<li>Each test should have a limited scope. If it fails it should be obvious why it failed. Itâs important to only assert one logical concept in a single test. Meaning you can have multiple asserts on the same object since they will usually be the same concept.</li>
<li>Keep your tests clean. They are just like your production code.</li>
<li>External dependencies should be replaced with test doubles in unit tests.</li>
</ul>
<h2>
Running tests
</h2>
<ul>
<li>Unit tests should be fast, the entire unit test suite should finish running in under a minute.
Unit tests should just run with zero effort (after installing dependencies).</li>
<li>Ratio of number of tests in each level of testing should be balanced; pyramid model. (e.g 80% unit tests, 15% integration tests, 5% acceptance tests).</li>
<li>Acceptance tests should be divided in suites based on features.</li>
</ul>
<h2>
Measurements of success for teams
</h2>
<ul>
<li>The integration and unit tests are passing before merging.</li>
<li>Acceptance tests are running successfully in the Acceptance environment before code is deployed to production.</li>
<li>Unit/Integration tests run within CI environment on each PR.</li>
<li>Failing tests block deployment. All tests run successfully on a CI environment before code is deployed.</li>
</ul>
<blockquote>
<p>Note: âCoverageâ, i.e. the percentage of lines covered by a unit or integration test, is not a measure of success, because it says nothing about how well the code has been tested. Do not make a specific level of test coverage a requirement, because it is hard to reach and it will cause people to start writing nonsense tests just to reach the required level of coverage.</p>
</blockquote>
<h2>
Suggested resources
</h2>
<ul>
<li>Read Test Driven Development By Example by Kent Beck</li>
<li>Read Clean Code by Robert C. Martin</li>
<li>Read xUnit Test Patterns: Refactoring Test Code by Gerard Meszaros</li>
<li>Check training videos at <a href="proxy.php?url=https://cleancoders.com" rel="noopener noreferrer">https://cleancoders.com</a>
</li>
<li>Working Effectively with Legacy Code by Michael Feathers</li>
</ul>
development
principles
tdd