<![CDATA[engineering.empathy.co]]>https://engineering.empathy.co/https://engineering.empathy.co/favicon.pngengineering.empathy.cohttps://engineering.empathy.co/Ghost 6.22Tue, 17 Mar 2026 05:57:46 GMT60<![CDATA[A first dive into CSS animations]]>CSS is one of those things that everyone has a strong opinion about. For some reason, it seems like you are only allowed to love it or hate it.

It’s quite well-known that it can be a struggle to work with. Some people even avoid all frontend development

]]>
https://engineering.empathy.co/a-first-dive-into-css-animations/6501bf9f741c3c000181b9c6Thu, 21 Sep 2023 06:45:07 GMTCSS is one of those things that everyone has a strong opinion about. For some reason, it seems like you are only allowed to love it or hate it.

It’s quite well-known that it can be a struggle to work with. Some people even avoid all frontend development just to stay away from CSS.

But, as I’m starting to grow in my career, I’m becoming a member of the first (and smaller) group – the one that loves CSS.

I learned that CSS is quite easy to use when you learn the ✨ basics ✨ correctly and don’t rush into advanced things.

So, if you are still struggling with it, you may need to consider doing some basic training or even practicing by playing some “silly” games like Flexbox Froggy or CSS Diner . Believe me, you’ll learn to center a div!

Once concepts like cascade, specificity, selectors, flex, grid, and so on stop being a mystery to you, you’ll probably be ready to dive into the cool stuff without your mental health suffering along the way.

The first two or three times, I found myself stuck sitting in front of my brand new assigned task that required a transition or an animation, looking at the designs and thinking “How the hell am I going to do that?!” But after working on a few more, I came to realize that it is all about the mental model.

The Mental Model

Developing an animation starts way before you start writing the code. Your first step should always be to observe the result you want and try to find the different behaviors and the HTML + CSS resources needed to achieve it. You’ll probably need to divide and conquer, in order to see everything clearly.

This way, when the “spinning ball” becomes an “svg or div with a rotate transform” and your “dropdown effect” becomes a “container with a height or max-height transition,” you’ll be ready to go! Of course, if you don’t already, you’ll need to find out what you can and cannot do with CSS.

Be sure to take a look at concepts like:

  • transform
  • transition
  • animate
  • keyframes

You’ll need them to do all the fancy things. Once you are aware of what you can do with them, you’ll see that, basically, what you can do is put together a series of transformations, so you’ll have to observe how each individual element of the animation interacts and what transformations it suffers.

It might take a while for it to click in your brain and start seeing things this way. In the meantime, let’s take a look at a few examples developed in the Empathy Platform Docs portal.

Example 1: 404 error

A good example of a simple animation is located in the 404 error. Can you guess what is going on and how this animation works? Don’t worry if you don’t see it immediately!

Here’s how it works:

First comes the HTML:

<div class="edoc-error-page__diagram">   
  <p class="edoc-error-page__text edoc-error-page__text--404">404</p> 
  <p class="edoc-error-page__text edoc-error-page__text--error"> ERROR 	   </p>   
  <Bubble class="edoc-error-page__bubble edoc-error-page__bubble--		   green"/> 
  <Bubble class="edoc-error-page__bubble edoc-error-page__bubble--blue"/> 
  <Bubble class="edoc-error-page__bubble edoc-error-page__bubble--pink"/>   <Bubble class="edoc-error-page__bubble edoc-error-page__bubble--yellow"   />   
  <Bubble class="edoc-error-page__bubble edoc-error-page__bubble--orange" 	/>
</div>

As you can see, in this case, each of the bubbles is an individual element. “Bubble” is an imported SVG, but this could be done with divs as well!

The second part is to define the animation for each of the balls. For this, keyframes come in handy:

@keyframes rotation {   
    from {     
        transform: rotate(0deg);   
    }  
    to {     
        transform: rotate(359deg);   
    } 
}

Assigning the animation with the property ‘animation’ to each bubble makes everything start spinning:

&--blue {   
    top: rem(-10px);   
    left: 38%;   
    animation: rotation 4s infinite linear;   
    circle {     
        r: 20;   
    } 
} 
&--green {   
    top: rem(40px);   
    left: 28%;   
    transform: rotate(45deg);   
    animation: rotation 3.5s infinite reverse linear;   
    circle {     
        r: 18;     
        fill: $color-green;   
    } 
}

But, as you can see, there are a few more things going on here, so let’s check everything:

  • top and left: Each bubble is placed with `absolute` in the picture.
  • animation properties: As you can see different bubbles are assigned different properties. This is done to change the characteristics of the rotation. Playing with times, we can make some of them spin slower and others spin faster. Also, some include `reverse` to rotate them counterclockwise.
  • circle properties: Here, we modify the svg characteristics to give the circles different radii and colors.

Too much to begin with? Maybe an even simpler example could help:

Example 2: Holons' No-Results Page

An even simpler animation is located in the no-results page of the Holons Search Experience. Didn’t see anything? Stare at that sad, disappointed Holon for a few seconds… It blinks! Can you guess how this one works?

Here it goes:

Let’s start with the HTML again:

<div class="no-results__image-wrapper">   
  <NoResultsEyesIcon class="no-results__image no-results__image--eyes"/>
  <NoResultsMouthIcon class="no-results__image no-results__image--mouth"   /> 
</div>

The face is separated in two SVGs, one for the eyes and another for the mouth. This allows us to easily animate the eyes, leaving the mouth alone.

Secondly, we have a keyframes animation for the blink:

@keyframes blink {   
    45%, 55% {     
        transform: scaleY(1);   
    }   
    50% {     
        transform: scaleY(0.1);   
    } 
}

In this case, instead of modifying rotation, we are playing with the vertical scaling. That way, it gets narrowed for a small moment (starting at 45% and finishing at 55%).

Finally, the animation is given to the eyes SVG:

&--eyes {   
    animation-name: blink;   
    animation-duration: 6s;   
    animation-timing-function: ease;   
    animation-iteration-count: infinite; 
}

That said, you’ll probably need to struggle with a few animations before you start to understand their inner-workings quite quickly, so don’t despair! You’ll get there!

Remember: It takes time!

Even if you get to a point where you immediately realize how everything works and how you’ll get to that point, it will take time to implement something beautiful. You’ll probably never be able to create a breathtaking animation in five minutes, it takes time!

It’s very possible that you’ll end up in front of a trade-off between the time you can dedicate to it and how smooth you want the animation to be. Just remember that it is normal, you’re not going to perform a miracle in limited time, so don’t panic and take your time (if you have it)!

]]>
<![CDATA[Purchase History Search: Indexing with Elasticsearch]]>

At Empathy.co, we have been making heavy use of search engines such as Solr and Elasticsearch for years. They are key types of software in the products we build for our platform and our clients. They enable fast and effective storage and retrieval of our clients' catalogue products,

]]>
https://engineering.empathy.co/purchase-history-search-indexing-with-elasticsearch/6501583053bc280001ceb093Thu, 14 Sep 2023 10:57:53 GMT

At Empathy.co, we have been making heavy use of search engines such as Solr and Elasticsearch for years. They are key types of software in the products we build for our platform and our clients. They enable fast and effective storage and retrieval of our clients' catalogue products, simply and with little overhead.

A couple of years ago, one of our clients requested that we speed up the retrieval of their historical purchase data. They had been storing the data without considering its retrieval much. Since we had already been using Elasticsearch for this client, we promptly designed a system for ingesting and finding purchase history data, thus Purchase History Search as a separate project was born.


The challenge of Purchase History Search was ingesting a huge volume of purchase data in real time and making the data easily searchable. The introduction of the time dimension to the data is a key difference from catalogue search. Moreover, the history of purchases is private data, so creating a privacy-first system was essential.

Elasticsearch works with indices that could be roughly compared to standard database tables. Each index is further divided into shards, which are essentially inverted indices that enable the efficient search of data using the terms contained within it. These shards can live in different nodes (machines) so that the index operations can be distributed and parallelized. A typical Purchase History cluster consists of three or five master nodes that manage the entire cluster, its metadata, and operations, and about six data nodes where the index shards actually reside.


At Empathy.co, for our retail clients, we usually build a new index every time there is a new catalogue using a batch or on-demand process. For Purchase History, we have a continuous stream of purchase documents going into a feed, so the mechanism changes. In this case, we use a different service called the Streaming Indexer.

Ingesting the Data

The Streaming Indexer is a scalable and stateless application that sits between the client's data feed and our Elasticsearch cluster. It fetches documents in real time and indexes them to a specific index. As the data is presented in a time-series style, it is only natural to have indices consist of a subset of the purchases. In our case, we decided to go with a monthly index. To further optimise the data storage and retrieval, it is standard for each document to have some kind of customer ID field. This ID can be employed to always route the document to the same index shard. When a shopper goes through their history, a single shard needs to be hit. It is like directing your search to a single drawer in your wardrobe.


This idea of sharding is not only a performance gain, but a good first step towards a privacy-aware and decentralized Purchase History system. Even though Elasticsearch is the centralising element of the whole setup, the idea of separating data based on customers is an element that we would like to explore further. We have been researching solutions like SOLID, whose pods can resemble our idea of shards. Basically, each shard/pod would be a private, fully decentralized repository of a consumer’s data. The consumer would be in total control of it and would be able to manage who has access to what part of their data.


As stated, shards are grouped into indices. The picture below shows the different components regarding Elasticsearch. This index distribution of data enables the Elasticsearch cluster to scale indefinitely with no degradation of the service. Multiple instances of Elasticsearch nodes can be run so that the processing of the querying and the indexing is further distributed. Each instance can work in parallel and also each index can be queried in separately. Furthermore, settings can be tweaked for older indices, such as freezing them and making them read-only. A large proportion of our clients' searches are for the purpose of retrieving recent purchases, which usually means only one or two of the most recent indices are hit.

Accessing the Data

The second part of our Purchase History setup is the Search API itself. It is able to search effectively for purchases with a combination of filter and sorting parameters. Depending on the use cases different search parameters can be used. A typical use case is for shoppers to review their purchase history looking for details or items to repurchase. Another use case is that of our client’s customer support services, which shoppers contact in the event of mismatches between ordered and received products. This setup can also be integrated with other applications, such as a Nutrition ranking system or a Pickup virtual assistant, quickly checking clients’ purchase history to retrieve purchases to be picked up.

The Search API service is also scalable and stateless, but it is more similar to our traditional search service. To make it performant, some tricks are possible, such as using the routing data discussed earlier. Secure access and permissions are simple to set up using OAuth, for example.


As with the rest of our services, Kubernetes is used to orchestrate the deployment of all the Purchase History services. This makes them reliable and enables them to run in any cloud environment as there are no cloud-specific technologies used. Kubernetes simplifies horizontal scaling, depending on the load. For example, if traffic suddenly increases, more Search API pods can be quickly deployed to meet established latency SLAs.

Some other considerations are the retention period of the data. Our customers, or even legislation, may dictate that the data be kept for no longer than a specific period. Having the monthly indices scheme in place allows us to simply remove a full index when all the documents belonging to that month are older than the defined period.

The Future of Purchase History

The following challenge will be to configure Purchase History access in a decentralized and private scenario. This will be done hand-in-hand with merchants, as they will no longer own and have total control over their customer base’s Purchase History, but rather an interface to enable access to each customer’s shards/pods.

]]>
<![CDATA[From Zero to Beam]]>Moving from in-house streaming code to a flexible and portable solution with Apache Beam

Long gone are the days when we used to consume data with Apache Spark Streaming, with an overly complicated, cloud-dependent infrastructure that was non-performant when load increased dramatically. Follow us on a journey of stack simplification,

]]>
https://engineering.empathy.co/from-zero-to-beam/64d351af7b3b27000100b209Thu, 10 Aug 2023 06:00:35 GMTMoving from in-house streaming code to a flexible and portable solution with Apache Beam

Long gone are the days when we used to consume data with Apache Spark Streaming, with an overly complicated, cloud-dependent infrastructure that was non-performant when load increased dramatically. Follow us on a journey of stack simplification, learning, and performance improvement within Empathy.co’s data pipeline.

A Blast from the Past

Our data journey began in 2017, building a full-fledged AWS-based pipeline, steadily consuming events from multiple sources, wrapping those into small batches of JSON events, and sending them on to the Spark Streaming consumer.

Here’s a simplified view of what that exact streaming side of things looked like - together with the important bits of infrastructure, such as queues where events would be written:

Simplified streaming diagram of the old solution


Main technologies found in the old stack:

  • SQS: Simple Queue Service, Amazon’s managed message queuing service mainly used to asynchronously send, store, and retrieve multiple messages of various sizes.

    Single events were stored in queue #1, awaiting their processing by the event-wrapper.
  • SNS: Simple Notification Service, Amazon’s web service that makes it easy to set up, operate, and send notifications from the cloud.

    We used it to notify other infrastructure elements that a new batch of events wrapped by event-wrapper was ready to be pulled and consumed.
  • Parquet: An open-source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.

    A data lake was built with all the Parquet files generated by this part of the infrastructure. We then continued to generate analytics from them via Apache Spark batch jobs.

In-house services:

  • Event wrapper: In charge of taking individual events written into the queue by previous services, merging them into a bigger set in a single JSON file, and forwarding them into another queue.
  • Spark consumer: constantly pulling the queue 10 messages at a time, parsing those into Parquet files for later processing by subsequent batch processes.

There’s more to it than what’s seen in the diagram, but that should give you an idea of how things looked back then - and yes, it was all built on AWS with little room for portability, without making major changes to the whole set of services which composed this part of the pipeline.

Some cons found in this solution:

  • Scalability: Major sales or special events such as Black Friday were a headache when traffic went through the roof, as it was not easy to increase the throughput in the consumer. This resulted in delayed events and visualizations not being updated in time.
  • Portability: Having so many AWS bits underneath made the system hard to migrate into another cloud, and finding alternative equivalent services became a major pain point that needed to be solved urgently.
  • In-house connector: Pulling SQS from Spark Streaming was done in-house and was kept apart from all the parsing logic, so the data would fit the schema defined for raw data.

You may be wondering, “Why not use Beam from the beginning? It was released in 2016!” Well, given the experience the team had with Apache Spark, it made sense to go with the much-proven Spark Streaming rather than making the switch to a technology that was still in its early stages.


Winds of Change

Despite the cons, we managed to improve its performance over time, making it a bit more capable of handling higher loads of traffic without missing a beat.

However, the portability pain point was still there and wasn’t an issue so long as we didn’t use any other cloud… But given certain requirements, an extension of the components to support Google Cloud Platform was on its way to broaden the offering, and it was the perfect time to explore alternatives and simplify the system for the better! After a thorough analysis, we came to the conclusion that, given our expertise at the time, Apache Beam was the way to go.

Something which used to be overly complicated and really dependent on infrastructure soon looked something like this:

Simplified streaming diagram of the current solution


What is Apache Beam?

According to the Apache Beam website itself:

“The easiest way to do batch and streaming data processing. Write once, run anywhere data processing for mission-critical production workloads.”

In short, it is a unified API to build data processing components which can compile a single codebase into Flink, Spark, and many other systems. Check out the full list of supported runners.

Pick your language of choice and you’re pretty much ready to write code that will easily and transparently translate into a runner. The official docs have everything you need to know!

Some pros that made our decision way easier:

  • Multiple languages: Ability to write code in multiple languages such as Java, Python, and Scala and still get a portable pipeline out of it.
  • Huge range of connectors: Out-of-the-box connectors for Kinesis, PubSub, and Kafka are already provided, along with lots of community support that develop and maintain them.
  • Ease of development: Writing code in Apache Beam does have a certain learning curve, but once you get used to it, it’s really intuitive.

However, there are some cons worth mentioning also:

  • Performance corner cases: Being capable of morphing into multiple languages means that the one-framework-for-all concept probably isn’t as performant as a native solution written purely in Flink, for example.
  • Unable to update to the latest connector driver whenever you want: Drivers are embedded within Apache Beam and released accordingly. Until it’s properly tested, the latest version of the driver library won’t be really used, and may be a hit or miss for things like vulnerabilities.

As soon as we decided that Apache Beam would replace our Spark Streaming implementation, we started to dig into what other parts of the infrastructure needed to be changed in order to optimize it. First step: the connector to the queue.

The Pipeline as the Baseline

The base of every Apache Beam job is the Pipeline, which is defined as “a user-constructed graph of transformations that defines the desired data processing operations.

Pipeline p = Pipeline.create(options);
p.apply("ReadLines", TextIO.read().from(options.getInputFile()))

As soon as you’ve got data flowing through the pipe (that initial TextIO step) we can start modifying our data however we wish.

TextIO in this sample will be our connector of choice, which basically reads files from a given location within the machine or remotely (S3 and GS buckets for instance) and that will give us a collection of strings in Beam named PCollection which is a “data set or data stream. The data that a pipeline processes is part of a PCollection.

As soon as you build that initial and crucial step in your pipeline, you are ready to start adding more stages that operate on the aforementioned PCollection.

The apache/beam GitHub repository has plenty of examples to play with, so take a look! Also, the Programming Guide on the official website has everything you need to get up to speed on developing your very own pipelines. All the concepts mentioned in here (and more!) are covered in detail.

From SQS to Kinesis and Beyond

Initially, we were using a mix of SQS and SNS with S3 buckets to persist the events midway, sending notifications via SNS whenever a new batch of events was ready to be pulled from Spark. This was not ideal by any means and it complicated the overall portability and maintainability of the system, among other things.

We wanted a real-time stream of events from beginning to end, with the ability to replay messages in case we wanted to and remove all the unnecessary complexity of maintaining midway buckets and notifications. The solution, given the climate at the time, was clear: Kinesis.

For those of you that don’t know what Kinesis is, it's basically AWS’s real-time streaming offering. The benefits, in addition to it being real-time, are that it is highly scalable and fully managed, which removes a lot of operability overhead - this was a decisive factor for us.

The initial development of the latest jobs contemplated code that would run on GCP Dataflow and Flink on Kubernetes.


Reliable, Fast and Portable - What Else?

By applying the concepts seen in the documentation above, we built our pipelines - all in a streaming fashion - which save the records processed in both Parquet files and MongoDB collections with a great throughput.

It is easily scalable when high traffic season comes (i.e. Black Friday or sales) and portable through clouds. Hope you enjoyed this brief intro to Apache Beam and the transition from a really cloud dependant environment to a more agnostic one!



]]>
<![CDATA[On-Call Engineering Approach]]>

At Empathy.co, On-Call was scheduled to allow for fast recovery in case of a disaster, errors, or loss of service. A bunch of folks from different teams are part of the On-Call rotation (with escalation policies), which guarantees that there is someone ready to catch any incidents that may

]]>
https://engineering.empathy.co/on-call-engineering-approach/64c76199185c9a00013e12b2Tue, 01 Aug 2023 09:45:37 GMT

At Empathy.co, On-Call was scheduled to allow for fast recovery in case of a disaster, errors, or loss of service. A bunch of folks from different teams are part of the On-Call rotation (with escalation policies), which guarantees that there is someone ready to catch any incidents that may occur.


We work with a "you build it, you own it" principle, which promotes each team's autonomy and ownership. Each team defines the operational challenges of their services in an Operational Readiness Review so that they can solve issues without first escalating to other teams – because nobody knows more about a service than the team that owns it.


Empathy.co’s culture enables and empowers software engineers to take ownership to build, run, and operate products and features. We actively participate in the design and implementation of customer solutions and remain responsible for the value and service they provide. This means we are engaged in the whole cycle, from software inception to any service-level management, such as debugging, troubleshooting, request and incident resolution.

Whether your organization already has an On-Call system or is looking to implement one, here's a look at how ours has been implemented and how we recommend setting it up.

Onboarding

Previously, when a new engineer was added to the On-Call rotation, there was not much guidance. To improve the On-Call onboarding, we implemented On-Call shadowing for a kinder, smoother ramp-up to going On-Call, with none of the stress or responsibility for diagnosing and fixing the issue.

💡
Empathy.co encourages everyone, regardless of their position within the company, to spend a week shadowing an engineering team using Better Stack to understand what our product does and how to use it.


The default shadowing schedule excludes weekends and the day the On-Call is transferred from one engineer to the next, resulting in a 4-days-a-week, 24-hours-a-day "shadowing shift." New engineers decide when they want to start shadowing and when they feel ready to join the On-Call rotation. Our expectation is that they begin shadowing sometime during the first three months, and our culture of shared responsibility and blamelessness makes it less daunting to make the switch from shadowing to being on call.

The Need-to-Know Steps

Identify & Log

Since it's important to respond to incidents quickly, identifying and logging must also be done quickly and methodically in order to move on to the next step.

Categorize & Prioritize

It's important to categorize incidents to prevent confusion. For instance, the number of users affected, affected services, revenue impact, etc. Prioritizing incidents can help the On-Call engineer make a call on whether or not the incident requires the time and resources of the rest of the team.

Notify the Right People

Assembling the right people at the right time and in the right place is key to ensuring exceptional mean time to resolve (MTTR) times. Therefore, it's necessary to suppress the noise and alert only the right people who can fix the alert.

Troubleshoot

It's vital that first responders like the On-Call engineer be able to troubleshoot on the go.

Our On-Call Proposal

Each team is the expert regarding the workloads they have built. Following the "you build it, you own it" principle, the alerts should be owned by each respective team.

  • Each team has an independent escalation policy.
  • One team can escalate to another team or another backup from their escalation policy.
  • The number of On-Call engineers for each escalation policy should be at least four, to avoid burnout and high-stress load.
  • Escalation policies should be owned by each team. They decide if they want one member in primary, a couple in secondary, notification timings, etc.
  • Alerts notify each team with the enough knowledge to solve the issue. Notifying the right people at the right time is key.



Schedule

Currently, our schedules usually start on Monday, but any weekday will do. We recommend avoiding On-Call shift changes on weekends. The shift length is one week, and only one week per month per engineer is recommended in order to avoid burnout.

Expectations

Setting clear expectations is key for maintaining alignment:

  1. During their On-Call shift, engineers should dedicate time to investigating and fixing the root cause of operational problems as a priority.
  2. Picking up new feature work should be a luxury, not an expectation.
  3. After a disruptive night or weekend, On-Call engineers are expected to take a break and have time to recover.

Each escalation policy should account for at least four engineers, and shift changes should happen on a weekly basis. Each On-Call engineer should plan their PTO days in coordination with their team, to ensure proper coverage.

Postmortem

After an incident happens, a Postmortem should be written the following day to detail the occurrence and the solution applied. The main owners of the postmortem are the On-Call engineers who solved the incident, but feel free to add more people to the review. The Postmortem should then be reviewed by Staff Engineering before being communicated to clients and any other stakeholders within the company.

What are the benefits of this proposal?

  • Clear expectations
  • Improve On-Call onboarding
  • Improved team transparency and accountability in handling issues
  • Suppress noise
  • Better service reliability by quickly acting on and resolving alerts; alerting the right people
  • Happier customers who can contact On-Call engineers for urgent issues at any time, and be assured that any issues always will be fixed quickly
  • Maintaining the backup and escalation policies means everyone can provide support, if needed


This is the On-Call approach that works for us, but each organization is different. It's important that you create an approach that fits your company's needs, offers reliability for your clients, and balances the workload for your On-Call engineers.

]]>
<![CDATA[How to Create an X Adapter (and pair it with X Components)]]>When creating a search experience, Interface X offers you a handy library of standalone building blocks to make the development part swift and easy, the X Components. But there are many other useful tools you can find in the Interface X repository.

The X Adapter is one of the most

]]>
https://engineering.empathy.co/how-to-create-an-x-adapter-and-pair-it-with-x-components/64aeb59982e2790001a4b16dFri, 14 Jul 2023 06:00:02 GMTWhen creating a search experience, Interface X offers you a handy library of standalone building blocks to make the development part swift and easy, the X Components. But there are many other useful tools you can find in the Interface X repository.

The X Adapter is one of the most powerful and versatile tools in it, serving as a decoupled communication layer between an endpoint and your components. Combine it with the X Components, and you'll have half the work done already!

But how does the adapter accomplish this? Essentially, the adapter acts as a middleman between the website and the endpoint by handling requests. When creating and configuring an adapter, you typically need to specify an endpoint. However, to actually "adapt" the data, you need more than just an endpoint. This is where mappers come into play.

RequestMapper and ResponseMapper

The mappers are at the core of the usefulness of an X Adapter. The RequestMapper of the adapter is used to build the endpoint correctly before making the request, and the ResponseMapper takes care of the response to that request.

We can define the mappers as dictionaries that help the adapter translate from what we have to what we want. This translation is defined using Schemas, which specify how the source object should be transformed into the target object.

For example, let’s say that our components name the search query as query, but the endpoint expects it to be called just q. The adapter will need a RequestMapper that tells it to translate query into q when creating the request. Here's an example of how to define a RequestMapper that does just that:

const requestMapper = schemaMapperFactory({
	q: 'query',
})

Easy enough! But what if the endpoint requires numerous fields? How can you keep track of all the different fields needed for the request to work while building the mapper? Fortunately, the schemaMapperFactory implements generic typing using Typescript. This allows you to specify the type of the source and target request parameter objects, which enables your IDE to give you hints about the available fields.

type SourceParams = { query: string };
type TargetParams = { q: string };

const requestMapper = schemaMapperFactory<SourceParams, TargetParams>({
  q: 'query',
})

Let’s cut to the chase

Mapping a couple of 1:1 parameters is rarely enough to handle real-life situations. Now that we've covered the basics of mappers, let's take a closer look at how they can handle more complex object structures, such as nested objects and calculations.

Returning to the previous example, we essentially tell the mapper that it will receive an object with an entry called query. When it finds that entry, it should put its value in a new object's entry, q. But what if query is not at the root level of the object? What if it's wrapped inside something else?

type SourceParams = {
    searchParams: {
        query: string
    }
};

The good news is that the solution is quite simple: specify the full path to the property, and the mapper will find it. The even better news is that Typescript hints will still work and suggest the correct full path!

type SourceParams = {
    searchParams: {
        query: string
    }
};
type TargetParams = { q: string };

const requestMapper = schemaMapperFactory<SourceParams, TargetParams>({
	q: 'searchParams.query',
})

Indicating the path to the property like in these examples will cover a lot of the situations you might encounter, but what if there’s some extra complexity in what you want to do? What if you require doing some calculations before passing the value to the target property?

Consider a scenario where you have to handle pagination for a request. The source object has the current page (page) and the number of elements per page  (pageSize), but the endpoint requires the index of the first element to return  (startIndex).

type SourceParams = {
    searchParams: {
        query: string, 
        page: number, 
        pageSize: number
    }
};
type TargetParams = {
    q: string, 
    startIndex: number
};

To assign the correct value to pageIndex, you must first multiply page and pageSize. Rather than indicating the path to a parameter, you can pass a function that receives the entire source object and performs this calculation.

const requestMapper = schemaMapperFactory<SourceParams, TargetParams>({
  q: 'searchParams.query',
	pageIndex: ({ searchParams }) => searchParams.page * searchParams.pageSize
})

What about lists?

Moving on to responses, the tools we have are the same, but the problems we'll face are likely to be more complex. When dealing with endpoint responses, we have to handle more complex structures and arrays of elements. The X Adapter was created with the idea of mapping search responses, so it will need a way of handling the mapping of lists of elements.

Taking the following endpoint response as an example:

{
	response: {
		items: [{
            name: 'example',
            image: 'image.url.example'
        },
        ...]
	}
}

We define the types for the SourceResponse and the TargetResponse:

type SourceResponse = {
	response: {
		items: {
			identifier: number;
			name: string;
			image: string;
		}[];
	};
}

type TargetResponse = {
	results: {
		id: string;
		description: string;
		images: string[];
	}[];
}

We see here that response.items should be mapped to results and, for each item in the response, id is now an identifier string, description is now name and image is now an array in images. Let’s see how it’s done:

const responseMapper = schemaMapperFactory<SourceResponse, TargetResponse>({
  results: {
		$path: 'response.items',
		$subSchema: {
			identifier: ({ id }) => id.toString(),
			name: 'description',
			images: ({ image }) => [image]
		}
	}
})

So, what's happening here? We're passing an object in the schema for the result property. This object has two elements: $path and $subSchema. $path is the path to the property from the source response object that will be iterated to populate the list. $subSchema is the schema that will be applied to each of the elements found there.

X Components and the X Adapter

Now that we have an adapter that can translate between our components and an endpoint, we just need the components. This is where the X Components come in handy. They offer multiple components to easily and quickly create an engaging search experience from the ground up. You can check out this video lesson for guidance on how to create a project with X Components or delve deeper into the documentation.

Using both the X Adapter and X Components, you only need to provide the endpoints. The components are already prepared to receive the information to display the search through the X Adapter.

The X Components require an adapter that bundles together different adapters for different endpoints to cater to the necessities of the components, such as Related Tags, Next Queries, and tagging. To make the development process easier, it provides types for each adapter's expected request and response. Additionally, there's a pre-built adapter, the X Adapter-Platform, that you can plug directly into the X Components setup and communicate with Empathy.co endpoints right away. However, that's not always the case.

If you want to create a search experience with the X Component using your own endpoint, you'll most likely start by showing results for the search. To do that, you'll need to create an adapter for the search endpoint.

const myAdapter = {
	search: searchEndpointAdapter
}

Your searchEndpointAdapter will use your endpoint and the mappers to adapt the endpoint requests and responses to what the X Components understand, similar to the previous examples:

import { SearchRequest, SearchResponse } from "@empathyco/x-types";
import { YourSearchRequest, YourSearchResponse } from "your-types";

const requestMapper = schemaMapperFactory<SearchRequest, YourSearchRequest>({
  q: 'query',
  pageIndex: ({ rows, start }) => (rows && start ? rows * start : 0)
})

const responseMapper = schemaMapperFactory<YourSearchResponse, SearchResponse>({
  results: {
		$path: 'response.items',
		$subSchema: {
			id: ({ id }) => id.toString(),
			name: 'name',
			images: ({ image }) => [image],
			modelName: () => 'Result'
		}
	},
  totalResults: 'response.total'
})

const searchEndpointAdapter = endpointAdapterFactory<SearchRequest, SearchResponse>({
  endpoint: '<https://your.endpoint>',
  requestMapper,
  responseMapper
});

After you’re finished with the adapter, pass it in the options object to the X Components and the search adapter will start providing the results from your search to the components that need them!

new XInstaller({
	searchEndpointAdapter,
	...
}).init();

Creating a great search experience for your customers doesn't have to be daunting. Empathy.co's X Components and X Adapter make it easy for developers to quickly build a reliable, adaptable search system that caters to the needs of their users. The X Adapter simplifies the process of translating between components and endpoints, while the X Components offer a range of pre-built components that can be customized to fit specific requirements. By utilizing these tools, you can provide a seamless and engaging search experience that keeps your customers coming back for more.

]]>
<![CDATA[Frequent Pattern Mining]]>TL; DR

The field of Frequent Pattern Mining (FPM) encompasses a series of techniques for finding patterns within a dataset.

This article will cover some of those techniques and how they can be used to extract behavioral patterns from anonymous interactions, in the context of an ecommerce site.

Terms and

]]>
https://engineering.empathy.co/frequent-pattern-mining/63e11961d7dc88003dedee0fTue, 11 Jul 2023 06:00:44 GMTTL; DRFrequent Pattern Mining

The field of Frequent Pattern Mining (FPM) encompasses a series of techniques for finding patterns within a dataset.

This article will cover some of those techniques and how they can be used to extract behavioral patterns from anonymous interactions, in the context of an ecommerce site.

Terms and Concepts

Say that we are observing and noting down in a ledger every time a keystroke is played on a piano. For clarity, let's concede that our piano is oversimplified and only has seven notes, from A to G.

Our ledger would read something like this: A, D, F, A, C, G, A, C, B

The contents of our ledger are a data series, and each individual note is an element (a.k.a. itemset) of the series.

We know that the ordering of the notes is rather important in a melody. Thus, we can safely assume that our series is indeed a sequence (ordered data series). This is not always the case and it really depends on what information you want to learn from the data. For instance, we might be studying how frequently two particular notes are dominant in a database of melodies. Then, we wouldn't need to guarantee a sequence.

Now imagine that we want to detect riffs. A riff is a short sequence of notes that occur several times in the structure of a melody. In our nomenclature, a riff is then a subsequence. A subsequence of length 'n' is also called an n-sequence.

For example, take this popular tune. Can you hum along?

Frequent Pattern Mining
The 6-sequence DDCCBBA is found four times. The 3-sequence GDD is found three times.

So this is the goal of FPM put simply: find the subsequences with their frequency, i.e. how many times they are observed.

In order to make things more interesting, let's consider many pianos playing at the same time. We can simultaneously write all the notes played, but we need an additional annotation to be able to tell them apart, e.g., P1 for the first piano, P2 for the second one, and so forth. This is the sid (sequence identifier).

Until now, we have considered every element to be defined by a single piece of data (e.g., 'A'), but that's obviously insufficient. Musical notes have more aspects to them than the tone: duration is also a key piece of data. We can improve our modeling by defining two aspects for our itemsets: tone and duration. For example, {tone: B, duration: 4} is an itemset defined by two items. Note that there is no concept of ordering for the items belonging to an itemset. They all occur at the same time.

Frequent Pattern Mining
Input data model for FPM algorithm

This twist allows for more interesting exercises. We could, for instance, focus our attention on finding a dominant rhythm in the melodies. Or, we could look for recurring patterns of notes combining a particular sequence of tones, followed by any tone with a particular duration.

Modeling Behavior

We will now apply the concepts above for modeling the behavioral trends observed in an ecommerce website.

We take the clickstream as our dataset, i.e., the atomic interactions that shoppers are producing on the website. These interactions include things like performing a search, clicking on a product, etc.

The Session ID is a perfect match to use as the sid, since this concept is widely used across the web for correlation purposes. Note that no other profiling data is needed. Plus, Session ID has desirable properties to it regarding privacy and security:

  • Meaningless
  • Volatile (only used during a session's lifespan)
  • Unique
  • Context-specific (not shared between sites)

The shoppers' interactions are then grouped by Session ID and sorted by timestamp when we observe them. The resulting sessions are our sequences, thus the interactions are itemsets.

Next, we must define the interactions through their items. The first consideration is that we have different kinds of shopper interactions. Thus it's logical to consider this an important item. Let's name it eventType. This may vary between websites, depending on the user experience. Let's initially consider three main types:

  • eventType: query (the shopper performs a query).
  • eventType: click (the shopper clicks on a product card from the results).
  • eventType: add2cart (the shopper adds a product to the shopping cart).

Query

At the very least, a query action is defined by its query terms. We also apply normalization processing to the query terms to improve the homogeneity of input data, e.g., lowercasing and transforming plural to singular words.

Frequent Pattern Mining
Itemset model for a query action

Click / Add2Cart

Click and add2cart are very similar regarding their relevant items, although the latter is usually a stronger signal so it's a good idea to keep them apart. The product ID is their most descriptive item, and the position of this product in the result list may also be relevant as it generally introduces bias in shoppers' behavior.

We also enrich the itemset with attributes from the shop's catalogue, in order to improve the aggregation factor. Consider that it's easier to discover patterns from items with a greater aggregation factor.

Frequent Pattern Mining
Itemset model for a click action
The selection of attributes to include in the itemsets has a big impact in the outcome of the FPM. The choices will depend primarily on the goal of the application. What kind of patterns are you aiming to discover?

Aggregation Factor

When a particular item occurs a lot across itemsets, then we say that it has a big aggregation factor. E.g., you can expect to see a lot of occurrences of the item 'eventType:query'.

The aggregation factor is necessary for discovering patterns, but it can also spoil your FPM with obvious facts. For instance, you could end up with the following top patterns:

  • {eventType:query}, {eventType:query}
  • {eventType:query}, {eventType:query}, {eventType:click}
  • {eventType:query}, {eventType:click}

The patterns based on items like 'eventType:query' are eclipsing other patterns that are probably more interesting to learn about.

A possible optimization in this case is crossing some features together in order to break the strong aggregation factor, thus gaining increased granularity, e.g.:

  • {eventType:query, terms:'cargo jumpsuit'} --> {query:'cargo jumpsuit'}
  • {eventType:click, productId:'12345'} --> {click:'12345'}

Therefore leading to more intriguing patterns like:

  • {query:'cargo jumpsuit'}, {click:'12345'}

In Action

Let's take a look at how to run FPM using Apache Spark's PrefixSpan implementation.

First, we'd typically prepare the data from clickstream as an Apache DataFrame. The key column here is sequence. Note that we have nested arrays (sequences of itemsets) in that column.

+-------+---------------------------------------------------------------+
|session|sequence                                                       |
+-------+---------------------------------------------------------------+
|131ABC |[[query:jacket], [click:p3, size:M, dept:casual], [click:p2, size:L, dept:casual]]                                                   |
|130ABC |[[query:jacket], [click:p2, size:L, dept:casual], [query:sport], [click:p3, size:M, dept:casual], [add2cart:p3, size:M, dept:casual]]    |
|125ABC |[[click:p1, size:40, dept:footwear], [add2cart:p3, size:M, dept:casual], [click:p2, size:M, dept:casual]]                |
|123ABC |[[query:sport], [click:p1, size:38, dept:footwear], [click:p3, size:M, dept:casual]]                                                   |
|124ABC |[[query:sweater], [click:p2, size:L, dept:casual], [prd:p3, size:M, dept:casual], [add2cart:p3, size:L, dept:casual], [query:shirt], [click:p2, size:L, dept:casual]]                                        |
+-------+---------------------------------------------------------------+

Running FPM on top of this prepared DataFrame is simple:

  def run(df: DataFrame): DataFrame = {
    new PrefixSpan()
      .setMinSupport(minSupport)
      .setMaxPatternLength(maxPatternLength)
      .setMaxLocalProjDBSize(maxLocalProjDBSize)
      .findFrequentSequentialPatterns(df)
  }

  • minSupport: A lower threshold for discarding patterns that are not representative enough in the observed sessions. For a given pattern, support is calculated as num_sessions_present / total_sessions. This value is therefore a relative amount (0-1).
  • maxPatternLength: An upper threshold for pattern length. E.g., if this value is set to 5, then the length of the output patterns would be between 1 and 5.
  • maxLocalProjDBSize: This is an internal value that determines how many resources, in terms of memory, can be dedicated to building the internal representation (DB sequences) when running the algorithm.

It's worth mentioning that fine-tuning minSupport and maxPatternLength critically impacts performance.

As output, we get another DataFrame with the results of the analysis.

+------------------------------------------------------------------+----+
|sequence                                                          |freq|
+------------------------------------------------------------------+----+
|[[size:M, dept:casual], [click:p3]]                               |53  |
|[[query:jacket], [click:p3, dept:casual, size:L]]                 |34  |
|[[query:shirt], [size:M, dept:casual], [add2cart:p2]]             |31  |
+------------------------------------------------------------------+----+

Note that many of the resulting itemsets are now "incomplete". They are now like templates. It tells you the common factors extracted from the original itemsets observed.

Alternative

Apache Spark also provides another implementation for performing FPM: FP-Growth.

It's interesting to know the differences, as it can also be a valid approach for some of the use cases.

  • Unordered discovery: When extracting patterns, FP-Growth doesn't take the order of the sequences into consideration. This can be handy for detecting elements that are frequently found together, not necessarily with the implicit cause-effect relationship that ordering implies. For instance, "products bought together" is a case of unordered discovery.
  • Simplified data model: The elements from input sequences are not expected as itemsets. They are not nested arrays, but plain values. This means that you can't expect to extract a common factor from composed elements. This may, however, still be valid for your use case.
  • Repeated elements not allowed: The same element can't appear twice in a given sequence. You'll need to filter out those cases before running the algorithm.

What Next?

Now that you have discovered the representative patterns in your data, you can interpret this info statistically as a source of predictions and recommendations for your shoppers. Optimizing the most frequent shopper journeys on a website (e.g., by suggesting shortcuts) is another interesting use case.

Patterns can be used to learn trends or association rules, so stay tuned for a post on that topic!

References

Frequent pattern discovery - Wikipedia
Frequent Pattern Mining
Frequent Pattern Mining - Spark 3.3.2 Documentation
Frequent Pattern Mining
]]>
<![CDATA[Progressive Delivery: Argo Rollouts Adoption]]>Introduction

Progressive Delivery is emerging as a worthy successor to Continuous Delivery, by enabling developers to control how new features are launched to end users. Its wide popularity is owed to the demand for faster and more reliable software releases. The increasing emphasis on customer experience has begun to push

]]>
https://engineering.empathy.co/progressive-delivery-argo-rollouts-adoption/6481edd6cf442b00012eb3d3Mon, 12 Jun 2023 06:00:44 GMTIntroduction

Progressive Delivery is emerging as a worthy successor to Continuous Delivery, by enabling developers to control how new features are launched to end users. Its wide popularity is owed to the demand for faster and more reliable software releases. The increasing emphasis on customer experience has begun to push Continuous Delivery methodology by the wayside. Large enterprises like Netflix, Amazon, and Uber are turning to Progressive Delivery to test and release code in a phased and controlled manner.

In a nutshell, Progressive Delivery empowers developers to plan and implement code changes to a subset of users and then expand it to all users. The progressive rollout of features is executed through techniques like blue-green deployment, feature flagging, and canary deployments. You can mitigate issues that come up by promoting a version to all users only when you’re confident that it is performant and reliable. And if it fails in production, the impact radius is restricted to a subset of users, and the update can be rolled back immediately.


Toolkit for Progressive Delivery

Apart from laying the foundation with Kubernetes, GitOps, service mesh, the key piece to the entire puzzle is a purpose-built progressive delivery tool.

Argo Rollouts is a Kubernetes controller and set of CRDs which provide advanced deployment capabilities such as blue-green, canary, canary analysis, experimentation, and progressive delivery features to Kubernetes.

We have also performed a deep analysis of the solution chosen at Empathy.co, outlining the requirements, benefits and a comparison between Argo Rollouts and Flagger.


Progressive Delivery Strategy

Progressive Delivery can be implemented through a number of strategies:

Canary Release

Channel a limited amount of traffic to a new canary service, and then if it passes reliability tests, you can gradually shift all traffic from the old to the new service, and the canary becomes the default version.

Feature Flag

Control the code launch remotely through a toggle-like feature, which enables changes to be rolled back immediately in the event of a failure.

Blue-Green Deployment

Gradually transfer traffic from an existing application (blue) to a newer one (green), while the blue version acts as a backup.

A/B Testing

Expose two different categories of the audience to two different application versions and analyze their performance to decide which is the ideal version.

Which tactic you pick depends on your goals and which you think would fit your workloads best.

Custom Strategy

Although those are the basics methods, Argo Rollouts allows more custom analysis to be added and all the capabilities to be explored, making it possible to create our own strategy during Releases. For instance, a custom strategy with the following capabilities could be defined:

  • Scale the canary stack up for testing purposes
  • Set some header-based traffic shaping to the canary while setWeight is still set to zero
  • Begin testing and then, when the tests are OK, start the canary promotion
  • Canary promotion sends traffic gradually
  • Automated rollback in case SLO and Error Budget is burnt out based on analysis from Prometheus Metrics
  • Scale down the old application release after a while, in order to guarantee a faster rollback in case of failures; no need to scale up the previous version

Outline the Metrics You Want to Measure

With Progressive Delivery, you can reduce risk. This is mainly because you continuously test code changes, analyze performance and implement your learnings all in real time. To ensure that this happens seamlessly, it's important to have KPIs against which to measure the success of your release. In our case, the SLO metrics would be a good KPI to choose. Thanks to Pyrra, there is a set of Prometheus Rules to monitor the Error Budget of your application and send an alert in case it is close to being burnt out.

Those metrics can be added as part of the Analysis in Argo Rollouts, so be sure to pay attention to how to explore Prometheus metrics in the Analysis and that your metrics make sense for checking the success of your release.

Check your Pyrra configurations, as they will be propagated as Prometheus Rules and will create some critical alerts for the Error Budget.

Migration: From Deployment to Rollout

There are multiple ways to migrate to Rollout, but here we'll explain the simplest one:

⚠️
When migrating a Deployment which is already serving live production traffic, a Rollout should run next to the Deployment before deleting the Deployment or scaling down the Deployment. Not following this approach might result in downtime. It also allows for the Rollout to be tested before deleting the original Deployment.

During the migration, the Deployment and the Rollout should coexist to avoid downtime. After that, the Deployment resource can be scaled down to zero replicas.

A temporal ingress and service will be located to avoid downtime and ensure correct behavior during the migration, because rollouts introduce a hash label in the selector to route the traffic.

After the migration, the temporal service and ingress can be deleted.

Kubectl Plugin

Argo Rollouts offers a Kubectl plugin to enrich the experience with Rollouts, Experiments, and Analysis from the command line.

Troubleshooting Argo Rollouts

Rollouts

Q: Can I restart a Rollout?

A: Sure, you can restart a Rollout just like you can restart a Deployment. There are multiple ways to do so: using the kubectl plugin, directly from Argo Rollouts UI, or through the Argo CD UI.

Q: Does the Rollout object follow the provided strategy when it is first created?

A: As with Deployments, Rollouts do not follow the strategy parameters on the initial deploy. The controller tries to get the Rollout into a steady state as fast as possible by creating a fully scaled-up ReplicaSet from the provided .spec.template. Once the Rollout has a stable ReplicaSet to transition from, the controller starts using the provided strategy to transition the previous ReplicaSet to the desired ReplicaSet.

Rollbacks

Q: If I use both Argo Rollouts and Argo CD, won't I have an endless loop in the case of a Rollback?

A: No, there is no endless loop. As explained in the previous question, Argo Rollouts doesn't tamper with Git in any way. If you use both Argo projects together, the sequence of events for a Rollback is the following:

  1. Version N runs on the cluster as a Rollout (managed by Argo CD). The Git repository is updated with version N+1 in the Rollout/Deployment manifest.
  2. Argo CD sees the changes in Git and updates the live state in the cluster with the new Rollout object.
  3. Argo Rollouts takes over as it watches for all changes in Rollout Objects. Argo Rollouts is completely oblivious to what is happening in Git. It only cares about what is happening with Rollout objects that are live in the cluster.
  4. Argo Rollouts tries to apply version N+1 with the selected strategy (e.g. blue-green).
  5. Version N+1 fails to deploy for some reason.
  6. Argo Rollouts scales back again (or switches traffic back) to version N in the cluster. No change in Git takes place from Argo Rollouts.
  7. The cluster is running version N and is completely healthy.
  8. The Rollout is marked as "Degraded" both in ArgoCD and Argo Rollouts.
  9. Argo CD syncs take no further action as the Rollout object in Git is exactly the same as in the cluster. They both mention version N+1.

Q: How can I run my own custom tests (e.g. smoke tests) to decide if a Rollback should take place or not?

A: Use a custom Job or Web Analysis.

Analysis

  • What is the difference between failures and errors?
  • Failures are when the failure condition evaluates to true or an AnalysisRun without a failure condition evaluates the success condition to false. Errors are when the controller has any kind of issue with taking a measurement (i.e. invalid Prometheus URL).

For more information, check out the Argo Rollouts FAQ.

Takeaways

ArgoRollouts offers enhancements to the usual Kubernetes deployment strategies, and the main highlights of adoption are:

  • Ability to choose your strategy and select the one that best suits your needs
  • No need to adjust to canary or blue-green, because Argo Rollouts allows you to create a custom strategy
  • Special attention to migration from Deployment to Rollout with some caveats that must be taken into consideration.

Resources

]]>
<![CDATA[Why should my team start automating?]]>Maybe you’re still gazing at this title, a bit suspicious and wondering to yourself, “Start automating? What does she mean, we’ve been doing it for ages!” If that’s the case, congratulations, you’re on the right path! Yet, you might find

]]>
https://engineering.empathy.co/why-should-my-team-start-automating/644a21e6c84b58003d8e32bdThu, 27 Apr 2023 13:07:03 GMTMaybe you’re still gazing at this title, a bit suspicious and wondering to yourself, “Start automating? What does she mean, we’ve been doing it for ages!” If that’s the case, congratulations, you’re on the right path! Yet, you might find this article useful to remember the old, dark days and sigh.

But for those who haven’t yet found the right moment to start, or even don’t see the value, I hope this story empowers you to join the bright side of testing and transform the initial question, “Why should my team start automating?” to “Why haven't we started automating yet?


Bumps in the road to automation

“It’s too expensive and brings no benefit.”

Correct, automation will imply a big initial investment. It’s necessary to dedicate time and people to find the tool that best fits the framework and programming language your team is using. Of course, your team will also need proper training as to how to use that tool and how to design simple, maintainable tests.


Moreover, it’s not possible to provide an exact ROI calculation, since automation brings test coverage that wouldn’t be possible to achieve and handle manually. Think of a test regression suite for a project with multiple interconnected features. During the first few iterations, it might be possible to manually run all acceptance and regression tests, but the more iterations, the bigger the regression suite would be. How long would it take for that suite to become unmanageable? Consequently, how long would it take for the team to give up on those tests?

“We don’t have a dedicated specialist on the team.”

Well, it’s not strictly necessary. Adopting Agile practices means not only there is no room anymore for a dedicated Quality & Productivity team working on demand at the end of different project releases, but also that quality becomes a whole-team effort. The entire team needs to adopt a quality mindset, as everyone is responsible for the quality of the product. Whether you are a developer who has no experience in automating tests or you are a manual tester without a programming background, never fear. It’s a team responsibility, and there will be colleagues on the team who will be able to help.

“But we don’t know how to automate!”

Take a look at the beginning of the previous section. Do you see that “It’s not strictly necessary” bit? That’s because although all the team might agree on learning automating skills, it still may be a good idea to bring a quality expert to coach and support the team, helping to adopt the new practices and making sure those first steps go in the right direction to a solid, maintainable and easy-to-understand automation test suite. A Quality & Productivity Engineer will know the team is ready when automation has become an assumed, natural process.

“We don’t have time…”

It’s harder to take on new habits when there are old ones to fall back on. If the team isn’t given sufficient time to get up to speed with automation and, later on, design and implement the automated tests during each sprint, they will go back to old practices. Automation will be skipped as a result of choosing the lesser of two evils, and some exploratory testing will be done, at most. Best case scenario, the automation tasks are resumed in the following sprint, with the delivered business value reduced. Worst case scenario, deliverables seem to remain standing without pertinent testing, so the team falls for the misbelief that they are not necessary.

Tests with benefits

At this point, there may still be some skeptical readers thinking that if their software is simple enough, they can still survive without automated tests.

Let’s imagine the following scenario: there’s a small application developed by a concrete team that always has enough time allotted to manually test all the progress made during each iteration. Some bugs are found during the testing phase, and few others make it to the production environment, yet the situation is not concerning. How could automation benefit the team?

  • Better understand the requirements and build testable code.
  • Bring confidence to the code. While in manual testing, the Quality and Productivity Engineer is the one providing this sense of security, with automated testing, this focus is moved to the tests themselves.
  • Get early feedback, helping developers to troubleshoot quickly. As bugs are spotted sooner, there’s more time to find the root cause, come up with a solution and redesign the code if needed, contributing to more solid code.
  • Eliminate the possibility of human mistakes during execution, since every test is run exactly the same way.
  • Decrease the team's frustration. Automation reduces repetitive tasks; consider that the more code is added, the longer manual testing would take.
  • Increases test coverage. Think about complex scenarios with a wide range of input data to cover. Most likely, only some of those scenarios will be tested if only manual testing is performed.
  • Create living documentation with automated tests. It’s easy to forget to update static documentation every time a feature changes. With automated tests, however, updating docs to include these new functionalities is mandatory.

Have I brought you to the bright side of automation? Happy testing and, as always, bring on any questions or comments!

]]>
<![CDATA[Creating Component Tests for Spark Applications]]>

One of the main engineering challenges faced by the Empathy.co Data Team is creating robust tests for our Spark applications. Since these applications are constantly evolving, as for any application, we needed a way to ensure changes wouldn’t break the code; a guarantee that the output from

]]>
https://engineering.empathy.co/creating-component-tests-for-spark-applications/641c2d0fee4bd9003d141038Thu, 23 Mar 2023 14:13:19 GMT

One of the main engineering challenges faced by the Empathy.co Data Team is creating robust tests for our Spark applications. Since these applications are constantly evolving, as for any application, we needed a way to ensure changes wouldn’t break the code; a guarantee that the output from our jobs would remain the same when refactored or when the input schema of the data is changed.

The biggest hurdle here was determining how to create component tests that check two key boxes:

  1. The aggregated results are as expected.
  2. The output columns of our DataFrame are the same and the schema remains unbroken.

Our Spark project architecture is based on the Spring Boot framework. We use this framework to facilitate the arguments we are passing to our application via environment variables or command line arguments. For that reason, all of our jobs follow the same structure, which uses configuration beans to read the configuration properties. All the jobs implement the abstract run method from the ApplicationRunner spring boot class. So, the final objective is to test the run method from our applications to ensure the aggregation results are returned as expected. Individual methods can also be tested with unit tests, but that won’t be covered here, as it is out of the scope of this blog post.

Let’s imagine that the structure of one of the Spark Jobs we want to test is the following:

@SpringBootApplication(
 exclude = Array(classOf[MongoAutoConfiguration]),
 scanBasePackages = Array[String]("my.configuration.package")
)

@ConfigurationProperties("job-name")
Class MyBatch @Autowired() (
   @BeanProperty var sparkConfig: SparkConfig,
   @BeanProperty var inputConfig: InputConfig
) extends Serializable
 with ApplicationRunner {

 @BeanProperty
 var outputPath: String = _

 override def run(args: ApplicationArguments): Unit = {

   val inputDf = readParquet(
     sparkConfig.sparkSession,
     inputConfig.path,
     inputConfig.getStartDate,
     inputConfig.getEndDate
   )

   parquetDF = letTheJobPerformItsJob()...

   if (!Option(outputPath).forall(_.isEmpty)) {
       parquetDF.save(
           fullOutputPath,
           partitionKeys = Seq(RawSchemaConstants.Yyyy, RawSchemaConstants.Mm) 
       )
}

Next, we will be using the Mockito framework to write the test code. Let’s start the test by creating the Spark session:

import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.lit
import org.mockito.ArgumentCaptor
import org.mockito.ArgumentMatchers.any
import org.mockito.ArgumentMatchers.anyString
import org.mockito.ArgumentMatchers.isNull
import org.mockito.MockitoSugar
import org.scalatest.flatspec.AnyFlatSpec
import org.scalatest.matchers.should.Matchers
import org.springframework.boot.DefaultApplicationArguments

class MyBatchTest extends AnyFlatSpec with Matchers with MockitoSugar {

 val spark = SparkSession.builder
   .master("local[*]")
   .getOrCreate()

 import spark.implicits._

}

All of our Spark batch jobs follow the same working order: first, the source DataFrame is read from a bucket. Then, we make some aggregations, and finally, the results are either saved to a different bucket or to a MongoDB collection, or both.

The next step is to define the input DataFrame that would be read from the source bucket during a real execution, which would need to contain the data required to test the different scenarios that could arise (such as corner cases). The content of this DataFrame depends on each test logic.

To make the code more readable, we first need to define some prefixes/aliases (SV, SFV) to group into objects the constant values that we are using repeatedly. Also, the code shown here is a reduced version of the DataFrame; not all the rows that are present in the source DataFrame of the test are shown for readability purposes. We could define several DataFrames inside different tests to check individual scenarios and use a larger one to test the overall aggregation with all the scenarios together, but all of them will follow this same procedure:

val testData = DataFrameGenerator.generateSampleDataFrame(
 SV.Year,
 SV.Month,
 SV.Day,
 Seq(
   (SV.St1, Sinks.Q, SFV.LangEngScopeDesktop),
   (SV.St1, Sinks.C, SFV.LangEngScopeDesktop),
   (SV.St1, Sinks.C, SFV.LangEngScopeDesktop),
   [...]
 ).toDF(
   SourceColumnNames.Instance,
   SourceColumnNames.Sink,
   SourceColumnNames.Filters
 )
)

We define an empty dataset with the output that is written to MongoDB, to mock the method since we are not interested in writing any results to MongoDB.

val emptyMongoInstanceDFs = Map[WriteConfig, DataFrame]()

After that, we need to create a mock object for the methods that read the source data and write the results, which in our case are located in an implicit class called DataFrameMethods that extends the Spark DataFrame object. We also define a Mockito ArgumentCaptor object that will allow us to capture the result of the Spark job, which is then passed to one of the methods that performs the write operation. This code is written inside the test code:

"My Batch" should "work" in {

 val dfMethodsMock = mock[DataFrameMethods]
 val dataFrameCaptor = ArgumentCaptor.forClass(classOf[DataFrame])

}

Now, we instantiate the job we want to test and create a spy object to mock some methods:

val myBatch = new MyBatch(
 SparkTestConfig.getSparkConfig,
 SparkTestConfig.getInputConfig
)
val spyBatch = spy(myBatch)

The input beans that the batch receives as arguments (Spark session, input config, etc) can be created as shown below. Our batch receives more input beans like the output parameters, but for the sake of simplicity, they are omitted from this snippet:

def getSparkConfig: SparkConfig = {
 val sparkConfig = new SparkConfig()
 sparkConfig.useLocalMaster = true
 sparkConfig.construct()
 sparkConfig
}


def getInputConfig: InputConfig = {
 val inputConfig = new InputConfig()
 inputConfig.path = "inputPath"
 inputConfig.timeZone = "UTC"
 inputConfig.date = "2022-01-01"
 inputConfig
}

The next step is to mock the read source data and write result methods. The read method should return the input DataFrame that we previously created. For the parquet write method, we set it to do nothing. The method that generates the documents written to MongoDB is also mocked to avoid returning results to just any database. We only want to capture the argument with the resulting DataFrame when the method is called.

doReturn(testData)
 .when(spyBatch)
 .readParquet(any[SparkSession],
ArgumentMatchers.eq("inputPath"),
ArgumentMatchers.eq(ZonedDateTime.of(LocalDate.parse("2022-01-01"), LocalTime.MIN, ZoneId.of("UTC"))),
ArgumentMatchers.eq(ZonedDateTime.of(LocalDate.parse("2022-01-02"), LocalTime.MIN, ZoneId.of("UTC")))
)

doNothing
 .when(dfMethodsMock)
 .save(ArgumentMatchers.eq("outputPath"), ArgumentMatchers.eq(Seq("yyyy", "mm")))

doReturn(emptyMongoInstanceDFs)
 .when(spyBatch)
 .generateInstanceDFs(any[DataFrame], isNull[String], any[DatabaseConfig])

Here comes the most interesting part of the test: the batch code is executed by calling the run method and the resulting DataFrame is captured with the argument captor by using the method that generates the documents written to MongoDB. We mocked this method so it generates an empty collection, but we need to capture the input argument to get the Spark job results. In this case, we are using the method that saves the result into a MongoDB collection, but we could just as well use the one that writes the result into a bucket in parquet format.

spyBatch.run(new DefaultApplicationArguments(""))

verify(spyBatch)
 .generateInstanceDFs(
   dataFrameCaptor.capture(),
   any(),
   any[DatabaseConfig]
 )

val result: DataFrame = dataFrameCaptor.getValue.asInstanceOf[DataFrame].cache()

The last part of the test consists of checking the result. There are several ways to perform this check. One consists of collecting the results and comparing the rows. Although we are working with small DataFrames, we prefer to do this in a distributed way by comparing the DataFrames without collecting the data into the Spark driver.

So, let’s define the expected DataFrame. The key point here is to create the DataFrame with the columns in the same order they are aggregated by the job; otherwise, the check will fail:


val expectedData = Seq(
 (SV.St1, SFV.EmptyFilterStr, 1, 1, 3),
 (SV.St1, SFV.LangEngStr, 1, 2, 1),
 [...]
).toDF(
 FinalColumnNames.Instance,
 FinalColumnNames.Filters,
 FinalColumnNames.QueryCount
)

We check to see that the total number of rows matches and that the DataFrames are equal by subtracting one from the other and checking that the results are empty:

assert(10 === result.count())
assert(expectedData.except(result).isEmpty)

We know that our jobs do not produce duplicated rows (exactly equal rows), so this check is enough to ensure that the expected and resulting DataFrames are the same. With repeated rows, we would need to do more checks, such as the number of times each row is repeated. We could subtract the DataFrames in the inverse order and perform the same check to ensure we are confident with the results of the test.

assert(result.except(expectedData).isEmpty)

To summarize, these are the steps to follow for Spark application component testing:

  1. Mock the beans required by the application and instantiate the Spark session and the Spark application.
  2. Mock the read method to use a DataFrame defined within the test and then write the method so that it does not perform any write operation.
  3. Create a spy for the application object so it can mock the read and write (result) methods.
  4. Create an argument captor to retrieve the result of the Spark application, which is passed via parameter to the write method.
  5. Run the code and capture the resulting DataFrame.
  6. Define the expected DataFrame with the columns defined in the same order as calculated by the Spark application.
  7. Compare the expected result with the DataFrame that was retrieved using the argument captor.

All together, the test code should appear like this:

import com.eb.data.batch.{SampleValues => SV}
import com.eb.data.batch.{SampleFiltersValues => SFV}
import com.eb.data.batch.{SampleSinks => Sinks}
import com.eb.data.batch.config.DatabaseConfig
import com.eb.data.batch.io.df.DataFrame.DataFrameMethods
import com.eb.data.batch.DataFrameGenerator
import com.eb.data.batch.FinalColumnNames
import com.eb.data.batch.SourceColumnNames
import com.eb.data.batch.SparkTestConfig
import com.eb.data.batch.UDFUtils
import com.mongodb.spark.config.WriteConfig
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.lit
import org.mockito.ArgumentCaptor
import org.mockito.ArgumentMatchers.any
import org.mockito.ArgumentMatchers.anyString
import org.mockito.ArgumentMatchers.isNull
import org.mockito.MockitoSugar
import org.scalatest.flatspec.AnyFlatSpec
import org.scalatest.matchers.should.Matchers
import org.springframework.boot.DefaultApplicationArguments

import java.time.ZonedDateTime

class MyBatchTest extends AnyFlatSpec with Matchers with MockitoSugar {

 val spark = SparkSession.builder
   .master("local[*]")
   .getOrCreate()

 import spark.implicits._

 val testData = DataFrameGenerator.generateSampleDataFrame(
   SV.Year,
   SV.Month,
   SV.Day,
   Seq(
   (SV.St1, Sinks.Q, SFV.LangEngScopeDesktop),
   (SV.St1, Sinks.C, SFV.LangEngScopeDesktop),
   (SV.St1, Sinks.C, SFV.LangEngScopeDesktop),
   [...]
  )
)
   ).toDF(
     SourceColumnNames.Instance,
     SourceColumnNames.Sink,
     SourceColumnNames.Filters
   )
 )

 val emptyMongoInstanceDFs = Map[WriteConfig, DataFrame]()

 "My batch" should "work" in {

   val dfMethodsMock = mock[DataFrameMethods]
   val dataFrameCaptor = ArgumentCaptor.forClass(classOf[DataFrame])

   val myBatch = new MyBatch(
     SparkTestConfig.getSparkConfig,
     SparkTestConfig.getInputConfig
   )
   val spyBatch = spy(myBatch)

   doReturn(testData)
     .when(spyBatch)
     .readParquet(any[SparkSession],
ArgumentMatchers.eq("inputPath"),
ArgumentMatchers.eq(ZonedDateTime.of(LocalDate.parse("2022-01-01"), LocalTime.MIN, ZoneId.of("UTC"))),
ArgumentMatchers.eq(ZonedDateTime.of(LocalDate.parse("2022-01-02"), LocalTime.MIN, ZoneId.of("UTC")))
)


   doNothing
 .when(dfMethodsMock)
 .save(ArgumentMatchers.eq("outputPath"), ArgumentMatchers.eq(Seq("yyyy", "mm")))

   doReturn(emptyMongoInstanceDFs)
     .when(spyBatch)
     .generateInstanceDFs(any[DataFrame], isNull[String], any[DatabaseConfig])

   spyBatch.run(new DefaultApplicationArguments(""))

   verify(spyBatch)
     .generateInstanceDFs(
       dataFrameCaptor.capture(),
       any(),
       any[DatabaseConfig]
     )

   val result: DataFrame = dataFrameCaptor.getValue.asInstanceOf[DataFrame].cache()

   assert(10 === result.count())

   val expectedData = Seq(
     (SV.St1, SFV.EmptyFilterStr, 1, 1, 3),
     (SV.St1, SFV.LangEngStr, 1, 2, 1),
     [...]
   ).toDF(
     FinalColumnNames.Instance,
     FinalColumnNames.Filters,
     FinalColumnNames.QueryCount
   )

   assert(10 === result.count())
   assert(expectedData.except(result).isEmpty)
   assert(result.except(expectedData).isEmpty)
 }
}

In a follow-up blog post, we will explore how to improve the efficiency of the Spark session creation and reduce the execution time when performing a battery of tests during the same execution. For now, we hope this serves as a helpful guide to performing component testing in Spark. As always, if you have any questions or comments, please feel free to reach out!

]]>
<![CDATA[Image Generation with AI Models]]>Introduction

In recent years, improvements in results from AI models have come from two different sources.

On one hand, the improvement in hardware capabilities in terms of raw computational power has enabled larger models to be trained. As a result, AI models have been able to get bigger and bigger

]]>
https://engineering.empathy.co/image-generation-with-ai-models/6391e50ae4a60d003d233c34Thu, 08 Dec 2022 13:53:07 GMTIntroductionImage Generation with AI Models

In recent years, improvements in results from AI models have come from two different sources.

On one hand, the improvement in hardware capabilities in terms of raw computational power has enabled larger models to be trained. As a result, AI models have been able to get bigger and bigger (i.e. have more and more parameters). The increasing of parameters has also made these models more precise, but there is another component besides having hardware and bigger models — we need a way to train them. To do so, enormous datasets are required; as parameters increase, dataset size also has to increase accordingly. For example, DALL·E 2, one of the biggest models in the text-to-image model space with about 3.5 billion parameters, was trained on a 650 million image-caption pair dataset obtained from multiple sources all over the internet. (Dataset creation is a whole other issue, so we will not delve further into that topic.)

On the other hand, there have been architectural improvements of the results obtained by these huge text-to-image models, which can’t be achieved by brute force alone. In recent years, with the breakthrough of the so-called diffusion models, there has been an enormous improvement in the results produced by new AI models even without increasing the parameters of said models (e.g. DALL·E 2 has fewer parameters compared to its previous version, but performs much better).

Even though research and development of new models is continual, 2022 has seen explosive growth in the Image Generation field. Where before there were small improvements on a year-to-year basis, this year has brought great improvements on a monthly, weekly, and even daily basis. In fact, while writing this article, Version 2 of Stable Diffusion was just released on November 24. While this article will not cover the latest version of Stable Diffusion, all ways of working related to this model that will be covered can most likely be applied to the latest version.

Getting Started with Image Generation Models

The first step to begin using Image Generation for personal use is to decide which model to use. As of now, there are three main models for image generation: Stable Diffusion, DALL·E 2 and Midjourney. They are not the only options, there are others like NovelAi or Jasper Art, but this post will focus on the main three.

The five key factors to ponder when deciding which model to use are laid out in the table below:


DALL·E 2

Midjourney

Stable Diffusion

Openness of use

Open for anyone

Open for anyone

Open for anyone

Pricing

Pricing based on resolution of image generated

Subscription based with several tiers and 25 first images for free

Free model download and usage;

Paid version based on credits if using Dream Studio

Ease of use

Easy to create good prompts

Easy to create good prompts

Hard to create good prompts; Steep learning curve

Community Support

Subreddit;

Few community-based applications

Subreddit and Discord

Subreddit;

Many community-based applications;

Tools to find good prompts

Content Policies

Automatic filters to avoid disturbing or offensive content;
The user has full responsibility for the content generated 

Rules on Discord to avoid disturbing or offensive content;
The user has full responsibility for the content generated 

Safety filters on Dream Studio;

Safety filters can be disabled when downloading and using the model;
The user has full responsibility for the content generated 

Results obtained

Consistent quality of results, better for enterprise use

Leaning towards more artistic/conceptual generations

Inconsistent due to difficult prompt creation, but quality results can still easily be obtained

After deciding which model to use, the next step is learning how to use it. For Dall·E 2, it can be used on the OpenAI website. In order to use Midjourney, you need to join their Discord server which requires a Discord account.

Image Generation with AI Models
Credit: https://medium.com/codex/dream-studio-stable-diffusions-ai-art-web-app-tool-5687e75cc100

Finally, for Stable Diffusion there are three options:

  • Using Dream Studio: This is the official tool, based on the web, released by StabilityAI, the company behind Stable Diffusion.
  • Deploying and using the model on a personal computer: With a valid GPU of 8+ GB of VRAM, Stable Diffusion can be run on your personal computer. You can follow these steps on Reddit to get going.
  • Using WebUI forks developed from the community, running on Google Colab: This is my go-to choice, as it can be run using the free tier of Google Colab and there is no need for a beefy computer. I like this Github repository that has the tooling itself and Google Colab notebooks ready to use.

Regardless of the option chosen for running Stable Diffusion, I recommend watching this Youtube video (in Spanish) where all concepts are explained in more depth, as well as exploring the subreddit from the Stable Diffusion community to look for prompts and updates on community tooling. Another place to look for prompts is Lexica Art.

Model Comparison

Now that it is clear what each of these models offers and how to begin using them, it is time for a visual comparison between the images that they generate. As part of this section, a direct comparison using the same subset of prompts is shown below. (Please note that the images shown are not mine nor Empathy.co’s. You can view the originals here.)

Image Generation with AI Models
Image Generation with AI Models
Image Generation with AI Models
Image Generation with AI Models
Image Generation with AI Models


As shown in the images above, all three models provide high-quality results even though there are some clear differences between them. Stable Diffusion has the most inconsistent results style-wise, sometimes producing realistic results like the snow-covered mountain, while others receive a painting-like result like the Cherry Blossom image.

DALL·E 2 and Midjourney are much more consistent in their respective styles, with DALL·E 2 being the most consistent and having a style that is more suited to enterprise use. Midjourney’s style leans more towards conceptual art-like results but the quality is a bit less consistent in comparison with the results from DALL·E 2 (e.g. its Cherry Blossom image vs. the others).

What can these models be used for?

These powerful image generation models present new workflows and applications for art. For example, they can be used to generate icon-like images for programs, generate art for characters and places in tabletop role-playing games, transform video recordings, be included in current workflows for artists and graphic designers, and the list goes on.

One Last Thing…

There is an important topic to consider regarding Image Generation models (and pretty much all AI-related models): the ethical concerns that arise from these new technologies.

The conversation is mounting around how these models were trained. As noted in the Introduction, all of them were trained using huge image datasets and the images were obtained from public sources, mainly the internet. The problem is that a part of the dataset is formed by images made by artists, including copyrighted ones. That plus the fact that the AI can be guided, particularly Stable Diffusion, toward certain styles, means some images generated have a clear similarity to certain artists’ styles. Artists from all over the world have brought up the topic of plagiarism. Are the images generated by these AIs plagiarised? Is it right to use these images? The discussion is ongoing.

Another related concern is that if images can be described and generated in a matter of seconds, artists will have fewer commissions and receive less work. This is understandably concerning for visual artists, and not to be taken lightly. Controversy also arose after an AI-generated image won first place in Colorado State Fair’s fine arts competition. The decision was met with backlash and claims that the winner cheated, and regulations banning AI submissions in those kinds of competitions have been requested.

Privacy is also a matter that needs to be attended to, as a report from Ars Technica claimed that the training dataset used by Stable Diffusion contained hundreds or even thousands of private medical records. While the responsibility of filtering these images from the dataset is not clear, LAION, the company behind the dataset used by Stable Diffusion, says that they do not host these images on their servers and that filtering sensitive information is the responsibility of the companies that create these AI models. However, LAION provides the URLs for the images used in their datasets, so they are also responsible in part.

That said, enjoy your journey of creating AI-generated images while respecting the privacy and intellectual property of others!


]]>
<![CDATA[Progressive Delivery in Kubernetes: Analysis]]>Overview

In this post, an analysis of Progressive Delivery options in the Cloud Native landscape will be done to explore how this enhancement can be added in a Kubernetes environment. The most embraced tools from the Cloud Native landscape will be analyzed (Argo Rollouts and Flagger), along with some takeaways

]]>
https://engineering.empathy.co/progressive-delivery-in-kubernetes-analysis/63878ba4ee3d91003d957e08Thu, 01 Dec 2022 09:29:54 GMTOverviewProgressive Delivery in Kubernetes: Analysis

In this post, an analysis of Progressive Delivery options in the Cloud Native landscape will be done to explore how this enhancement can be added in a Kubernetes environment. The most embraced tools from the Cloud Native landscape will be analyzed (Argo Rollouts and Flagger), along with some takeaways and the selection that best fits Empathy.co’s needs.

Motivation

The native Kubernetes Deployment Objects supports the Rolling Update strategy which provides a basic guarantee during an update, as well as limitations:

  • Few controls over the speed of the rollout
  • Inability to control traffic flow to the new version
  • Readiness probes are unsuitable for deeper, stress or one-time checks.
  • No ability to check external metrics to verify an update
  • No ability to automatically abort and rollback the update

For the reasons above, in a complex production environment, this Rolling Update could be risky because it doesn't provide control over the blast radius, may roll out too aggressively and there is no rollback automation in case of failure.

Requirements

Although there are multiple tools that offer multiple deployment capabilities, the minimum requirements would be:

  • GitOps approach: The tool chosen should work under GitOps approach, no manual changes required
  • NGINX Ingress Controller compatibility: The goal is to add more deployment alternatives, not to research a different Kubernetes ingress controller
  • Prometheus analysis compatibility: The metrics are saved on Prometheus, the tool should allow using Prometheus queries to perform a measurement.
  • Compatible with multiple Service Mesh options: The chosen tool should not be tied to a specific service mesh. This will allow us to experiment with multiple service meshes in the future.
  • GUI: Would be valuable to have, but at least some kind of deployment tracing related to what is happening under the hood in the alternative deployments would be necessary.

Affected/Related Systems

  • Kubernetes deployment methods
  • Application delivery strategies from Teams

Current Design

Native Kubernetes Deployment Objects:

  • Rolling Update: A Rolling Update slowly replaces the old version with the new version. This is the default strategy of the Deployment object
  • Recreate: Deletes the old version of the application before bringing up the new version. This ensures that two versions of the application never run at the same time, but there is downtime during the deployment.

Proposed Design

The aspirational goal is to add extra deployment capabilities to the current Kubernetes cluster and, therefore, increase the agility and confidence of application teams by reducing the risk of outages when deploying new releases.

The main benefits would be:

  • Safer Releases: Reduce the risk of introducing a new software version in production by gradually shifting traffic to the new version while measuring metrics like request success rate and latency.
  • Flexible Traffic Routing: Shift and route traffic between app versions with the possibility of using a service mesh (Linkerd, Istio, Kuma...) or not (Contour, NGINX, Traefik...)
  • Extensible Validation: Extend the application analysis with custom metrics and webhooks for acceptance tests, load tests or any other custom validation.
  • Progressive Delivery: Alternatives deployment strategies:
  • Canary (progressive traffic shifting)
  • A/B Testing (HTTP headers and cookies traffic routing): Called experiments by Argo Rollouts, although Canaries could have specific headers too.
  • Blue/Green (traffic switching and mirroring)

Concepts

Blue/Green


It has both the new and old version of the application deployed at the same time. During this time, only the old version of the application will receive production traffic. This allows the developers to run tests against the new version before switching the live traffic to the new version.

Progressive Delivery in Kubernetes: Analysis

Canary

A Canary deployment exposes a subset of users to the new version of the application while serving the rest of the traffic to the old version. Once the new version is verified as being correct, it can gradually replace the old version. Ingress controllers and service meshes, such as NGINX and Istio, enable more sophisticated traffic shaping patterns for canarying than what is natively available (e.g. achieving very fine-grained traffic splitting, or splitting based on HTTP headers).

Progressive Delivery in Kubernetes: Analysis

The picture above shows a Canary with two stages (25% and 75% of traffic goes to the new version), but this is just an example. Argo Rollouts allow multiple stages and percentages of traffic to be defined for each use case.

Tools

The two great projects are Argo Rollouts and Flagger. Both projects are mature and widely used.

Argo Rollouts

Argo Rollouts is a Kubernetes Controller and set of CRDs which provide advanced deployment capabilities such as Blue/Green, Canary, Canary analysis, experimentation and progressive delivery features to Kubernetes. A UI is deployed to see the different Rollouts.

Two kinds of rollouts:

Argo Rollouts offers experiments that allow users to have ephemeral runs of one or more ReplicaSets and run AnalysisRuns along those ReplicaSets to confirm everything is running as expected. Some use cases of experiments could be:

  • Deploying two versions of an application for a specific duration to enable the analysis of the application.
  • Using experiments to enable A/B/C testing by launching multiple experiments with a different version of their application for a long duration.
  • Launching a new version of an existing application with different labels to avoid receiving traffic from a Kubernetes service. The user can run tests against the new version before continuing the Rollout.

A/B Testing could be performed using Argo Rollouts experiments

There are several ways to perform analysis to drive progressive delivery.

  • AnalysisRuns are like Jobs, in that they eventually complete; the result of the run affects if the Rollout's update will continue, abort or pause. AnalysisRuns accept templating, making it easy to parametrize analysis.
  • AnalysisRuns accepts multiple data sources like:
  • Prometheus, querying over the applications metrics to foresee if the service has a degraded performance during the deployment
  • Cloudwatch, querying over AWS metrics to check if everything is fine during the deployment
  • Web, perform an HTTP request and compare against the result of a JSON response
  • Job, execute a custom script in order to success/fail

Traffic Management

Observability

Migration

Pain Points:

  • RBAC & Authentication
  • Non-native integration: Argo Rollouts use their own CRD Rollout, not Kubernetes native

Flagger

Flagger is part of the Flux family of GitOps tools. Flagger is pretty similar to Argo Rollouts and its main highlights are:

  • Native integration: It watches Deployment resources, not need to handle it using a CRD
  • Highly extensible and comes with batteries included: It provides a load-tester to run basic or complex-scenarios

When you create a deployment, Flagger generates duplicate resources of your app (including configmaps and secrets). It creates Kubernetes objects with <targetRef.name>-primary and a service endpoint to the primary deployment.

It employs the same concepts about Canary, Blue/Green and A/B Testing as Argo Rollouts does.

Observability

Pain Points:

  • No UI, so no RBAC and authentication are needed, but it's complex to have fast feedback from the current status of the rollouts. Checking the logs or checking the status of Canary resources is the only way.
  • No kubectl plugin to check how the deployment is going; necessary to deal with `kubectl logs -f flagger-controller` to see how kubectl describes Canary in order to check the progress.
  • Documentation could be better.
  • Blue/Green is an adapted Canary (same as a Canary but with 100% weight)

Questions

What happens if the controller is down?

Argo Rollouts

  • If there are Rollouts changes while the Argo Rollouts Controller is down, the controller will receive the latest changes; it's not going to start from where the Rollout was.
  • If there is no new commit while the Controller is down, the Controller reconciles the status automatically. If the Rollout is in step 3 and the Controller is down, when it is back up, it will pick up from the same spot.

Flagger

  • Like Argo Rollouts, it reconciles fine enough.
  • The difference is that it follows the steps, instead of the previous changes and then the latest changes.
  • New rollouts/deployments will be blocked, but the pods and HPA will remain up and running, even if it breaks in the middle of a rollout/deployment. Both Controllers will reconcile automatically after recovery.

What happens with the dashboards? Any changes?

Argo Rollouts

  • Although we don't have a Deployment resource, metrics from deployments won't disappear.
Progressive Delivery in Kubernetes: Analysis
Progressive Delivery in Kubernetes: Analysis

Flagger

  • Deployment resource is there, so no changes are expected.
Progressive Delivery in Kubernetes: Analysis
Progressive Delivery in Kubernetes: Analysis
  • No changes

What happens when a Canary is paused on the GUI or command line? Is the GitOps setup going to override the change?

Argo Rollouts

  • It can be done from the GUI and from the kubectl command line easily; the RolloutAbort will be notified by ArgoCD.
  • It can be retried from the GUI easily or from kubectl commands; ArgoCD will mark the Rollout in progress

Flagger

  • It looks like it's not possible to pause the deployment using the command line. It's needed to have Flagger Tester API deployed

What happens when a rollback occurs? What happens with the GitOps setup?

  • Argo Rollouts is integrated with ArgoCD and the progress of the Rollout can be seen from ArgoCD UI.
  • Flagger is not integrated with ArgoCD as seamlessly as Argo Rollouts, so a bunch of resources have been created and are visible in the ArgoCD UI, but there is no feedback.

What happens in lower environments with a Canary deployment if there is not enough traffic?

Argo Rollouts

  • Argo Rollouts doesn't have a current way to do a loadtest directly but, as, a workaround it can be used with the webhooks to launch a k6 loadtest, as seen in this issue in their project.
  • The loadtest has to be controlled out of the box; it specifically stops the loadtest when Canary reaches the step required.

Flagger

How does Canary traffic management work without a service mesh?

  • In the absence of a traffic routing provider, both options can handle the Canary weights using NGINX capabilities. Besides, both options handle SMI and offer a broad selection related to service meshes. Then, whichever tool fits best and is not a blocker can be used to select one service mesh or another.

What happens when a configMap or secret used by the Deployment (as volume mounts, environment variables) are changed?

Argo Rollouts

  • There is no support for that in Argo Rollouts, but there is an open issue in their Project
  • Some workaround should be done, to be able to have rollout and rollback available when only a configMap changes. The workaround consists:
  • Random suffix in the configMap name
  • ConfigMap and Deployment definition in the same .yaml to avoid creating multiple random suffixes

Flagger

  • Using the Helm annotation trick for automatically rolling out deployments when the configMap changes works well enough in the event of a rollout. But, for a rollback after the rollout, the same issue as the Deployments and ConfigMaps may appear because there is only one configMap, not multiple. That means the workaround for the rollback would have to be done in the same way as Argo Rollouts

To Sum Up

Both tools will help us to get alternative deployments, while there are some tradeoffs related to each tool:

Argo Rollouts

Pros

  • Great UI, fast feedback
  • Great integration and feedback with ArgoCD, indicating if the Rollout is in progress
  • Easy integration with current Deployment resources
  • Documentation

Cons

Flagger

Pros

  • Kubernetes native, doesn't introduce new Kubernetes resources
  • Loadtest integrated

Cons

  • No UI; feedback needs to be gathered through the K8s API
  • Zero feedback from ArgoCD; Flagger integrates better with Flux, based on their documentation
  • Documentation could be better
  • Main differences with Argo Rollouts
  • Feedback using kubectl commands
  • Blue/Green is an adapted Canary (same as a Canary but with 100% weight, after some tests)

At Empathy, the tool we have chosen is Argo Rollouts. It fits our needs pretty well, offers faster feedback, has great integration with ArgoCD, and is open to more complex strategies.

What's next?

  • Choose your fighter, adapt the strategies to your applications. Likely some apps fit better with a Blue/Green approach and others with a Canary approach.
  • Demo Session in lower environments.
  • Plan migration with Teams.
  • Capabilities could be improved in the future if/when a Service Mesh is added to the Platform.

References

]]>
<![CDATA[Tailwind + Vue No-code Editor]]>There has been a lot of talk about no-code solutions, lately. This movement tries to approach non-developers by offering software development tools that allow them to create and modify applications without using code. The benefits of no-code tools include speed, accessibility, reduced costs and autonomy.

Thinking about this idea, I

]]>
https://engineering.empathy.co/tailwind-vue-no-code-editor/6368d303134943003dfca6e5Mon, 07 Nov 2022 15:20:25 GMTThere has been a lot of talk about no-code solutions, lately. This movement tries to approach non-developers by offering software development tools that allow them to create and modify applications without using code. The benefits of no-code tools include speed, accessibility, reduced costs and autonomy.

Thinking about this idea, I wondered how to create a no-code editor for a web application. But, since a tool like this would be huge for a single post, I decided to focus only on the personalization of the styles and themes.

So, I chose to rely on one of the most popular CSS frameworks at the moment: Tailwind. Not because of its usual use, but for all the tools it has in terms of configuration and CSS generation.

The idea is to create a frontend interface which allows the Tailwind configuration to be modified in real time and shows the result styles applied. Then, this customized configuration could be stored and used in the build and deployment process of a hypothetical application.

However, in this article, we are going to focus only on the editor and how to achieve a real-time preview of the Tailwind config changes.

To do so, we are going to create a simple service in Node using ExpressJs. This service will receive the Tailwind configuration from the frontend editor and run PostCSS with the Tailwind plugin to generate the CSS. Finally, the service will return the generated CSS to the editor, which will update the page to show the changes.

We could try to run the PostCSS and Tailwind plugin directly in the browser, to make it work with node polyfills; that is how it is done in Tailwind Play’s internal implementations. Another option is to use WebContainers, but for simplicity's sake, we are going to run it in a simple node service.

Creating the Project

Let’s create a new project called tailwind-editor with Vite running npm create. I’m going to use Vue for the frontend because I’m more comfortable with it and also because it is awesome 😉.

$ dev npm create vite@latest
✔ Project name: … tailwind-editor
✔ Select a framework: › Vue
✔ Select a variant: › JavaScript

Then, add the dependencies for the service.

$ cd tailwind-editor
$ npm install --save express cors postcss tailwindcss

The resulting package:

{
  "name": "tailwind-editor",
  "private": true,
  "version": "0.0.0",
  "type": "module",
  "scripts": {
    "dev": "vite",
    "build": "vite build",
    "preview": "vite preview"
  },
  "dependencies": {
    "cors": "^2.8.5",
    "express": "^4.18.1",
    "postcss": "^8.4.17",
    "tailwindcss": "^3.1.8",
    "vue": "^3.2.37"
  },
  "devDependencies": {
    "@vitejs/plugin-vue": "^3.1.0",
    "vite": "^3.1.0"
  }
}

Creating the Tailwind CSS Service

Now, we are going to create the service that will receive the Tailwind config and return the resulting CSS.

Let’s start with the file src/tailwind-as-a-service.js which will contain the Express server with the cors middleware to support cross-origin calls. It is listening by port 8080 to any request to the root path with a GET method and returns the text Hello World.

// src/tailwind-as-a-service.js

import express from 'express';
import cors from 'cors';

const app = express()
app.use(cors());
const port = 8080

app.get('/', (req, res) => {
  res.send('Hello World!')
})

app.listen(port, () => {
  console.log(`Tailwind as a service listening on port ${port}`)
})

Running the server in node lets you check the response directly in the browser:

$ node ./src/tailwind-as-a-service.js
Since we are using Vite and ES modules, the minimum node version required to follow this article is 14.18+.

So far, so good. Now we are going to configure postcss and its Tailwind plugin to return CSS:

// src/tailwind-as-a-service.js
......
import postcss from 'postcss';
import tailwindcss from 'tailwindcss';

......

const defaultCss = `
  @import 'tailwindcss/base';
  @import 'tailwindcss/components';
  @import 'tailwindcss/utilities';
`;

app.get('/', async (req, res) => {
  const configuredTailwind = tailwindcss({
    content: [{ raw: '<div class="bg-red-500">', extension: 'html' }]
  });
  const postcssProcessor = postcss([configuredTailwind]);
  const { css } = await postcssProcessor.process(defaultCss);
  res.send(css);
});

......

We just added the postcss and tailwindcss dependencies. Next, we have to configure the Tailwind plugin for PostCSS with the content option.

This option tells Tailwind to inspect the HTML, JavaScript components, and other files, to look for CSS classes to generate and include its CSS in the final result. It also allows us to write raw HTML inline.

After that, it is time to create a postcssProcessor with the configured Tailwind plugin, which is responsible for parsing the CSS and applying all PostCSS plugins.

Finally, we process a “fake” CSS file with the default base, components and utility styles of Tailwind. This is necessary to make Tailwind generate all necessary CSS.

The result CSS is returned in the response. So if we run the service with node ./src/tailwind-as-a-service.js again, and we request it from the browser, the resulting CSS will be shown:

Here, you can see the base CSS Tailwind provides by default and also the .bg-red-500 class at the end that we are passing as raw HTML to the Tailwind config.

So, we have a service to request and return the CSS, but how do we configure that CSS? Let’s make this service receive parameters and use them to configure the Tailwind plugin:

// src/tailwind-as-a-service.js
......

const defaultCss = `
  @import 'tailwindcss/base';
  @import 'tailwindcss/components';
  @import 'tailwindcss/utilities';
`;

app.post('/', async (req, res) => {
  const configuredTailwind = tailwindcss({
    content: [{ raw: req.body.html, extension: 'html' }],
    theme: req.body.theme
  });
  const postcssProcessor = postcss([configuredTailwind]);
  const { css } = await postcssProcessor.process(defaultCss);
  res.send(css);
});

......

Here, we changed the .get method into .post to be able to send and receive parameters in the request body, for larger parameters.

Moreover, we pick html and theme parameters from the request body and use them to configure Tailwind.

For simplicity, we are using only the theme part of the Tailwind configuration, but this approach, allows any part of it to be configured.

Creating the Editor

We are going to define some custom configurations for the Tailwind theme, which users can then modify through an interface. Below are the defined values for a simple example of components: buttons and titles. To keep the scope small, for this example we only allow some colors and font properties to be changed:

// src/custom-tailwind-config.js

export const customTailwindConfig = {
  colors: {
    primary: {
      25: '#cdd3d6',
      50: '#243d48',
      75: '#1b2d36'
    },
    secondary: {
      25: '#bfe1ec',
      50: '#0086b2',
      75: '#006485'
    },
    success: {
      25: '#ecfdf5',
      50: '#10b981',
      75: '#065f46'
    },
    warning: {
      25: '#fffbeb',
      50: '#f59e0b',
      75: '#92400e'
    },
    error: {
      25: '#fef2f2',
      50: '#ef4444',
      75: '#991b1b'
    },
    title: {
      1: '#000',
      2: '#a3a3a3',
      3: '#000',
      4: '#0e7490'
    }
  },
  fontSize: {
    button: '1rem',
    'size-title1': '2rem',
    'size-title2': '1.5rem',
    'size-title3': '1.25rem',
    'size-title4': '1.125rem'
  },
  fontWeight: {
    'weight-button': '400',
    'weight-title1': '700',
    'weight-title2': '700',
    'weight-title3': '400',
    'weight-title4': '400'
  }
};
Keep in mind that we are using the Tailwind theme configuration for the sake of simplicity – it is not the only way to achieve the same result. The whole Tailwind configuration could be overridden, including the plugins used and/or their configuration. For example, you could create your own Tailwind plugin and adding all your CSS components based on a configuration passed to the plugin. This configuration could be passed as a parameter to the Tailwind service as we are doing here with the theme configuration.

The next step is to go to the App.vue component and remove all default content, then add some buttons and titles using the CSS utility classes that Tailwind generates with the theme configuration that was just defined:

// src/App.vue

<template>
  <section class="flex flex-col gap-10 min-w-[200px] m-10">
    <section class="flex flex-col gap-10">
      <button class="w-40 h-8 rounded bg-primary-50 hover:bg-primary-75 text-primary-25 hover:text-primary-25 text-button font-weight-button">
        Button Primary
      </button>
      <button class="w-40 h-8 rounded bg-secondary-50 hover:bg-secondary-75 text-secondary-25 hover:text-secondary-25 text-button font-weight-button">
        Button Secondary
      </button>
      <button class="w-40 h-8 rounded bg-success-50 hover:bg-success-75 text-success-25 hover:text-success-25 text-button font-weight-button">
        Button Success
      </button>
      <button class="w-40 h-8 rounded bg-warning-50 hover:bg-warning-75 text-warning-25 hover:text-warning-25 text-button font-weight-button">
        Button Warning
      </button>
      <button class="w-40 h-8 rounded bg-error-50 hover:bg-error-75 text-error-25 hover:text-error-25 text-button font-weight-button">
        Button Error
      </button>
    </section>
  
    <section class="flex flex-col gap-10 m-10">
      <h1 class="text-title-1 text-size-title1 font-weight-title1">
        Title 1
      </h1>
      <h2 class="text-title-2 text-size-title2 font-weight-title2">
        Title 2
      </h2>
      <h3 class="text-title-3 text-size-title3 font-weight-title3">
        Title 3
      </h3>
      <h4 class="text-title-4 text-size-title4 font-weight-title4">
        Title 4
      </h4>
    </section>
  </section>
  </section>
</template>

Before running the dev script to serve the Vue project, we need to modify this script in package.json to parallelize its execution with the Tailwind service:

"dev": "node ./src/tailwind-as-a-service.js & vite",

Afterwards, we run the project and visit the locally-served URL:

$ npm run dev

We are using the latest version of Vite, so the default port is 5153:

Finally, we open that URL in the browser and… oops! No styles?! What’s going on? 🤯

The styles are not being applied because we are not using Tailwind directly in our Vue project as we normally would. Instead, we have to call the tailwind-as-a-service endpoint to retrieve the CSS. Okay then, let’s create the function to call the service.

We create a new file fetch-css.js in the src directory:

// src/fetch-css.js

export async function fetchCss(tailwindCustomConfig) {
  return await fetch('http://localhost:8080', {
    method: 'POST',
    headers: { 'content-type': 'application/json' },
    body: JSON.stringify({
      html: document.body.innerHTML,
      theme: {
        extend: tailwindCustomConfig
      }
    })
  }).then(response => response.text());
}

This async function is receiving the custom Tailwind config and fetching the Tailwind service, passing it as the theme:{ extend: tailwindCustomConfig } parameter. We pass it inside extend to keep all the default utilities Tailwind has, and only add the new ones we need. We also obtain all the current HTML on the page and send it to the service as the html parameter. Tailwind will use this HTML to know which CSS classes generate and which don't.

The following step is to create a new component Editor.vue to use that function:

// src/components/Editor.vue

<script setup>
  import { onMounted, ref } from 'vue';
  import { customTailwindConfig } from '../custom-tailwind-config.js';
  import { fetchCss } from '../fetch-css.js';

  const css = ref('');

  async function getCss() {
    css.value = await fetchCss(customTailwindConfig);
  }

  onMounted(getCss);
</script>

<template>
  <component is="style">{{ css }}</component>
</template>

There are many things going on here. Let’s take a closer look:

  • We import the previous customTailwindConfig and the fetchCss function.
  • We add a css ref. If you are not already familiar with it, check out the new Vue Composition API documentation.
  • We create a new function getCss which calls the fetchCss and assigns the returned promise value to the ref value.
  • We use the onMounted Vue lifecycle hook to call the previous function whenever the component is mounted.
  • Finally, we create a dynamic component to attach the CSS to the DOM. This dynamic component will render the CSS inside a <style> tag, when the css ref is updated. That way, whenever we update the css ref value, the styles will be updated.
We use a dynamic component as a workaround because, in the Vue template compiler the <style> tag is not allowed inside the <template> tag.

Now, we import and use the Editor.vue component inside the App.vue:

// src/App.vue

<template>

  <div class="flex flex-row">
		......

    <Editor/>
  </div>
</template>

<script setup>
  import Editor from './components/Editor.vue';
</script>


Ready to see the styles? Reload the URL and they will appear:

Finally, the last step is to make the Editor.vue modify the default theme from the Tailwind configuration and request the CSS again from the service to see a live view of the changes.

We are starting with the colors, adding a color picker for each color we want to configure.

// src/components/Editor.vue
<script setup>
  import { onMounted, reactive, ref, watch } from 'vue';
  ......

  const css = ref('');
  const editableCustomConfig = reactive(customTailwindConfig);

  async function getCss() {
    css.value = await fetchCss(editableCustomConfig);
  }

  onMounted(getCss);
  watch(editableCustomConfig, getCss);

</script>

<template>

  <div class="flex flex-col flex-nowrap gap-10 w-1/2  m-10">

    <h2 class="font-bold">Colors</h2>
    <section class="flex flex-row flex-wrap gap-10">

      <div v-for="(color, colorName) in editableCustomConfig.colors"
           style="display: flex; flex-flow: column nowrap;">
        <label v-for="(_, shadeName) in color">{{ colorName }} {{ shadeName }}
          <input type="color" v-model.lazy="color[shadeName]">
        </label>
      </div>

    </section>
  </div>

  <component is="style">{{ css }}</component>
</template>

In this step, we are making several changes to be able to modify the configuration reactively:

In the <script>:

  • First, we create a reactive object with our customTailwindConfig as the initial value.
  • Then we fetch the CSS with this reactive object.
  • Finally, we add a watch to call the getCss function, whenever this reactive object changes

In the <template>:

  • We add two loops v-for to iterate over each color and each shade, binding a color picker to the value.
  • We use the v-model directive with the lazy modifier so that not too many requests are made whenever we move the selector over the color picker.
Notice, we are binding the color shade to the v-model using the color and the shade name, instead of using the v-for variable directly. This is because the variable used to iterate in the v-for loops cannot be modified; the workaround lets us access the value indirectly.

If we run the application again, we can see the color pickers. Now, by changing a color, the component using that color will be updated automatically:

Finally, we configure font size and font weight:

// src/components/Editor.vue
<script setup>
......
</script>

<template>
......

    <h2 class="font-bold">Font Sizes</h2>
    <section class="flex flex-row flex-wrap gap-10">
      <label v-for="(_, sizeName) in editableCustomConfig.fontSize">{{
          sizeName.replace('size-', '')
        }}
        <input type="number"
               step="0.125"
               class="w-14 border border-black text-center"
               :value="editableCustomConfig.fontSize[sizeName].replace('rem','')"
               @input="event=> editableCustomConfig.fontSize[sizeName] = event.target.value + 'rem'">rem
      </label>
    </section>

    <h2 class="font-bold">Font Weight</h2>
    <section class="flex flex-row flex-wrap gap-10">
      <label v-for="(_, weightName) in editableCustomConfig.fontWeight">{{
          weightName.replace('weight-', '')
        }}
        <input type="number"
               step="100"
               min="100"
               max="900"
               class="w-14 border border-black text-center"
               v-model="editableCustomConfig.fontWeight[weightName]">
      </label>
    </section>
  </div>

  <component is="style">{{ css }}</component>
</template>

Here, we are repeating the same principle used for the colours, but with a single v-for for each case. Moreover, for the case of font size, the rem unit has to be added and removed before it is passed to the Tailwind configuration.

Alright, the moment you have been waiting for has arrived! This is what the editor looks like:

This editor is the starting point for creating your own no-code tool that allows you to configure your project and see the changes on the fly. Remember that using theme values is not the only way to make this configurable; you can also use all the Tailwind config options.

Why not CSS custom properties, AKA CSS variables?

Sure, we could achieve exactly the same result by using CSS variables as values in the Tailwind theme configuration. Modifying the value of these variables in the front would remove the need to use any additional service and adapt the Tailwind process on the fly. Then, after saving these variables and loading them in production, the changes would be deployed.

But there’s a reason, promise!

Why

In this example, we are only modifying the theme part of the Tailwindconfiguration, but there are plenty more options in that configuration.

Imagine you created a Tailwind plugin which adds your own Design Components and CSS utilities. These plugins can have options, too.

So, with the solution laid out here, you can modify all these possible configurations and options and see the results immediately.

Also, you can use the service created to directly save the configuration in your database (by user or customer) to later retrieve it during the deployment process or for any other purpose you need.

Ready to get started? All the working code is available here. Please, feel free to open issues and give feedback. Also, if you find it useful, a star would be much appreciated 😉.

]]>
<![CDATA[Building a Future-proof Developer Documentation Site]]>Motivation

As time goes on, lots of things become outdated sooner or later, and documentation is no exception. But when documentation becomes outdated, it automatically loses its essence – to be consulted for valuable information. In modern software development, where Application teams are continuously adding features to their services and

]]>
https://engineering.empathy.co/building-a-future-proof-developer-documentation-site/633ae288688e8e003daa7c47Tue, 18 Oct 2022 06:23:08 GMTMotivation

As time goes on, lots of things become outdated sooner or later, and documentation is no exception. But when documentation becomes outdated, it automatically loses its essence – to be consulted for valuable information. In modern software development, where Application teams are continuously adding features to their services and Platform teams are continuously evolving the platform and infrastructure to accommodate all the workloads, it is extremely easy for documentation to get out of date just in a couple of weeks.

This is something the Empathy.co Platform Engineering team experienced firsthand when we realized there was an issue with how Development teams learn how to work with the platform provided for them. We noticed that there wasn't a clear path for everyone to follow when using the platform tools. While we had some documentation available that we felt was sufficient, we realized it wasn't as complete or as maintainable as we thought. So, we had to find a solution.

Design Thinking

The following items were identified as the main pain points of the problem:

  • The documentation was spread across various places and different tools.
  • Some documentation was out of date, meaning it was no longer useful and was sometimes confusing to readers.
  • With the fast growth of the company, new colleagues didn't even know where various pieces of documentation were located.
  • The overall process was unmaintainable and not future-proof.

Scope

There are several types of documentation, and not all of them should be handled the same way, nor do they have the same lifecycle. Assuming that not all the roles at the company have the same skillset, like fluent writing of Markdown documents and working with Git, we concentrated our efforts specifically on technical documentation. That way, the skillset of those interacting with the documentation was more clearly defined.

More specifically, we focused on the documentation about the Internal Development Platform (IDP, for short) that Platform Engineering needs to share with the Application teams, like CI/CD processes, monitoring, logging, etc. Since the nature of the IDP is to help developers by using certain tools and practices, we like to refer to this documentation as Developer Workflow (DW).

Centralization

Aiming to centralize the technical documentation in a single place and tool, we decided to deploy Backstage to the platform. Backstage is a tool developed by Spotify that, among other features, enables the creation of documentation sites out of the Markdown documentation stored in a Git repository, using MkDocs.

Hosting the documentation in Git repositories also provides the benefits of using Git, like versioning, tracking, code reviewing, etc.

Content Up To Date

Unfortunately, there's no tool or magic wand that keeps the documentation up to date. It is a continuous effort, so it should be included in teams' ways of working. From our experience, it is very helpful to take into consideration the time it will take to create or update the documentation when estimating a given task. A task shouldn't be marked as finished if the documentation hasn't been reviewed and updated.

In this specific case, it was hard work reviewing all the documents that needed to be updated, created, or deleted accordingly. It is always better to keep the documentation up to date by making small updates frequently, rather than updating a large amount of content infrequently.

Visibility

As stated above, one of the main pain points of the previous solution was that the stakeholders weren't aware of where to find certain articles within the documentation. Adding Backstage helped centralize all the technical documentation into a single place, but the problem wasn't yet solved. Backstage was still unfamiliar to some people, mostly those who recently joined the company. We also received feedback about the search experience not working as well as we had hoped, so we decided to create a separate site to host only the Developer Workflow documentation and reduce all the surrounding elements that could prevent users from finding the necessary documents.

We didn't want to split the documentation again, so we decided to keep it in Backstage and the standalone website, using the exact same source of truth for both to avoid duplicating the content. This part of the solution will be explored more in-depth, later.

Maintainability

Documentation evolves over time, and as part of that evolution, things are likely to break. A few of the issues that can occur are:

  • Broken hyperlinks
  • Tables of contents not matching page sections
  • Navigation bars showing items that no longer exist or not showing items that should be shown

All of these items were successfully addressed just by using a few MkDocs plugins, which will be explained in more detail in the next section.

Not only does the content have to be maintainable, so does the overall process. In order to account for this in the design, it is key to keep the logic as simple as possible (following the KISS Principle), so teams managing the process can easily update the documentation, keep the process up to date, and evolve the whole solution accordingly. Keeping it as simple as possible also makes it easier to transfer ownership to another team and accelerate the onboarding process of new team members, so they understand the workflow and how to support the solution.

The Solution

Now that the context and requirements are clear, it is time to look at how to implement them into the solution. It mainly consists of a series of agreements in the way of working, in conjunction with a series of best practices and MkDocs plugins.

Documentation Source: Git Repository

Since Backstage aggregates all the MkDocs documentation sites from several Git repositories into a single documentation portal, we decided to treat the Developer Workflow documentation as another component in Backstage. This repository also contains the Terraform code to build a simple Cloudfront + S3 static web hosting and a simple Github action to build the DW standalone website. This enabled us to have the exact same documentation in two places, fed by the same source.

Focus on Maintainability: MkDocs Plugins

All documentation structure, navigation bars, and page-linking related items were addressed using MkDocs theme features and additional MkDocs plugins.

Since Backstage uses a slightly modified version of MkDocs Material Theme, we had to explore what theme features were compatible with Backstage and used the following:

mkdocs.yaml

theme:
  name: material
  features:
    - navigation.sections
    - navigation.expand
    - navigation.indexes
    - navigation.top
  • navigation.sections: Enabled using the file structure of the docs folder to automatically generate the navigation sidebar. This prevents an out-of-date sidebar, as it is automatically rendered based on the content.
  • navigation.expand: Automatically expands all the sections from the navigation sidebar. The motivation for this was to give the user a bird's-eye view of all the documents at once, reducing the need to search – except for when looking for more specific content.
  • navigation.indexes: Enables using the section titles as pages, which is great for overview pages, offering a general view of what the section is about.
  • navigation.top: Adds a "Back to top" button when the scroll is not at the very top of the page, which is especially useful when navigating large pages.

In addition to the MkDocs Material theme features, we also used the following MkDocs plugins. Note that to make it work both in Backstage and standalone, the plugins must be installed in Backstage, as well.

Note that the order in which the plugins are listed is the order in which they are executed when building the documentation site. This is important to keep in mind when adding or removing them, as the results may vary.

mkdocs.yaml

plugins:
  - techdocs-core
  - search
  - awesome-pages
  - macros
  - glightbox:
      touchNavigation: false
      loop: false
      effect: zoom
      width: 100%
      height: auto
      zoomable: true
      draggable: false
  - webcontext:
      context: catalog/default/component/developer-workflow
  - alias
  • techdocs-core: A Backstage built-in plugin, it adds the basic documentation site functionality to Backstage. The rest of the plugins and extensions must be compatible with this one in order for all of them to work properly. The plugin also needs to be installed in the standalone version of the DW documentation site to make sure all the features used are compatible with Backstage.
  • search: A built-in MkDocs plugin that adds indexing and search capabilities to the site.
  • awesome-pages: Enables the creation of a .pages file inside a folder to tune folder-specific settings like navigation, sorting, hiding, etc., among other things. We mostly use it to customize section names that have the same name as the folders, by default. For example, a folder called docs/04-ci-cd could be tuned to be rendered as CI/CD - Build and Deploy just by adding a .pages file with the following content:

.pages

title: CI/CD - Build and Deploy

This makes it possible to use the name of the folder for sorting, but with a more user-friendly name for the sections.

  • macros: A plugin that enables using Jinja2 templating and variables within MkDocs Markdown pages, which is great for automating the rendering of certain blocks within the documentation.

For example, let's say there's a reference to a team email address in several pages of the documentation. If the email address changes for some reason, it would have to be updated on all the pages where it appears. With the MkDocs macros plugin, it can be set as a variable in the extra section inside mkdocs.yaml and be expanded as many times on as many pages as needed.

mkdocs.yaml

extra:
  team_email_address: [email protected]
  slack_handler: @myteam

contact-info.md

Contact us:

-   Email: **{{ team_email_address }}**
-   Slack Handler: **{{ slack_handler }}**

That's just a simple example, but it can be extended to other uses. For example, it could be used to type all the links to the tools of the Internal Development Platform once and then refer to those links from multiple pages. Additionally, it could be used generate a page with all the links, for faster access.

mkdocs.yaml

extra:
  tool_urls:
    tool_one:
      test: https://tool-one.test.company.org
      stage: https://tool-one.stage.company.org
      prod: https://tool-one.prod.company.org

    tool_two:
      test: https://tool-two.test.company.org
      stage: https://tool-two.stage.company.org
      prod: https://tool-two.prod.company.org

tools.md

{% for tool in (tool_urls | sort) %}
### **{{ tool | replace("_"," ") | upper }}**
	
{% for environment in (tool_urls[tool] | sort) %}
	- **{{ environment | replace("_"," ") | upper }}**: {{ tool_urls[tool][environment] }}
{% endfor %}
	
{% endfor %}

The rendering the above would look like this:

  • glightbox: A plugin that adds basic zoomable image functionality, a feature that MkDocs doesn't provide by default.
  • webcontext: A plugin that aids in maintaining compatibility with Backstage. As Backstage renders each documentation site for each of the registered components or services, it generates the links relative to the component base URL ${BACKSTAGE_URL}/catalog/default/component/${COMPONENT_NAME}, causing links between pages to break on one of the websites. This is configured to match Backstage context and is dynamically set to / as part of the automation to deploy the standalone website.
  • alias: A plugin that makes it possible to use aliases for each documentation page. This hardens the internal links between pages, as the links point to an alias or to a relative path in the filesystem, instead of to the fully qualified URL. Using this plugin, when a file is renamed or moved to a different folder, the internal links don't break.

Lastly, let's look at all of the Markdown extensions used. All of them are included in the techdocs-core plugin. Almost all of them are using the default behaviour, we just set them explicitly in the mkdocs.yaml to avoid misbehavior if a default value changes at some point.

mkdocs.yaml

markdown_extensions:
  - admonition
  - pymdownx.highlight:
      # Enables linking to a specific line in a code block.
      anchor_linenums: true 
  - pymdownx.inlinehilite
  - pymdownx.snippets
  - pymdownx.superfences
  - attr_list
  - pymdownx.emoji:
      emoji_index: !!python/name:materialx.emoji.twemoji
      emoji_generator: !!python/name:materialx.emoji.to_svg
  - def_list
  - pymdownx.tasklist
  - pymdownx.details
  - footnotes
Check the full list of PyMdown extensions here.

Working Agreements

In order to support the solution with minimal effort, we decided to establish some practices around the Developer Workflow documentation. The following are some of the most important practices that should be checked as part of a code review:

  • All folders must contain an index.md file with an overview of the whole section.
  • Avoid using lots of images. If something can be described with text, it should be. This may feel counterproductive, but keeping images and screenshots up to date in GitOps documentation is a tough task.
  • Diagrams should be in SVG format. SVG files are written as text, can be embedded directly in MkDocs, and can be raw-edited without keeping a separate file that requires exporting to an image format.
  • All documentation pages must have an alias and all links between documentation pages must point to the page alias.
  • All the DW documents must be of the interest of the users of the platform. Documents that are more in-depth and intended for Platform Administrators must be hosted in a different repository and served through Backstage.

Preview Documentation Changes in Localhost

Working with Backstage is pretty difficult, when it comes to previewing how rendered markdown files will look. Even though there are various tools and extensions that enable previewing Markdown files from the IDE, those are not capable of rendering with the exact same plugins that MkDocs is going to use when building the site. On the other hand, working directly with MkDocs and knowing that all the plugins, features and extensions are compatible with Backstage, it becomes quite easy and simple to render the site in localhost, even with real-time updates. To do so, the official MkDocs material Docker image needs to be used as the base image before plugins are added to it.

Dockerfile

FROM squidfunk/mkdocs-material:8.4.2

RUN python -m pip install --upgrade pip \
    && pip install mkdocs-alias-plugin==0.4.0 \
       mkdocs-awesome-pages-plugin==2.8.0 \
       mkdocs-macros-plugin==0.7.0 \
       mkdocs-techdocs-core==1.1.4 \
       mkdocs-webcontext-plugin==0.1.0 \
       mkdocs-glightbox==0.1.7

Docker Build and Run commands

# Docker Build
docker image build -t mkdocs-material-local .

# Docker Run
# Run this command inside the folder containing the mkdocs.yaml
docker container run --rm -it -p 8000:8000 -v ${PWD}:/docs mkdocs-material-local

The next step is to fire a browser up and navigate to http://localhost:8000. The rendered site should appear, and it should be automatically updated as the files are updated.

Deploying to Production

All roads come to an end and, in this case, the road ends when the Developer Workflow documentation site is deployed to a Production environment, where it can be accessed by the public. We use a simple setup of Cloudfront + S3, using a Lambda Edge to authenticate the site using Google Workspaces login, but that's out of the scope of this post. Since there are lots of different ways to host a static website, just choose the one that you are most comfortable with or is best for your organization.

In order to keep the deployment also as simple as possible, we just have a Github actions workflow that performs four main steps:

  1. Build the Docker image with the MkDocs Material and its plugins.
  2. Tune the mkdocs.yaml file a bit for the standalone deployment. For example, the webcontext plugin.
  3. Run a Docker container using the just-built MkDocs Material image and invoke the build command, that renders the website and generates the static site in a folder named  site.
  4. Copy the contents of the site folder to the static web hosting. In our case, this is the Amazon S3 bucket. Here, we also create a Cloudfront invalidation to the distribution, to prevent the users from waiting until the Cloudfront cache expires.

Github Workflow: publish-website.yaml

name: Publish Website
on: [push]

jobs:
  cicd:
    runs-on: [self-hosted]
    # Omitted for readability

    steps:
    - name: Checkout
      uses: actions/checkout@v3

    - name: Build docker images
      run: docker image build -t mkdocs-material-custom:local .

    - name: Parse mkdocs.yaml
      run: |
        # Install yq tool
        wget https://github.com/mikefarah/yq/releases/download/v4.21.1/yq_linux_386 -O ./yq && chmod +x ./yq
        # Use default webcontext and disable use_directory_urls to make it work with CloudFront
        ./yq '.plugins[select(. == "webcontext")].[].context = "/" | .use_directory_urls = false' -i mkdocs.yaml

    - name: Build mkdocs site
      run: docker container run -v $PWD:/docs mkdocs-material-custom:local build

    # Omitted for readability

    - name: Deploy
      if: github.ref == 'refs/heads/main'
      run: |
        aws s3 sync site s3://xxxxxx.company.org --delete --cache-control max-age=3600
        aws cloudfront create-invalidation --distribution-id XXXXXXXXXXXXXX --paths '/*'

The following are screenshots of the real result, so you can see how DW documentation looks both as a standalone MkDocs website and when integrated into Backstage.

References

]]>
<![CDATA[Intro to CSS Grid Layout]]>What is Grid Layout?

CSS Grid Layout is a two-dimensional grid system. It is a CSS language mechanism created to place and distribute elements on a page, which is one of the most problematic processes in CSS. It is a standard, which means that you don't need anything

]]>
https://engineering.empathy.co/intro-to-css-grid-layout/633d369c688e8e003daa7f08Thu, 06 Oct 2022 05:30:00 GMTWhat is Grid Layout?

CSS Grid Layout is a two-dimensional grid system. It is a CSS language mechanism created to place and distribute elements on a page, which is one of the most problematic processes in CSS. It is a standard, which means that you don't need anything special for the browser to be able to understand it. There are no limits to using it: wherever you can develop your CSS to define the style, you can use Grid Layout to apply a grid.

Getting Started

To get started you have to define a container element as a grid with display: grid, but before you do that, it is important to be familiar with certain concepts:

  • Grid container: The parent of all the grid items, it is the element on which display: grid is applied.
  • Item: Each child contained in the grid.
  • Grid line: Horizontal or vertical grid cell separator.
  • Grid cell: Minimum unit of the grid.
  • Grid track: Horizontal or vertical band of grid cells.
  • Grid area: Set of grid cells.

The Basics

There are four basic Grid properties you need to know, in order to make use of it easily and simply: display, columns and rows, fr unit, and gap. Let’s take a look at each of them to understand their role.

Display

This value defines how the grid will be positioned with the content. There are two possible values:

The inline-grid value positions it inline, to the left or to the right of the content, and the grid value positions the grid in block, either above or below the content.

Columns and Rows

These properties are used to create the grid by sizing the rows and columns.

In this example, the grid defined will have two columns. The first will have a size of 200px and the second one will measure 100px; both will have a row size of 50px.

Fr unit

Grid has a special unit of dimension, fr (like pixels or percentages) which represents a fraction of the remaining space in the grid. Instead of using pixels in the example, let’s use fr.

Now, the two columns will occupy all the available space, with the first one occupying double the amount of the second.

Gap

Sometimes, columns and/or rows need to have a gap between them. The column-gap and row-gap properties can be used, or simply condensed into gap, as shown below.

Generators

Sometimes, when there are many elements that have to be positioned inside the page, it can be difficult to visualize what they will look like or the best way to do it. That is where generators come in handy. There are lots of tools available, but here is a selection of three useful grid generators (in no particular order):

CSS Grid Generator

This generator is an open source project that is very easy to use – perfect for beginners who are learning how to use Grid.  With CSS Grid Generator, just specify the number of rows, columns, and gaps across rows and columns, and it will provide the proper CSS class ready to copy to your clipboard.

LayoutIt

Like the previous one, LayoutIt is also an open source project. Things to note about this generator are that it allows code to be exported in CodePen and allows for the resizing columns and rows using percentages (%), fractionals (fr), and pixels (px).

Griddy

From the information available in many articles, Griddy seems to be the generator that is most valued and used by developers, in general. Like the previous tool, it allows different resizing units to be used but is a little more difficult to use than CSS Grid Generator and LayoutIt.

Grid vs Flex

The main difference between Flex and Grid is that Flex uses one dimension, while Grid uses two dimensions. This means that with Flex, the positioning of the elements can be defined on either the horizontal or the vertical axis. Grid, on the other hand, allows the ability to work in two dimensions, meaning it is possible to position the cell both vertically and horizontally.

Highlights of Flex

  • Best choice for aligning the content between elements.
  • Positioning elements horizontally or vertically.
  • Good at positioning the smallest details of a layout.

Highlights of Grid

  • Responsive, easier to implement.
  • Compatible with all modern browsers.
  • Positioning elements in two dimensions.


Does this mean that Grid is better than Flex?


Nope! Both are good tools, but one will usually be more suitable than the other, depending on the situation. In any case, it is best to learn how to use both and combine them. Now, you have a guide to using Grid, so go ahead and get started!

]]>
<![CDATA[Image & Color Recognition: A Practical Example]]>We’ve all heard about Image Recognition and how it impacts our day-to-day activities. From the impressive automated driving software that some automobile companies have implemented in their vehicles, to the cool smartphone app that tells us what kind of bird we are staring at – most of us

]]>
https://engineering.empathy.co/image-color-recognition-a-practical-example/632b1965d126f3003d5588baTue, 27 Sep 2022 06:30:00 GMT

We’ve all heard about Image Recognition and how it impacts our day-to-day activities. From the impressive automated driving software that some automobile companies have implemented in their vehicles, to the cool smartphone app that tells us what kind of bird we are staring at – most of us know what the concept of Image Recognition is, more or less. Of course, it might seem like a very complex field (and, well, it is actually tricky sometimes), but we can also do some interesting experiments that will show us how this is, indeed, a very promising area to try out.

Given how broad the field is, the experiment that we are going to perform and assess is focused on a smaller area within it: Color Recognition. The goal is to create an efficient process that can extract the dominant colors from any given picture.  

Color Recognition: The Basics

This area is one of the most explored in computer vision, as it is much easier to collect and classify data, and it is pretty straightforward to test results. It is not difficult for an average person that does not experience colorblindness to see an image of a jacket and identify the main colors that are present on it; nor is it for a computer to do the same, with the appropriate solution.

Of course, nowadays there are already professional solutions on the market that tackle this problem in a very complete way, like the Google Vision AI project or the Amazon Rekognition service. But, in this case, we are going to build our own, homemade Color Recognition system with Python and some added libraries.

The Experiment

First of all, it is important to know what we want to obtain: a system that can take an image, process it, and return a set of the four main colors. Plus, we want to do it in the simplest and most efficient way possible.

Now that we have and overview of Color Recognition and a clear idea of what our objectives are, we can start with the implementation itself. To illustrate the experiment, let’s go through a guided, step-by-step example.

Step 1: Process the image to be analyzed

The first step of the experiment is to “read” the image. Use the Numpy and OpenCV libraries to translate the picture into a matrix of data corresponding to the colors in RGB code, assigning each pixel a value between 0 and 255.

    open_cv_image = np.array(file)
    open_cv_image = open_cv_image[:, :, ::-1].copy()
    img = cv2.cvtColor(open_cv_image, cv2.COLOR_BGR2RGB)

The matrix obtained will be similar to this one, which has been shortened for obvious reasons:

Image & Color Recognition: A Practical Example

Step 2: Resize the picture

Next, it is time to resize the image using the OpenCV library. At a low level, this step means converting the big data matrix into a smaller one. Reducing the number of pixels in the image avoids introducing noise into the color processing phase.

Moreover, the fewer pixels that are in the final image, the less time the color recognition process will take. Of course, this has some drawbacks, the most significant one being that having fewer pixels can also affect the accuracy of the recognition. In order to find the best ratio for the resizing process and avoid compromising that accuracy, iterate! Iterating measures the time taken for the whole process to be completed and determines the pixels required.

For this example, it was necessary to iterate through several versions of the size of the pixel matrix (300, 200, 100, 50, 35, 25, and 10). The ideal size for the data matrix was 35, which is important to keep in mind.

Image & Color Recognition: A Practical Example
Representation of the resizing process

Step 3: Apply K-Means clustering

The analysis of the image and subsequent color classification is performed with a technique called K-Means clustering. K-Means is a non-supervised algorithm that aims to partition n objects into k groups (called clusters). While there are more specific techniques to get a more accurate number of clusters, for the sake of simplicity, they won’t be explored in this article.

Now that the image has been resized to the proper measurements, it is time to use the SKLearn library to get to the most interesting part of the clustering process. In this case, a total of four clusters will be used, because that is the number of dominant colors to be extracted from the picture.

    # use k-means clustering
    k_means = KMeans(n_clusters=4# cluster number (specified by hand)
    k_means.fit(pixels)

Step 4: Extract colors

Once the K-Means algorithm has finished computing, the results have to be extracted using the Numpy library.

    # extract color codes in RGB from the centroids of k-means algorithm
    colors = np.asarray(k_means.cluster_centers_, dtype='uint8')
    hex_colors = rgb_to_hex(colors)

After some refactoring, the result of this step is an array containing the four most dominant colors of the image analyzed:

['#DFD9D8', '#3C424B', '#8CA0AA', '#A46C5D']

The only thing left to do is check the color codes! It looks like a very good match between the picture analyzed and the dominant colors extracted:

Image & Color Recognition: A Practical Example
Final result composition with the image analyzed and its dominant colors

Conclusion

That’s it! With those four simple steps, we have created our own Color Recognition system. Of course, the application and different uses will vary, depending on the project at hand, but there is a clear approach to follow.

Here at Empathy.co, our goal is to create captivating Search experiences for shoppers and Image Recognition can play an important role in doing so. A perfect example of this is the ability to filter by color that a lot of online stores have. The data required for those filters to work comes (usually, but not always) from a system like the one we created in this experiment. There are also more advanced uses for this, like automated tag systems that label products based on their images and assign them to specific categories. This is a very interesting field, for sure, and now you’re ready to dive into it!

]]>