engineering.empathy.co

A first dive into CSS animations

Carmen Rendueles — Thu, 21 Sep 2023 06:45:07 GMT

CSS is one of those things that everyone has a strong opinion about. For some reason, it seems like you are only allowed to love it or hate it.

It’s quite well-known that it can be a struggle to work with. Some people even avoid all frontend development just to stay away from CSS.

But, as I’m starting to grow in my career, I’m becoming a member of the first (and smaller) group – the one that loves CSS.

I learned that CSS is quite easy to use when you learn the ✨ basics ✨ correctly and don’t rush into advanced things.

So, if you are still struggling with it, you may need to consider doing some basic training or even practicing by playing some “silly” games like Flexbox Froggy or CSS Diner . Believe me, you’ll learn to center a div!

Once concepts like cascade, specificity, selectors, flex, grid, and so on stop being a mystery to you, you’ll probably be ready to dive into the cool stuff without your mental health suffering along the way.

The first two or three times, I found myself stuck sitting in front of my brand new assigned task that required a transition or an animation, looking at the designs and thinking “How the hell am I going to do that?!” But after working on a few more, I came to realize that it is all about the mental model.

The Mental Model

Developing an animation starts way before you start writing the code. Your first step should always be to observe the result you want and try to find the different behaviors and the HTML + CSS resources needed to achieve it. You’ll probably need to divide and conquer, in order to see everything clearly.

This way, when the “spinning ball” becomes an “svg or div with a rotate transform” and your “dropdown effect” becomes a “container with a height or max-height transition,” you’ll be ready to go! Of course, if you don’t already, you’ll need to find out what you can and cannot do with CSS.

Be sure to take a look at concepts like:

transform
transition
animate
keyframes

You’ll need them to do all the fancy things. Once you are aware of what you can do with them, you’ll see that, basically, what you can do is put together a series of transformations, so you’ll have to observe how each individual element of the animation interacts and what transformations it suffers.

It might take a while for it to click in your brain and start seeing things this way. In the meantime, let’s take a look at a few examples developed in the Empathy Platform Docs portal.

Example 1: 404 error

A good example of a simple animation is located in the 404 error. Can you guess what is going on and how this animation works? Don’t worry if you don’t see it immediately!

Here’s how it works:

First comes the HTML:

   
  404 
   ERROR

As you can see, in this case, each of the bubbles is an individual element. “Bubble” is an imported SVG, but this could be done with divs as well!

The second part is to define the animation for each of the balls. For this, keyframes come in handy:

@keyframes rotation {   
    from {     
        transform: rotate(0deg);   
    }  
    to {     
        transform: rotate(359deg);   
    } 
}

Assigning the animation with the property ‘animation’ to each bubble makes everything start spinning:

&--blue {   
    top: rem(-10px);   
    left: 38%;   
    animation: rotation 4s infinite linear;   
    circle {     
        r: 20;   
    } 
} 
&--green {   
    top: rem(40px);   
    left: 28%;   
    transform: rotate(45deg);   
    animation: rotation 3.5s infinite reverse linear;   
    circle {     
        r: 18;     
        fill: $color-green;   
    } 
}

But, as you can see, there are a few more things going on here, so let’s check everything:

top and left: Each bubble is placed with `absolute` in the picture.
animation properties: As you can see different bubbles are assigned different properties. This is done to change the characteristics of the rotation. Playing with times, we can make some of them spin slower and others spin faster. Also, some include `reverse` to rotate them counterclockwise.
circle properties: Here, we modify the svg characteristics to give the circles different radii and colors.

Too much to begin with? Maybe an even simpler example could help:

Example 2: Holons' No-Results Page

An even simpler animation is located in the no-results page of the Holons Search Experience. Didn’t see anything? Stare at that sad, disappointed Holon for a few seconds… It blinks! Can you guess how this one works?

Here it goes:

Let’s start with the HTML again:

The face is separated in two SVGs, one for the eyes and another for the mouth. This allows us to easily animate the eyes, leaving the mouth alone.

Secondly, we have a keyframes animation for the blink:

@keyframes blink {   
    45%, 55% {     
        transform: scaleY(1);   
    }   
    50% {     
        transform: scaleY(0.1);   
    } 
}

In this case, instead of modifying rotation, we are playing with the vertical scaling. That way, it gets narrowed for a small moment (starting at 45% and finishing at 55%).

Finally, the animation is given to the eyes SVG:

&--eyes {   
    animation-name: blink;   
    animation-duration: 6s;   
    animation-timing-function: ease;   
    animation-iteration-count: infinite; 
}

That said, you’ll probably need to struggle with a few animations before you start to understand their inner-workings quite quickly, so don’t despair! You’ll get there!

Remember: It takes time!

Even if you get to a point where you immediately realize how everything works and how you’ll get to that point, it will take time to implement something beautiful. You’ll probably never be able to create a breathtaking animation in five minutes, it takes time!

It’s very possible that you’ll end up in front of a trade-off between the time you can dedicate to it and how smooth you want the animation to be. Just remember that it is normal, you’re not going to perform a miracle in limited time, so don’t panic and take your time (if you have it)!

Purchase History Search: Indexing with Elasticsearch

Andoni Azkarate — Thu, 14 Sep 2023 10:57:53 GMT

At Empathy.co, we have been making heavy use of search engines such as Solr and Elasticsearch for years. They are key types of software in the products we build for our platform and our clients. They enable fast and effective storage and retrieval of our clients' catalogue products, simply and with little overhead.

A couple of years ago, one of our clients requested that we speed up the retrieval of their historical purchase data. They had been storing the data without considering its retrieval much. Since we had already been using Elasticsearch for this client, we promptly designed a system for ingesting and finding purchase history data, thus Purchase History Search as a separate project was born.

The challenge of Purchase History Search was ingesting a huge volume of purchase data in real time and making the data easily searchable. The introduction of the time dimension to the data is a key difference from catalogue search. Moreover, the history of purchases is private data, so creating a privacy-first system was essential.

Elasticsearch works with indices that could be roughly compared to standard database tables. Each index is further divided into shards, which are essentially inverted indices that enable the efficient search of data using the terms contained within it. These shards can live in different nodes (machines) so that the index operations can be distributed and parallelized. A typical Purchase History cluster consists of three or five master nodes that manage the entire cluster, its metadata, and operations, and about six data nodes where the index shards actually reside.

At Empathy.co, for our retail clients, we usually build a new index every time there is a new catalogue using a batch or on-demand process. For Purchase History, we have a continuous stream of purchase documents going into a feed, so the mechanism changes. In this case, we use a different service called the Streaming Indexer.

Ingesting the Data

The Streaming Indexer is a scalable and stateless application that sits between the client's data feed and our Elasticsearch cluster. It fetches documents in real time and indexes them to a specific index. As the data is presented in a time-series style, it is only natural to have indices consist of a subset of the purchases. In our case, we decided to go with a monthly index. To further optimise the data storage and retrieval, it is standard for each document to have some kind of customer ID field. This ID can be employed to always route the document to the same index shard. When a shopper goes through their history, a single shard needs to be hit. It is like directing your search to a single drawer in your wardrobe.

This idea of sharding is not only a performance gain, but a good first step towards a privacy-aware and decentralized Purchase History system. Even though Elasticsearch is the centralising element of the whole setup, the idea of separating data based on customers is an element that we would like to explore further. We have been researching solutions like SOLID, whose pods can resemble our idea of shards. Basically, each shard/pod would be a private, fully decentralized repository of a consumer’s data. The consumer would be in total control of it and would be able to manage who has access to what part of their data.

As stated, shards are grouped into indices. The picture below shows the different components regarding Elasticsearch. This index distribution of data enables the Elasticsearch cluster to scale indefinitely with no degradation of the service. Multiple instances of Elasticsearch nodes can be run so that the processing of the querying and the indexing is further distributed. Each instance can work in parallel and also each index can be queried in separately. Furthermore, settings can be tweaked for older indices, such as freezing them and making them read-only. A large proportion of our clients' searches are for the purpose of retrieving recent purchases, which usually means only one or two of the most recent indices are hit.

Accessing the Data

The second part of our Purchase History setup is the Search API itself. It is able to search effectively for purchases with a combination of filter and sorting parameters. Depending on the use cases different search parameters can be used. A typical use case is for shoppers to review their purchase history looking for details or items to repurchase. Another use case is that of our client’s customer support services, which shoppers contact in the event of mismatches between ordered and received products. This setup can also be integrated with other applications, such as a Nutrition ranking system or a Pickup virtual assistant, quickly checking clients’ purchase history to retrieve purchases to be picked up.

The Search API service is also scalable and stateless, but it is more similar to our traditional search service. To make it performant, some tricks are possible, such as using the routing data discussed earlier. Secure access and permissions are simple to set up using OAuth, for example.

As with the rest of our services, Kubernetes is used to orchestrate the deployment of all the Purchase History services. This makes them reliable and enables them to run in any cloud environment as there are no cloud-specific technologies used. Kubernetes simplifies horizontal scaling, depending on the load. For example, if traffic suddenly increases, more Search API pods can be quickly deployed to meet established latency SLAs.

Some other considerations are the retention period of the data. Our customers, or even legislation, may dictate that the data be kept for no longer than a specific period. Having the monthly indices scheme in place allows us to simply remove a full index when all the documents belonging to that month are older than the defined period.

The Future of Purchase History

The following challenge will be to configure Purchase History access in a decentralized and private scenario. This will be done hand-in-hand with merchants, as they will no longer own and have total control over their customer base’s Purchase History, but rather an interface to enable access to each customer’s shards/pods.

From Zero to Beam

Edgar Herrero — Thu, 10 Aug 2023 06:00:35 GMT

Moving from in-house streaming code to a flexible and portable solution with Apache Beam

Long gone are the days when we used to consume data with Apache Spark Streaming, with an overly complicated, cloud-dependent infrastructure that was non-performant when load increased dramatically. Follow us on a journey of stack simplification, learning, and performance improvement within Empathy.co’s data pipeline.

A Blast from the Past

Our data journey began in 2017, building a full-fledged AWS-based pipeline, steadily consuming events from multiple sources, wrapping those into small batches of JSON events, and sending them on to the Spark Streaming consumer.

Here’s a simplified view of what that exact streaming side of things looked like - together with the important bits of infrastructure, such as queues where events would be written:

Simplified streaming diagram of the old solution

Main technologies found in the old stack:

SQS: Simple Queue Service, Amazon’s managed message queuing service mainly used to asynchronously send, store, and retrieve multiple messages of various sizes.

Single events were stored in queue #1, awaiting their processing by the event-wrapper.
SNS: Simple Notification Service, Amazon’s web service that makes it easy to set up, operate, and send notifications from the cloud.

We used it to notify other infrastructure elements that a new batch of events wrapped by event-wrapper was ready to be pulled and consumed.
Parquet: An open-source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.

A data lake was built with all the Parquet files generated by this part of the infrastructure. We then continued to generate analytics from them via Apache Spark batch jobs.

In-house services:

Event wrapper: In charge of taking individual events written into the queue by previous services, merging them into a bigger set in a single JSON file, and forwarding them into another queue.
Spark consumer: constantly pulling the queue 10 messages at a time, parsing those into Parquet files for later processing by subsequent batch processes.

There’s more to it than what’s seen in the diagram, but that should give you an idea of how things looked back then - and yes, it was all built on AWS with little room for portability, without making major changes to the whole set of services which composed this part of the pipeline.

Some cons found in this solution:

Scalability: Major sales or special events such as Black Friday were a headache when traffic went through the roof, as it was not easy to increase the throughput in the consumer. This resulted in delayed events and visualizations not being updated in time.
Portability: Having so many AWS bits underneath made the system hard to migrate into another cloud, and finding alternative equivalent services became a major pain point that needed to be solved urgently.
In-house connector: Pulling SQS from Spark Streaming was done in-house and was kept apart from all the parsing logic, so the data would fit the schema defined for raw data.

You may be wondering, “Why not use Beam from the beginning? It was released in 2016!” Well, given the experience the team had with Apache Spark, it made sense to go with the much-proven Spark Streaming rather than making the switch to a technology that was still in its early stages.

Winds of Change

Despite the cons, we managed to improve its performance over time, making it a bit more capable of handling higher loads of traffic without missing a beat.

However, the portability pain point was still there and wasn’t an issue so long as we didn’t use any other cloud… But given certain requirements, an extension of the components to support Google Cloud Platform was on its way to broaden the offering, and it was the perfect time to explore alternatives and simplify the system for the better! After a thorough analysis, we came to the conclusion that, given our expertise at the time, Apache Beam was the way to go.

Something which used to be overly complicated and really dependent on infrastructure soon looked something like this:

Simplified streaming diagram of the current solution

What is Apache Beam?

According to the Apache Beam website itself:

“The easiest way to do batch and streaming data processing. Write once, run anywhere data processing for mission-critical production workloads.”

In short, it is a unified API to build data processing components which can compile a single codebase into Flink, Spark, and many other systems. Check out the full list of supported runners.

Pick your language of choice and you’re pretty much ready to write code that will easily and transparently translate into a runner. The official docs have everything you need to know!

Some pros that made our decision way easier:

Multiple languages: Ability to write code in multiple languages such as Java, Python, and Scala and still get a portable pipeline out of it.
Huge range of connectors: Out-of-the-box connectors for Kinesis, PubSub, and Kafka are already provided, along with lots of community support that develop and maintain them.
Ease of development: Writing code in Apache Beam does have a certain learning curve, but once you get used to it, it’s really intuitive.

However, there are some cons worth mentioning also:

Performance corner cases: Being capable of morphing into multiple languages means that the one-framework-for-all concept probably isn’t as performant as a native solution written purely in Flink, for example.
Unable to update to the latest connector driver whenever you want: Drivers are embedded within Apache Beam and released accordingly. Until it’s properly tested, the latest version of the driver library won’t be really used, and may be a hit or miss for things like vulnerabilities.

As soon as we decided that Apache Beam would replace our Spark Streaming implementation, we started to dig into what other parts of the infrastructure needed to be changed in order to optimize it. First step: the connector to the queue.

The Pipeline as the Baseline

The base of every Apache Beam job is the Pipeline, which is defined as “a user-constructed graph of transformations that defines the desired data processing operations.”

Pipeline p = Pipeline.create(options); p.apply("ReadLines", TextIO.read().from(options.getInputFile()))

As soon as you’ve got data flowing through the pipe (that initial TextIO step) we can start modifying our data however we wish.

TextIO in this sample will be our connector of choice, which basically reads files from a given location within the machine or remotely (S3 and GS buckets for instance) and that will give us a collection of strings in Beam named PCollection which is a “data set or data stream. The data that a pipeline processes is part of a PCollection.”

As soon as you build that initial and crucial step in your pipeline, you are ready to start adding more stages that operate on the aforementioned PCollection.

The apache/beam GitHub repository has plenty of examples to play with, so take a look! Also, the Programming Guide on the official website has everything you need to get up to speed on developing your very own pipelines. All the concepts mentioned in here (and more!) are covered in detail.

From SQS to Kinesis and Beyond

Initially, we were using a mix of SQS and SNS with S3 buckets to persist the events midway, sending notifications via SNS whenever a new batch of events was ready to be pulled from Spark. This was not ideal by any means and it complicated the overall portability and maintainability of the system, among other things.

We wanted a real-time stream of events from beginning to end, with the ability to replay messages in case we wanted to and remove all the unnecessary complexity of maintaining midway buckets and notifications. The solution, given the climate at the time, was clear: Kinesis.

For those of you that don’t know what Kinesis is, it's basically AWS’s real-time streaming offering. The benefits, in addition to it being real-time, are that it is highly scalable and fully managed, which removes a lot of operability overhead - this was a decisive factor for us.

The initial development of the latest jobs contemplated code that would run on GCP Dataflow and Flink on Kubernetes.

Reliable, Fast and Portable - What Else?

By applying the concepts seen in the documentation above, we built our pipelines - all in a streaming fashion - which save the records processed in both Parquet files and MongoDB collections with a great throughput.

It is easily scalable when high traffic season comes (i.e. Black Friday or sales) and portable through clouds. Hope you enjoyed this brief intro to Apache Beam and the transition from a really cloud dependant environment to a more agnostic one!

On-Call Engineering Approach

Ramiro Álvarez — Tue, 01 Aug 2023 09:45:37 GMT

At Empathy.co, On-Call was scheduled to allow for fast recovery in case of a disaster, errors, or loss of service. A bunch of folks from different teams are part of the On-Call rotation (with escalation policies), which guarantees that there is someone ready to catch any incidents that may occur.

We work with a "you build it, you own it" principle, which promotes each team's autonomy and ownership. Each team defines the operational challenges of their services in an Operational Readiness Review so that they can solve issues without first escalating to other teams – because nobody knows more about a service than the team that owns it.

Empathy.co’s culture enables and empowers software engineers to take ownership to build, run, and operate products and features. We actively participate in the design and implementation of customer solutions and remain responsible for the value and service they provide. This means we are engaged in the whole cycle, from software inception to any service-level management, such as debugging, troubleshooting, request and incident resolution.

Whether your organization already has an On-Call system or is looking to implement one, here's a look at how ours has been implemented and how we recommend setting it up.

Onboarding

Previously, when a new engineer was added to the On-Call rotation, there was not much guidance. To improve the On-Call onboarding, we implemented On-Call shadowing for a kinder, smoother ramp-up to going On-Call, with none of the stress or responsibility for diagnosing and fixing the issue.

💡

Empathy.co encourages everyone, regardless of their position within the company, to spend a week shadowing an engineering team using Better Stack to understand what our product does and how to use it.

The default shadowing schedule excludes weekends and the day the On-Call is transferred from one engineer to the next, resulting in a 4-days-a-week, 24-hours-a-day "shadowing shift." New engineers decide when they want to start shadowing and when they feel ready to join the On-Call rotation. Our expectation is that they begin shadowing sometime during the first three months, and our culture of shared responsibility and blamelessness makes it less daunting to make the switch from shadowing to being on call.

The Need-to-Know Steps

Identify & Log

Since it's important to respond to incidents quickly, identifying and logging must also be done quickly and methodically in order to move on to the next step.

Categorize & Prioritize

It's important to categorize incidents to prevent confusion. For instance, the number of users affected, affected services, revenue impact, etc. Prioritizing incidents can help the On-Call engineer make a call on whether or not the incident requires the time and resources of the rest of the team.

Notify the Right People

Assembling the right people at the right time and in the right place is key to ensuring exceptional mean time to resolve (MTTR) times. Therefore, it's necessary to suppress the noise and alert only the right people who can fix the alert.

Troubleshoot

It's vital that first responders like the On-Call engineer be able to troubleshoot on the go.

Our On-Call Proposal

Each team is the expert regarding the workloads they have built. Following the "you build it, you own it" principle, the alerts should be owned by each respective team.

Each team has an independent escalation policy.
One team can escalate to another team or another backup from their escalation policy.
The number of On-Call engineers for each escalation policy should be at least four, to avoid burnout and high-stress load.
Escalation policies should be owned by each team. They decide if they want one member in primary, a couple in secondary, notification timings, etc.
Alerts notify each team with the enough knowledge to solve the issue. Notifying the right people at the right time is key.

Schedule

Currently, our schedules usually start on Monday, but any weekday will do. We recommend avoiding On-Call shift changes on weekends. The shift length is one week, and only one week per month per engineer is recommended in order to avoid burnout.

Expectations

Setting clear expectations is key for maintaining alignment:

During their On-Call shift, engineers should dedicate time to investigating and fixing the root cause of operational problems as a priority.
Picking up new feature work should be a luxury, not an expectation.
After a disruptive night or weekend, On-Call engineers are expected to take a break and have time to recover.

Each escalation policy should account for at least four engineers, and shift changes should happen on a weekly basis. Each On-Call engineer should plan their PTO days in coordination with their team, to ensure proper coverage.

Postmortem

After an incident happens, a Postmortem should be written the following day to detail the occurrence and the solution applied. The main owners of the postmortem are the On-Call engineers who solved the incident, but feel free to add more people to the review. The Postmortem should then be reviewed by Staff Engineering before being communicated to clients and any other stakeholders within the company.

What are the benefits of this proposal?

Clear expectations
Improve On-Call onboarding
Improved team transparency and accountability in handling issues
Suppress noise
Better service reliability by quickly acting on and resolving alerts; alerting the right people
Happier customers who can contact On-Call engineers for urgent issues at any time, and be assured that any issues always will be fixed quickly
Maintaining the backup and escalation policies means everyone can provide support, if needed

This is the On-Call approach that works for us, but each organization is different. It's important that you create an approach that fits your company's needs, offers reliability for your clients, and balances the workload for your On-Call engineers.

How to Create an X Adapter (and pair it with X Components)

Guillermo Cacheda — Fri, 14 Jul 2023 06:00:02 GMT

When creating a search experience, Interface X offers you a handy library of standalone building blocks to make the development part swift and easy, the X Components. But there are many other useful tools you can find in the Interface X repository.

The X Adapter is one of the most powerful and versatile tools in it, serving as a decoupled communication layer between an endpoint and your components. Combine it with the X Components, and you'll have half the work done already!

But how does the adapter accomplish this? Essentially, the adapter acts as a middleman between the website and the endpoint by handling requests. When creating and configuring an adapter, you typically need to specify an endpoint. However, to actually "adapt" the data, you need more than just an endpoint. This is where mappers come into play.

RequestMapper and ResponseMapper

The mappers are at the core of the usefulness of an X Adapter. The RequestMapper of the adapter is used to build the endpoint correctly before making the request, and the ResponseMapper takes care of the response to that request.

We can define the mappers as dictionaries that help the adapter translate from what we have to what we want. This translation is defined using Schemas, which specify how the source object should be transformed into the target object.

For example, let’s say that our components name the search query as query, but the endpoint expects it to be called just q. The adapter will need a RequestMapper that tells it to translate query into q when creating the request. Here's an example of how to define a RequestMapper that does just that:

const requestMapper = schemaMapperFactory({
	q: 'query',
})

Easy enough! But what if the endpoint requires numerous fields? How can you keep track of all the different fields needed for the request to work while building the mapper? Fortunately, the schemaMapperFactory implements generic typing using Typescript. This allows you to specify the type of the source and target request parameter objects, which enables your IDE to give you hints about the available fields.

type SourceParams = { query: string };
type TargetParams = { q: string };

const requestMapper = schemaMapperFactory({
  q: 'query',
})

Let’s cut to the chase

Mapping a couple of 1:1 parameters is rarely enough to handle real-life situations. Now that we've covered the basics of mappers, let's take a closer look at how they can handle more complex object structures, such as nested objects and calculations.

Returning to the previous example, we essentially tell the mapper that it will receive an object with an entry called query. When it finds that entry, it should put its value in a new object's entry, q. But what if query is not at the root level of the object? What if it's wrapped inside something else?

type SourceParams = {
    searchParams: {
        query: string
    }
};

The good news is that the solution is quite simple: specify the full path to the property, and the mapper will find it. The even better news is that Typescript hints will still work and suggest the correct full path!

type SourceParams = {
    searchParams: {
        query: string
    }
};
type TargetParams = { q: string };

const requestMapper = schemaMapperFactory({
	q: 'searchParams.query',
})

Indicating the path to the property like in these examples will cover a lot of the situations you might encounter, but what if there’s some extra complexity in what you want to do? What if you require doing some calculations before passing the value to the target property?

Consider a scenario where you have to handle pagination for a request. The source object has the current page (page) and the number of elements per page (pageSize), but the endpoint requires the index of the first element to return (startIndex).

type SourceParams = {
    searchParams: {
        query: string, 
        page: number, 
        pageSize: number
    }
};
type TargetParams = {
    q: string, 
    startIndex: number
};

To assign the correct value to pageIndex, you must first multiply page and pageSize. Rather than indicating the path to a parameter, you can pass a function that receives the entire source object and performs this calculation.

const requestMapper = schemaMapperFactory({
  q: 'searchParams.query',
	pageIndex: ({ searchParams }) => searchParams.page * searchParams.pageSize
})

What about lists?

Moving on to responses, the tools we have are the same, but the problems we'll face are likely to be more complex. When dealing with endpoint responses, we have to handle more complex structures and arrays of elements. The X Adapter was created with the idea of mapping search responses, so it will need a way of handling the mapping of lists of elements.

Taking the following endpoint response as an example:

{
	response: {
		items: [{
            name: 'example',
            image: 'image.url.example'
        },
        ...]
	}
}

We define the types for the SourceResponse and the TargetResponse:

type SourceResponse = {
	response: {
		items: {
			identifier: number;
			name: string;
			image: string;
		}[];
	};
}

type TargetResponse = {
	results: {
		id: string;
		description: string;
		images: string[];
	}[];
}

We see here that response.items should be mapped to results and, for each item in the response, id is now an identifier string, description is now name and image is now an array in images. Let’s see how it’s done:

const responseMapper = schemaMapperFactory({
  results: {
		$path: 'response.items',
		$subSchema: {
			identifier: ({ id }) => id.toString(),
			name: 'description',
			images: ({ image }) => [image]
		}
	}
})

So, what's happening here? We're passing an object in the schema for the result property. This object has two elements: $path and $subSchema. $path is the path to the property from the source response object that will be iterated to populate the list. $subSchema is the schema that will be applied to each of the elements found there.

X Components and the X Adapter

Now that we have an adapter that can translate between our components and an endpoint, we just need the components. This is where the X Components come in handy. They offer multiple components to easily and quickly create an engaging search experience from the ground up. You can check out this video lesson for guidance on how to create a project with X Components or delve deeper into the documentation.

Using both the X Adapter and X Components, you only need to provide the endpoints. The components are already prepared to receive the information to display the search through the X Adapter.

The X Components require an adapter that bundles together different adapters for different endpoints to cater to the necessities of the components, such as Related Tags, Next Queries, and tagging. To make the development process easier, it provides types for each adapter's expected request and response. Additionally, there's a pre-built adapter, the X Adapter-Platform, that you can plug directly into the X Components setup and communicate with Empathy.co endpoints right away. However, that's not always the case.

If you want to create a search experience with the X Component using your own endpoint, you'll most likely start by showing results for the search. To do that, you'll need to create an adapter for the search endpoint.

const myAdapter = {
	search: searchEndpointAdapter
}

Your searchEndpointAdapter will use your endpoint and the mappers to adapt the endpoint requests and responses to what the X Components understand, similar to the previous examples:

import { SearchRequest, SearchResponse } from "@empathyco/x-types";
import { YourSearchRequest, YourSearchResponse } from "your-types";

const requestMapper = schemaMapperFactory({
  q: 'query',
  pageIndex: ({ rows, start }) => (rows && start ? rows * start : 0)
})

const responseMapper = schemaMapperFactory({
  results: {
		$path: 'response.items',
		$subSchema: {
			id: ({ id }) => id.toString(),
			name: 'name',
			images: ({ image }) => [image],
			modelName: () => 'Result'
		}
	},
  totalResults: 'response.total'
})

const searchEndpointAdapter = endpointAdapterFactory({
  endpoint: '',
  requestMapper,
  responseMapper
});

After you’re finished with the adapter, pass it in the options object to the X Components and the search adapter will start providing the results from your search to the components that need them!

new XInstaller({
	searchEndpointAdapter,
	...
}).init();

Creating a great search experience for your customers doesn't have to be daunting. Empathy.co's X Components and X Adapter make it easy for developers to quickly build a reliable, adaptable search system that caters to the needs of their users. The X Adapter simplifies the process of translating between components and endpoints, while the X Components offer a range of pre-built components that can be customized to fit specific requirements. By utilizing these tools, you can provide a seamless and engaging search experience that keeps your customers coming back for more.

Frequent Pattern Mining

Jose Jurado — Tue, 11 Jul 2023 06:00:44 GMT

TL; DR

The field of Frequent Pattern Mining (FPM) encompasses a series of techniques for finding patterns within a dataset.

This article will cover some of those techniques and how they can be used to extract behavioral patterns from anonymous interactions, in the context of an ecommerce site.

Terms and Concepts

Say that we are observing and noting down in a ledger every time a keystroke is played on a piano. For clarity, let's concede that our piano is oversimplified and only has seven notes, from A to G.

Our ledger would read something like this: A, D, F, A, C, G, A, C, B

The contents of our ledger are a data series, and each individual note is an element (a.k.a. itemset) of the series.

We know that the ordering of the notes is rather important in a melody. Thus, we can safely assume that our series is indeed a sequence (ordered data series). This is not always the case and it really depends on what information you want to learn from the data. For instance, we might be studying how frequently two particular notes are dominant in a database of melodies. Then, we wouldn't need to guarantee a sequence.

Now imagine that we want to detect riffs. A riff is a short sequence of notes that occur several times in the structure of a melody. In our nomenclature, a riff is then a subsequence. A subsequence of length 'n' is also called an n-sequence.

For example, take this popular tune. Can you hum along?

The 6-sequence DDCCBBA is found four times. The 3-sequence GDD is found three times.

So this is the goal of FPM put simply: find the subsequences with their frequency, i.e. how many times they are observed.

In order to make things more interesting, let's consider many pianos playing at the same time. We can simultaneously write all the notes played, but we need an additional annotation to be able to tell them apart, e.g., P1 for the first piano, P2 for the second one, and so forth. This is the sid (sequence identifier).

Until now, we have considered every element to be defined by a single piece of data (e.g., 'A'), but that's obviously insufficient. Musical notes have more aspects to them than the tone: duration is also a key piece of data. We can improve our modeling by defining two aspects for our itemsets: tone and duration. For example, {tone: B, duration: 4} is an itemset defined by two items. Note that there is no concept of ordering for the items belonging to an itemset. They all occur at the same time.

Input data model for FPM algorithm

This twist allows for more interesting exercises. We could, for instance, focus our attention on finding a dominant rhythm in the melodies. Or, we could look for recurring patterns of notes combining a particular sequence of tones, followed by any tone with a particular duration.

Modeling Behavior

We will now apply the concepts above for modeling the behavioral trends observed in an ecommerce website.

We take the clickstream as our dataset, i.e., the atomic interactions that shoppers are producing on the website. These interactions include things like performing a search, clicking on a product, etc.

The Session ID is a perfect match to use as the sid, since this concept is widely used across the web for correlation purposes. Note that no other profiling data is needed. Plus, Session ID has desirable properties to it regarding privacy and security:

Meaningless
Volatile (only used during a session's lifespan)
Unique
Context-specific (not shared between sites)

The shoppers' interactions are then grouped by Session ID and sorted by timestamp when we observe them. The resulting sessions are our sequences, thus the interactions are itemsets.

Next, we must define the interactions through their items. The first consideration is that we have different kinds of shopper interactions. Thus it's logical to consider this an important item. Let's name it eventType. This may vary between websites, depending on the user experience. Let's initially consider three main types:

eventType: query (the shopper performs a query).
eventType: click (the shopper clicks on a product card from the results).
eventType: add2cart (the shopper adds a product to the shopping cart).

Query

At the very least, a query action is defined by its query terms. We also apply normalization processing to the query terms to improve the homogeneity of input data, e.g., lowercasing and transforming plural to singular words.

Itemset model for a query action

Click / Add2Cart

Click and add2cart are very similar regarding their relevant items, although the latter is usually a stronger signal so it's a good idea to keep them apart. The product ID is their most descriptive item, and the position of this product in the result list may also be relevant as it generally introduces bias in shoppers' behavior.

We also enrich the itemset with attributes from the shop's catalogue, in order to improve the aggregation factor. Consider that it's easier to discover patterns from items with a greater aggregation factor.

Itemset model for a click action

The selection of attributes to include in the itemsets has a big impact in the outcome of the FPM. The choices will depend primarily on the goal of the application. What kind of patterns are you aiming to discover?

Aggregation Factor

When a particular item occurs a lot across itemsets, then we say that it has a big aggregation factor. E.g., you can expect to see a lot of occurrences of the item 'eventType:query'.

The aggregation factor is necessary for discovering patterns, but it can also spoil your FPM with obvious facts. For instance, you could end up with the following top patterns:

{eventType:query}, {eventType:query}
{eventType:query}, {eventType:query}, {eventType:click}
{eventType:query}, {eventType:click}

The patterns based on items like 'eventType:query' are eclipsing other patterns that are probably more interesting to learn about.

A possible optimization in this case is crossing some features together in order to break the strong aggregation factor, thus gaining increased granularity, e.g.:

{eventType:query, terms:'cargo jumpsuit'} --> {query:'cargo jumpsuit'}
{eventType:click, productId:'12345'} --> {click:'12345'}

Therefore leading to more intriguing patterns like:

{query:'cargo jumpsuit'}, {click:'12345'}

In Action

Let's take a look at how to run FPM using Apache Spark's PrefixSpan implementation.

First, we'd typically prepare the data from clickstream as an Apache DataFrame. The key column here is sequence. Note that we have nested arrays (sequences of itemsets) in that column.

+-------+---------------------------------------------------------------+
|session|sequence                                                       |
+-------+---------------------------------------------------------------+
|131ABC |[[query:jacket], [click:p3, size:M, dept:casual], [click:p2, size:L, dept:casual]]                                                   |
|130ABC |[[query:jacket], [click:p2, size:L, dept:casual], [query:sport], [click:p3, size:M, dept:casual], [add2cart:p3, size:M, dept:casual]]    |
|125ABC |[[click:p1, size:40, dept:footwear], [add2cart:p3, size:M, dept:casual], [click:p2, size:M, dept:casual]]                |
|123ABC |[[query:sport], [click:p1, size:38, dept:footwear], [click:p3, size:M, dept:casual]]                                                   |
|124ABC |[[query:sweater], [click:p2, size:L, dept:casual], [prd:p3, size:M, dept:casual], [add2cart:p3, size:L, dept:casual], [query:shirt], [click:p2, size:L, dept:casual]]                                        |
+-------+---------------------------------------------------------------+

Running FPM on top of this prepared DataFrame is simple:

  def run(df: DataFrame): DataFrame = {
    new PrefixSpan()
      .setMinSupport(minSupport)
      .setMaxPatternLength(maxPatternLength)
      .setMaxLocalProjDBSize(maxLocalProjDBSize)
      .findFrequentSequentialPatterns(df)
  }

minSupport: A lower threshold for discarding patterns that are not representative enough in the observed sessions. For a given pattern, support is calculated as num_sessions_present / total_sessions. This value is therefore a relative amount (0-1).
maxPatternLength: An upper threshold for pattern length. E.g., if this value is set to 5, then the length of the output patterns would be between 1 and 5.
maxLocalProjDBSize: This is an internal value that determines how many resources, in terms of memory, can be dedicated to building the internal representation (DB sequences) when running the algorithm.

It's worth mentioning that fine-tuning minSupport and maxPatternLength critically impacts performance.

As output, we get another DataFrame with the results of the analysis.

+------------------------------------------------------------------+----+
|sequence                                                          |freq|
+------------------------------------------------------------------+----+
|[[size:M, dept:casual], [click:p3]]                               |53  |
|[[query:jacket], [click:p3, dept:casual, size:L]]                 |34  |
|[[query:shirt], [size:M, dept:casual], [add2cart:p2]]             |31  |
+------------------------------------------------------------------+----+

Note that many of the resulting itemsets are now "incomplete". They are now like templates. It tells you the common factors extracted from the original itemsets observed.

Alternative

Apache Spark also provides another implementation for performing FPM: FP-Growth.

It's interesting to know the differences, as it can also be a valid approach for some of the use cases.

Unordered discovery: When extracting patterns, FP-Growth doesn't take the order of the sequences into consideration. This can be handy for detecting elements that are frequently found together, not necessarily with the implicit cause-effect relationship that ordering implies. For instance, "products bought together" is a case of unordered discovery.
Simplified data model: The elements from input sequences are not expected as itemsets. They are not nested arrays, but plain values. This means that you can't expect to extract a common factor from composed elements. This may, however, still be valid for your use case.
Repeated elements not allowed: The same element can't appear twice in a given sequence. You'll need to filter out those cases before running the algorithm.

What Next?

Now that you have discovered the representative patterns in your data, you can interpret this info statistically as a source of predictions and recommendations for your shoppers. Optimizing the most frequent shopper journeys on a website (e.g., by suggesting shortcuts) is another interesting use case.

Patterns can be used to learn trends or association rules, so stay tuned for a post on that topic!

References

Frequent pattern discovery - Wikipedia

Wikimedia Foundation, Inc.Contributors to Wikimedia projects

Frequent Pattern Mining - Spark 3.3.2 Documentation

3.3.2

Progressive Delivery: Argo Rollouts Adoption

Ramiro Álvarez — Mon, 12 Jun 2023 06:00:44 GMT

Introduction

Progressive Delivery is emerging as a worthy successor to Continuous Delivery, by enabling developers to control how new features are launched to end users. Its wide popularity is owed to the demand for faster and more reliable software releases. The increasing emphasis on customer experience has begun to push Continuous Delivery methodology by the wayside. Large enterprises like Netflix, Amazon, and Uber are turning to Progressive Delivery to test and release code in a phased and controlled manner.

In a nutshell, Progressive Delivery empowers developers to plan and implement code changes to a subset of users and then expand it to all users. The progressive rollout of features is executed through techniques like blue-green deployment, feature flagging, and canary deployments. You can mitigate issues that come up by promoting a version to all users only when you’re confident that it is performant and reliable. And if it fails in production, the impact radius is restricted to a subset of users, and the update can be rolled back immediately.

Toolkit for Progressive Delivery

Apart from laying the foundation with Kubernetes, GitOps, service mesh, the key piece to the entire puzzle is a purpose-built progressive delivery tool.

Argo Rollouts is a Kubernetes controller and set of CRDs which provide advanced deployment capabilities such as blue-green, canary, canary analysis, experimentation, and progressive delivery features to Kubernetes.

We have also performed a deep analysis of the solution chosen at Empathy.co, outlining the requirements, benefits and a comparison between Argo Rollouts and Flagger.

Progressive Delivery Strategy

Progressive Delivery can be implemented through a number of strategies:

Canary Release

Channel a limited amount of traffic to a new canary service, and then if it passes reliability tests, you can gradually shift all traffic from the old to the new service, and the canary becomes the default version.

Feature Flag

Control the code launch remotely through a toggle-like feature, which enables changes to be rolled back immediately in the event of a failure.

Blue-Green Deployment

Gradually transfer traffic from an existing application (blue) to a newer one (green), while the blue version acts as a backup.

A/B Testing

Expose two different categories of the audience to two different application versions and analyze their performance to decide which is the ideal version.

Which tactic you pick depends on your goals and which you think would fit your workloads best.

Custom Strategy

Although those are the basics methods, Argo Rollouts allows more custom analysis to be added and all the capabilities to be explored, making it possible to create our own strategy during Releases. For instance, a custom strategy with the following capabilities could be defined:

Scale the canary stack up for testing purposes
Set some header-based traffic shaping to the canary while setWeight is still set to zero
Begin testing and then, when the tests are OK, start the canary promotion
Canary promotion sends traffic gradually
Automated rollback in case SLO and Error Budget is burnt out based on analysis from Prometheus Metrics
Scale down the old application release after a while, in order to guarantee a faster rollback in case of failures; no need to scale up the previous version

Outline the Metrics You Want to Measure

With Progressive Delivery, you can reduce risk. This is mainly because you continuously test code changes, analyze performance and implement your learnings all in real time. To ensure that this happens seamlessly, it's important to have KPIs against which to measure the success of your release. In our case, the SLO metrics would be a good KPI to choose. Thanks to Pyrra, there is a set of Prometheus Rules to monitor the Error Budget of your application and send an alert in case it is close to being burnt out.

Those metrics can be added as part of the Analysis in Argo Rollouts, so be sure to pay attention to how to explore Prometheus metrics in the Analysis and that your metrics make sense for checking the success of your release.

Check your Pyrra configurations, as they will be propagated as Prometheus Rules and will create some critical alerts for the Error Budget.

Migration: From Deployment to Rollout

There are multiple ways to migrate to Rollout, but here we'll explain the simplest one:

Reference an existing Deployment from a Rollout using the workloadRef field.

⚠️

When migrating a Deployment which is already serving live production traffic, a Rollout should run next to the Deployment before deleting the Deployment or scaling down the Deployment. Not following this approach might result in downtime. It also allows for the Rollout to be tested before deleting the original Deployment.

During the migration, the Deployment and the Rollout should coexist to avoid downtime. After that, the Deployment resource can be scaled down to zero replicas.

A temporal ingress and service will be located to avoid downtime and ensure correct behavior during the migration, because rollouts introduce a hash label in the selector to route the traffic.

After the migration, the temporal service and ingress can be deleted.

Kubectl Plugin

Argo Rollouts offers a Kubectl plugin to enrich the experience with Rollouts, Experiments, and Analysis from the command line.

Troubleshooting Argo Rollouts

Rollouts

Q: Can I restart a Rollout?

A: Sure, you can restart a Rollout just like you can restart a Deployment. There are multiple ways to do so: using the kubectl plugin, directly from Argo Rollouts UI, or through the Argo CD UI.

Q: Does the Rollout object follow the provided strategy when it is first created?

A: As with Deployments, Rollouts do not follow the strategy parameters on the initial deploy. The controller tries to get the Rollout into a steady state as fast as possible by creating a fully scaled-up ReplicaSet from the provided .spec.template. Once the Rollout has a stable ReplicaSet to transition from, the controller starts using the provided strategy to transition the previous ReplicaSet to the desired ReplicaSet.

Rollbacks

Q: If I use both Argo Rollouts and Argo CD, won't I have an endless loop in the case of a Rollback?

A: No, there is no endless loop. As explained in the previous question, Argo Rollouts doesn't tamper with Git in any way. If you use both Argo projects together, the sequence of events for a Rollback is the following:

Version N runs on the cluster as a Rollout (managed by Argo CD). The Git repository is updated with version N+1 in the Rollout/Deployment manifest.
Argo CD sees the changes in Git and updates the live state in the cluster with the new Rollout object.
Argo Rollouts takes over as it watches for all changes in Rollout Objects. Argo Rollouts is completely oblivious to what is happening in Git. It only cares about what is happening with Rollout objects that are live in the cluster.
Argo Rollouts tries to apply version N+1 with the selected strategy (e.g. blue-green).
Version N+1 fails to deploy for some reason.
Argo Rollouts scales back again (or switches traffic back) to version N in the cluster. No change in Git takes place from Argo Rollouts.
The cluster is running version N and is completely healthy.
The Rollout is marked as "Degraded" both in ArgoCD and Argo Rollouts.
Argo CD syncs take no further action as the Rollout object in Git is exactly the same as in the cluster. They both mention version N+1.

Q: How can I run my own custom tests (e.g. smoke tests) to decide if a Rollback should take place or not?

A: Use a custom Job or Web Analysis.

Analysis

What is the difference between failures and errors?
Failures are when the failure condition evaluates to true or an AnalysisRun without a failure condition evaluates the success condition to false. Errors are when the controller has any kind of issue with taking a measurement (i.e. invalid Prometheus URL).

For more information, check out the Argo Rollouts FAQ.

Takeaways

ArgoRollouts offers enhancements to the usual Kubernetes deployment strategies, and the main highlights of adoption are:

Ability to choose your strategy and select the one that best suits your needs
No need to adjust to canary or blue-green, because Argo Rollouts allows you to create a custom strategy
Special attention to migration from Deployment to Rollout with some caveats that must be taken into consideration.

Resources

Why should my team start automating?

Mavi Fernández — Thu, 27 Apr 2023 13:07:03 GMT

Maybe you’re still gazing at this title, a bit suspicious and wondering to yourself, “Start automating? What does she mean, we’ve been doing it for ages!” If that’s the case, congratulations, you’re on the right path! Yet, you might find this article useful to remember the old, dark days and sigh.

But for those who haven’t yet found the right moment to start, or even don’t see the value, I hope this story empowers you to join the bright side of testing and transform the initial question, “Why should my team start automating?” to “Why haven't we started automating yet?”

Bumps in the road to automation

“It’s too expensive and brings no benefit.”

Correct, automation will imply a big initial investment. It’s necessary to dedicate time and people to find the tool that best fits the framework and programming language your team is using. Of course, your team will also need proper training as to how to use that tool and how to design simple, maintainable tests.

Moreover, it’s not possible to provide an exact ROI calculation, since automation brings test coverage that wouldn’t be possible to achieve and handle manually. Think of a test regression suite for a project with multiple interconnected features. During the first few iterations, it might be possible to manually run all acceptance and regression tests, but the more iterations, the bigger the regression suite would be. How long would it take for that suite to become unmanageable? Consequently, how long would it take for the team to give up on those tests?

“We don’t have a dedicated specialist on the team.”

Well, it’s not strictly necessary. Adopting Agile practices means not only there is no room anymore for a dedicated Quality & Productivity team working on demand at the end of different project releases, but also that quality becomes a whole-team effort. The entire team needs to adopt a quality mindset, as everyone is responsible for the quality of the product. Whether you are a developer who has no experience in automating tests or you are a manual tester without a programming background, never fear. It’s a team responsibility, and there will be colleagues on the team who will be able to help.

“But we don’t know how to automate!”

Take a look at the beginning of the previous section. Do you see that “It’s not strictly necessary” bit? That’s because although all the team might agree on learning automating skills, it still may be a good idea to bring a quality expert to coach and support the team, helping to adopt the new practices and making sure those first steps go in the right direction to a solid, maintainable and easy-to-understand automation test suite. A Quality & Productivity Engineer will know the team is ready when automation has become an assumed, natural process.

“We don’t have time…”

It’s harder to take on new habits when there are old ones to fall back on. If the team isn’t given sufficient time to get up to speed with automation and, later on, design and implement the automated tests during each sprint, they will go back to old practices. Automation will be skipped as a result of choosing the lesser of two evils, and some exploratory testing will be done, at most. Best case scenario, the automation tasks are resumed in the following sprint, with the delivered business value reduced. Worst case scenario, deliverables seem to remain standing without pertinent testing, so the team falls for the misbelief that they are not necessary.

Tests with benefits

At this point, there may still be some skeptical readers thinking that if their software is simple enough, they can still survive without automated tests.

Let’s imagine the following scenario: there’s a small application developed by a concrete team that always has enough time allotted to manually test all the progress made during each iteration. Some bugs are found during the testing phase, and few others make it to the production environment, yet the situation is not concerning. How could automation benefit the team?

Better understand the requirements and build testable code.
Bring confidence to the code. While in manual testing, the Quality and Productivity Engineer is the one providing this sense of security, with automated testing, this focus is moved to the tests themselves.
Get early feedback, helping developers to troubleshoot quickly. As bugs are spotted sooner, there’s more time to find the root cause, come up with a solution and redesign the code if needed, contributing to more solid code.
Eliminate the possibility of human mistakes during execution, since every test is run exactly the same way.
Decrease the team's frustration. Automation reduces repetitive tasks; consider that the more code is added, the longer manual testing would take.
Increases test coverage. Think about complex scenarios with a wide range of input data to cover. Most likely, only some of those scenarios will be tested if only manual testing is performed.
Create living documentation with automated tests. It’s easy to forget to update static documentation every time a feature changes. With automated tests, however, updating docs to include these new functionalities is mandatory.

Have I brought you to the bright side of automation? Happy testing and, as always, bring on any questions or comments!

Creating Component Tests for Spark Applications

Daniel Hernández Alfageme — Thu, 23 Mar 2023 14:13:19 GMT

One of the main engineering challenges faced by the Empathy.co Data Team is creating robust tests for our Spark applications. Since these applications are constantly evolving, as for any application, we needed a way to ensure changes wouldn’t break the code; a guarantee that the output from our jobs would remain the same when refactored or when the input schema of the data is changed.

The biggest hurdle here was determining how to create component tests that check two key boxes:

The aggregated results are as expected.
The output columns of our DataFrame are the same and the schema remains unbroken.

Our Spark project architecture is based on the Spring Boot framework. We use this framework to facilitate the arguments we are passing to our application via environment variables or command line arguments. For that reason, all of our jobs follow the same structure, which uses configuration beans to read the configuration properties. All the jobs implement the abstract run method from the ApplicationRunner spring boot class. So, the final objective is to test the run method from our applications to ensure the aggregation results are returned as expected. Individual methods can also be tested with unit tests, but that won’t be covered here, as it is out of the scope of this blog post.

Let’s imagine that the structure of one of the Spark Jobs we want to test is the following:

@SpringBootApplication(
 exclude = Array(classOf[MongoAutoConfiguration]),
 scanBasePackages = Array[String]("my.configuration.package")
)

@ConfigurationProperties("job-name")
Class MyBatch @Autowired() (
   @BeanProperty var sparkConfig: SparkConfig,
   @BeanProperty var inputConfig: InputConfig
) extends Serializable
 with ApplicationRunner {

 @BeanProperty
 var outputPath: String = _

 override def run(args: ApplicationArguments): Unit = {

   val inputDf = readParquet(
     sparkConfig.sparkSession,
     inputConfig.path,
     inputConfig.getStartDate,
     inputConfig.getEndDate
   )

   parquetDF = letTheJobPerformItsJob()...

   if (!Option(outputPath).forall(_.isEmpty)) {
       parquetDF.save(
           fullOutputPath,
           partitionKeys = Seq(RawSchemaConstants.Yyyy, RawSchemaConstants.Mm) 
       )
}

Next, we will be using the Mockito framework to write the test code. Let’s start the test by creating the Spark session:

import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.lit
import org.mockito.ArgumentCaptor
import org.mockito.ArgumentMatchers.any
import org.mockito.ArgumentMatchers.anyString
import org.mockito.ArgumentMatchers.isNull
import org.mockito.MockitoSugar
import org.scalatest.flatspec.AnyFlatSpec
import org.scalatest.matchers.should.Matchers
import org.springframework.boot.DefaultApplicationArguments

class MyBatchTest extends AnyFlatSpec with Matchers with MockitoSugar {

 val spark = SparkSession.builder
   .master("local[*]")
   .getOrCreate()

 import spark.implicits._

}

All of our Spark batch jobs follow the same working order: first, the source DataFrame is read from a bucket. Then, we make some aggregations, and finally, the results are either saved to a different bucket or to a MongoDB collection, or both.

The next step is to define the input DataFrame that would be read from the source bucket during a real execution, which would need to contain the data required to test the different scenarios that could arise (such as corner cases). The content of this DataFrame depends on each test logic.

To make the code more readable, we first need to define some prefixes/aliases (SV, SFV) to group into objects the constant values that we are using repeatedly. Also, the code shown here is a reduced version of the DataFrame; not all the rows that are present in the source DataFrame of the test are shown for readability purposes. We could define several DataFrames inside different tests to check individual scenarios and use a larger one to test the overall aggregation with all the scenarios together, but all of them will follow this same procedure:

val testData = DataFrameGenerator.generateSampleDataFrame(
 SV.Year,
 SV.Month,
 SV.Day,
 Seq(
   (SV.St1, Sinks.Q, SFV.LangEngScopeDesktop),
   (SV.St1, Sinks.C, SFV.LangEngScopeDesktop),
   (SV.St1, Sinks.C, SFV.LangEngScopeDesktop),
   [...]
 ).toDF(
   SourceColumnNames.Instance,
   SourceColumnNames.Sink,
   SourceColumnNames.Filters
 )
)

We define an empty dataset with the output that is written to MongoDB, to mock the method since we are not interested in writing any results to MongoDB.

val emptyMongoInstanceDFs = Map[WriteConfig, DataFrame]()

After that, we need to create a mock object for the methods that read the source data and write the results, which in our case are located in an implicit class called DataFrameMethods that extends the Spark DataFrame object. We also define a Mockito ArgumentCaptor object that will allow us to capture the result of the Spark job, which is then passed to one of the methods that performs the write operation. This code is written inside the test code:

"My Batch" should "work" in {

 val dfMethodsMock = mock[DataFrameMethods]
 val dataFrameCaptor = ArgumentCaptor.forClass(classOf[DataFrame])

}

Now, we instantiate the job we want to test and create a spy object to mock some methods:

val myBatch = new MyBatch(
 SparkTestConfig.getSparkConfig,
 SparkTestConfig.getInputConfig
)
val spyBatch = spy(myBatch)

The input beans that the batch receives as arguments (Spark session, input config, etc) can be created as shown below. Our batch receives more input beans like the output parameters, but for the sake of simplicity, they are omitted from this snippet:

def getSparkConfig: SparkConfig = {
 val sparkConfig = new SparkConfig()
 sparkConfig.useLocalMaster = true
 sparkConfig.construct()
 sparkConfig
}


def getInputConfig: InputConfig = {
 val inputConfig = new InputConfig()
 inputConfig.path = "inputPath"
 inputConfig.timeZone = "UTC"
 inputConfig.date = "2022-01-01"
 inputConfig
}

The next step is to mock the read source data and write result methods. The read method should return the input DataFrame that we previously created. For the parquet write method, we set it to do nothing. The method that generates the documents written to MongoDB is also mocked to avoid returning results to just any database. We only want to capture the argument with the resulting DataFrame when the method is called.

doReturn(testData)
 .when(spyBatch)
 .readParquet(any[SparkSession],
ArgumentMatchers.eq("inputPath"),
ArgumentMatchers.eq(ZonedDateTime.of(LocalDate.parse("2022-01-01"), LocalTime.MIN, ZoneId.of("UTC"))),
ArgumentMatchers.eq(ZonedDateTime.of(LocalDate.parse("2022-01-02"), LocalTime.MIN, ZoneId.of("UTC")))
)

doNothing
 .when(dfMethodsMock)
 .save(ArgumentMatchers.eq("outputPath"), ArgumentMatchers.eq(Seq("yyyy", "mm")))

doReturn(emptyMongoInstanceDFs)
 .when(spyBatch)
 .generateInstanceDFs(any[DataFrame], isNull[String], any[DatabaseConfig])

Here comes the most interesting part of the test: the batch code is executed by calling the run method and the resulting DataFrame is captured with the argument captor by using the method that generates the documents written to MongoDB. We mocked this method so it generates an empty collection, but we need to capture the input argument to get the Spark job results. In this case, we are using the method that saves the result into a MongoDB collection, but we could just as well use the one that writes the result into a bucket in parquet format.

spyBatch.run(new DefaultApplicationArguments(""))

verify(spyBatch)
 .generateInstanceDFs(
   dataFrameCaptor.capture(),
   any(),
   any[DatabaseConfig]
 )

val result: DataFrame = dataFrameCaptor.getValue.asInstanceOf[DataFrame].cache()

The last part of the test consists of checking the result. There are several ways to perform this check. One consists of collecting the results and comparing the rows. Although we are working with small DataFrames, we prefer to do this in a distributed way by comparing the DataFrames without collecting the data into the Spark driver.

So, let’s define the expected DataFrame. The key point here is to create the DataFrame with the columns in the same order they are aggregated by the job; otherwise, the check will fail:


val expectedData = Seq(
 (SV.St1, SFV.EmptyFilterStr, 1, 1, 3),
 (SV.St1, SFV.LangEngStr, 1, 2, 1),
 [...]
).toDF(
 FinalColumnNames.Instance,
 FinalColumnNames.Filters,
 FinalColumnNames.QueryCount
)

We check to see that the total number of rows matches and that the DataFrames are equal by subtracting one from the other and checking that the results are empty:

assert(10 === result.count())
assert(expectedData.except(result).isEmpty)

We know that our jobs do not produce duplicated rows (exactly equal rows), so this check is enough to ensure that the expected and resulting DataFrames are the same. With repeated rows, we would need to do more checks, such as the number of times each row is repeated. We could subtract the DataFrames in the inverse order and perform the same check to ensure we are confident with the results of the test.

assert(result.except(expectedData).isEmpty)

To summarize, these are the steps to follow for Spark application component testing:

Mock the beans required by the application and instantiate the Spark session and the Spark application.
Mock the read method to use a DataFrame defined within the test and then write the method so that it does not perform any write operation.
Create a spy for the application object so it can mock the read and write (result) methods.
Create an argument captor to retrieve the result of the Spark application, which is passed via parameter to the write method.
Run the code and capture the resulting DataFrame.
Define the expected DataFrame with the columns defined in the same order as calculated by the Spark application.
Compare the expected result with the DataFrame that was retrieved using the argument captor.

All together, the test code should appear like this:

import com.eb.data.batch.{SampleValues => SV}
import com.eb.data.batch.{SampleFiltersValues => SFV}
import com.eb.data.batch.{SampleSinks => Sinks}
import com.eb.data.batch.config.DatabaseConfig
import com.eb.data.batch.io.df.DataFrame.DataFrameMethods
import com.eb.data.batch.DataFrameGenerator
import com.eb.data.batch.FinalColumnNames
import com.eb.data.batch.SourceColumnNames
import com.eb.data.batch.SparkTestConfig
import com.eb.data.batch.UDFUtils
import com.mongodb.spark.config.WriteConfig
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.lit
import org.mockito.ArgumentCaptor
import org.mockito.ArgumentMatchers.any
import org.mockito.ArgumentMatchers.anyString
import org.mockito.ArgumentMatchers.isNull
import org.mockito.MockitoSugar
import org.scalatest.flatspec.AnyFlatSpec
import org.scalatest.matchers.should.Matchers
import org.springframework.boot.DefaultApplicationArguments

import java.time.ZonedDateTime

class MyBatchTest extends AnyFlatSpec with Matchers with MockitoSugar {

 val spark = SparkSession.builder
   .master("local[*]")
   .getOrCreate()

 import spark.implicits._

 val testData = DataFrameGenerator.generateSampleDataFrame(
   SV.Year,
   SV.Month,
   SV.Day,
   Seq(
   (SV.St1, Sinks.Q, SFV.LangEngScopeDesktop),
   (SV.St1, Sinks.C, SFV.LangEngScopeDesktop),
   (SV.St1, Sinks.C, SFV.LangEngScopeDesktop),
   [...]
  )
)
   ).toDF(
     SourceColumnNames.Instance,
     SourceColumnNames.Sink,
     SourceColumnNames.Filters
   )
 )

 val emptyMongoInstanceDFs = Map[WriteConfig, DataFrame]()

 "My batch" should "work" in {

   val dfMethodsMock = mock[DataFrameMethods]
   val dataFrameCaptor = ArgumentCaptor.forClass(classOf[DataFrame])

   val myBatch = new MyBatch(
     SparkTestConfig.getSparkConfig,
     SparkTestConfig.getInputConfig
   )
   val spyBatch = spy(myBatch)

   doReturn(testData)
     .when(spyBatch)
     .readParquet(any[SparkSession],
ArgumentMatchers.eq("inputPath"),
ArgumentMatchers.eq(ZonedDateTime.of(LocalDate.parse("2022-01-01"), LocalTime.MIN, ZoneId.of("UTC"))),
ArgumentMatchers.eq(ZonedDateTime.of(LocalDate.parse("2022-01-02"), LocalTime.MIN, ZoneId.of("UTC")))
)


   doNothing
 .when(dfMethodsMock)
 .save(ArgumentMatchers.eq("outputPath"), ArgumentMatchers.eq(Seq("yyyy", "mm")))

   doReturn(emptyMongoInstanceDFs)
     .when(spyBatch)
     .generateInstanceDFs(any[DataFrame], isNull[String], any[DatabaseConfig])

   spyBatch.run(new DefaultApplicationArguments(""))

   verify(spyBatch)
     .generateInstanceDFs(
       dataFrameCaptor.capture(),
       any(),
       any[DatabaseConfig]
     )

   val result: DataFrame = dataFrameCaptor.getValue.asInstanceOf[DataFrame].cache()

   assert(10 === result.count())

   val expectedData = Seq(
     (SV.St1, SFV.EmptyFilterStr, 1, 1, 3),
     (SV.St1, SFV.LangEngStr, 1, 2, 1),
     [...]
   ).toDF(
     FinalColumnNames.Instance,
     FinalColumnNames.Filters,
     FinalColumnNames.QueryCount
   )

   assert(10 === result.count())
   assert(expectedData.except(result).isEmpty)
   assert(result.except(expectedData).isEmpty)
 }
}

In a follow-up blog post, we will explore how to improve the efficiency of the Spark session creation and reduce the execution time when performing a battery of tests during the same execution. For now, we hope this serves as a helpful guide to performing component testing in Spark. As always, if you have any questions or comments, please feel free to reach out!

Image Generation with AI Models

Angel Olmedo Garcia — Thu, 08 Dec 2022 13:53:07 GMT

Introduction

In recent years, improvements in results from AI models have come from two different sources.

On one hand, the improvement in hardware capabilities in terms of raw computational power has enabled larger models to be trained. As a result, AI models have been able to get bigger and bigger (i.e. have more and more parameters). The increasing of parameters has also made these models more precise, but there is another component besides having hardware and bigger models — we need a way to train them. To do so, enormous datasets are required; as parameters increase, dataset size also has to increase accordingly. For example, DALL·E 2, one of the biggest models in the text-to-image model space with about 3.5 billion parameters, was trained on a 650 million image-caption pair dataset obtained from multiple sources all over the internet. (Dataset creation is a whole other issue, so we will not delve further into that topic.)

On the other hand, there have been architectural improvements of the results obtained by these huge text-to-image models, which can’t be achieved by brute force alone. In recent years, with the breakthrough of the so-called diffusion models, there has been an enormous improvement in the results produced by new AI models even without increasing the parameters of said models (e.g. DALL·E 2 has fewer parameters compared to its previous version, but performs much better).

Even though research and development of new models is continual, 2022 has seen explosive growth in the Image Generation field. Where before there were small improvements on a year-to-year basis, this year has brought great improvements on a monthly, weekly, and even daily basis. In fact, while writing this article, Version 2 of Stable Diffusion was just released on November 24. While this article will not cover the latest version of Stable Diffusion, all ways of working related to this model that will be covered can most likely be applied to the latest version.

Getting Started with Image Generation Models

The first step to begin using Image Generation for personal use is to decide which model to use. As of now, there are three main models for image generation: Stable Diffusion, DALL·E 2 and Midjourney. They are not the only options, there are others like NovelAi or Jasper Art, but this post will focus on the main three.

The five key factors to ponder when deciding which model to use are laid out in the table below:

	DALL·E 2	Midjourney	Stable Diffusion
Openness of use	Open for anyone	Open for anyone	Open for anyone
Pricing	Pricing based on resolution of image generated	Subscription based with several tiers and 25 first images for free	Free model download and usage; Paid version based on credits if using Dream Studio
Ease of use	Easy to create good prompts	Easy to create good prompts	Hard to create good prompts; Steep learning curve
Community Support	Subreddit; Few community-based applications	Subreddit and Discord	Subreddit; Many community-based applications; Tools to find good prompts
Content Policies	Automatic filters to avoid disturbing or offensive content; The user has full responsibility for the content generated	Rules on Discord to avoid disturbing or offensive content; The user has full responsibility for the content generated	Safety filters on Dream Studio; Safety filters can be disabled when downloading and using the model; The user has full responsibility for the content generated
Results obtained	Consistent quality of results, better for enterprise use	Leaning towards more artistic/conceptual generations	Inconsistent due to difficult prompt creation, but quality results can still easily be obtained

After deciding which model to use, the next step is learning how to use it. For Dall·E 2, it can be used on the OpenAI website. In order to use Midjourney, you need to join their Discord server which requires a Discord account.

Credit: https://medium.com/codex/dream-studio-stable-diffusions-ai-art-web-app-tool-5687e75cc100

Finally, for Stable Diffusion there are three options:

Using Dream Studio: This is the official tool, based on the web, released by StabilityAI, the company behind Stable Diffusion.
Deploying and using the model on a personal computer: With a valid GPU of 8+ GB of VRAM, Stable Diffusion can be run on your personal computer. You can follow these steps on Reddit to get going.
Using WebUI forks developed from the community, running on Google Colab: This is my go-to choice, as it can be run using the free tier of Google Colab and there is no need for a beefy computer. I like this Github repository that has the tooling itself and Google Colab notebooks ready to use.

Regardless of the option chosen for running Stable Diffusion, I recommend watching this Youtube video (in Spanish) where all concepts are explained in more depth, as well as exploring the subreddit from the Stable Diffusion community to look for prompts and updates on community tooling. Another place to look for prompts is Lexica Art.

Model Comparison

Now that it is clear what each of these models offers and how to begin using them, it is time for a visual comparison between the images that they generate. As part of this section, a direct comparison using the same subset of prompts is shown below. (Please note that the images shown are not mine nor Empathy.co’s. You can view the originals here.)

As shown in the images above, all three models provide high-quality results even though there are some clear differences between them. Stable Diffusion has the most inconsistent results style-wise, sometimes producing realistic results like the snow-covered mountain, while others receive a painting-like result like the Cherry Blossom image.

DALL·E 2 and Midjourney are much more consistent in their respective styles, with DALL·E 2 being the most consistent and having a style that is more suited to enterprise use. Midjourney’s style leans more towards conceptual art-like results but the quality is a bit less consistent in comparison with the results from DALL·E 2 (e.g. its Cherry Blossom image vs. the others).

What can these models be used for?

These powerful image generation models present new workflows and applications for art. For example, they can be used to generate icon-like images for programs, generate art for characters and places in tabletop role-playing games, transform video recordings, be included in current workflows for artists and graphic designers, and the list goes on.

One Last Thing…

There is an important topic to consider regarding Image Generation models (and pretty much all AI-related models): the ethical concerns that arise from these new technologies.

The conversation is mounting around how these models were trained. As noted in the Introduction, all of them were trained using huge image datasets and the images were obtained from public sources, mainly the internet. The problem is that a part of the dataset is formed by images made by artists, including copyrighted ones. That plus the fact that the AI can be guided, particularly Stable Diffusion, toward certain styles, means some images generated have a clear similarity to certain artists’ styles. Artists from all over the world have brought up the topic of plagiarism. Are the images generated by these AIs plagiarised? Is it right to use these images? The discussion is ongoing.

Another related concern is that if images can be described and generated in a matter of seconds, artists will have fewer commissions and receive less work. This is understandably concerning for visual artists, and not to be taken lightly. Controversy also arose after an AI-generated image won first place in Colorado State Fair’s fine arts competition. The decision was met with backlash and claims that the winner cheated, and regulations banning AI submissions in those kinds of competitions have been requested.

Privacy is also a matter that needs to be attended to, as a report from Ars Technica claimed that the training dataset used by Stable Diffusion contained hundreds or even thousands of private medical records. While the responsibility of filtering these images from the dataset is not clear, LAION, the company behind the dataset used by Stable Diffusion, says that they do not host these images on their servers and that filtering sensitive information is the responsibility of the companies that create these AI models. However, LAION provides the URLs for the images used in their datasets, so they are also responsible in part.

That said, enjoy your journey of creating AI-generated images while respecting the privacy and intellectual property of others!

Progressive Delivery in Kubernetes: Analysis

Ramiro Álvarez — Thu, 01 Dec 2022 09:29:54 GMT

Overview

In this post, an analysis of Progressive Delivery options in the Cloud Native landscape will be done to explore how this enhancement can be added in a Kubernetes environment. The most embraced tools from the Cloud Native landscape will be analyzed (Argo Rollouts and Flagger), along with some takeaways and the selection that best fits Empathy.co’s needs.

Motivation

The native Kubernetes Deployment Objects supports the Rolling Update strategy which provides a basic guarantee during an update, as well as limitations:

Few controls over the speed of the rollout
Inability to control traffic flow to the new version
Readiness probes are unsuitable for deeper, stress or one-time checks.
No ability to check external metrics to verify an update
No ability to automatically abort and rollback the update

For the reasons above, in a complex production environment, this Rolling Update could be risky because it doesn't provide control over the blast radius, may roll out too aggressively and there is no rollback automation in case of failure.

Requirements

Although there are multiple tools that offer multiple deployment capabilities, the minimum requirements would be:

GitOps approach: The tool chosen should work under GitOps approach, no manual changes required
NGINX Ingress Controller compatibility: The goal is to add more deployment alternatives, not to research a different Kubernetes ingress controller
Prometheus analysis compatibility: The metrics are saved on Prometheus, the tool should allow using Prometheus queries to perform a measurement.
Compatible with multiple Service Mesh options: The chosen tool should not be tied to a specific service mesh. This will allow us to experiment with multiple service meshes in the future.
GUI: Would be valuable to have, but at least some kind of deployment tracing related to what is happening under the hood in the alternative deployments would be necessary.

Affected/Related Systems

Kubernetes deployment methods
Application delivery strategies from Teams

Current Design

Native Kubernetes Deployment Objects:

Rolling Update: A Rolling Update slowly replaces the old version with the new version. This is the default strategy of the Deployment object
Recreate: Deletes the old version of the application before bringing up the new version. This ensures that two versions of the application never run at the same time, but there is downtime during the deployment.

Proposed Design

The aspirational goal is to add extra deployment capabilities to the current Kubernetes cluster and, therefore, increase the agility and confidence of application teams by reducing the risk of outages when deploying new releases.

The main benefits would be:

Safer Releases: Reduce the risk of introducing a new software version in production by gradually shifting traffic to the new version while measuring metrics like request success rate and latency.
Flexible Traffic Routing: Shift and route traffic between app versions with the possibility of using a service mesh (Linkerd, Istio, Kuma...) or not (Contour, NGINX, Traefik...)
Extensible Validation: Extend the application analysis with custom metrics and webhooks for acceptance tests, load tests or any other custom validation.
Progressive Delivery: Alternatives deployment strategies:
Canary (progressive traffic shifting)
A/B Testing (HTTP headers and cookies traffic routing): Called experiments by Argo Rollouts, although Canaries could have specific headers too.
Blue/Green (traffic switching and mirroring)

Concepts

Blue/Green

It has both the new and old version of the application deployed at the same time. During this time, only the old version of the application will receive production traffic. This allows the developers to run tests against the new version before switching the live traffic to the new version.

Canary

A Canary deployment exposes a subset of users to the new version of the application while serving the rest of the traffic to the old version. Once the new version is verified as being correct, it can gradually replace the old version. Ingress controllers and service meshes, such as NGINX and Istio, enable more sophisticated traffic shaping patterns for canarying than what is natively available (e.g. achieving very fine-grained traffic splitting, or splitting based on HTTP headers).

The picture above shows a Canary with two stages (25% and 75% of traffic goes to the new version), but this is just an example. Argo Rollouts allow multiple stages and percentages of traffic to be defined for each use case.

Tools

The two great projects are Argo Rollouts and Flagger. Both projects are mature and widely used.

Argo Rollouts

Argo Rollouts is a Kubernetes Controller and set of CRDs which provide advanced deployment capabilities such as Blue/Green, Canary, Canary analysis, experimentation and progressive delivery features to Kubernetes. A UI is deployed to see the different Rollouts.

Two kinds of rollouts:

Argo Rollouts offers experiments that allow users to have ephemeral runs of one or more ReplicaSets and run AnalysisRuns along those ReplicaSets to confirm everything is running as expected. Some use cases of experiments could be:

Deploying two versions of an application for a specific duration to enable the analysis of the application.
Using experiments to enable A/B/C testing by launching multiple experiments with a different version of their application for a long duration.
Launching a new version of an existing application with different labels to avoid receiving traffic from a Kubernetes service. The user can run tests against the new version before continuing the Rollout.

A/B Testing could be performed using Argo Rollouts experiments

There are several ways to perform analysis to drive progressive delivery.

AnalysisRuns are like Jobs, in that they eventually complete; the result of the run affects if the Rollout's update will continue, abort or pause. AnalysisRuns accept templating, making it easy to parametrize analysis.
AnalysisRuns accepts multiple data sources like:
Prometheus, querying over the applications metrics to foresee if the service has a degraded performance during the deployment
Cloudwatch, querying over AWS metrics to check if everything is fine during the deployment
Web, perform an HTTP request and compare against the result of a JSON response
Job, execute a custom script in order to success/fail

Traffic Management

Observability

Grafana Dashboard

Migration

Instead of modifying and creating a new rollout from scratch, Argo Rollouts allows reference Deployment from Rollout. This will reduce effort in the event of migration.

Pain Points:

RBAC & Authentication
Non-native integration: Argo Rollouts use their own CRD Rollout, not Kubernetes native

Flagger

Flagger is part of the Flux family of GitOps tools. Flagger is pretty similar to Argo Rollouts and its main highlights are:

Native integration: It watches Deployment resources, not need to handle it using a CRD
Highly extensible and comes with batteries included: It provides a load-tester to run basic or complex-scenarios

When you create a deployment, Flagger generates duplicate resources of your app (including configmaps and secrets). It creates Kubernetes objects with -primary and a service endpoint to the primary deployment.

It employs the same concepts about Canary, Blue/Green and A/B Testing as Argo Rollouts does.

Observability

Grafana Dashboard

Pain Points:

No UI, so no RBAC and authentication are needed, but it's complex to have fast feedback from the current status of the rollouts. Checking the logs or checking the status of Canary resources is the only way.
No kubectl plugin to check how the deployment is going; necessary to deal with `kubectl logs -f flagger-controller` to see how kubectl describes Canary in order to check the progress.
Documentation could be better.
Blue/Green is an adapted Canary (same as a Canary but with 100% weight)

Questions

What happens if the controller is down?

Argo Rollouts

If there are Rollouts changes while the Argo Rollouts Controller is down, the controller will receive the latest changes; it's not going to start from where the Rollout was.
If there is no new commit while the Controller is down, the Controller reconciles the status automatically. If the Rollout is in step 3 and the Controller is down, when it is back up, it will pick up from the same spot.

Flagger

Like Argo Rollouts, it reconciles fine enough.
The difference is that it follows the steps, instead of the previous changes and then the latest changes.
New rollouts/deployments will be blocked, but the pods and HPA will remain up and running, even if it breaks in the middle of a rollout/deployment. Both Controllers will reconcile automatically after recovery.

What happens with the dashboards? Any changes?

Argo Rollouts

Although we don't have a Deployment resource, metrics from deployments won't disappear.

Flagger

Deployment resource is there, so no changes are expected.

No changes

What happens when a Canary is paused on the GUI or command line? Is the GitOps setup going to override the change?

Argo Rollouts

It can be done from the GUI and from the kubectl command line easily; the RolloutAbort will be notified by ArgoCD.
It can be retried from the GUI easily or from kubectl commands; ArgoCD will mark the Rollout in progress

Flagger

It looks like it's not possible to pause the deployment using the command line. It's needed to have Flagger Tester API deployed

What happens when a rollback occurs? What happens with the GitOps setup?

Argo Rollouts is integrated with ArgoCD and the progress of the Rollout can be seen from ArgoCD UI.
Flagger is not integrated with ArgoCD as seamlessly as Argo Rollouts, so a bunch of resources have been created and are visible in the ArgoCD UI, but there is no feedback.

What happens in lower environments with a Canary deployment if there is not enough traffic?

Argo Rollouts

Argo Rollouts doesn't have a current way to do a loadtest directly but, as, a workaround it can be used with the webhooks to launch a k6 loadtest, as seen in this issue in their project.
The loadtest has to be controlled out of the box; it specifically stops the loadtest when Canary reaches the step required.

Flagger

It has integration with k6 loadtests through a webhook and offers a flagger-loadtest tool; more information on webhooks can be found here.

How does Canary traffic management work without a service mesh?

In the absence of a traffic routing provider, both options can handle the Canary weights using NGINX capabilities. Besides, both options handle SMI and offer a broad selection related to service meshes. Then, whichever tool fits best and is not a blocker can be used to select one service mesh or another.

What happens when a configMap or secret used by the Deployment (as volume mounts, environment variables) are changed?

Argo Rollouts

There is no support for that in Argo Rollouts, but there is an open issue in their Project
Some workaround should be done, to be able to have rollout and rollback available when only a configMap changes. The workaround consists:
Random suffix in the configMap name
ConfigMap and Deployment definition in the same .yaml to avoid creating multiple random suffixes

Flagger

Using the Helm annotation trick for automatically rolling out deployments when the configMap changes works well enough in the event of a rollout. But, for a rollback after the rollout, the same issue as the Deployments and ConfigMaps may appear because there is only one configMap, not multiple. That means the workaround for the rollback would have to be done in the same way as Argo Rollouts

To Sum Up

Both tools will help us to get alternative deployments, while there are some tradeoffs related to each tool:

Argo Rollouts

Pros

Great UI, fast feedback
Great integration and feedback with ArgoCD, indicating if the Rollout is in progress
Easy integration with current Deployment resources
Documentation

Cons

UI without RBAC or auth
Loadtest not integrated, it has to be added ad-hoc using a webhook
Non-Kubernetes native, Rollout resource added by the CRD

Flagger

Pros

Kubernetes native, doesn't introduce new Kubernetes resources
Loadtest integrated

Cons

No UI; feedback needs to be gathered through the K8s API
Zero feedback from ArgoCD; Flagger integrates better with Flux, based on their documentation
Documentation could be better
Main differences with Argo Rollouts
Feedback using kubectl commands
Blue/Green is an adapted Canary (same as a Canary but with 100% weight, after some tests)

At Empathy, the tool we have chosen is Argo Rollouts. It fits our needs pretty well, offers faster feedback, has great integration with ArgoCD, and is open to more complex strategies.

What's next?

Choose your fighter, adapt the strategies to your applications. Likely some apps fit better with a Blue/Green approach and others with a Canary approach.
Demo Session in lower environments.
Plan migration with Teams.
Capabilities could be improved in the future if/when a Service Mesh is added to the Platform.

References

Tailwind + Vue No-code Editor

Iván Tajes Vidal — Mon, 07 Nov 2022 15:20:25 GMT

There has been a lot of talk about no-code solutions, lately. This movement tries to approach non-developers by offering software development tools that allow them to create and modify applications without using code. The benefits of no-code tools include speed, accessibility, reduced costs and autonomy.

Thinking about this idea, I wondered how to create a no-code editor for a web application. But, since a tool like this would be huge for a single post, I decided to focus only on the personalization of the styles and themes.

So, I chose to rely on one of the most popular CSS frameworks at the moment: Tailwind. Not because of its usual use, but for all the tools it has in terms of configuration and CSS generation.

The idea is to create a frontend interface which allows the Tailwind configuration to be modified in real time and shows the result styles applied. Then, this customized configuration could be stored and used in the build and deployment process of a hypothetical application.

However, in this article, we are going to focus only on the editor and how to achieve a real-time preview of the Tailwind config changes.

To do so, we are going to create a simple service in Node using ExpressJs. This service will receive the Tailwind configuration from the frontend editor and run PostCSS with the Tailwind plugin to generate the CSS. Finally, the service will return the generated CSS to the editor, which will update the page to show the changes.

We could try to run the PostCSS and Tailwind plugin directly in the browser, to make it work with node polyfills; that is how it is done in Tailwind Play’s internal implementations. Another option is to use WebContainers, but for simplicity's sake, we are going to run it in a simple node service.

Creating the Project

Let’s create a new project called tailwind-editor with Vite running npm create. I’m going to use Vue for the frontend because I’m more comfortable with it and also because it is awesome 😉.

$ dev npm create vite@latest
✔ Project name: … tailwind-editor
✔ Select a framework: › Vue
✔ Select a variant: › JavaScript

Then, add the dependencies for the service.

$ cd tailwind-editor
$ npm install --save express cors postcss tailwindcss

The resulting package:

{
  "name": "tailwind-editor",
  "private": true,
  "version": "0.0.0",
  "type": "module",
  "scripts": {
    "dev": "vite",
    "build": "vite build",
    "preview": "vite preview"
  },
  "dependencies": {
    "cors": "^2.8.5",
    "express": "^4.18.1",
    "postcss": "^8.4.17",
    "tailwindcss": "^3.1.8",
    "vue": "^3.2.37"
  },
  "devDependencies": {
    "@vitejs/plugin-vue": "^3.1.0",
    "vite": "^3.1.0"
  }
}

Creating the Tailwind CSS Service

Now, we are going to create the service that will receive the Tailwind config and return the resulting CSS.

Let’s start with the file src/tailwind-as-a-service.js which will contain the Express server with the cors middleware to support cross-origin calls. It is listening by port 8080 to any request to the root path with a GET method and returns the text Hello World.

// src/tailwind-as-a-service.js

import express from 'express';
import cors from 'cors';

const app = express()
app.use(cors());
const port = 8080

app.get('/', (req, res) => {
  res.send('Hello World!')
})

app.listen(port, () => {
  console.log(`Tailwind as a service listening on port ${port}`)
})

Running the server in node lets you check the response directly in the browser:

$ node ./src/tailwind-as-a-service.js

Since we are using Vite and ES modules, the minimum node version required to follow this article is 14.18+.

So far, so good. Now we are going to configure postcss and its Tailwind plugin to return CSS:

// src/tailwind-as-a-service.js
......
import postcss from 'postcss';
import tailwindcss from 'tailwindcss';

......

const defaultCss = `
  @import 'tailwindcss/base';
  @import 'tailwindcss/components';
  @import 'tailwindcss/utilities';
`;

app.get('/', async (req, res) => {
  const configuredTailwind = tailwindcss({
    content: [{ raw: '', extension: 'html' }]
  });
  const postcssProcessor = postcss([configuredTailwind]);
  const { css } = await postcssProcessor.process(defaultCss);
  res.send(css);
});

......

We just added the postcss and tailwindcss dependencies. Next, we have to configure the Tailwind plugin for PostCSS with the content option.

This option tells Tailwind to inspect the HTML, JavaScript components, and other files, to look for CSS classes to generate and include its CSS in the final result. It also allows us to write raw HTML inline.

After that, it is time to create a postcssProcessor with the configured Tailwind plugin, which is responsible for parsing the CSS and applying all PostCSS plugins.

Finally, we process a “fake” CSS file with the default base, components and utility styles of Tailwind. This is necessary to make Tailwind generate all necessary CSS.

The result CSS is returned in the response. So if we run the service with node ./src/tailwind-as-a-service.js again, and we request it from the browser, the resulting CSS will be shown:

Here, you can see the base CSS Tailwind provides by default and also the .bg-red-500 class at the end that we are passing as raw HTML to the Tailwind config.

So, we have a service to request and return the CSS, but how do we configure that CSS? Let’s make this service receive parameters and use them to configure the Tailwind plugin:

// src/tailwind-as-a-service.js
......

const defaultCss = `
  @import 'tailwindcss/base';
  @import 'tailwindcss/components';
  @import 'tailwindcss/utilities';
`;

app.post('/', async (req, res) => {
  const configuredTailwind = tailwindcss({
    content: [{ raw: req.body.html, extension: 'html' }],
    theme: req.body.theme
  });
  const postcssProcessor = postcss([configuredTailwind]);
  const { css } = await postcssProcessor.process(defaultCss);
  res.send(css);
});

......

Here, we changed the .get method into .post to be able to send and receive parameters in the request body, for larger parameters.

Moreover, we pick html and theme parameters from the request body and use them to configure Tailwind.

For simplicity, we are using only the theme part of the Tailwind configuration, but this approach, allows any part of it to be configured.

Creating the Editor

We are going to define some custom configurations for the Tailwind theme, which users can then modify through an interface. Below are the defined values for a simple example of components: buttons and titles. To keep the scope small, for this example we only allow some colors and font properties to be changed:

// src/custom-tailwind-config.js

export const customTailwindConfig = {
  colors: {
    primary: {
      25: '#cdd3d6',
      50: '#243d48',
      75: '#1b2d36'
    },
    secondary: {
      25: '#bfe1ec',
      50: '#0086b2',
      75: '#006485'
    },
    success: {
      25: '#ecfdf5',
      50: '#10b981',
      75: '#065f46'
    },
    warning: {
      25: '#fffbeb',
      50: '#f59e0b',
      75: '#92400e'
    },
    error: {
      25: '#fef2f2',
      50: '#ef4444',
      75: '#991b1b'
    },
    title: {
      1: '#000',
      2: '#a3a3a3',
      3: '#000',
      4: '#0e7490'
    }
  },
  fontSize: {
    button: '1rem',
    'size-title1': '2rem',
    'size-title2': '1.5rem',
    'size-title3': '1.25rem',
    'size-title4': '1.125rem'
  },
  fontWeight: {
    'weight-button': '400',
    'weight-title1': '700',
    'weight-title2': '700',
    'weight-title3': '400',
    'weight-title4': '400'
  }
};

Keep in mind that we are using the Tailwind theme configuration for the sake of simplicity – it is not the only way to achieve the same result. The whole Tailwind configuration could be overridden, including the plugins used and/or their configuration. For example, you could create your own Tailwind plugin and adding all your CSS components based on a configuration passed to the plugin. This configuration could be passed as a parameter to the Tailwind service as we are doing here with the theme configuration.

The next step is to go to the App.vue component and remove all default content, then add some buttons and titles using the CSS utility classes that Tailwind generates with the theme configuration that was just defined:

// src/App.vue

Before running the dev script to serve the Vue project, we need to modify this script in package.json to parallelize its execution with the Tailwind service:

"dev": "node ./src/tailwind-as-a-service.js & vite",

Afterwards, we run the project and visit the locally-served URL:

$ npm run dev

We are using the latest version of Vite, so the default port is 5153:

Finally, we open that URL in the browser and… oops! No styles?! What’s going on? 🤯

The styles are not being applied because we are not using Tailwind directly in our Vue project as we normally would. Instead, we have to call the tailwind-as-a-service endpoint to retrieve the CSS. Okay then, let’s create the function to call the service.

We create a new file fetch-css.js in the src directory:

// src/fetch-css.js

export async function fetchCss(tailwindCustomConfig) {
  return await fetch('http://localhost:8080', {
    method: 'POST',
    headers: { 'content-type': 'application/json' },
    body: JSON.stringify({
      html: document.body.innerHTML,
      theme: {
        extend: tailwindCustomConfig
      }
    })
  }).then(response => response.text());
}

This async function is receiving the custom Tailwind config and fetching the Tailwind service, passing it as the theme:{ extend: tailwindCustomConfig } parameter. We pass it inside extend to keep all the default utilities Tailwind has, and only add the new ones we need. We also obtain all the current HTML on the page and send it to the service as the html parameter. Tailwind will use this HTML to know which CSS classes generate and which don't.

The following step is to create a new component Editor.vue to use that function:

// src/components/Editor.vue

There are many things going on here. Let’s take a closer look:

We import the previous customTailwindConfig and the fetchCss function.
We add a css ref. If you are not already familiar with it, check out the new Vue Composition API documentation.
We create a new function getCss which calls the fetchCss and assigns the returned promise value to the ref value.
We use the onMounted Vue lifecycle hook to call the previous function whenever the component is mounted.
Finally, we create a dynamic component to attach the CSS to the DOM. This dynamic component will render the CSS inside a