It’s quite well-known that it can be a struggle to work with. Some people even avoid all frontend development
]]>It’s quite well-known that it can be a struggle to work with. Some people even avoid all frontend development just to stay away from CSS.
But, as I’m starting to grow in my career, I’m becoming a member of the first (and smaller) group – the one that loves CSS.
I learned that CSS is quite easy to use when you learn the ✨ basics ✨ correctly and don’t rush into advanced things.
So, if you are still struggling with it, you may need to consider doing some basic training or even practicing by playing some “silly” games like Flexbox Froggy or CSS Diner . Believe me, you’ll learn to center a div!
Once concepts like cascade, specificity, selectors, flex, grid, and so on stop being a mystery to you, you’ll probably be ready to dive into the cool stuff without your mental health suffering along the way.
The first two or three times, I found myself stuck sitting in front of my brand new assigned task that required a transition or an animation, looking at the designs and thinking “How the hell am I going to do that?!” But after working on a few more, I came to realize that it is all about the mental model.
Developing an animation starts way before you start writing the code. Your first step should always be to observe the result you want and try to find the different behaviors and the HTML + CSS resources needed to achieve it. You’ll probably need to divide and conquer, in order to see everything clearly.
This way, when the “spinning ball” becomes an “svg or div with a rotate transform” and your “dropdown effect” becomes a “container with a height or max-height transition,” you’ll be ready to go! Of course, if you don’t already, you’ll need to find out what you can and cannot do with CSS.
Be sure to take a look at concepts like:
transformtransitionanimatekeyframesYou’ll need them to do all the fancy things. Once you are aware of what you can do with them, you’ll see that, basically, what you can do is put together a series of transformations, so you’ll have to observe how each individual element of the animation interacts and what transformations it suffers.
It might take a while for it to click in your brain and start seeing things this way. In the meantime, let’s take a look at a few examples developed in the Empathy Platform Docs portal.
A good example of a simple animation is located in the 404 error. Can you guess what is going on and how this animation works? Don’t worry if you don’t see it immediately!
First comes the HTML:
<div class="edoc-error-page__diagram">
<p class="edoc-error-page__text edoc-error-page__text--404">404</p>
<p class="edoc-error-page__text edoc-error-page__text--error"> ERROR </p>
<Bubble class="edoc-error-page__bubble edoc-error-page__bubble-- green"/>
<Bubble class="edoc-error-page__bubble edoc-error-page__bubble--blue"/>
<Bubble class="edoc-error-page__bubble edoc-error-page__bubble--pink"/> <Bubble class="edoc-error-page__bubble edoc-error-page__bubble--yellow" />
<Bubble class="edoc-error-page__bubble edoc-error-page__bubble--orange" />
</div>As you can see, in this case, each of the bubbles is an individual element. “Bubble” is an imported SVG, but this could be done with divs as well!
The second part is to define the animation for each of the balls. For this, keyframes come in handy:
@keyframes rotation {
from {
transform: rotate(0deg);
}
to {
transform: rotate(359deg);
}
}Assigning the animation with the property ‘animation’ to each bubble makes everything start spinning:
&--blue {
top: rem(-10px);
left: 38%;
animation: rotation 4s infinite linear;
circle {
r: 20;
}
}
&--green {
top: rem(40px);
left: 28%;
transform: rotate(45deg);
animation: rotation 3.5s infinite reverse linear;
circle {
r: 18;
fill: $color-green;
}
}But, as you can see, there are a few more things going on here, so let’s check everything:
Too much to begin with? Maybe an even simpler example could help:
An even simpler animation is located in the no-results page of the Holons Search Experience. Didn’t see anything? Stare at that sad, disappointed Holon for a few seconds… It blinks! Can you guess how this one works?
Let’s start with the HTML again:
<div class="no-results__image-wrapper">
<NoResultsEyesIcon class="no-results__image no-results__image--eyes"/>
<NoResultsMouthIcon class="no-results__image no-results__image--mouth" />
</div>The face is separated in two SVGs, one for the eyes and another for the mouth. This allows us to easily animate the eyes, leaving the mouth alone.
Secondly, we have a keyframes animation for the blink:
@keyframes blink {
45%, 55% {
transform: scaleY(1);
}
50% {
transform: scaleY(0.1);
}
}In this case, instead of modifying rotation, we are playing with the vertical scaling. That way, it gets narrowed for a small moment (starting at 45% and finishing at 55%).
Finally, the animation is given to the eyes SVG:
&--eyes {
animation-name: blink;
animation-duration: 6s;
animation-timing-function: ease;
animation-iteration-count: infinite;
}
That said, you’ll probably need to struggle with a few animations before you start to understand their inner-workings quite quickly, so don’t despair! You’ll get there!
Even if you get to a point where you immediately realize how everything works and how you’ll get to that point, it will take time to implement something beautiful. You’ll probably never be able to create a breathtaking animation in five minutes, it takes time!
It’s very possible that you’ll end up in front of a trade-off between the time you can dedicate to it and how smooth you want the animation to be. Just remember that it is normal, you’re not going to perform a miracle in limited time, so don’t panic and take your time (if you have it)!
]]>At Empathy.co, we have been making heavy use of search engines such as Solr and Elasticsearch for years. They are key types of software in the products we build for our platform and our clients. They enable fast and effective storage and retrieval of our clients' catalogue products,
]]>At Empathy.co, we have been making heavy use of search engines such as Solr and Elasticsearch for years. They are key types of software in the products we build for our platform and our clients. They enable fast and effective storage and retrieval of our clients' catalogue products, simply and with little overhead.
A couple of years ago, one of our clients requested that we speed up the retrieval of their historical purchase data. They had been storing the data without considering its retrieval much. Since we had already been using Elasticsearch for this client, we promptly designed a system for ingesting and finding purchase history data, thus Purchase History Search as a separate project was born.
The challenge of Purchase History Search was ingesting a huge volume of purchase data in real time and making the data easily searchable. The introduction of the time dimension to the data is a key difference from catalogue search. Moreover, the history of purchases is private data, so creating a privacy-first system was essential.
Elasticsearch works with indices that could be roughly compared to standard database tables. Each index is further divided into shards, which are essentially inverted indices that enable the efficient search of data using the terms contained within it. These shards can live in different nodes (machines) so that the index operations can be distributed and parallelized. A typical Purchase History cluster consists of three or five master nodes that manage the entire cluster, its metadata, and operations, and about six data nodes where the index shards actually reside.
At Empathy.co, for our retail clients, we usually build a new index every time there is a new catalogue using a batch or on-demand process. For Purchase History, we have a continuous stream of purchase documents going into a feed, so the mechanism changes. In this case, we use a different service called the Streaming Indexer.
The Streaming Indexer is a scalable and stateless application that sits between the client's data feed and our Elasticsearch cluster. It fetches documents in real time and indexes them to a specific index. As the data is presented in a time-series style, it is only natural to have indices consist of a subset of the purchases. In our case, we decided to go with a monthly index. To further optimise the data storage and retrieval, it is standard for each document to have some kind of customer ID field. This ID can be employed to always route the document to the same index shard. When a shopper goes through their history, a single shard needs to be hit. It is like directing your search to a single drawer in your wardrobe.
This idea of sharding is not only a performance gain, but a good first step towards a privacy-aware and decentralized Purchase History system. Even though Elasticsearch is the centralising element of the whole setup, the idea of separating data based on customers is an element that we would like to explore further. We have been researching solutions like SOLID, whose pods can resemble our idea of shards. Basically, each shard/pod would be a private, fully decentralized repository of a consumer’s data. The consumer would be in total control of it and would be able to manage who has access to what part of their data.

As stated, shards are grouped into indices. The picture below shows the different components regarding Elasticsearch. This index distribution of data enables the Elasticsearch cluster to scale indefinitely with no degradation of the service. Multiple instances of Elasticsearch nodes can be run so that the processing of the querying and the indexing is further distributed. Each instance can work in parallel and also each index can be queried in separately. Furthermore, settings can be tweaked for older indices, such as freezing them and making them read-only. A large proportion of our clients' searches are for the purpose of retrieving recent purchases, which usually means only one or two of the most recent indices are hit.
The second part of our Purchase History setup is the Search API itself. It is able to search effectively for purchases with a combination of filter and sorting parameters. Depending on the use cases different search parameters can be used. A typical use case is for shoppers to review their purchase history looking for details or items to repurchase. Another use case is that of our client’s customer support services, which shoppers contact in the event of mismatches between ordered and received products. This setup can also be integrated with other applications, such as a Nutrition ranking system or a Pickup virtual assistant, quickly checking clients’ purchase history to retrieve purchases to be picked up.
The Search API service is also scalable and stateless, but it is more similar to our traditional search service. To make it performant, some tricks are possible, such as using the routing data discussed earlier. Secure access and permissions are simple to set up using OAuth, for example.
As with the rest of our services, Kubernetes is used to orchestrate the deployment of all the Purchase History services. This makes them reliable and enables them to run in any cloud environment as there are no cloud-specific technologies used. Kubernetes simplifies horizontal scaling, depending on the load. For example, if traffic suddenly increases, more Search API pods can be quickly deployed to meet established latency SLAs.
Some other considerations are the retention period of the data. Our customers, or even legislation, may dictate that the data be kept for no longer than a specific period. Having the monthly indices scheme in place allows us to simply remove a full index when all the documents belonging to that month are older than the defined period.
The following challenge will be to configure Purchase History access in a decentralized and private scenario. This will be done hand-in-hand with merchants, as they will no longer own and have total control over their customer base’s Purchase History, but rather an interface to enable access to each customer’s shards/pods.
]]>Long gone are the days when we used to consume data with Apache Spark Streaming, with an overly complicated, cloud-dependent infrastructure that was non-performant when load increased dramatically. Follow us on a journey of stack simplification,
]]>Long gone are the days when we used to consume data with Apache Spark Streaming, with an overly complicated, cloud-dependent infrastructure that was non-performant when load increased dramatically. Follow us on a journey of stack simplification, learning, and performance improvement within Empathy.co’s data pipeline.
Our data journey began in 2017, building a full-fledged AWS-based pipeline, steadily consuming events from multiple sources, wrapping those into small batches of JSON events, and sending them on to the Spark Streaming consumer.
Here’s a simplified view of what that exact streaming side of things looked like - together with the important bits of infrastructure, such as queues where events would be written:
Main technologies found in the old stack:
In-house services:
There’s more to it than what’s seen in the diagram, but that should give you an idea of how things looked back then - and yes, it was all built on AWS with little room for portability, without making major changes to the whole set of services which composed this part of the pipeline.
Some cons found in this solution:
You may be wondering, “Why not use Beam from the beginning? It was released in 2016!” Well, given the experience the team had with Apache Spark, it made sense to go with the much-proven Spark Streaming rather than making the switch to a technology that was still in its early stages.
Despite the cons, we managed to improve its performance over time, making it a bit more capable of handling higher loads of traffic without missing a beat.
However, the portability pain point was still there and wasn’t an issue so long as we didn’t use any other cloud… But given certain requirements, an extension of the components to support Google Cloud Platform was on its way to broaden the offering, and it was the perfect time to explore alternatives and simplify the system for the better! After a thorough analysis, we came to the conclusion that, given our expertise at the time, Apache Beam was the way to go.
Something which used to be overly complicated and really dependent on infrastructure soon looked something like this:
According to the Apache Beam website itself:
“The easiest way to do batch and streaming data processing. Write once, run anywhere data processing for mission-critical production workloads.”
In short, it is a unified API to build data processing components which can compile a single codebase into Flink, Spark, and many other systems. Check out the full list of supported runners.
Pick your language of choice and you’re pretty much ready to write code that will easily and transparently translate into a runner. The official docs have everything you need to know!
Some pros that made our decision way easier:
However, there are some cons worth mentioning also:
As soon as we decided that Apache Beam would replace our Spark Streaming implementation, we started to dig into what other parts of the infrastructure needed to be changed in order to optimize it. First step: the connector to the queue.
The base of every Apache Beam job is the Pipeline, which is defined as “a user-constructed graph of transformations that defines the desired data processing operations.”
Pipeline p = Pipeline.create(options);
p.apply("ReadLines", TextIO.read().from(options.getInputFile()))
As soon as you’ve got data flowing through the pipe (that initial TextIO step) we can start modifying our data however we wish.
TextIO in this sample will be our connector of choice, which basically reads files from a given location within the machine or remotely (S3 and GS buckets for instance) and that will give us a collection of strings in Beam named PCollection which is a “data set or data stream. The data that a pipeline processes is part of a PCollection.”
As soon as you build that initial and crucial step in your pipeline, you are ready to start adding more stages that operate on the aforementioned PCollection.
The apache/beam GitHub repository has plenty of examples to play with, so take a look! Also, the Programming Guide on the official website has everything you need to get up to speed on developing your very own pipelines. All the concepts mentioned in here (and more!) are covered in detail.
Initially, we were using a mix of SQS and SNS with S3 buckets to persist the events midway, sending notifications via SNS whenever a new batch of events was ready to be pulled from Spark. This was not ideal by any means and it complicated the overall portability and maintainability of the system, among other things.
We wanted a real-time stream of events from beginning to end, with the ability to replay messages in case we wanted to and remove all the unnecessary complexity of maintaining midway buckets and notifications. The solution, given the climate at the time, was clear: Kinesis.
For those of you that don’t know what Kinesis is, it's basically AWS’s real-time streaming offering. The benefits, in addition to it being real-time, are that it is highly scalable and fully managed, which removes a lot of operability overhead - this was a decisive factor for us.
The initial development of the latest jobs contemplated code that would run on GCP Dataflow and Flink on Kubernetes.
By applying the concepts seen in the documentation above, we built our pipelines - all in a streaming fashion - which save the records processed in both Parquet files and MongoDB collections with a great throughput.
It is easily scalable when high traffic season comes (i.e. Black Friday or sales) and portable through clouds. Hope you enjoyed this brief intro to Apache Beam and the transition from a really cloud dependant environment to a more agnostic one!
At Empathy.co, On-Call was scheduled to allow for fast recovery in case of a disaster, errors, or loss of service. A bunch of folks from different teams are part of the On-Call rotation (with escalation policies), which guarantees that there is someone ready to catch any incidents that may
]]>At Empathy.co, On-Call was scheduled to allow for fast recovery in case of a disaster, errors, or loss of service. A bunch of folks from different teams are part of the On-Call rotation (with escalation policies), which guarantees that there is someone ready to catch any incidents that may occur.
We work with a "you build it, you own it" principle, which promotes each team's autonomy and ownership. Each team defines the operational challenges of their services in an Operational Readiness Review so that they can solve issues without first escalating to other teams – because nobody knows more about a service than the team that owns it.
Empathy.co’s culture enables and empowers software engineers to take ownership to build, run, and operate products and features. We actively participate in the design and implementation of customer solutions and remain responsible for the value and service they provide. This means we are engaged in the whole cycle, from software inception to any service-level management, such as debugging, troubleshooting, request and incident resolution.
Whether your organization already has an On-Call system or is looking to implement one, here's a look at how ours has been implemented and how we recommend setting it up.
Previously, when a new engineer was added to the On-Call rotation, there was not much guidance. To improve the On-Call onboarding, we implemented On-Call shadowing for a kinder, smoother ramp-up to going On-Call, with none of the stress or responsibility for diagnosing and fixing the issue.
The default shadowing schedule excludes weekends and the day the On-Call is transferred from one engineer to the next, resulting in a 4-days-a-week, 24-hours-a-day "shadowing shift." New engineers decide when they want to start shadowing and when they feel ready to join the On-Call rotation. Our expectation is that they begin shadowing sometime during the first three months, and our culture of shared responsibility and blamelessness makes it less daunting to make the switch from shadowing to being on call.
Since it's important to respond to incidents quickly, identifying and logging must also be done quickly and methodically in order to move on to the next step.
It's important to categorize incidents to prevent confusion. For instance, the number of users affected, affected services, revenue impact, etc. Prioritizing incidents can help the On-Call engineer make a call on whether or not the incident requires the time and resources of the rest of the team.
Assembling the right people at the right time and in the right place is key to ensuring exceptional mean time to resolve (MTTR) times. Therefore, it's necessary to suppress the noise and alert only the right people who can fix the alert.
It's vital that first responders like the On-Call engineer be able to troubleshoot on the go.
Each team is the expert regarding the workloads they have built. Following the "you build it, you own it" principle, the alerts should be owned by each respective team.

Currently, our schedules usually start on Monday, but any weekday will do. We recommend avoiding On-Call shift changes on weekends. The shift length is one week, and only one week per month per engineer is recommended in order to avoid burnout.
Setting clear expectations is key for maintaining alignment:
Each escalation policy should account for at least four engineers, and shift changes should happen on a weekly basis. Each On-Call engineer should plan their PTO days in coordination with their team, to ensure proper coverage.
After an incident happens, a Postmortem should be written the following day to detail the occurrence and the solution applied. The main owners of the postmortem are the On-Call engineers who solved the incident, but feel free to add more people to the review. The Postmortem should then be reviewed by Staff Engineering before being communicated to clients and any other stakeholders within the company.
This is the On-Call approach that works for us, but each organization is different. It's important that you create an approach that fits your company's needs, offers reliability for your clients, and balances the workload for your On-Call engineers.
The X Adapter is one of the most
]]>The X Adapter is one of the most powerful and versatile tools in it, serving as a decoupled communication layer between an endpoint and your components. Combine it with the X Components, and you'll have half the work done already!
But how does the adapter accomplish this? Essentially, the adapter acts as a middleman between the website and the endpoint by handling requests. When creating and configuring an adapter, you typically need to specify an endpoint. However, to actually "adapt" the data, you need more than just an endpoint. This is where mappers come into play.
The mappers are at the core of the usefulness of an X Adapter. The RequestMapper of the adapter is used to build the endpoint correctly before making the request, and the ResponseMapper takes care of the response to that request.
We can define the mappers as dictionaries that help the adapter translate from what we have to what we want. This translation is defined using Schemas, which specify how the source object should be transformed into the target object.
For example, let’s say that our components name the search query as query, but the endpoint expects it to be called just q. The adapter will need a RequestMapper that tells it to translate query into q when creating the request. Here's an example of how to define a RequestMapper that does just that:
const requestMapper = schemaMapperFactory({
q: 'query',
})Easy enough! But what if the endpoint requires numerous fields? How can you keep track of all the different fields needed for the request to work while building the mapper? Fortunately, the schemaMapperFactory implements generic typing using Typescript. This allows you to specify the type of the source and target request parameter objects, which enables your IDE to give you hints about the available fields.
type SourceParams = { query: string };
type TargetParams = { q: string };
const requestMapper = schemaMapperFactory<SourceParams, TargetParams>({
q: 'query',
})
Mapping a couple of 1:1 parameters is rarely enough to handle real-life situations. Now that we've covered the basics of mappers, let's take a closer look at how they can handle more complex object structures, such as nested objects and calculations.
Returning to the previous example, we essentially tell the mapper that it will receive an object with an entry called query. When it finds that entry, it should put its value in a new object's entry, q. But what if query is not at the root level of the object? What if it's wrapped inside something else?
type SourceParams = {
searchParams: {
query: string
}
};The good news is that the solution is quite simple: specify the full path to the property, and the mapper will find it. The even better news is that Typescript hints will still work and suggest the correct full path!
type SourceParams = {
searchParams: {
query: string
}
};
type TargetParams = { q: string };
const requestMapper = schemaMapperFactory<SourceParams, TargetParams>({
q: 'searchParams.query',
})Indicating the path to the property like in these examples will cover a lot of the situations you might encounter, but what if there’s some extra complexity in what you want to do? What if you require doing some calculations before passing the value to the target property?
Consider a scenario where you have to handle pagination for a request. The source object has the current page (page) and the number of elements per page (pageSize), but the endpoint requires the index of the first element to return (startIndex).
type SourceParams = {
searchParams: {
query: string,
page: number,
pageSize: number
}
};
type TargetParams = {
q: string,
startIndex: number
};
To assign the correct value to pageIndex, you must first multiply page and pageSize. Rather than indicating the path to a parameter, you can pass a function that receives the entire source object and performs this calculation.
const requestMapper = schemaMapperFactory<SourceParams, TargetParams>({
q: 'searchParams.query',
pageIndex: ({ searchParams }) => searchParams.page * searchParams.pageSize
})
Moving on to responses, the tools we have are the same, but the problems we'll face are likely to be more complex. When dealing with endpoint responses, we have to handle more complex structures and arrays of elements. The X Adapter was created with the idea of mapping search responses, so it will need a way of handling the mapping of lists of elements.
Taking the following endpoint response as an example:
{
response: {
items: [{
name: 'example',
image: 'image.url.example'
},
...]
}
}
We define the types for the SourceResponse and the TargetResponse:
type SourceResponse = {
response: {
items: {
identifier: number;
name: string;
image: string;
}[];
};
}
type TargetResponse = {
results: {
id: string;
description: string;
images: string[];
}[];
}
We see here that response.items should be mapped to results and, for each item in the response, id is now an identifier string, description is now name and image is now an array in images. Let’s see how it’s done:
const responseMapper = schemaMapperFactory<SourceResponse, TargetResponse>({
results: {
$path: 'response.items',
$subSchema: {
identifier: ({ id }) => id.toString(),
name: 'description',
images: ({ image }) => [image]
}
}
})
So, what's happening here? We're passing an object in the schema for the result property. This object has two elements: $path and $subSchema. $path is the path to the property from the source response object that will be iterated to populate the list. $subSchema is the schema that will be applied to each of the elements found there.
Now that we have an adapter that can translate between our components and an endpoint, we just need the components. This is where the X Components come in handy. They offer multiple components to easily and quickly create an engaging search experience from the ground up. You can check out this video lesson for guidance on how to create a project with X Components or delve deeper into the documentation.
Using both the X Adapter and X Components, you only need to provide the endpoints. The components are already prepared to receive the information to display the search through the X Adapter.
The X Components require an adapter that bundles together different adapters for different endpoints to cater to the necessities of the components, such as Related Tags, Next Queries, and tagging. To make the development process easier, it provides types for each adapter's expected request and response. Additionally, there's a pre-built adapter, the X Adapter-Platform, that you can plug directly into the X Components setup and communicate with Empathy.co endpoints right away. However, that's not always the case.
If you want to create a search experience with the X Component using your own endpoint, you'll most likely start by showing results for the search. To do that, you'll need to create an adapter for the search endpoint.
const myAdapter = {
search: searchEndpointAdapter
}
Your searchEndpointAdapter will use your endpoint and the mappers to adapt the endpoint requests and responses to what the X Components understand, similar to the previous examples:
import { SearchRequest, SearchResponse } from "@empathyco/x-types";
import { YourSearchRequest, YourSearchResponse } from "your-types";
const requestMapper = schemaMapperFactory<SearchRequest, YourSearchRequest>({
q: 'query',
pageIndex: ({ rows, start }) => (rows && start ? rows * start : 0)
})
const responseMapper = schemaMapperFactory<YourSearchResponse, SearchResponse>({
results: {
$path: 'response.items',
$subSchema: {
id: ({ id }) => id.toString(),
name: 'name',
images: ({ image }) => [image],
modelName: () => 'Result'
}
},
totalResults: 'response.total'
})
const searchEndpointAdapter = endpointAdapterFactory<SearchRequest, SearchResponse>({
endpoint: '<https://your.endpoint>',
requestMapper,
responseMapper
});
After you’re finished with the adapter, pass it in the options object to the X Components and the search adapter will start providing the results from your search to the components that need them!
new XInstaller({
searchEndpointAdapter,
...
}).init();
Creating a great search experience for your customers doesn't have to be daunting. Empathy.co's X Components and X Adapter make it easy for developers to quickly build a reliable, adaptable search system that caters to the needs of their users. The X Adapter simplifies the process of translating between components and endpoints, while the X Components offer a range of pre-built components that can be customized to fit specific requirements. By utilizing these tools, you can provide a seamless and engaging search experience that keeps your customers coming back for more.
]]>The field of Frequent Pattern Mining (FPM) encompasses a series of techniques for finding patterns within a dataset.
This article will cover some of those techniques and how they can be used to extract behavioral patterns from anonymous interactions, in the context of an ecommerce site.

The field of Frequent Pattern Mining (FPM) encompasses a series of techniques for finding patterns within a dataset.
This article will cover some of those techniques and how they can be used to extract behavioral patterns from anonymous interactions, in the context of an ecommerce site.
Say that we are observing and noting down in a ledger every time a keystroke is played on a piano. For clarity, let's concede that our piano is oversimplified and only has seven notes, from A to G.
Our ledger would read something like this: A, D, F, A, C, G, A, C, B
The contents of our ledger are a data series, and each individual note is an element (a.k.a. itemset) of the series.
We know that the ordering of the notes is rather important in a melody. Thus, we can safely assume that our series is indeed a sequence (ordered data series). This is not always the case and it really depends on what information you want to learn from the data. For instance, we might be studying how frequently two particular notes are dominant in a database of melodies. Then, we wouldn't need to guarantee a sequence.
Now imagine that we want to detect riffs. A riff is a short sequence of notes that occur several times in the structure of a melody. In our nomenclature, a riff is then a subsequence. A subsequence of length 'n' is also called an n-sequence.
For example, take this popular tune. Can you hum along?

So this is the goal of FPM put simply: find the subsequences with their frequency, i.e. how many times they are observed.
In order to make things more interesting, let's consider many pianos playing at the same time. We can simultaneously write all the notes played, but we need an additional annotation to be able to tell them apart, e.g., P1 for the first piano, P2 for the second one, and so forth. This is the sid (sequence identifier).
Until now, we have considered every element to be defined by a single piece of data (e.g., 'A'), but that's obviously insufficient. Musical notes have more aspects to them than the tone: duration is also a key piece of data. We can improve our modeling by defining two aspects for our itemsets: tone and duration. For example, {tone: B, duration: 4} is an itemset defined by two items. Note that there is no concept of ordering for the items belonging to an itemset. They all occur at the same time.

This twist allows for more interesting exercises. We could, for instance, focus our attention on finding a dominant rhythm in the melodies. Or, we could look for recurring patterns of notes combining a particular sequence of tones, followed by any tone with a particular duration.
We will now apply the concepts above for modeling the behavioral trends observed in an ecommerce website.
We take the clickstream as our dataset, i.e., the atomic interactions that shoppers are producing on the website. These interactions include things like performing a search, clicking on a product, etc.
The Session ID is a perfect match to use as the sid, since this concept is widely used across the web for correlation purposes. Note that no other profiling data is needed. Plus, Session ID has desirable properties to it regarding privacy and security:
The shoppers' interactions are then grouped by Session ID and sorted by timestamp when we observe them. The resulting sessions are our sequences, thus the interactions are itemsets.
Next, we must define the interactions through their items. The first consideration is that we have different kinds of shopper interactions. Thus it's logical to consider this an important item. Let's name it eventType. This may vary between websites, depending on the user experience. Let's initially consider three main types:
At the very least, a query action is defined by its query terms. We also apply normalization processing to the query terms to improve the homogeneity of input data, e.g., lowercasing and transforming plural to singular words.

Click and add2cart are very similar regarding their relevant items, although the latter is usually a stronger signal so it's a good idea to keep them apart. The product ID is their most descriptive item, and the position of this product in the result list may also be relevant as it generally introduces bias in shoppers' behavior.
We also enrich the itemset with attributes from the shop's catalogue, in order to improve the aggregation factor. Consider that it's easier to discover patterns from items with a greater aggregation factor.

The selection of attributes to include in the itemsets has a big impact in the outcome of the FPM. The choices will depend primarily on the goal of the application. What kind of patterns are you aiming to discover?
When a particular item occurs a lot across itemsets, then we say that it has a big aggregation factor. E.g., you can expect to see a lot of occurrences of the item 'eventType:query'.
The aggregation factor is necessary for discovering patterns, but it can also spoil your FPM with obvious facts. For instance, you could end up with the following top patterns:
The patterns based on items like 'eventType:query' are eclipsing other patterns that are probably more interesting to learn about.
A possible optimization in this case is crossing some features together in order to break the strong aggregation factor, thus gaining increased granularity, e.g.:
Therefore leading to more intriguing patterns like:
Let's take a look at how to run FPM using Apache Spark's PrefixSpan implementation.
First, we'd typically prepare the data from clickstream as an Apache DataFrame. The key column here is sequence. Note that we have nested arrays (sequences of itemsets) in that column.
+-------+---------------------------------------------------------------+
|session|sequence |
+-------+---------------------------------------------------------------+
|131ABC |[[query:jacket], [click:p3, size:M, dept:casual], [click:p2, size:L, dept:casual]] |
|130ABC |[[query:jacket], [click:p2, size:L, dept:casual], [query:sport], [click:p3, size:M, dept:casual], [add2cart:p3, size:M, dept:casual]] |
|125ABC |[[click:p1, size:40, dept:footwear], [add2cart:p3, size:M, dept:casual], [click:p2, size:M, dept:casual]] |
|123ABC |[[query:sport], [click:p1, size:38, dept:footwear], [click:p3, size:M, dept:casual]] |
|124ABC |[[query:sweater], [click:p2, size:L, dept:casual], [prd:p3, size:M, dept:casual], [add2cart:p3, size:L, dept:casual], [query:shirt], [click:p2, size:L, dept:casual]] |
+-------+---------------------------------------------------------------+
Running FPM on top of this prepared DataFrame is simple:
def run(df: DataFrame): DataFrame = {
new PrefixSpan()
.setMinSupport(minSupport)
.setMaxPatternLength(maxPatternLength)
.setMaxLocalProjDBSize(maxLocalProjDBSize)
.findFrequentSequentialPatterns(df)
}
num_sessions_present / total_sessions. This value is therefore a relative amount (0-1).It's worth mentioning that fine-tuning minSupport and maxPatternLength critically impacts performance.
As output, we get another DataFrame with the results of the analysis.
+------------------------------------------------------------------+----+
|sequence |freq|
+------------------------------------------------------------------+----+
|[[size:M, dept:casual], [click:p3]] |53 |
|[[query:jacket], [click:p3, dept:casual, size:L]] |34 |
|[[query:shirt], [size:M, dept:casual], [add2cart:p2]] |31 |
+------------------------------------------------------------------+----+
Note that many of the resulting itemsets are now "incomplete". They are now like templates. It tells you the common factors extracted from the original itemsets observed.
Apache Spark also provides another implementation for performing FPM: FP-Growth.
It's interesting to know the differences, as it can also be a valid approach for some of the use cases.
Now that you have discovered the representative patterns in your data, you can interpret this info statistically as a source of predictions and recommendations for your shoppers. Optimizing the most frequent shopper journeys on a website (e.g., by suggesting shortcuts) is another interesting use case.
Patterns can be used to learn trends or association rules, so stay tuned for a post on that topic!

Progressive Delivery is emerging as a worthy successor to Continuous Delivery, by enabling developers to control how new features are launched to end users. Its wide popularity is owed to the demand for faster and more reliable software releases. The increasing emphasis on customer experience has begun to push
]]>Progressive Delivery is emerging as a worthy successor to Continuous Delivery, by enabling developers to control how new features are launched to end users. Its wide popularity is owed to the demand for faster and more reliable software releases. The increasing emphasis on customer experience has begun to push Continuous Delivery methodology by the wayside. Large enterprises like Netflix, Amazon, and Uber are turning to Progressive Delivery to test and release code in a phased and controlled manner.
In a nutshell, Progressive Delivery empowers developers to plan and implement code changes to a subset of users and then expand it to all users. The progressive rollout of features is executed through techniques like blue-green deployment, feature flagging, and canary deployments. You can mitigate issues that come up by promoting a version to all users only when you’re confident that it is performant and reliable. And if it fails in production, the impact radius is restricted to a subset of users, and the update can be rolled back immediately.
Apart from laying the foundation with Kubernetes, GitOps, service mesh, the key piece to the entire puzzle is a purpose-built progressive delivery tool.
Argo Rollouts is a Kubernetes controller and set of CRDs which provide advanced deployment capabilities such as blue-green, canary, canary analysis, experimentation, and progressive delivery features to Kubernetes.
We have also performed a deep analysis of the solution chosen at Empathy.co, outlining the requirements, benefits and a comparison between Argo Rollouts and Flagger.
Progressive Delivery can be implemented through a number of strategies:
Channel a limited amount of traffic to a new canary service, and then if it passes reliability tests, you can gradually shift all traffic from the old to the new service, and the canary becomes the default version.

Control the code launch remotely through a toggle-like feature, which enables changes to be rolled back immediately in the event of a failure.
Gradually transfer traffic from an existing application (blue) to a newer one (green), while the blue version acts as a backup.

Expose two different categories of the audience to two different application versions and analyze their performance to decide which is the ideal version.

Which tactic you pick depends on your goals and which you think would fit your workloads best.
Although those are the basics methods, Argo Rollouts allows more custom analysis to be added and all the capabilities to be explored, making it possible to create our own strategy during Releases. For instance, a custom strategy with the following capabilities could be defined:
setWeight is still set to zero
With Progressive Delivery, you can reduce risk. This is mainly because you continuously test code changes, analyze performance and implement your learnings all in real time. To ensure that this happens seamlessly, it's important to have KPIs against which to measure the success of your release. In our case, the SLO metrics would be a good KPI to choose. Thanks to Pyrra, there is a set of Prometheus Rules to monitor the Error Budget of your application and send an alert in case it is close to being burnt out.
Those metrics can be added as part of the Analysis in Argo Rollouts, so be sure to pay attention to how to explore Prometheus metrics in the Analysis and that your metrics make sense for checking the success of your release.
Check your Pyrra configurations, as they will be propagated as Prometheus Rules and will create some critical alerts for the Error Budget.
There are multiple ways to migrate to Rollout, but here we'll explain the simplest one:
workloadRef field.During the migration, the Deployment and the Rollout should coexist to avoid downtime. After that, the Deployment resource can be scaled down to zero replicas.
A temporal ingress and service will be located to avoid downtime and ensure correct behavior during the migration, because rollouts introduce a hash label in the selector to route the traffic.

After the migration, the temporal service and ingress can be deleted.
Argo Rollouts offers a Kubectl plugin to enrich the experience with Rollouts, Experiments, and Analysis from the command line.
Q: Can I restart a Rollout?
A: Sure, you can restart a Rollout just like you can restart a Deployment. There are multiple ways to do so: using the kubectl plugin, directly from Argo Rollouts UI, or through the Argo CD UI.
Q: Does the Rollout object follow the provided strategy when it is first created?
A: As with Deployments, Rollouts do not follow the strategy parameters on the initial deploy. The controller tries to get the Rollout into a steady state as fast as possible by creating a fully scaled-up ReplicaSet from the provided .spec.template. Once the Rollout has a stable ReplicaSet to transition from, the controller starts using the provided strategy to transition the previous ReplicaSet to the desired ReplicaSet.
Q: If I use both Argo Rollouts and Argo CD, won't I have an endless loop in the case of a Rollback?
A: No, there is no endless loop. As explained in the previous question, Argo Rollouts doesn't tamper with Git in any way. If you use both Argo projects together, the sequence of events for a Rollback is the following:
Q: How can I run my own custom tests (e.g. smoke tests) to decide if a Rollback should take place or not?
A: Use a custom Job or Web Analysis.
For more information, check out the Argo Rollouts FAQ.
ArgoRollouts offers enhancements to the usual Kubernetes deployment strategies, and the main highlights of adoption are:
But for those who haven’t yet found the right moment to start, or even don’t see the value, I hope this story empowers you to join the bright side of testing and transform the initial question, “Why should my team start automating?” to “Why haven't we started automating yet?”
Correct, automation will imply a big initial investment. It’s necessary to dedicate time and people to find the tool that best fits the framework and programming language your team is using. Of course, your team will also need proper training as to how to use that tool and how to design simple, maintainable tests.
Moreover, it’s not possible to provide an exact ROI calculation, since automation brings test coverage that wouldn’t be possible to achieve and handle manually. Think of a test regression suite for a project with multiple interconnected features. During the first few iterations, it might be possible to manually run all acceptance and regression tests, but the more iterations, the bigger the regression suite would be. How long would it take for that suite to become unmanageable? Consequently, how long would it take for the team to give up on those tests?
Well, it’s not strictly necessary. Adopting Agile practices means not only there is no room anymore for a dedicated Quality & Productivity team working on demand at the end of different project releases, but also that quality becomes a whole-team effort. The entire team needs to adopt a quality mindset, as everyone is responsible for the quality of the product. Whether you are a developer who has no experience in automating tests or you are a manual tester without a programming background, never fear. It’s a team responsibility, and there will be colleagues on the team who will be able to help.
Take a look at the beginning of the previous section. Do you see that “It’s not strictly necessary” bit? That’s because although all the team might agree on learning automating skills, it still may be a good idea to bring a quality expert to coach and support the team, helping to adopt the new practices and making sure those first steps go in the right direction to a solid, maintainable and easy-to-understand automation test suite. A Quality & Productivity Engineer will know the team is ready when automation has become an assumed, natural process.
It’s harder to take on new habits when there are old ones to fall back on. If the team isn’t given sufficient time to get up to speed with automation and, later on, design and implement the automated tests during each sprint, they will go back to old practices. Automation will be skipped as a result of choosing the lesser of two evils, and some exploratory testing will be done, at most. Best case scenario, the automation tasks are resumed in the following sprint, with the delivered business value reduced. Worst case scenario, deliverables seem to remain standing without pertinent testing, so the team falls for the misbelief that they are not necessary.
At this point, there may still be some skeptical readers thinking that if their software is simple enough, they can still survive without automated tests.
Let’s imagine the following scenario: there’s a small application developed by a concrete team that always has enough time allotted to manually test all the progress made during each iteration. Some bugs are found during the testing phase, and few others make it to the production environment, yet the situation is not concerning. How could automation benefit the team?
Have I brought you to the bright side of automation? Happy testing and, as always, bring on any questions or comments!
]]>One of the main engineering challenges faced by the Empathy.co Data Team is creating robust tests for our Spark applications. Since these applications are constantly evolving, as for any application, we needed a way to ensure changes wouldn’t break the code; a guarantee that the output from
]]>One of the main engineering challenges faced by the Empathy.co Data Team is creating robust tests for our Spark applications. Since these applications are constantly evolving, as for any application, we needed a way to ensure changes wouldn’t break the code; a guarantee that the output from our jobs would remain the same when refactored or when the input schema of the data is changed.
The biggest hurdle here was determining how to create component tests that check two key boxes:
Our Spark project architecture is based on the Spring Boot framework. We use this framework to facilitate the arguments we are passing to our application via environment variables or command line arguments. For that reason, all of our jobs follow the same structure, which uses configuration beans to read the configuration properties. All the jobs implement the abstract run method from the ApplicationRunner spring boot class. So, the final objective is to test the run method from our applications to ensure the aggregation results are returned as expected. Individual methods can also be tested with unit tests, but that won’t be covered here, as it is out of the scope of this blog post.
Let’s imagine that the structure of one of the Spark Jobs we want to test is the following:
@SpringBootApplication(
exclude = Array(classOf[MongoAutoConfiguration]),
scanBasePackages = Array[String]("my.configuration.package")
)
@ConfigurationProperties("job-name")
Class MyBatch @Autowired() (
@BeanProperty var sparkConfig: SparkConfig,
@BeanProperty var inputConfig: InputConfig
) extends Serializable
with ApplicationRunner {
@BeanProperty
var outputPath: String = _
override def run(args: ApplicationArguments): Unit = {
val inputDf = readParquet(
sparkConfig.sparkSession,
inputConfig.path,
inputConfig.getStartDate,
inputConfig.getEndDate
)
parquetDF = letTheJobPerformItsJob()...
if (!Option(outputPath).forall(_.isEmpty)) {
parquetDF.save(
fullOutputPath,
partitionKeys = Seq(RawSchemaConstants.Yyyy, RawSchemaConstants.Mm)
)
}
Next, we will be using the Mockito framework to write the test code. Let’s start the test by creating the Spark session:
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.lit
import org.mockito.ArgumentCaptor
import org.mockito.ArgumentMatchers.any
import org.mockito.ArgumentMatchers.anyString
import org.mockito.ArgumentMatchers.isNull
import org.mockito.MockitoSugar
import org.scalatest.flatspec.AnyFlatSpec
import org.scalatest.matchers.should.Matchers
import org.springframework.boot.DefaultApplicationArguments
class MyBatchTest extends AnyFlatSpec with Matchers with MockitoSugar {
val spark = SparkSession.builder
.master("local[*]")
.getOrCreate()
import spark.implicits._
}
All of our Spark batch jobs follow the same working order: first, the source DataFrame is read from a bucket. Then, we make some aggregations, and finally, the results are either saved to a different bucket or to a MongoDB collection, or both.
The next step is to define the input DataFrame that would be read from the source bucket during a real execution, which would need to contain the data required to test the different scenarios that could arise (such as corner cases). The content of this DataFrame depends on each test logic.
To make the code more readable, we first need to define some prefixes/aliases (SV, SFV) to group into objects the constant values that we are using repeatedly. Also, the code shown here is a reduced version of the DataFrame; not all the rows that are present in the source DataFrame of the test are shown for readability purposes. We could define several DataFrames inside different tests to check individual scenarios and use a larger one to test the overall aggregation with all the scenarios together, but all of them will follow this same procedure:
val testData = DataFrameGenerator.generateSampleDataFrame(
SV.Year,
SV.Month,
SV.Day,
Seq(
(SV.St1, Sinks.Q, SFV.LangEngScopeDesktop),
(SV.St1, Sinks.C, SFV.LangEngScopeDesktop),
(SV.St1, Sinks.C, SFV.LangEngScopeDesktop),
[...]
).toDF(
SourceColumnNames.Instance,
SourceColumnNames.Sink,
SourceColumnNames.Filters
)
)
We define an empty dataset with the output that is written to MongoDB, to mock the method since we are not interested in writing any results to MongoDB.
val emptyMongoInstanceDFs = Map[WriteConfig, DataFrame]()After that, we need to create a mock object for the methods that read the source data and write the results, which in our case are located in an implicit class called DataFrameMethods that extends the Spark DataFrame object. We also define a Mockito ArgumentCaptor object that will allow us to capture the result of the Spark job, which is then passed to one of the methods that performs the write operation. This code is written inside the test code:
"My Batch" should "work" in {
val dfMethodsMock = mock[DataFrameMethods]
val dataFrameCaptor = ArgumentCaptor.forClass(classOf[DataFrame])
}Now, we instantiate the job we want to test and create a spy object to mock some methods:
val myBatch = new MyBatch(
SparkTestConfig.getSparkConfig,
SparkTestConfig.getInputConfig
)
val spyBatch = spy(myBatch)
The input beans that the batch receives as arguments (Spark session, input config, etc) can be created as shown below. Our batch receives more input beans like the output parameters, but for the sake of simplicity, they are omitted from this snippet:
def getSparkConfig: SparkConfig = {
val sparkConfig = new SparkConfig()
sparkConfig.useLocalMaster = true
sparkConfig.construct()
sparkConfig
}
def getInputConfig: InputConfig = {
val inputConfig = new InputConfig()
inputConfig.path = "inputPath"
inputConfig.timeZone = "UTC"
inputConfig.date = "2022-01-01"
inputConfig
}
The next step is to mock the read source data and write result methods. The read method should return the input DataFrame that we previously created. For the parquet write method, we set it to do nothing. The method that generates the documents written to MongoDB is also mocked to avoid returning results to just any database. We only want to capture the argument with the resulting DataFrame when the method is called.
doReturn(testData)
.when(spyBatch)
.readParquet(any[SparkSession],
ArgumentMatchers.eq("inputPath"),
ArgumentMatchers.eq(ZonedDateTime.of(LocalDate.parse("2022-01-01"), LocalTime.MIN, ZoneId.of("UTC"))),
ArgumentMatchers.eq(ZonedDateTime.of(LocalDate.parse("2022-01-02"), LocalTime.MIN, ZoneId.of("UTC")))
)
doNothing
.when(dfMethodsMock)
.save(ArgumentMatchers.eq("outputPath"), ArgumentMatchers.eq(Seq("yyyy", "mm")))
doReturn(emptyMongoInstanceDFs)
.when(spyBatch)
.generateInstanceDFs(any[DataFrame], isNull[String], any[DatabaseConfig])
Here comes the most interesting part of the test: the batch code is executed by calling the run method and the resulting DataFrame is captured with the argument captor by using the method that generates the documents written to MongoDB. We mocked this method so it generates an empty collection, but we need to capture the input argument to get the Spark job results. In this case, we are using the method that saves the result into a MongoDB collection, but we could just as well use the one that writes the result into a bucket in parquet format.
spyBatch.run(new DefaultApplicationArguments(""))
verify(spyBatch)
.generateInstanceDFs(
dataFrameCaptor.capture(),
any(),
any[DatabaseConfig]
)
val result: DataFrame = dataFrameCaptor.getValue.asInstanceOf[DataFrame].cache()
The last part of the test consists of checking the result. There are several ways to perform this check. One consists of collecting the results and comparing the rows. Although we are working with small DataFrames, we prefer to do this in a distributed way by comparing the DataFrames without collecting the data into the Spark driver.
So, let’s define the expected DataFrame. The key point here is to create the DataFrame with the columns in the same order they are aggregated by the job; otherwise, the check will fail:
val expectedData = Seq(
(SV.St1, SFV.EmptyFilterStr, 1, 1, 3),
(SV.St1, SFV.LangEngStr, 1, 2, 1),
[...]
).toDF(
FinalColumnNames.Instance,
FinalColumnNames.Filters,
FinalColumnNames.QueryCount
)
We check to see that the total number of rows matches and that the DataFrames are equal by subtracting one from the other and checking that the results are empty:
assert(10 === result.count())
assert(expectedData.except(result).isEmpty)We know that our jobs do not produce duplicated rows (exactly equal rows), so this check is enough to ensure that the expected and resulting DataFrames are the same. With repeated rows, we would need to do more checks, such as the number of times each row is repeated. We could subtract the DataFrames in the inverse order and perform the same check to ensure we are confident with the results of the test.
assert(result.except(expectedData).isEmpty)To summarize, these are the steps to follow for Spark application component testing:
All together, the test code should appear like this:
import com.eb.data.batch.{SampleValues => SV}
import com.eb.data.batch.{SampleFiltersValues => SFV}
import com.eb.data.batch.{SampleSinks => Sinks}
import com.eb.data.batch.config.DatabaseConfig
import com.eb.data.batch.io.df.DataFrame.DataFrameMethods
import com.eb.data.batch.DataFrameGenerator
import com.eb.data.batch.FinalColumnNames
import com.eb.data.batch.SourceColumnNames
import com.eb.data.batch.SparkTestConfig
import com.eb.data.batch.UDFUtils
import com.mongodb.spark.config.WriteConfig
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.lit
import org.mockito.ArgumentCaptor
import org.mockito.ArgumentMatchers.any
import org.mockito.ArgumentMatchers.anyString
import org.mockito.ArgumentMatchers.isNull
import org.mockito.MockitoSugar
import org.scalatest.flatspec.AnyFlatSpec
import org.scalatest.matchers.should.Matchers
import org.springframework.boot.DefaultApplicationArguments
import java.time.ZonedDateTime
class MyBatchTest extends AnyFlatSpec with Matchers with MockitoSugar {
val spark = SparkSession.builder
.master("local[*]")
.getOrCreate()
import spark.implicits._
val testData = DataFrameGenerator.generateSampleDataFrame(
SV.Year,
SV.Month,
SV.Day,
Seq(
(SV.St1, Sinks.Q, SFV.LangEngScopeDesktop),
(SV.St1, Sinks.C, SFV.LangEngScopeDesktop),
(SV.St1, Sinks.C, SFV.LangEngScopeDesktop),
[...]
)
)
).toDF(
SourceColumnNames.Instance,
SourceColumnNames.Sink,
SourceColumnNames.Filters
)
)
val emptyMongoInstanceDFs = Map[WriteConfig, DataFrame]()
"My batch" should "work" in {
val dfMethodsMock = mock[DataFrameMethods]
val dataFrameCaptor = ArgumentCaptor.forClass(classOf[DataFrame])
val myBatch = new MyBatch(
SparkTestConfig.getSparkConfig,
SparkTestConfig.getInputConfig
)
val spyBatch = spy(myBatch)
doReturn(testData)
.when(spyBatch)
.readParquet(any[SparkSession],
ArgumentMatchers.eq("inputPath"),
ArgumentMatchers.eq(ZonedDateTime.of(LocalDate.parse("2022-01-01"), LocalTime.MIN, ZoneId.of("UTC"))),
ArgumentMatchers.eq(ZonedDateTime.of(LocalDate.parse("2022-01-02"), LocalTime.MIN, ZoneId.of("UTC")))
)
doNothing
.when(dfMethodsMock)
.save(ArgumentMatchers.eq("outputPath"), ArgumentMatchers.eq(Seq("yyyy", "mm")))
doReturn(emptyMongoInstanceDFs)
.when(spyBatch)
.generateInstanceDFs(any[DataFrame], isNull[String], any[DatabaseConfig])
spyBatch.run(new DefaultApplicationArguments(""))
verify(spyBatch)
.generateInstanceDFs(
dataFrameCaptor.capture(),
any(),
any[DatabaseConfig]
)
val result: DataFrame = dataFrameCaptor.getValue.asInstanceOf[DataFrame].cache()
assert(10 === result.count())
val expectedData = Seq(
(SV.St1, SFV.EmptyFilterStr, 1, 1, 3),
(SV.St1, SFV.LangEngStr, 1, 2, 1),
[...]
).toDF(
FinalColumnNames.Instance,
FinalColumnNames.Filters,
FinalColumnNames.QueryCount
)
assert(10 === result.count())
assert(expectedData.except(result).isEmpty)
assert(result.except(expectedData).isEmpty)
}
}
In a follow-up blog post, we will explore how to improve the efficiency of the Spark session creation and reduce the execution time when performing a battery of tests during the same execution. For now, we hope this serves as a helpful guide to performing component testing in Spark. As always, if you have any questions or comments, please feel free to reach out!
]]>In recent years, improvements in results from AI models have come from two different sources.
On one hand, the improvement in hardware capabilities in terms of raw computational power has enabled larger models to be trained. As a result, AI models have been able to get bigger and bigger
]]>
In recent years, improvements in results from AI models have come from two different sources.
On one hand, the improvement in hardware capabilities in terms of raw computational power has enabled larger models to be trained. As a result, AI models have been able to get bigger and bigger (i.e. have more and more parameters). The increasing of parameters has also made these models more precise, but there is another component besides having hardware and bigger models — we need a way to train them. To do so, enormous datasets are required; as parameters increase, dataset size also has to increase accordingly. For example, DALL·E 2, one of the biggest models in the text-to-image model space with about 3.5 billion parameters, was trained on a 650 million image-caption pair dataset obtained from multiple sources all over the internet. (Dataset creation is a whole other issue, so we will not delve further into that topic.)
On the other hand, there have been architectural improvements of the results obtained by these huge text-to-image models, which can’t be achieved by brute force alone. In recent years, with the breakthrough of the so-called diffusion models, there has been an enormous improvement in the results produced by new AI models even without increasing the parameters of said models (e.g. DALL·E 2 has fewer parameters compared to its previous version, but performs much better).
Even though research and development of new models is continual, 2022 has seen explosive growth in the Image Generation field. Where before there were small improvements on a year-to-year basis, this year has brought great improvements on a monthly, weekly, and even daily basis. In fact, while writing this article, Version 2 of Stable Diffusion was just released on November 24. While this article will not cover the latest version of Stable Diffusion, all ways of working related to this model that will be covered can most likely be applied to the latest version.
The first step to begin using Image Generation for personal use is to decide which model to use. As of now, there are three main models for image generation: Stable Diffusion, DALL·E 2 and Midjourney. They are not the only options, there are others like NovelAi or Jasper Art, but this post will focus on the main three.
The five key factors to ponder when deciding which model to use are laid out in the table below:
After deciding which model to use, the next step is learning how to use it. For Dall·E 2, it can be used on the OpenAI website. In order to use Midjourney, you need to join their Discord server which requires a Discord account.
Finally, for Stable Diffusion there are three options:
Regardless of the option chosen for running Stable Diffusion, I recommend watching this Youtube video (in Spanish) where all concepts are explained in more depth, as well as exploring the subreddit from the Stable Diffusion community to look for prompts and updates on community tooling. Another place to look for prompts is Lexica Art.
Now that it is clear what each of these models offers and how to begin using them, it is time for a visual comparison between the images that they generate. As part of this section, a direct comparison using the same subset of prompts is shown below. (Please note that the images shown are not mine nor Empathy.co’s. You can view the originals here.)
As shown in the images above, all three models provide high-quality results even though there are some clear differences between them. Stable Diffusion has the most inconsistent results style-wise, sometimes producing realistic results like the snow-covered mountain, while others receive a painting-like result like the Cherry Blossom image.
DALL·E 2 and Midjourney are much more consistent in their respective styles, with DALL·E 2 being the most consistent and having a style that is more suited to enterprise use. Midjourney’s style leans more towards conceptual art-like results but the quality is a bit less consistent in comparison with the results from DALL·E 2 (e.g. its Cherry Blossom image vs. the others).
These powerful image generation models present new workflows and applications for art. For example, they can be used to generate icon-like images for programs, generate art for characters and places in tabletop role-playing games, transform video recordings, be included in current workflows for artists and graphic designers, and the list goes on.
There is an important topic to consider regarding Image Generation models (and pretty much all AI-related models): the ethical concerns that arise from these new technologies.
The conversation is mounting around how these models were trained. As noted in the Introduction, all of them were trained using huge image datasets and the images were obtained from public sources, mainly the internet. The problem is that a part of the dataset is formed by images made by artists, including copyrighted ones. That plus the fact that the AI can be guided, particularly Stable Diffusion, toward certain styles, means some images generated have a clear similarity to certain artists’ styles. Artists from all over the world have brought up the topic of plagiarism. Are the images generated by these AIs plagiarised? Is it right to use these images? The discussion is ongoing.
Another related concern is that if images can be described and generated in a matter of seconds, artists will have fewer commissions and receive less work. This is understandably concerning for visual artists, and not to be taken lightly. Controversy also arose after an AI-generated image won first place in Colorado State Fair’s fine arts competition. The decision was met with backlash and claims that the winner cheated, and regulations banning AI submissions in those kinds of competitions have been requested.
Privacy is also a matter that needs to be attended to, as a report from Ars Technica claimed that the training dataset used by Stable Diffusion contained hundreds or even thousands of private medical records. While the responsibility of filtering these images from the dataset is not clear, LAION, the company behind the dataset used by Stable Diffusion, says that they do not host these images on their servers and that filtering sensitive information is the responsibility of the companies that create these AI models. However, LAION provides the URLs for the images used in their datasets, so they are also responsible in part.
That said, enjoy your journey of creating AI-generated images while respecting the privacy and intellectual property of others!
In this post, an analysis of Progressive Delivery options in the Cloud Native landscape will be done to explore how this enhancement can be added in a Kubernetes environment. The most embraced tools from the Cloud Native landscape will be analyzed (Argo Rollouts and Flagger), along with some takeaways
]]>In this post, an analysis of Progressive Delivery options in the Cloud Native landscape will be done to explore how this enhancement can be added in a Kubernetes environment. The most embraced tools from the Cloud Native landscape will be analyzed (Argo Rollouts and Flagger), along with some takeaways and the selection that best fits Empathy.co’s needs.
The native Kubernetes Deployment Objects supports the Rolling Update strategy which provides a basic guarantee during an update, as well as limitations:
For the reasons above, in a complex production environment, this Rolling Update could be risky because it doesn't provide control over the blast radius, may roll out too aggressively and there is no rollback automation in case of failure.
Although there are multiple tools that offer multiple deployment capabilities, the minimum requirements would be:
Native Kubernetes Deployment Objects:
The aspirational goal is to add extra deployment capabilities to the current Kubernetes cluster and, therefore, increase the agility and confidence of application teams by reducing the risk of outages when deploying new releases.
The main benefits would be:
It has both the new and old version of the application deployed at the same time. During this time, only the old version of the application will receive production traffic. This allows the developers to run tests against the new version before switching the live traffic to the new version.
A Canary deployment exposes a subset of users to the new version of the application while serving the rest of the traffic to the old version. Once the new version is verified as being correct, it can gradually replace the old version. Ingress controllers and service meshes, such as NGINX and Istio, enable more sophisticated traffic shaping patterns for canarying than what is natively available (e.g. achieving very fine-grained traffic splitting, or splitting based on HTTP headers).
The picture above shows a Canary with two stages (25% and 75% of traffic goes to the new version), but this is just an example. Argo Rollouts allow multiple stages and percentages of traffic to be defined for each use case.
The two great projects are Argo Rollouts and Flagger. Both projects are mature and widely used.
Argo Rollouts is a Kubernetes Controller and set of CRDs which provide advanced deployment capabilities such as Blue/Green, Canary, Canary analysis, experimentation and progressive delivery features to Kubernetes. A UI is deployed to see the different Rollouts.
Two kinds of rollouts:
Argo Rollouts offers experiments that allow users to have ephemeral runs of one or more ReplicaSets and run AnalysisRuns along those ReplicaSets to confirm everything is running as expected. Some use cases of experiments could be:
A/B Testing could be performed using Argo Rollouts experiments
There are several ways to perform analysis to drive progressive delivery.
Traffic Management
Observability
Migration
Pain Points:
Flagger is part of the Flux family of GitOps tools. Flagger is pretty similar to Argo Rollouts and its main highlights are:
When you create a deployment, Flagger generates duplicate resources of your app (including configmaps and secrets). It creates Kubernetes objects with <targetRef.name>-primary and a service endpoint to the primary deployment.
It employs the same concepts about Canary, Blue/Green and A/B Testing as Argo Rollouts does.
Observability
Pain Points:
`kubectl logs -f flagger-controller` to see how kubectl describes Canary in order to check the progress.Argo Rollouts
Flagger
Argo Rollouts
Flagger
Argo Rollouts
Flagger
Argo Rollouts
Flagger
Argo Rollouts
Flagger
Both tools will help us to get alternative deployments, while there are some tradeoffs related to each tool:
Pros
Cons
Pros
Cons
At Empathy, the tool we have chosen is Argo Rollouts. It fits our needs pretty well, offers faster feedback, has great integration with ArgoCD, and is open to more complex strategies.
Thinking about this idea, I
]]>Thinking about this idea, I wondered how to create a no-code editor for a web application. But, since a tool like this would be huge for a single post, I decided to focus only on the personalization of the styles and themes.
So, I chose to rely on one of the most popular CSS frameworks at the moment: Tailwind. Not because of its usual use, but for all the tools it has in terms of configuration and CSS generation.
The idea is to create a frontend interface which allows the Tailwind configuration to be modified in real time and shows the result styles applied. Then, this customized configuration could be stored and used in the build and deployment process of a hypothetical application.
However, in this article, we are going to focus only on the editor and how to achieve a real-time preview of the Tailwind config changes.
To do so, we are going to create a simple service in Node using ExpressJs. This service will receive the Tailwind configuration from the frontend editor and run PostCSS with the Tailwind plugin to generate the CSS. Finally, the service will return the generated CSS to the editor, which will update the page to show the changes.
We could try to run the PostCSS and Tailwind plugin directly in the browser, to make it work with node polyfills; that is how it is done in Tailwind Play’s internal implementations. Another option is to use WebContainers, but for simplicity's sake, we are going to run it in a simple node service.
Let’s create a new project called tailwind-editor with Vite running npm create. I’m going to use Vue for the frontend because I’m more comfortable with it and also because it is awesome 😉.
$ dev npm create vite@latest
✔ Project name: … tailwind-editor
✔ Select a framework: › Vue
✔ Select a variant: › JavaScriptThen, add the dependencies for the service.
$ cd tailwind-editor
$ npm install --save express cors postcss tailwindcssThe resulting package:
{
"name": "tailwind-editor",
"private": true,
"version": "0.0.0",
"type": "module",
"scripts": {
"dev": "vite",
"build": "vite build",
"preview": "vite preview"
},
"dependencies": {
"cors": "^2.8.5",
"express": "^4.18.1",
"postcss": "^8.4.17",
"tailwindcss": "^3.1.8",
"vue": "^3.2.37"
},
"devDependencies": {
"@vitejs/plugin-vue": "^3.1.0",
"vite": "^3.1.0"
}
}Now, we are going to create the service that will receive the Tailwind config and return the resulting CSS.
Let’s start with the file src/tailwind-as-a-service.js which will contain the Express server with the cors middleware to support cross-origin calls. It is listening by port 8080 to any request to the root path with a GET method and returns the text Hello World.
// src/tailwind-as-a-service.js
import express from 'express';
import cors from 'cors';
const app = express()
app.use(cors());
const port = 8080
app.get('/', (req, res) => {
res.send('Hello World!')
})
app.listen(port, () => {
console.log(`Tailwind as a service listening on port ${port}`)
})Running the server in node lets you check the response directly in the browser:
$ node ./src/tailwind-as-a-service.js
Since we are using Vite and ES modules, the minimum node version required to follow this article is 14.18+.
So far, so good. Now we are going to configure postcss and its Tailwind plugin to return CSS:
// src/tailwind-as-a-service.js
......
import postcss from 'postcss';
import tailwindcss from 'tailwindcss';
......
const defaultCss = `
@import 'tailwindcss/base';
@import 'tailwindcss/components';
@import 'tailwindcss/utilities';
`;
app.get('/', async (req, res) => {
const configuredTailwind = tailwindcss({
content: [{ raw: '<div class="bg-red-500">', extension: 'html' }]
});
const postcssProcessor = postcss([configuredTailwind]);
const { css } = await postcssProcessor.process(defaultCss);
res.send(css);
});
......We just added the postcss and tailwindcss dependencies. Next, we have to configure the Tailwind plugin for PostCSS with the content option.
This option tells Tailwind to inspect the HTML, JavaScript components, and other files, to look for CSS classes to generate and include its CSS in the final result. It also allows us to write raw HTML inline.
After that, it is time to create a postcssProcessor with the configured Tailwind plugin, which is responsible for parsing the CSS and applying all PostCSS plugins.
Finally, we process a “fake” CSS file with the default base, components and utility styles of Tailwind. This is necessary to make Tailwind generate all necessary CSS.
The result CSS is returned in the response. So if we run the service with node ./src/tailwind-as-a-service.js again, and we request it from the browser, the resulting CSS will be shown:

Here, you can see the base CSS Tailwind provides by default and also the .bg-red-500 class at the end that we are passing as raw HTML to the Tailwind config.
So, we have a service to request and return the CSS, but how do we configure that CSS? Let’s make this service receive parameters and use them to configure the Tailwind plugin:
// src/tailwind-as-a-service.js
......
const defaultCss = `
@import 'tailwindcss/base';
@import 'tailwindcss/components';
@import 'tailwindcss/utilities';
`;
app.post('/', async (req, res) => {
const configuredTailwind = tailwindcss({
content: [{ raw: req.body.html, extension: 'html' }],
theme: req.body.theme
});
const postcssProcessor = postcss([configuredTailwind]);
const { css } = await postcssProcessor.process(defaultCss);
res.send(css);
});
......Here, we changed the .get method into .post to be able to send and receive parameters in the request body, for larger parameters.
Moreover, we pick html and theme parameters from the request body and use them to configure Tailwind.
For simplicity, we are using only the theme part of the Tailwind configuration, but this approach, allows any part of it to be configured.We are going to define some custom configurations for the Tailwind theme, which users can then modify through an interface. Below are the defined values for a simple example of components: buttons and titles. To keep the scope small, for this example we only allow some colors and font properties to be changed:
// src/custom-tailwind-config.js
export const customTailwindConfig = {
colors: {
primary: {
25: '#cdd3d6',
50: '#243d48',
75: '#1b2d36'
},
secondary: {
25: '#bfe1ec',
50: '#0086b2',
75: '#006485'
},
success: {
25: '#ecfdf5',
50: '#10b981',
75: '#065f46'
},
warning: {
25: '#fffbeb',
50: '#f59e0b',
75: '#92400e'
},
error: {
25: '#fef2f2',
50: '#ef4444',
75: '#991b1b'
},
title: {
1: '#000',
2: '#a3a3a3',
3: '#000',
4: '#0e7490'
}
},
fontSize: {
button: '1rem',
'size-title1': '2rem',
'size-title2': '1.5rem',
'size-title3': '1.25rem',
'size-title4': '1.125rem'
},
fontWeight: {
'weight-button': '400',
'weight-title1': '700',
'weight-title2': '700',
'weight-title3': '400',
'weight-title4': '400'
}
};Keep in mind that we are using the Tailwind theme configuration for the sake of simplicity – it is not the only way to achieve the same result. The whole Tailwind configuration could be overridden, including the plugins used and/or their configuration. For example, you could create your own Tailwind plugin and adding all your CSS components based on a configuration passed to the plugin. This configuration could be passed as a parameter to the Tailwind service as we are doing here with the theme configuration.
The next step is to go to the App.vue component and remove all default content, then add some buttons and titles using the CSS utility classes that Tailwind generates with the theme configuration that was just defined:
// src/App.vue
<template>
<section class="flex flex-col gap-10 min-w-[200px] m-10">
<section class="flex flex-col gap-10">
<button class="w-40 h-8 rounded bg-primary-50 hover:bg-primary-75 text-primary-25 hover:text-primary-25 text-button font-weight-button">
Button Primary
</button>
<button class="w-40 h-8 rounded bg-secondary-50 hover:bg-secondary-75 text-secondary-25 hover:text-secondary-25 text-button font-weight-button">
Button Secondary
</button>
<button class="w-40 h-8 rounded bg-success-50 hover:bg-success-75 text-success-25 hover:text-success-25 text-button font-weight-button">
Button Success
</button>
<button class="w-40 h-8 rounded bg-warning-50 hover:bg-warning-75 text-warning-25 hover:text-warning-25 text-button font-weight-button">
Button Warning
</button>
<button class="w-40 h-8 rounded bg-error-50 hover:bg-error-75 text-error-25 hover:text-error-25 text-button font-weight-button">
Button Error
</button>
</section>
<section class="flex flex-col gap-10 m-10">
<h1 class="text-title-1 text-size-title1 font-weight-title1">
Title 1
</h1>
<h2 class="text-title-2 text-size-title2 font-weight-title2">
Title 2
</h2>
<h3 class="text-title-3 text-size-title3 font-weight-title3">
Title 3
</h3>
<h4 class="text-title-4 text-size-title4 font-weight-title4">
Title 4
</h4>
</section>
</section>
</section>
</template>Before running the dev script to serve the Vue project, we need to modify this script in package.json to parallelize its execution with the Tailwind service:
"dev": "node ./src/tailwind-as-a-service.js & vite",Afterwards, we run the project and visit the locally-served URL:
$ npm run devWe are using the latest version of Vite, so the default port is 5153:

Finally, we open that URL in the browser and… oops! No styles?! What’s going on? 🤯
The styles are not being applied because we are not using Tailwind directly in our Vue project as we normally would. Instead, we have to call the tailwind-as-a-service endpoint to retrieve the CSS. Okay then, let’s create the function to call the service.
We create a new file fetch-css.js in the src directory:
// src/fetch-css.js
export async function fetchCss(tailwindCustomConfig) {
return await fetch('http://localhost:8080', {
method: 'POST',
headers: { 'content-type': 'application/json' },
body: JSON.stringify({
html: document.body.innerHTML,
theme: {
extend: tailwindCustomConfig
}
})
}).then(response => response.text());
}
This async function is receiving the custom Tailwind config and fetching the Tailwind service, passing it as the theme:{ extend: tailwindCustomConfig } parameter. We pass it inside extend to keep all the default utilities Tailwind has, and only add the new ones we need. We also obtain all the current HTML on the page and send it to the service as the html parameter. Tailwind will use this HTML to know which CSS classes generate and which don't.
The following step is to create a new component Editor.vue to use that function:
// src/components/Editor.vue
<script setup>
import { onMounted, ref } from 'vue';
import { customTailwindConfig } from '../custom-tailwind-config.js';
import { fetchCss } from '../fetch-css.js';
const css = ref('');
async function getCss() {
css.value = await fetchCss(customTailwindConfig);
}
onMounted(getCss);
</script>
<template>
<component is="style">{{ css }}</component>
</template>There are many things going on here. Let’s take a closer look:
customTailwindConfig and the fetchCss function.css ref. If you are not already familiar with it, check out the new Vue Composition API documentation.getCss which calls the fetchCss and assigns the returned promise value to the ref value.onMounted Vue lifecycle hook to call the previous function whenever the component is mounted.<style> tag, when the css ref is updated. That way, whenever we update the css ref value, the styles will be updated.We use a dynamic component as a workaround because, in the Vue template compiler the<style>tag is not allowed inside the<template>tag.
Now, we import and use the Editor.vue component inside the App.vue:
// src/App.vue
<template>
<div class="flex flex-row">
......
<Editor/>
</div>
</template>
<script setup>
import Editor from './components/Editor.vue';
</script>
Ready to see the styles? Reload the URL and they will appear:

Finally, the last step is to make the Editor.vue modify the default theme from the Tailwind configuration and request the CSS again from the service to see a live view of the changes.
We are starting with the colors, adding a color picker for each color we want to configure.
// src/components/Editor.vue
<script setup>
import { onMounted, reactive, ref, watch } from 'vue';
......
const css = ref('');
const editableCustomConfig = reactive(customTailwindConfig);
async function getCss() {
css.value = await fetchCss(editableCustomConfig);
}
onMounted(getCss);
watch(editableCustomConfig, getCss);
</script>
<template>
<div class="flex flex-col flex-nowrap gap-10 w-1/2 m-10">
<h2 class="font-bold">Colors</h2>
<section class="flex flex-row flex-wrap gap-10">
<div v-for="(color, colorName) in editableCustomConfig.colors"
style="display: flex; flex-flow: column nowrap;">
<label v-for="(_, shadeName) in color">{{ colorName }} {{ shadeName }}
<input type="color" v-model.lazy="color[shadeName]">
</label>
</div>
</section>
</div>
<component is="style">{{ css }}</component>
</template>In this step, we are making several changes to be able to modify the configuration reactively:
In the <script>:
customTailwindConfig as the initial value.watch to call the getCss function, whenever this reactive object changesIn the <template>:
v-for to iterate over each color and each shade, binding a color picker to the value.v-model directive with the lazy modifier so that not too many requests are made whenever we move the selector over the color picker.Notice, we are binding the color shade to thev-modelusing the color and the shade name, instead of using thev-forvariable directly. This is because the variable used to iterate in thev-forloops cannot be modified; the workaround lets us access the value indirectly.
If we run the application again, we can see the color pickers. Now, by changing a color, the component using that color will be updated automatically:

Finally, we configure font size and font weight:
// src/components/Editor.vue
<script setup>
......
</script>
<template>
......
<h2 class="font-bold">Font Sizes</h2>
<section class="flex flex-row flex-wrap gap-10">
<label v-for="(_, sizeName) in editableCustomConfig.fontSize">{{
sizeName.replace('size-', '')
}}
<input type="number"
step="0.125"
class="w-14 border border-black text-center"
:value="editableCustomConfig.fontSize[sizeName].replace('rem','')"
@input="event=> editableCustomConfig.fontSize[sizeName] = event.target.value + 'rem'">rem
</label>
</section>
<h2 class="font-bold">Font Weight</h2>
<section class="flex flex-row flex-wrap gap-10">
<label v-for="(_, weightName) in editableCustomConfig.fontWeight">{{
weightName.replace('weight-', '')
}}
<input type="number"
step="100"
min="100"
max="900"
class="w-14 border border-black text-center"
v-model="editableCustomConfig.fontWeight[weightName]">
</label>
</section>
</div>
<component is="style">{{ css }}</component>
</template>Here, we are repeating the same principle used for the colours, but with a single v-for for each case. Moreover, for the case of font size, the rem unit has to be added and removed before it is passed to the Tailwind configuration.
Alright, the moment you have been waiting for has arrived! This is what the editor looks like:

This editor is the starting point for creating your own no-code tool that allows you to configure your project and see the changes on the fly. Remember that using theme values is not the only way to make this configurable; you can also use all the Tailwind config options.
Sure, we could achieve exactly the same result by using CSS variables as values in the Tailwind theme configuration. Modifying the value of these variables in the front would remove the need to use any additional service and adapt the Tailwind process on the fly. Then, after saving these variables and loading them in production, the changes would be deployed.
But there’s a reason, promise!
In this example, we are only modifying the theme part of the Tailwindconfiguration, but there are plenty more options in that configuration.
Imagine you created a Tailwind plugin which adds your own Design Components and CSS utilities. These plugins can have options, too.
So, with the solution laid out here, you can modify all these possible configurations and options and see the results immediately.
Also, you can use the service created to directly save the configuration in your database (by user or customer) to later retrieve it during the deployment process or for any other purpose you need.
Ready to get started? All the working code is available here. Please, feel free to open issues and give feedback. Also, if you find it useful, a star would be much appreciated 😉.
]]>As time goes on, lots of things become outdated sooner or later, and documentation is no exception. But when documentation becomes outdated, it automatically loses its essence – to be consulted for valuable information. In modern software development, where Application teams are continuously adding features to their services and
]]>As time goes on, lots of things become outdated sooner or later, and documentation is no exception. But when documentation becomes outdated, it automatically loses its essence – to be consulted for valuable information. In modern software development, where Application teams are continuously adding features to their services and Platform teams are continuously evolving the platform and infrastructure to accommodate all the workloads, it is extremely easy for documentation to get out of date just in a couple of weeks.
This is something the Empathy.co Platform Engineering team experienced firsthand when we realized there was an issue with how Development teams learn how to work with the platform provided for them. We noticed that there wasn't a clear path for everyone to follow when using the platform tools. While we had some documentation available that we felt was sufficient, we realized it wasn't as complete or as maintainable as we thought. So, we had to find a solution.
The following items were identified as the main pain points of the problem:
There are several types of documentation, and not all of them should be handled the same way, nor do they have the same lifecycle. Assuming that not all the roles at the company have the same skillset, like fluent writing of Markdown documents and working with Git, we concentrated our efforts specifically on technical documentation. That way, the skillset of those interacting with the documentation was more clearly defined.
More specifically, we focused on the documentation about the Internal Development Platform (IDP, for short) that Platform Engineering needs to share with the Application teams, like CI/CD processes, monitoring, logging, etc. Since the nature of the IDP is to help developers by using certain tools and practices, we like to refer to this documentation as Developer Workflow (DW).
Aiming to centralize the technical documentation in a single place and tool, we decided to deploy Backstage to the platform. Backstage is a tool developed by Spotify that, among other features, enables the creation of documentation sites out of the Markdown documentation stored in a Git repository, using MkDocs.
Hosting the documentation in Git repositories also provides the benefits of using Git, like versioning, tracking, code reviewing, etc.
Unfortunately, there's no tool or magic wand that keeps the documentation up to date. It is a continuous effort, so it should be included in teams' ways of working. From our experience, it is very helpful to take into consideration the time it will take to create or update the documentation when estimating a given task. A task shouldn't be marked as finished if the documentation hasn't been reviewed and updated.
In this specific case, it was hard work reviewing all the documents that needed to be updated, created, or deleted accordingly. It is always better to keep the documentation up to date by making small updates frequently, rather than updating a large amount of content infrequently.
As stated above, one of the main pain points of the previous solution was that the stakeholders weren't aware of where to find certain articles within the documentation. Adding Backstage helped centralize all the technical documentation into a single place, but the problem wasn't yet solved. Backstage was still unfamiliar to some people, mostly those who recently joined the company. We also received feedback about the search experience not working as well as we had hoped, so we decided to create a separate site to host only the Developer Workflow documentation and reduce all the surrounding elements that could prevent users from finding the necessary documents.
We didn't want to split the documentation again, so we decided to keep it in Backstage and the standalone website, using the exact same source of truth for both to avoid duplicating the content. This part of the solution will be explored more in-depth, later.
Documentation evolves over time, and as part of that evolution, things are likely to break. A few of the issues that can occur are:
All of these items were successfully addressed just by using a few MkDocs plugins, which will be explained in more detail in the next section.
Not only does the content have to be maintainable, so does the overall process. In order to account for this in the design, it is key to keep the logic as simple as possible (following the KISS Principle), so teams managing the process can easily update the documentation, keep the process up to date, and evolve the whole solution accordingly. Keeping it as simple as possible also makes it easier to transfer ownership to another team and accelerate the onboarding process of new team members, so they understand the workflow and how to support the solution.
Now that the context and requirements are clear, it is time to look at how to implement them into the solution. It mainly consists of a series of agreements in the way of working, in conjunction with a series of best practices and MkDocs plugins.
Since Backstage aggregates all the MkDocs documentation sites from several Git repositories into a single documentation portal, we decided to treat the Developer Workflow documentation as another component in Backstage. This repository also contains the Terraform code to build a simple Cloudfront + S3 static web hosting and a simple Github action to build the DW standalone website. This enabled us to have the exact same documentation in two places, fed by the same source.

All documentation structure, navigation bars, and page-linking related items were addressed using MkDocs theme features and additional MkDocs plugins.
Since Backstage uses a slightly modified version of MkDocs Material Theme, we had to explore what theme features were compatible with Backstage and used the following:
mkdocs.yaml
theme:
name: material
features:
- navigation.sections
- navigation.expand
- navigation.indexes
- navigation.top
docs folder to automatically generate the navigation sidebar. This prevents an out-of-date sidebar, as it is automatically rendered based on the content.In addition to the MkDocs Material theme features, we also used the following MkDocs plugins. Note that to make it work both in Backstage and standalone, the plugins must be installed in Backstage, as well.
Note that the order in which the plugins are listed is the order in which they are executed when building the documentation site. This is important to keep in mind when adding or removing them, as the results may vary.
mkdocs.yaml
plugins:
- techdocs-core
- search
- awesome-pages
- macros
- glightbox:
touchNavigation: false
loop: false
effect: zoom
width: 100%
height: auto
zoomable: true
draggable: false
- webcontext:
context: catalog/default/component/developer-workflow
- alias
.pages file inside a folder to tune folder-specific settings like navigation, sorting, hiding, etc., among other things. We mostly use it to customize section names that have the same name as the folders, by default. For example, a folder called docs/04-ci-cd could be tuned to be rendered as CI/CD - Build and Deploy just by adding a .pages file with the following content:.pages
title: CI/CD - Build and Deploy
This makes it possible to use the name of the folder for sorting, but with a more user-friendly name for the sections.
For example, let's say there's a reference to a team email address in several pages of the documentation. If the email address changes for some reason, it would have to be updated on all the pages where it appears. With the MkDocs macros plugin, it can be set as a variable in the extra section inside mkdocs.yaml and be expanded as many times on as many pages as needed.
mkdocs.yaml
extra:
team_email_address: [email protected]
slack_handler: @myteamcontact-info.md
Contact us:
- Email: **{{ team_email_address }}**
- Slack Handler: **{{ slack_handler }}**That's just a simple example, but it can be extended to other uses. For example, it could be used to type all the links to the tools of the Internal Development Platform once and then refer to those links from multiple pages. Additionally, it could be used generate a page with all the links, for faster access.
mkdocs.yaml
extra:
tool_urls:
tool_one:
test: https://tool-one.test.company.org
stage: https://tool-one.stage.company.org
prod: https://tool-one.prod.company.org
tool_two:
test: https://tool-two.test.company.org
stage: https://tool-two.stage.company.org
prod: https://tool-two.prod.company.orgtools.md
{% for tool in (tool_urls | sort) %}
### **{{ tool | replace("_"," ") | upper }}**
{% for environment in (tool_urls[tool] | sort) %}
- **{{ environment | replace("_"," ") | upper }}**: {{ tool_urls[tool][environment] }}
{% endfor %}
{% endfor %}The rendering the above would look like this:

${BACKSTAGE_URL}/catalog/default/component/${COMPONENT_NAME}, causing links between pages to break on one of the websites. This is configured to match Backstage context and is dynamically set to / as part of the automation to deploy the standalone website.Lastly, let's look at all of the Markdown extensions used. All of them are included in the techdocs-core plugin. Almost all of them are using the default behaviour, we just set them explicitly in the mkdocs.yaml to avoid misbehavior if a default value changes at some point.
mkdocs.yaml
markdown_extensions:
- admonition
- pymdownx.highlight:
# Enables linking to a specific line in a code block.
anchor_linenums: true
- pymdownx.inlinehilite
- pymdownx.snippets
- pymdownx.superfences
- attr_list
- pymdownx.emoji:
emoji_index: !!python/name:materialx.emoji.twemoji
emoji_generator: !!python/name:materialx.emoji.to_svg
- def_list
- pymdownx.tasklist
- pymdownx.details
- footnotes
In order to support the solution with minimal effort, we decided to establish some practices around the Developer Workflow documentation. The following are some of the most important practices that should be checked as part of a code review:
index.md file with an overview of the whole section.Working with Backstage is pretty difficult, when it comes to previewing how rendered markdown files will look. Even though there are various tools and extensions that enable previewing Markdown files from the IDE, those are not capable of rendering with the exact same plugins that MkDocs is going to use when building the site. On the other hand, working directly with MkDocs and knowing that all the plugins, features and extensions are compatible with Backstage, it becomes quite easy and simple to render the site in localhost, even with real-time updates. To do so, the official MkDocs material Docker image needs to be used as the base image before plugins are added to it.
Dockerfile
FROM squidfunk/mkdocs-material:8.4.2
RUN python -m pip install --upgrade pip \
&& pip install mkdocs-alias-plugin==0.4.0 \
mkdocs-awesome-pages-plugin==2.8.0 \
mkdocs-macros-plugin==0.7.0 \
mkdocs-techdocs-core==1.1.4 \
mkdocs-webcontext-plugin==0.1.0 \
mkdocs-glightbox==0.1.7Docker Build and Run commands
# Docker Build
docker image build -t mkdocs-material-local .
# Docker Run
# Run this command inside the folder containing the mkdocs.yaml
docker container run --rm -it -p 8000:8000 -v ${PWD}:/docs mkdocs-material-localThe next step is to fire a browser up and navigate to http://localhost:8000. The rendered site should appear, and it should be automatically updated as the files are updated.
All roads come to an end and, in this case, the road ends when the Developer Workflow documentation site is deployed to a Production environment, where it can be accessed by the public. We use a simple setup of Cloudfront + S3, using a Lambda Edge to authenticate the site using Google Workspaces login, but that's out of the scope of this post. Since there are lots of different ways to host a static website, just choose the one that you are most comfortable with or is best for your organization.
In order to keep the deployment also as simple as possible, we just have a Github actions workflow that performs four main steps:
mkdocs.yaml file a bit for the standalone deployment. For example, the webcontext plugin.build command, that renders the website and generates the static site in a folder named site.site folder to the static web hosting. In our case, this is the Amazon S3 bucket. Here, we also create a Cloudfront invalidation to the distribution, to prevent the users from waiting until the Cloudfront cache expires.Github Workflow: publish-website.yaml
name: Publish Website
on: [push]
jobs:
cicd:
runs-on: [self-hosted]
# Omitted for readability
steps:
- name: Checkout
uses: actions/checkout@v3
- name: Build docker images
run: docker image build -t mkdocs-material-custom:local .
- name: Parse mkdocs.yaml
run: |
# Install yq tool
wget https://github.com/mikefarah/yq/releases/download/v4.21.1/yq_linux_386 -O ./yq && chmod +x ./yq
# Use default webcontext and disable use_directory_urls to make it work with CloudFront
./yq '.plugins[select(. == "webcontext")].[].context = "/" | .use_directory_urls = false' -i mkdocs.yaml
- name: Build mkdocs site
run: docker container run -v $PWD:/docs mkdocs-material-custom:local build
# Omitted for readability
- name: Deploy
if: github.ref == 'refs/heads/main'
run: |
aws s3 sync site s3://xxxxxx.company.org --delete --cache-control max-age=3600
aws cloudfront create-invalidation --distribution-id XXXXXXXXXXXXXX --paths '/*'
The following are screenshots of the real result, so you can see how DW documentation looks both as a standalone MkDocs website and when integrated into Backstage.


CSS Grid Layout is a two-dimensional grid system. It is a CSS language mechanism created to place and distribute elements on a page, which is one of the most problematic processes in CSS. It is a standard, which means that you don't need anything
]]>CSS Grid Layout is a two-dimensional grid system. It is a CSS language mechanism created to place and distribute elements on a page, which is one of the most problematic processes in CSS. It is a standard, which means that you don't need anything special for the browser to be able to understand it. There are no limits to using it: wherever you can develop your CSS to define the style, you can use Grid Layout to apply a grid.
To get started you have to define a container element as a grid with display: grid, but before you do that, it is important to be familiar with certain concepts:
There are four basic Grid properties you need to know, in order to make use of it easily and simply: display, columns and rows, fr unit, and gap. Let’s take a look at each of them to understand their role.
This value defines how the grid will be positioned with the content. There are two possible values:
The inline-grid value positions it inline, to the left or to the right of the content, and the grid value positions the grid in block, either above or below the content.
These properties are used to create the grid by sizing the rows and columns.
In this example, the grid defined will have two columns. The first will have a size of 200px and the second one will measure 100px; both will have a row size of 50px.
Grid has a special unit of dimension, fr (like pixels or percentages) which represents a fraction of the remaining space in the grid. Instead of using pixels in the example, let’s use fr.
Now, the two columns will occupy all the available space, with the first one occupying double the amount of the second.
Sometimes, columns and/or rows need to have a gap between them. The column-gap and row-gap properties can be used, or simply condensed into gap, as shown below.
Sometimes, when there are many elements that have to be positioned inside the page, it can be difficult to visualize what they will look like or the best way to do it. That is where generators come in handy. There are lots of tools available, but here is a selection of three useful grid generators (in no particular order):
This generator is an open source project that is very easy to use – perfect for beginners who are learning how to use Grid. With CSS Grid Generator, just specify the number of rows, columns, and gaps across rows and columns, and it will provide the proper CSS class ready to copy to your clipboard.
Like the previous one, LayoutIt is also an open source project. Things to note about this generator are that it allows code to be exported in CodePen and allows for the resizing columns and rows using percentages (%), fractionals (fr), and pixels (px).
From the information available in many articles, Griddy seems to be the generator that is most valued and used by developers, in general. Like the previous tool, it allows different resizing units to be used but is a little more difficult to use than CSS Grid Generator and LayoutIt.
The main difference between Flex and Grid is that Flex uses one dimension, while Grid uses two dimensions. This means that with Flex, the positioning of the elements can be defined on either the horizontal or the vertical axis. Grid, on the other hand, allows the ability to work in two dimensions, meaning it is possible to position the cell both vertically and horizontally.
Does this mean that Grid is better than Flex?
Nope! Both are good tools, but one will usually be more suitable than the other, depending on the situation. In any case, it is best to learn how to use both and combine them. Now, you have a guide to using Grid, so go ahead and get started!
We’ve all heard about Image Recognition and how it impacts our day-to-day activities. From the impressive automated driving software that some automobile companies have implemented in their vehicles, to the cool smartphone app that tells us what kind of bird we are staring at – most of us know what the concept of Image Recognition is, more or less. Of course, it might seem like a very complex field (and, well, it is actually tricky sometimes), but we can also do some interesting experiments that will show us how this is, indeed, a very promising area to try out.
Given how broad the field is, the experiment that we are going to perform and assess is focused on a smaller area within it: Color Recognition. The goal is to create an efficient process that can extract the dominant colors from any given picture.
This area is one of the most explored in computer vision, as it is much easier to collect and classify data, and it is pretty straightforward to test results. It is not difficult for an average person that does not experience colorblindness to see an image of a jacket and identify the main colors that are present on it; nor is it for a computer to do the same, with the appropriate solution.
Of course, nowadays there are already professional solutions on the market that tackle this problem in a very complete way, like the Google Vision AI project or the Amazon Rekognition service. But, in this case, we are going to build our own, homemade Color Recognition system with Python and some added libraries.
First of all, it is important to know what we want to obtain: a system that can take an image, process it, and return a set of the four main colors. Plus, we want to do it in the simplest and most efficient way possible.
Now that we have and overview of Color Recognition and a clear idea of what our objectives are, we can start with the implementation itself. To illustrate the experiment, let’s go through a guided, step-by-step example.
The first step of the experiment is to “read” the image. Use the Numpy and OpenCV libraries to translate the picture into a matrix of data corresponding to the colors in RGB code, assigning each pixel a value between 0 and 255.
The matrix obtained will be similar to this one, which has been shortened for obvious reasons:

Next, it is time to resize the image using the OpenCV library. At a low level, this step means converting the big data matrix into a smaller one. Reducing the number of pixels in the image avoids introducing noise into the color processing phase.
Moreover, the fewer pixels that are in the final image, the less time the color recognition process will take. Of course, this has some drawbacks, the most significant one being that having fewer pixels can also affect the accuracy of the recognition. In order to find the best ratio for the resizing process and avoid compromising that accuracy, iterate! Iterating measures the time taken for the whole process to be completed and determines the pixels required.
For this example, it was necessary to iterate through several versions of the size of the pixel matrix (300, 200, 100, 50, 35, 25, and 10). The ideal size for the data matrix was 35, which is important to keep in mind.
The analysis of the image and subsequent color classification is performed with a technique called K-Means clustering. K-Means is a non-supervised algorithm that aims to partition n objects into k groups (called clusters). While there are more specific techniques to get a more accurate number of clusters, for the sake of simplicity, they won’t be explored in this article.
Now that the image has been resized to the proper measurements, it is time to use the SKLearn library to get to the most interesting part of the clustering process. In this case, a total of four clusters will be used, because that is the number of dominant colors to be extracted from the picture.
Once the K-Means algorithm has finished computing, the results have to be extracted using the Numpy library.
After some refactoring, the result of this step is an array containing the four most dominant colors of the image analyzed:
The only thing left to do is check the color codes! It looks like a very good match between the picture analyzed and the dominant colors extracted:

That’s it! With those four simple steps, we have created our own Color Recognition system. Of course, the application and different uses will vary, depending on the project at hand, but there is a clear approach to follow.
Here at Empathy.co, our goal is to create captivating Search experiences for shoppers and Image Recognition can play an important role in doing so. A perfect example of this is the ability to filter by color that a lot of online stores have. The data required for those filters to work comes (usually, but not always) from a system like the one we created in this experiment. There are also more advanced uses for this, like automated tag systems that label products based on their images and assign them to specific categories. This is a very interesting field, for sure, and now you’re ready to dive into it!
]]>