Blog

A lightweight alternative to Amundsen for your dbt project

2021-01-17T00:00:00.000Z

If you've been using dbt for a little while, chances are your project has more than 50 models. Chances are more than 10 people are building dashboards based on those models.

In the best case, self-service analytics users are coming to you with repeting questions about what model to use when. In the worst case, they are taking business decisions using the wrong model.

In this post, I will show you how you can build a lightweight metadata search engine on top of your dbt metadata to answer all these questions. I hope to show you that data governance, data lineage, and data discovery don't need to be complicated topics and that you can get started today on those roadmaps with my lightweight open source solution.

LIVE DEMO: https://dbt-metadata-utils.guitton.co

Data Governance is Ripe

In his recent post The modern data stack: past, present, and future, Tristan Handy - the CEO of Fishtown Analytics (the company behind dbt) - was writing:

Governance is a product area whose time has come. This product category encompasses a broad range of use cases, including discovery of data assets, viewing lineage information, and just generally providing data consumers with the context needed to navigate the sprawling data footprints inside of data-forward organizations. This problem has only been made more painful by the modern data stack to-date, since it has become increasingly easy to ingest, model, and analyze more data.

He later also points out that dbt has its own lightweight governance interface: dbt Docs. They are a great starting point and might be enough for a while. However, as time goes by, your dbt project will outgrow its clothes. The search in dbt Docs is Regex only, and you might find its relevancy going down with a growing number of models. This can become important for Data Analysts building dashboards and looking for the right model but also for Data Engineers looking to "pull the thread" when debugging a model. Those use cases can be summarised with the two following "Jobs to be done":

Data discovery can solve 2 'Jobs to be Done'

When I want to build a dashboard, but I don’t know which table to use, help me search through the available models, so I can be confident in my conclusions.
When I am debugging a data model, but I don’t know where to start, help me get data engineering context, so I can be faster to a solution.

These days, the solution to those two problems seems to be rolling out "heavyweight" tools like Amundsen. As Paco Nathan writes p.115 of the book Data Teams by Jesse Anderson (you can find my review of the book here):

If you look across Uber, Lyft, Netflix, LinkedIn, Stitch Fix, and other firms roughly in that level of maturity, they each have an open source project regarding a knowledge graph of metadata about dataset usage -- Amundsen, Data Hub, Marquez and so on. [...] Once an organization began to leverage those knowledge graphs, they gained much more than just lineage information. They began to recognize the business process pathways from data collection through data management and into revenue bearing use cases.

Amundsen and other heavyweight tools are the go-to solution for data discovery

Those tools come on top of an already complex stack of tools that data teams need to operate. What if we wanted a lightweight solution instead, like dbt Docs?

The Features of Amundsen and other Metadata Engines

In his great Teardown of Data Discovery Platforms, Eugene Yan summarizes really well the features of Amundsen and other metadata engines. He splits them in 3 categories: features to find data, features to understand data and features to use data.

Architecture of your friendly neighbourhood metadata engine

Its friendly UI with a familiar search UX is one of the key factors behind Amundsen's success. But another one is its modular architecture, which is already being reused by other metadata open source projects like the project whale (previously called metaframe).

We can further split the 3 categories of features into 10 features of varying implementation difficulty. Those features have also varying returns, not represented here.

Taxonomy of 10 features from metadata engines, cost opinions are my own

The key thing to realise is that Lyft might have spent a 15⭐️-cost on Amundsen to assemble all those features. But what if we wanted to build a 3⭐️-cost metadata engine? What features and technologies would you pick?

2021-02-09 Update: The Ground paper from Rise labs

In the seminal paper Ground: A Data Context Service - RISE Lab, Rise labs have outlined those features with much better terminology that I wasn't aware of at the time of first writing this post: the ABC of Metadata

Application - information on how the data should be interpreted
Behavior - information on how the data is created and used
Change - information on the frequency and types of updates to the data

A Lightweight Alternative to Amundsen

Although it's possible that the feature completeness (everything is in one place) makes the USP of Amundsen and others, I want to make the case for a more lightweight approach.

Documentation tools go stale easily. Or at least in situations where they are not tied with data modeling code. dbt has proven with dbt Docs that data people want to document their code (hi team 😁). We were just waiting for a tool simple and integrated enough for the culture of Data Governance to blossom. It reminds me of those DevOps books showing that the solution is not the tooling but rather the culture (if you're curious check out The Phoenix Project).

Additionally, dbt sources are a great way to make raw data explicitly labeled. The dbt graph documents data lineage for you at the table level and I will leverage later that graph to propagate tags with no additional work.

In other words, with schemas, descriptions and data lineage, dbt Docs covers the category

Features to Understand

from the above diagram. So what is missing from dbt Docs to rival with Amundsen? Only a way to sublime the work that is already happening in your dbt repository. And that is Search.

Algolia market themselves as a 'flexible search platform'

A good search engine will cover the Features to Find category. Fortunately, we don't need to build a search engine. This is where we will use Algolia's free tier in addition to some static HTML and JS files to build our lightweight data discovery and metadata engine. Algolia's free tier allows you for 10k search requests and 10k records per month. Given that for us 1 record = 1 dbt model, and 1 search request = 1 data request from a user, my guess is that the free tier will cover our needs for a while.

Note: if you're worried that Algolia isn't open source, consider using the project typesense.

How to get at least one feature in the Features to Use category? Well, a dbt project is tracked in version control, so by parsing git's metadata, we can for example know each model's owner.

More generally, to extend our lightweight metadata engine, we would add metadata sources and develop parsers to collect and organise that metadata. We would then index that metadata in our search engine. Examples of metadata sources are:

dbt artifacts (_See my post on how to parse dbt artifacts _)
git metadata
BI tool metadata database (e.g. who queries what, who curates what)
data warehouse admin views (e.g. for Redshift: stl_insert, svv_table_info, stl_query, predicate columns)
...

What does good Search look like

Search is going to be key if our metadata engine is to rival with Amundsen, so let's look at Amundsen's docs. We know from their architecture page that they use ElasticSearch under the hood. And we can also read that we will need a ranking mechanism to order our dbt models by relevancy:

Search for data within your organization by a simple text search. A PageRank-inspired search algorithm recommends results based on names, descriptions, tags, and querying/viewing activity on the table/dashboard. -- Source

A bit further in the docs, we learn that Amundsen has three search indices and that the search bar uses multi-index search against those indices:

the users could search for any random information in the search bar. In the backend, the search system will use the same query term from users and search across three different entities (tables, people, and dashboards) and return the results with the highest ranking. -- Source

We even get examples for searchable attributes for the documents in the tables index:

For Table search, it will search across different fields, including table name, schema name, table or column descriptions, tags and etc -- Source

Presumably, there's not much point in reverse engineering an open source project, so I'll spare you the rest: it also supports search-as-you-type and faceted search (applying filters).

Putting it together in the dbt-metadata-utils repository

I have assembled a couple of scripts in the (work in progress) repository called dbt-metadata-utils. I will walk through a couple of key parts here, but feel free to check out the full code there, and if you want to use it on your own project, hit me up.

All you will need is:

your already existing dbt project in a git repository locally
clone dbt-metadata-utils on the same machine than your dbt project
create one Algolia account (and API key)
create one Algolia app inside that account
run the commands laid out later

For the dbt project, we will use one of the example projects listed on the dbt docs: the jaffle_shop codebase.

I had no clue about Jaffles, and then I used dbt

Create an environment file in which you will need to fill in the values from the Algolia dashboard:

ALGOLIA_ADMIN_API_KEY=
ALGOLIA_SEARCH_ONLY_API_KEY=
ALGOLIA_APP_ID=

ALGOLIA_INDEX_NAME=jaffle_shop_nodes

DBT_REPO_LOCAL_PATH=~/workspace/jaffle_shop
DBT_MANIFEST_PATH=~/workspace/jaffle_shop/target/manifest.json

GIT_METADATA_CACHE_PATH=data/git_metadata

And then run the 4 make commands:

$ make install  # best is to install inside a virtual environment
pip install --upgrade pip
pip install -r requirements.txt

$ make update-git-metadata
python -m dbt_metadata_utils.git_metadata
100%|███████████████████████████████████████████| 11/11 [00:00<00:00, 12499.96it/s]

$ make update-index
python -m dbt_metadata_utils.algolia

$ make run
cd dbt-search-app && npm start

> [email protected] start /Users/louis.guitton/workspace/dbt-metadata-utils/dbt-search-app
> parcel index.html --port 3000

Server running at https://localhost:3000
✨  Built in 1.03s.

If you navigate to https://localhost:3000, you should see a UI that looks like this:

Screenshot from dbt-metadata-utils

I didn't dwell on details, but our metadata engine's features are:

search as you type by table name, table descriptions, the model's folder in the dbt codebase, or its sources
uses DAG algorithms to propagate tags using the loader and sources keys from the dbt .yml files
faceted search by those tags
ranking by degree-centrality, and by boosting dbt models that are in a mart or have a docs description
enrich the tables documents with git metadata parsed from the git repository using the python git client
advanced search using dynamic filtering: if you enter a query with a loader (e.g. "airflow payments"), it will use rules to filter documents with loader=airflow

Conclusion

LIVE DEMO: https://dbt-metadata-utils.guitton.co

There you have it! A lightweight data governance tool on top of dbt artifacts and Algolia. I hope this showed you that data governance doesn't need to be a complicated topic, and that by using a knowledge graph of metadata, you can get a head start on your roadmap.

Leave a star on the github project, and let me know your thoughts on twitter. I enjoyed building this project and writing this post because it lies at the intersection of three of my areas of interest: NLP, Analytics and Engineering. I cover those three topics in other places on my blog.

Resources

How to monitor your FastAPI service

2020-09-18T00:00:00.000Z

How to monitor your FastAPI service

API Monitoring vs API Profiling

Monitoring is essentially collecting data in the background of your application for the purpose of helping diagnosing issues, helping debugging errors, or informing on the latency of a service.

For example, at the infrastructure level, you can monitor CPU and memory utilization. For example, at the application level, you can monitor errors, code performance or database querying performance. For a more complete introduction to monitoring and why it's necessary, see this excellent post from Full Stack Python.

In this post, we fill focus on Application Performance Monitoring (APM) for a FastAPI application.

Error Tracking

In this post, I will not talk about monitoring application errors and warnings. For this purpose, check Sentry, it has great ASGI support and will work out of the box with your FastAPI service.

API Profiling

Profiling is a code best-practice that is not specific to web development. From the python docs on profiling we can read :

the profilers run code and give you a detailed breakdown of execution times, allowing you to identify bottlenecks in your programs. Auditing events provide visibility into runtime behaviors that would otherwise require intrusive debugging or patching.

You can of course apply profiling in the context of a FastAPI application. In which case you might find this timing middleware handy.

However, with this approach, the timing data is logged to stdout. You can use it in development to to find bottlenecks, but in practice looking at the logs in production to get latency information is not the most convenient.

Available Tools for Application Performance Monitoring (APM)

As will all things, there are many options. Some are open source, some are SaaS businesses. Most likely you or your organisation are already using one or more monitoring tools, so I'd suggest starting with the one you know. The tools on the list below don't do only APM, and that's what makes it harder to understand sometimes. Example application monitoring tools you might have heard of:

New Relic (commercial with parts open source)
Datadog (commercial with parts open source)
StatsD (open source)
Prometheus (open source)
OpenTelemetry (open source)

This list is not exhaustive, but let's note OpenTelemetry which is the most recent on this list and is now the de-facto standard for application monitoring metrics.

At this point, choosing a tool doesn't matter, let's rather understand what an APM tool does.

The 4 Steps of Monitoring

The 4 steps of monitoring

It all starts with your application code. You instrument your service with a library corresponding to your app's language (in our case python). This is the monitoring client library. Monitoring client library examples:
Then the monitoring client library sends each individual call to the monitoring server daemon over the network (UDP in particular, as opposed to TCP or HTTP).
The monitoring server daemon is listening to monitoring events coming from the applications. It packs the incoming data into batches and regularly sends it to the monitoring backend.
The monitoring backend has usually 2 parts: a data processing application and a visualisation webapp. It turns the stream of monitoring data into human-readable charts and alerts. Examples:
- app.datadoghq.com
- one.newrelic.com

The monitoring backend has 2 parts

The problem with monitoring ASGI webapps

ASGI is a relatively new standard for python web servers. As with every new standard, it will take some time for all tools in the ecosystem to support it.

Given the 4 steps of monitoring laid out above, a problem arise if the monitoring client library doesn't support ASGI. For example, this is the case with NewRelic at the moment (see ASGI - Starlette/Fast API Framework · Issue #5 · newrelic/newrelic-python-agent for more details). I looked at Datadog too and saw that ASGI is also not supported at the moment.

On the open source side of the tools however, OpenTelemetry had great support for ASGI. So I set out to instrument my FastAPI service with OpenTelemetry.

Update - Sep 19th, 2020: There seems to be support for ASGI in ddtrace

Update - Sep 22th, 2020: There is now an API in the NewRelic agent to support ASGI frameworks, with uvicorn already supported and starlette on the way.

Update - Oct 23th, 2020: The NewRelic python agent now supports Starlette and FastAPI out of the box.

Instrumenting FastAPI with OpenTelemetry and Jaeger

OpenTelemetry provides a standard for steps 1 (with Instrumentors) and 2 (with Exporters) from the 4 steps above. One of the big advantages of OpenTelemetry is that you can send the events to any monitoring backend (commercial or open source). This is especially awesome because you can use the same intrumentation setup for development, staging and production environments.

Update - May 30th, 2021: Github is now adopting OpenTelemetry

Note that depending on the language you use for your microservice, your mileage may vary. For example, there is no NewRelic OpenTelemetry Exporter in Python yet. But there are OpenTelemetry Exporters for many others, see the list here: Registry | OpenTelemetry (filter by language and with type=Exporter).

One of the available backends is Jaeger: open source, end-to-end distributed tracing. (Note that Jaeger is also a monitoring client library that you can instrument your application with, but here that's not the part of interest).

Instrumenting FastAPI with OpenTelemetry and Jaeger

Although it's open source and worked really easily, the issue I had with Jaeger was that it doesn't have any data pipeline yet. This means that, in the visualisation webapp, you can browse traces but you cannot see any aggregated charts. Such a backend is on their roadmap though.

Still, Jaeger is my goto tool for monitoring while in development. See the last part for more details.

Instrumenting FastAPI with OpenTelemetry and Datadog

I couldn't find any open source monitoring backend with a data pipeline that would provide the features I was looking for (latency percentile plots, bar chart of total requests and errors ...).

It became apparent that that's where commercial solutions like NewRelic and Datadog shine. I hence set out to try the OpenTelemtry Datadog exporter.

Instrumenting FastAPI with OpenTelemetry and Datadog

With this approach, you get a fully featured monitoring backend that will allow you to have full observability for your microservice.

The 2 drawbacks are:

you need to deploy the Datadog agent yourself (with docker or on Kuberetes or on whatever environment fits your stack) and this can get a bit involved
Datadog being a commercial product, this solution will not be free. You will have to pay extra attention to the pricing of Datadog (especially if you deploy the Datadog agent to Kubernetes 😈).

Example FastAPI instrumentation using OpenTelementry, Jaeger and DataDog

So how does it look in the code ? This is how my application factory looks. If you have any questions, feel free to reach out on twitter or open a github issue. I will not share my instrumentation because it is specific to my application, but imagine that you can define any nested spans and that those traces will sent the same way to Jaeger or to DataDog. This makes it really fast to iterate on your instrumentation code (e.g. add or remove spans), and even faster to find performance bottlenecks in your code.

"""FastAPI Application factory with OpenTelemetry instrumentation
sent to Jaeger in dev and to DataDog in staging and production."""
from fastapi import FastAPI
from opentelemetry import trace
from opentelemetry.exporter.datadog import DatadogExportSpanProcessor, DatadogSpanExporter
from opentelemetry.exporter.jaeger import JaegerSpanExporter
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchExportSpanProcessor

from my_api.config import generate_settings
from my_api.routers import my_router_a, my_router_b


def get_application() -> FastAPI:
    """Application factory.

    Returns:
        ASGI application to be passed to ASGI server like uvicorn or hypercorn.

    Reference:
    - [FastAPI Middlewares](https://fastapi.tiangolo.com/advanced/middleware/)
    """
    # load application settings
    settings = generate_settings()

    if settings.environment != "development":
        # opentelemetry + datadog for staging or production
        trace.set_tracer_provider(TracerProvider())
        datadog_exporter = DatadogSpanExporter(
            agent_url=settings.dd_trace_agent_url,
            service=settings.dd_service,
            env=settings.environment,
            version=settings.dd_version,
            tags=settings.dd_tags,
        )
        trace.get_tracer_provider().add_span_processor(
          DatadogExportSpanProcessor(datadog_exporter)
        )
    else:
        # opentelemetry + jaeger for development
        # requires jaeger running in a container
        trace.set_tracer_provider(TracerProvider())
        jaeger_exporter = JaegerSpanExporter(
            service_name="my-app", agent_host_name="localhost", agent_port=6831,
        )
        trace.get_tracer_provider().add_span_processor(
            BatchExportSpanProcessor(jaeger_exporter, max_export_batch_size=10)
        )

    application = FastAPI(
        title="My API",
        version="1.0",
        description="Do something awesome, while being monitored.",
    )
    # Add your routers
    application.include_router(my_router_a)
    application.include_router(my_router_b)

    FastAPIInstrumentor.instrument_app(application)
    return application


app = get_application()

Conclusion

I hope that with this post you've learned:

the difference between profiling, monitoring, tracking errors
the architecture of application monitoring
some of application monitoring tools out there
that OpenTelemetry allows you to reuse the same instrumentation setup for all your environments, which speeds up the speed at which you can find performance bottlenecks in your application

I've used this setup to get a 10x speed up on one multi-lingual NLP fastapi service I built at OneFootball.

Resources

Panama Papers Investigation using Entity Resolution and Entity Linking

2024-10-20T00:00:00.000Z

Panama Papers Investigation using Entity Resolution and Entity Linking

If you’ve worked with a corpus of text, chances are you needed to structure its information specifically for your domain. How can you link the entities mentioned in the articles to a knowledge base you control, which you can enrich and which might evolve depending on your focus?

Imagine you are an investigative journalist sifting through the Panama Papers and you are following a lead: the consortium called “Londex Resources S.A.”. You’re not sure what people, organizations, countries or other articles are connected to that lead. Perhaps one of them can be your next breakthrough?

In this article, we will demonstrate a technical approach that combines Entity Resolution performed with Senzing with Entity Linking performed in spaCy. We show how this can be used to construct a domain-specific Knowledge Graph, e.g. around a lead you’re following, to analyze your corpus with it. We will then show how to close the loop and use the analyzed corpus to update the Knowledge Graph with new leads.

Along with this blog post, we have open-sourced a package to do zero-shot entity linking spacy-lancedb-linker, and released a tutorial for reference erkg-tutorials.

Articles from the ICIJ Offshore Leaks dataset

For this blog post, we will be looking at a set of articles from investigative journalism like Panama Papers, Pandora Papers , and Offshore Leaks. Those are cross-border investigations that have made the headlines and were led by ICIJ (International Consortium of Investigative Journalists).

ICIJ maintains the ICIJ Offshore Leaks dataset, in the form of either a Neo4J database or a set of zipped CSV files. The dataset contains 4 main entity types.

Data schema of the ICIJ Offshore Leaks dataset

Persons or “Officers” are directors, shareholders, and beneficiaries of offshore companies. For example presidents, royals, members of parliament, their family members, their closest associates. “Intermediaries” are secrecy brokers like banks or law firms that Officers turn to to optimize their finances. Organizations or “Entities” are shell companies established by secrecy brokers. “Addresses” are countries, world regions, secret jurisdictions of Officers, Entities or Intermediaries.

ICIJ Offshore Leaks node for Arzu Aliyeva

For example, Offshore Leaks has shown that Arzu Aliyeva, daughter of Ilham Aliyev, president of Azerbaijan, lives in Dubaï and is a shareholder and director of Arbor Investments Ltd, registered in the Virgin Islands. This creates a natural graph that connects Arzu Aliyeva to other Officers like Hassan Gozal.

This dataset is commonly used to show UBO (Ultimate Beneficial Owner) or reveal or investigate AML (Anti Money Laundering) scenarios. Prior work shows how to use this data in Neo4j, in Linkurious and shows typical investigations written with that data. In this blog post, we will rather show how a Senzing-preprocessed version of this dataset can be used to power an Entity Linking use case.

Overview of the high-level architecture

Senzing provides a development library for Principle-Based Entity Resolution based on Entity-Centric Learning. Senzing Founder/CEO Jeff Jonas said: “[we want to help] developers fast-track their entity resolution needs – as understanding who is who and who is related to who is essential – and exceptionally essential in the creation of entity resolved knowledge graphs (ERKG)”. They have previously shown how to extract personally identifiable information (PII) from the ICIJ graph to be used as input into Senzing. After configuring and running Senzing, a JSON export of entity resolution (ER) results can be used to construct or update a Knowledge Graph (KG), called an entity-resolved knowledge graph (ERKG). Pre-computed ER results for ICIJ are shared as a dataset by Senzing in a GCP public bucket (download link).

High-level architecture for this blog post

While other tutorials show the ICIJ Offshore Leaks data loading into graph databases and entity resolution with Senzing, this tutorial starts with the Senzing export. With a custom Data Engineering pipeline, we can ingest Entity Resolution results into an Approximate Nearest Neighbors (ANN) index stored in LanceDB. We can then use that index in a spaCy pipeline to run Entity Linking against a small dataset of scraped ICIJ web articles. The end-user can then use the output of the entity linking.

Dagster lineage graph of the pipeline built for this article

In practice, in louisguitton/erkg-tutorials we built this data pipeline in Python using an orchestration tool which helps visualise it. The Senzing ER results feed a Senzing pipeline that builds the EL inputs, which feeds a spacy pipeline. Next, we will see in detail how to use the ERKG to power Entity Linking.

From a Suspicion to Entity Linking

While Senzing is proven to scale into billions of records, the rest of these components don't all scale the same way without performance engineering. Given that ICIJ has 1.5M records and ~5M aliases, we draw on a subset to make this tutorial quick and easy for the reader.

When doing Entity Linking against Wikidata or DBPedia, a sub-set would be considered so as not to load the entire Knowledge Graph into the entity linking pipeline. Similarly, we query for a subset of the KG using query languages like SPARQL, or by building custom KGs from smaller files (CSVs or JSONs).

Also in practice, investigative journalists work off so-called Case Management Systems. In that workflow they use software to organize and analyze information, they get assigned a "lead" (a specific person or company) and they only look at the immediate subgraph for that lead.

For those reasons, we start from a text file called data/icij-example/suspicious.txt, where the investigative journalist can seed the system. Let’s say the lead you have to explore is the consortium called “Londex Resources S.A.” which has ties with the Azerbaijani presidential family: you start by providing a few entity names from the Senzing ERKG you care about. Here, we start with Arzu Aliyeva the daughter, Ilham Aliyev the president, etc…

Arzu Aliyeva
Ilham Aliyev
Mossack Fonseca
Fazil Mammadov
AtaHolding
FM Management Holding Group S.A. Stand
UF Universe Foundation
Mehriban Aliyeva
Heydar Aliyev
Leyla Aliyeva
AtaHolding Azerbaijan
Financial Management Holding Limited
Hughson Management Inc.

From that, we’re able to filter down (using a friend-of-friend logic) the ERKG to less than 100 entities of interest. That’s the immediate subgraph to our lead. Starting with this might be enough. If it turns out it isn’t, you can expand the subgraph by either adding seed entities to suspicious.txt or by adding more friends of friends.

Once we’ve filtered out the ERKG, we extract aliases into the aliases.jsonl file in the format required by the entity linking library we wrote.

{"alias":"Ilham Aliyev","entities":["1342265","1551574"],"probabilities":[0.5,0.5]}
{"alias":"Arzu Aliyeva","entities":["281073","918573","1470056","1722271","1697384","1380470"],"probabilities":[0.1666666667,0.1666666667,0.1666666667,0.1666666667,0.1666666667,0.1666666667]}
{"alias":"Arzu Ilham Qizi Aliyeva","entities":["883102"],"probabilities":[1.0]}

We also need to generate entity descriptions from the ERKG to populate the second file required by the entity linking library, entities.jsonl. We generate those descriptions by joining together the structured features available in the ERKG.

{"entity_id": "1342265", "type": "PER", "name": "Ilham Aliyev", "description": "Ilham Aliyev, located at P.O. BOX 17920 JEBEL ALI FREE ZONE DUBAI UAE, in United Arab Emirates"}
{"entity_id": "1697384", "type": "PER", "name": "Arzu Aliyeva", "description": "Arzu Aliyeva, located at APARTMENT NO. 1801 DUBAI MARINA LEREV RESIDENTIAL DUBAI U.A.E., in United Arab Emirates"}
{"entity_id": "1551574", "type": "ORG", "name": "Rosamund International Ltd", "description": "Rosamund International Ltd, located at PORTCULLIS TRUSTNET CHAMBERS P.O. BOX 3444 ROAD TOWN, TORTOLA BRITISH VIRGIN ISLANDS, in British Virgin Islands"}

Introducing spacy-lancedb-linker, a new library for ANN Entity Linking with spacy

With our two artefacts ready, we can start using entity linking. Entity Linking is one of the common NLP tasks.

Entity Linking and Discovery

A more formal definition of Entity Linking can be found in the Zshot paper by IBM:

Entity Linking, also known as named entity disambiguation, is the process of identifying and disambiguating mentions of entities in a text, linking them to their corresponding entries in a knowledge base or a dictionary. For example, given "Barack Obama", entity linking would determine that this refers to the specific person with that name (one of the presidents of the United States) and not any other person or concept with the same name. [...] Entity linking can be useful for a variety of natural language processing tasks, such as information extraction, question answering, and text summarization. It helps to provide context and background information about the entities mentioned in the text, which can facilitate a deeper understanding of the content.

Several techniques can be used for entity linking. From deep learning and supervised learning to unsupervised learning approaches. They usually have two stages: candidate creation and candidate ranking. In candidate creation, the approaches aim to narrow down the vast number of entities into a manageable subset (e.g., tens or hundreds), and in candidate ranking, the approaches aim to rank the candidate entities of each mention according to the probability that they match the given mention.

Example of the two steps required for entity linking: candidate creation and candidate ranking

When it comes to open-source implementations at our disposal, there is of course spaCy’s Entity Linker although it uses supervised learning and thus requires labels which is not practical when quickly iterating. There is also IBM’s zshot Linker which implements 5 deep-learning linkers and is zero-shot, but still, the underlying models are using deep learnings thus might be slower, and were trained on labels. We found Microsoft’s spaCy-compatible ANN linker which uses unsupervised learning, building an Approximate Nearest Neighbors (ANN) index computed on the Character N-Gram TF-IDF representation of all aliases in your KnowledgeBase. This approach was the most fitting for our use case. Unfortunately, the project is not supported anymore, the last commit is from 2 years ago and the ANN index used (nmslib) was causing setup errors.

Inspired by microsoft/spacy-ann-linker, we therefore wrote our own ANN entity linking library louisguitton/spacy-lancedb-linker, swapping nmslib for a supported and active ANN index LanceDB. The result is a simple API that we can use to run unsupervised entity linking in spaCy:

from typing import Iterator


import srsly
from spacy.language import Language
from spacy.tokens import Doc, DocBin
from spacy_lancedb_linker.kb import AnnKnowledgeBase
from spacy_lancedb_linker.linker import AnnLinker  # noqa
from spacy_lancedb_linker.types import Alias, Entity




def entity_linking(nlp: Language, spacy_dataset: DocBin) -> Iterator[Doc]:
   entities = [Entity(**entity) for entity in srsly.read_jsonl("data/icij-example/entities.jsonl")]


   aliases = [Alias(**alias) for alias in srsly.read_jsonl("data/icij-example/aliases.jsonl")]


   ann_kb = AnnKnowledgeBase(uri="data/sample-lancedb")
   ann_kb.add_entities(entities)
   ann_kb.add_aliases(aliases)


   ann_linker = nlp.add_pipe("ann_linker", last=True)
   ann_linker.set_kb(ann_kb)


   docs = spacy_dataset.get_docs(nlp.vocab)
   return nlp.pipe(docs)

Combining all the pieces

To recap, we start from Senzing's ERKG for ICIJ, we filter it using the lead to follow in suspicious.txt, we generate the two artifacts that we need for spacy-lancedb-linker, and we now can put together an Entity Linking pipeline. Let’s have a look at the output of the Entity Linking on an ICIJ web article about the Azeri presidential family:

ERKG-powered Entity Linking of an ICIJ article on the Azeri presidential family

The Entity Linking here can be used for information extraction, or to provide context and background information about the entities mentioned in the text. We can also use the following simple heuristic: if an entity is not linking to anything in the KB, but is central to the article, maybe it could be worth investigating next.

To implement this, we show in the tutorial how to use DerwenAI/pytextrank to rank entities and filter for entities not linked. This can form the basis of a human-in-the-loop system where the investigative journalist updates the KB or decides what leads to follow next. In the case of this article, we see that Londex Resources S.A. seems to be mentioned 2 times and ranked in position 19 in terms of the most important entities in the article. So we can then explore the ICIJ Offshore Leaks dataset to see if that entity is known and linked to others, and if not can decide to investigate it further.

Table of entities up for review by the investigative journalist

We hope this blog post was useful in demonstrating a technical approach that combines Entity Resolution performed with Senzing with Entity Linking performed in spaCy. We showed how this can be used to construct a domain-specific Knowledge Graph, in particular around the Azerbaijan presidential family, and we showed how to analyze a corpus of articles with this pipeline and come up with new leads.

If you’re curious about this approach, check out the reference tutorial at erkg-tutorials and the unsupervised entity linking library we’ve open-sourced spacy-lancedb-linker.

References

https://en.wikipedia.org/wiki/Panama_Papers
https://en.wikipedia.org/wiki/Pandora_Papers
https://en.wikipedia.org/wiki/Offshore_Leaks
https://www.icij.org/about/
https://offshoreleaks.icij.org/pages/database
https://offshoreleaks.icij.org/nodes/78392
https://neo4j.com/blog/analyzing-panama-papers-neo4j/
https://source.opennews.org/articles/people-and-tech-behind-panama-papers/
https://www.theguardian.com/news/2016/apr/03/what-you-need-to-know-about-the-panama-papers
https://senzing.com/about/
https://github.com/Senzing/mapper-icij
https://senzing.com/entity-resolved-knowledge-graphs/
https://storage.googleapis.com/erkg/icij/ICIJ-entity-report-2024-06-21_12-04-57-std.json.zip
https://github.com/louisguitton/spacy-lancedb-linker
https://dagster.io/
https://github.com/louisguitton/erkg-tutorials
https://www.kaseware.com/case-management
https://github.com/louisguitton/spacy-lancedb-linker
Entity Linking and Discovery via Arborescence-based Supervised Clustering https://arxiv.org/pdf/2109.01242
https://arxiv.org/pdf/2307.13497
Reveal the Unknown: Out-of-Knowledge-Base Mention Discovery with Entity Linking https://arxiv.org/pdf/2302.07189
Low-Rank Subspaces for Unsupervised Entity Linking https://arxiv.org/pdf/2104.08737
Reveal the Unknown: Out-of-Knowledge-Base Mention Discovery with Entity Linking https://arxiv.org/pdf/2302.0718
https://spacy.io/usage/linguistic-features#entity-linking
https://ibm.github.io/zshot/#linker
https://microsoft.github.io/spacy-ann-linker/
https://github.com/nmslib/nmslib
https://github.com/lancedb/lancedb
https://github.com/DerwenAI/pytextrank

Graphs and Language

2024-02-15T00:00:00.000Z

Graphs and Language

A rising tide lifts all boats, and the recent advances in LLMs are no exception. In this blog post, we will explore how Knowledge Graphs can benefit from LLMs, and vice versa.

Where do Knowledge Graphs fit with Large Language Models?

(Source)

Where do Knowledge Graphs fit with Large Language Models?

In particular, Knowledge Graphs can ground LLMs with facts using Graph RAG, which can be cheaper than Vector RAG. We'll look at a 10-line code example in LlamaIndex and see how easy it is to start. LLMs can help build automated KGs, which have been a bottleneck in the past. Graphs can provide your Domain Experts with an interface to supervise your AI systems.

Note: this is a written version of a talk I gave at the AI in Production online conference on February 15th, 2024. You can watch the talk here.

A trip down memory lane at Spacy IRL 2019

I've been working with Natural Language Processing for a few years now, and I've seen the rise of Large Language Models. The start of my NLP and Graphs work dates back to 2018, applied to the Sports Media domain when I worked as a Machine Learning Engineer at OneFootball, a football media company from Berlin, Germany.

As a practitioner, I remember that time well because it was a time of great change in the NLP field. We were moving from the era of rule-based systems and word embeddings to the era of deep learning, moving from LSTMs to a slew of models like Elmo or ULMfit based on the transformer architecture. I was one of the lucky few who could attend the Spacy IRL 2019 conference in Berlin. There were corporate training workshops followed by talks about Transformers, conversational AI assistants, and applied NLP in finance or media.

Spacy IRL 2019 keynote by Sebastian Ruder

In his keynote, The missing elements in NLP (spaCy IRL 2019), Yoav Goldberg predicts that the next big development will be to enable non-experts to use NLP. He was right ✅. He thought we would get there by humans writing rules aided by Deep Learning resulting in transparent and debuggable models. He was wrong ❌. We got there with chat, and we now have less transparent and less debuggable models. We moved further right and down on his chart (see below) to a place deeper than Deep Learning. The jury is still out on whether we can move towards more transparent models that work for non-experts and with little data.

Yoav Goldberg: The missing elements in NLP (spaCy IRL 2019)

In the context of my employer at the time, OneFootball, a football media in 12 languages with 10 million monthly active users, we used NLP to assist our newsroom and unlock new product features. I built systems to extract entities and relations from football articles, tag the news, and recommend articles to users. I shared some of that work in a previous talk at a Berlin NLP meetup. We had medium data, not a lot. And we had partial labels in the form of "retags". We also could not pay for much compute. So we had to be creative. It was the realm of Applied NLP.

That's where I stumbled upon the beautiful world of Graphs, specifically the great work from my now friend Paco Nathan with his library pytextrank. Graphs (along with rule-based matchers, weak supervision, and other NLP tricks I applied over the years) helped me work with little annotated data and incorporate declarative knowledge from domain experts while building a system that could be used and maintained by non-experts, with some level of human+machine collaboration. We shipped a much better tagging system and a new recommendation system, and I was hooked.

Today with the rise of LLMs, I see a lot of potential to combine the two worlds of Graphs and LLMs, and I want to share that with you.

1. Fact grounding with Graph RAG

1.1 Fine-tuning vs Retrieval-Augmented Generation

The first place where Graphs and LLMs meet is in the area of fact grounding. LLMs suffer from a few issues like hallucination, knowledge cut-off, bias, and lack of control. To circumvent those issues, people have turned to their available domain data. In particular, two approaches emerged: Fine Tuning and Retrieval-Augmented Generation (RAG).

In his talk LLMs in Production at the AI Conference 3 months ago, Dr. Waleed Kadous, Chief Scientist at AnyScale, sheds some light on navigating the trade-offs between the two approaches. "Fine-tuning is for form, not facts", he says. "RAG is for facts".

Fine-tuning will get easier and cheaper. Open-source libraries like OpenAccess-AI-Collective/axolotl and huggingface/trl already make this process easier. But, it's still resource-intensive and requires more NLP maturity as a business. RAG is more accessible, on the other hand.

According to this Hacker News thread from 2 months ago, Ask HN: How do I train a custom LLM/ChatGPT on my documents in Dec 2023?, the vast majority of practitioners are indeed using RAG rather than fine-tuning.

1.2 Vector RAG vs Graph RAG

When people say RAG, they usually mean Vector RAG, which is a retrieval system based on a Vector Database. In their blog post and accompanying notebook tutorial, NebulaGraph introduces an alternative that they call Graph RAG, which is a retrieval system based on a Graph Database (disclaimer: they are a Graph database vendor). They show that the facts retrieved by the RAG system will vary based on the chosen architecture.

They also show in a separate tutorial part of the LlamaIndex docs that Graph RAG is more concise and hence cheaper in terms of tokens than Vector RAG.

1.3 RAG Zoo

To make sense of the different RAG architectures, consider the following diagrams I created:

Differences and similarities of the RAG architectures

In all cases, we ask a question in natural language QNL and we get an answer in natural language ANL. In all cases, there is some kind of Encoding model that extracts structure from the question, coupled with some kind of Generator model ("Answer Gen") that generates the answer.

Vector RAG embeds the query (usually with a smaller model than the LLM; something like FlagEmbeddings or any small of the models at the top of the Huggingface Embeddings Leaderboard) into a vector embedding vQ. It then retrieves the top-k document chunks from the Vector DB that are closest to vQ and returns those as vectors and chunks (vj, Cj). Those are passed along with QNL as context to the LLM, which generates the answer ANL.

Graph RAG extracts the keywords ki from the query and retrieves triples from the graph that match the keyword. It then passes the triples (sj, pj, oj) along with QNL to the LLM, which generates the answer ANL.

Structured RAG uses a Generator model (LLM or smaller fine-tuned model) to generate a query in the database's query language. It could generate a SQL query for a RDBMS or a Cypher query for a Graph DB. For example, let's imagine we query a RDBMS: the model will generate QSQL which is then passed to the database to retrieve the answer. We note the answer ASQL but those are data records that result from running QSQL in the database. The answer ASQL as well as QNL are passed to the LLM to generate ANL.

In the case of Hybrid RAG, the system uses a combination of the above. There are multiple hybridation techniques that go beyond this blog post. The simple idea is that you pass more context to the LLM for Answer Gen, and you let it use its summarisation strength to generate the answer.

1.4 Graph RAG implementation in LlamaIndex

And now for the code, with the current frameworks, we can build a Graph RAG system in 10 lines of python.

from llama_index.llms import Ollama
from llama_index import ServiceContext, KnowledgeGraphIndex
from llama_index.retrievers import KGTableRetriever
from llama_index.graph_stores import Neo4jGraphStore
from llama_index.storage.storage_context import StorageContext
from llama_index.query_engine import RetrieverQueryEngine
from llama_index.data_structs.data_structs import KG
from IPython.display import Markdown, display

llm = Ollama(model='mistral', base_url="http://localhost:11434")
service_context = ServiceContext.from_defaults(llm=llm, embed_model="local:BAAI/bge-small-en")

graph_store = Neo4jGraphStore(username="neo4j", password="password", url="bolt://localhost:7687", database="neo4j")
storage_context = StorageContext.from_defaults(graph_store=graph_store)

kg_index = KnowledgeGraphIndex(index_struct=KG(index_id="vector"), service_context=service_context, storage_context=storage_context)
graph_rag_retriever = KGTableRetriever(index=kg_index, retriever_mode="keyword")

kg_rag_query_engine = RetrieverQueryEngine.from_args(retriever=graph_rag_retriever, service_context=service_context)

response_graph_rag = kg_rag_query_engine.query("Tell me about Peter Quill.")
display(Markdown(f"{response_graph_rag}"))

This snippet supposes you have Ollama serving the mistral model and a Neo4j database running locally. It also assumes you have a Knowledge Graph in your Neo4j database, but if you don't we'll cover in the next section how to build one.

2. KG construction

2.1 Building a Knowledge Graph

Before conducting inference, you need to index your data either in a Vector DB or a Graph DB.

Indexing architectures for RAG

The equivalent of chunking and embedding documents for Vector RAG is extracting triples for Graph RAG. Triples are of the form (s, p, o) where s is the subject, p is the predicate, and o is the object. Subjects and objects are entities, and predicates are relationships.

There are a few ways to extract triples from text, but the most common way is to use a combination of a Named Entity Recogniser (NER) and a Relation Extractor (RE). NER will extract entities like "Peter Quill" and "Guardians of the Galaxy vol 3", and RE will extract relationships like "plays role in" and "directed by".

There are fine-tuned models specialised in RE like REBEL, but people started using LLMs to extract triples. Here is the default prompt chain of LlamaIndex for RE:

Some text is provided below. Given the text, extract up to
{max_knowledge_triplets}
knowledge triplets in the form of (subject, predicate, object). Avoid stopwords.
---------------------
Example:
Text: Alice is Bob's mother.
Triplets: (Alice, is mother of, Bob)
Text: Philz is a coffee shop founded in Berkeley in 1982.
Triplets:
(Philz, is, coffee shop)
(Philz, founded in, Berkeley)
(Philz, founded in, 1982)
---------------------
Text: {text}
Triplets:

The issue with this approach is that first you have to parse the chat output with regexes, and second you have no control over the quality of entities or relationships extracted.

2.2 KG construction implementation in LlamaIndex

With LlamaIndex however, you can build a KG in 10 lines of python using the following code snippet:

from llama_index.llms import Ollama
from llama_index import ServiceContext, KnowledgeGraphIndex
from llama_index.graph_stores import Neo4jGraphStore
from llama_index.storage.storage_context import StorageContext
from llama_index import download_loader

llm = Ollama(model='mistral', base_url="http://localhost:11434")
service_context = ServiceContext.from_defaults(llm=llm, embed_model="local:BAAI/bge-small-en")

graph_store = Neo4jGraphStore(username="neo4j", password="password", url="bolt://localhost:7687", database="neo4j")
storage_context = StorageContext.from_defaults(graph_store=graph_store)

loader = download_loader("WikipediaReader")()
documents = loader.load_data(pages=['Guardians of the Galaxy Vol. 3'], auto_suggest=False)

kg_index = KnowledgeGraphIndex.from_documents(
    documents,
    storage_context=storage_context,
    service_context=service_context,
    max_triplets_per_chunk=5,
    include_embeddings=False,
    kg_triplet_extract_fn=None,
    kg_triple_extract_template=None
)

2.3 Example failure modes of LLM-based KG construction

However, if we have a look at the resulting KG for the movie "Guardians of the Galaxy vol 3", we can note a few issues.

Neo4j Bloom screenshot of a KG constructed with a LLM

Here is a table overview of the issues

This is to be compared with the Wikidata graph labelled by humans, which looks like this:

Human-labelled KG in Wikidata generated with metaphacts

2.4 Towards better KG construction

So where do we go from there? KGs are difficult to construct and evolve by nature, which challenges the existing methods in KGs to generate new facts and represent unseen knowledge. The paper Unifying Large Language Models and Knowledge Graphs: A Roadmap provides a good overview of the current state of the art and the challenges ahead.

Knowledge graph construction involves creating a structured representation of knowledge within a specific domain. This includes identifying entities and their relationships with each other. The process of knowledge graph construction typically involves multiple stages, including 1) entity discovery, 2) coreference resolution, and 3) relation extraction. Fig 19 presents the general framework of applying LLMs for each stage in KG construction. More recent approaches have explored 4) end-to-end knowledge graph construction, which involves constructing a complete knowledge graph in one step or directly 5) distilling knowledge graphs from LLMs.

Which is summarised in this figure from the paper:

The general framework of LLM-based KG construction

I've seen only a few projects that have tried to tackle this problem: DerwenAI/textgraphs and IBM/zshot.

3. Unlock Experts

3.1 Human vs AI

The final place where Graphs and LLMs meet is Human+Machine collaboration. Who doesn't love a "Human vs AI" story? News headlines about "AGI" or "ChatGPT passing the bar exam" are everywhere.

I would encourage the reader to have a look at this answer from the AI Snake Oil newsletter. They make a good point that models like ChatGPT memorise the solutions rather than reason about them, which makes exams a bad way to compare humans with machines.

Going beyond Memorisation, there is a whole area of research around what's called Generalization, Reasoning, Planning, Representation Learning, and graphs can help with that.

3.2 Human + Machine: Visualisation

Rather than against each other, I'm interested in ways Humans and Machines can work together. In particular, how do humans understand and debug black-box models?

One key project that, in my opinion, moved the needle there was the whatlies paper from Vincent Warmerdam, 2020. He used UMAP on embeddings to reveal quality issues in LLMs, and built a framework for others to audit their embeddings rather than blindly trust them.

Similarly, Graph Databases come with a lot of visualisation tools out of the box. For example, they would add context with colour, metadata, and different layout algorithms (force-based, Sankey)

3.3 Human + Machine: Human in the Loop

Finally, how do we address the lack of control of Deep Learning models, and how do we incorporate declarative knowledge from domain experts?

I like to refer to the phrase "the proof is in the pudding", and by that, I mean that the value of a piece of tech must be judged based on its results in production. And when we look at production systems, we see that LLMs or Deep Learning models are not used in isolation, but rather within Human-in-the-Loop systems.

In a project and paper from 2 weeks ago, Google has started using language models to help it find and spot bugs in its C/C++, Java, and Go code. The results have been encouraging: it has recently started using an LLM based on its Gemini model to “successfully fix 15% of sanitiser bugs discovered during unit tests, resulting in hundreds of bugs patched”. Though the 15% acceptance rate sounds relatively small, it has a big effect at Google-scale. The bug pipeline yields better-than-human fixes - “approximately 95% of the commits sent to code owners were accepted without discussion,” Google writes. “This was a higher acceptance rate than human-generated code changes, which often provoke questions and comments”.

The key takeaway here for me has to do with their architecture:

AI-powered patching at Google

They built it with a LLM, but they also combined LLMs with smaller more specific AI models, and more importantly with a double human filter on top, thus working with machines.

Conclusion

I remember those 2019 days vividly, moving from LSTMs to Transformers, and we thought that was Deep Learning. Now, with LLMs, we've reached what I would describe as Abysmal Learning. And I like this image because it can mean both "extremely deep" as well as "profoundly bad".

More than ever, we need more control, more transparency, and ways for humans to work with machines. In this blog post, we've seen here a few ways in which Graphs and LLMs can work together to help with that, and I'm excited to see what the future holds.

Deeper than Deep Learning: Abysmal Learning

Resources

Language, Graphs, and AI in industry - Paco Nathan - Jan, 2024
Graph ML meets Language Models - Paco Nathan - Oct 25, 2023
[2306.08302] Unifying Large Language Models and Knowledge Graphs: A Roadmap
GitHub - RManLuo/Awesome-LLM-KG: Awesome papers about unifying LLMs and KGs - Jun 14, 2023
Evaluating LLMs is a minefield
GPT-4 and professional benchmarks: the wrong answer to the wrong question - AI Snake Oil - Oct 4, 2023
AI-powered patching: the future of automated vulnerability fixes - Google Security - Jan 31, 2024
Graph & Geometric ML in 2024: Where We Are and What’s Next (Part II — Applications) | by Michael Galkin - Jan 16, 2024
[2312.02783] Large Language Models on Graphs: A Comprehensive Survey - Dec 5, 2023
ULTRA: Foundation Models for Knowledge Graph Reasoning | by Michael Galkin | Towards Data Science - Nov 3, 2023
Fine Tuning Is For Form, Not Facts | Anyscale - July 5, 2023
GenAI Stack Walkthrough: Behind the Scenes With Neo4j, LangChain, and Ollama in Docker - Oct 05, 2023
NebulaGraph Launches Industry-First Graph RAG: Retrieval-Augmented Generation with LLM Based on Knowledge Graphs - Sep 6, 2023
RAG Using Unstructured Data & Role of Knowledge Graphs | Kùzu - Jan 15, 2024
Constructing knowledge graphs from text using OpenAI functions | by Tomaz Bratanic - Oct 20, 2023
Knowledge graph from unstructured text | by Noah Mayerhofer | Neo4j Developer Blog - Sep 21, 2023

NER models in Argilla

2024-05-23T00:00:00.000Z

Demo of NER on football news in Argilla

🤝Organizer: Argilla.io
🏠Venue Host: Argilla Event Calendar

📝Agenda:

Louis Guitton is a great community member, a long-time attendee of the Argilla community meetup and working as a freelancer within the AI space. Within this meetup, he will:

Recap on Argilla v1.26-28 Span updates
Recap on the NER task in the wider context of NLP
Typical NER datasets and how to load them in Argilla
The different circles of NER in Argilla
- Load a research NER dataset or annotate data
- Add suggestions with a spaCy pipeline (en_core_web_sm vs SpanMarker)
- Add suggestions with Entity Linking
- Add suggestions with a LLM and few-shot learning
- Add suggestions with a foundation model from Hugging Face (NuNER)
- Bonus: nested NER

We hope to see you all on the 23rd :)

Gabriel and Natalia from the Argilla team

Functional Programming for Pandas Data Engineering

2024-08-24T00:00:00.000Z

Functional Programming for Data Engineering Pipelines that use Python Pandas dataframes

(Source)

If you're maintaining a codebase that uses pandas dataframes heavily, you might have felt this pain already. Your files are getting longer, debugging the data transformations is getting slower.

When it comes to Data Engineering, Functional Programming has proven its value already and I won't come back on this in this post. If you're not convinced, just have a look at the seminal piece by Maxime Beauchemin (creator of Apache Airflow and Apache Superset) Functional Data Engineering — a modern paradigm for batch data processing.

But, of all the Data Engineering or Machine Learning Operations tools, one is at the same time used a lot, and harder to adopt functional programming with: pandas dataframes. I will show more niche ways to write pandas code that has served me well in previous roles or at previous clients to reduce tech debt, and make Data Engineering in pandas more fun.

Functional Programming in Python

For an in depth look, have a read at Functional Programming in Python: When and How to Use It.

>>> animals = ["ferret", "vole", "dog", "gecko"]
>>> sorted(animals, key=lambda s: -len(s))
['ferret', 'gecko', 'vole', 'dog']

Functional Programming in Pandas

For an intro to the topic, have a read at Method chaining across multiple lines in Python.

Let's use this dataframe as an example:

import pandas as pd

df = pd.DataFrame.from_records([
    {"name": "Alice", "age": 24, "state": "NY", "point": 64},
    {"name": "Bob", "age": 42, "state": "CA", "point": 92},
    {"name": "Charlie", "age": 18, "state": "CA", "point": 70}
])

Bad: entry-level pandas

df["point_ratio"] = df['point'] / 100
df["surrogate_key"] = df["name"] + "-" + df["age"].astype(str) + "-" + df["state"]
df = df.drop(columns='state')
df = df.sort_values('age')
df = df.head(3)

While still maintaining one transformation per line, there are mentions of df everywhere. We are not explicit about the fact that we rely on the transformations to happen in the order we wrote them. Also, you can see with the surrogate_key transformation that the readability of the code decreases when the transformation complexity increases.

Better: pandas functional API

result = (
    df
    .assign(point_ratio=lambda d: d['point'] / 100)
    .assign(surrogate_key=lambda d: d.apply(lambda r: f"{r['name']}-{r['age']}-{r['state']}", axis=1))
    .drop(columns='state')
    .sort_values('age')
    .head(3)
)

Using .assign and parenthesis (), we anchor our approach in functional programming. Each transformation is on its own line, and there are no more mentions of df. We are explicit about the transformations order.

On the other hand, the surrogate_key transformation is hard to write:

There are two nested lambda functions
we iterate on rows using .apply and axis=1, which adds complexity
we are using unspoken rules like naming d the parameter of type pd.DataFrame, and naming r the parameter which is a "Row" of the dataframe.

Because code is read more than it's written, investing the time to write this code is still worth it for teams. But we can do better

Best: use `pandas.DataFrame.itertuples` with the functional API

result = (
    df
    .assign(point_ratio=lambda d: d['point'] / 100)
    .assign(surrogate_key=lambda d: [f"{user.name}-{user.age}-{user.state}" for user in d.itertuples(name="User")])
    .drop(columns='state')
    .sort_values('age')
    .head(3)
)

We take the same approach as before, but we tweak the surrogate_key transformation. This time:

no nested lambda
we iterate over rows using itertuples, which maintains dtypes of the rows and that gives us NamedTuple objects
explicit variable name user instead of r previously

Conclusion

In this short article, I have showed you a new way to write your pandas data pipelines that can be leveraged to write more explicit and maintainable code for Data Engineering.

When Natural Language Processing (NLP) meets Football

2022-03-18T00:00:00.000Z

Unstructured: Volume I - The Data Science of Geo-Gaming, Negativity and Football

This is the first event in our new Unstructured speaker series, looking at the intersection of Data Science and Business.

We will have a small group meeting in Berlin, and hopefully a wider audience joining us online. This Meetup event is for the in-person event in Berlin. After the talks we will have some drinks and networking.

If you only want to join remotely, please sign-up here. The Meetup Event sign-up is only meant for in-person attendees (you are still welcome to join the Group to be notified about all events).

We have three expert speakers sharing their experience of applying data science to business problems.

Boyan Angelov, CTO at VAAM : Via Negativa for Data Projects
Louis Guitton, Principal Machine Learning Engineer for OneFootball: Scaling football news ingestion with Graph-based Natural Language Processing
Stefan Berkner, CDO at Tilo - How Gaming Helped me to Tile the World

Boyan Angelov is a CTO and data strategist with a decade of experience in a variety of academic and business environments. He's the author of the O'Reilly "Python and R for the Modern Data Scientist" book and currently working on his second book - "Elements of Data Strategy: A Handbook for the Analytics Manager".

In his work, Naseem Taleb extensively covers the concept of Via Negativa: people are much better at understanding the downsides than upsides. In this talk, I'll explain what not to do in delivering data projects. I'll go through the most common scenarios and the factors causing them. And finally, I will provide several remedy recipes to ensure your data projects don't suffer the same fate.

Louis Guitton attended Mines ParisTech PSL from 2012-2016, where he got his MSc in Engineering (with a minor in Econometrics) and perfected the "spaghetti al kettle". An open source contributor and technologist, Louis spoke in May 2021 at The Knowledge Graph Conference, in NYC, about his graph data science work in natural language processing. This is the business-critical technology he has developed for OneFootball in Berlin.

Stefan Berkner is a passionate self-taught software engineer with 15+ years experience in development and architecture. He was previously Lead Software Engineer at a German credit bureau where he was responsible for leading the development of the technology that would go on to become Tilo, where he is the Chief Development Officer.

Searching in databases using geographical data and a given distance can be challenging if the database does not support this natively. Creating a grid on the world drastically reduces the potential search space. Stefan will explain how one of his favourite games, Dyson Sphere Program, influenced him in choosing a grid that is easy to calculate and work with.

Reviewing my first research paper on EasyChair

2024-08-27T00:00:00.000Z

EasyChair the platform for reviewing research papers

Overall evaluation -3: (strong reject; on a scale from -3 to +3)

Reviewer's confidence 3: (medium; on a scale from 1 to 5)

Overview

The paper presents a serialisation method for RDF ontologies into a flat JSON - along with a Java-based tool called rdf2json. The JSON generator is overcoming the circular structures that can be found in graphs with "mapping paths". The paper then compares the new JSON serializer to existing JSON serializers, and presents future work.

Strengths

the paper satisfies the relevance criteria of the call: it focuses on practical code pipelines around KGs ontologies; in particular its area for submission seems to fit topics “Tools for mechanizing building of knowledge graphs” and “Connections to software engineering practices, such as build tools”
the paper has the goal of bringing incremental adoption of Semantic Web technologies for software development teams, and shares an open-source library on Github.

Weaknesses

it is not easy to follow the motivation, overall technical ideas, and main results. Not sure these are expressed in a manner that the broader ISWC audience can understand.
the paper does not provide enough evidence to justify its key claims and conclusions.
the novelty (insights, method) versus the other methods mentioned in the paper is not clear.

Details

1/ Clarity

The area for submission is not clearly articulated in the abstract or introduction. Maybe the author can refer to the call for paper and reuse some of the verbiage. For example “Tools for mechanizing building of knowledge graphs” and “Connections to software engineering practices, such as build tools”.

The paper is presenting an open-source library that is mentioned in the last sentence, maybe the author could fix this by mentioning the Java project earlier, perhaps even in the title.

Some of the key arguments are not developed. For example "they deliver information in a graph-like structure instead of a tree structure" or "The model creator uses restricted paths
to draw only the relevant branches of the final tree structure".

The structure of the paper is clear, but we might suggest the following tweak: 1-Introduction / 2-RDF2JSON: Usage and examples / 3-Comparison with existing approaches / 4-Roadmap. Anchoring the structure around the open-source library might help the author explaining the use cases and the benefits of its method.

The introduction doesn't have any figures. An architecture diagram would be welcomed, especially with the presence of entities such as: Ontologist, RDF, Triple store, Jena API, rdf2json, JSON, Developer/User.

The writing contains grammatical errors, making it hard to follow and review. For example "Simple Person ontology can be seen in Figure 1."

2/ Evidences

The writing contains a lot of general claims with no evidence to back them up. This loses the adoption of the reader. Example formulas that we hope the author can improve in a future version: “trivial to many”, “quite the opposite”, “[developers] prefer”, "Data structures should be modeled by data/domain experts, and not by software developers", "Desired structure by developers", "Ontologies should be created by data experts, not by software developers", "This is very far from what is actually happening", "History repeats itself", "This is certainly not the adoption level the community is looking for".

The bibliography mentions 3 papers, including a well cited paper (1169 citations / 52 highly influential citations) to establish context, but only 1 paper is recent (2023), the rest is more than 10 years old (2012-2013), so it’s hard to see how this paper connects to recent publications. The rest of the references are not research related, including even a private consulting firm's press release. Maybe the author could aim at connecting their contribution to more numerous and more recent papers (~10 in the from 2010 to today).

3/ Novelty

The paper mentions other JSON serialization methods, without highlighting what the new proposed methods improves upon. Maybe examples in the form of "before and after" could help the reader understand the novelty. For example: Person ontology with JSON-LD vs Person ontology with RDF2JSON.

The author mentions "[existing JSON serializers] deliver information in a graph-like structure in-
stead of a tree structure" and implies implicitly that the tree structure is better without explaining how and why.

Learn SPARQL in 5 minutes and use it to query WikiData

2020-05-27T00:00:00.000Z

Learn how to query Wikidata

I'm working on Entity Linking and Knowledge Bases. In that context, exporting a relevant part of Wikidata can be really useful to build surface form dictionaries and coocurence probabilities etc... In order to know which part of Wikidata is relevant to dump, I thought we could query Wikidata (although it seems we can only download the entire dump and filter afterwards).

SPARQL is a language to formulate questions (queries) for knowledge databases. Therefore you can query Wikidata with SPARQL. At first sight, the syntax is not particularly easy and I've gone through this tutorial.

SparQL in 5 minutes

#is the comment character
The SELECT clause lists variables that you want returned (variables start with a question mark)
The WHERE clause contains restrictions on them, in the form of SPO triples (subject, predicate, object), e.g. ?fruit hasColor yellow.
On Wikidata, items and properties are not identified by human-readable names like “hasColor” (property) or “yellow” (item).
Instead, Wikidata items and properties are assigned an identifier, that you need to know beforehand.
- for items, it's a Q-number, e.g. "yellow" is Q943
- for properties, it's a P-number, e.g. "hasColor" is P462
you can search for the itentifiers using search term for items and P:search term for properties
but you should rely on autocompletion in query.wikidata.org by pressing Ctrl + Space
Finally, you need to include prefix namespaces to query the WQDS (Wikidata Query Service). There are many prefixes, one for each namespace in SPARQL :
- wd: for items
- wdt: for properties, pointing to the object
to doublecheck what prefix links to what resource, use https://prefix.cc/
to get more than the Wikidata ID as selectable attributes, you need to include them in the WHERE clause using SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }

Putting this all together we get:

SELECT ?fruit ?fruitLabel
WHERE
{
  # fruit hasColor yellow
  ?fruit wdt:P462 wd:Q943
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
}

Try it

You can further filter this down by adding more triple conditions using ; character e.g. you could filter for actual fruits by doing

# fruit instance of or subclass of a fruit
?fruit wdt:P31/wdt:P279* wd:Q3314483;

Advanced filters :
- p: for properties, pointing to the subject
- ps: for property statement
- pq: for property qualifier
You can abbreviate a lot with the [] syntax

SELECT ?painting ?paintingLabel ?material ?materialLabel
WHERE
{
  # element is a painting
  ?painting wdt:P31/wdt:P279* wd:Q3305213;
  # extract the statement node 'material' (P186)
            p:P186 [
              # get material property statement
              ps:P186 ?material;
              # 'applies to part'(P518) 'painting surface'(Q861259)
              pq:P518 wd:Q861259
            ].
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
}

More grammar by example: ORDER BY, LIMIT

SELECT ?country ?countryLabel ?population
WHERE
{
  # instances of sovereign state
  ?country wdt:P31/wdt:P279* wd:Q3624078;
  # hasPopulation populationValue
           wdt:P1082 ?population.
  # filter for english translations
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
# ASC(?something) or DESC(?something)
ORDER BY DESC(?population)
LIMIT 10

If you add more variables like population above, the query will filter out countries that don't have a population value. To fix this, use an OPTIONAL clause

SELECT ?book ?title ?illustratorLabel ?publisherLabel ?published
WHERE
{
  ?book wdt:P50 wd:Q35610.
  OPTIONAL { ?book wdt:P1476 ?title. }
  OPTIONAL { ?book wdt:P110 ?illustrator. }
  OPTIONAL { ?book wdt:P123 ?publisher. }
  OPTIONAL { ?book wdt:P577 ?published. }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
}

FILTER and BIND, see tutorial section for more details

SELECT ?person ?personLabel ?age
WHERE
{
  # instance of human
  ?person wdt:P31 wd:Q5;
          wdt:P569 ?born;
          wdt:P570 ?died;
  # died from capital punishment
          wdt:P1196 wd:Q8454.
  BIND(?died - ?born AS ?ageInDays).
  BIND(?ageInDays/365.2425 AS ?ageInYears).
  BIND(FLOOR(?ageInYears) AS ?age).
  FILTER(?age > 90)
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
}

Try it

One can select based on a list of items using VALUES

SELECT ?item ?itemLabel ?mother ?motherLabel
WHERE {
  # A. Einstein or J.S. Bach
  VALUES ?item { wd:Q937 wd:Q1339 }
  # mother of
  OPTIONAL { ?item wdt:P25 ?mother. }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}

The Label Service extension automatically generates labels as follows:
- ?xxxLabel as a shortcut for rdfs:label
- ?xxxAltLabel as a shortcut for skos:altLabel
- ?xxxDescription as a shortcut for schema:description

Fun SPARQL queries related to football

Get the 🇳🇱 dutch nicknames of a team:

# get the dutch nicknames from Bayern München
SELECT ?item ?itemLabel ?itemDescription ?itemAltLabel
WHERE {
  VALUES ?item { wd:Q15789 }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "nl". }
}

Try it

Get the stadium names of the teams that are part of the Big 5:

SELECT ?item ?itemLabel ?venue ?venueLabel ?venueAltLabel
WHERE
{
  ?item wdt:P31/wdt:P279* wd:Q847017;
        wdt:P118 ?league;
        wdt:P115 ?venue.
  # filter for Big 5
  VALUES ?league { wd:Q82595 wd:Q9448 wd:Q13394 wd:Q15804 wd:Q324867 wd:Q206813}.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
}
ORDER BY ?league

Try it

Solutions to the tutorial exercices

Here are my solution to the exercises in that tutorial.

Chemical elements

Write a query that returns all chemical elements with their element symbol and atomic number, in order of their atomic number.

SELECT ?element ?elementLabel ?symbol ?atomic_number
WHERE
{
  ?element wdt:P31 wd:Q11344;
           wdt:P246 ?symbol ;
           wdt:P1086 ?atomic_number .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
}
ORDER BY ASC(?atomic_number)

Try it

Rivers that flow into the Mississippi

Write a query that returns all rivers that flow directly or indirectly into the Mississippi River.

SELECT ?river ?riverLabel
WHERE
{
  ?river wdt:P31 wd:Q4022;
         wdt:P403/wdt:P403* wd:Q1497 .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
}
ORDER BY ASC(?riverLabel)

Try it

References to Le Figaro website

SELECT ?ref ?refURL WHERE {
  ?ref pr:P854 ?refURL .
  FILTER (CONTAINS(str(?refURL),'lefigaro.fr')) .
} LIMIT 10

Now that you have developed a SparQL query, here is the simplest way to programatically query WikiData with python:

pandas
requests

"""SPARQL utils."""
from pathlib import Path
from typing import List
from urllib.parse import urlparse

import pandas as pd
import requests


def query_wikidata(sparql_file: str, sparql_columns: List[str]) -> pd.DataFrame:
    """Query Wikidata SPARQL API endpoint."""
    wikidata_api = "https://query.wikidata.org/sparql"
    query = Path(sparql_file).read_text()
    r = requests.get(wikidata_api, params={"format": "json", "query": query})
    data = r.json()
    df = (
        pd.json_normalize(data, record_path=["results", "bindings"])
        .rename(columns={c + ".value": c for c in sparql_columns})[sparql_columns]
        .assign(q_id=lambda d: d.item.apply(lambda u: Path(urlparse(u).path).stem))
    )
    return df

Code Reviews: a Cheat Sheet

2025-07-02T00:00:00.000Z

Code Reviews can turn into a weird Ping-Pong game

A code review is a process where someone other than the author(s) of a piece of code examines that code. Code committed to the codebase is both the responsibility of the author and the reviewer.

Done right, PR review can be the engine of team and business growth. Done poorly, they can leave the team fatigued and the business questioning. This guide is here to share my experience and best practices to avoid inefficient and unpleasant Code Reviews.

1. What to look at

A Code Review Maslow Pyramid of Needs

Code reviews should look at:

Design: Is the code well-designed and appropriate for your system?
Functionality: Does the code behave as the author likely intended? Is the way the code behaves good for its users?
Complexity: Could the code be made simpler? Would another developer be able to easily understand and use this code when they come across it in the future? Beware of over-engineering. The code should solve problems that need to be solved _now_, and not problems that the code author speculates _might_ need to be solved in the future.
Tests: Does the code have correct and well-designed automated tests?
Naming: Did the developer choose clear names for variables, classes, methods, etc.?
Comments: Are the comments clear and useful? Note that comments are most useful when they explain _why_ the code exists.
Style: Does the code follow our style guides? Note, in most cases, style nits should be avoided and be enforced entirely by automated tooling. However, some stylistic decisions can be discussed if it impacts readability and complexity.
Documentation: Did the developer also update relevant documentation?

All of the above are grounds for a reviewer to request changes in a PR. Consensus should be reached to the best of the abilities of the author(s) and reviewer. However, if consensus cannot be reached between the two parties, the review should be escalated to the technical lead.

2. Code Review Conduct

2.1. Be a great submitter

Provide context with the PR template

YOU are the primary reviewer

code review is not a tennis game where "the ball is in your court now". Review your code with the same level of detail that you would giving reviews.
make sure the code works
don't rely on others to catch your mistakes

Things to think about

did I check for reusable code or utility methods? is the code elegant?
did I remove debugger statements and prints? is the code readable?
is my code secure?
is my code maintainable?

Work in progress

We believe in starting a review early so you don’t get too far only to have to rewrite things after someone has made a great suggestion.

Just create a PR even with a readme commit (when 30 to 50% of the code is there, it's a good rule of thumb), and add a clear "[ WIP ]" tag to the title so that we know it's a work in progress.

The sooner you get feedback, the better: nobody wants to hear at 90% of the way "you need to redo everything".

Ask for review early and expect architectural design comments.

General Guidelines

Provide context to the reader = use the PR template
Review your own code = if needed, build a sandbox
Expect conversation
Submit in progress work = see next slide
Submit reviews < 500 lines of python code
Use automated tools = see next slides
Be responsive
Accept defeat

How to allow maintainers to modify your PR

Allowing changes to a pull request branch created from a fork - GitHub Docs

2.2. Be a great reviewer

"Why don't you simply stretch and smile?"

Be kind.
Explain your reasoning.
Balance giving explicit directions with just pointing out problems and letting the developer decide.
Encourage developers to simplify code or add code comments instead of just explaining the complexity to you.

Other details

Make sure you are aware of the problem/feature.
Don't be rude, be polite
Try to avoid the usage of the first person, try to talk about the PR, the Code, not about the author!
Give suggestions and make clear why do you think you suggestion is better than the current approach.
Link to resources, blog posts, stack overflow answers
Don't point out just the bad things, give compliment as well.
Ask questions instead of giving answers
Don't burn out: try to review max 400 lines of code in one session. Make it part of your daily workflow. (use github notifications)
Don't use the words "now simply", "easily", "just", "obviously" ...

What to provide feedback on

Code review is not only for experienced developers! Here is what you
can provide feedback on:

high-level business goals
high-level glance at the code and readability check
setup: can you run it ?
technical solutions / architecture design / the actual code

3. Engineering Management

This part is closer to a manifesto than to anything else, but I still find it useful:

Universal code reviews: Everyone should review and be reviewed (junior or senior)
Ensure consistency:
- We should agree on a style guide to move away from personal preference (we use Google's style guide for R and Python)
- Once we agree on the style guide, start automating things with linters (from more painful to less: CI on the code, git pre-commit hooks or IDE setups for each developer)
performed by peers and not management (core review is not a performance review)
no blame culture

4. Python Code Review Checklist

"first make it work, then make it beautiful, then make it fast"

Correct
- Design
  - modularity
  - reusability
- Functionality
- configuration management
Readable
- Style: formatter, linter
- naming
Altruist
- Tests
- Comments
- Documentation
- correct logging structure for OTel
- leverage common design patterns like Singleton, Factory, Adapter, Decorator, others: https://refactoring.guru/design-patterns/python
Performance optimised / Complexity
- immutable data types
- numpy over for loops

References

How to parse dbt artifacts

2020-12-20T00:00:00.000Z

A lot of artifacts

(Source)

Overview

If you're using dbt, chances are you've noticed that it generates and saves one or more artifacts with every invocation.

In this post, I'll show you how to get started with dbt artifacts, and how to parse them to unlock applications valuable to your team and your use case.

Whether that's just for a fun Friday afternoon learning session, or whether that's your first foray at building a Data Governance tool using dbt, I hope you'll find this post useful, and if you do, let me know on twitter!

When are Artifacts Produced

dbt logo

A word of warning: dbt's current minor version as of writing is v0.18.1 and multiple improvements to artifacts are coming in dbt's next version v0.19.0, but that doesn't change the content of this post.

dbt has produced artifacts since the release of dbt-docs in v0.11.0. Starting in dbt v0.19.0, we are committing to a stable and sustainable way of versioning, documenting, and validating dbt artifacts.
Ref: https://next.docs.getdbt.com/reference/artifacts/dbt-artifacts/

The artifacts currently generated are JSON files called manifest.json, catalog.json, run_results.json and sources.json. They are used to power the docs website and other dbt features.

Different dbt commands generate different artifacts, so I've summarised that in the table below:

Of course, dbt docs is the command that refreshes most artifacts (makes sense, since they were initially introduced to power the docs site). But it's interesting to note that manifest can be refreshed by other commands than the usual suspects dbt run or dbt test too.

Available Data in dbt artifacts

Manifest:

Today, dbt uses this file to populate the docs site, and to perform state comparison. Members of the community have used this file to run checks on how many models have descriptions and tests.

Run Results:

In aggregate, many run_results.json can be combined to calculate average model runtime, test failure rates, the number of record changes captured by snapshots, etc.

Catalog:

Today, dbt uses this file to populate metadata, such as column types and table statistics, in the docs site.

Sources:

Today, dbt Cloud uses this file to power its Source Freshness visualization.

graph.gpickle:

Stores the networkx representation of the dbt resource DAG.

Parsing Artifacts from the Command Line with `jq`

jq logo

To get started with parsing dbt artifacts for your own use case, I suggest to use jq, the lightweight and flexible command-line JSON processor. This way, you can try out your ideas, explore the available data without writing much code at first.

jq Cheat sheet:

In particular, you will need to make use of some of the built-in operators like to_entries and map.

Here is a command to grab the materialisation of each model

→ cat target/manifest.json | jq '.nodes | to_entries | map({node: .key, materialized: .value.config.materialized})'
[
  {
    "node": "model.jaffle_shop.dim_customers",
    "materialized": "table"
  },
  {
    "node": "model.jaffle_shop.stg_customers",
    "materialized": "view"
  }
]

You can then for example store that into a file by piping the output

cat target/manifest.json | jq '.nodes | ...' > my_data_of_interest.json

Parsing Artifacts from Python with `pydantic`

pydantic docs

Once you get a better idea of what data you need, you might want to develop more custom logic around dbt artifacts. This is where python shines: you can write a script with the logic you need. You can install and import great python libraries. For instance, you could use networkx to run graph algorithms on your dbt DAG.

You will then need to parse the dbt artifacts in python. I recommend to use the great pydantic library: among other things, it allows to parse JSON files with very concise code that lets you focus on high-level parsing logic.

Here is an example logic to parse manifest.json:

import json
from typing import Dict, List, Optional
from enum import Enum

from pydantic import BaseModel, validator


class DbtResourceType(str, Enum):
    model = 'model'
    analysis = 'analysis'
    test = 'test'
    operation = 'operation'
    seed = 'seed'
    source = 'source'


class DbtMaterializationType(str, Enum):
    table = 'table'
    view = 'view'
    incremental = 'incremental'
    ephemeral = 'ephemeral'
    seed = 'seed'


class NodeDeps(BaseModel):
    nodes: List[str]


class NodeConfig(BaseModel):
    materialized: Optional[DbtMaterializationType]


class Node(BaseModel):
    unique_id: str
    path: Path
    resource_type: DbtResourceType
    description: str
    depends_on: Optional[NodeDeps]
    config: NodeConfig


class Manifest(BaseModel):
    nodes: Dict["str", Node]
    sources: Dict["str", Node]

    @validator('nodes', 'sources')
    def filter(cls, val):
        return {k: v for k, v in val.items() if v.resource_type.value in ('model', 'seed', 'source')}


if __name__ == "__main__":
    with open("target/manifest.json") as fh:
        data = json.load(fh)

    m = Manifest(**data)

Once you've got the Manifest class, you can use it in your custom logic. For example, in our use case from above where we want to check for model materialization, we can do:

>>> m = Manifest(**data)
>>> [{"node": node, "materialized": n.config.materialized.value} for node, n in m.nodes.items()]
[
  {
    "node": "model.jaffle_shop.dim_customers",
    "materialized": "table"
  },
  {
    "node": "model.jaffle_shop.stg_customers",
    "materialized": "view"
  }
]

Example Application 1: Detecting a Change in Materialization

Let's say you want to check that no materialisation has changed before you run dbt run. This is useful because some materialization changes require a --full-refresh. You could achieve the change detection with the following commands:

→ cat target/manifest.json | jq '.nodes | to_entries | map({node: .key, materialized: .value.config.materialized})' > old_state.json
→ # code change: let's say one model materialization is changed from table to view
→ dbt compile
→ cat target/manifest.json | jq '.nodes | to_entries | map({node: .key, materialized: .value.config.materialized})' > new_state.json
→ diff old_state.json new_state.json
12c12
<     "materialized": "table"
---
>     "materialized": "view"

Example Application 2: Compute Model Centrality with `networkx`

Once you've parsed the manifest.json, you have at your disposal the graph of models from your project. You could explore off-the-shelf graph algorithms provided by networkx, and see if any of the insights you get are valuable.

For example, nx.degree_centrality can give you the list of models that are "central" to your project. You can use that e.g. to priotise maintenance efforts. In the future, you could imagine a dbt docs search that prioritises results based on this metric as a very simple PageRank proxy.

Once you've written the pydantic code from above, this turns out to be possible in a very small amount of lines.

import networkx as nx

# ... pydantic code from above for Manifest class

class GraphManifest(Manifest):
    @property
    def node_list(self):
        return list(self.nodes.keys()) + list(self.sources.keys())

    @property
    def edge_list(self):
        return [(k, d) for k, v in self.nodes.items() for d in v.depends_on.nodes]

    def build_graph(self) -> nx.Graph:
        G = nx.Graph()
        G.add_nodes_from(self.node_list)
        G.add_edges_from(self.edge_list)
        return G


if __name__ == "__main__":
    with open("target/manifest.json") as fh:
        data = json.load(fh)

    m = GraphManifest(**data)
    G = m.build_graph()
    nx.degree_centrality(G)

Example Application 3: Graph visualisation

Provided you use python 3.8+, there is another dbt artifact that can be interesting to you: graph.gpickle. Instead of parsing manifest.json and building the graph yourself, you can deserialize the networkx graph built by dbt itself.

All it takes is 2 lines!

That's hard to beat, but note that you will rely on the internal graph definition of dbt and won't be able to customise it. For example, tests will be nodes on your graph now.

import networkx as nx

G = nx.read_gpickle("target/graph.gpickle")

Nevertheless, this can be useful for example for a quick visulisation using pyvis:

from pyvis.network import Network

nt = Network("500px", "1000px", notebook=True)
nt.from_nx(G)
nt.show("nx.html")

References

(non) Alternatives to dbt

2024-09-25T00:00:00.000Z

Landscape of data transformation tools

🤝Organizer: community members Eva Schreyer and Lucas Silbernagel
🏠Venue Host: Enpal office @ Germany

📝Agenda

6:00 PM | Check in/Registration
6:30 PM | Welcome Remarks & Housekeeping by Enpal
6:45 PM | What's new with dbt? (Stephan Durry, dbt Labs)
7:00 PM | 1. Talk: Streamlining dbt: How to Build a Project Structure that Keeps Your Team on the Same Page (Ekaterina Khrushch, Contentful)
7:30 PM | 2. Talk: The (non) alternatives to dbt (Louis Guitton, Freelance Data Engineer)
8:00 PM | 3. Talk: Tougher Than Berghain Bouncers: Crafting a CI Pipeline for Your dbt Repo That Turns Away the Unworthy (Noel Spencer, Enpal)
8:30 | Networking & Reception

Stephan Durry (dbt Labs) on what's new in dbt

How To Build and Interpret a Nomogram for Setting Better Running Goals

2023-12-21T00:00:00.000Z

Whether you're starting your fitness journey or planning your next running season, you will need to understand where you are, measure your progress and set running goals. I had to do it myself when I started running this summer, and I was lost.

I turned to my Data Science and Engineering background and built a tool called a Nomogram to assist me.

This guide provides a nonstatistical audience with a methodological approach for building, interpreting, and using nomograms to estimate running fitness and set difficult and specific goals. If you do not know what a Nomogram is, don't worry, I will explain it step by step in the rest of the article.

Brief Review of Goal Setting Theory and Discussion on Performance

Although this article deals with setting better goals, this is not a Goal Setting blog post. Setting goals is part of any self-improvement approach, and fitness or running is no exception.

When setting out to set your own goals, it's easy to get lost in the profusion of acronyms and fields in which goal setting is used, for example: Psychology (e.g. WOOP: Wish, Outcome, Obstacle, Plan), Self-help (e.g. SMART: Specific, Measurable, Achievable, Relevant, and Time-Bound) or Business (e.g. OKRs and KPIs: Objectives, Key Results, Key Performance Indicators).

Sometimes, goals are even set for us, by our employer, our doctor, our coach, our family, our friends, our insurance company. For example, my health insurance gives me a few basic fitness advice:

"do 60k steps per week, use an app to track them"
"climb stairs"
"reduce stress (eustress vs distress)"
"take breaks at work, move 2h out of 8h of your workday"

Although I'm no stranger to setting goals, I was lost when I started running this summer. Until I re-discovered the Locke theory of goal setting. In 1968, Edwin Locke published a paper called "Toward a Theory of Task Motivation and Incentives" in which he proposed that:

After controlling for ability, goals that are difficult to achieve and specific tend to increase performance far more than easy goals, no goals or telling people to do their best. It therefore follows that the simplest motivational explanation of why some individuals outperform others is that they have different goals.

The first part of this quote is key: "After controlling for ability". The verb control is used in its statistical sense, meaning that the effect of ability is removed from the equation. In other words, we all have different running fitness levels, and we need to control for that when setting goals.

The second part of the quote calls for a disclaimer: by following this approach, we bias ourselves towards performance. There are plenty of other motivations for running, and they are perfectly valid:

It's not all about performance, there are other valid motivations for running

But if for the rest of this post we focus on performance, we also need to realise that performance is a result of many factors. For example, the blog post "Why are you so slow?" uses a statistical model to reveal that running speeds for a 200m dash is influenced by 5 factors of which the weakest link is the limiting factor. In other words, if you want to improve your 200m dash time, you need to improve your height, weight, fast- and slow-twitch muscle mass, cardiovascular conditioning, flexibility and elasticity. The research paper Factors associated with high-level endurance performance goes even further and lists 26 factors that influence endurance running performance. I will spare you the detail and leave only a figure from the paper that summarises the factors:

Consensus report on the 26 factors (FENDLE) that influence high-level endurance running performance

In don't know about you, but this is too many factors for it to be practical. So I started looking for a single numerical estimate of my running fitness that I could use to set goals. It should be easy to measure, easy to understand, and easy to compare to others. Most importantly, it should be tailored to my individual profile.

French Engineering and Nomograms

Nomograms are graphical calculating devices that look like a 2D diagram and that allows approximate computations. Nomograms are in particular used because of their ability to reduce statistical predictive models into a single numerical estimate, perfect for our use case!

The field of nomography was invented in 1884 by the French engineer Philbert Maurice d'Ocagne (1862-1938) and used extensively for many years to provide engineers with fast graphical calculations of complicated formulas to a practical precision.

Historically, they were used and developed in civil engineering. Place yourself if 1843, you are a civil engineer, and you need to calculate the volume of earth to be moved to allow for the construction of a road or a railway. You have a formula, but it's complicated and you don't have a computer to do the calculation for you. At that time, the French administration would have sent you a graphical table to help you with the calculation. Tables turned into nomograms, and just a few years later in 1846, Léon Lalanne, a French engineer from Ecole Polytechnique and Ecole des Ponts, published a nomogram called "Abaque ou compteur universel" in which he explains how to use a nomogram to do all sorts of calculations.

Abacus of the universal calculator by Leon Lalanne 1843

Later, in 1867, Eduard Lill, an Austrian engineer and Captain of Military Engineering, published a nomogram to solve quadratic equations (x2 + px + q = 0) showing nomograms were not just a french affair.

Recently, nomograms have been used beyond civil engineering, especially in the field of electrical engineering (e.g. for resistors or inductance sizing), mechanical engineering (e.g. for gears dimensioning), and chemical engineering (e.g. for phase-transitions of materials). Today, they are mostly used for educational purposes, their practical usage being replaced by computers. Except for a few domains, e.g. cancer prognosis.

Being a french engineer, I have had the pleasure to study "Abaques" (the french word for Nomograms which would translate to Abacuses) in my time. I have in particular been influenced by the nomogram used in optical engineering for the Lensmaker's equation and level sets ("abaques de Pouchet" or "lignes de niveaux" in french)

Nomogram used in optical engineering for the Lensmaker's equation

Searching for a Single Numerical Estimate

At this point of my reasoning, equipped of Lock theory and Nomograms, I could summarise my requirements for the running nomogram as follows:

professionals and amateurs share an axis: the graph lets you compare yourself to others and in particular, to professionals
long and short running races share an axis: from sprinting to long endurance, athletes cna use their fitness in a wide range of distances, from 100m dash to 100km ultra trails
leans on a single numerical estimate to summarise running fitness: a rating perhaps, out of 10 or 100, useful to compare myself to my former self or to others

At that point in my running journey, I had been exposed through my Garmin smart watch to the indicator called VO2max. VO2max is a measure of the maximum volume of oxygen that an athlete can use. It is a good indicator of cardiorespiratory fitness, and it is used by Garmin to estimate your running fitness. There are common protocols to estimate VO2max, such as the Cooper test or the Vameval test (particularly popular in France for football). The idea is to run as fast as you can for a given amount of time (e.g. 6 minutes), and to measure the distance covered. These protocols measure your maximal aerobic speed (MAS) which is related to VO2max. Those protocols have their own practicality and precision issues (e.g. like the fact that you need to know your pace upfront, which is a chicken and egg problem).

For my personal use case, VO2max started losing importance because my typical efforts (e.g. a 60min football game, a 2h bike tour, a 10km running race) are much longer than 6 minutes. I started to realise that other indicators were summarising my running fitness better. For example, I noticed that my average pace on a 60min Z2 jog was improving (cf A guide to heart rate training - Runner's World).

I later learned about Critical Speed (CS). Without going into the details, CS is a measure of the maximum speed that an athlete can sustain for a long period of time. It can replace MAS as a surrogate estimate of fitness. You can use the previous link to calculate it or this link. One of its added benefits is that it is very close to the second ventilatory threshold (SV2) which is otherwise costly and impractical to measure (you need lactate and ventilatory tests and a costly physiological assessment).

Example of a physiological assessment performed by Upside Strength on a CrossFitⓇ Competitor.

In particular, the 2020 paper Calculation of Critical Speed from Raw Training Data in Recreational Marathon Runners shows that CS can be calculated from a few personal time trials (e.g. 400m, 800m, 5km) and that it is a good predictor of marathon performance. Moreover, you can visualise the CS in a 2D space where the x-axis is the duration of effort in seconds and the y-axis is the average speed during the effort in km/h. This 2D space will form the basis for our nomogram in the next sections.

Athletics World Records and the Valencia Marathon

Here is the first version of our nomogram:

Step 1: Running Nomogram with World Records and Valencia Marathon data points

This blog post is not a dataviz tutorial, but let me just say that I built this visualisation with the python programming language and the Altair visualisation library. The code is available on Github.

I have added a few World Records for men and women, in a few typical disciplines: 1mile, 5km, 10km, Half Marathon and Marathon. Those dots answer the question: "what would a world record athlete do?". They also form the upper bound of our y-axis: above that line, no human has run faster. Note that this line may move up in time (meaning the World Records will improve) due to training optimisation, new technology (shoes), better doping drugs ...

Beyond World Records, it's interesting to look at major athletics events like the Valencia Marathon. Although marathons welcome amateurs, they need to define a lower limit, for logistical reasons and economical reasons (maintain the brand value of a Valencia Marathon Finisher). This is called the "sweeper car" or "broom wagon" or "voiture balai" in french.

the maximum official time for finishing the race being 5h:30:00, with this time limit not being exceeded under any circumstances.

Tour de France cyclists being swept by the Broom Wagon because they were too slow

The next interesting data point from the Valencia marathon are the so-called "starting waves". At the start of a running event, the organiser staggers the athletes in so-called waves of people that hopefully run at a similar pace. The main goal of waves is to limit the meandering needed to overtake a slower athlete, costing energy and time to the faster athlete overtaking. An indirect benefit of waves is that it gives us the organiser's perspective on what they think the distribution of runners will be (if we assume they tried to design waves of comparable amount of athletes).

The Valencia Marathon Trinidad Alfonso is planning to start the race in nine waves in order to improve the comfort and safety of all the runners, based on the order of the accredited times.

Finally, I've looked at the "sub-elite bib status" which is a special status given to athletes that have run a fast enough time in the past 3 years. They gain access to a special starting wave and a few other privileges.

Sub-elite bib status will apply to athletes who apply with times under 30:00 in a 10k race, 1h06:00 in a half marathon, or 2h20:00 in a marathon run in the last three years

Adding a X-axis grid

Here is the second version of our nomogram:

Step 2: Running Nomogram with running-aware x-axis grid

The idea is to divide the x-axis into disciplines that are relevant to running.

On the professional side, World Athletics divides disciplines like this:

Sprints, Hurdles and Relays: 100m, 110mH, 200m, 300m, 400m, 400mH, 4x100m, 4x200m, 4x400m
Middle Distances (Courses de demi-fond): 600m, 800m, 1000m, 1500m, Mile, 2000m
Long Distances and Steeplechase (Courses de fond): 2000m SC, 3000m, 3000m SC, 2 Miles, 5000m, 10000m
Road Running: 10 km, 15 km, 10 Miles, 20 km, HM, 25 km, 30 km, Marathon, 100 km

On the amateur side, most races organised in my area are 5km, 10km, HM and Marathon. Very little or none for other disciplines. Therefore, I have decided to use the following grid: no sprint, Mile (for Middle distances), 5km (for Long Distance), 10km, HM, Marathon for Road running. If we wanted to include sprinting, we could use a 400m line.

You might note a "distortion" of sorts for longer distances. To counteract this, we could use a logarithmic scale. In particular a Symmetric log scale (symlog), which is particularly useful for plotting data that varies over multiple orders of magnitude but includes zero-valued data, like in this variant:

Step 2 bis: Running Nomogram with symlog scale

In the rest of the blog post, I decide to keep the linear scale as it makes the reading of the x-axis easier and puts more emphasis on endurance disciplines as opposed to sprint disciplines.

Adding a Y-axis grid with VDOT

Here is the third version of our nomogram:

Step 3: Running Nomogram with running-aware y-axis grid

The idea is to divide the y-axis into levels that are relevant to running.

"The World's Best Coach" Jack Daniels has proposed a system called VDOT that allows to compare athletes of different levels.

In the 1970s, Daniels and his colleague, Jimmy Gilbert, examined the performances and known VO2max values of elite middle and long distance runners. Although the laboratory determined VO2max values of these runners may have been different, equally performing runners were assigned equal aerobic profiles. Daniels labeled these "pseudoVO2max" or "effective VO2max" values as VDOT values.

With the result of a recent competition, a runner can find his or her VDOT value using a VDOT calculator. This will allow them to determine an "equivalent performance" at a different race distance, as well as recommended training paces.

By looking at the code from the VDOT calculator, I was able to find the equation of curves of constant VDOT values. I then plot the "iso-VDOT" curves on the nomogram, using an interval of 5 points of VDOT. (Side note: VDOT defines levels internally from level 2 = VDOT 40 to level 9 = VDOT 85, equally spaced at 5 points apart which I've simply prolonged).

We can now interprete those curves. For example, Men world record athletes are above VDOT 80 while Women world record athletes have a VDOT of 75. It also seems that the 5km women world record is underperforming other women world records, which calls for a new world record at Paris 2024 maybe. The Valencia Broom Wagon has a VDOT of about 25, while sub-elite athletes seems to have a VDOT a little above 70.

IAAF points as an alternative to VDOT

World Athletics maintains a ranking of athletes based on their performance. They use a points system called IAAF points, similar to ATP points in Tennis. Here is an example of the World Rankings | Women's Marathon:

Example World Athletics Rankings using IAAF points

Do you know if you have scored your first IAAF point yet? Go to this IAAF Scoring Calculator and enter your time for a given distance.

Given that IAAF points are an official ranking, we could have plotted iso-IAAF curves on the monogram. But after trying that, I felt that this was not as clear as the VDOT curves. We can even show that the VDOT curves are a good approximation of the IAAF curves by plotting both on the same graph:

Step 4: Relation between IAAF points and VDOT score

Note: I was able to find the equation for IAAF curves by looking at the code of this PHP library used by the Latvian Athletics Association and this stackexchange answer.

Setting Realistic Goals by looking at French Athletics Open Data

Now that we have a nomogram, we can use it to set a difficult and specific goal. Before that, we need to know what is a realistic performance. Turning to statistics, we can look at the distribution of performances of athletes in a given discipline.

Thankfully, the French Athletics Federation has an Open Data portal. We can crawl the data available at "Les Bilans" for a given discipline, and plot the distribution of performances. Here is the example of Men Half Marathon in 2023:

Step 5: Half Marathon performances for French men in 2023

This looks like a log-normal distribution and we could certainly model this further and look at percentiles etc... However for this blog post, I will simply use a visual interpretation of a realistic performance. "Most people" seem to have a VDOT between 37 and 44. Therefore, aiming for a VDOT of 45 seems like a difficult enough goal as a beginner to get ahead of the masses without setting the bar too high and being unrealistic.

Plotting your past performances and your future goals

Here is the last stop on our nomogram journey:

Step 6: Running Nomogram with past performances and future goals

Over the last few months, I have trained and raced a few times. I have added my past performances to the nomogram in orange. I have also added my future goals in blue. I will run my first Half Marathon in Berlin in April 2023, hoping to finish under 1h40, giving me a VDOT of 45.

When you think about it, finishing a first Half-Marathon in under 1h40 is ambitious, but looking at the data here, I think it's a good goal: difficult and specific. I seem to already have a VDOT above 45 (although on a 1 mile distance - the shortest), and I have a few months to train and improve my fitness further, specifically working on longer distance runs and making sure my VDOT score translates for bigger distances.

Conclusion

This concludes my guide on how to build and interpret a nomogram to set better running goals. No matter your level in running, exercise physiology or statistics, I hope you have found something of value in this article. Feel free to use the nomogram I have built for your own goal setting, and let me know if you have any feedback.

Blog

A lightweight alternative to Amundsen for your dbt project

Data Governance is Ripe

The Features of Amundsen and other Metadata Engines

A Lightweight Alternative to Amundsen

What does good Search look like

Putting it together in the dbt-metadata-utils repository

Conclusion

Resources

How to monitor your FastAPI service

API Monitoring vs API Profiling

Error Tracking

API Profiling

Available Tools for Application Performance Monitoring (APM)

The 4 Steps of Monitoring

The problem with monitoring ASGI webapps

Instrumenting FastAPI with OpenTelemetry and Jaeger

Instrumenting FastAPI with OpenTelemetry and Datadog

Example FastAPI instrumentation using OpenTelementry, Jaeger and DataDog

Conclusion

Resources

Panama Papers Investigation using Entity Resolution and Entity Linking

Articles from the ICIJ Offshore Leaks dataset

Overview of the high-level architecture

From a Suspicion to Entity Linking

Introducing spacy-lancedb-linker, a new library for ANN Entity Linking with spacy

Combining all the pieces

References

Graphs and Language

A trip down memory lane at Spacy IRL 2019

1. Fact grounding with Graph RAG

1.1 Fine-tuning vs Retrieval-Augmented Generation

1.2 Vector RAG vs Graph RAG

1.3 RAG Zoo

1.4 Graph RAG implementation in LlamaIndex

2. KG construction

2.1 Building a Knowledge Graph

2.2 KG construction implementation in LlamaIndex

2.3 Example failure modes of LLM-based KG construction

2.4 Towards better KG construction

3. Unlock Experts

3.1 Human vs AI

3.2 Human + Machine: Visualisation

3.3 Human + Machine: Human in the Loop

Conclusion

Resources

NER models in Argilla

Functional Programming for Pandas Data Engineering

Functional Programming in Python

Functional Programming in Pandas

Bad: entry-level pandas

Better: pandas functional API

Best: use pandas.DataFrame.itertuples with the functional API

Conclusion

When Natural Language Processing (NLP) meets Football

Unstructured: Volume I - The Data Science of Geo-Gaming, Negativity and Football

Reviewing my first research paper on EasyChair

Overview

Strengths

Weaknesses

Details

1/ Clarity

2/ Evidences

3/ Novelty

Learn SPARQL in 5 minutes and use it to query WikiData

SparQL in 5 minutes

Fun SPARQL queries related to football

Solutions to the tutorial exercices

Chemical elements

Rivers that flow into the Mississippi

References to Le Figaro website

Code Reviews: a Cheat Sheet

1. What to look at

2. Code Review Conduct

2.1. Be a great submitter

Work in progress

General Guidelines

2.2. Be a great reviewer

What to provide feedback on

3. Engineering Management

Best: use `pandas.DataFrame.itertuples` with the functional API

Example Application 2: Compute Model Centrality with `networkx`