dbt for a little while, chances are your project has more than 50 models. Chances are more than 10 people are building dashboards based on those models.In the best case, self-service analytics users are coming to you with repeting questions about what model to use when. In the worst case, they are taking business decisions using the wrong model.
In this post, I will show you how you can build a lightweight metadata search engine on top of your dbt metadata to answer all these questions. I hope to show you that data governance, data lineage, and data discovery don't need to be complicated topics and that you can get started today on those roadmaps with my lightweight open source solution.
LIVE DEMO: https://dbt-metadata-utils.guitton.co
In his recent post The modern data stack: past, present, and future, Tristan Handy - the CEO of Fishtown Analytics (the company behind dbt) - was writing:
Governance is a product area whose time has come. This product category encompasses a broad range of use cases, including discovery of data assets, viewing lineage information, and just generally providing data consumers with the context needed to navigate the sprawling data footprints inside of data-forward organizations. This problem has only been made more painful by the modern data stack to-date, since it has become increasingly easy to ingest, model, and analyze more data.
He later also points out that dbt has its own lightweight governance interface: dbt Docs. They are a great starting point and might be enough for a while. However, as time goes by, your dbt project will outgrow its clothes. The search in dbt Docs is Regex only, and you might find its relevancy going down with a growing number of models. This can become important for Data Analysts building dashboards and looking for the right model but also for Data Engineers looking to "pull the thread" when debugging a model. Those use cases can be summarised with the two following "Jobs to be done":
These days, the solution to those two problems seems to be rolling out "heavyweight" tools like Amundsen. As Paco Nathan writes p.115 of the book Data Teams by Jesse Anderson (you can find my review of the book here):
If you look across Uber, Lyft, Netflix, LinkedIn, Stitch Fix, and other firms roughly in that level of maturity, they each have an open source project regarding a knowledge graph of metadata about dataset usage -- Amundsen, Data Hub, Marquez and so on. [...] Once an organization began to leverage those knowledge graphs, they gained much more than just lineage information. They began to recognize the business process pathways from data collection through data management and into revenue bearing use cases.
Those tools come on top of an already complex stack of tools that data teams need to operate. What if we wanted a lightweight solution instead, like dbt Docs?
In his great Teardown of Data Discovery Platforms, Eugene Yan summarizes really well the features of Amundsen and other metadata engines. He splits them in 3 categories: features to find data, features to understand data and features to use data.
Its friendly UI with a familiar search UX is one of the key factors behind Amundsen's success. But another one is its modular architecture, which is already being reused by other metadata open source projects like the project whale (previously called metaframe).
We can further split the 3 categories of features into 10 features of varying implementation difficulty. Those features have also varying returns, not represented here.
The key thing to realise is that Lyft might have spent a 15⭐️-cost on Amundsen to assemble all those features. But what if we wanted to build a 3⭐️-cost metadata engine? What features and technologies would you pick?
2021-02-09 Update: The Ground paper from Rise labs
In the seminal paper Ground: A Data Context Service - RISE Lab, Rise labs have outlined those features with much better terminology that I wasn't aware of at the time of first writing this post: the ABC of Metadata
Although it's possible that the feature completeness (everything is in one place) makes the USP of Amundsen and others, I want to make the case for a more lightweight approach.
Documentation tools go stale easily. Or at least in situations where they are not tied with data modeling code. dbt has proven with dbt Docs that data people want to document their code (hi team 😁). We were just waiting for a tool simple and integrated enough for the culture of Data Governance to blossom. It reminds me of those DevOps books showing that the solution is not the tooling but rather the culture (if you're curious check out The Phoenix Project).
Additionally, dbt sources are a great way to make raw data explicitly labeled. The dbt graph documents data lineage for you at the table level and I will leverage later that graph to propagate tags with no additional work.
In other words, with schemas, descriptions and data lineage, dbt Docs covers the category
Features to Understand
from the above diagram. So what is missing from dbt Docs to rival with Amundsen? Only a way to sublime the work that is already happening in your dbt repository. And that is Search.
A good search engine will cover the Features to Find category. Fortunately, we don't need to build a search engine. This is where we will use Algolia's free tier in addition to some static HTML and JS files to build our lightweight data discovery and metadata engine. Algolia's free tier allows you for 10k search requests and 10k records per month. Given that for us 1 record = 1 dbt model, and 1 search request = 1 data request from a user, my guess is that the free tier will cover our needs for a while.
Note: if you're worried that Algolia isn't open source, consider using the project typesense.
How to get at least one feature in the Features to Use category? Well, a dbt project is tracked in version control, so by parsing git's metadata, we can for example know each model's owner.
More generally, to extend our lightweight metadata engine, we would add metadata sources and develop parsers to collect and organise that metadata. We would then index that metadata in our search engine. Examples of metadata sources are:
stl_insert, svv_table_info, stl_query, predicate columns)Search is going to be key if our metadata engine is to rival with Amundsen, so let's look at Amundsen's docs. We know from their architecture page that they use ElasticSearch under the hood. And we can also read that we will need a ranking mechanism to order our dbt models by relevancy:
Search for data within your organization by a simple text search. A PageRank-inspired search algorithm recommends results based on names, descriptions, tags, and querying/viewing activity on the table/dashboard. -- Source
A bit further in the docs, we learn that Amundsen has three search indices and that the search bar uses multi-index search against those indices:
the users could search for any random information in the search bar. In the backend, the search system will use the same query term from users and search across three different entities (tables, people, and dashboards) and return the results with the highest ranking. -- Source
We even get examples for searchable attributes for the documents in the tables index:
For Table search, it will search across different fields, including table name, schema name, table or column descriptions, tags and etc -- Source
Presumably, there's not much point in reverse engineering an open source project, so I'll spare you the rest: it also supports search-as-you-type and faceted search (applying filters).
More on search
To build this search capability, you could use different technologies. I attended a talk at Europython 2020 from Paolo Melchiorre advocating for using good-old PostgreSQL's full text search. To my knowledge though, you don't get search as you type. This is one of the reasons why people tend to go for ElasticSearch or Algolia. To choose between them, this is then a buy or build decision: more engineering resources vs "throwing money" at the serverless Algolia. As we saw though for our use case, the free tier will be enough so we get the best of both worlds.
Remains the question of structuring our documents for search. Attributes in searchable documents are one of three types: searchable attribute (i.e. matches your query), a faceting attribute (i.e. a filter) or a ranking attribute (i.e. a weight).
Our searchable attributes will be table names and descriptions.
Our faceting attributes will be "tags" on our models: these could be vanilla dbt tags if you have good ones, or materialisation, resource type or any other key from the .yml file. Assuming there is a conscious curation effort happening from the code maintainers when they place a model in a folder in the dbt codebase, we can hence use folder names as a faceting attribute too. Lastly, we can use the dbt graph to propagate from left to right the source that models depend on; this will serve as a useful faceting attribute.
For ranking attributes, we will build metrics important to us to prioritise tables for our users. Keep in mind that we started with 2 use cases ('Jobs to be Done'), so each persona could benefit from a different metric. For example, for "dashboard builders", the goal could be to downrank the corner case models so that only models that are "central" are used. But for "data auditors", the goal might be to prioritise the models that need attention first. In our case, we will focus on the first persona, and we will use a PageRank-like algorithm (degree centrality as shown in my previous post). This is great at the start of your self-service analytics journey: dashboard builders might not know what are the good tables yet, so a good proxy is to look at which models are reused by your dbt comitters. Later, you could do like Amundsen and rely on the query logs to boost the models that are used the most.
I have assembled a couple of scripts in the (work in progress) repository called dbt-metadata-utils. I will walk through a couple of key parts here, but feel free to check out the full code there, and if you want to use it on your own project, hit me up.
All you will need is:
For the dbt project, we will use one of the example projects listed on the dbt docs: the jaffle_shop codebase.
Create an environment file in which you will need to fill in the values from the Algolia dashboard:
ALGOLIA_ADMIN_API_KEY=
ALGOLIA_SEARCH_ONLY_API_KEY=
ALGOLIA_APP_ID=
ALGOLIA_INDEX_NAME=jaffle_shop_nodes
DBT_REPO_LOCAL_PATH=~/workspace/jaffle_shop
DBT_MANIFEST_PATH=~/workspace/jaffle_shop/target/manifest.json
GIT_METADATA_CACHE_PATH=data/git_metadataAnd then run the 4 make commands:
$ make install # best is to install inside a virtual environment
pip install --upgrade pip
pip install -r requirements.txt
$ make update-git-metadata
python -m dbt_metadata_utils.git_metadata
100%|███████████████████████████████████████████| 11/11 [00:00<00:00, 12499.96it/s]
$ make update-index
python -m dbt_metadata_utils.algolia
$ make run
cd dbt-search-app && npm start
> [email protected] start /Users/louis.guitton/workspace/dbt-metadata-utils/dbt-search-app
> parcel index.html --port 3000
Server running at https://localhost:3000
✨ Built in 1.03s.If you navigate to https://localhost:3000, you should see a UI that looks like this:
I didn't dwell on details, but our metadata engine's features are:
LIVE DEMO: https://dbt-metadata-utils.guitton.co
There you have it! A lightweight data governance tool on top of dbt artifacts and Algolia. I hope this showed you that data governance doesn't need to be a complicated topic, and that by using a knowledge graph of metadata, you can get a head start on your roadmap.
Leave a star on the github project, and let me know your thoughts on twitter. I enjoyed building this project and writing this post because it lies at the intersection of three of my areas of interest: NLP, Analytics and Engineering. I cover those three topics in other places on my blog.
Monitoring is essentially collecting data in the background of your application for the purpose of helping diagnosing issues, helping debugging errors, or informing on the latency of a service.
For example, at the infrastructure level, you can monitor CPU and memory utilization. For example, at the application level, you can monitor errors, code performance or database querying performance. For a more complete introduction to monitoring and why it's necessary, see this excellent post from Full Stack Python.
In this post, we fill focus on Application Performance Monitoring (APM) for a FastAPI application.
In this post, I will not talk about monitoring application errors and warnings. For this purpose, check Sentry, it has great ASGI support and will work out of the box with your FastAPI service.
Profiling is a code best-practice that is not specific to web development. From the python docs on profiling we can read :
the profilers run code and give you a detailed breakdown of execution times, allowing you to identify bottlenecks in your programs. Auditing events provide visibility into runtime behaviors that would otherwise require intrusive debugging or patching.
You can of course apply profiling in the context of a FastAPI application. In which case you might find this timing middleware handy.
However, with this approach, the timing data is logged to stdout. You can use it in development to to find bottlenecks, but in practice looking at the logs in production to get latency information is not the most convenient.
As will all things, there are many options. Some are open source, some are SaaS businesses. Most likely you or your organisation are already using one or more monitoring tools, so I'd suggest starting with the one you know. The tools on the list below don't do only APM, and that's what makes it harder to understand sometimes. Example application monitoring tools you might have heard of:
This list is not exhaustive, but let's note OpenTelemetry which is the most recent on this list and is now the de-facto standard for application monitoring metrics.
At this point, choosing a tool doesn't matter, let's rather understand what an APM tool does.
monitoring client library. Monitoring client library examples:monitoring client library sends each individual call to the monitoring server daemon over the network (UDP in particular, as opposed to TCP or HTTP).monitoring server daemon is listening to monitoring events coming from the applications. It packs the incoming data into batches and regularly sends it to the monitoring backend.monitoring backend has usually 2 parts: a data processing application and a visualisation webapp. It turns the stream of monitoring data into human-readable charts and alerts. Examples:
ASGI is a relatively new standard for python web servers. As with every new standard, it will take some time for all tools in the ecosystem to support it.
Given the 4 steps of monitoring laid out above, a problem arise if the monitoring client library doesn't support ASGI. For example, this is the case with NewRelic at the moment (see ASGI - Starlette/Fast API Framework · Issue #5 · newrelic/newrelic-python-agent for more details). I looked at Datadog too and saw that ASGI is also not supported at the moment.
On the open source side of the tools however, OpenTelemetry had great support for ASGI. So I set out to instrument my FastAPI service with OpenTelemetry.
Update - Sep 19th, 2020: There seems to be support for ASGI in ddtrace
Update - Sep 22th, 2020: There is now an API in the NewRelic agent to support ASGI frameworks, with uvicorn already supported and starlette on the way.
Update - Oct 23th, 2020: The NewRelic python agent now supports Starlette and FastAPI out of the box.
OpenTelemetry provides a standard for steps 1 (with Instrumentors) and 2 (with Exporters) from the 4 steps above. One of the big advantages of OpenTelemetry is that you can send the events to any monitoring backend (commercial or open source). This is especially awesome because you can use the same intrumentation setup for development, staging and production environments.
Update - May 30th, 2021: Github is now adopting OpenTelemetry
Note that depending on the language you use for your microservice, your mileage may vary. For example, there is no NewRelic OpenTelemetry Exporter in Python yet. But there are OpenTelemetry Exporters for many others, see the list here: Registry | OpenTelemetry (filter by language and with type=Exporter).
One of the available backends is Jaeger: open source, end-to-end distributed tracing. (Note that Jaeger is also a monitoring client library that you can instrument your application with, but here that's not the part of interest).
Although it's open source and worked really easily, the issue I had with Jaeger was that it doesn't have any data pipeline yet. This means that, in the visualisation webapp, you can browse traces but you cannot see any aggregated charts. Such a backend is on their roadmap though.
Still, Jaeger is my goto tool for monitoring while in development. See the last part for more details.
I couldn't find any open source monitoring backend with a data pipeline that would provide the features I was looking for (latency percentile plots, bar chart of total requests and errors ...).
It became apparent that that's where commercial solutions like NewRelic and Datadog shine. I hence set out to try the OpenTelemtry Datadog exporter.
With this approach, you get a fully featured monitoring backend that will allow you to have full observability for your microservice.
The 2 drawbacks are:
So how does it look in the code ? This is how my application factory looks. If you have any questions, feel free to reach out on twitter or open a github issue. I will not share my instrumentation because it is specific to my application, but imagine that you can define any nested spans and that those traces will sent the same way to Jaeger or to DataDog. This makes it really fast to iterate on your instrumentation code (e.g. add or remove spans), and even faster to find performance bottlenecks in your code.
"""FastAPI Application factory with OpenTelemetry instrumentation
sent to Jaeger in dev and to DataDog in staging and production."""
from fastapi import FastAPI
from opentelemetry import trace
from opentelemetry.exporter.datadog import DatadogExportSpanProcessor, DatadogSpanExporter
from opentelemetry.exporter.jaeger import JaegerSpanExporter
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchExportSpanProcessor
from my_api.config import generate_settings
from my_api.routers import my_router_a, my_router_b
def get_application() -> FastAPI:
"""Application factory.
Returns:
ASGI application to be passed to ASGI server like uvicorn or hypercorn.
Reference:
- [FastAPI Middlewares](https://fastapi.tiangolo.com/advanced/middleware/)
"""
# load application settings
settings = generate_settings()
if settings.environment != "development":
# opentelemetry + datadog for staging or production
trace.set_tracer_provider(TracerProvider())
datadog_exporter = DatadogSpanExporter(
agent_url=settings.dd_trace_agent_url,
service=settings.dd_service,
env=settings.environment,
version=settings.dd_version,
tags=settings.dd_tags,
)
trace.get_tracer_provider().add_span_processor(
DatadogExportSpanProcessor(datadog_exporter)
)
else:
# opentelemetry + jaeger for development
# requires jaeger running in a container
trace.set_tracer_provider(TracerProvider())
jaeger_exporter = JaegerSpanExporter(
service_name="my-app", agent_host_name="localhost", agent_port=6831,
)
trace.get_tracer_provider().add_span_processor(
BatchExportSpanProcessor(jaeger_exporter, max_export_batch_size=10)
)
application = FastAPI(
title="My API",
version="1.0",
description="Do something awesome, while being monitored.",
)
# Add your routers
application.include_router(my_router_a)
application.include_router(my_router_b)
FastAPIInstrumentor.instrument_app(application)
return application
app = get_application()I hope that with this post you've learned:
I've used this setup to get a 10x speed up on one multi-lingual NLP fastapi service I built at OneFootball.
If you’ve worked with a corpus of text, chances are you needed to structure its information specifically for your domain. How can you link the entities mentioned in the articles to a knowledge base you control, which you can enrich and which might evolve depending on your focus?
Imagine you are an investigative journalist sifting through the Panama Papers and you are following a lead: the consortium called “Londex Resources S.A.”. You’re not sure what people, organizations, countries or other articles are connected to that lead. Perhaps one of them can be your next breakthrough?
In this article, we will demonstrate a technical approach that combines Entity Resolution performed with Senzing with Entity Linking performed in spaCy. We show how this can be used to construct a domain-specific Knowledge Graph, e.g. around a lead you’re following, to analyze your corpus with it. We will then show how to close the loop and use the analyzed corpus to update the Knowledge Graph with new leads.
Along with this blog post, we have open-sourced a package to do zero-shot entity linking spacy-lancedb-linker, and released a tutorial for reference erkg-tutorials.
For this blog post, we will be looking at a set of articles from investigative journalism like Panama Papers, Pandora Papers , and Offshore Leaks. Those are cross-border investigations that have made the headlines and were led by ICIJ (International Consortium of Investigative Journalists).
ICIJ maintains the ICIJ Offshore Leaks dataset, in the form of either a Neo4J database or a set of zipped CSV files. The dataset contains 4 main entity types.
Persons or “Officers” are directors, shareholders, and beneficiaries of offshore companies. For example presidents, royals, members of parliament, their family members, their closest associates. “Intermediaries” are secrecy brokers like banks or law firms that Officers turn to to optimize their finances. Organizations or “Entities” are shell companies established by secrecy brokers. “Addresses” are countries, world regions, secret jurisdictions of Officers, Entities or Intermediaries.
For example, Offshore Leaks has shown that Arzu Aliyeva, daughter of Ilham Aliyev, president of Azerbaijan, lives in Dubaï and is a shareholder and director of Arbor Investments Ltd, registered in the Virgin Islands. This creates a natural graph that connects Arzu Aliyeva to other Officers like Hassan Gozal.
This dataset is commonly used to show UBO (Ultimate Beneficial Owner) or reveal or investigate AML (Anti Money Laundering) scenarios. Prior work shows how to use this data in Neo4j, in Linkurious and shows typical investigations written with that data. In this blog post, we will rather show how a Senzing-preprocessed version of this dataset can be used to power an Entity Linking use case.
Senzing provides a development library for Principle-Based Entity Resolution based on Entity-Centric Learning. Senzing Founder/CEO Jeff Jonas said: “[we want to help] developers fast-track their entity resolution needs – as understanding who is who and who is related to who is essential – and exceptionally essential in the creation of entity resolved knowledge graphs (ERKG)”. They have previously shown how to extract personally identifiable information (PII) from the ICIJ graph to be used as input into Senzing. After configuring and running Senzing, a JSON export of entity resolution (ER) results can be used to construct or update a Knowledge Graph (KG), called an entity-resolved knowledge graph (ERKG). Pre-computed ER results for ICIJ are shared as a dataset by Senzing in a GCP public bucket (download link).
While other tutorials show the ICIJ Offshore Leaks data loading into graph databases and entity resolution with Senzing, this tutorial starts with the Senzing export. With a custom Data Engineering pipeline, we can ingest Entity Resolution results into an Approximate Nearest Neighbors (ANN) index stored in LanceDB. We can then use that index in a spaCy pipeline to run Entity Linking against a small dataset of scraped ICIJ web articles. The end-user can then use the output of the entity linking.
In practice, in louisguitton/erkg-tutorials we built this data pipeline in Python using an orchestration tool which helps visualise it. The Senzing ER results feed a Senzing pipeline that builds the EL inputs, which feeds a spacy pipeline. Next, we will see in detail how to use the ERKG to power Entity Linking.
While Senzing is proven to scale into billions of records, the rest of these components don't all scale the same way without performance engineering. Given that ICIJ has 1.5M records and ~5M aliases, we draw on a subset to make this tutorial quick and easy for the reader.
When doing Entity Linking against Wikidata or DBPedia, a sub-set would be considered so as not to load the entire Knowledge Graph into the entity linking pipeline. Similarly, we query for a subset of the KG using query languages like SPARQL, or by building custom KGs from smaller files (CSVs or JSONs).
Also in practice, investigative journalists work off so-called Case Management Systems. In that workflow they use software to organize and analyze information, they get assigned a "lead" (a specific person or company) and they only look at the immediate subgraph for that lead.
For those reasons, we start from a text file called data/icij-example/suspicious.txt, where the investigative journalist can seed the system. Let’s say the lead you have to explore is the consortium called “Londex Resources S.A.” which has ties with the Azerbaijani presidential family: you start by providing a few entity names from the Senzing ERKG you care about. Here, we start with Arzu Aliyeva the daughter, Ilham Aliyev the president, etc…
Arzu Aliyeva
Ilham Aliyev
Mossack Fonseca
Fazil Mammadov
AtaHolding
FM Management Holding Group S.A. Stand
UF Universe Foundation
Mehriban Aliyeva
Heydar Aliyev
Leyla Aliyeva
AtaHolding Azerbaijan
Financial Management Holding Limited
Hughson Management Inc.From that, we’re able to filter down (using a friend-of-friend logic) the ERKG to less than 100 entities of interest. That’s the immediate subgraph to our lead. Starting with this might be enough. If it turns out it isn’t, you can expand the subgraph by either adding seed entities to suspicious.txt or by adding more friends of friends.
Once we’ve filtered out the ERKG, we extract aliases into the aliases.jsonl file in the format required by the entity linking library we wrote.
{"alias":"Ilham Aliyev","entities":["1342265","1551574"],"probabilities":[0.5,0.5]}
{"alias":"Arzu Aliyeva","entities":["281073","918573","1470056","1722271","1697384","1380470"],"probabilities":[0.1666666667,0.1666666667,0.1666666667,0.1666666667,0.1666666667,0.1666666667]}
{"alias":"Arzu Ilham Qizi Aliyeva","entities":["883102"],"probabilities":[1.0]}We also need to generate entity descriptions from the ERKG to populate the second file required by the entity linking library, entities.jsonl. We generate those descriptions by joining together the structured features available in the ERKG.
{"entity_id": "1342265", "type": "PER", "name": "Ilham Aliyev", "description": "Ilham Aliyev, located at P.O. BOX 17920 JEBEL ALI FREE ZONE DUBAI UAE, in United Arab Emirates"}
{"entity_id": "1697384", "type": "PER", "name": "Arzu Aliyeva", "description": "Arzu Aliyeva, located at APARTMENT NO. 1801 DUBAI MARINA LEREV RESIDENTIAL DUBAI U.A.E., in United Arab Emirates"}
{"entity_id": "1551574", "type": "ORG", "name": "Rosamund International Ltd", "description": "Rosamund International Ltd, located at PORTCULLIS TRUSTNET CHAMBERS P.O. BOX 3444 ROAD TOWN, TORTOLA BRITISH VIRGIN ISLANDS, in British Virgin Islands"}With our two artefacts ready, we can start using entity linking. Entity Linking is one of the common NLP tasks.
A more formal definition of Entity Linking can be found in the Zshot paper by IBM:
Entity Linking, also known as named entity disambiguation, is the process of identifying and disambiguating mentions of entities in a text, linking them to their corresponding entries in a knowledge base or a dictionary. For example, given "Barack Obama", entity linking would determine that this refers to the specific person with that name (one of the presidents of the United States) and not any other person or concept with the same name. [...] Entity linking can be useful for a variety of natural language processing tasks, such as information extraction, question answering, and text summarization. It helps to provide context and background information about the entities mentioned in the text, which can facilitate a deeper understanding of the content.
Several techniques can be used for entity linking. From deep learning and supervised learning to unsupervised learning approaches. They usually have two stages: candidate creation and candidate ranking. In candidate creation, the approaches aim to narrow down the vast number of entities into a manageable subset (e.g., tens or hundreds), and in candidate ranking, the approaches aim to rank the candidate entities of each mention according to the probability that they match the given mention.
When it comes to open-source implementations at our disposal, there is of course spaCy’s Entity Linker although it uses supervised learning and thus requires labels which is not practical when quickly iterating. There is also IBM’s zshot Linker which implements 5 deep-learning linkers and is zero-shot, but still, the underlying models are using deep learnings thus might be slower, and were trained on labels. We found Microsoft’s spaCy-compatible ANN linker which uses unsupervised learning, building an Approximate Nearest Neighbors (ANN) index computed on the Character N-Gram TF-IDF representation of all aliases in your KnowledgeBase. This approach was the most fitting for our use case. Unfortunately, the project is not supported anymore, the last commit is from 2 years ago and the ANN index used (nmslib) was causing setup errors.
Inspired by microsoft/spacy-ann-linker, we therefore wrote our own ANN entity linking library louisguitton/spacy-lancedb-linker, swapping nmslib for a supported and active ANN index LanceDB. The result is a simple API that we can use to run unsupervised entity linking in spaCy:
from typing import Iterator
import srsly
from spacy.language import Language
from spacy.tokens import Doc, DocBin
from spacy_lancedb_linker.kb import AnnKnowledgeBase
from spacy_lancedb_linker.linker import AnnLinker # noqa
from spacy_lancedb_linker.types import Alias, Entity
def entity_linking(nlp: Language, spacy_dataset: DocBin) -> Iterator[Doc]:
entities = [Entity(**entity) for entity in srsly.read_jsonl("data/icij-example/entities.jsonl")]
aliases = [Alias(**alias) for alias in srsly.read_jsonl("data/icij-example/aliases.jsonl")]
ann_kb = AnnKnowledgeBase(uri="data/sample-lancedb")
ann_kb.add_entities(entities)
ann_kb.add_aliases(aliases)
ann_linker = nlp.add_pipe("ann_linker", last=True)
ann_linker.set_kb(ann_kb)
docs = spacy_dataset.get_docs(nlp.vocab)
return nlp.pipe(docs)To recap, we start from Senzing's ERKG for ICIJ, we filter it using the lead to follow in suspicious.txt, we generate the two artifacts that we need for spacy-lancedb-linker, and we now can put together an Entity Linking pipeline. Let’s have a look at the output of the Entity Linking on an ICIJ web article about the Azeri presidential family:
The Entity Linking here can be used for information extraction, or to provide context and background information about the entities mentioned in the text. We can also use the following simple heuristic: if an entity is not linking to anything in the KB, but is central to the article, maybe it could be worth investigating next.
To implement this, we show in the tutorial how to use DerwenAI/pytextrank to rank entities and filter for entities not linked. This can form the basis of a human-in-the-loop system where the investigative journalist updates the KB or decides what leads to follow next. In the case of this article, we see that Londex Resources S.A. seems to be mentioned 2 times and ranked in position 19 in terms of the most important entities in the article. So we can then explore the ICIJ Offshore Leaks dataset to see if that entity is known and linked to others, and if not can decide to investigate it further.
We hope this blog post was useful in demonstrating a technical approach that combines Entity Resolution performed with Senzing with Entity Linking performed in spaCy. We showed how this can be used to construct a domain-specific Knowledge Graph, in particular around the Azerbaijan presidential family, and we showed how to analyze a corpus of articles with this pipeline and come up with new leads.
If you’re curious about this approach, check out the reference tutorial at erkg-tutorials and the unsupervised entity linking library we’ve open-sourced spacy-lancedb-linker.
A rising tide lifts all boats, and the recent advances in LLMs are no exception. In this blog post, we will explore how Knowledge Graphs can benefit from LLMs, and vice versa.
Where do Knowledge Graphs fit with Large Language Models?
In particular, Knowledge Graphs can ground LLMs with facts using Graph RAG, which can be cheaper than Vector RAG. We'll look at a 10-line code example in LlamaIndex and see how easy it is to start. LLMs can help build automated KGs, which have been a bottleneck in the past. Graphs can provide your Domain Experts with an interface to supervise your AI systems.
Note: this is a written version of a talk I gave at the AI in Production online conference on February 15th, 2024. You can watch the talk here.
I've been working with Natural Language Processing for a few years now, and I've seen the rise of Large Language Models. The start of my NLP and Graphs work dates back to 2018, applied to the Sports Media domain when I worked as a Machine Learning Engineer at OneFootball, a football media company from Berlin, Germany.
As a practitioner, I remember that time well because it was a time of great change in the NLP field. We were moving from the era of rule-based systems and word embeddings to the era of deep learning, moving from LSTMs to a slew of models like Elmo or ULMfit based on the transformer architecture. I was one of the lucky few who could attend the Spacy IRL 2019 conference in Berlin. There were corporate training workshops followed by talks about Transformers, conversational AI assistants, and applied NLP in finance or media.
In his keynote, The missing elements in NLP (spaCy IRL 2019), Yoav Goldberg predicts that the next big development will be to enable non-experts to use NLP. He was right ✅. He thought we would get there by humans writing rules aided by Deep Learning resulting in transparent and debuggable models. He was wrong ❌. We got there with chat, and we now have less transparent and less debuggable models. We moved further right and down on his chart (see below) to a place deeper than Deep Learning. The jury is still out on whether we can move towards more transparent models that work for non-experts and with little data.
In the context of my employer at the time, OneFootball, a football media in 12 languages with 10 million monthly active users, we used NLP to assist our newsroom and unlock new product features. I built systems to extract entities and relations from football articles, tag the news, and recommend articles to users. I shared some of that work in a previous talk at a Berlin NLP meetup. We had medium data, not a lot. And we had partial labels in the form of "retags". We also could not pay for much compute. So we had to be creative. It was the realm of Applied NLP.
That's where I stumbled upon the beautiful world of Graphs, specifically the great work from my now friend Paco Nathan with his library pytextrank. Graphs (along with rule-based matchers, weak supervision, and other NLP tricks I applied over the years) helped me work with little annotated data and incorporate declarative knowledge from domain experts while building a system that could be used and maintained by non-experts, with some level of human+machine collaboration. We shipped a much better tagging system and a new recommendation system, and I was hooked.
Today with the rise of LLMs, I see a lot of potential to combine the two worlds of Graphs and LLMs, and I want to share that with you.
The first place where Graphs and LLMs meet is in the area of fact grounding. LLMs suffer from a few issues like hallucination, knowledge cut-off, bias, and lack of control. To circumvent those issues, people have turned to their available domain data. In particular, two approaches emerged: Fine Tuning and Retrieval-Augmented Generation (RAG).
In his talk LLMs in Production at the AI Conference 3 months ago, Dr. Waleed Kadous, Chief Scientist at AnyScale, sheds some light on navigating the trade-offs between the two approaches. "Fine-tuning is for form, not facts", he says. "RAG is for facts".
Fine-tuning will get easier and cheaper. Open-source libraries like OpenAccess-AI-Collective/axolotl and huggingface/trl already make this process easier. But, it's still resource-intensive and requires more NLP maturity as a business. RAG is more accessible, on the other hand.
According to this Hacker News thread from 2 months ago, Ask HN: How do I train a custom LLM/ChatGPT on my documents in Dec 2023?, the vast majority of practitioners are indeed using RAG rather than fine-tuning.
When people say RAG, they usually mean Vector RAG, which is a retrieval system based on a Vector Database. In their blog post and accompanying notebook tutorial, NebulaGraph introduces an alternative that they call Graph RAG, which is a retrieval system based on a Graph Database (disclaimer: they are a Graph database vendor). They show that the facts retrieved by the RAG system will vary based on the chosen architecture.
They also show in a separate tutorial part of the LlamaIndex docs that Graph RAG is more concise and hence cheaper in terms of tokens than Vector RAG.
To make sense of the different RAG architectures, consider the following diagrams I created:
In all cases, we ask a question in natural language QNL and we get an answer in natural language ANL. In all cases, there is some kind of Encoding model that extracts structure from the question, coupled with some kind of Generator model ("Answer Gen") that generates the answer.
Vector RAG embeds the query (usually with a smaller model than the LLM; something like FlagEmbeddings or any small of the models at the top of the Huggingface Embeddings Leaderboard) into a vector embedding vQ. It then retrieves the top-k document chunks from the Vector DB that are closest to vQ and returns those as vectors and chunks (vj, Cj). Those are passed along with QNL as context to the LLM, which generates the answer ANL.
Graph RAG extracts the keywords ki from the query and retrieves triples from the graph that match the keyword. It then passes the triples (sj, pj, oj) along with QNL to the LLM, which generates the answer ANL.
Structured RAG uses a Generator model (LLM or smaller fine-tuned model) to generate a query in the database's query language. It could generate a SQL query for a RDBMS or a Cypher query for a Graph DB. For example, let's imagine we query a RDBMS: the model will generate QSQL which is then passed to the database to retrieve the answer. We note the answer ASQL but those are data records that result from running QSQL in the database. The answer ASQL as well as QNL are passed to the LLM to generate ANL.
In the case of Hybrid RAG, the system uses a combination of the above. There are multiple hybridation techniques that go beyond this blog post. The simple idea is that you pass more context to the LLM for Answer Gen, and you let it use its summarisation strength to generate the answer.
And now for the code, with the current frameworks, we can build a Graph RAG system in 10 lines of python.
from llama_index.llms import Ollama
from llama_index import ServiceContext, KnowledgeGraphIndex
from llama_index.retrievers import KGTableRetriever
from llama_index.graph_stores import Neo4jGraphStore
from llama_index.storage.storage_context import StorageContext
from llama_index.query_engine import RetrieverQueryEngine
from llama_index.data_structs.data_structs import KG
from IPython.display import Markdown, display
llm = Ollama(model='mistral', base_url="http://localhost:11434")
service_context = ServiceContext.from_defaults(llm=llm, embed_model="local:BAAI/bge-small-en")
graph_store = Neo4jGraphStore(username="neo4j", password="password", url="bolt://localhost:7687", database="neo4j")
storage_context = StorageContext.from_defaults(graph_store=graph_store)
kg_index = KnowledgeGraphIndex(index_struct=KG(index_id="vector"), service_context=service_context, storage_context=storage_context)
graph_rag_retriever = KGTableRetriever(index=kg_index, retriever_mode="keyword")
kg_rag_query_engine = RetrieverQueryEngine.from_args(retriever=graph_rag_retriever, service_context=service_context)
response_graph_rag = kg_rag_query_engine.query("Tell me about Peter Quill.")
display(Markdown(f"<b>{response_graph_rag}</b>"))This snippet supposes you have Ollama serving the mistral model and a Neo4j database running locally. It also assumes you have a Knowledge Graph in your Neo4j database, but if you don't we'll cover in the next section how to build one.
Before conducting inference, you need to index your data either in a Vector DB or a Graph DB.
The equivalent of chunking and embedding documents for Vector RAG is extracting triples for Graph RAG. Triples are of the form (s, p, o) where s is the subject, p is the predicate, and o is the object. Subjects and objects are entities, and predicates are relationships.
There are a few ways to extract triples from text, but the most common way is to use a combination of a Named Entity Recogniser (NER) and a Relation Extractor (RE). NER will extract entities like "Peter Quill" and "Guardians of the Galaxy vol 3", and RE will extract relationships like "plays role in" and "directed by".
There are fine-tuned models specialised in RE like REBEL, but people started using LLMs to extract triples. Here is the default prompt chain of LlamaIndex for RE:
Some text is provided below. Given the text, extract up to
{max_knowledge_triplets}
knowledge triplets in the form of (subject, predicate, object). Avoid stopwords.
---------------------
Example:
Text: Alice is Bob's mother.
Triplets: (Alice, is mother of, Bob)
Text: Philz is a coffee shop founded in Berkeley in 1982.
Triplets:
(Philz, is, coffee shop)
(Philz, founded in, Berkeley)
(Philz, founded in, 1982)
---------------------
Text: {text}
Triplets:The issue with this approach is that first you have to parse the chat output with regexes, and second you have no control over the quality of entities or relationships extracted.
With LlamaIndex however, you can build a KG in 10 lines of python using the following code snippet:
from llama_index.llms import Ollama
from llama_index import ServiceContext, KnowledgeGraphIndex
from llama_index.graph_stores import Neo4jGraphStore
from llama_index.storage.storage_context import StorageContext
from llama_index import download_loader
llm = Ollama(model='mistral', base_url="http://localhost:11434")
service_context = ServiceContext.from_defaults(llm=llm, embed_model="local:BAAI/bge-small-en")
graph_store = Neo4jGraphStore(username="neo4j", password="password", url="bolt://localhost:7687", database="neo4j")
storage_context = StorageContext.from_defaults(graph_store=graph_store)
loader = download_loader("WikipediaReader")()
documents = loader.load_data(pages=['Guardians of the Galaxy Vol. 3'], auto_suggest=False)
kg_index = KnowledgeGraphIndex.from_documents(
documents,
storage_context=storage_context,
service_context=service_context,
max_triplets_per_chunk=5,
include_embeddings=False,
kg_triplet_extract_fn=None,
kg_triple_extract_template=None
)However, if we have a look at the resulting KG for the movie "Guardians of the Galaxy vol 3", we can note a few issues.
Here is a table overview of the issues
This is to be compared with the Wikidata graph labelled by humans, which looks like this:
So where do we go from there? KGs are difficult to construct and evolve by nature, which challenges the existing methods in KGs to generate new facts and represent unseen knowledge. The paper Unifying Large Language Models and Knowledge Graphs: A Roadmap provides a good overview of the current state of the art and the challenges ahead.
Knowledge graph construction involves creating a structured representation of knowledge within a specific domain. This includes identifying entities and their relationships with each other. The process of knowledge graph construction typically involves multiple stages, including 1) entity discovery, 2) coreference resolution, and 3) relation extraction. Fig 19 presents the general framework of applying LLMs for each stage in KG construction. More recent approaches have explored 4) end-to-end knowledge graph construction, which involves constructing a complete knowledge graph in one step or directly 5) distilling knowledge graphs from LLMs.
Which is summarised in this figure from the paper:
I've seen only a few projects that have tried to tackle this problem: DerwenAI/textgraphs and IBM/zshot.
The final place where Graphs and LLMs meet is Human+Machine collaboration. Who doesn't love a "Human vs AI" story? News headlines about "AGI" or "ChatGPT passing the bar exam" are everywhere.
I would encourage the reader to have a look at this answer from the AI Snake Oil newsletter. They make a good point that models like ChatGPT memorise the solutions rather than reason about them, which makes exams a bad way to compare humans with machines.
Going beyond Memorisation, there is a whole area of research around what's called Generalization, Reasoning, Planning, Representation Learning, and graphs can help with that.
Rather than against each other, I'm interested in ways Humans and Machines can work together. In particular, how do humans understand and debug black-box models?
One key project that, in my opinion, moved the needle there was the whatlies paper from Vincent Warmerdam, 2020. He used UMAP on embeddings to reveal quality issues in LLMs, and built a framework for others to audit their embeddings rather than blindly trust them.
Similarly, Graph Databases come with a lot of visualisation tools out of the box. For example, they would add context with colour, metadata, and different layout algorithms (force-based, Sankey)
Finally, how do we address the lack of control of Deep Learning models, and how do we incorporate declarative knowledge from domain experts?
I like to refer to the phrase "the proof is in the pudding", and by that, I mean that the value of a piece of tech must be judged based on its results in production. And when we look at production systems, we see that LLMs or Deep Learning models are not used in isolation, but rather within Human-in-the-Loop systems.
In a project and paper from 2 weeks ago, Google has started using language models to help it find and spot bugs in its C/C++, Java, and Go code. The results have been encouraging: it has recently started using an LLM based on its Gemini model to “successfully fix 15% of sanitiser bugs discovered during unit tests, resulting in hundreds of bugs patched”. Though the 15% acceptance rate sounds relatively small, it has a big effect at Google-scale. The bug pipeline yields better-than-human fixes - “approximately 95% of the commits sent to code owners were accepted without discussion,” Google writes. “This was a higher acceptance rate than human-generated code changes, which often provoke questions and comments”.
The key takeaway here for me has to do with their architecture:
They built it with a LLM, but they also combined LLMs with smaller more specific AI models, and more importantly with a double human filter on top, thus working with machines.
I remember those 2019 days vividly, moving from LSTMs to Transformers, and we thought that was Deep Learning. Now, with LLMs, we've reached what I would describe as Abysmal Learning. And I like this image because it can mean both "extremely deep" as well as "profoundly bad".
More than ever, we need more control, more transparency, and ways for humans to work with machines. In this blog post, we've seen here a few ways in which Graphs and LLMs can work together to help with that, and I'm excited to see what the future holds.
🤝Organizer: Argilla.io
🏠Venue Host: Argilla Event Calendar
📝Agenda:
Louis Guitton is a great community member, a long-time attendee of the Argilla community meetup and working as a freelancer within the AI space. Within this meetup, he will:
We hope to see you all on the 23rd :)
If you're maintaining a codebase that uses pandas dataframes heavily, you might have felt this pain already. Your files are getting longer, debugging the data transformations is getting slower.
When it comes to Data Engineering, Functional Programming has proven its value already and I won't come back on this in this post. If you're not convinced, just have a look at the seminal piece by Maxime Beauchemin (creator of Apache Airflow and Apache Superset) Functional Data Engineering — a modern paradigm for batch data processing.
But, of all the Data Engineering or Machine Learning Operations tools, one is at the same time used a lot, and harder to adopt functional programming with: pandas dataframes. I will show more niche ways to write pandas code that has served me well in previous roles or at previous clients to reduce tech debt, and make Data Engineering in pandas more fun.
For an in depth look, have a read at Functional Programming in Python: When and How to Use It.
>>> animals = ["ferret", "vole", "dog", "gecko"]
>>> sorted(animals, key=lambda s: -len(s))
['ferret', 'gecko', 'vole', 'dog']For an intro to the topic, have a read at Method chaining across multiple lines in Python.
Let's use this dataframe as an example:
import pandas as pd
df = pd.DataFrame.from_records([
{"name": "Alice", "age": 24, "state": "NY", "point": 64},
{"name": "Bob", "age": 42, "state": "CA", "point": 92},
{"name": "Charlie", "age": 18, "state": "CA", "point": 70}
])df["point_ratio"] = df['point'] / 100
df["surrogate_key"] = df["name"] + "-" + df["age"].astype(str) + "-" + df["state"]
df = df.drop(columns='state')
df = df.sort_values('age')
df = df.head(3)While still maintaining one transformation per line, there are mentions of df everywhere. We are not explicit about the fact that we rely on the transformations to happen in the order we wrote them. Also, you can see with the surrogate_key transformation that the readability of the code decreases when the transformation complexity increases.
result = (
df
.assign(point_ratio=lambda d: d['point'] / 100)
.assign(surrogate_key=lambda d: d.apply(lambda r: f"{r['name']}-{r['age']}-{r['state']}", axis=1))
.drop(columns='state')
.sort_values('age')
.head(3)
)Using .assign and parenthesis (), we anchor our approach in functional programming. Each transformation is on its own line, and there are no more mentions of df. We are explicit about the transformations order.
On the other hand, the surrogate_key transformation is hard to write:
lambda functions.apply and axis=1, which adds complexityd the parameter of type pd.DataFrame, and naming r the parameter which is a "Row" of the dataframe.Because code is read more than it's written, investing the time to write this code is still worth it for teams. But we can do better
pandas.DataFrame.itertuples with the functional APIresult = (
df
.assign(point_ratio=lambda d: d['point'] / 100)
.assign(surrogate_key=lambda d: [f"{user.name}-{user.age}-{user.state}" for user in d.itertuples(name="User")])
.drop(columns='state')
.sort_values('age')
.head(3)
)We take the same approach as before, but we tweak the surrogate_key transformation. This time:
lambdaitertuples, which maintains dtypes of the rows and that gives us NamedTuple objectsuser instead of r previouslyIn this short article, I have showed you a new way to write your pandas data pipelines that can be leveraged to write more explicit and maintainable code for Data Engineering.
]]>
This is the first event in our new Unstructured speaker series, looking at the intersection of Data Science and Business.
We will have a small group meeting in Berlin, and hopefully a wider audience joining us online. This Meetup event is for the in-person event in Berlin. After the talks we will have some drinks and networking.
If you only want to join remotely, please sign-up here. The Meetup Event sign-up is only meant for in-person attendees (you are still welcome to join the Group to be notified about all events).
We have three expert speakers sharing their experience of applying data science to business problems.
Boyan Angelov is a CTO and data strategist with a decade of experience in a variety of academic and business environments. He's the author of the O'Reilly "Python and R for the Modern Data Scientist" book and currently working on his second book - "Elements of Data Strategy: A Handbook for the Analytics Manager".
In his work, Naseem Taleb extensively covers the concept of Via Negativa: people are much better at understanding the downsides than upsides. In this talk, I'll explain what not to do in delivering data projects. I'll go through the most common scenarios and the factors causing them. And finally, I will provide several remedy recipes to ensure your data projects don't suffer the same fate.
Louis Guitton attended Mines ParisTech PSL from 2012-2016, where he got his MSc in Engineering (with a minor in Econometrics) and perfected the "spaghetti al kettle". An open source contributor and technologist, Louis spoke in May 2021 at The Knowledge Graph Conference, in NYC, about his graph data science work in natural language processing. This is the business-critical technology he has developed for OneFootball in Berlin.
Stefan Berkner is a passionate self-taught software engineer with 15+ years experience in development and architecture. He was previously Lead Software Engineer at a German credit bureau where he was responsible for leading the development of the technology that would go on to become Tilo, where he is the Chief Development Officer.
Searching in databases using geographical data and a given distance can be challenging if the database does not support this natively. Creating a grid on the world drastically reduces the potential search space. Stefan will explain how one of his favourite games, Dyson Sphere Program, influenced him in choosing a grid that is easy to calculate and work with.
]]>
Overall evaluation -3: (strong reject; on a scale from -3 to +3)
Reviewer's confidence 3: (medium; on a scale from 1 to 5)
The paper presents a serialisation method for RDF ontologies into a flat JSON - along with a Java-based tool called rdf2json. The JSON generator is overcoming the circular structures that can be found in graphs with "mapping paths". The paper then compares the new JSON serializer to existing JSON serializers, and presents future work.
The area for submission is not clearly articulated in the abstract or introduction. Maybe the author can refer to the call for paper and reuse some of the verbiage. For example “Tools for mechanizing building of knowledge graphs” and “Connections to software engineering practices, such as build tools”.
The paper is presenting an open-source library that is mentioned in the last sentence, maybe the author could fix this by mentioning the Java project earlier, perhaps even in the title.
Some of the key arguments are not developed. For example "they deliver information in a graph-like structure instead of a tree structure" or "The model creator uses restricted paths
to draw only the relevant branches of the final tree structure".
The structure of the paper is clear, but we might suggest the following tweak: 1-Introduction / 2-RDF2JSON: Usage and examples / 3-Comparison with existing approaches / 4-Roadmap. Anchoring the structure around the open-source library might help the author explaining the use cases and the benefits of its method.
The introduction doesn't have any figures. An architecture diagram would be welcomed, especially with the presence of entities such as: Ontologist, RDF, Triple store, Jena API, rdf2json, JSON, Developer/User.
The writing contains grammatical errors, making it hard to follow and review. For example "Simple Person ontology can be seen in Figure 1."
The writing contains a lot of general claims with no evidence to back them up. This loses the adoption of the reader. Example formulas that we hope the author can improve in a future version: “trivial to many”, “quite the opposite”, “[developers] prefer”, "Data structures should be modeled by data/domain experts, and not by software developers", "Desired structure by developers", "Ontologies should be created by data experts, not by software developers", "This is very far from what is actually happening", "History repeats itself", "This is certainly not the adoption level the community is looking for".
The bibliography mentions 3 papers, including a well cited paper (1169 citations / 52 highly influential citations) to establish context, but only 1 paper is recent (2023), the rest is more than 10 years old (2012-2013), so it’s hard to see how this paper connects to recent publications. The rest of the references are not research related, including even a private consulting firm's press release. Maybe the author could aim at connecting their contribution to more numerous and more recent papers (~10 in the from 2010 to today).
The paper mentions other JSON serialization methods, without highlighting what the new proposed methods improves upon. Maybe examples in the form of "before and after" could help the reader understand the novelty. For example: Person ontology with JSON-LD vs Person ontology with RDF2JSON.
The author mentions "[existing JSON serializers] deliver information in a graph-like structure in-
stead of a tree structure" and implies implicitly that the tree structure is better without explaining how and why.
I'm working on Entity Linking and Knowledge Bases. In that context, exporting a relevant part of Wikidata can be really useful to build surface form dictionaries and coocurence probabilities etc... In order to know which part of Wikidata is relevant to dump, I thought we could query Wikidata (although it seems we can only download the entire dump and filter afterwards).
SPARQL is a language to formulate questions (queries) for knowledge databases. Therefore you can query Wikidata with SPARQL. At first sight, the syntax is not particularly easy and I've gone through this tutorial.
#is the comment characterSELECT clause lists variables that you want returned (variables start with a question mark)WHERE clause contains restrictions on them, in the form of SPO triples (subject, predicate, object), e.g. ?fruit hasColor yellow.search term for items and P:search term for propertieswd: for itemswdt: for properties, pointing to the objectSERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }Putting this all together we get:
SELECT ?fruit ?fruitLabel
WHERE
{
# fruit hasColor yellow
?fruit wdt:P462 wd:Q943
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
}; character e.g. you could filter for actual fruits by doing# fruit instance of or subclass of a fruit
?fruit wdt:P31/wdt:P279* wd:Q3314483;p: for properties, pointing to the subjectps: for property statementpq: for property qualifier[] syntaxSELECT ?painting ?paintingLabel ?material ?materialLabel
WHERE
{
# element is a painting
?painting wdt:P31/wdt:P279* wd:Q3305213;
# extract the statement node 'material' (P186)
p:P186 [
# get material property statement
ps:P186 ?material;
# 'applies to part'(P518) 'painting surface'(Q861259)
pq:P518 wd:Q861259
].
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
}ORDER BY, LIMITSELECT ?country ?countryLabel ?population
WHERE
{
# instances of sovereign state
?country wdt:P31/wdt:P279* wd:Q3624078;
# hasPopulation populationValue
wdt:P1082 ?population.
# filter for english translations
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
# ASC(?something) or DESC(?something)
ORDER BY DESC(?population)
LIMIT 10OPTIONAL clauseSELECT ?book ?title ?illustratorLabel ?publisherLabel ?published
WHERE
{
?book wdt:P50 wd:Q35610.
OPTIONAL { ?book wdt:P1476 ?title. }
OPTIONAL { ?book wdt:P110 ?illustrator. }
OPTIONAL { ?book wdt:P123 ?publisher. }
OPTIONAL { ?book wdt:P577 ?published. }
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
}FILTER and BIND, see tutorial section for more detailsSELECT ?person ?personLabel ?age
WHERE
{
# instance of human
?person wdt:P31 wd:Q5;
wdt:P569 ?born;
wdt:P570 ?died;
# died from capital punishment
wdt:P1196 wd:Q8454.
BIND(?died - ?born AS ?ageInDays).
BIND(?ageInDays/365.2425 AS ?ageInYears).
BIND(FLOOR(?ageInYears) AS ?age).
FILTER(?age > 90)
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
}VALUESSELECT ?item ?itemLabel ?mother ?motherLabel
WHERE {
# A. Einstein or J.S. Bach
VALUES ?item { wd:Q937 wd:Q1339 }
# mother of
OPTIONAL { ?item wdt:P25 ?mother. }
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}?xxxLabel as a shortcut for rdfs:label?xxxAltLabel as a shortcut for skos:altLabel?xxxDescription as a shortcut for schema:descriptionGet the 🇳🇱 dutch nicknames of a team:
# get the dutch nicknames from Bayern München
SELECT ?item ?itemLabel ?itemDescription ?itemAltLabel
WHERE {
VALUES ?item { wd:Q15789 }
SERVICE wikibase:label { bd:serviceParam wikibase:language "nl". }
}Get the stadium names of the teams that are part of the Big 5:
SELECT ?item ?itemLabel ?venue ?venueLabel ?venueAltLabel
WHERE
{
?item wdt:P31/wdt:P279* wd:Q847017;
wdt:P118 ?league;
wdt:P115 ?venue.
# filter for Big 5
VALUES ?league { wd:Q82595 wd:Q9448 wd:Q13394 wd:Q15804 wd:Q324867 wd:Q206813}.
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
}
ORDER BY ?leagueHere are my solution to the exercises in that tutorial.
Write a query that returns all chemical elements with their element symbol and atomic number, in order of their atomic number.
SELECT ?element ?elementLabel ?symbol ?atomic_number
WHERE
{
?element wdt:P31 wd:Q11344;
wdt:P246 ?symbol ;
wdt:P1086 ?atomic_number .
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
}
ORDER BY ASC(?atomic_number)Write a query that returns all rivers that flow directly or indirectly into the Mississippi River.
SELECT ?river ?riverLabel
WHERE
{
?river wdt:P31 wd:Q4022;
wdt:P403/wdt:P403* wd:Q1497 .
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
}
ORDER BY ASC(?riverLabel)SELECT ?ref ?refURL WHERE {
?ref pr:P854 ?refURL .
FILTER (CONTAINS(str(?refURL),'lefigaro.fr')) .
} LIMIT 10Now that you have developed a SparQL query, here is the simplest way to programatically query WikiData with python:
pandas
requests"""SPARQL utils."""
from pathlib import Path
from typing import List
from urllib.parse import urlparse
import pandas as pd
import requests
def query_wikidata(sparql_file: str, sparql_columns: List[str]) -> pd.DataFrame:
"""Query Wikidata SPARQL API endpoint."""
wikidata_api = "https://query.wikidata.org/sparql"
query = Path(sparql_file).read_text()
r = requests.get(wikidata_api, params={"format": "json", "query": query})
data = r.json()
df = (
pd.json_normalize(data, record_path=["results", "bindings"])
.rename(columns={c + ".value": c for c in sparql_columns})[sparql_columns]
.assign(q_id=lambda d: d.item.apply(lambda u: Path(urlparse(u).path).stem))
)
return df]]>
A code review is a process where someone other than the author(s) of a piece of code examines that code. Code committed to the codebase is both the responsibility of the author and the reviewer.
Done right, PR review can be the engine of team and business growth. Done poorly, they can leave the team fatigued and the business questioning. This guide is here to share my experience and best practices to avoid inefficient and unpleasant Code Reviews.
Code reviews should look at:
All of the above are grounds for a reviewer to request changes in a PR. Consensus should be reached to the best of the abilities of the author(s) and reviewer. However, if consensus cannot be reached between the two parties, the review should be escalated to the technical lead.
Provide context with the PR template
YOU are the primary reviewer
Things to think about
We believe in starting a review early so you don’t get too far only to have to rewrite things after someone has made a great suggestion.
Just create a PR even with a readme commit (when 30 to 50% of the code is there, it's a good rule of thumb), and add a clear "[ WIP ]" tag to the title so that we know it's a work in progress.
The sooner you get feedback, the better: nobody wants to hear at 90% of the way "you need to redo everything".
Ask for review early and expect architectural design comments.
How to allow maintainers to modify your PR
Allowing changes to a pull request branch created from a fork - GitHub Docs
Other details
Code review is not only for experienced developers! Here is what you
can provide feedback on:
This part is closer to a manifesto than to anything else, but I still find it useful:
"first make it work, then make it beautiful, then make it fast"
If you're using dbt, chances are you've noticed that it generates and saves one or more artifacts with every invocation.
In this post, I'll show you how to get started with dbt artifacts, and how to parse them to unlock applications valuable to your team and your use case.
Whether that's just for a fun Friday afternoon learning session, or whether that's your first foray at building a Data Governance tool using dbt, I hope you'll find this post useful, and if you do, let me know on twitter!
A word of warning: dbt's current minor version as of writing is v0.18.1 and multiple improvements to artifacts are coming in dbt's next version v0.19.0, but that doesn't change the content of this post.
dbt has produced artifacts since the release of dbt-docs in v0.11.0. Starting in dbt v0.19.0, we are committing to a stable and sustainable way of versioning, documenting, and validating dbt artifacts.
Ref: https://next.docs.getdbt.com/reference/artifacts/dbt-artifacts/
The artifacts currently generated are JSON files called manifest.json, catalog.json, run_results.json and sources.json. They are used to power the docs website and other dbt features.
Different dbt commands generate different artifacts, so I've summarised that in the table below:
Of course, dbt docs is the command that refreshes most artifacts (makes sense, since they were initially introduced to power the docs site). But it's interesting to note that manifest can be refreshed by other commands than the usual suspects dbt run or dbt test too.
Manifest:
Today, dbt uses this file to populate the docs site, and to perform state comparison. Members of the community have used this file to run checks on how many models have descriptions and tests.
Run Results:
In aggregate, many run_results.json can be combined to calculate average model runtime, test failure rates, the number of record changes captured by snapshots, etc.
Catalog:
Today, dbt uses this file to populate metadata, such as column types and table statistics, in the docs site.
Sources:
Today, dbt Cloud uses this file to power its Source Freshness visualization.
graph.gpickle:
Stores the networkx representation of the dbt resource DAG.
jq`
To get started with parsing dbt artifacts for your own use case, I suggest to use jq, the lightweight and flexible command-line JSON processor. This way, you can try out your ideas, explore the available data without writing much code at first.
jq Cheat sheet:
In particular, you will need to make use of some of the built-in operators like to_entries and map.
Here is a command to grab the materialisation of each model
→ cat target/manifest.json | jq '.nodes | to_entries | map({node: .key, materialized: .value.config.materialized})'
[
{
"node": "model.jaffle_shop.dim_customers",
"materialized": "table"
},
{
"node": "model.jaffle_shop.stg_customers",
"materialized": "view"
}
]You can then for example store that into a file by piping the output
cat target/manifest.json | jq '.nodes | ...' > my_data_of_interest.jsonpydantic`
Once you get a better idea of what data you need, you might want to develop more custom logic around dbt artifacts. This is where python shines: you can write a script with the logic you need. You can install and import great python libraries. For instance, you could use networkx to run graph algorithms on your dbt DAG.
You will then need to parse the dbt artifacts in python. I recommend to use the great pydantic library: among other things, it allows to parse JSON files with very concise code that lets you focus on high-level parsing logic.
Here is an example logic to parse manifest.json:
import json
from typing import Dict, List, Optional
from enum import Enum
from pydantic import BaseModel, validator
class DbtResourceType(str, Enum):
model = 'model'
analysis = 'analysis'
test = 'test'
operation = 'operation'
seed = 'seed'
source = 'source'
class DbtMaterializationType(str, Enum):
table = 'table'
view = 'view'
incremental = 'incremental'
ephemeral = 'ephemeral'
seed = 'seed'
class NodeDeps(BaseModel):
nodes: List[str]
class NodeConfig(BaseModel):
materialized: Optional[DbtMaterializationType]
class Node(BaseModel):
unique_id: str
path: Path
resource_type: DbtResourceType
description: str
depends_on: Optional[NodeDeps]
config: NodeConfig
class Manifest(BaseModel):
nodes: Dict["str", Node]
sources: Dict["str", Node]
@validator('nodes', 'sources')
def filter(cls, val):
return {k: v for k, v in val.items() if v.resource_type.value in ('model', 'seed', 'source')}
if __name__ == "__main__":
with open("target/manifest.json") as fh:
data = json.load(fh)
m = Manifest(**data)Once you've got the Manifest class, you can use it in your custom logic. For example, in our use case from above where we want to check for model materialization, we can do:
>>> m = Manifest(**data)
>>> [{"node": node, "materialized": n.config.materialized.value} for node, n in m.nodes.items()]
[
{
"node": "model.jaffle_shop.dim_customers",
"materialized": "table"
},
{
"node": "model.jaffle_shop.stg_customers",
"materialized": "view"
}
]Let's say you want to check that no materialisation has changed before you run dbt run. This is useful because some materialization changes require a --full-refresh. You could achieve the change detection with the following commands:
→ cat target/manifest.json | jq '.nodes | to_entries | map({node: .key, materialized: .value.config.materialized})' > old_state.json
→ # code change: let's say one model materialization is changed from table to view
→ dbt compile
→ cat target/manifest.json | jq '.nodes | to_entries | map({node: .key, materialized: .value.config.materialized})' > new_state.json
→ diff old_state.json new_state.json
12c12
< "materialized": "table"
---
> "materialized": "view"networkxOnce you've parsed the manifest.json, you have at your disposal the graph of models from your project. You could explore off-the-shelf graph algorithms provided by networkx, and see if any of the insights you get are valuable.
For example, nx.degree_centrality can give you the list of models that are "central" to your project. You can use that e.g. to priotise maintenance efforts. In the future, you could imagine a dbt docs search that prioritises results based on this metric as a very simple PageRank proxy.
Once you've written the pydantic code from above, this turns out to be possible in a very small amount of lines.
import networkx as nx
# ... pydantic code from above for Manifest class
class GraphManifest(Manifest):
@property
def node_list(self):
return list(self.nodes.keys()) + list(self.sources.keys())
@property
def edge_list(self):
return [(k, d) for k, v in self.nodes.items() for d in v.depends_on.nodes]
def build_graph(self) -> nx.Graph:
G = nx.Graph()
G.add_nodes_from(self.node_list)
G.add_edges_from(self.edge_list)
return G
if __name__ == "__main__":
with open("target/manifest.json") as fh:
data = json.load(fh)
m = GraphManifest(**data)
G = m.build_graph()
nx.degree_centrality(G)Provided you use python 3.8+, there is another dbt artifact that can be interesting to you: graph.gpickle. Instead of parsing manifest.json and building the graph yourself, you can deserialize the networkx graph built by dbt itself.
All it takes is 2 lines!
That's hard to beat, but note that you will rely on the internal graph definition of dbt and won't be able to customise it. For example, tests will be nodes on your graph now.
import networkx as nx
G = nx.read_gpickle("target/graph.gpickle")Nevertheless, this can be useful for example for a quick visulisation using pyvis:
from pyvis.network import Network
nt = Network("500px", "1000px", notebook=True)
nt.from_nx(G)
nt.show("nx.html")
🤝Organizer: community members Eva Schreyer and Lucas Silbernagel
🏠Venue Host: Enpal office @ Germany
📝Agenda
I turned to my Data Science and Engineering background and built a tool called a Nomogram to assist me.
This guide provides a nonstatistical audience with a methodological approach for building, interpreting, and using nomograms to estimate running fitness and set difficult and specific goals. If you do not know what a Nomogram is, don't worry, I will explain it step by step in the rest of the article.
Although this article deals with setting better goals, this is not a Goal Setting blog post. Setting goals is part of any self-improvement approach, and fitness or running is no exception.
When setting out to set your own goals, it's easy to get lost in the profusion of acronyms and fields in which goal setting is used, for example: Psychology (e.g. WOOP: Wish, Outcome, Obstacle, Plan), Self-help (e.g. SMART: Specific, Measurable, Achievable, Relevant, and Time-Bound) or Business (e.g. OKRs and KPIs: Objectives, Key Results, Key Performance Indicators).
Sometimes, goals are even set for us, by our employer, our doctor, our coach, our family, our friends, our insurance company. For example, my health insurance gives me a few basic fitness advice:
Although I'm no stranger to setting goals, I was lost when I started running this summer. Until I re-discovered the Locke theory of goal setting. In 1968, Edwin Locke published a paper called "Toward a Theory of Task Motivation and Incentives" in which he proposed that:
After controlling for ability, goals that are difficult to achieve and specific tend to increase performance far more than easy goals, no goals or telling people to do their best. It therefore follows that the simplest motivational explanation of why some individuals outperform others is that they have different goals.
The first part of this quote is key: "After controlling for ability". The verb control is used in its statistical sense, meaning that the effect of ability is removed from the equation. In other words, we all have different running fitness levels, and we need to control for that when setting goals.
The second part of the quote calls for a disclaimer: by following this approach, we bias ourselves towards performance. There are plenty of other motivations for running, and they are perfectly valid:
But if for the rest of this post we focus on performance, we also need to realise that performance is a result of many factors. For example, the blog post "Why are you so slow?" uses a statistical model to reveal that running speeds for a 200m dash is influenced by 5 factors of which the weakest link is the limiting factor. In other words, if you want to improve your 200m dash time, you need to improve your height, weight, fast- and slow-twitch muscle mass, cardiovascular conditioning, flexibility and elasticity. The research paper Factors associated with high-level endurance performance goes even further and lists 26 factors that influence endurance running performance. I will spare you the detail and leave only a figure from the paper that summarises the factors:
In don't know about you, but this is too many factors for it to be practical. So I started looking for a single numerical estimate of my running fitness that I could use to set goals. It should be easy to measure, easy to understand, and easy to compare to others. Most importantly, it should be tailored to my individual profile.
Nomograms are graphical calculating devices that look like a 2D diagram and that allows approximate computations. Nomograms are in particular used because of their ability to reduce statistical predictive models into a single numerical estimate, perfect for our use case!
The field of nomography was invented in 1884 by the French engineer Philbert Maurice d'Ocagne (1862-1938) and used extensively for many years to provide engineers with fast graphical calculations of complicated formulas to a practical precision.
Historically, they were used and developed in civil engineering. Place yourself if 1843, you are a civil engineer, and you need to calculate the volume of earth to be moved to allow for the construction of a road or a railway. You have a formula, but it's complicated and you don't have a computer to do the calculation for you. At that time, the French administration would have sent you a graphical table to help you with the calculation. Tables turned into nomograms, and just a few years later in 1846, Léon Lalanne, a French engineer from Ecole Polytechnique and Ecole des Ponts, published a nomogram called "Abaque ou compteur universel" in which he explains how to use a nomogram to do all sorts of calculations.
Later, in 1867, Eduard Lill, an Austrian engineer and Captain of Military Engineering, published a nomogram to solve quadratic equations (x2 + px + q = 0) showing nomograms were not just a french affair.
Recently, nomograms have been used beyond civil engineering, especially in the field of electrical engineering (e.g. for resistors or inductance sizing), mechanical engineering (e.g. for gears dimensioning), and chemical engineering (e.g. for phase-transitions of materials). Today, they are mostly used for educational purposes, their practical usage being replaced by computers. Except for a few domains, e.g. cancer prognosis.
Being a french engineer, I have had the pleasure to study "Abaques" (the french word for Nomograms which would translate to Abacuses) in my time. I have in particular been influenced by the nomogram used in optical engineering for the Lensmaker's equation and level sets ("abaques de Pouchet" or "lignes de niveaux" in french)
At this point of my reasoning, equipped of Lock theory and Nomograms, I could summarise my requirements for the running nomogram as follows:
At that point in my running journey, I had been exposed through my Garmin smart watch to the indicator called VO2max. VO2max is a measure of the maximum volume of oxygen that an athlete can use. It is a good indicator of cardiorespiratory fitness, and it is used by Garmin to estimate your running fitness. There are common protocols to estimate VO2max, such as the Cooper test or the Vameval test (particularly popular in France for football). The idea is to run as fast as you can for a given amount of time (e.g. 6 minutes), and to measure the distance covered. These protocols measure your maximal aerobic speed (MAS) which is related to VO2max. Those protocols have their own practicality and precision issues (e.g. like the fact that you need to know your pace upfront, which is a chicken and egg problem).
For my personal use case, VO2max started losing importance because my typical efforts (e.g. a 60min football game, a 2h bike tour, a 10km running race) are much longer than 6 minutes. I started to realise that other indicators were summarising my running fitness better. For example, I noticed that my average pace on a 60min Z2 jog was improving (cf A guide to heart rate training - Runner's World).
I later learned about Critical Speed (CS). Without going into the details, CS is a measure of the maximum speed that an athlete can sustain for a long period of time. It can replace MAS as a surrogate estimate of fitness. You can use the previous link to calculate it or this link. One of its added benefits is that it is very close to the second ventilatory threshold (SV2) which is otherwise costly and impractical to measure (you need lactate and ventilatory tests and a costly physiological assessment).
In particular, the 2020 paper Calculation of Critical Speed from Raw Training Data in Recreational Marathon Runners shows that CS can be calculated from a few personal time trials (e.g. 400m, 800m, 5km) and that it is a good predictor of marathon performance. Moreover, you can visualise the CS in a 2D space where the x-axis is the duration of effort in seconds and the y-axis is the average speed during the effort in km/h. This 2D space will form the basis for our nomogram in the next sections.
Here is the first version of our nomogram:
This blog post is not a dataviz tutorial, but let me just say that I built this visualisation with the python programming language and the Altair visualisation library. The code is available on Github.
I have added a few World Records for men and women, in a few typical disciplines: 1mile, 5km, 10km, Half Marathon and Marathon. Those dots answer the question: "what would a world record athlete do?". They also form the upper bound of our y-axis: above that line, no human has run faster. Note that this line may move up in time (meaning the World Records will improve) due to training optimisation, new technology (shoes), better doping drugs ...
Beyond World Records, it's interesting to look at major athletics events like the Valencia Marathon. Although marathons welcome amateurs, they need to define a lower limit, for logistical reasons and economical reasons (maintain the brand value of a Valencia Marathon Finisher). This is called the "sweeper car" or "broom wagon" or "voiture balai" in french.
the maximum official time for finishing the race being 5h:30:00, with this time limit not being exceeded under any circumstances.
The next interesting data point from the Valencia marathon are the so-called "starting waves". At the start of a running event, the organiser staggers the athletes in so-called waves of people that hopefully run at a similar pace. The main goal of waves is to limit the meandering needed to overtake a slower athlete, costing energy and time to the faster athlete overtaking. An indirect benefit of waves is that it gives us the organiser's perspective on what they think the distribution of runners will be (if we assume they tried to design waves of comparable amount of athletes).
The Valencia Marathon Trinidad Alfonso is planning to start the race in nine waves in order to improve the comfort and safety of all the runners, based on the order of the accredited times.
Finally, I've looked at the "sub-elite bib status" which is a special status given to athletes that have run a fast enough time in the past 3 years. They gain access to a special starting wave and a few other privileges.
Sub-elite bib status will apply to athletes who apply with times under 30:00 in a 10k race, 1h06:00 in a half marathon, or 2h20:00 in a marathon run in the last three years
Here is the second version of our nomogram:
The idea is to divide the x-axis into disciplines that are relevant to running.
On the professional side, World Athletics divides disciplines like this:
On the amateur side, most races organised in my area are 5km, 10km, HM and Marathon. Very little or none for other disciplines. Therefore, I have decided to use the following grid: no sprint, Mile (for Middle distances), 5km (for Long Distance), 10km, HM, Marathon for Road running. If we wanted to include sprinting, we could use a 400m line.
You might note a "distortion" of sorts for longer distances. To counteract this, we could use a logarithmic scale. In particular a Symmetric log scale (symlog), which is particularly useful for plotting data that varies over multiple orders of magnitude but includes zero-valued data, like in this variant:
In the rest of the blog post, I decide to keep the linear scale as it makes the reading of the x-axis easier and puts more emphasis on endurance disciplines as opposed to sprint disciplines.
Here is the third version of our nomogram:
The idea is to divide the y-axis into levels that are relevant to running.
"The World's Best Coach" Jack Daniels has proposed a system called VDOT that allows to compare athletes of different levels.
In the 1970s, Daniels and his colleague, Jimmy Gilbert, examined the performances and known VO2max values of elite middle and long distance runners. Although the laboratory determined VO2max values of these runners may have been different, equally performing runners were assigned equal aerobic profiles. Daniels labeled these "pseudoVO2max" or "effective VO2max" values as VDOT values.
With the result of a recent competition, a runner can find his or her VDOT value using a VDOT calculator. This will allow them to determine an "equivalent performance" at a different race distance, as well as recommended training paces.
By looking at the code from the VDOT calculator, I was able to find the equation of curves of constant VDOT values. I then plot the "iso-VDOT" curves on the nomogram, using an interval of 5 points of VDOT. (Side note: VDOT defines levels internally from level 2 = VDOT 40 to level 9 = VDOT 85, equally spaced at 5 points apart which I've simply prolonged).
We can now interprete those curves. For example, Men world record athletes are above VDOT 80 while Women world record athletes have a VDOT of 75. It also seems that the 5km women world record is underperforming other women world records, which calls for a new world record at Paris 2024 maybe. The Valencia Broom Wagon has a VDOT of about 25, while sub-elite athletes seems to have a VDOT a little above 70.
World Athletics maintains a ranking of athletes based on their performance. They use a points system called IAAF points, similar to ATP points in Tennis. Here is an example of the World Rankings | Women's Marathon:
Do you know if you have scored your first IAAF point yet? Go to this IAAF Scoring Calculator and enter your time for a given distance.
Given that IAAF points are an official ranking, we could have plotted iso-IAAF curves on the monogram. But after trying that, I felt that this was not as clear as the VDOT curves. We can even show that the VDOT curves are a good approximation of the IAAF curves by plotting both on the same graph:
Note: I was able to find the equation for IAAF curves by looking at the code of this PHP library used by the Latvian Athletics Association and this stackexchange answer.
Now that we have a nomogram, we can use it to set a difficult and specific goal. Before that, we need to know what is a realistic performance. Turning to statistics, we can look at the distribution of performances of athletes in a given discipline.
Thankfully, the French Athletics Federation has an Open Data portal. We can crawl the data available at "Les Bilans" for a given discipline, and plot the distribution of performances. Here is the example of Men Half Marathon in 2023:
This looks like a log-normal distribution and we could certainly model this further and look at percentiles etc... However for this blog post, I will simply use a visual interpretation of a realistic performance. "Most people" seem to have a VDOT between 37 and 44. Therefore, aiming for a VDOT of 45 seems like a difficult enough goal as a beginner to get ahead of the masses without setting the bar too high and being unrealistic.
Here is the last stop on our nomogram journey:
Over the last few months, I have trained and raced a few times. I have added my past performances to the nomogram in orange. I have also added my future goals in blue. I will run my first Half Marathon in Berlin in April 2023, hoping to finish under 1h40, giving me a VDOT of 45.
When you think about it, finishing a first Half-Marathon in under 1h40 is ambitious, but looking at the data here, I think it's a good goal: difficult and specific. I seem to already have a VDOT above 45 (although on a 1 mile distance - the shortest), and I have a few months to train and improve my fitness further, specifically working on longer distance runs and making sure my VDOT score translates for bigger distances.
This concludes my guide on how to build and interpret a nomogram to set better running goals. No matter your level in running, exercise physiology or statistics, I hope you have found something of value in this article. Feel free to use the nomogram I have built for your own goal setting, and let me know if you have any feedback.