Lightup Data

Lightup Closes $9 Million Series A Round Led By Andreessen Horowitz and Newlands to Democratize No-Code Data Quality Checks Across Enterprise Operations

Saurabh — Sun, 17 Mar 2024 17:30:00 +0000

Lightup, developers of a modern no-code data quality monitoring platform, announced today that it closed $9 million in Series A funding to further empower organizations to democratize data quality checks for 100% coverage across enterprise operations.

Which do you choose — data quality or data quantity? At McDonald’s, we choose both

Saurabh — Mon, 17 Apr 2023 16:30:16 +0000

Original Article Published on Medium

“Managing data quality is critical to enabling informed business decision making, but McDonald’s scale means massive amounts of data — that’s where our data-quality monitoring tool is helping us deliver consistent and accurate insights.” Keep reading Medium article here.

Introducing Lightup Recommendations for Automated Data Quality Optimization, Beta Preview Coming Soon

Sarabjot Singh — Wed, 27 Aug 2025 07:11:52 +0000

You’ve enabled no-code Auto Metrics, monitors, and even some business-specific custom SQL rules in Lightup — but how do you know what’s missing or what to fine-tune next?

That’s where Lightup Recommendations come in.

We’re excited to announce Lightup Recommendations, a new beta feature coming soon to Lightup that provides intelligent, actionable suggestions to optimize your Data Quality coverage. Think of it as your Data Quality advisor — always scanning your environment for missing checks, monitors, and other Data Quality best practices. Lightup automatically analyzes your configurations and usage, always learning and always recommending and automating the next best action to take.

What Are Lightup Recommendations?

Lightup Recommendations are intelligent suggestions for the next best action to take, designed to help data engineers and analysts automatically identify and resolve gaps in Data Quality coverage. Instead of manually auditing your datasets, Lightup sytematically surfaces the most relevant and helpful recommended actions to take — like creating monitors, enabling schemas, activating data profiles, adding missing metrics, or even fixing configurations for existing monitors.

Imagine this:

You have a metric but no monitor on it? Lightup will recommend and create one for you.
A column has data but no metric yet? Lightup will suggest a metric that makes sense and enable it.
Got a noisy monitor, and not sure what to do? Lightup will suggest how to fix the configuration.
Monitor missing issues? Lightup will provide recommendations for additional training or threshold setting adjustments.
Haven’t configured your tables? Lightup will recommend what configuration to use based on the profiled data.

But it doesn’t stop there. Lightup will even prioritize Recommendations based on how you actually use your data. For example, if you frequently reference a table or column using custom SQL checks, Lightup recognizes its importance and bumps up Recommendations related to it. Simply filter and sort, then activate or ignore each one.

Simply put, Recommendations help ensure your Data Quality coverage in Lightup is fully optimized for your organization’s needs and desired results.

Why Recommendations Matter for Data Teams

Maintaining high Data Quality standards is critical for organizations relying on accurate reporting, real-time applications, and decision-making. But with growing datasets and complex pipelines, gaps in Data Quality can go unnoticed.

Lightup Recommendations help solve that by:

Providing smart recommendations of what actions to take, using data usage patterns and existing configurations.
Saving time by eliminating manual hunts for missing checks, monitors, or misconfigured settings.
Boosting Data Observability by ensuring tables, columns, and metrics are properly covered.
Automatically analyzing monitor performance, suggesting relevant actions to fine-tune results.

The best part? Each Recommendation is actionable with a single click. Activate it, Lightup applies the action for you. Or ignore it — Lightup won’t bug you again about it.

Bringing More Automation and Intelligence to Lightup

What makes Lightup’s Recommendations smart? Lightup uses intelligent analysis to prioritize recommendations based on how your team interacts with your data. For instance, if you’re using a table heavily but haven’t enabled Auto Metrics on it, Lightup will flag it as high priority.

This helps data teams:

Eliminate the guesswork for which metrics, schemas, monitors, and profiles to activate next.
Automate more tasks, enabling faster coverage in Lightup.
Accelerate their Data Quality initiatives.

Simply put, Lightup Recommendations make it easy to see what’s missing across your environment, without manual audits or tedious task prioritization.

Get Early Beta Access

Lightup Recommendations is coming soon in beta preview by invitation only for early adopters to try out and help shape what’s next. We can’t wait to show you how much time (and guesswork) this feature will save your organization!

Join the waitlist to be among the first to test this feature in your Lightup workspace.

Stay tuned for more updates as we continue building new features to automate and scale tasks in Lightup.

Introducing Lightup Data Quality for Unstructured Data

Sarabjot Singh — Thu, 31 Jul 2025 14:36:18 +0000

The AI Era Needs Data Quality for Unstructured Data, Starting With Documents

More than 80% of enterprise data is unstructured¹ — but traditional Data Quality tools are designed to run checks or queries on structured data in databases and data warehouses.

That means many enterprises aren’t leveraging unstructured data in critical documents that power operations, analytics, compliance, and customer experience, such as:

Financial reports with updated numbers.
Product documentation scattered across folders or document repositories.
Customer support knowledge bases that need constant updating.

Even small changes in these documents can introduce inaccuracy, inconsistency, or incompleteness — and in some cases, even PII contamination. Unfortunately, monitoring the quality of document data isn’t possible with traditional Data Quality tools and has been challenging to operationalize at enterprise scale.

That’s where Lightup Data Quality for Unstructured Data comes in.

Why Is Managing Unstructured Data Quality So Challenging?

Documents are a prime example of unstructured data containing enterprise insights and valuable information that can be used to drive business decisions and train AI/LLM models.

Yet, documents are typically manually managed, difficult to monitor for quality, and error-prone — making them problematic for training LLM models or decision-making.

Unstructured data is everywhere, including:

PDFs of financial reports
DOCX files for legal, human resources (HR), or product documentation
Plain text or markdown (.txt, .md) files for notes
Email archives
Internal wikis and knowledge bases
Files stored in cloud repositories like Amazon S3 buckets, Google Drive, OneDrive, or Box

Unstructured data is information that doesn’t conform to a predefined, structured data model or fixed schema and can’t be neatly organized into rows and columns like structured data in databases or data warehouses.

Whereas structured data is easy for machines to read and sort, unstructured data — such as documents, images, text, audio, video — is designed for people to consume and can’t be directly queried or analyzed using standard methods.

Managing unstructured Data Quality for documents is difficult to track at scale due to:

The varying structural differences between files and document types.
The nature of key facts being embedded in free text, not schema-defined fields.
Untracked or unmonitored changes within the content.
Quality issues, like missing information, factual inconsistencies, or exposed sensitive or personally identifiable information (PII).

Yet, unstructured data can hold essential enterprise context and operational knowledge. Financial numbers, compliance clauses, invoices, customer feedback, product information, and more all live in documents. However, regressions or errors of omission in documents often go unnoticed, leading to potential downstream risks and problems.

The Value of Training AI/LLM Models with Unstructured Document Data

As organizations accelerate their AI adoption, they’re realizing that some of the most valuable training data already exists inside their enterprise documents. Unlike data in structured databases, documents often contain rich context, domain-specific information, and operational nuances — exactly the kind of company–specific information that AI models and large language models (LLMs) need to be truly useful and relevant for enterprise AI applications.

Enterprise documents:

Capture the core institutional knowledge with contextual details often missing from databases.
Include everyday language and terminology used by employees, partners, and customers.
Are the primary format for business decisions, compliance, and communication.

3 Enterprise Uses Cases

Training Domain-Specific LLMs: Enterprises fine-tune foundational AI models using internal documents, such as technical manuals, customer support FAQs, and policy documents, to improve accuracy in industry-specific tasks. For example, a healthcare provider might use internal protocols and documentation to train a model that assists with medical coding or claims triaging.

Retrieval-Augmented Generation (RAG) Systems: RAG architectures use enterprise documents as a knowledge base that models can reference at runtime. For example, a model answering a customer question can retrieve content from the latest product documentation or internal wikis to generate a contextually correct, up-to-date response.

Automated Document Intelligence: AI models are increasingly used to extract structured information from contracts, financial reports, or onboarding documents. By applying natural language processing (NLP) to unstructured content, enterprises can automate workflows like risk scoring, revenue forecasting, and compliance checks at scale.

Simply put, maintaining high-quality enterprise documents becomes a prerequisite. If your AI is learning from or referencing documents, you need confidence that the information is accurate, complete, and consistent.

Getting Started with Lightup's Unstructured Data Quality

Getting started in Lightup is as simple as connecting an unstructured data source to Lightup:

Navigate to the Lightup Explorer panel, now with a dedicated tab for unstructured data sources.
Connect an unstructured data source, such as Amazon S3.
Enable the Unstructured Data toggle to indicate that folder contents should be profiled.
Lightup creates a directory tree of files and folders within the S3 bucket.

Supported file types: PDF, .txt, .md (more coming soon)

AI-Powered Document Profiling

After connecting your unstructured data source to Lightup and enabling the Data Profiling toggle for all documents, Lightup will automatically generate an AI-powered summary of facts that includes:

Document metadata (type, length, creation date)
Summary of the content
5 autogenerated questions and answers (Q&A), highlighting salient document facts

Editable Profiles

Review the data profile for the document, and if it needs adjusting, you can:

Click “Regenerate” create a new version of the profile.
Manually edit to add or delete Q&As based on domain knowledge and context.
Save updated profiles for monitoring.

Since AI is non-deterministic by nature, each regenerated profile may offer a different perspective.

Auto Metrics for Documents and Folders

Once a data profile is activated, Lightup enables document- and folder-level Auto Metrics for continuous observability. Purpose-built for monitoring the quality of documents to understand the accuracy and reliability of the content, Lightup provides out-of-the-box coverage for the four primary dimensions of Data Quality for documents that can degrade or become problematic at scale:

Inaccuracy
Inconsistency
Incompleteness
Personally Identifiable Information (PII) Contamination

Document-Level Metrics

Lightup’s document-level metrics support rules, monitors, and alerts, keeping your teams notified as soon as anomalies occur. Each metric can be scheduled to run at regular or custom intervals, such as hourly, daily, weekly, or monthly in a specified time zone.

Inaccuracy: Flags changes in factual data (e.g., revenue changed from $1M to $1.2M).

Inconsistency: Detects contradictions within the document and presents a side-by-side comparison of conflicting information (e.g., document indicates revenue of $100K multiple times, but also mentions a different revenue figure of $120K).

Incompleteness: Identifies missing information from the original Q&As and indicates factual gaps or degradations in complete information over time (e.g., net income figures were included in the original source document report, but aren’t included in subsequent reports).

PII Contamination: Detects and lists instances of personally identifiable information (PII) and provides the count and examples of detected PII fields (e.g., name and date of birth).

Folder-Level Metrics

Inconsistency Across Documents: Analyzes multiple files in a folder for conflicting information and surfaces contradictions between versions or documents with detailed comparisons.

Custom Metrics

When you’re ready to go beyond Auto Metrics, Lightup also supports Custom Metrics to extract domain-specific facts from documents using natural language prompts.

For example, if you want to create a metric to track net income from a financial report, here’s how to do it in Lightup:

Navigate to unstructured data source, then right click to select Create Metric.
Define schema (e.g., Value: Income, Type: Number).
Create a metric using natural language, such as “Extract net income from financial report.”
Schedule metric collection runs to trigger document scans.
Activate monitors to track the output, enable Anomaly Detection, and define preferred channels for alerts, while Lightup automatically tracks incidents in dashboards.

Lightup allows you to monitor facts in documents as easily as you can monitor structured fields in a database or data warehouse.

Anomaly Detection and Alerting

Anomaly Detection tracks trends over time and alerts you if quality signals change. For example, if you typically see 4 – 5 inaccuracies for a particular document and suddenly get 10, Lightup flags that incident and notifies your team.

Role-Based Access Control (RBAC)

Security still applies to unstructured data sources with Lightup’s enterprise-grade RBAC framework:

- Users only see metrics, profiles, or PII Contamination if authorized.
- Sensitive documents are monitored securely for compliance.

Role-based access control ensures that users without permissions to view sensitive data, like PII, won’t be able to access profiles.

Explore Our Open Source Project for Unstructured Data Quality

Since we believe AI-ready documents should be accessible to everyone, we’re happy to share a Python library for assessing unstructured Data Quality, available in GitHub as an open source project.*

Connect an LLM and S3 bucket to your project.
Use our open source code library to evaluate the accuracy and reliability of documents, including PDFs, text files, and markdowns.
Run computational checks: Inconsistency, Incompleteness, Inaccuracy, and PII contamination.

Explore our open source project today, contribute your ideas, and deploy anywhere!

Questions? Reach out at [email protected]

*Unstructured Data Quality with Anomaly Detection, Alerting, RBAC, and other enterprise features is available for customers and trial users only.

1. Mary Shacklett, “Structured vs Unstructured Data: Key Differences,” Datamation, November 3, 2023.

Feature Roundup Webinar: What’s New in Lightup?

Saurabh — Wed, 30 Jul 2025 07:29:35 +0000

As organizations scale their data operations, one truth becomes clear: Data Quality and Observability (DQO) are no longer “nice to have” — they are mission-critical.

At Lightup, we’ve taken a deep dive into the architecture of our platform and discovered more opportunities than we originally imagined. Each layer — from ingestion to visualization — presents unique challenges at scale, and we’ve been systematically upgrading every one of them.

The result? A new wave of features that radically improve both automation and control — enabling faster time-to-value and safer enterprise-wide collaboration.

Why Not Just Write SQL or Use dbt?

We hear this a lot. And if that works for you on a one-off basis at a small scale, great.

The real challenges show up at scale. Scale really compounds the complexity of Data Quality metric authorship, management, and maintenance.

Why? As your data environment grows, complexity grows exponentially at every layer:

Data sources proliferate across dispersed data warehouses, multiple lakehouses, object stores, streaming platforms, and SaaS tools.
Metric computation balloons to billions of rows, making full table scans unaffordable and untenable. These hard-coded metrics sliced by business dimensions (per store, per product, per brand, etc.) become impractical to implement manually.
Anomaly detection monitors get noisy without advanced, dimensional logic.
Incident workflows require cross-team collaboration, demanding tight integration with tools like ServiceNow, Jira, Teams, and PagerDuty.
Remediations and reporting need to be automated and scalable across teams.

That’s where Lightup’s newest features come in.

What’s New in Lightup?

Here’s a preview of some of the latest enhancements in Lightup:

Virtual Tables
Unsupervised Learning for Sliced Metric Monitors
Enhanced reporting with richer email alerts, CSV exports, and customizable dashboards
Zero-Config Auto Metrics
Data Quality for SaaS applications
Kafka supported as a data source for streaming Data Quality checks
Data Lineage
Genie, Lightup’s AI Assistant
LLM-powered duplicate data and PII detection
Unstructured Data Quality

Many of these features already leverage AI and LLMs. And there are many more on the way…

Webinar

We’re excited to share a roundup of our latest feature updates and a product demo in our rescheduled webinar on Wednesday, August 6, 2025, at 10:00 am – 11:00 am PT.

You’ll get a front-row seat to see how Lightup is pushing the boundaries of scalable Data Quality and Observability for enterprises. Plus, we’ll reveal what’s coming soon and what to expect next from Lightup.

Know someone who should attend? Forward this blog to them and we’ll see you there!

Missed the webinar? Catch the replay, now available on-demand.

How to Build Data Quality Dashboards with Lightup Metrics and Your Favorite BI Tools

Saurabh — Thu, 01 May 2025 18:52:44 +0000

If you care about Data Quality reporting for broader transparency at your organization, you need custom dashboards. Whether you’re tracking data freshness, schema changes, or incidents, creating Data Quality dashboards is the best way to report usage, trends, and visually summarize insights by product or department for stakeholders and executive leadership teams.

Lightup provides built-in Data Quality dashboards for at-a-glance insights that matter most to your team.

For organizations that prefer to use their enterprise BI tool, Lightup enables teams to build executive reports and custom dashboards with PowerBI, Tableau, Apache Superset, and more.

Creating Data Quality Dashboards with Lightup Metrics

By design, Lightup focuses on being as flexible as possible, which is why we expose our data model directly in Postgres. That means if your BI tool can connect to Postgres, you’re ready to start building custom Data Quality dashboards for your organization — no extra data modeling required.

Get started by connecting your BI tool to Lightup’s Postgres database, where you’ll get direct access to an analytics view of key Lightup metrics, such as:

Daily incident counts by data source
Failing records count
Number of incidents
Names of metrics and monitors
Status of active monitors
Count by product or department

This isn’t just a view of raw data in Postgres; Lightup exposes a structured, dashboard-ready schema, so you can easily map fields to your BI tools for visual exploration and reporting.

Customizing Data Quality dashboards with Lightup metrics is simple:

Connect your BI tool to Lightup’s Postgres database.
Explore the exposed schema.
Create charts and dashboards based on your business requirements.
Share dashboards with stakeholders for ongoing visibility and decision-making.

Why It Matters

If you’re ready to enhance Data Quality reporting at your organization with customized dashboards, Lightup enables faster time-to-insight by providing:

A live data model in Postgres, ensuring up-to-date Data Quality dashboards.
A ready-to-use schema, no data modeling required.
Flexibility to use your preferred BI tool.

Get Started

Start a risk-free trial of Lightup and build your own custom Data Quality dashboards today.

Questions? Request a free demo or contact us.

Introducing Lightup for Data Lineage: Enabling Data Quality Assurance with Enhanced Traceability and Faster Root Cause Analysis

Sameer Satyam — Tue, 01 Apr 2025 10:18:38 +0000

As organizations continue to modernize their data management and race to implement more AI-driven data products, the need for reliable, accurate, and auditable data is now more critical than ever. Why? Large enterprises that are managing massive amounts of data, running complex pipelines, and working with artificial intelligence (AI) and machine learning (ML) applications rely heavily on the integrity and accuracy of data.

So, how can enterprises maintain data reliability and traceability throughout the data life cycle?

That’s where Data Lineage comes in.

What Is Data Lineage?

Data Lineage is the process of tracking and visualizing the flow of data from its origin or source, through all the processing stages, until it reaches its final form or target destination.

By helping organizations understand the flow of data across its life cycle, Data Lineage provides answers to questions, such as:

Where did the data come from?
What are the downstream dependencies?
What’s the final target destination of the data?

Think of Data Lineage as a map, showing where data originated from (the source), how it’s been changed or transformed (data processing), and where it’s going for consumption (target destination). This allows organizations to keep track of these processes, gaining visibility and traceability into each stage of the data pipeline.

Why Is Data Lineage Critical for Modern Data Management?

Data Lineage plays a crucial role for organizations implementing modern data management systems, especially when it comes to Data Governance and Data Quality.

Here’s why:

- - Traceability: Since Data Lineage provides granular visibility into where data came from and where it’s going, identifying issues like inconsistencies or unexpected changes becomes much easier.
  - Identification of Data Quality Gaps: By tracking data across each stage of its life cycle, Data Lineage can identify systemic gaps in Data Quality coverage. For example, Data Lineage shows which nodes are missing automated validation checks or areas where data isn’t monitored for inconsistencies as data flows through pipelines.
  - Root Cause Analysis: Data Lineage helps teams diagnose the root cause of incidents, even when issues arise that don’t necessarily cause pipeline failure outright. If downstream reports or dashboards show incorrect figures, lineage maps can reveal whether the problem originated from missing fields in the source system or if errors were introduced in later processing phases. This helps accelerate root cause analysis, enabling teams to fix issues at the source to ensure high-quality data flows through the pipeline to reach its final destination.
  - Impact Analysis: Data Lineage enables teams to assess the downstream impact or blast radius of poor-quality data or unexpected schema changes. For example, if rows and columns get dropped from the parent table, Data Lineage indicates the downstream dependencies in child tables. This visibility helps with comprehensive impact analysis, ensuring that gaps in upstream processes are addressed before they cascade through reporting systems used for critical business decisions.

How Data Lineage Works in Practice

What does this look like in practice? Imagine you’re working with raw product data in a Postgres database. You move this data into Snowflake, changing some category names to fit your SQL database schema. At each step, you document what happens to the data: what’s dropped, what’s changed, and the state of the data at each stage.

This metadata trail is important because it captures:

Where the data came from and where it’s going.
What columns or rows were altered or removed.
When and where those changes took place.

Even when an ETL pipeline itself doesn’t fail, Data Lineage can help identify any discrepancies in the data — such as a sudden drop in the number of rows in the product table, despite the raw product data remaining the same.

This level of granular visibility and traceability is critical in complex data environments, since it helps teams quickly identify the relationship between data assets and the exact blast radius of incidents on downstream processes.

Lightup for Data Lineage

You asked, we’re on it.

At Lightup, we understand that to get the most out of Data Lineage, you need to combine it with contextual Data Quality insights. That’s why we’re excited to announce the beta release of Lightup for Data Lineage, designed to make it easier than ever to track and visualize the flow of data with integrated incident status warnings at every phase.

Lightup Data Quality and Lineage go hand in hand for faster, more efficient root cause analysis. You’ll also see any gaps in Data Quality checks, plus the exact downstream blast radius of every incident — leaving nothing to chance.

Whether you’re working with complex data pipelines, ensuring high-quality data for products and services, or maintaining regulatory compliance, Lightup enhances Lineage with the visibility, traceability, and Data Quality insights needed to mitigate risks, accelerate root cause analysis, and deliver trusted data across the enterprise.

Simply put, when Lineage mappings are enriched with Data Quality incident warnings, that becomes an indispensable way to ensure data flows smoothly and remains secure, building trust for data consumers.

We can’t wait to enhance your Lineage with Data Quality insights — sign up to join the waitlist.

Data Quality and Observability for Streaming Data: Introducing Lightup’s New ksqlDB Connector

Sameer Satyam — Sun, 16 Mar 2025 17:32:41 +0000

Data Quality Monitoring for Streaming Data

Monitoring Data Quality for structured data at rest is challenging enough in modern data environments without adding another layer of complexity to it, like real-time streaming data. With data in motion, Data Quality issues are amplified since hidden bad data can cascade quickly, proliferating to more places downstream — likely, undetected.

For example, if incorrect data and invalid schemas go undetected in event-driven streaming platforms like Kafka, downstream applications and supported services in that ecosystem can fail or run at suboptimal performance. Since many organizations use Kafka to run real-time applications for high-volume and high-speed transactions — such as restaurant sales, e-commerce, Internet of Things (IoT), telco systems, streaming video services, or even tracking customer loyalty points — it’s critical to ensure every system is running on high-quality data.

Over the last few years, the need to monitor Data Quality for streaming data in Kafka has dramatically increased, which is why we’re excited to announce the release of Lightup’s beta connector for ksqlDB.

How It Works

Since ksqlDB has a different architecture than relational databases with structured tables, applying Data Quality checks the traditional way to real-time streaming data doesn’t work.

In order to run Data Quality checks on streaming Kafka data, we took a different approach. Lightup connects to ksqlDB to read streaming data from all Topics stored in Kafka clusters.

In Lightup, a “Kafka cluster” is treated as a data source.
Since tables don’t exist in Kafka, the “Topics” in ksqlDB are converted to tables in Lightup.
Since schemas don’t exist in Kafka, Lightup automatically creates a schema, named “default.“

The Lightup connector for ksqlDB enables you to monitor metadata Auto Metrics for Tables, Schemas, and Columns in Kafka “Topics” (handled as Tables in Lightup).

Technical Requirements

ksqlDB must be installed and configured for stream processing on Kafka clusters.
A Kafka schema registry is necessary to get the schema of the values in each Topic.

Lightup Data Quality and Observability for ksqlDB

Monitoring Data Quality in real time is more crucial than ever, especially as organizations increasingly rely on streaming data platforms like Kafka for high-speed, high-volume transactions. With the beta launch of our ksqlDB connector, Lightup enables enterprises to monitor Data Quality for streaming data in dynamic, event-driven ecosystems.

By treating Kafka Topics as tables in Lightup and automatically generating schemas for otherwise schemaless streaming data, Lightup empowers enterprises to monitor Data Quality for real-time data. The result? Enterprise data teams can proactively catch issues before incidents escalate and cascade through downstream systems and workloads.

As the demand for real-time analytics and AI applications continues to grow, ensuring high-quality data for all systems isn’t just important, it’s essential for maintaining optimal performance and data reliability in modern data environments.

Learn more about Lightup’s ksqlDB connector, request a free consultation and demo today.

4 Out of 4 LLMs Tested Got It Wrong: Examining the Impact of Data Inconsistency in Unstructured Data

Sarabjot Singh — Tue, 25 Feb 2025 07:08:51 +0000

What Is Data Inconsistency?

Data inconsistency is data that isn’t standardized or uniform across various data sources, systems, or formats. This often occurs when data is pulled from outdated or incorrect data sources containing conflicting, invalid, or partial information for the same attribute.

Inconsistent data can arise from differences in formats, currencies, or units between sources, typically caused by mistakes like typos, improper data formats, or invalid data entry due to human error or lack of knowledge.

What Is Unstructured Data?

Unstructured data is any type of data without a predefined format or structure and comes in a variety of forms, such as:

Text data (emails, documents, chat messages, social media posts, customer reviews)
Multimedia (images, videos, audio files)
Logs (from servers or applications)
Web pages (HTML content)
IoT or sensor data

Unlike structured data in a database, organized in rows and columns, unstructured data lacks a predetermined model, making it harder to organize, process, and analyze. Though, the latest innovations in artificial intelligence (AI), machine learning (ML), natural language processing (NLP), observability, and advanced analytics are enabling organizations to get more insights and value out of unstructured data, especially for building large language models (LLM) and GenAI applications.

Data Inconsistency in Unstructured Data

Inconsistent information in unstructured data, such as documents, can significantly degrade the accuracy and reliability of Large Language Model (LLM) applications that rely on them for knowledge retrieval and processing. When documents contain typos or contradictory statements, the LLM can’t determine which information is correct. Naturally, discrepancies in document information cause the LLM to provide invalid output or misleading responses.

Testing the Impact of Data Inconsistency

To test the impact of data inconsistency in LLM datasets, imagine a scenario where we need to understand the total revenue of a company based on unstructured data in financial reports.
For this test, we used a mock revenue statement from a fictitious company, Fict.ai:

				
					Fict.ai, a rising player in the artificial intelligence sector, has demonstrated strong financial performance, reporting a revenue of 1.2 million dollars in the latest fiscal year. This achievement highlights the company’s ability to monetize its AI-driven solutions effectively, securing a solid position in a competitive market. With a revenue of 1 million, Fict.ai has managed to attract investors and expand its research and development efforts, ensuring continuous innovation in machine learning and automation.

Despite market fluctuations, Fict.ai has maintained a stable financial trajectory, consistently hitting the 1 million revenue mark, reinforcing its business model’s viability. The company’s sustained growth, reflected in its 1 million-dollar earnings, enables it to explore new markets and diversify its AI offerings. Looking ahead, Fict.ai aims to scale beyond its current 1 million revenue, leveraging its expertise to drive further profitability and technological advancements.

This sample financial report discloses Fict.ai’s revenue incorrectly as $1.2 million in the first sentence, but then states $1 million as the correct revenue number, four times throughout the rest of the document.

Now, imagine this conflicting report is part of the unstructured dataset in the LLM for Fict.ai’s GenAI-powered financial assistant. How would you know if you could trust the response?

Testing Four LLMs

We used Fict.ai’s report to test four state-of-the-art LLMs — GPT-4o, Claude Sonnet, o1-mini, and DeepSeek-R1 — and asked for the revenue amount using the following prompt:

				
					Given the following details about Fict.ai, can you tell me their annual revenue? Just return a numeric value.

How did the LLMs respond?

All four LLMs got it wrong, returning $1,200,000 as the numeric value of Fict.ai’s revenue, overpivoting to the value corresponding to the first mention of revenue.

Test 2: Detecting Inconsistencies

We ran a second test to check if the LLMs could correctly detect the issue by adjusting the prompt before responding.

Edited prompt:

				
					Given the following details about Fict.ai, can you tell me what is their annual revenue? Just return a numeric value. If the revenue  is inconsistent return all values in a list.

By tightening the prompt, the LLMs generated a new response with inconsistencies detected:

				
					The revenue values mentioned are inconsistent: [1.2, 1] (in millions)

The Impact of Data Inconsistency

As this test demonstrates, LLM-based knowledge retrieval can be very sensitive to underlying Data Quality issues, such as inconsistent values in documents. Why does that matter? The impact of LLMs providing inaccurate answers and unreliable output include:

Dissatisfied users
Lower adoption of AI-driven applications
Exposing the business to potential risks of financial losses, legal consequences, or reputational damage

This simplified test case underscores the importance of Data Quality for unstructured data to ensure that LLM applications run on accurate, consistent, and trustworthy information. Moreover, without a systematic way to monitor Data Quality for unstructured data, the effectiveness of AI-driven automation, decision-making, and knowledge retrieval can be severely compromised.

Open Source Data Quality for Unstructured Data

Explore our open source project for Unstructured Data Quality to test it out and add your contributions: https://github.com/lightup-data/lightudq.

Lightup Fireside Chat with Guest Malcolm Hawker: Challenging the Core Foundations of Data to Enable GenAI

Saurabh — Mon, 03 Feb 2025 18:30:58 +0000

Introduction

In our latest Fireside Chat, Malcolm Hawker, CDO of Profisee, joins Lightup CEO and Founder Manu Bansal to unpack a question posed in Malcolm’s LinkedIn post: “Why should data teams be skeptical if they hear a ‘focus on foundations’ is needed to enable GenAI?”

Fireside Chat

For this “poke-the-bear-worthy” fireside chat, challenging some of the core foundations of data, we exceeded 5 minutes. But, it was well worth it to hear Malcolm and Manu engage in a lively conversation with probing questions and surprising answers, covering:

The paradox of operationalizing and getting value out of AI.
The multiple sides of dark data and the implications of lighting it up.
The unspoken truth about AI governance frameworks.

Transcript

Manu: Malcolm, welcome to the show. I read your recent LinkedIn post, and I’m like, I have to come talk to you about this. It’s a spicy hot take, and there’s so much to unpack in what you’re saying there.

I think most people probably that are listening to this already know you, so you probably don’t need an introduction. But for those of you that don’t, Malcolm, maybe you want to introduce yourself quickly, and I’ll just add that you have been in the industry longer than you probably care to admit.

But there’s just so much to draw from your experience. But why don’t you introduce yourself, real quick?

Malcolm: Well, first and foremost, thank you, Manu, for having me, and giving me the opportunity to share my perspectives with your audience.

I’m Malcolm. I’m the CDO of Profisee. We make master data management MDM software. Been in the data and analytics space for about thirty years, been around the block. I’ve worn a lot of different hats, but I’m very active out on LinkedIn. Part of my mission is to share what I know with the kind of a broader data and analytics community and, to help other CDOs and other, you know, senior data and analytics leaders succeed, in their mission to drive transformative value from data. So, thanks for having me today, and I look forward to the chat.

Manu: And poke the bear every once in a while. Right?

Malcolm: Oh, I poke the bear almost every day. Yeah. I love poking the bear. And the reason is because we have a lot of data to show that the status quo is just not working that well.

Manu: Mhmm.

Malcolm: Right? There’s lots of data to show the status quo is holding us back.

So I think we need more people looking at different and more provocative ways of solving old problems. So that’s something, that role, I hold near and dear of chief bear poker.

Manu: Yep. Awesome. And, I’m Manu Bansal. I’m the Founder and CEO of Lightup Data. We are building a Data Quality platform that we’re very proud of, working with large enterprises and really enjoying the journey.

Malcolm, you made a very interesting statement recently, which is challenging the core foundations of data.

And I was just intrigued when you said that because people don’t normally do that. And, I think the way you framed it, it makes perfect sense to me. You’re kind of talking about how data has, especially in the world of LLMs, has really shifted, from what it used to look like in the world of analytics to being much more text oriented and unstructured. And, I think, it’s kind of natural to assume that your data foundations will just carry over, but you seem to be implying they don’t.

And you’re taking a hard stance on that. What exactly are you getting at?

Malcolm: Well, you know, I guess we need to be careful with generalities, right, and with and with sweeping statements.

Sweeping statements get a lot of clicks and they get a lot of attention, but they are sweeping statements. So I should probably preface by saying that this general shift that we have towards unstructured data that is being driven by LLMs is a challenge for most companies, not all. It’s a challenge for most companies. It’s a challenge for most CDOs.

It’s a challenge for most data and analytics foundations because the foundations we’ve been focused on for the last twenty years are the foundations needed to support analytics, rows and columns, and very structured data. Yes. We can have conversations about, you know, ETL versus ELT, lakes versus warehouses. We can have those conversations.

But at a very, very high level, the core foundations that most CDOs and most data leaders have been building, most, not all, but most have been building, is around structured data needed to optimize analytical processes.

The reality of a Gen AI driven world is that LLMs are built on and optimized by text text. Right? So they were built off the Internet. They’re trained off text data.

They are fine tuned using text based data, and they’re optimized by text based data, meaning the prompts that we type into them. So if you’re a CDO out there and you’re saying, yeah. I want my company to use LLMs and to get value from generative AI, so I’m gonna double down on my foundations. You’re doubling down on something that is probably enabling your analytics platform, your click, your Tableau, your Burst, your your Snowflake, you know, a data warehouse, but it’s not enabling a broader use of Gen AI in your organization.

Manu: Can you give me an example of what that foundation or that element might look like that doesn’t actually carry over?

Malcolm: That doesn’t carry over. Well, so let’s take a Data Quality rule, for example. A basic Data Quality rule that says this data may must be present or it must conform to this standard or even the idea of here’s how I determine whether an address is correct or not. An address is actually one that might carry over because an address might be useful to an LLM.

But basic Data Quality rules that are all built around looking at an individual field or an individual record. Right? They’re not looking at it. Your Data Quality rules are typically not looking at full paragraphs of data.

Right? There may be some, for example, maybe in health care or some other uses of Data Quality where they are looking at more of a narrative.

Mhmm. Right? But typically, what we’re looking at is individual fields of data, individual attributes, individual records, and applying Data Quality standards to them, or or making data conform to certain standards. So this is just one example of how okay.

I’ve got a foundation built on these Data Quality rules that is looking to make sure that it’s a, you know, an integer instead of a varchar. Okay. Well, that’s not enough in a world of LLMs. If you’re trying to apply Data Quality in the world of l l LLMs, you would need to look at narratives, stories.

You’d need to look broader. The Air Canada use case is a classic example here where Air Canada got sued because they were putting incorrect information into a chatbot around their bereavement policy. Their policy they used to reimburse people for airline tickets bought to go to the funerals of family members.

And that data was was was incorrect, but it was data that was based on long long form text based data that stated a a given policy that a Data Quality rule that is being used to make sure that the data going into a data warehouse that is used for a downstream analytical process would would would be ill suited to support.

Manu: I see. I see. And so it’s like, that’s interesting because I’m trying to put this into the context of what I was hearing at Gartner data analytics the last time it happened earlier this year. And, you know, it’s like you couldn’t go to a talk that didn’t talk about GenAI, and you couldn’t go to one that didn’t talk about Data Quality. And it was just alarming to see how those two topics would be part of pretty much every single conversation.

And one thing I felt coming out of that event and then follow-up conversations was from a qualitative point of view, first principles point of view, it makes perfect sense that Gen AI is being held back by Data Quality. But then the moment you try to link the two, it starts to become very vague and murky. Right?

But at the same time, like, the fact remains that that’s what we hear CDOs complaining about the most in holding back those initiatives.

I don’t think you’re suggesting we don’t have Data Quality as a primary need for LLMs. Right? You’re not suggesting that. Right? The same problem.

Malcolm: No. No. No. No. Not not at all. Data Quality is more important than it’s ever been before for the very reason stated.

But the problem is you cited Gartner. I was there as well.

And the problem is what I’ve heard is two years of platitudes.

I’ll give you an example.

AI governance.

We need to have ethical practices and we need to have practices around data governance that limit the presence of bias.

What does that, what do they actually mean? Mhmm. What are the rules that, that me as a data practitioner would use to ensure that the data that I was using was ethical?

Manu: Mhmm.

Malcolm: Nobody nobody is talking about what those actual u rules or policies would be because if you need to implement them at scale, if you need to automate them, if you need to write some sort of script against the data that is saying, okay. Is this data ethical? You need to mathematically apply some sort of rules or some sort of conditions to the data. And how do you do that when all that you’re hearing at these conferences is just platitude after platitude? Right? Like and and it’s impossible to argue, okay. Data Quality is important for companies to get the value out of generative AI.

But then when you start, what does that actually mean? Well, what it actually means is that the data that it would be consuming either during training, ninety nine percent of companies aren’t gonna be training their own models. So we’re really kind of talking about either fine tuning or or we’re talking about prompts. Right? Is the data that is going into a prompt accurate?

Is the data that is being used for fine-tuning accuracy? And how do we ensure that? If we’re talking about a paragraph of data that describes, who knows, right, your bereavement policy or maybe your HR policy, what are the processes that you would use to make sure that that data is correct? How are you going to do that?

How are you going to deploy data stewardship resources to make sure that that data is correct? Are you gonna be able to automate it? Nobody’s talking about these things. Like, nobody’s talking about it.

Not to mention the fact that most of the stuff that we govern is actually structured, is highly structured. Stuck rows, columns. It’s not the stuff again, there are outliers. There are companies in the health care space that have been doing OCR optical character recognition.

There are companies in a few spaces where they have been applying governance to less structured data. But for the most part, most companies are just completely ignoring unstructured data. SharePoint servers, PDF files, sitting out on marketing Google drives.

Right? Like, that stuff is getting sucked into LLMs left, right, and center with absolutely zero governance. So you go to Gartner. You hear, oh, well, we need to double down on foundations. We need to focus on Data Quality to enable, you know, LLMs when the fact of the matter is nobody is out there talking about how.

Manu: Yeah. And the and it’s a very interesting take, or or angle that you’re getting into, which is how part of it can we get into the details here a little bit. Right? And it’s like, at some level, you’re saying if you can measure it, then how do you manage it at all? Right?

And then immediately, you start to talk about what should you measure. So before you even get into how should you measure, what should you be measuring? What does Data Quality even mean in this context? And, as much as I have thought about it, I feel like the answer drastically varies depending on which end of the pipeline you’re looking at, and the two ends are obviously the extreme points.

You talked a little bit about what it looks like going into a prompt, and you want the data to be accurate. And the primary kind of measure, at least if I put this in the context of a human review, what I would tell my expert to review for is accuracy of data that’s going into the prompt. Right? And I could imagine accuracy here could also encompass completeness, which then therefore starts to talk about freshness and all those typical attributes of Data Quality that we talked about.

Right?

If policies are not up to date, then they probably are not going to be accurate. Right?

Do you have any ideas on what it would look like at the head end of the pipeline? We kind of alluded to the store of PDFs sitting on a SharePoint, and no one is governing that, and then you suddenly expect elements to be well on them. It’s not going to happen. But if I even forget about the step of feeding this into elements, and I’m just saying I want my PDF store to be ready for one day being fed into LLMs. What should I be tracking today if I’m going to start using it in six months?

Malcolm: That’s a great question. I mean, step number one is just discovery.

Right? Like, just discovery. You gotta figure out what’s out there. I mean, to me, that would be step one.

Like, what is the universe of arguably valuable data that you could use to help inform or optimize or potentially fine tune a language model for a given business problem. Mhmm. Right? So what is the universe of data out there?

What’s out there? Right? I suspect this is a huge challenge. Like, do CDOs even know how much data is sitting in the marketing realm, for example, that is that could be arguably extremely valuable to build models related to customer preferences or buyer behaviors?

Those are with the line more to more traditional ML models. But are there other data sources out there that could be used to create some sort of Copilot for customer service use cases, for example? Right? How much data is sitting out there in customer service FAQs and that type of data that is sitting in those repositories that is largely and can totally ungoverned from the perspective of the CDO.

Maybe it’s governed from some local process where the customer service function is managing their own Data Quality, managing their own rules. Okay. That’s great. But what are those rules?

So step number one would be discovery. Step number two would be to understand what governance is being applied today and who’s responsible for it. What do those processes look like? Right?

How do you ensure things like change management? How do you ensure all the dimensions that many that you just mentioned, whether there’s four, six, or twelve dimensions of Data Quality depending on who you ask. Right? Yeah.

So I mean, to me, that would be kind of step number one. And number two, what’s out there and what are you doing to ensure some idea of governance? And then number three would be to kind of overlay some requirements for AI to say, okay. What’s unique about AI?

What do we need, what do we need to solve for?

Manu: You know, I mean, that in and of itself would be, for most companies, I suspect would be a massive lift. And what I’m seeing out there is most data organizations are just kind of not doing it. Mhmm. And and the usage of LLMs is is happening organically from the bottom up within specific business functions where people are just using off the shelf LLMs to help or or copilots to do things like, you know, like a like a Git copilot to optimize engineering processes where people in marketing are just going using OpenAI to write, you know, customer communications or to write FAQ statements.

So, you know, is that right? Is that wrong? It’s happening. Right? And the CDOs could be, hey, you know, we gotta stop this ungoverned use of this, these processes. Well, it’s not gonna stop. I mean, only gonna only gonna get worse.

And you have, like, just trying to, again, deconstruct the problem statement here a little bit.

I’m hearing kind of two different takes from the community right now. One take is that Data Quality for the construction data is just so poor that you cannot expect elements to do anything useful, at least not to the level that you can actually put in production yet. And all the focus should be on improving the quality of this input data source.

Second kind of, more developer, sentiment seems to be that majority of the gain right now is in playing with heuristics on the on the LLM pipeline side itself, whether you’re doing drag or some variant of it, and how you start to chunk up data, and should it chunk more or less, or should you take neighbors or multiple PDFs or just one? Like, there’s so many kind of, it feels like a black art right now, which, which is not very scientific. But then just because the bar is so low, simple heuristics actually tend to give you a lot of gain. And that’s what’s really the limiting step right now before you start to care about the quality of input data.

Where do you stand on, like, which is a more important problem today?

Malcolm: Those are astute observations and completely accurate.

The first observation, which is this idea that the Data Quality is just so bad I can’t do anything, would I think closely align to CDOs that most likely will be looking for new jobs in the next year.

The fact that you are making this highly deterministic, very binary, it’s either good or it’s bad statement when it comes to anything related to AI is a testament to a lack of knowledge around how AI works.

Right? Because AI is not deterministic. It’s probabilistic. And AI is highly, highly, subject to context.

Right? So what is true in one context may not be necessarily be true in another. It’s all about the context. As data people, we should know this by now.

We should know this by now. But many of us cling to these deterministic mindsets that make us think that data is either all garbage or all good. It’s neither. It may be both at the same time.

Right? It may depend, it depends entirely on the use case. So if you are a data leader out there that is taking that approach of our Data Quality is just so bad. I don’t know what to do.

I can’t use this. Well, that’s the wrong approach because it’s inaccurate. I guarantee you there is a business problem out there right now today that could be solved using an LLM even with the poor and a business state of your, let’s say, your CRM data, which is one that people always pick on. I guarantee you there is business value that can be delivered even using the data as it exists today.

Let’s remember that LLMs are built on the Internet.

This is not necessarily a bastion of Data Quality. It’s not something that is well known for the accuracy of data. Yet somehow, these amazing systems that we’re building are still able to provide meaningful value to us. People are still using them.

Kids are still using them to do their homework day in and day out with varying degrees of success, I would imagine. But the other use case that you talked about is a very practical one, which I’m seeing is happening in most CTO CTO organizations because people are getting fed up with a lack of traction related to AI, and they’re going and trying to solve these problems. Mhmm. More on the CTO side of the house where they are deploying complex rag problems or prop complex rag processes, where they are using vector databases, where they are trying to find ways to add context and insight off of structured datasets using, you know, a graph, right, and others.

So, yes, that is the more practical way to do it. It is the more outcome driven way to do it. Today, what you described is the only way that I know that I can see at the very least in vectors and graphs. The only way that I know that you can add the context necessary to make structured data more consumable by LLMs.

So that’s a more practical approach. The data shows us, this was published in a new Vantage partner survey earlier this summer, that only about five percent of companies are taking that approach are taking that approach that you just described. I hope to see it increase in the coming year, and I expect it will increase in the coming year because there’s a ton of value out there when we find a way to operationalize that highly structured data to have it be consumed by LLMs and during some sort of a fine tuning process or even within more of a complex prompt. So, yeah, that’s the practical way to do it, but not enough companies are focused on it.

Manu: Yeah. I mean, there might actually be an irony hiding there and trying to really clean up the data feeding into LLMs. Right? I mean, the whole promise is you don’t have to do that. I mean, are you just basically then starting to structure your unstructured data that was supposed to be what you avoid in the first place?

Malcolm: This here’s the paradox. Right? Which is the only way that I think that we’re gonna be able to get our data in a state where we can operationalize it and get value at scale is through AI, is to use AI to do it. Mhmm.

Is to use AI to do some of the data prep. Right? Is to use AI to get the data in a more unstructured format, right, to create narratives or to create stories where GenAI based solutions can, can more readily consume it. So there’s a little bit of a paradox there.

There’s a little bit of irony there. But this does start to get into some of the foundations that you were talking about, which is, can you build processes to start injecting stewardship or or more traditional governance processes into something like what we just described? Mhmm. Right?

Where you are doing some data profiling at scale, of of structured data and you were running graphs on it and then you were making some assertions based on the graphs and or or or some of the triples that are based based in that graph, can then you apply more more traditional data stewardship to make sure this makes case and makes makes sense. Yeah. You know, can we improve some of these processes and scale some of these processes? Yes.

I think we can. And we can bring some of those foundations into this new world, but far too many data leaders are taking that first approach, Manu, that you described, which is the, well, it’s junk, and I can’t use it. It’s useless.

Yeah. Yeah. Maybe maybe instead of saying data is junk and I can’t make GenAI work, we should be saying data is junk, and the only way I can actually get value out of it is with Gen AI.

And so the question to use AI to scale AI, the answer is yes.

Because the Pareto principle tells us that eighty percent of our data is probably sitting out there unused, right, unmonetized, un unanalyzed, ungoverned. It’s just it’s out there in the SharePoint servers and PDFs and image files all over the place.

On sitting on word deck word docs on hard drives. I mean, there’s a ton of data out there, and we know intuitively we should know that there’s a ton of value in that data. There’s a ton of risk in that data as well, but we should know there’s a ton of value there. And how do we extract that insight? The only way we’re gonna do it at scale is using AI, ironically.

Manu: Yeah. So it’s like, how do you light up your dark data? And maybe, maybe that’s what the LLM is designed to do in the first place. And since sort of fighting, that dark, ugly data, maybe we should be asking, how do I actually get value out of it? Because that’s what the objective is. That’s what the challenge is.

Malcolm: I love it. And I love you using the phrase dark data because it has multiple definitions in this case. Right? I think what you were referring to is data that is just sitting there, you know, not generating value.

But there’s another definition of dark data that says that it’s data that is just sitting in a data center on a disk somewhere, consuming scarce energy resources in that data center, but we’ll never see the light of day on a report, that will never be on a dashboard, that is never used to, you know, to do anything of meaning.

And depending on who you ask, there’s data all over the place on this, but it’s anywhere from fifty to ninety percent of data is sitting dark. Right? Yeah. Get that data is all sitting on disk somewhere that is consuming energy that needs to be cool, that needs to be powered. Right, to the point where the data center industry, by some estimates, is consuming or or producing more greenhouse gas than the airline industry and the shipping industry combined.

So maybe that is part of the use case here. Right? So maybe if it’s not getting the value from Gen AI, maybe at the very least, maybe it’s doing well by the planet because a lot of this data is just sitting out there consuming scarce resources and just collecting dust.

Manu: Yep. Yep.

Malcolm: So if we could light up that data Yeah. Then there’s it’s a win win. Right? It’s it’s you can you can get some value from it. You can mitigate any business risk to it. And, also, by the way, you could probably do some good by the planet.

Yep. Yeah. I mean, and I think we are still just kind of, it’s almost like we’re unlearning the habits that have been developed before, and we have to get there before we can start to develop new habits. Right? So if you try to really pin down what those new foundations are going to look like and what they should look like, for the world of AI, probably don’t know enough yet.

Manu: I think you’re on point, right, which is, first, let’s agree that we need to unlearn the way we have been doing the infrastructure data. That’s not working. Then we can start to ask what is. Right?

Malcolm: To totally agree. And I I when it comes to frameworks and foundations, I mean, that’s one thing that we do pretty well. Mhmm. Right? Like, when you go to these conferences, you mentioned Gartner before. Every single one of them is talking about AI governance, and most of them will show you an AI governance framework.

It’s how to actually operationalize it. It’s the problem. Nobody’s talking about how to actually do that. So when you talk about this shift of foundations yeah. I mean, I think figuring out, like, the kind of the bits and bobs of each of the, you know, enabling capabilities of each of these frameworks are going to be necessary. I’m not worried about our ability to figure that stuff out. What I’m more worried about is what could brought more broadly be to call a mindset.

The way we think about data, the way we think about our customers, and I use that word intentionally customers. Right? The way we think about our roles.

I think we’ve got a real challenge related to mindset. I gave you one example earlier when I was talking about this very deterministic way that people in the data and analytics space tend to see the world. Right? Data is either all good or all bad. It’s garbage in, which means it’s garbage out. That is a very deterministic, rules driven approach to the world. When in reality, in an AI driven world, it is inherently probabilistic.

Manu: Mhmm.

Malcolm: It is inherently probabilistic. Right? So what is good for one may not be good for another or vice versa.

So take what’s just one small example of a different way of thinking about these old problems.

Right? I use the phrase garbage. We use the phrase garbage to describe our data all the time. What sort of impact is that having long term on how we view our role and how we view the products that we’re building for our customers?

I suspect that it would have a corrosive impact over time. Right? This phrase of garbage in garbage out. I could keep going on here, but I think we need to think that the the start here is to think differently about data, think differently about these problems, think differently about how we approach things, challenge the status quo Yeah.

Because it’s not working.

Manu: Mhmm.

Malcolm: That will lead to all, I suspect in time that could lead to better things down the road.

Yep. And it kind of reminds me of how Columbia University built their walkways on the campus. They just, you know, they’re like, yeah. We can get a planner to plan it all out, but why do that?

Let’s just have people go to classes. And over time, they very quickly picked out the fastest paths going from point a to point b depending on where they needed to route. And you just started to see those marks on the graph, and they said, let’s pave these pathways. Right?

And something another way of saying that another way of saying that so I’ll be giving a presentation in the first few months of this year related to data governance and saying we need to start over.

Yeah. We need to rethink data governance.

And one of the things that I’m recommending is that we try to make a pivot from rules based policies to exception based policies. And that’s basically what you just said, which is you can solve a problem by throwing rules at it and saying, here are the rules.

Right? Or you can let the people walk the class, and the rules will naturally evolve. Right? The paths will naturally evolve. They always do. They always do.

So to some degree, not always, there are compliance and audit and regulatory concerns related to data governance that we need to adhere to. That’s the minimum bar. But I would argue that to go farther than that, we need to be more exceptions driven than more rules driven. And I know that’s really kind of pithy and and and high level, but I think that’s part of the mindset shift that we need to make here.

We need to make it here. There are some, my friend Bob Snyder, would call that perhaps more of a noninvasive approach to data governance. I think that that has some relevancy here. But I would argue that a lot of the data governance is happening, you know, kind of naturally within organizations today, and it is.

Right? Those PDFs and the SharePoint servers that I talked about, they’re not completely ungoverned. There are controls about who can access them. There are controls about who can update a given marketing document.

So there are controls out there. It’s happening. The pathways are being built.

We just need to figure out who’s doing it. What are the rules used to do it today? Are those meeting the needs? And how do we need to change our approach to that? And I suspect more of an exception based approach would be the right one. And, honestly, that’s probably largely where we are outside the CDO organization today anyway.

Well, this is basically suggesting that let people build those applications first, then you wait and watch what practices or what recommendations emerge. Right? We don’t have to force a certain form of data governance or Data Quality yet. And if we try to, we are probably going to do more damage than good. It’s just much better to let people play with this stuff, build some applications, and show some value.

Manu: And like you said, right, maybe we’ll make a few mistakes along the way, but those exceptions will teach us more than trying to sit on a drawing board right now, first creating foundations and then hoping for the applications to lower value. Right? Yeah. Maybe that’s a big takeaway here, which is to let people build, and then we will see what they need as support around it.

Malcolm: Well, because we know again, we’ve got twenty years of data here. We know that when we take a rules driven, control driven approach where we are unable and unwilling to show the value and to quantify the value and prove the value of that approach. Right? If we’re saying you must do this, yet we don’t say, here’s the benefit you’re gonna get from it, and it’s been quantified, it’s been modeled out, it’s been shown, you’re gonna get the benefit of it. If we take that approach of all stick, no carrot, right, using an old using an old metaphor, people are just gonna go do what they’re gonna do anyway.

They’re gonna do what they’re gonna do anyway. We see this every day. And this is why most data governance programs are struggling because it is all rules. It is all stick, and it is no carrot.

Right? And we talk a big game about the value of governance or the importance of governance or how important this is to get the value out of data. But then when our customers ask us, oh, okay. Can you prove it?

Crickets.

So then our customers are right to say, well, I mean, I hear you, but at the same time, I’ve got an SLA to meet. I’ve got a product development deadline. I’ve got customers who I need to support, and I’m gonna go do it. Right? So they are building those pathways. They’re doing what they need to get done.

So I would say, hey. You know, what do we have to lose in changing the way we think about these problems? I don’t think we have a ton to lose. I think we’ve got a lot to gain.

Manu: I think that was a really good, deep dive into you had a lot of ideas that you touched upon in that post, and I think we’re able to get a good comprehensive view of what this is actually getting at and what we should be doing as a data community to to really make this a productive exercise as opposed to trying to discipline people, when they don’t actually want it. Right? So great conversation, Malcolm. Great to have you on the show. Thanks for talking to me, and, see you around.

Malcolm: Thank you so much. I appreciate it.

Manu: Bye now.