Python – Quantum Tunnel https://jrogel.com Random thoughts about data, machine learning and science Tue, 03 Mar 2026 10:31:40 +0000 en-GB hourly 1 https://wordpress.org/?v=6.9.4 https://i0.wp.com/jrogel.com/wp-content/uploads/2022/07/cropped-jackalope-site-icon.png?fit=32%2C32&ssl=1 Python – Quantum Tunnel https://jrogel.com 32 32 87178224 Advanced Data Science and Analytics — 2nd Ed has entered production https://jrogel.com/advanced-data-science-and-analytics-2nd-ed-has-entered-production/ Tue, 03 Mar 2026 10:31:37 +0000 https://jrogel.com/?p=26811 Read More »Advanced Data Science and Analytics — 2nd Ed has entered production]]> I’m very pleased to share that the second edition of “Advanced Data Science and Analytics with Python” has now entered production.

After months of writing, rewriting, refactoring code, updating examples, and expanding chapters, it’s now in the hands of the production team. This is where the invisible craft of publishing begins.

This edition is not a cosmetic refresh; it reflects how our field has evolved.

Writing a second edition is an interesting exercise: You revisit your past self, keep what still holds, and upgrade what no longer does. Data science, Machine Learning and indeed AI have changed, tooling has shifted, expectations have risen. The book needed to reflect that.

Now comes the quiet phase, production schedules, proofs, fine-tuning figures. before it makes its way into your hands.

Grateful to everyone who has supported the journey so far: readers, reviewers, colleagues, and colleagues who keep asking the right questions

]]>
26811
From Tokens to Transformers https://jrogel.com/from-tokens-to-transformers/ Fri, 20 Feb 2026 16:06:24 +0000 https://jrogel.com/?p=26794 Read More »From Tokens to Transformers]]> Rethinking NLP in the Second Edition of Advanced Data Science and Analytics with Python

When I first wrote Advanced Data Science and Analytics with Python, natural language processing (NLP) occupied a niche corner of the data science landscape. Back then, much of the focus in Python revolved around parsing and vectorising text: extracting tokens, counting frequencies, maybe applying a topic model or two. Fast forward a few years, and NLP has become one of the engines driving modern AI, powering everything from search and recommendation to summarisation and chat interfaces.

That shift is at the heart of Chapter 2 in the second edition, where “Speaking Naturally” has been thoroughly reimagined for today’s ecosystem. Instead of stopping at token counts and bag-of-words, this chapter bridges the gap between traditional text processing and the language-rich representations that underlie contemporary AI systems.

From Soup to Semantics

We start where most real text projects begin, with acquisition and cleaning. Python’s Beautiful Soup still plays a starring role for scraping structured text off the web, but the focus now goes beyond parsing tags to extracting meaningfulcontent. Regular expressions, Unicode normalisation and tokenisation are introduced not as academic subjects but as practical tools you’ll reach for every time you ingest text.

Finding Structure in Language

Once you have clean text, the chapter furthers your intuition with topic modelling, an unsupervised way of surfacing latent themes across documents. These techniques remain valuable for exploration, summarisation and even automated labelling in the absence of annotated training data.

Encoding Meaning: Beyond Frequency Counts

The real leap comes with representation learning. Rather than relying on sparse counts, modern NLP encodes text as dense vectors that capture contextual meaning. Word embeddings — and their contextual successors — turn raw text into numbers that machine learning models can reason about. This edition makes that leap accessible, showing how to generate, visualise and use these representations in Python.

Semantic Search with Vector Engines

Building on embeddings, we explore vector similarity search — the backbone of semantic retrieval. Using tools like FAISS, you’ll learn how to retrieve text not based on matching keywords but on meaning, opening the door to advanced search, clustering and recommendation applications.

The NLP landscape has moved faster than almost any other area of AI. Transformers, contextual language models and embedding systems have shifted what’s possible — and what’s practical — for practitioners. This chapter is carefully redesigned to reflect that evolution, giving you the grounding you need to work with text data that isn’t just cleaned and counted, but understood.

More soon. Stay tuned.

]]>
26794
Forecasting the Future: Time Series, Prophets, and Cross-Validation https://jrogel.com/forecasting-the-future-time-series-prophets-and-cross-validation/ Mon, 19 Jan 2026 14:30:00 +0000 https://jrogel.com/?p=26777 When I wrote about the Jackalope’s return and the second edition of Advanced Data Science and Analytics with Python, I hinted that this wasn’t just a light refresh. It’s a proper evolution. New chapters, new tools, and, perhaps most importantly, a stronger emphasis on how we trust the models we build.

One of the chapters I’ve been spending time with recently dives head-first into forecasting. Not the hand-wavy, crystal-ball-gazing sort (sadly no actual precogs were harmed in the process), but practical, defensible forecasting that you can deploy without fear of your future self cursing your name.

Enter the Prophet

Yes, that Prophet.

Facebook’s (now Meta’s) Prophet framework gets its own dedicated treatment. Not because it’s fashionable, but because it occupies a genuinely interesting space: expressive enough to handle real-world seasonality, trends, and holiday effects, yet accessible enough that you don’t need to disappear into a cave with nothing but state-space equations and a beard.

The chapter walks through:

  • How Prophet decomposes time series into trend, seasonality, and effects you can actually explain to stakeholders
  • When it works beautifully; and when it really, really doesn’t
  • Why it’s often a strong baseline, even if you later graduate to more exotic architectures

Think of Prophet as the Millennium Falcon of forecasting: not the newest ship in the galaxy, occasionally held together with duct tape, but astonishingly reliable in the right hands.

The Bit Everyone Skips (and Shouldn’t)

Forecasting models are easy to build. Evaluating them properly is where things usually fall apart. So this chapter leans hard into time series cross-validation and forecast evaluation. No random shuffling. No accidental peeking into the future. No Schrödinger’s test set.

We cover:

  • Rolling and expanding windows (and why they matter)
  • Forecast horizons and why “one-step ahead” tells only half the story
  • Metrics that actually align with decision-making, not just leaderboard vanity

If you’ve ever had a model that looked flawless in development and then collapsed in production like a soufflé near a subwoofer, this section is for you.

In applied data science, forecasting sits at an awkward crossroads. It’s everywhere — demand planning, operations, finance, healthcare, energy — and yet it’s often treated as a dark art or an afterthought.

This chapter is about demystifying that space. About treating time seriously (literally), respecting causality, and building forecasts you can defend in a meeting without resorting to interpretive dance or “the model felt confident”.

This is just one chapter. Over the coming weeks, I’ll be writing about other additions and revisions in the second edition — from modern modelling techniques to deployment considerations, and a few opinionated takes on where data science education often goes wrong.

If this chapter is about seeing the future, the rest of the book is about making sure you survive it — preferably with clean code, reproducible results, and fewer existential crises.

More soon. 🛸📈

]]>
26777
Advanced Data Science and Analytics – 2nd Edition in the making https://jrogel.com/advanced-data-science-and-analytics-2nd-edition-in-the-making/ Tue, 06 May 2025 07:21:16 +0000 https://jrogel.com/?p=26653 Well, it is happening! At the beginning of the year I announced in this blog that the revised second edition of my book “Data Science and Analytics with Python” had been delivered to my publisher and it was off to print. It is now available for pre-order from May 7th. Check it out here.

At the time, I was also in discussions with my publisher about revising the companion volume “Advanced Data Science and Analytics with Python” with the idea not only to keep the two volumes updated in parallel, but more importantly to provide context to a lot of the recent growth in the area of large language models and generative AI. As such, back in March I announced that the second edition to the second volume was approved.

I have now been slowly but surely working on the revisions and, boy! there is a lot to do. I will keep you posted with the revisions.

]]>
26653
Building An AI Policy Advisor https://jrogel.com/building-an-ai-policy-advisor/ Fri, 18 Apr 2025 18:47:29 +0000 https://jrogel.com/?p=26639 Building a Smart Compliance Assistant with Gemini and Semantic Search

Compliance documentation is often the intellectual equivalent of a root canal. Necessary? Absolutely. Enjoyable? Let’s just say there’s room for improvement. If you’ve ever had to comb through GDPR policies, data protection agreements, or internal risk documents, you’ll know that extracting meaningful insights quickly is like panning for gold in a sea of jargon.

I have recently completed the 5-day GenAI course with Google and Kaggle and my project to capstone the course is the AI Policy Advisor. A nimble, document-savvy assistant powered by Google’s Gemini API, semantic embeddings, and a pinch of automation magic.

This blog post outlines how we built a smart assistant capable of analysing compliance policies, surfacing risks, and offering structured summaries. The goal? Free up legal and compliance teams from soul-crushing document review and let AI shoulder the burden. We’ll explore the motivation behind the project, walk through the technical components, and show you how it all fits together in a fast, intelligent pipeline.


The Problem: Compliance Overload

Regulatory and compliance work is text-heavy and time-consuming. GDPR alone has over 88 pages of tightly written regulations. Multiply that by various industry-specific frameworks (HIPAA, ISO 27001, FCA regulations), and you have a documentation nightmare. Often, policy teams need to:

  • Identify clauses related to a specific regulation (e.g., GDPR’s Right to Erasure)
  • Summarise implications for the business
  • Highlight potential risk areas

This is not only laborious but also prone to oversight. In a world increasingly driven by data, the cost of missing a compliance detail can be astronomical.

So I asked: what if we could offload that initial document triage to a smart AI assistant?


The Ingredients: Gemini, Embeddings, and a Smart Pipeline

To build our assistant, we wanted three key ingredients:

  1. Semantic understanding of document content
  2. Structured, human-like responses
  3. A pipeline to orchestrate retrieval, reasoning, and reporting

Let’s break it down.

Step 1: Loading and Preprocessing Documents

We start with a folder full of compliance policies and load them into memory. Each document is treated as a text blob:

def load_documents(folder_path="./corpus"):
    docs = []
    for fname in os.listdir(folder_path):
        if fname.endswith(".txt"):
            with open(os.path.join(folder_path, fname), "r", encoding="utf-8") as f:
                docs.append({"filename": fname, "content": f.read()})
    return pd.DataFrame(docs)

corpus_df = load_documents()

Each file becomes a row in a DataFrame, complete with filename and content.

Step 2: Semantic Search with Sentence Embeddings

We use the excellent sentence-transformers library to convert each document into a high-dimensional vector using the all-MiniLM-L6-v2 model. This lets us compare documents based on meaning, not keywords.

embed_model = SentenceTransformer("all-MiniLM-L6-v2")
corpus_df["embedding"] = corpus_df["content"].apply(lambda x: embed_model.encode(x))

For a given query like “What are the GDPR risks?”, we compute its embedding and retrieve the top-matching documents:

def semantic_search(query, top_k=3):
    q_emb = embed_model.encode(query).reshape(1, -1)
    similarities = corpus_df["embedding"].apply(lambda emb: cosine_similarity(q_emb, emb.reshape(1, -1))[0][0])
    top_matches = corpus_df.iloc[similarities.nlargest(top_k).index]
    return top_matches[["filename", "content"]]

Step 3: Prompt Engineering and Gemini

Now the fun part: crafting a prompt for Gemini. We tell the model to behave like a compliance assistant and ask it to structure the output.

def prompt_summary(document_text):
    return f"""
    You are a compliance assistant.
    Summarise the following document and identify any potential risks related to GDPR:

    {document_text}

    Provide structured output in the format:
    Summary:
    Risks:
    Recommended Actions:
    """

We pass this to Gemini using Google’s Generative AI client:

import google.generativeai as genai

genai.configure(api_key=os.environ["GOOGLE_API_KEY"])
model = genai.GenerativeModel("gemini-pro")

def gemini_response(prompt):
    return model.generate_content(prompt).text

Gemini returns rich, human-readable output that reads like a professional wrote it (because in a way, one did).

Step 4: The Agent Pipeline

We bundle it all into a single pipeline function:

def agent_pipeline(user_query):
    relevant_docs = semantic_search(user_query)
    combined_text = "\n---\n".join(relevant_docs["content"].tolist())
    prompt = prompt_summary(combined_text)
    return gemini_response(prompt)

And voilà. With one line, you can extract insights from a haystack of compliance docs:

response = agent_pipeline("What risks are covered in the GDPR policy?")
print(response)

Why This Works

This capstone project works because it leverages several key principles of modern AI:

  • Vector-based retrieval gives us relevance beyond keywords.
  • Prompt-based generation allows for flexibility and expressiveness.
  • Structured prompting ensures consistency and downstream usability.

The assistant doesn’t just summarise; it interprets. It spots risks, outlines actions, and adapts to your question.

Imagine asking: “What should we do next based on these risks?” and the assistant scheduling a task or drafting an email. We’re almost there.


Final Remarks

AI isn’t here to replace compliance officers, legal counsel, or policy analysts. But it can absolutely take the grunt work off their plates. With a well-crafted prompt, semantic search, and a generative engine like Gemini, we can build assistants that read, reason, and report at speed.

The AI Policy Advisor capstone project is more than a demo. It’s a proof of concept for how AI can become a colleague—a second set of eyes on dense documentation, a proactive analyst, and a structured summariser that works at the speed of thought.

So the next time you’re staring down a 30-page data policy, maybe let the assistant take the first pass. You’ve earned it.

Here is a video to accompany this post:

]]>
26639
“Advanced Data Science and Analytics with Python” – 2nd Ed. Announced https://jrogel.com/advanced-data-science-and-analytics-with-python-2nd-ed-announced/ Fri, 28 Mar 2025 18:35:17 +0000 https://jrogel.com/advanced-data-science-and-analytics-with-python-2nd-ed-announced/ Read More »“Advanced Data Science and Analytics with Python” – 2nd Ed. Announced]]> I’d like to let you know that I’ve officially started work on the 2nd edition of my book “Advanced Data Science and Analytics with Python”!

This updated edition will dive deeper into some of the most exciting developments in our field—particularly Large Language Models (LLMs) and Generative AI, alongside some advanced principles of advanced analytics, machine learning, and Python-based workflows.

If you’re curious about what the first edition covered, you can check it out here. You can get more info about this and my other books here.

Looking forward to sharing more as the update progresses—suggestions and ideas are always welcome!

]]>
26626
The 2nd Edition of Data Science and Analytics with Python is Off to Print https://jrogel.com/the-2nd-edition-of-data-science-and-analytics-with-python-is-off-to-print/ Wed, 29 Jan 2025 18:57:36 +0000 https://jrogel.com/?p=26589 Read More »The 2nd Edition of Data Science and Analytics with Python is Off to Print]]> It’s Done! The 2nd Edition of Data Science and Analytics with Python is Off to Print! 📚🐍💻

After months of reviewing, refining, and updating, I’m thrilled to share that the final version of the 2nd Edition of Data Science and Analytics with Python has been officially submitted! 🎉

This edition is packed with fresh insights, updated techniques, and even more hands-on examples to help data enthusiasts, analysts, and engineers make the most of Python’s data ecosystem. From foundational concepts to advanced analytics, I’ve worked hard to ensure this book remains a go-to resource for anyone navigating the world of data science.

Next stop? The printing press! 🚀 Looking forward to seeing it in readers’ hands soon. Stay tuned for release details!

]]>
26589
The 2nd Edition of “Data Science and Analytics with Python” is Almost Here! https://jrogel.com/the-2nd-edition-of-data-science-and-analytics-with-python-is-almost-here/ Mon, 27 Jan 2025 17:14:59 +0000 https://jrogel.com/?p=26584 Read More »The 2nd Edition of “Data Science and Analytics with Python” is Almost Here!]]>

I’m excited to share a major milestone in my journey as an author and educator—I’ve officially completed proofreading the 2nd Edition of Data Science and Analytics with Python!

This updated edition has been a labour of love, carefully revised to incorporate the latest advancements in Python and data science techniques. From new libraries and tools to modern best practices, I’ve worked to ensure this book serves as a comprehensive resource for anyone looking to harness the power of Python in their data-driven journey.

What’s Next?

While proofreading is now complete, I’m taking a final end-to-end pass to ensure the content flows seamlessly and meets the high standards I’ve set for this book. Once this final revision is done, the manuscript will be submitted to the publisher for production.

When Can You Expect It?

The 2nd Edition is expected to hit the shelves in April 2025. Whether you’re a student, a professional, or simply a data enthusiast, I believe this edition will be an invaluable guide in mastering Python for data science and analytics.

Stay tuned for updates as we get closer to the release date!

Let’s continue exploring, learning, and pushing the boundaries of what’s possible with data science and Python.

]]>
26584
Data Science and Analytics with Python – 2ed Proofs https://jrogel.com/data-science-and-analytics-with-python-2ed-proofs/ Tue, 07 Jan 2025 14:26:51 +0000 https://jrogel.com/?p=26571 Read More »Data Science and Analytics with Python – 2ed Proofs]]>

The Journey Continues: Reviewing Proofs for the 2nd Edition of My Book “Data Science and Analytics with Python

January is shaping up to be an exciting and pivotal month in my writing journey as I dive into the proofs for the 2nd edition of Data Science and Analytics with Python. It’s a moment of both pride and meticulous focus—seeing months of effort take the final shape before reaching readers later this year.

Why a 2nd Edition?

The world of data science is evolving at an astonishing pace. Since the first edition was published, we’ve seen new tools, libraries, and techniques redefine what’s possible. The 2nd edition captures these advancements, with updates ranging from expanded coverage of machine learning techniques to improved discussions including Generative AI. Whether you’re just starting your data science journey or looking to refine your skills, this edition is designed to resonate with today’s data-driven professionals.

The Proofing Process

Reviewing proofs is both rewarding and nerve-wracking. It’s the final opportunity to fine-tune details, spot errors, and ensure clarity in every explanation and example. As someone who values precision in technical writing, I approach this phase with a blend of excitement and diligence. By the end of the month, the proofs should be finalised, marking another significant milestone on the path to publication.

A Glimpse Ahead: Advanced Data Science

Once the proofs for Data Science and Analytics with Python are complete, my focus will shift to the companion volume, Advanced Data Science and Analytics with Python. The 2nd edition of this book promises to take readers even deeper, exploring cutting-edge techniques and real-world applications. I’m particularly excited about updating sections on deep learning, reinforcement learning, and scaling data science pipelines—topics that are more relevant than ever.

What’s Next?

Both books represent my commitment to making data science accessible, practical, and inspiring. They’ve always been about more than just the code—they’re about fostering a mindset of exploration, creativity, and problem-solving.

As I work through the next stages, I look forward to sharing updates, sneak peeks, and reflections. Whether you’re an existing reader or considering delving into these books for the first time, I hope they’ll equip you to navigate the dynamic landscape of data science with confidence and curiosity.

Stay tuned for more updates, and thank you for being part of this journey!

]]>
26571
Data Science and Analytics with Python — 2nd Edition — ready! https://jrogel.com/data-science-and-analytics-with-python-2nd-edition-ready/ Tue, 16 Jul 2024 22:00:34 +0000 https://jrogel.com/?p=26485 Read More »Data Science and Analytics with Python — 2nd Edition — ready!]]>

Exciting News!

I am thrilled to announce that the Second Edition of my book, “DataScience and Analytics with Python” is now complete!Seven years ago, when the first edition was published, Artificial Intelligence (AI) and Machine Learning (ML) were just starting to gain traction. Since then, we’ve witnessed an incredible explosion in interest and development in these fields. The book’s reception has been nothing short of amazing, with both business practitioners and universities adopting it as a key resource.In this new edition, I’ve built on that success by:

  • Expanding several sections to provide deeper insights and more comprehensive coverage.
  • Updating all code examples to reflect the latest advancements in Python libraries and modules
  • Introducing new sections on cutting-edge topics like Generative AI.
  • Addressing critical issues of explainability, transparency, and fairness in AI.

This edition remains a practical guide for data analysts and budding data scientists, leveraging Python’s powerful ecosystem and the flexibility of Jupyter Notebook.Stay tuned for more updates and details on the release date! Next, I will be reviewing the companion volume “Advanced Data Science and Analytics with Python”. If you have any comments or suggestions for that, please get in touch.

You can get more info on my books here.

]]>
26485