Python – Coding

Structured Output with LangChain and Llamafile

brakmic — Sun, 22 Jun 2025 16:50:12 +0000

This article shows how one can teach Llamafile to handle structured outputs like JSON. If you’re already familiar with LangChain, you’ll know that popular models like OpenAI include their own implementations of with_structured_output. Using it is straightforward:

All we need is to derive a new class from Pydantic’s BaseModel. The rest happens transparently. You don’t need to teach the LLM anything.

Using Llamafile

This isn’t currently possible with Llamafile, which I’m using in my local environment. In case you don’t know what Llamafile is, here’s a quick rundown before we dive into structured outputs. A Llamafile is an executable LLM that you can run locally. Technically it’s a combination of llama.cpp with the Cosmopolitan Libc, which makes it executable on a wide range of architectures. By default it starts a local LLM instance that you can access in your browser at http://localhost:8080

I’ll be using Llama-3.2-1B-Instruct-Q8_0.llamafile because I’m writing this on a relatively weak machine (an old MacBook Air from 2018 with 8 GB of RAM). After downloading, just make the file executable with chmod +x. Windows users can also run Llamafiles but need to take care when trying to run files over 4 GB. To run Llamafile as a server use the --server --nobrowser --unsecure flags

Now you can test the server by opening http://localhost:8080 in your browser. You should see this page:

Structured Outputs with Llamafile

Now we can teach Llamafile to produce structured output. Since it lacks a with_structured_output method, we import JsonOutputParser and PromptTemplate from LangChain’s libraries.

First we define an Answer class that represents the JSON output we expect the LLM to return. To make the example more realistic, I’ve added a few extra properties to Answer class.

Next we provide our new answer type to LangChain’s JsonOutputParser

To complete the setup we define a PromptTemplate, injecting the parser’s format instructions into it. The final step is chaining the three Runnable Interface implementations: prompt, llm, and parser

The rest of the code is straightforward

Invoke the chain
Print out the answer (either as raw JSON or by using the utility function display_answer)
In error cases call prompt and llm only while ignoring the parser

Depending on the LLM you use, the output will vary but it should be formatted like this:

You can find the sources in this repository.

Have fun with Llamafile

Writing a Keycloak-PKCE Library in C++

brakmic — Fri, 28 Feb 2025 18:48:58 +0000

In this article, we will be talking about a C++ library for using PKCE with Keycloak. Although I don’t consider myself a great C++ developer, I have worked with this programming language every now and then. And because I do a lot of work with Keycloak (check my other articles on that), I thought it would be a useful learning exercise to implement from scratch a library that supports PKCE (Proof-Key for Code Exchange) for Keycloak. Later, during the development of this library, I concluded that it would be a waste of a good API if I did not make it available via C. So, I wrote a small wrapper that offers a stable ABI to C99-compatible compilers. And because not everyone wants to deal with C++ or C directly, I wrote two additional wrappers for Python and Lua. Both of them use the C-API of this library, and I hope they’ll be helpful to others. I myself don’t intend to use this library for any production purposes, as I wrote it solely to learn more about PKCE and how everything works with Keycloak. But who knows, maybe it could be used to power secure communication. In fact, I strongly recommend using PKCE for every client that is incapable of securing its own credentials. These clients are mobile and web apps, many desktop applications, or basically anything that runs JavaScript. But before we dive in, here is the link to the repository. I am still extending the documentation and the test suite, so don’t expect everything to be there for direct consumption. However, the provided demos and example code are all working.

Writing a library in C++, really?

Yes, writing stuff in C++ is anything but pleasant, and I never quite liked the language. However, C++ is a sufficiently low-level language that still maintains a degree of useful structures like classes, interfaces, inheritance, polymorphism, and other good object-oriented features. Additionally, writing code in C++ requires a much higher “alertness” than, say, TypeScript or C#. Also, the outcomes of C++ can be later used to craft highly portable APIs, as I did with the C-API for this library. Even the most complex C++ code can be wrapped into an easy-to-reuse C-API. Try this with any other programming language. Modern variants of C++ also offer loads of new features that make dealing with structures, algorithms, strings, ranges, etc., much easier than it was in older versions of the language. All in all, it’s not that bad, and I consider myself rather a subpar programmer than C++ a bad programming language. Coding in C++ is like going to a gym. You might not like the toil, the salty sweat in your mouth, and the thick atmosphere, but you’ll surely love to see your body shape improving.

What is PKCE and why do we need it?

Proof-Key for Code Exchange is a security extension of the OAuth 2.0 protocol for secure client-side authentication. Those of you who develop client applications like web and mobile clients know of the inherent problem these solutions face: they very often cannot securely manage their credentials and other sensitive data. Storing passwords or tokens on the client side is a recipe for disaster sooner or later. To prevent interception attacks, protocols like PKCE have been invented. Instead of requiring the client to store a password or token locally, PKCE allows the client to create a verifier (a random string) and a code challenge (a hash of the verifier), which are used as follows:

The advantages are:

Interception of codes becomes ineffective
No shared secrets needed
Man-in-the-middle attacks are significantly mitigated
Replay attacks are too

If you are interested in learning how to use PKCE in a practical example with Angular and Keycloak, you can read this article. A more detailed version of the ASCII diagram can be found here.

C++ Library Structure

The C++ library is structured as follows:

The project uses CMake for building.
C++ Library sources are located in the subfolder “lib“.
The “examples” folder contains the C++ demo code that showcases the usage of the library.
Other languages and their examples are located in their respective subfolders inside “wrappers“.
Several external projects are used as git submodules. They are located in the “external” subfolder.
“scripts” contains bash scripts I use to quickly create certificates, set up the C environment, run Lua scripts, and other goodies. Feel free to adapt them to your needs.
Under “keycloak” are docker-compose YAML files, a test realm JSON, and the nginx configuration files. This docker deployment uses Postgres as its database and (optionally) Nginx as a proxy server.
The “config” folder contains the default library_config JSON, the civetweb config (used by the C demo client), and the default app_config JSON used by the C++ demo client.
Inside the “docs” folder are markdown files that describe various parts of the library.
Tests are located in the “test” subfolder and currently only contain unit tests. I am still working on integration and e2e tests.
Under “certs” are self-signed certificates and CAs. Every language has its pair of keys/certs.

Core Interfaces

The most important interface is ITokenService that the KeycloakClient class implements.

It declares three methods:

The first and last of these are just helper methods that return information from the library configuration that gets read upon library instantiation. The real engine of the whole process is the exchange_code(…) method that uses the implementation of the IAuthenticationStrategy interface to communicate with Keycloak.

The methods in this interface are self-describing. The method create_authorization_url generates a new URL the client must visit to authenticate. Usually, a default Keycloak Login Form will be presented, or a template version of it. This is how it looks with the C demo client:

The C client generates a URL that, via a GET request, sends certain information to the Keycloak server. Among them are the expected response_type (code), the redirect_uri (which Keycloak will use on success to hand over tokens), the code_challenge from the client, and the scopes the client expects to receive. The first step is to enter your own credentials in Keycloak’s Login Form.

I have entered “test-user” and “password“, which you can use in the docker-compose deployment of Keycloak that comes with this library. At boot, the deployment imports a realm called “TestRealm” with this user and the OIDC Public Client named “test-client”. One must have an OpenID Connect Public Client setup to be able to use PKCE with the library. After the successful authentication, Keycloak will call the given redirect_uri to complete the PKCE auth flow. For this to succeed, the demo application must run an embedded server that handles certain routes. This, of course, has nothing to do with PKCE but rather with the usability of the various demo code provided. To avoid the need to run a separate web server, I used web servers like Crow, CivetWeb, lua-http, and OpenResty. As I am not an expert in any of these servers, I do not claim that the code I wrote is in any way highly performant or convention-compliant for the respective programming languages. But the demo code does work, as we can see in the final result of the C demo application, which is presented after the successful token exchange: a list of claims from the JWT token.

In the example above, I am using locally available DNS entries like “pkce-client.local.com” and “keycloak.local.com“. The provided self-signed certificates use the same DNS names too. Therefore, if your local setup differs or you are trying it in a different setting, ensure correct DNS names, routes, and certificates. Also, keep in mind that if you are using DevContainers as I do in VSCode, name resolution can become problematic because Docker uses its own DNS.

But now let’s continue talking about the other methods from the IAuthenticationStrategy interface, shall we? The handle_callback method is responsible for verifying the code received from Keycloak. For this check to succeed, PKCE’s StateStore must find the previously created state entry in its internal cache. If you are now asking yourself where this state object comes from, just go up and check the original authentication URL. Do you see the “state” parameter among others? The state (a random value) gets created and inserted into the StateStore (basically, a small cache) before the authentication starts. Keycloak receives it so that it can later send it back together with the code. Then, of course, the PKCE StateStore checks if the returned state object is still there (the cache deletes older entries automatically), and in a successful case, the code exchange will be allowed to continue. And just to be complete: state objects are used to prevent CSRF attacks. Ultimately, the KeycloakClient, which implements the ITokenService interface, will be allowed to execute the code exchange with Keycloak. These two interfaces, ITokenService (KeycloakClient) and IAuthenticationStrategy (PKCEStrategy), drive the PKCE auth flow.

Other Interfaces

There are of course other important implementations like the HttpClient, which relies on ASIO for SSL communication, but these are completely decoupled from the PKCE-relevant code and can be easily replaced. I have not invested that much time in implementing proper HTTP/SSL handling, as I was mostly focused on the PKCE side of things. I actually planned to write a strategy interface for HTTP communication so that both HTTP and HTTPS communication, with ASIO or any other networking library, would be possible, but I later decided to postpone it. I was mostly focused on PKCE. However, I will need to change HttpClient later, because without a proper strategy interface, everything boils down to ASIO and SSL, which makes changing code and especially testing it very hard (perhaps impossible?). HttpClient will be the first thing I throw out and reimplement.

Another useful piece of code is the Logger class that internally uses spdlog. However, I am still not completely satisfied, as I don’t want to dictate any kind of logging facility in the library. Maybe this should also be replaced with a stable interface or some kind of factory that creates logging facilities (if needed) for users who are actually interested in logging. But on the other hand, who wants to use std::cout or fmt::print for logging manually?

The other two very important but more or less easily replaceable mechanisms are the configuration classes. I separate them into two groups: library configuration and application configuration. The library configuration should be treated as something unchangeable and only then modified when internal changes happen. But in most cases, I think, you will only need to adapt your own application configuration. The application configuration can be of any format—or it doesn’t even need to exist at all (though I don’t recommend omitting it). The library configuration is always a JSON file named “library_config.json“. It contains the defaults that describe the Keycloak server, the PKCE behavior, and the cookie settings. All these things are “external” to the client in use and therefore should never be dictated by the client.

The application configuration either contains settings that govern the application behavior (like embedded web server settings) or, in some cases, the proxy settings that will be used to inform the PKCE library. These proxy settings will be used by the PKCE library to update its internals and establish communication channels with Keycloak. For example, when a web server that processes the callback routes Keycloak will use to exchange tokens is behind a network not directly accessible to Keycloak, a proxy server will allow forwarding such calls.

In any case, the application configuration must contain the redirect_uri setting, which the library uses to insert into the initial authentication URL leading to the Keycloak Login Form.

Because there are many different applications and use cases, there is actually no “correct” or “default” configuration possible. The only mandatory entry is the redirect_uri, but if your application only needs to set this value, then the usage of a configuration file itself might be questionable. Although the library configuration is by default in JSON format, there is no obligation to follow this rule. You can easily replace the internal mechanism of the ConfigLoader that instantiates the library configuration.

C API

After completing the basic design of the library, I asked myself if it should remain only with the C++ API (and ABI) for future clients. This would make the potential number of clients much lower and the implementation unnecessarily complex. Therefore, I decided to embark on another journey: providing a C API for the original C++ interfaces. It’s been a while (years?) since I wrote any meaningful amount of C code, so I am a bit out of touch with the latest developments in this area. I used C99 because it’s widely supported, and I also didn’t need anything more fancy than . I only tested the API under Ubuntu 22.04 and 24.04, so I cannot say how it will behave under other operating systems. I have a Windows box here but sadly no time to check everything. It’s still an experiment, I guess. The API is accompanied by a demo client that you can use to test and maybe also fix the API. The implementation of all wrapper code is located in the internal folder inside the API. The main implementation is in the kc_pkce.cpp file.

And after I wrote the C API, I thought it would be nice if we could test it in a language that is easier to use than C or C++.

Python Client

One of the two implemented users of the C API is the Python client, which uses the FFI declarations from keycloak_pkce.py. The client is fairly simple as it uses the default Python HTTP server, so there is no need to include another piece of software like I had to do with the C and C++ clients. We have everything inside the main() function, so running this demo is straightforward. The only thing you must configure is the path to the C API library. If you are using the CMake build system coming with the library, the binary will be located in build/c/lib/libkc_pkce.so. Set the environment variable KC_PKCE_LIB to point to this file. Then you can run the Python script as usual:

Lua Client

I have never written any serious code with Lua but always wanted to try something “realistic” with it. What better reason than writing an API wrapper, right? But what I didn’t know was that in the Lua world, there are some really amazing things to be found. Tools like OpenResty allow you to write Lua code directly in the nginx.conf and benefit from its performance. This, of course, brought the advantage of not having to search for a web server that could be embedded into the client code. Here, the web server is the client. In Lua, just like with Python, we rely on FFI (foreign-function-interface) that bridges between C and Lua/Python. Only the execution of this particular code is a bit different because we need to spin up the nginx web server to actually run Lua code. For this, we execute the openresty command from the build/lua folder:

openresty -p . -c nginx.conf -g "daemon off;"

This command will use the current path (-p) as the root and the configuration (-c) in the build/lua folder. The third option (-g) is optional, as it only prevents nginx from becoming a daemon so that you can stop it with CTRL+C. If you are frequently running and stopping openresty, I recommend using the nginx kill-script located here. Otherwise, you’ll sooner or later run into problems with blocked ports and orphaned nginx instances.

There is also a standalone.lua script available that behaves similarly to Python’s standalone.py script. However, unlike Python, it was difficult for me to find a good web server capable of using (self-signed?) certificates. I tried various solutions but failed too many times to run the server. Not sure why. Maybe because most of them were unmaintained for many years. I have no direct experience with the Lua world and don’t want to make any assumptions, but it’s clear that a “default solution,” like the one Python provides (a basic HTTP server), is simply not there. So, I ended up with a solution based on lua-http. More information on how to set up everything, including the needed packages, can be found here. To run the Lua standalone script comfortably, I recommend the run_lua_script.sh from the scripts folder. Otherwise, you’ll need to prepare the execution environment (luajit, etc.) manually.

Conclusion

I intended to write this library only for myself. It was an exercise, a test to see if I really understand PKCE well by using a programming language that is unforgiving and hard to grasp. I have already written something similar in TypeScript, which is a much easier language to use. But unlike TypeScript, I am not entirely satisfied with the C++ code that came out. I am not following good conventions and patterns as I should have. I have chained the library to ASIO by default. I have missed too many opportunities to craft useful factories and strategies. I have a full logging library getting compiled by default. And a bunch of other things too. But at least it works. And it has a C API I can play with, and two scripting languages use it already. Not bad, but also not quite good yet. But I hope it can serve as a show case of how the PKCE authorization flow works and which mechanisms are at play (both visible and invisible ones).

Have fun with Keycloak and PKCE!

Using Bosque in JupyterLab

brakmic — Fri, 17 Jan 2025 16:17:39 +0000

A few years ago, I discovered an interesting Microsoft Research Project called “BOSQUE” (back then, it was in all caps). The aim was to develop a programming language that promoted the regularized programming paradigm. Just like structured programming and abstract data types in the ’70s freed software developers from the underlying intricacies of the hardware, regularized programming promises to free us from “computational substrates.” Although this article is about my experiences with building extensions and kernels for JupyterLab, I think it’s important to explain the advantages of Bosque and why you should learn it. Since this language is still evolving, the following explanations are based on my own experience and reading of the relevant documents and reports. I strongly encourage anyone interested in Bosque to read available documents and run some of the code available (for example, in JupyterLab with Bosque support).

Going a Step Higher

Dealing with “machine words” (chunks of memory defined by hardware vendors) in the ’70s was just as common as dealing with loops these days. Imagine yourself looking at machine words and writing programs that interpret some of them as “printable text,” “non-printable escape sequences,” “float numbers,” “double floats,” etc. Abstract Data Types freed us from dealing with such vendor-specifics. The same can be said for functions, loops, and statements in general. Instead of writing programs that dealt with machine addresses, base pointers, stack pointers, jumps, and other hardware specifics, we went a step higher and invented Structured Programming. Our software could now be structured in terms we could think about instead of thinking in accidental hardware intricacies. We, so to speak, added an additional layer of indirection to free ourselves. Just as we invented loops to eliminate dealing with registers, addresses, and jumps, Bosque is now going a step higher by removing loops and restricting recursions. Instead of handling loop structures, we move to a higher level and focus on intents. When I loop over something, I am actually expressing an intent toward a structure that should be handled in some way, like “iterate over this list and multiply each of its elements by 2.” The structure that performs the looping is not that important. That’s why we don’t need loops in Bosque—because we focus on the programmer’s intents rather than on structures like for, while, do-while, etc.

Mutable State

The same applies to (shared?) mutable state, which is still pervasive in today’s programming. Whenever we take someone else’s code or want to extend our own, one of the first questions is: If I add this one little piece of code, will the rest still be running smoothly? Programming is still not “a practice of monotone reasoning,” as we still need to carry the full burden of accidental complexity in our heads. We freed ourselves from the hardware and gained all these nice structures and later “objects,” but we still must bear the weight of accidental complexities introduced by mutable state. Most often, we simply don’t have enough time or the required skills to digest these complexities mentally. As a result, software development becomes a grueling marathon through the rainforest, just on another level. It seems that the additional layer of indirection we created earlier has introduced its own complexities. That’s why everything in Bosque is immutable by default.

Verification

In Bosque, we can define functions that execute verifications before they start. This eliminates a whole class of errors based on invalid or incomplete data. In most languages, programmers usually employ checks based on if-else statements or various macros. This works, of course, but it feels more like a crutch, as people are using constructs from one area to ensure the functionality of something in a completely different area. In Bosque, such checks have their own semantics and can be reasoned about, with the compiler helping us prevent runtime errors. An additional aspect of this kind of verification is that it can be used for so-called SemVer (semantic version) verifications. In regularized programming, we can define the minimum acceptable versions (major, minor, patch) of certain parts of the program, thereby eliminating yet another class of errors. Instead of hunting for reasons why a certain API call failed, we can preemptively eliminate the errors by defining the lowest acceptable API version.

Integrating Bosque into JupyterLab

One thing I want to admit before writing anything about JupyterLab and kernel/extension development: I had never before written anything regarding JupyterLab and the framework beneath it. I never needed it until I started experimenting with the current Bosque compiler environment. Previously, the Bosque compiler could be started with a simple “make-exe” script (defined in package.json). But the current version is a bit more complex, as it’s being done in two steps: first, the compilation to JavaScript (since Bosque compiles to JS), and then the execution of this code by NodeJS. Doing this manually is possible but a bit tedious, as it involves switching from the IDE to the console. Yes, it could be mitigated, and I’ve even created a DevContainer for VSCode that you can find here. But why shouldn’t we have all the nice features of JupyterLab? Web-based development and instant execution of code snippets? JupyterLab is an awesome environment for testing new things, and as Bosque is still a moving target, it would be much better to experiment with it there instead of switching from the IDE to the console and back. So I decided to integrate it into JupyterLab. But I had no clue how to do it. And, honestly, I still doubt that the resulting Kernel and Syntax Colorization Extension, which I am now going to present to you, are really well-crafted. But they work, and they can be extended, for example, when the language syntax changes or a new compiler comes out. If you are interested in improving the code, feel free to open a PR. Thanks in advance!

Writing a Kernel for JupyterLab

JupyterLab is highly extensible and supports various types of extensions. Although I am not sure if I have grasped them all, from my experience with it (that is, the last 10 days or so), JupyterLab can be extended by writing so-called Kernels in Python or extensions in TypeScript. A kernel is a piece of software capable of executing external code, for example, a compiler, and then returning the result to JupyterLab, which then presents it in the UI. This is what basically happens when we press SHIFT+ENTER inside a code cell. The code is sent to the kernel, which then creates a temporary file, feeds it to the respective compiler (or whatever else), and then waits for it to return the result. Ultimately, the result is then returned to JupyterLab and presented to the user. Quite a simple concept, isn’t it? Well, yes in theory, which is mostly gray, while practice is always pitch black. I had significant problems orienting myself at first. Just like any other component, JupyterLab Kernels must follow certain rules to be accepted as kernels.

They must be specified in a “kernel spec.”
They must be installable by pip (the Python package manager).
They must derive from the Kernel base class.
They must provide certain information about the language they represent, the MIME types, the file extensions, and the Lexer they use.

I am not going to write a tutorial on these rules and how to implement them, as JupyterLab is already doing excellent work in describing them and also provides many good examples you can reuse in your code. I was quite successful at leveraging the basic code and mostly needed to focus on the extras demanded by Bosque (the aforementioned separation between compilation and execution of code). This is how the Bosque Kernel is structured:

The kernel.json file defines the arguments that JupyterLab will use to launch it.

This is the bridge between JupyterLab and, in this case, our Bosque environment. Whenever we send Bosque code to be run, the arguments defined here will be used to execute the BosqueKernel class defined in kernel.py. Apart from the Kernel, we also have our own Pygments Lexer, which is located in lexer.py. We need the lexer to tokenize the syntactic rules of the programming language we use. Most importantly, lexers are needed for:

Error Messaging by providing syntax-highlighted error messages or tracebacks.
Static Analysis through features like syntax checking or code linting on the server side.
Documentation Generation by assisting in generating well-formatted documentation or help messages.

However, one thing should not be confused with Lexers: they cannot be used for syntax colorization on the frontend. This was initially what I thought when I ported the syntax definition from the official Bosque repository. Lexers and Kernels, in general, are server-side components.

Bosque Wrapper

As already mentioned, Bosque source code must first be compiled to JavaScript and then run with NodeJS in a separate step. To abstract away these peculiarities (and also to prepare for future changes), I added a small wrapper that handles all this internally. I expect this behavior to change in the future, so the wrapper is already equipped with a currently unused method compile_bosque_future. The entire operation between the Kernel and Wrapper is as follows:

The Kernel executes _compile_and_execute, which calls the Wrapper’s own compile_and_execute method.

The Wrapper calls both the Bosque compiler and NodeJS, then returns the result back to the Kernel.

I am not sure if this is the best way to utilize the Kernel, but it works. I’m not even sure if it’s “pythonic” enough, but again: it works. Feel free to open a PR if you have a better solution. In any case, expect this “duality” to disappear one day when Bosque is able to compile itself. All in all, this is what the Kernel is doing here. To install it, you need to run the usual pip install . inside the kernel folder. This is not necessary if you are building the Dockerfile provided in the BosqueDev-Jupyter project. All you need to do is run your Docker container to access the extended JupyterLab UI.

Just select the “Bosque Notebook,” and the Kernel will start for you. However, the Kernel alone won’t be enough, as you will be missing the nice syntax highlighting we know from other JupyterLab environments. That was the second task to be completed when I started working on the whole thing.

Writing a Syntax Highlighting Extension

Unlike Kernels, frontend extensions in JupyterLab are written in TypeScript. At least, this is what I assume to be the default, but who knows—maybe there are some frameworks that allow writing the extension in Python or another language and then transpiling it into JavaScript. However, I consider this overkill and didn’t even try to find such solutions. Instinctively, I went the TypeScript way. But hold on—before even trying to do anything manually, you should follow the rules laid out by JupyterLab and set up the project correctly. That is, you should run pip install copier in the console to install the project generator. copier uses a predefined extension template that helps you set up the properties of your extension project. The data you enter in the copier dialog will be saved in .copier-answers.yml.

Copier will then create the initial structure for your extension. All sources will go to src, all styles to style, and so forth. It’s just a normal TypeScript project. However, do not try to build and manage it with npm or any other package manager. Instead, a separate package manager based on Yarn will be used: jlpm (JupyterLab package manager). In any case, check the package.json of your project to learn a bit about it. Whenever you compile the project, the results will land in the “lib” folder, and the final extension will be in a subfolder with the same name as your project. In my case, it’s src/jupyterlab_bosque_syntax. Inside this folder, you will see some Python files that are needed for the extension to be properly installed. In general, do not modify these files. Focus on TypeScript only and leave the Python stuff alone. At least, this is what I have learned during my adventures with the development of this extension.

The sources of the extension are located in src:

Generating Parsers with Lezer

Before we focus on the logic behind syntax colorization, I want to mention the importance of the Lezer Parser. Since our code will be running in a code editor based on CodeMirror 6, the use of the Lezer Parser is unavoidable. The code we enter cannot be highlighted without the code editor understanding its structure. So, we need to write a grammar first, which will then be used by the lezer-generator to build a parser for Bosque source code. This parser, of course, is something you should never touch directly. Just let the generator do the work for you. Focus on the grammar (and you can be sure that the current grammar I created is still not complete). The parser code below is a heavily shortened example, as the data in “states,” “stateData,” and “goto” is much longer.

The parser generation is done automatically during the build process. But if you need to parse the grammar manually, just run lezer-generator src/grammar/bosque.grammar -o src/parser/bosque-parser.js.

Activating the Extension in JupyterLab

JupyterLab, of course, offers an extensible platform that developers can use to add their own components. In our case, we needed to register the extension for both the new language (Bosque) and the new file extension (.bsq).

There are many ways to register an extension, and all of them depend on the use case. Some extensions will be visible, like menus or new widgets. Others, like this one, are for syntax highlighting and do not need to be activated manually. I think one could write entire books on how JupyterLab can be extended. So let’s leave it here, as I don’t consider myself an expert in this matter. The next piece of code that is important to know about is the actual language definition that configures the parser we just generated. This is done in src/bosque-language.ts. The keywords we defined in our grammar are being mapped to Lezer Tags (either standard ones or custom ones we define ourselves). This way, we can group certain keywords and highlight them differently from others. As our grammar changes and grows, we can update the grammar, generate a new parser, and then update the groups.

Parallel to Tags, we also define the HighlightStyles that associate CSS classes with those Tags.

However, during development, I encountered problems regarding the highlighting of elements in a single line of code that I wanted to be colorized differently. I wanted to distinguish function and method modifiers from keywords and names. For example, “public function main()” would be highlighted in three different colors. The same should apply to concept and entity methods.

No matter how precisely I defined and grouped the keywords, they either ended up being highlighted together or had no color at all. After some troubleshooting, I realized that the problem was not “wrong grouping” or “invalid CSS classes,” but the fact that the parser lumps those keywords together. The parser recognizes them separately, of course, but the end result that goes through the highlighting extension is just a single line of code that often cannot be highlighted at all. So, I had to add an “overlay” to process the code lines in chunks.

The Overlay essentially loops over visible “ranges” (this comes from CodeMirror) and extracts text from them. Then, it loops over these strings in search of any known keywords. Whenever it finds one, it assigns a CSS class to it. I know it looks a bit excessive, maybe even superfluous, but I simply wanted to know if there is a way to highlight different keywords in the same line. My initial impression was that I only needed the parser to consume the keywords and that the rest would be “somehow” applied by CodeMirror. But it seems that this wasn’t the correct strategy. However, I am still unsure if this is really a good way of highlighting individual keywords. In any case, if you don’t want or need this kind of highlighting, all you need to do is remove the instantiation of the overlay in src/bosque-language-support.ts, which wires everything up.

The CSS classes used for highlighting are located in style/base.css, which is the recommended file for custom CSS. To change the colors, you basically only need to modify base.css. The rest can remain the same unless you want to change the grammar or introduce new Tags. If you want to experiment with the extension, I recommend using a script that compiles, installs, and then rebuilds JupyterLab in one go. Each time you make a change to the extension, you will need to reinstall it and then rebuild JupyterLab.

I recommend using the prepared development environment that you can clone from here. Just follow the instructions in the README, and you are good to go.

Conclusion

I had no idea what was waiting for me when I started thinking about using Bosque in JupyterLab. I used Jupyter (without the “Lab” suffix) a few years ago and thought it would be “cool” to experiment with an evolving, still not “production-ready” language in an environment designed for rapid prototyping, experimentation, and instant feedback loops. But nobody told me it would involve writing a kernel and an extension in two different languages, using new package managers, and even writing grammars and lexers. Although I am still not sure if the code I wrote follows common conventions and good practices, I surely hope it will be of some use to other developers interested in Bosque. And if some of you even decide to try out Bosque because there is now an easy way to do so, I’ll be more than happy. Thank you so much for trying out Bosque (and your PRs regarding my code are still very welcome).

Have fun with Bosque!

Data Science for Losers, Part 6 – Azure ML

brakmic — Sat, 12 Dec 2015 14:13:56 +0000

In this article we’ll explore Microsoft’s Azure Machine Learning environment and how to combine Cloud technologies with Python and Jupyter. As you may know I’ve been extensively using them throughout this article series so I have a strong opinion on how a Data Science-friendly environment should look like. Of course, there’s nothing against other coding environments or languages, for example R, so your opinion may greatly differ from mine and this is fine. Also AzureML offers a very good R-support! So, feel free to adapt everything from this article to your needs. And before we begin, a few words about how I came to the idea to write about Azure and Data Science.

Being a datascientific Loser is OK but what about a Beginner Data Science-Certificate?

I’ve started to write on Data Science because I thought that the best way to learn about something is actually to write about it. No matter how little your current knowledge is the writing itself will surely help to increase it. It may take some time, of course, and there’ll be many obstacles waiting to stop your progress. But if you persist in sharing your knowledge, no matter how small it is, you’ll prevail and gain more confidence. So you may continue your learning journey.

And as I wrote in the original article our industry, IT, has big problems with naming things. Just take a few buzzwords from the last decade and you’ll know what I mean. However, naming things may be wrong but this is more an “internal” problem of the IT and people not involved in our business can happily ignore it. Something completely different happens when you start to compare human beings to animals, that is give people animal names. This is not only misleading but also a dangerous trend, in my opinion. There may be a shortage of qualified Data Scientists (and the fact that amateurs like me write lengthy articles on DS is one of the symptoms) but describing the complexity of DS by using animal terms is something we should be worry about. Starting to compare people to unicorns will sooner or later end with comparisons like pigs, rats and other, less lovely terms. Nothing can escape duality and IT, as we all know, is dual to the core. Failure and Success, 0 and 1, Good and Bad. The industry is now fervently searching for unicorn data scientists, but sooner or later they’ll also have pig & rat data scientists producing miserable models, incapable algorithms and misleading predictions. Just like we already have coding monkeys writing code that incurs technical debt.

Maybe I’m taking a rather unimportant term too seriously, I don’t know. However, I think that such a powerful and influential industry like IT should think more thoroughly about the message it sends to future talents.

Now, let’s talk about something more positive!

There’s a course at edX called “Data Science and Machine Learning Essentials” provided by Microsoft. It’s a 4-Week practice-oriented course created by Prof. Cynthia Rudin and Dr. Stephen F. Elston. I can’t remember how I found it because I wasn’t actively searching for any datascientific courses but in the end I registered a seat and, to my surprise, found out that Azure ML supports Jupyter! It wasn’t the mere Python language support that made me enthusiastic about Azure ML but the direct, out-of-the-browser access to a proper Jupyter environment. Currently, Azure ML supports Python and R as its primary languages but if you know Jupyter you surely can imagine working with Julia, Scala and other languages as well. And this is the reason why I’m dedicating this article to Azure ML. Everything I’ll present here will be based on what I’ve learned from the lectures. You can register for free and earn a certificate until March 31st 2016. I’d recommend you to buy the Verified Certificate because it supports edX which is a non-profit organization.

And just to provide evidence that I’m supporting edX here’s my certificate (yes, I’m bragging about myself):

And just for those of you who’d rather like to learn with R: there’s a free eBook from Dr. Elston called “Data Science in the Cloud with Microsoft Azure Machine Learning and R“.

Now let’s begin with Azure ML.

Azure ML Registration and Basic Instruments

First, we’ll need a proper account which you can get completely for free at bit.ly/azureml_login

Just click on “Get started” and create a test account.

The environment doesn’t expire after some “testing period” which is really nice but you should know that you’ll not be running the fastest possible hardware so the execution performance of your models won’t be “state-of-the-art”. However, for exploratory tasks and overall learning experience this free access is more than enough. The start page of Azure ML looks like this:

You can see my Workspace and a few Experiments in my recent experiments list. The first thing you should know is the Azure ML naming of things: an Experiment is basically a Workflow you create to consume, process and manipulate data to get some insights from it. Therefore, an experiment is a collection of interconnected steps you create via drag & drop and you can also extend them providing code that can be written in Python, R or SQL (or in any combination of them). It’s also possible to create Web Services based on your generated models that can be used by your customers.

This is the toolset from Azure ML where I manage my experiments:

In the left sidebar are the tools, datasets, models, functions, web services etc. located. This is the place where you select you preferred models, your scripts (Python/R/SQL) and everything else related to data and its manipulation. In the middle you see the logic of the datascientific workflow you create by dragging & dropping of elements from the left. On the right you can tweak the different settings of a selected Workflow step. Most of the steps usually produce some outputs that’ll be processed further by subsequent steps. In such cases you combine the output-input-flows by dragging a line from little circles between those steps. Many modules have more than one “input circle” which means that they can process more than a single input. For example, the Execute Python Script module can receive two data sets and also a zipped script file containing some additional programming logic written in Python.

Also, modules can have more than just one output. Again, the Execute Python Script module can output new datasets but also create plots:

To execute the functionalities provided by these modules you have to right-click on the circle and choose the option from the context menu. Here I’m selecting the Visualize option to get a tabular view of the output dataset.

This is how a visualized dataset in Azure ML looks like:

You gain direct insights of your data, like count, mean, median, max, min. Also about missing data, what type it is (Numeric, Categorical). One really nice feature is the direct graphical representation of single columns. Under each column name you see little bar-charts which help you recognize their distributions. Another very helpful feature is the “compare to” on the right side. You select a column in the list and then compare its values to another column you select in the dropbox on the right.

You can also graphically visualize datasets by executing your own Python (or R) scripts. Here’s an example with several scatterplots:

The scripts you provide are being executed directly by the Cloud Environment and must contain the azureml_main function so Azure ML can execute the scripts. To properly handle datasets you must import Pandas. For plotting you use matplotlib. If you’re new to Pandas then you may want to read one of the previous articles from this series.

This is how a rudimentary script should look like (the amount of input parameters depends on the actual task)

Take care of correctly naming the function. This is what would happen if you’d try to omit azureml_main:

Now you’re ready to execute the experiment. In this case it comprises of three steps:

Getting data
Executing 1st Python Script
Executing 2nd Python Script

Click on RUN in the status bar and wait for experiment to complete:

A successful completion of each of the steps will be signed with a green check mark. Currently running scripts show the “hour” icon. We now go to the second script that contains our filtering code and visualize the results. We see that there are no temperatures smaller than 20.0 degree (look at the Min-values indicator on the right).

Using Jupyter in Azure ML

For me, the most exciting part is the ability to quickly switch to Jupyter and work on the same dataset.

After a few moments the familiar interface will show up:

Now you can work on datasets from Azure and completely ignore the underlying environment. You don’t even need Windows to use Azure ML, of course. This means, just ignore the fact that this blog’s writer is still using Windows.

Importing Scripts

Azure ML tool-set is very intuitive and easy to use. Just type in a term in the search field above the modules list and you’ll surely find the right one. Also, you can very easily pack your own scripts into Zip-files and upload them as your own modules. Just write a Python script containing the logic you want to reuse, zip it and upload it to Azure ML. That’s all you have to make your script become a part of your datasource list. You can now use it in your experiments.

Here we define a small math-utils library and zip it subsequently. The numpy library is not being directly used in the functions and serves just as an example that you can reference additional libraries for internal access.

We click on NEW in the status bar and choose DATASET from local file.

In the file dialog we select our zipped file and wait for upload to complete.

Our new utility library will now show up in the list of datasets:

We drag & drop the new module and connect it to another Execute Python Script module in the experiment. The source of this module is expanded accordingly. The namespace of the referenced module is the original file name of the zipped python script.

Here you should take care of connecting the utility script to the third of the available circles of the target module.

Now you can save and execute the experiment. Please note that I’m providing here just a contrived example to show you how this mechanism works.

Conclusion

Azure ML is an exciting environment for datascientific work and what I’ve provided here is just a tiny overview of some of its many features. And again, I recommend you to take a seat in the aforementioned course and learn everything needed to effectively work with Azure ML. Alternatively you can also watch this video from Dr. Elston about Data Science with Cortana Analytics where he explains in great detail the structure of Azure ML and how to use it effectively.

Data Science for Losers, Part 5 – Spark DataFrames

brakmic — Tue, 24 Nov 2015 14:14:41 +0000

Sometimes, the hardest part in writing is completing the very first sentence. I began to write the “Loser’s articles” because I wanted to learn a few bits on Data Science, Machine Learning, Spark, Flink etc., but as the time passed by the whole degenerated into a really chaotic mess. This may be a “creative” chaos but still it’s a way too messy to make any sense to me. I’ve got a few positive comments and also a lot of nice tweets, but quality is not a question of comments or individual twitter-frequency. Do these texts properly describe “Data Science”, or at least some parts of it? Maybe they do, but I don’t know for sure.

Whatever, let’s play with Apache Spark’s DataFrames.

The notebook for this article is located here.

Running Apache Spark in Jupyter

Before we start using DataFrames we first have to prepare our environment which will run in Jupyter (formerly known as “IPython”). After you’ve downloaded and unpacked the Spark Package you’ll find some important Python libraries and scripts inside the python/pyspark directory. These files are used, for example, when you start the PySpark REPL in the console. As you may know, Spark supports Java, Scala, Python and R. Python-based REPL called PySpark offers a nice option to control Spark via Python scripts. Just open the console and type in pyspark to start the REPL. If it doesn’t start, please, check your environment variables, especially SPARK_HOME which must point to the root of your Spark installation. Also take care of putting the important sub-directories like python, scala etc. into your PATH variable so the system can find the scripts.

In PySpark you can execute any of the available commands to control your Spark instance, instantiate Jobs, run transformations, actions etc. And the same functionality we want to have inside Jupyter. To achieve this we need to declare a few paths and set some variables inside the Python environment.

We get the root of our Spark installation, SPARK_HOME, and insert the needed python path. Then, we point to the Py4J package and finally, we execute the shell.py script to initialize a new Spark instance. Ultimately, the Jupyter output returns information about the newly started Spark executor and the two instances: SparkContext and HiveContext. We’ll soon use them to load data and work with DataFrames.

What are DataFrames?

The grandpa of all modern DataFrames like those from pandas or Spark are R’s DataFrames. Basically, they’re 2D-Matrices with a bunch of powerful methods for querying and transforming data. Just imagine you’d have an in-memory representation of a columnar dataset, like a database table or an Excel-Sheet. Everything you can do with such data objects you can do with DataFrames too. You can read a JSON-file, for example, and easily create a new DataFrame based on it. Also, you can save it into a wide variety of formats (JSON, CSV, Excel, Parquet etc.). And for the Spark engine the DataFrames are even more than a transportation format: they define the future API for accessing the Spark engine itself.

As you may already know Spark’s architecture is centered around the term of RDDs (resilient distributed datasets) which are type-agnostic. RDDs don’t know much about the original datatypes of data they distribute over your clusters. Therefore, in situations where you may want to have a typed access to your data you have to deliver your own strategies and this usually involves a lot of boilerplate code. To avoid such scenarios and also to deliver a general, library-independent API the DataFrames will server as the central access point for accessing the underlying Spark libraries (Spark SQL, GraphX, MLlib etc.). Currently, when working on some Spark-based project, it’s not uncommon to have to deal with a whole “zoo” of RDDs which are not compatible: a ScalaRDD is not the same as a PythonRDD, for example. But a DataFrame will always remain just a DataFrame, no matter where it came from and which language you’ve used to create it. Also, the processing of DataFrames is equally fast no matter what language you use. Not so with RDDs. A PythonRDD can never be as fast as ScalaRDDs, for example, because Scala converts directly to ByteCode (and Spark is written in Scala) while PythonRDDs first must be converted into compatible structures before compiling them into ByteCode.

This is how the stadard Spark Data Model built on RDDs looks like:

We take a data set, a log-file, for example, and let Spark create an RDD with several Partitions (physical representations spread over a cluster). Here, each entry is just a line of the whole log-file and we have no direct access to its internal structure. Let’s now check the structure of a DataFrame. In this example we’ll use a JSON-file containing 10.000 Reddit-comments:

Unlike RDDs DataFrames order data in columnar format and maintain access to their underlying data types. Here we see the original field-names and can access them directly. In the above example only a few of the available fields were shown, so let’s use some real code to load the JSON-file containing the comments and work with it:

First, we import the Pandas library for easier accessing of JSON-structures:

Then we load the original file which is not exactly a JSON-file. Therefore, we have to adjust it a little bit before letting Pandas consume it.

Now we can use the Pandas DataFrame to create a new Spark DataFrame. From now on we can cache it, check its structure, list columns etc.

Here we print the underlying schema of our DataFrame:

It is important to know that Spark can create DataFrames based on any 2D-Matrix, regardless if its a DataFrame from some other framework, like Pandas, or even a plain structure. For example, we can load a DataFrame from a Parquet.

We can manually create DataFrames, too:

And as with RDDs DataFrames are being lazily evaluated. This means that unless you execute any of the available actions (show, collect, count etc.) none of the previously defined transformations (select, filter etc.) will happen.

Querying DataFrames

If you’ve already worked with DataFrames from other frameworks or languages then Spark’s DataFrames will fee very familiar. For example, filtering data can be done with SQL-like commands like:

select

first

selecting multiple columns

groupby, filter, join etc.

The advantage of this approach is clear: there’s no need to map raw indices to concrete semantics. You just use field names and work on data according to your domain tasks. For example, we can easily create an alias for a certain column:

Here we gave the select function a column-object (func.col) which in this case represents the real column “department“. Because we use an object here we also gain access to some additional methods like “alias”. In previous examples the column was represented by a simple string value (the column name) there were not additional methods available.

By using the column object we can also very easily create more complex queries like this grouping/counting example:

The original namespace where the column-objects reside is pyspark.sql.functions. For our example we created an alias called func. But there’s no need to work only with Spark DataFrames. We can, for example, execute some complex queries which make usage of Spark’s architecture mandatory but the rest of our work still can be done with “standard” tools like Pandas:

User Defined Functions with DataFrames

As we may already know dealing with data sets involves a lot of scrubbing and massaging data. In our small example with Reddit comments the column “created” contains a TimeStamp value which is semantically not very readable. Pandas users surely know about the different datetime/timestamp conversion functions and in Spark we have a toolset that allows us to define our own functions which operate at the column level: User Defined Functions. Many of you may already have used them with Hive and Pig, so I’ll avoid writing much about the whole theory of UDFs and UDAFs (User Defined Aggregate Functions).

We create a new UDF which takes a single value and its type to convert it to a readable datetime-string by using Pandas’ to_datetime. Then, we let our Spark-DataFrame transform the “created” column by executing withColumn-function which takes our UDF as its second argument.

Alternatively, you could also use lambdas and convert results to Pandas’s DataFrames:

It’s also possible to register UDFs so they may be used from another Jupyter notebooks instances, for example.

Conclusion

Working with DataFrames opens a wide range of possibilities not easily available when working with raw RDDs. If you came from Python or R you’ll immediately adopt Spark’s DataFrames. The most important fact is that DataFrames will constitute the central API for working with Spark no matter what language or which library from Spark’s ecosystem you may be using. Everything will ultimately end up as a DataFrame.

Data Science for Losers, Part 4 – Machine Learning

brakmic — Tue, 27 Oct 2015 18:35:42 +0000

It’s been a while since I’ve written an article on Data Science for Losers. A big Sorry to my readers. But I don’t think that many people are reading this blog.

Now let’s continue our journey with the next step: Machine Learning. As always the examples will be written in Python and the Jupyter Notebook can be found here. The ML library I’m using is the well-known scikit-learn.

What’s Machine Learning

From my non-scientist perspective I’d define ML as a subset of the Artificial Intelligence research which develops self-learning (or self-improving?) algorithms that try to gain knowledge from data and make predictions based on it. However, ML is not only reserved for academia or some “enlightened circles”. We use ML every day without being aware of its existence and usefulness. A few examples of ML in the wild would be: spam filters, speech-recognition software, automatic text-analysis, “intelligent game characters”, or the upcoming self-driving cars. All these entities make decisions based on some ML algorithms.

The Three Families of ML-Algorithms

ML research, like any other good scientific discipline, groups its algorithms into different sub-groups:

Supervised Learning
Unsupervised Learning
Reinforcement Learning

I think I should avoid to annoy you with my pseudo-scientific understanding of this whole subject so I’ll keep the definition of these three as simple as possible:

Supervised Learning is when the expected output is already known and you want your algorithms to learn about relations between your input “signals” and the desired outputs. The simples possible explanation is: you have labelled data and want your algorithm to learn about them. Afterwards, you give the model some new, unlabeled data, and let it figure out what labels the data should get. It goes like this: Hey, Algorithm, look what happens (output) when I push this button (signal). Now learn from this example and the next time I push a “similar” button your should react accordingly.
Unsupervised Learning is when you know nothing about your data, that is: no labels here, and let your favorite slave, Algorithm, analyze it to hopefully find some interesting patterns. It’s like: Hey Algorithm, I have a ton of messy (that is: unstructured) data. And because you are a computer and have all time in the world I’ll let you analyze it. In the mean time I’ll have a pizza and you’ll hopefully provide me some meaningful structure and/or relations within this data.
Reinforcement Learning is in some way related to Supervised Learning because it uses so-called rewards to signal the algorithm how successful its action was. In RL the algorithm, the agent, interacts with the environment by doing some actions. The results of these activities are rewards which help the agent learn how well it performed. The difference between SL and RL is that these outcomes are not ultimate truths like the labelled data in SL, but just measurements. Therefore, the agent has to execute a certain amount of trial-and-error-tasks until it finds its optimum. It’s like: Hey Algorithm, this is the snake pit with pirate gold buried deep down. Now, bring it to me…

In this article we’ll use the algorithms from the Supervised Learning family. And like any other scientific object the aforementioned families of algorithms are not the only grouping in Machine Learning. They’re, of course, much more granular and divided into sub-groups. For example, the SL family splits into Regression and Classification and then they split further into groups like Support Vector Machines, Naive Bayes, Linear Regression, Random Forests, Neural Networks etc. In general we have linear and non-linear methods to predict an outcome (or many of them). All of these methods are there to make predictions about data based on some other data (the training data).

Training the algorithms (Regression and Classification)

The data we analyze contains features or attributes (ordered as columns) and instances (ordered as rows). Just imagine you’d get a big CSV file with 100 columns and 1000.000 rows. The 99 columns define different attributes of this data set. For example, some DNA-information (genes, molecular structure etc.). And the rows represent different combinations of attributes which produce some outcome in an organism (for example, the risk of getting a certain disease). This result, the risk, is represented as a percentage ranging from 0% to 100%, in the 100th column. Now, your task would be to randomly take some of these rows, let’s say 800.000, and train your algorithm about the risks. The remaining 200.000 you keep separated from your training task so you can later use them to test how well your trained model behaves. Of course, the testing data would only contain 99 columns because it makes no sense to test your algorithm by giving it the already calculated outcomes (the 100th column). Simply spoken: you split the original data set into two sets each representing different time frames. The training data set represents the past and the testing data set represents the future data. You hope that based on past data your algorithm will be able to accurately predict the future (your testing data set for now) and all the upcoming data sets (the “real” future data sets). This ML model is based on Regression because we try to predict some ordered and continuous values (the percentages).

The ML models based on Classification are not calculating values but rather grouping instances of data sets into some predefined categories or classes, hence the name “classification”. For example, we could take a data set containing information about different trees and trying to predict a tree family based on certain attributes. Our algorithm would first learn what belongs to a certain tree family and try to predict it based on given attributes. Again, we’d split the original data set into two pieces, one for training the other for testing. Also, the testing part wouldn’t contain the information about the tree family (the class). The algorithm would then read all the attributes, row by row and calculate the possibility that a tree (the whole row) belongs to a certain family (one of the possible entries in the last column).

This is, in essence, what you’re doing when applying ML by using Supervised Learning. You have the data, you know about its form and quality of the results and now you want your algorithm to be trained so you can later use it to make predictions on data that will come later. In fact, you prepare your algorithm for the future based on the data from the past. Your ultimate goal is to create a model that is able to generalize the future outcomes based on what it has learned by analyzing some past data.

But, again, this article series is for Losers so we don’t care much about everything and focus ourselves on a few things we barely understand.

A Classification Primer with sciki-learn and flowers

I’m pretty sure that the most of you already know the Iris data set. This data collection is very old and serves as one of standard examples when teaching people how to classify some data by providing a data set containing the attributes about the Iris flowers. In short, the Iris data set contains 150 samples (also called “observations”, “instances”, “records” or “examples”) about three different flower types (Iris setosa, virginica and versicolor). The instances are separated into four columns (or “attributes”, “features” , “inputs” or “independent variables”) describing sepal length & width, and petal length & width. Now the algorithm we’re going to use must correctly predict one of the possible outcomes (also called “targets”, “responses” or “dependent variables”): setosa, virginica or versicolor, represented by numbers 0, 1 and 2. Therefore: We have to classify the samples.

First we import scikit-learn and load the Iris data set. The Iris data set can either be loaded directly from scikit-learn or from a CSV file from the same directory where the notebook is located.

Scikit offers many different data sets. Just list them all by using dir(sklearn.datasets). All of them contain some general methods and properties:

Here we see that our toy data set comprises of 150 instances (50 for each flower type), four attributes, which describe every single flower, and also the units used to calculate those values (centimeters in this case).

We can look into the raw data from Iris data set by using its data-property:

The respective targets (the classes we’ll try to predict) are also easily accessible:

We see the distribution of the tree categories (classes) of flower types ranging from 0 to 2. The categories are represented as numerical values because scikit-learn expects them to be numerical.

As we see here the target array contains a grouped set of available categories. And this fact is very important because to create a proper training set we first must ensure that it contains a sound combination of all categories. In our case it wouldn’t be very intelligent to take only the first 60 entries, for example. The third flower category (numbered 2) wouldn’t be represented at all while the second one (numbered 1) wouldn’t be represented in a similar relation like in the original data set. Training our algorithm based on such data would certainly lead to biased predictions. Therefore we must ensure that all categories are correctly represented in the training data set. Also we should never let the whole data set be used for training the model because it would lead to overfitting. This means that our model would perfectly learn the structure of our testing data but will not be able to correctly analyze future data. Or, as you may often hear: the model would rather learn noise than signals.

Also, our decision of using Classification or Regression should not only be based on the shape of numerical values. Just because we see an ordered set of 0s, 1s and 2s doesn’t always mean we see categories here. Every time we want to develop a predictive model we have to learn about the explored data set first before we let the machine do the rest.

We gain additional information about our data set by exploring these properties (of course, in the wild we’d have to collect these names from different sources, like column names, documentation etc.):

It’s worth knowing that scikit-learn uses NumPy arrays to manage its data. These are, of course, much faster than traditional Python arrays.

By using the excellent matplotlib package we can easily generate visual representations. Here we use petal length and and width as x respective y values to create a scatter-plot.

From X to y

By convention the samples are located in a structure called X and their respective targets in y (small caps!). Both are NumPy arrays while the Iris data set is of type sklearn.datasets.base.Bunch. The reason we capitalize X is because of it’s structure: it’s a matrix while y is a vector. Hence, our work is based on a very simple rule: use X to predict y. Therefore, the learning process must be similar: give it X to learn about y. In this article we’ll use the k-Nearest Neghbor Classifier to predict the labels of Iris samples which will be located in the test data set. kNN is just one of many available algorithms and it’s advisable to use more than one algorithm. Luckily, the scikit-learn package makes it very easy to switch between them. Just import a different model and fit it to the same data set. There’s basically just one API to learn regardless of the currently used algorithm.

As already mentioned it’s very important to split data sets in such a way that no biased predictions occur. One of the possible methods is by using functions like train_test_split. Here we split our data into four sets, two matrices and two vectors:

id stands for “iris data” (matrices) while il means “iris label” (vectors). We now use the training parts (which represent “data from the past”) to train our model.

We can use our trained model to generate probabilistic predictions. Here we use some randomly typed numbers, which represent the four features, and let the model calculate percentages.

According to our trained model the chance of being setosa is at 20%, versicolor at 60%, and virginica at 20%.

Let’s test our trained model on some “future” data. In this case it’s the part we’ve got from the train_test_split function.

We see the predictions for each of the unlabeled samples. These are classes our model “thinks” the flowers belong to. Now let’s see what the real results are.

Compared the two with the naked eye we could say that the predictions were “OK” and mostly accurate. Here are the cross-validations scores:

We see that our model is very accurate! Not bad for Losers like us. But this is mostly due to two simple facts: the Iris data set is only a “toy” data set and kNN is the simplest of all available algorithms.

Conclusion

I hope I could provide some “valuable” information regarding ML and how we can utilize algorithms from scikit-learn. This article should serve as a humble beginning of a series of articles on ML in general. In my spare time I’ll try to provide more examples with different algorithms (and will not use the iris data set and kNN again, I promise).

Data Science for Losers, Part 3 – Scala & Apache Spark

brakmic — Thu, 15 Oct 2015 19:10:45 +0000

I’ve already mentioned Apache Spark and my irrational plan to integrate it somehow with this series but unfortunately the previous articles were a complete mess so it has had to be postponed. And now, finally, this blog entry is completely dedicated to Apache Spark with examples in Scala and Python.

The notebook for this article can be found here.

Apache Spark Definition

By its own definition Spark is a fast, general engine for large-scale data processing. Well, someone would say: but we already have Hadoop, so why should we use Spark? Such a question I’d answer with a remark that Hadoop is EJB reinvented and that we need something more flexible, more general, more expandable and…much faster than MapReduce. Spark handles both batch and streaming processing at a very fast rate. Compared with Hadoop its in-memory tasks run 100 times faster and 10 times faster on disk. One of the weak points in Hadoop’s infrastructure is the delivery of new data, for example in real-time analysis. We live in times where the 90% of word’s current data was generated in the last two years so we surely can’t wait for too long. And we also want to quickly become productive, in a programming language we prefer, without writing economically worthless ceremonial code. Just look at Hadoop’s Map/Reduce definitions and you’ll know what I’m talking about.

public static class TokenizerMapper
       extends Mapper, Text, Text, IntWritable>{

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
      }
    }
  }

Here’s an example in Python which we’ll later use in Jupyter:

moby_dick_rdd.flatMap(lambda line: line.split(” “)).map(lambda word: (word, 1)).reduceByKey(lambda x,y: x + y)

But a few things remain the same: the underlying Filesystem, HDFS, is not going away because Spark is only a processing Engine and doesn’t dictate what your distributed Filesystems have to be. Spark is compatible with Hadoop so you don’t have to create everything from scratch when transferring your old MapReduce logic to Spark’s ecosystem. The same applies to your YARN scheduler and Mesos cluster manager. Spark brings its own Scheduler too: Standalone Scheduler.

Spark Architecture

If you look at Spark’s architecture you’ll recognize that there’s a core component which lays the foundation for hosting a very diverse set of functionalities:

Spark Core covers the basic functionality like scheduling, fault recovery, memory management, interactions with the storage system. And as we’ll later see in this article here’s is the place where the most important abstraction, RDD (resilient distributed dataset) is defined as well as the APIs to work with them.

Spark SQL replaces the previous API for accessing structured data, Shark. It supports queries via SQL and Apache HQL (Hive Query Language). This API allows accessing different types of structured data storages like databases, JSON files, Parquet, Hive Tables etc.

Spark Streaming is the engine for processing of live data streams. It’s functionality is similar to Apache Storm. It can consume data from Twitter, HDFS, Apache Kafka and ZeroMQ.

MLlib is a library designed for doing machine learning tasks. It comes with different ML algorithms like classification, regression, clustering, filtering etc.

GraphX is a library for processing and manipulating graphs. For example Facebook’s “friends”, Twitter “followers” etc. It comes with a set of specialized algorithms for processing such data, like PageRank, triangle counting, label propagation etc.)

Spark Definition

For our purposes we define Spark as a distributed execution environment for large-scale computations. Spark itself is built with Scala and runs on JVM but supports different languages and environments. It also contains a nice web-interface to control the status of distributed tasks. Here we see a screenshot showing the status of a Python job. Later we’ll learn how to use Spark with Jupyter.

Resilient Distributed Data sets

Take almost any written Spark tutorial or video and sooner or later you’ll approach RDDs. This TLA (three-letter-acronym) describes the building block of Spark’s architecture and is all you need to know to be able to use any of the aforementioned functionalities. RDDs are immutable distributed collections of objects. With “objects” we mean anything you can use (documents, videos, sounds, blobs,…whatever). They’re immutable because an RDD can’t be changed once it has been created. It’s distributed because Spark splits RDDs into many smaller parts which can be distributed and computed over many different computing nodes. Simply imagine you have a cluster comprising 20 machines and now you want to do some Natural Language Processing on a very big chunk of scientific articles. Spark would take this “blob”, create an RDD out of it and split it internally into many smaller parts and send them to all available nodes over the network. The next time you start a computation on this RDD (because all you see is still a single object) Spark would use your NLP-logic and send it to each and every participating node to do some computation on the part of the original data. After all of them have produced their partial computations you’d receive one final result. This is the very reason why RDDs have to be immutable and, of course, distributed. They must be immutable so that no manipulation of smaller parts can happen and they have to be distributed because we’re interested in fast calculations done in parallel over many nodes. To create an RDD we can use different programming languages, like Scala, Java, Python or R. In this article we’ll use both Scala and Python to create a Spark Context, configure it and load some text to play with it for a bit. But before we can use Spark we have to configure it. Here we’ll use it as a single-instance, for testing purposes. In “real world” you’d usually have a master instance controlling other nodes.

Installing and Configuring Apache Spark

I’m using Windows and there are some additional steps during the installation which I’ll explain as we proceed. First, go to Apache Spark homepage and download v.1.5.1 pre-built for Hadoop 2.6 an later.

Unpack the tarball (under Windows you have to use an packer like 7zip because Windows doesn’t recognize tarballs)
Apache Spark doesn’t contain a precompiled version of Hadoop for Windows. Therefore you have to build it yourself. Or download a precompiled version 2.6 from here.
Insert new environment Variables HADOOP_HOME and HADOOP_CONF_DIR pointing to the root of your Hadoop install-dir respective config-dir within your Hadoop’s root. This is how it looks on my machine:

Also insert SPARK_HOME pointing to the root-dir of your Spark installation.
Spark is built with Scala and runs on JVM, therefore we need Scala and JVM installed on our machines. Additionally we install Scala’s Simple Build Tool SBT.

Using Apache Spark from Command Line

To start Spark we simply use a shell-script from its bin-directory. Type in spark-shell and soon a bunch of messages will start to flow in.

After a few seconds you’ll see a prompt like this:

A scala-based REPL (read-evaluate-print-loop) is awaiting commands. A default SparkContext is available through the variable sc. Just type in sc. and press TAB to expand all its properties and methods.

We’ll now use the SparkContext to read a text file and put it into a new variable, or to be more scala-ish: into a val. In Scala there’s a difference between mutable variables declared with keyword var and immutable “variables” declared with val (the “val” stands for value which implies immutability)

We see a lot of messages which are not only related to reading a file. There are “ensureFreeSpace” calls, broadcasts, some “pieces” were mentioned etc. At the very end of the stream we get the message that an RDD has been created and its name is moby_dick. Well done! This is our first RDD. RDDs are named by its origin: ScalaRDDs, PythonRDDs etc.

Now let’s see what the Spark WebConsole says about it. Go to your https://localhost:4040 and search for a textFile-Job (then name of the job could differ on your machine)

Um, well, there are no jobs at all! Am I kidding you? Yes, I do! But for your own good. Trust me, because Apache Spark is lazy! Spark is simply doing nothing because there’s no reason to read this file. But why? How can Spark decide when and what to do with some piece of data?

OK, just imagine we’re not reading Herman Melville’s Moby Dick but some extremely large text corpus. For example, gigabytes of Log-Files provided by Apache Kafka which you want to analyze line by line. So, you give Spark this log-file’s path and then…it blocks everything because it’s reading it as a whole….and reading it….and….you wait….and your memory goes down…and there’s even less memory available for your filtering logic…and you still have to wait….OutOfMemoryException...well, that’s not nice!

Apache Spark is not only lazy but it also offers two types of operations to work with RDDs: transformations and actions. Transformations are operations that construct new data from previous data. There is no change of original data as we already know because RDDs are immutable. This is nothing new to you if you call yourself a “functional programmer”, or simply love Haskell and laugh at other coder’s problems with shared mutable state. But I’m not going to annoy you with my superficial knowledge about Category Theory, Monoids & Monads.

An action in Spark is technically a computation which produces a value which can either be returned to the caller (the driver program) or persisted into an external storage. When Spark executes an action some computation will be executed and a data type is returned but not an RDD. When a transformation “happens” an RDD will be returned but no computation kicked off or any kind of other data type generated.

In our example above nothing has happened because there was no action in sight. Let’s add a simple foreach action which will go over all words of Moby Dick and print them out in the console. Prepare yourself for a quick stream of unreadable data and also check the output in the WebConsole while the job is being executed.

Our WebConsole now shows a completed job and its metadata:

If you click on the small dots in the visualization box you’ll get a more detailed information about different stages of the job.

Here you can see the different tasks belonging to the job, their localities (which of your nodes executed them), memory consumption and if they were successful or not. Because we’re using a single-node Spark instance all these values are not of much value to us but imagine processing large chunks of data on a cluster. Then you have to have such detailed information.

Caching Data in Apache Spark

Data doesn’t have to be located on persistent storages only. You can also cache it in memory and Spark will provide you all the tools needed to advance your processing logic. We can instruct Spark to cache data by using the cache method of the RDD.

If we now go to our WebConsole and click the Storage-Link we’d see…well, I assume you already knew it: there’ll be nothing waiting for us. Why? The answer is simple: because a cache method is a Transformation (and it’s a transformation because it returns an RDD), so nothing will be executed by Spark unless we call some action on it. The principle is simple: when I have to return an RDD then I’m a transformation and because I’m a transformation I simply do nothing because I’m lazy. If I return something that’s not an RDD then I’m an action. And because I’m an action I’m everything else but not lazy. This is the mantra of Apache Spark.

Now let’s split all the words by white space so we can count all the words in Moby Dick. Again, we do a transformation first (flatMap) and then we execute the count-Action.

We let Spark count the words for us…this is the moment when everything kicks in…caching the RDD, splitting into single words, and counting them.

When we now check the Storage information in the WebConsole the page output looks differently:

Especially, the second table is interesting. There we encounter a new term: Partitions. What are partitions? Partitions are essentially physical parts of your original “documents” distributed over the nodes belonging to your Spark instance. An RDD is a logical representation of an entity that is physically separated and replicated over different machines. You don’t work with partitions because Spark uses them for management purposes only. What you see are RDDs only, and also DataFrames which represent a newer, and more sophisticated API to manage data under Spark. But in this article we’ll only use RDDs. Maybe in some following post we’ll also talk about DataFrames because they map perfectly with Pandas and R’s DataFrames. The most important difference between Spark’s and “other” DataFrames is the way of execution which is lazy in Spark’s context. Python and R eagerly load their DataFrames which is.

What you should also know is that for Spark an RDD is just an object without any additional, more special information. An RDD has to be maximally general and therefore it’s not always very “pleasant” to work with them. Usually problem domains influence the usage of tools and prescribe at least in part what kind of data has to be used, how their structure should look like, the way we query the data etc. In the field of Data Analysis people often deal with more or less structured data like Tables, Matrices, Pivots etc. And this is the reason why using DataFrames in Spark is much more productive that fiddling around with raw RDDs because they come preformatted as columnar structures ready to be queried like any other data storage by using well-known methods.

Connecting with Apache Spark from Jupyter

The examples above are based on Scala, a language I didn’t introduce properly which is everything but not “nice”. However, as this article series will (hopefully) be continued I’ll try to find a few spots to give a few examples on Scala. I’m not sure how interesting a mini-tutorial on Scala would be (or if these articles are in any way useful to anyone) but, being a Loser’s article series, one can only expect a chaotic structure, misleading examples and obviously incomplete explanations.

However, let’s see how the Python Shell communicates with Apache Spark? The easiest way to do this is just by executing another shell-script from Spark’s bin-directory: pyspark. This would bring us a command prompt similar to the one from Scala.

Here we could execute our Python code directly and play the same game as with Scala. But the title of this paragraph contained the term Jupyter and Jupyter is not a console-only tool. So, how to communicate with Spark via Jupyter? Actually, there are many different ways and I’ve found some kernels with complete environments dedicated to Spark but there are also different requirements regarding Jupyter versions (some want only 3.0, others want IPython only and not Jupyter etc.). Therefore I’ll show the simplest and surely the most primitive method to create a SparkContext and send it Python commands from Jupyter.

Put the full path to the py4j package into your PYTHONPATH. py4j is located under your Spark’s python-dir.

Then execute the jupyter notebook command from your notebook’s directory (I use Jupyter v4.0)
Import SparkConf and SparkContext from pyspark
Create a new “local” SparkConf and give it a name
Instantiate a new SparkContext by giving it this “local” config
After executing the above command you’ll see a bunch of messages flowing into your console
Now you can send commands to the new SparkContext-instance

The console output shows Spark logs after the context has been initialized:

Now we can send commands to our new Spark instance:

Here we’re using the textFile-Transformation to create a new RDD which we’ll send through our transformation pipeline:

Create a flatMap consisting of all words from the book separated by white space
map these words to tuples consisting of word-names and a value of 1
Use reduceByKey on all tuples to create unique word groups by accumulating their numeric values (simply spoken: count & group words)
And finally, the foreach-action triggers some visible “action” by applying the function output to each element in the list of tuples.

The corresponding job information will show up in the Spark WebConsole:

Conclusion

I hope this article could explain a few parts of the Apache Spark structure and its abstractions. There’s, of course, much much more to be explained and I didn’t even touch any of the components (Spark SQL, Streaming, MLlib etc.). Maybe in some later article because I want to dedicate the next article to scikit-learn and the general Machine Learning concepts. But after we’ve learned a bit about ML we surely could use a bit of Apache Spark to execute some nice ML-code. What do you think about? Just leave a comment. Thank you.

Regards,

Harris

Data Science for Losers, Part 2 – Addendum

brakmic — Tue, 13 Oct 2015 13:43:41 +0000

This should have been the third part of the Loser’s article series but as you may know I’m trying very hard to keep the overall quality as low as possible. This, of course, implies missing parts, misleading explanations, irrational examples and an awkward English syntax (it’s actually German syntax covered by English-like semantics ). And that’s why we now have to go through this addendum and not the real Part Three about using Apache Spark with IPython.

The notebook can be found here.

So, let’s talk about a few features from Pandas I’ve forgot to mention in the last two articles.

Playing SQL with DataFrames

Pandas is wonderful because of so many things and one of the most prominent features is the capability of doing SQL-like selections, creating filters and groups without resorting to any SQL-dialect. Here’s how we chain multiple boolean selectors into one query on a DataFrame:

Grouping entries is equally simple:

And grouped data can be iterated over by using either iterkeys, itervalues or iteritems

In this example we use an iterator as a filter. The iterator provides us employee IDs which are being used to select certain parts of Employees-DataFrame:

Using lambdas for more concise code is possible, too:

Here we iterate over the values and map group-IDs from GrupedOrders-DataFrame to the index-based locations in OrderDetails-DataFrame. We can pass the groups certain aggregate-functions too.

An example with sum.

But we are not constrained to only one function per call. We can define dictionary objects mapping columns to functions. This is how we calculate means of unit prices and sums of quantities.

A DataFrame can be reindexed very easily. We can also define fill-values to be applied whenever a value is missing. This looks way better than colons of NaN-entries throughout the table.

If we omit certain indices pandas will adapt accordingly. If we later bring back an index entry the hidden values will instantly show up. Just leave a few values out and experiment a little bit.

Creating Panels

Series and DataFrames are the two most known structures in Pandas. But there’s another one: Panel. Panels can hold three-dimensional structures which comprise:

an item axis, whose entries are DataFrames
a major axis, whose entries are DataFrame rows
a minor axis, whose entries are DataFrame columns

These DataFrames are related together and Panels maintain the correspondence between the index and column values. Here we define a Panel containing three DataFrames:

We see that the first DataFrame owns a full view on the available structure. The other two can only access certain parts of it an when we iterate over the items of the Panel, i.e. its DataFrames, the output differs accordingly:

We see that the first output slot, belonging to first_DF, shows all of the selected rows and columns. The other two slots only show some of available values across the rows and columns:

Slot 2 has no access to rows 2 and 4 because in the definition above the second_DF only partially accesses indices 0,1 and 3. Its column rand1 is empty.
Slot 3 has no access to rows 0, 1 and 3 because of its definition, third_DF. The columns rand0, rand1 and rand3 remain completely empty and only rows 3 and 4 under column rand3 contain values.

3D Panels are complex and can’t be easily displayed on screen. However, they are very powerful and allow to control relations that need more than two dimensions. For example time-frame dependent data, like historical stock prices in finance.

Conclusion

Of course, there are almost uncountable options, functionalities and other constructs in Pandas and NumPy and I can only hope we’ll see more of them while I’m writing these articles. But as promised the next “episode” will be devoted to Apache Spark and its usage via IPython (Jupyter).

Maybe I should write a few words about Scala? What do you think?

I would appreciate your feedback (not only regarding Spark & Scala but in general).

Data Science for Losers, Part 2

brakmic — Fri, 09 Oct 2015 17:19:59 +0000

In the first article we’ve learned a bit about Data Science for Losers. And the most important message, in my opinion, is that patterns are everywhere but many of them can’t be immediately recognized. This is one of the reasons why we’re digging deep holes in our databases, data warehouses, and other silos. In this article we’ll use a few more methods from Pandas’ DataFrames and generate plots. We’ll also create pivot tables and query an MS SQL database via ODBC. SqlAlchemy will be our helper in this case and we’ll see that even Losers like us can easily merge and filter SQL tables without touching the SQL syntax. No matter the task you always need a powerful tool-set in the first place. Like the Anaconda Distribution which we’ll be using here. Our data sources will be things like JSON files containing reddit comments or SQL-databases like Northwind. Many 90’es kids used Northwind to learn SQL.

The notebook from this article can be found here.

Analyzing Reddit comments

Our first analysis will use 10.000 comments from a reddit backup. More specifically, we will use 10.000 comments from August 2015. There are many more gigabytes of reddit’s backups available, so feel free to download entries from other time periods. If you prefer Torrent here’s the link. You can also query these data sets via browser by using Google BigQuery. In the end you should have a JSON file containing many entries which we’ll try to load into a DataFrame. But as we’ve already learned most data is not in the format we expect it to be. Sooner or later we’ll have to structurally modify our data. Here’s a snippet of our reddit JSON:

Well, this isn’t a proper JSON and Pandas would immediately reject it. Let’s just try to load it just to see what happens.

Well, not a good impression. Pandas sees some trailing data which by itself is misleading enough. But when a machine complains wrongly why should be expect our data to be correct anyway? We have to accept the fact that these entries only seemingly look like a JSON structure but surely don’t follow the JSON standard itself. A collection of JSON objects, that is: reddit comments, has to be inside an JSON array (square brackets, []) and must be separated by commas. Do we see any commas in the document? Yes, but only inside the entries themselves and not between them. And where’s the array comprising of all reddit comments? No square brackets in sight. So, what are we going to do now? Converting raw data into a readable format, of course. In this case we’ll use a simple Python snippet to read the comments one by one, put a comma in between, remove any newlines and put the comments into a new array.

Now we can load all comments into a DataFrame but we also have to give Pandas a little more info than usual. This is because it doesn’t know in advance what the internal JSON-structure looks like. Is it a flat array containing only single entries? Or a dictionary? Or something like objects with properties? Therefore it isn’t enough just to type in pd.read_json and expect it to load the unknown structure somehow. Instead we have to type in a command like this:

Pandas can load JSON’s either as Series or as DataFrames depending on the internal structure of the document. In our case we want to generate Tables and therefore we want to receive a DataFrame. For this we need a proper orientation of our data. Therefore we instruct Pandas to use the column orientation. Finally, we want a certain column containing timestamps to be parsed as a date entry. Finally, we check the basic properties of our new DataFrame:

Now let’s create a horizontal bar-plot showing the highest rated comments for all available sub-reddits:

Nice, how quick we’ve got from an unreadable JSON to a plot! And now for something completely different: pivot tables.

Nice, but what’s a pivot table by definition? Honestly, we could fill at least a book about it but let it be said that a pivot table is an automatic data summarization, grouping, filtering and counting tool. Pivot tables visualize data that automatically adjust itself according to given rules, dimension and filters. For example we decide that it should show the controversality column of every comment.

What happens here is that for the two given indices author & subr there will be a column controversiality. But this selection is not very useful because our index starts with author. There are many authors and much less sub-reddits. Let’s change the order and look at the outcome:

This now is something completely different because our authors are grouped by subreddits. It makes a lot more sense because we can expect many authors to visit different sub-reddits and not only one. Also we have to take into account that the many zero-values are not there because JSON file contained them but because we’ve used the option fill_values=0. Without it the pivot would be filled with NaN’s. The last parameter margins=True calculates the “totals”. There are many more options, of course. For example, we can instruct Pandas to use certain aggregate functions on every field or only some of them. Here we want Pandas to calculate sums:

And suddenly, our pivot table looks a lot nicer. No more useless zeroes all along the way. We can add more than just one aggregate function. It’s even possible to create a dictionary of aggregate functions. The column would be the key-entry in the dictionary and the function its corresponding value. Here we let Pandas calculate mean-values:

OK, the outcome is pretty silly, but you get the point. It’s a Loser’s way to a higher knowledge, anyway.

A pivot table can be used like any other data source, too. You can execute queries against it and do all the nice filtering stuff as if it was a database.

Querying Databases with SQLAlchemy and pyODBC

Again, SQLAlchemy is something that surely deserves a book or two and therefore I’m not going to talk too much about it because it simply doesn’t fit into a simple tutorial like this. In our case all we have to know is that SQLAlchemy is a powerful Python package giving us the possibility to query any kind of database without touching its vendor’s specifics. For example, here I’m using an MS SQL Server (just to annoy the most of you) and access it via ODBC (yes, still annoying you!). The ODBC setup can be a little bit daunting so let me quickly explain how to do it under Windows:

Open ODBC Data Sources (just type in ODBC in Windows Search)
In the new window click on Add and then select “SQL Server Native Client”

Click on Finish and then add your data source information: a new name, a description and the full server name like HOSTNAME\SQL-SERVER-NAME

Click on Next and type in your user/password in the next window (it depends on how you access your MSSQL, integrated or SQL-Auth)

In the next window activate the checkbox “Change the default database to” and select NORTHWIND (download it from here and restore it in MSSQL Manager if you don’t have it already running in your db-server)

Now just click on Next and then on Finish. You can optionally test the settings in the last window presented.

OK, hopefully it wasn’t too annoying.

We now have to configure SQLAlchemy to talk to our new ODBC resource:

Here we see many other “standard” imports like those with Pandas, NumPy etc. But at the bottom we additionally configure SQLAlchemy. The pyODBC package is needed too, because SQLAlchemy uses it to create an ODBC context.

SQLAlchemy offers the method create_engine which we have to feed with a special connection string containing the name of the database driver, in this case mssql+pyodbc, the database catalog information and its access tokens. The result of this call will be a reference to an engine object maintaining a connection to the given database catalog. This is how the structure of the NORTHWIND catalog looks like:

First, we let SQLAlchemy access certain tables from the catalog:

We access tables with SQLAlchemy by providing their respective names. The second parameter, engine, maintains the physical connection against the catalog. The resulting references are just normal DataFrames which is of course an important advantage of Pandas. No matter what type the original data source is in the end everything ends up in a DataFrame. Or Series if we’re not maintaining multidimensional structures.

We can now execute methods that resemble behaviors known from ordinary databases. For example, Table Joins via merge:

Here we instruct Pandas to merge two tables by using certain primary keys from both when combining their rows into a new table. The parameter how instructs Pandas to use the inner-join which means it will only combine such rows which belong to both of the tables. Therefore we’ll not receive any NaN-rows. But in some cases this could be desirable. Then use the alternative options like left, right or outer.

Pivots with Tables from SQLAlchemy

And of course it’s possible to generate the same pivot tables with data that came from SQLAlchemy. They’re nothing else but DataFrames all the way down. OK, not absolutely all the way down, because there are also Series and NumPy arrays etc., but this is a little bit too much of knowledge for Losers like us. Maybe in some later articles.

Here we let an aggregate sum-function from NumPy be only executed on Quantity fields. Being armored with some knowledge from the previous pivot example it shouldn’t be that hard to create own pivot’s. And of course, we can query pivot tables, too.

Data Science for Losers

brakmic — Wed, 07 Oct 2015 17:56:10 +0000

Anaconda Installation

To do some serious statistics with Python one should use a proper distribution like the one provided by Continuum Analytics. Of course, a manual installation of all the needed packages (Pandas, NumPy, Matplotlib etc.) is possible but beware the complexities and convoluted package dependencies. In this article we’ll use the Anaconda Distribution. The installation under Windows is straightforward but avoid the usage of multiple Python installations (for example, Python3 and Python2 in parallel). It’s best to let Anaconda’s Python binary be your standard Python interpreter. Also, after the installation you should run these commands:

conda update conda

conda update

“conda” is the package manager of Anaconda and takes care of downloading and installing all the needed packages in your distribution.

After having installed the Anaconda Distribution you can go and download this article’s sources from GitHub. Inside the directory you’ll find a “notebook”. Notebooks are special files for the interactive environment called IPython (or Jupyter). The newer name Jupyter alludes to the fact that newer versions are capable of interpreting multiple languages: Julia, R and Python. That is: JuPyteR. More info on Jupyter can be found here.

Running IPython

On the console type in: ipython notebook and you’ll see a web-server being started and automatically assigned an IP-Port. A new browser window will open and present you the content of directory IPython has been started from.

Console output

Jupyter Main Page

Via Browser you can load existing notebooks, upload them or create new ones by using the button on the right.

Of couse, you can load and manipulate many other file types but a typical workflow starts with a click on an ipynb-File. In this article I’m using my own twitter statistics for the last three months and the whole logic is in Twitter Analysis.ipynb

Using Twitter Statistics

Twitter offers a statistics service for its users which makes it possible to download CSV-formatted data containing many interesting entries. Although it’s only possible to download entries with a maximum range of 28 days one can easily concatenate multiple CSV files via Pandas. But first, we have to look inside a typical notebook document and play with it for a while.

We see a window-like structure with menus but the most interesting part is the fact that a notebook acts like any other application window. By heavily utilizing the available browser / JavaScript-based technologies IPython notebooks offer powerful functionalities like immediate execution of code (just type SHIFT+Enter), creating new edit fields (type B for “below” or A for “append”), deleting them (by typing two times D) and many other options described in the help menu. The gray area from the above screenshot is basically an editor like IDLE and can accept any command or even Markdown (just change the option “Code” to “Markdown”). And of course it supports code completion

Pandas, Matplotlib and Seaborn

To describe Pandas one would need a few books. The same applies to Matplotlib and Seaborn. But because I’m writing an Article for Losers like me I feel no obligation to try to describe everything at once or in great detail. Instead, I’ll focus on a few very simple tasks which are part of any serious data analysis (however, this article is surely not a serious data analysis).

Collecting Data
Checking, Adjusting and Cleaning Data
Data Analysis (in the broadest sense of the meaning)

First we collect data by using the most primitive yet ubiquitous method: we download a few CSV-files containing monthly user data from Twitter Analytics.

We do this a few times for different ranges. Later we’ll concatenate them into one big Pandas’ DataFrame. But before doing this we have to load all the needed libraries. This is done in the first code area by using certain import statements. We import Pandas, Matplotlib and NumPy. The two special statements with % sign in front are so-called “magic commands”. In this case we instruct the environment to generate graphics inside the current window. Without these commands the generated graphics would be shown in a separate browser pop-up which can quickly become an annoyance.

As next we download the statistics data from Twitter. Afterwards, we instruct Pandas to load CSV-files by giving it their respective paths. The return values of these operations will be new DataFrame objects which resemble Excel-Sheets. A DataFrame comprises two array-like structures called Series. Technically DataFrame-Series are based on NumPy’s arrays which are known to be very fast. But for Losers like us this is (still) not that important. More important is the question: What to do next with all these DataFrames?

Well, lets take one of them just to present a few “standard moves” a Data Scientist implements when he/she touches some unknown data.

How many columns and rows are inside?

What are the column names?

Which data types?

Let DataFrame describe itself by providing mean values, standard deviations etc.

It’s also recommended to use head() and tail() methods to read the first and last few entries. It serves the purpose of quickly checking if all data was properly transferred into memory. Often one can find some additional entries at the bottom of the file (just load any Excel file and you’ll know what I’m talking about).

Concatenating Data for further processing

After having checked the data and learned a little bit about it we want to combine all the available DataFrames into one. This can be done easily by using the append() method. We’ll also export this concatenation to a new CSV-file. In future we don’t have to repeat the concatenation process again and again. We should notice the parameter ignore_index which instructs Pandas to ignore original index entries that is similar across different files that share the same structure. Without this option the concatenation process would fail. Also, we let check for any integrity errors.

Using only interesting parts

More data is better than less data? It depends. In our case we don’t need all the columns Twitter provides us. Therefore we decide to cut out a certain part which we’ll be using throughout the rest of this article.

Here we slice the DataFrame by giving it an array of column names. The returned value is a new DataFrame containing only columns with corresponding names.

Adjusting and cleaning data

Often, data is not in the expected format. In our case we have the important column named “time” which represents time values but not the way Pandas wants it. The time zone flag “Z” is missing and instead we have a weird “+0000” value appended to each time entry. We now have to clean up our data.

Here we use list comprehensions to iterate over time-entries and replace parts of their contents from “ +0000” to “Z“. Later, we change the data type for all rows under the column “time” to type “datetime[ns]“. In both cases we use slicing features of the Pandas library. In the first command we use the loc() method to select rows/columns by using labels. There’s also an alternative way of selecting via indices by using iloc(). In both statements we select all rows by using the colon operator without giving any begin- and end ranges.

Filtering data

So, our data is now clean and formatted the way we wanted it to be before doing any analysis. Next, we select data according to our criteria (or from our customers, pointy-haired bosses etc.)

Here we’re interested in tweets with at least three retweets. Such tweets we do consider “successful”. Of course, the meaning of something can lead to an endless discussion and the adjective “successful” serves only as an example on how complex and multi-layered an analysis task can become. Very often your customers, bosses, colleagues etc. will approach you with sublime questions you first have to distill and “sharpen” before doing any serious work.

This seemingly simple statement shows one of the powerful Pandas features. We can select data directly in the index field by providing equations or even boolean algebra. The result of the operation would be a new DataFrame with complete rows containing (not only) the field “retweets” with values greater than or equal to 3. It’s like using SELECT * FROM Tweets WHERE retweet >= 3

Visualizing data with Seaborn

Finally, we want to visualize our analysis results. There are many available libraries for plotting data points. In this case we’re using the Seaborn package which itself utilizes Matplotlib. Further we will create an alternative graph by using Matplotlib only. Our current graph should visualize the distribution of our successful tweets over time. For this we use a Seaborn-generated Bar-Plot which expects values for X and Y axes. Additionally, we rotate the time values on the X-axis to 90 degrees. Finally, we plot the bar chart. Depending on your resolution a slight modification of the figure properties could be needed.

Visualizing data with Matplotlib

Here we use Matplotlib directly by providing it two Pandas Series: time and retweets. The result is a dashed line graph. Of course, there are so many different options within Matplotlib’s powerful methods and this graph is just a very simple example.