Chaosmail Blog

Getting Started with Microsoft SQL 2019 Big Data clusters

Tue, 26 Feb 2019 14:30:00 +0000

Microsoft latest SQL Server 2019 (preview) comes in a new version, the SQL Server 2019 Big Data cluster (BDC). There are a couple of cool things about the BDC version: (1) it runs on Kubernetes (2) it integrates a sharded SQL engine (3) it integrates HDFS (a distributed file storage) (4) it integrates Spark (a distributed compute engine) (5) and both services Spark and HDFS run behind an Apache Knox Gateway (HTTPS application gateway for Hadoop).

On top, using Polybase you can connect to many different external data sources such as MongoDB, Oracle, Teradata, SAP Hana, and many more. Hence, SQL Server 2019 Big Data cluster (BDC) is a scalable, performant and maintainable SQL platform, Data Warehouse, Data Lake and Data Science platform without compromising cloud and on-premise. In this blog post I want to give you a quick start tutorial into SQL 2019 Big Data clusters (BDC) and show you how to set it up on Azure Kubernetes Services (AKS), upload some data to HDFS and access the data from SQL and Spark.

SQL Server 2019 Big Data cluster (BDC)

SQL Server 2019 Big Data cluster (BDC) is one of the most exciting pieces of technologies I have seen in a long time. Here is why.

Kubernetes

SQL Server 2019 builds on a new abstraction layer called Platform Abstraction Layer (PAL) which let’s you run SQL Server on multiple platforms and environments, such as Windows, Linux, and Containers. To take this one step further, we can run SQL Server clusters entirely within Kubernetes - either locally (e.g. on Minikube), on on-premise clusters or in the cloud (e.g. on Azure Kubernetes Services). All data is persisted using Persistent Volumes. To facilitate operations, there is a new mssqlctl command to scaffold, configure, and scale SQL Server 2019 clusters in Kubernetes.

SQL Master Instance

If you deploy SQL Server 2019 as a cluster in Kubernetes, it comes with a SQL Master Instance and multiple SQL engine compute and storage shards. The great thing about the Master Instance is that it is just a normal SQL instance - you can use all existing tooling, code, etc. and interact with the SQL Server cluster as if it was a single DB instance. If you stream data to the cluster, you can stream the data directly to the SQL shards without going through the Master Instance. This gives you optimal throughput performance.

Polybase

You might know Polybase from SQL Server 2016 as a service that let’s you connect to flat HDFS data sources. With SQL Server 2019, you can now as well connect to relational data sources (e.g. Oracle, Teradata, SAP Hana, etc.) or NoSQL data sources (e.g. Mongo DB, Cosmos DB, etc.) as well as using Polybase and external tables - both with predicate pushdown filters. It’s a fantastic feature turning your SQL Server 2019 cluster into your central data hub.

HDFS

Now comes the fun part. When you deploy a SQL Server 2019 BDC, you also deploy an Hadoop Distributed Filesystem (HDFS) within Kubernetes. With the tiered storage feature in HDFS you can as well mount existing HDFS clusters into the integrated SQL Server 2019 HDFS. Using the integrated Polybase scale-out groups you can efficiently access this distributed data from SQL with external tables. If you install SQL Server 2019 as a BDC, all the configurations of those services is done automatically, even pass-through authentication. These features allows your SQL Server 2019 cluster to become the central data storage for both relational structured and massive volumes of flat unstructured data.

Spark

And it’s getting better. The SQL Server 2019 BDC also includes a Spark run-time co-located with the HDFS data pools. For me - coming from a Big Data background - this is huge! This means, you can take advantage of all Spark features (SparkSQL, Dataframes, MLlib for machine learning, GraphX for graph processing, Structured Streaming for stream processing, and much more) directly within your SQL cluster. Now, your SQL 2019 cluster can as well be used by your data scientists and data engineers as a central Big Data hub. Thanks to integration of Apache Livy (a Spark Rest Gateway) you can utilize this functionality with your existing tooling, such as Jupyter or Zeppelin notebooks out-of-the-box.

Much More … (Knox, Grafana, SSIS, Report Server, etc.)

Once we are running in Kubernetes, you can as well add many more services to the cluster and manage, operate, and scale them together. The Spark and HDFS functionality is configured with an Apache Knox Gateway (HTTPS application gateway for Hadoop) and can be integrated into many other existing services (e.g. processes writing to HDFS, etc.). SQL Server 2019 BDC ships comes with an integrated Cluster Configuration portal and a Grafana dashboard for monitoring all relevant service metrics.

Deploying other co-located services to the same Kubernetes cluster becomes quite easy. Services such as Integration Services, Analysis Services or Report Server can simply be deployed and scaled to the same SQL Server 2019 cluster as additional Kubernetes pods.

Another cool feature of SQL Server 2019 worth mentioning is that along Python and R it will also support User Defined Functions (UDFs) written in Java. Niels Berglund has many examples in his Blog post series.

Installation

Currently, SQL Server 2019 and SQL Server 2019 Big data cluster (BDC) are still in private preview. Hence, you need to apply for the Early Adoption Program which will grant you access to Microsoft’s private registry and SQL Server 2019 images. You are also assigned a buddy (a PM on the SQL Server 2019 team) as well as granted access to a private Teams channel. Hence, if you want to try it already today, you should definitely sign up!

In this section we will go through the prerequisites and installation process as documented in the SQL Server 2019 installation guidelines for Big Data analytics. In the documentation, you will find a link to a Python script that allows you to spin up SQL 2019 on Azure Kubernetes Services (AKS).

If you want to install SQL Server 2019 BDC on your on-premise Kubernetes cluster, you can follow the steps in Christopher Adkin’s Blog. You can find an official deployment guide for BDC on Minikube in the Microsoft docs.

Prerequisites: Kubernetes and MSSQL clients

To deploy a SQL Server 2019 Big Data cluster (BDC) on Azure Kubernetes Services (AKS), you need the following tools installed. For this tutorial, I installed all these tools on Ubuntu 18.04 LTS on WSL (Windows Subsystem for Linux).

To avoid any problems with Kubernetes APIs, it’s best to install the same kubectl version as the Kubernetes version on AKS. In the SQL Server 2019 docs, the version 1.12.6 is recommended. Hence, in this case we also install the Kubernetes 1.12.6 client.

sudo apt-get update && sudo apt-get install -y apt-transport-https
curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -
echo "deb https://apt.kubernetes.io/ kubernetes-xenial main" | sudo tee -a /etc/apt/sources.list.d/kubernetes.list
sudo apt-get update
sudo apt-get install -y kubectl=1.12.6-00

The mssqlctl tool is a handy command-line utility that allows you to create and manage SQL Server 2019 Big Data cluster installations. You can install it using pip with the following command:

$ pip3 install -r  https://private-repo.microsoft.com/python/ctp-2.4/mssqlctl/requirements.txt

Before you continue, make sure that both kubectl and mssqlctl commands are available. If they are not, you might restart the current bash session.

Prerequisites: Azure Data Studio

Azure Data Studio is a cross-platform management tool for Microsoft databases. It’s like SQL Server Management Studio on top of the popular VS Code editor engine, a rich T-SQL editor with IntelliSense and Plugin support. Currently, it’s the easiest way to connect to the different SQL Server 2019 endpoints (SQL, HDFS, and Spark). To do so, you need to install Data Studio and the SQL Server 2019 extension.

The following screenshot (Source: Microsoft/azuredatastudio) shows an overview of Azure Data Studio and its capabilities.

Azure Data Studio also supports Jupyter-style notebooks for T-SQL and Spark. The following screenshot shows Data Studio with the notebooks extension.

Install SQL Server 2019 BDC on Azure Kubernetes Services (AKS)

In this section, we will follow the steps from the installation script in order to install SQL Server 2019 for Big Data on AKS. I will give you a bit more details and explanation about the executed steps. If you just want to install SQL Server 2019 for Big Data, you can as well just use the script.

First, we start setting all required parameters for the installation process. For the docker username and password, please use the credentials provided in the Early Adoption program. These will give you access to Microsoft’s internal registry with the latest SQL Server 2019 images.

Please note that during the Early Adoption phase you will have to enter your docker credentials for the private Microsoft registry. You will receive these credentials after registering for the SQL Server Early Adoption program.

# Provide your Azure subscription ID
SUBSCRIPTION_ID="***"
# Provide Azure resource group name to be created
GROUP_NAME="demos.sql2019"

# Provide Azure region
AZURE_REGION="westeurope"
# Provide VM size for the AKS cluster
VM_SIZE="Standard_L4s"
# Provide number of worker nodes for AKS cluster
AKS_NODE_COUNT="3"
# Provide supported Kubernetes version
KUBERNETES_VERSION="1.12.7"

# This is both Kubernetes cluster name and SQL Big Data cluster name
# Provide name of AKS cluster and SQL big data cluster
CLUSTER_NAME="sqlbigdata"
# Provide username to be used for Controller user
CONTROLLER_USERNAME="admin"
# This password will be use for Controller user, Knox user and SQL Server Master SA accounts
# Provide password to be used for Controller user, Knox user and SQL Server Master SA accounts
PASSWORD="MySQLBigData2019"
CONTROLLER_PASSWORD="$PASSWORD"
MSSQL_SA_PASSWORD="$PASSWORD"
KNOX_PASSWORD="$PASSWORD"

# Private Microsoft registry
DOCKER_REGISTRY="private-repo.microsoft.com"
DOCKER_REPOSITORY="mssql-private-preview"
# if brave choose "latest"
DOCKER_IMAGE_TAG="ctp2.4"
DOCKER_IMAGE_POLICY="IfNotPresent"

# Provide your Docker username and email
DOCKER_USERNAME="***"
DOCKER_EMAIL="***"
# Provide your Docker password
DOCKER_PASSWORD="***"
DOCKER_PRIVATE_REGISTRY="1"

# aks | minikube | kubernetes
CLUSTER_PLATFORM="aks"
ACCEPT_EULA="Y"
STORAGE_SIZE="10Gi"

First, let’s create a new resource group.

$ az group create --subscription $SUBSCRIPTION_ID --location $AZURE_REGION \
    --name $GROUP_NAME

Now, we can go ahead and create the AKS cluster.

$ az aks create --subscription $SUBSCRIPTION_ID --location $AZURE_REGION \
    --name $CLUSTER_NAME --resource-group $GROUP_NAME \
    --generate-ssh-keys --node-vm-size $VM_SIZE \
    --node-count $AKS_NODE_COUNT --kubernetes-version $KUBERNETES_VERSION

If your code fails at this point, the selected Kubernetes version might not be supported in your region. You can check which versions are supported using the following command.

$ az aks get-versions --location $AZURE_REGION --output table

Please note, if you have problems with the aks command creating the Service Principal in your Azure Active Directory (like for Microsoft employees), you can as well create the principal manually beforehand:

$ az ad sp create-for-rbac --name $CLUSTER_NAME --skip-assignment
{
  "appId": "***",
  "displayName": "***",
  "name": "http://***",
  "password": "***",
  "tenant": "***"
}

# assign appId and password values
$ SP_APP_ID = "<app_id>"
$ SP_PASSWORD = "<password>"

$ az aks create --subscription $SUBSCRIPTION_ID --location $AZURE_REGION \
    --name $CLUSTER_NAME --resource-group $GROUP_NAME \
    --generate-ssh-keys --node-vm-size $VM_SIZE \
    --node-count $AKS_NODE_COUNT --kubernetes-version $KUBERNETES_VERSION \
    --service-principal $SP_APP_ID --client-secret $SP_PASSWORD

In the next step, we retrieve the credentials for the cluster. This will register the credentials in the kubectl config.

$ az aks get-credentials --name $CLUSTER_NAME --resource-group $GROUP_NAME --admin --overwrite-existing

In order to access the Kubernetes dashboard, we also need to create a role binding. I took this line from Pascal Naber’s blog post.

$ kubectl create clusterrolebinding kubernetes-dashboard -n kube-system \
    --clusterrole=cluster-admin --serviceaccount=kube-system:kubernetes-dashboard

Next, we can open the Kuerbenetes dashboard for the newly created AKS cluster and see if everything looks fine. To do so, we can forward the required ports to localhost.

$ az aks browse --resource-group $GROUP_NAME --name $CLUSTER_NAME

The Kubernetes dashboard should now be available via http://localhost:8001. I recommend you to open it and take a look at your newly created cluster.

Finally, we can deploy SQL Server 2019 BDC on the Kubernetes cluster using the mssqlctl command-line utility.

$ mssqlctl cluster create --name $CLUSTER_NAME

Great, that was it! You are now ready to get started. The following figure shows the Kubernetes dashboard with an installed instance of SQL Server 2019 BDC. You can see the Storage, Data and Compute pools as well as the SQL Master instance.

Querying SQL Server 2019 BDC

For this section, we will use Azure Data Studio with the SQL Server 2019 extension which let’s us connect to both the SQL Server endpoint as well as the Knox endpoint for HDFS and Spark.

Working with HDFS

First, we will put some data into the Big Data cluster. Let’s retrieve the Knox URL for HDFS.

$ kubectl get service service-security-lb -o=custom-columns="IP:.status.loadBalancer.ingress[0].ip,PORT:.spec.ports[0].port" -n $CLUSTER_NAME

Using this URL, we can build the WebHDFS URL and use any HDFS client to connect to the file system.

https://<service-security-lb service external IP address>:30433/gateway/default/webhdfs/v1/

You can follow the guidelines in the Microsoft docs using curl.

You can as well use the integrated HDFS explorer in Data Studio. To do so, you must create a new connection in Data Studio and select SQL Server Big Data Cluster. I recommend to use the user root in order to have read/write access in all directories. The configuration should look similar to the following picture.

Once added, you should see the server and the HDFS directories in Data Studio.

In order to query the HDFS data from SQL, you can configure external tables with the external table wizard.

Working with SQL

To work with SQL in SQL Server 2019 BDC, we can simply connect to the SQL Server Master Instance. This instance is a standard SQL Server engine running behind a load balancer on Kubernetes. You can use als your familiar tools such as SQL Server Management Studio to connect and interact with the SQL Server instance.

To connect to the SQL Server Master Instance from outside the cluster, we need to provide the external IP address of the master instance. You can find the local IP address of the SQL Server master instance service in the Kubernetes Dashboard, under the Big Data cluster namespace under services under the name endpoint-master-pool as property Cluster IP. Alternatively, you can as well print the external IP address using the following command:

$ kubectl get service endpoint-master-pool -o=custom-columns="IP:.status.loadBalancer.ingress[0].ip,PORT:.spec.ports[0].port" -n $CLUSTER_NAME

As I said, it is just a normal SQL Server engine like in SQL Server 2016/2017. Hence, you can connect to the SQL Server endpoint using standard SQL tooling such as SQL Server Management Studio or Azure Data Studio. Since we will use Data Studio as well for Spark notebooks and HDFS, we will connect using Azure Data Studio.

Create a new connection and select connection type Microsoft SQL Server. Use the username sa and the password you set in the setup script.

The following screenshot shows a query over an external table storing data in HDFS on the same cluster.

Using Polybase, you can as well setup external tables to many other relational data sources, such as Oracle and SAP Hana. Using the external table wizard in Data Studio this connection is easy to setup.

You can find many more demos for SQL Server 2019 on Bob Ward’s Github repository.

Working with Spark

To work with Spark in SQL Server 2019 BDC, we can leverage the notebook capabilities of Data Studio. Once we connected to the Big Data cluster, we will see options to create Spark notebooks for this instance.

In the current version, the credentials from Spark are not yet passed to the SQL engine automatically. Hence we have to supply a username and password along with the local database host to build the JDBC connection string. Here is a simple PySpark script to connect to the SQL Server database from within Spark.

To connect to the SQL Server Master Instance from Spark from within the cluster, we need to provide the local IP address of the master instance. You can find the local IP address of the SQL Server master instance service in the Kubernetes Dashboard, under the Big Data cluster namespace under services under the name endpoint-master-pool as property Cluster IP. Alternatively, you can as well print the internal IP address using the following command:

$ kubectl get service endpoint-master-pool -o=custom-columns="IP:.spec.clusterIP,PORT:.spec.ports[0].port" -n $CLUSTER_NAME

host = "<local_ip_address>:31334"
database = "demos"
user = "sa"
password = "<sa_password>"
table = "dbo.NYCTaxiTrips"
jdbc_url = "jdbc:sqlserver://%s;database=%s;user=%s;password=%s" % (host, database, user, password)

df = spark.read.format("jdbc") \
       .option("url", jdbc_url) \
       .option("dbtable", table) \
       .load()

df.show()

The following screenshot shows the above query executed on my SQL Server 2019 BDC instance on the NYC Taxi Trips dataset.

If you need to install additional Python packages on the cluster nodes or configure the Spark environment, you can use the Jupyter magic commands.

I am sure you can see why this is really cool, right? You can easily run your Spark ETL, pre-processing and Machine Learning pipelines on data both stored in SQL and HDFS or any external sources.

You can find many more Big Data samples on Buck Woody’s Github repository.

Summary

SQL Server 2019 Big Data cluster (BDC) is combining SQL Server, HDFS and Spark into one single cluster running on Kubernetes, either locally, on-premise or on the cloud. Using Polybase, one can connect multiple services - such as relational databases and NoSQL databases, or files in HDFS - as external tables. This allows you to have a single cluster for all your SQL and Spark workloads as well as for storing massive datasets.

To setup SQL Server 2019 BDC, one needs to sign up for the SQL Server Early Adoption program and install kubectl and mssqlctl on the local machine. The cluster can be created using the Python installation script. Make sure to clean your credentials and setup rolebinding to access your Kubernetes cluster in the cloud.

Once the cluster is created, one can use Azure Data Studio to manage both SQL Server and HDFS. On top, Data Studio provides Jupyter-like notebooks to run Spark on the SQL Server 2019 cluster.

Resources

Thanks to Kaijisse Waaijer.

Updates

April 01, 2019: Update article to CTP 2.4

Introduction to BigData, Hadoop and Spark

Thu, 31 Jan 2019 16:00:00 +0000

Everyone is speaking about Big Data and Data Lakes these days. Many IT professionals see Apache Spark as the solution to every problem. At the same time, Apache Hadoop has been around for more than 10 years and won’t go away anytime soon. In this blog post I want to give a brief introduction to Big Data, demystify some of the main concepts such as Map Reduce, and highlight the similarities and differences between Hadoop and Spark.

Big Data

You hear about Big Data everywhere. But what does it actually mean and what precisely can we do with it? When is a Big Data system required? We will try to answer these 2 questions in this section.

What is Big Data

What is Big Data? Have you ever heard of the popular definition of Big Data with the 3 Vs? This definition is very common and can be found in many text books and Wikipedia. It suggests that your data is Big data when one (or all - depending on the definition) of the following criteria are fulfilled:

Volume
Velocity
Variety

I find this definition very concise and understandable but a bit imprecise which is probably intentional. Here is a more practical definition of what the 3 Vs stand for, based on my own experiences.

Volume describes a large amount of data you want to store, process or analyze. If we are speaking in terms of 100s of GBs to TBs to PBs then we are speaking about Big Data. An important aspect to consider, the data growth. As a rule of thumb: If your data is growing by multiple GBs per day, you are probably dealing with Big Data.

Velocity means a high data throughput to be stored, processed and/or analyzed; often a large amount of data over a short period of time. When we are processing thousands to millions of records per second, then we are most likely speaking of Big Data.

Variety stands for the large amount of different data types and formats that can be stored, processed or analyzed. This means one aims to process any kind of data, be it binary, text, structured, unstructured, compressed, uncompressed, nested, flat, etc. However, variety is rather a consequence of Big Data as all data is eventually stored on a distributed file system and so one has to care about different optimized file formats for different use-cases.

Big Data systems are built to handle data of high volume, velocity and variety. Apache Hadoop and Apache Spark are popular Big Data frameworks for large-scale distributed processing. We will learn the similarities and differences in the following sections.

Please note that other definitions vary slightly and you will find 4 or even more Vs, such as Veracity for example. Veracity refers to the trustworthiness of the data and hence describes how useful your data actually is. While these extended definitions are relevant for Big Data, they don’t necessarily apply only to Big Data systems or require Big Data systems in my opinion.

What can we do with Big Data

Big Data systems allow you to load, process and store data of high volume, velocity and variety. However, I prefer to see it the other way around. Once, you want to load, process and store data of high volume, velocity and variety, you need a Big Data system.

A great example is Google’s search index. In a very simplified way, one has to count occurrences of keywords and phrases in websites and store each keyword and its corresponding list of occurring websites as an inverted index. As you might imagine, both the amount of content and the inverted index probably don’t fit into the memory of a single machine.

Another classic example is Twitter’s trending hashtags (the word count example). One has to count the occurrences of a keyword in all tweets, maybe even weighted over time and aggregated per geographic region. As you might imagine, doing this on billion of messages per second probably exceeds the capabilities of a single machine.

Hence, both use-cases require scalable distributed systems to handle the load and process the data efficiently. There is one more thing that both use-cases have in common: they aggregate/group data by a key, e.g. the occurrences per keyword and the counts per hashtag. This is one of the key requirements for a Big Data system.

Hence most of the Big Data workloads can be categorized into 2 topics: Batch Processing or Stream Processing.

Batch Processing
- Transformation, Join and Aggregation
- Analytics: (historical) Analytics, Prediction and Modeling
Stream Processing
- Transformation, Join and (temporal) Aggregation
- Analytics: (real-time) Analytics, Inferencing Prediction models

What is Big Data Analytics

Big Data systems are used to store and process massive amounts of data, mostly batch and stream processing. They are often used for analytics. For clarity about which data is processed, I like to make the distinction between 3 different use-cases of increasing difficulty in Big Data analytics. The use cases are:

(Historical) Analysis
Prediction
Modeling

In classical Analytics, we analyze historical/observed data. In Big Data, we analyze massive amounts of such data. A typical question to answer with analytics could be how to compute the number of visitors of the previous season based on all bookings of said season?.

In Prediction, we analyze the past to build a model that can predict the future. In more general terms, one fits a model on a set of training data to use it for inferring any unknown/unseen observation. We often use statistical methods (such as Generalized Linear Models, Logistic Regression, etc.) as well as Machine Learning (SVM, Gradient Boosted Trees, Deep Learning, etc.) techniques to build these models. A typical question to answer with prediction could be how to forecast the number of visitors for the following season based on all bookings of previous seasons?.

Modeling builds on both analytics and prediction capabilities. In Modeling, the aims is to analyze the past and build a model to predict different possibilities of the future depending on the model parameters. These models are often more complicated than a simple statistical or Machine Learning model and take into account multiple state variables and parameters that can be modified. A typical question to answer with modeling could be how to forecast the number of visitors for the following season if the winter will be two weeks shorter based on all bookings of previous seasons plus additional data sources (weather data, etc.)?.

Hadoop: HDFS, Yarn and MapReduce

Apache Hadoop is a framework for storing and processing massive amounts of data on commodity hardware. It is a collection of services that sit together in the Hadoop repository.

HDFS: a distributed file system
MapReduce: a framework for distributed processing
Yarn: a cluster resource manager

HDFS (Hadoop Distributed File System) is a distributed file system that stores and replicates data in blobs across multiple nodes in a cluster. HDFS is the open-source implementation of the Google File System (GFS) paper published by Jeff Dean and Sanjay Ghemawat at Google in 2003.

HDFS consists of a name node and multiple data nodes. The name node holds the references to the data blobs and takes care of the file system and meta operations whereas the data nodes store the data in blobs on the local file system.

MapReduce is a high-level framework for distributed processing of large data sets that abstracts developers’ code into Map (transformation) and Reduce (aggregation) operations. By doing so, the code can automatically run in parallel on a distributed system. MapReduce is an open-source implementation of the MapReduce: Simplified Data Processing on Large Clusters paper published by Jeff Dean and Sanjay Ghemawat at Google in 2004.

Due to the same name of the paper and the open-source implementation, the term MapReduce can lead to confusion between the original concept and the framework.

Similar to both Google papers, HDFS and MapReduce were designed and developed to function as one single framework for distributed processing of large data sets. MapReduce takes advantage of data replication in HDFS by moving computations to the same physical machine were the data is stored. In Hadoop 1, the MapReduce services ran directly on the data nodes without any resource manager.

Apache Yarn (acronym for Yet Another Resource Negotiator) is a distributed resource manager and job scheduler for managing the cluster resources (CPUs, RAM, GPUs, etc.) and for scheduling and running distributed jobs on a Hadoop cluster. It was introduced in Hadoop 2 to decouple the MapReduce engine from the cluster resource management and allowed more services to run on-top of Hadoop. Hence, instead of starting services on each of the nodes individually one can submit a service to Yarn which takes care of the resource negotiation, distribution of the service to all requested nodes, execution of the service, log collection, etc.

Yarn consists of a resource manager service to negotiate cluster resources and multiple node manager services that manage the execution of processes on each of the nodes.

Although not managed in the same repository as Apache Hadoop, I often like to mention Apache Zookeeper as another integral building block of Hadoop. Apache Zookeeper is a distributed synchronized transaction-based in-memory key-value store. Many Hadoop services use Zookeeper for storing dynamic configuration (available nodes per partition, current master, etc.), leader election, synchronization, and much more.

Nowadays, there are many other services related to or included in the Hadoop stack. Here is a (small) list of distributed services that usually run on-top of Hadoop:

Batch Processing
- Hive
- Pig
- MapReduce
- Tez
- Druid
- Impala
- Spark
Stream processing
- Storm
- Flink
- Spark Streaming
Data Storage
- HDFS (File Store)
- HBase (NoSql)
- Cassandra (NoSQL)
- Accumulo (NoSQL)
- Kafka (Log Store)
- Solr (Inverted Document Index)

Most of these service run on top of Hadoop because they utilize one or more of its components. Typical examples of reused components are:

HDFS as distributed storage (used in Hive, HBase, etc.)
Yarn as resource manager (used in Hive, Spark, Storm, etc.)
Zookeeper for synchronization and leader election (used in Hive, Kafka, Hive, etc. )
Hive Metastore as a meta data storage (used in Spark, Impala, etc.)

Spark: The Evolution of MapReduce

Apache Spark got popular in 2014 as a fast general-purpose compute framework for distributed processing which claimed to be more than 100 times faster than the traditional MapReduce implementation. It provides high level operations for working with distributed data sets which are optimized and executed in-memory of the cluster nodes. Spark runs on top of multiple resource managers such as Yarn or Mesos.

Conceptually, Spark’s execution engine is similar to the other distributed processing frameworks:

Apache Tez (Tez is Hindi for speed) is a faster MapReduce engine based on Apache Yarn that optimizes complex execution graphs into single jobs to avoid intermediate writes to HDFS. It is the default execution engine powering Apache Pig and Apache Hive (a large scale data warehouse solution on top of Hadoop) on the Hortonworks Hadoop distribution.

Apache Impala is a Big Data SQL engine on top of HDFS, HBase and Hive (Metastore) with its own specialized distributed query engine. It is the default engine for Apache Hive on the Cloudera and MapR Hadoop distributions.

As we can see in the following figure, the main differences from the newer processing frameworks (Tez, Impala, and Spark) compared to the traditional MapReduce engine (left) is that they avoid writing intermediate results to HDFS and heavily optimize the execution graph. Another optimization strategy is processing/caching data in memory of the local nodes across the execution graph.

Traditional MapReduce (left) vs. Tez/Impala/Spark optimized engines (right) (Source: hortonworks.com)

What sets Apache Spark aside from the other frameworks, is the in-memory processing engine as well as the rich set of included libraries (GraphX for graph processing, MLib for Machine Learning, Spark Streaming for mini batch streaming, and Spark SQL) and SDKs (Scala, Python, Java, and R). Please note that these libraries are for distributed processing, so distributed graph processing, distributed machine learning, etc. out-of-the-box.

The amazing performance of Spark’s in-memory engine comes with a trade-off. Tuning and operating Spark pipelines with varying amounts of data requires a lot of manual tuning of configurations, digging through log files, and reading books, articles, and blog posts. And since the execution parallelism can be modified in a fine-grained way, one has to configure/set the number of tasks per JVM, the number of JVMs per worker, and the number of workers as well as all the memory settings (heap, shuffle, and storage) for these executors and the driver.

Summary

We speak about Big Data, when we speak about large volumes (> 10s GB), high velocity (> 10.000s records/second) or large variety (binary, text, unstructured, compressed, etc.) of data. We use Big Data systems to store and process massive data sets - either as batch (process partitions of data) or as stream (process single records). In Big Data analytics, we usually differentiate between historical analytics, prediction and modeling.

Apache Hadoop is a collection of services for large-scale distributed storage and processing, mainly HDFS (a distributed filesystem), MapReduce (a processing framework), Apache Yarn (a cluster resource manager), and Apache Zookeeper (a fast distributed key-value storage). Many other services such as Hive, HBase, etc. run on top of Hadoop.

Apache Spark is a fast (100 times faster than traditional MapReduce) distributed in-memory processing engine with high-level APIs, libraries for distributed graph processing and machine learning, and SDKs for Scala, Java, Python and R. It also has support for SQL and streaming.

Resources

Thanks to Emil Jorgensen and Bryan Minnock.

Bing Maps API - Rest Locations (Geocoding)

Tue, 31 Jul 2018 17:00:00 +0000

The Bing Maps APIs provide a REST API for Geocoding, hence finding a location based on a text input. This text input can be either a structured address or an (unstructured) search query. In this blog post we will see how to request a developer key and use the REST API for finding the coordinates to a query using the Bing Maps Location API.

Request an API key

If you already use Azure or have an Azure account, then you can request an API key directly from within Azure. To do so, log into your Azure account and go to the Marketplace. Search for Bing Maps API for Enterprise, select your pricing model and add it to your Azure account. The tier level 1 is free and grants you 10K requests/month. You can find more pricing information in the product documentation.

If you don’t have an Azure account then you can request an API key from the Bing Maps Portal.

Locations API (Rest)

There are 2 types of queries available in the Locations API, structured and unstructured queries. In this example we will use an unstructured query to query the API with a search term. However, you can also find examples using structured queries on the Bing Maps Location API documentation.

Let’s find the geolocation of Howth an Irish village in east central Dublin. To run this, you have to replace the <accesskey> placeholder with you own access key in the following snippet.

$ curl http://dev.virtualearth.net/REST/v1/Locations?query=Howth+Dublin&include=queryParse&key=<accesskey>&output=json

{
  "authenticationResultCode": "ValidCredentials",
  "brandLogoUri": "http://dev.virtualearth.net/Branding/logo_powered_by.png",
  "copyright": "Copyright © 2018 Microsoft and its suppliers. All rights reserved. This API cannot be accessed and the content and any results may not be used, reproduced or transmitted in any manner without express written permission from Microsoft Corporation.",
  "resourceSets": [
    {
      "estimatedTotal": 2,
      "resources": [
        {
          "__type": "Location:http://schemas.microsoft.com/search/local/ws/rest/v1",
          "bbox": [
            53.3509009854156,
            -6.12213139880203,
            53.4088417514008,
            -5.99270815503098
          ],
          "name": "Howth, Ireland",
          "point": {
            "type": "Point",
            "coordinates": [
              53.3798713684082,
              -6.0574197769165
            ]
          },
          "address": {
            "adminDistrict": "Dublin",
            "countryRegion": "Ireland",
            "formattedAddress": "Howth, Ireland",
            "locality": "Howth"
          },
          "confidence": "High",
          "entityType": "PopulatedPlace",
          "geocodePoints": [
            {
              "type": "Point",
              "coordinates": [
                53.3798713684082,
                -6.0574197769165
              ],
              "calculationMethod": "Rooftop",
              "usageTypes": [
                "Display"
              ]
            }
          ],
          "matchCodes": [
            "Ambiguous"
          ],
          "queryParseValues": [
            {
              "property": "Locality",
              "value": "howth"
            },
            {
              "property": "AdminDistrict",
              "value": "dublin"
            }
          ]
        },
        {
          "__type": "Location:http://schemas.microsoft.com/search/local/ws/rest/v1",
          "bbox": [
            53.3849579306227,
            -6.08322532103279,
            53.392683365764,
            -6.06595509125969
          ],
          "name": "Howth, Ireland",
          "point": {
            "type": "Point",
            "coordinates": [
              53.3888206481934,
              -6.07459020614624
            ]
          },
          "address": {
            "adminDistrict": "Dublin",
            "countryRegion": "Ireland",
            "formattedAddress": "Howth, Ireland",
            "locality": "Howth"
          },
          "confidence": "Medium",
          "entityType": "RailwayStation",
          "geocodePoints": [
            {
              "type": "Point",
              "coordinates": [
                53.3888206481934,
                -6.07459020614624
              ],
              "calculationMethod": "Rooftop",
              "usageTypes": [
                "Display"
              ]
            }
          ],
          "matchCodes": [
            "Ambiguous"
          ],
          "queryParseValues": [
            {
              "property": "Landmark",
              "value": "howth"
            },
            {
              "property": "AdminDistrict",
              "value": "dublin"
            }
          ]
        }
      ]
    }
  ],
  "statusCode": 200,
  "statusDescription": "OK",
  "traceId": "40fdef3af24f4784ace0cce7eac525d8|DB40060332|7.7.0.0|Ref A: 4E51EFC390E74430B0E957794757DF59 Ref B: DB3EDGE1021 Ref C: 2018-07-31T13:59:46Z"
}

In the above code we see that the API returns the geolocation, a bounding box and other useful geo information such as country region and administrative district as a json body.

With the Bing Maps API, one can also perform Batch operations. You can find more information about Batch operations on the Bing Blog.

Resources

MS Hack 2018 - Smart Outlook

Mon, 30 Jul 2018 17:00:00 +0000

In the MS Hack 2018 global Microsoft Hackathon, I joined the Smart Outlook project for 3 days in order to develop a few ideas to make your daily work in Outlook more productive. We came up with a few ideas, mainly about enhancing folders, categories, and focus inbox as well as notifications about new emails. We build a mock UI to visualize the idea of our solution; the code is available on Github.

Idea

Employees spend 20 hours/week on email. Common patterns include manually moving emails into folders, manually maintaining rules, maintaining a single folder inbox,keeping tons of unread emails in the inbox. These things cost time.

Categorizing emails is not straight forward. The following tools are available to categorize emails:

Custom Folders
Categories
Focus Inbox
Junk Folder

Both Custom Folders and Categories are filled manually or with manually-created rules whereas Junk Folder and Focus Inbox are filled automatically. We receive an email notification for every email received in the inbox.

Solution

To make Outlook more efficient, we want to merge Custom Folders and Focus Inbox (Tabs) into Smart Tabs. Smart Tabs can be created within hierarchies like traditional folders. Emails can be dragged into Smart Tabs like traditional folders and each email can only belong to one single Smart Tab. However unlike traditional folders, the number of unread emails is added up with all parent tabs. According to the manual categorization of emails to Smart Tabs, a Machine Learning algorithm trains/finetunes a model based on content/subject semantics as well as email and organizational metadata and moves new emails to the most likely tab (or inbox). During setup, common Smart Tabs are created for you such as Team, Notifications, News, Org Updates, Customers, External, etc.

To focus on only relevant notifications, any of your Smart Tabs can be pinned to the top. You will only receive notifications from new emails in your pinned tabs. We merged the UI of the Outlook web client with the UI of Edge browser to create this Tab UI. We built a mock UI to visualize the idea of our solution; the code is available on Github. Here is a screenshot of our mock UI for Smart Tabs.

A blog post about deploying a static website to Azure can be found in the Azure Blog.

Execution

We collaborated with 1 team in London and 2 teams in India who had similiar ideas. Here is what we implemented on the hackathon:

Mock UI
- Smart Tabs: A merge between classic folders and Focus/Other tabs. Mails should be classified automatically (based on previous user interactions) and put into those smart tabs. This tabs will also appear on the left the same folders did.
- Focus Mode: Smart tabs can be pinned as tabs to the top. Then the user receives notifications solely on the current open tabs.
- UI: We tried to merge functionality of Outlook.com Web client with the simple/clean UI of Edge into a single tab-based email UI
ML API Endpoint
- Labeled a custom email dataset
- Trained simple ML model with semantic features in Python
- Email classification based on content
- Deployment to Azure VM
Outlook Plugin
- Create Smart Tabs programmatically (as folders)
- Classify incoming messages and move to Smart Tabs (as folders)

Future Work

It was great fun to hack on Outlook, to try makiung Outlook more efficient and to propose a new UI that let’s you focus on what is important. We see great potential in email clients becoming more intelligent, productive and efficient. In this 3 day hackathon we validated a few ideas and showed in a proof of concept that these ideas can potentially find their way into Outlook Windows or web client.

Resources

Resolving _HOST in kerberized HDP Sandbox

Sun, 05 Mar 2017 00:30:00 +0000

Since HDP-2.5 Hortenworks provides its HDP Sandbox within a docker container within a Virtual Machine ISO. For many developers working with multiple VM HDP Sandboxes this is not optimal, as we always have to tunnel each connection through the VM host into the docker container. That’s why we are building our own custom Sandbox. However, when building a kerberized Hadoop installation, it is a bit tricky to configure a hostname such that Kerberos principals resolve the _HOST variable properly.

Ambari’s autogenerated Kerberos principals

With Ambari and Ambari Blueprints automated Hadoop cluster installations are quite comfortable; one can simply describe all components and configurations in a Blueprint XML file. When it comes to Kerberos, Ambari automatically takes care of creating all principals and keytabs. However, I was experiencing a strange Kerberos authentication error when Ambari resolved the _HOST variable to localhost in all principals despite setting the hostname sandbox.chaosmail.at in /etc/hosts and /etc/hostname. Hence, the Kerberos principal was not valid.

It turned out, that the error occurred due to placing 127.0.0.1 sandbox.chaosmail.at in the last line of /etc/hosts instead of the first. Here is how I debugged the error.

Diving into Ambari source code

If we dive into the Ambari source code on Github and search for _HOST, we quickly find the following code snippet.

String hostname = record.get(KerberosIdentityDataFileReader.HOSTNAME);

if(KerberosHelper.AMBARI_SERVER_HOST_NAME.equals(hostname)) {
  // Replace KerberosHelper.AMBARI_SERVER_HOST_NAME with the actual hostname where the Ambari
  // server is... this host
  hostname = StageUtils.getHostName();
}

// Evaluate the principal "pattern" found in the record to generate the "evaluated principal"
// by replacing the _HOST and _REALM variables.
String evaluatedPrincipal = principal.replace("_HOST", hostname).replace("_REALM", defaultRealm);

Bingo, that’s the place where the _HOST variable gets resolved. In our case, running host and server on the same machine, the variable will be replaced by the return value of the StageUtils.getHostName() function.

Let’s find this function in the source code and look at the relevant line.

server_hostname = InetAddress.getLocalHost().getCanonicalHostName().toLowerCase();

Now we know, that the _HOST variable in a Kerberos principal will be replaced with the output of the getCanonicalHostName() function (which is implemented in the standard library in the package java.net.InetAddress) when autogenerating principals with Ambari.

Testing the hostname

Let’s throw the pieces together and write a little Java script to print out the hostname using the getCanonicalHostName() function.

import java.net.InetAddress;
import java.net.UnknownHostException;

class PrintHostname {

  public static void main(String[] args) {
    String server_hostname;  
    try {
      server_hostname = InetAddress.getLocalHost().getCanonicalHostName().toLowerCase();
    } catch (UnknownHostException e) {
      System.out.println("Could not find canonical hostname");
      server_hostname = "localhost";
    }

    System.out.println(server_hostname);
  }
}

We can run the script using the following commands:

javac PrintHostname.java
java PrintHostname

Finally when we can test the 2 versions of /etc/hosts.

127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
127.0.0.1   sandbox.chaosmail.at

Using the above hosts file, the PrintHostname script outputs localhost.

127.0.0.1   sandbox.chaosmail.at
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6

Using the above hosts file, the PrintHostname script outputs sandbox.chaosmail.at.

Resolving `_HOST` in Kerberos principals

Finally we can be sure that getCanonicalHostName() returns sandbox.chaosmail.at and hence the _HOST variable we resolve to sandbox.chaosmail.at. This means that all principals generated by Ambari will have the proper hostname and hence will be valid principals.

Resources

Intro to Deep Learning for Computer Vision

Sat, 22 Oct 2016 21:30:00 +0000

The field of Deep Learning (DL) is rapidly growing and surpassing traditional approaches for machine learning and pattern recognition since 2012 by a factor 10%-20% in accuracy. This blog post gives an introduction to DL and its applications in computer vision with a focus on understanding state-of-the-art architectures such as AlexNet, GoogLeNet, VGG, and ResNet and methodologies such as classification, localization, segmentation, detection and recognition. It is based on a presentation that I held in the Seminar of Computer Graphics at Vienna University of Technology.

Introduction

First steps towards artificial intelligence and machine learning were made by Rosenblatt in 1958, a psychologist who developed a simplified mathematical model that mimics the behavior of neurons in the human brain - the Perceptron. Using a set of training data, it is able to approximate a linear function by iteratively updating its weights according to the output of a simple FeedForward (FF) pass. By combining multiple Perceptrons (also called Neurons or Units) to a network using 1 input and 1 output layer, Rosenblatt built a system that can correctly classify simple geometric shapes. However, it is unable to approximate the nonlinear XOR function and does not allow hidden layers due to the simple update rule. These limitations as well as hardware restrictions resulted in vanishing public interest during the following decades.

The second important novelty for Artificial Neural Networks (ANN) was the formal introduction of BackPropagation (BP) in the early 1960s, a concept of computing a derivative for all neurons in a network using chain rule and propagating the error through the different layers of the network from the output layer back to the input layer. This method allows to optimize the weights of a network by using optimization, such as stochastic gradient descent. It was first applied to NN by Werbos in 1974 but due to the lack of interest in NN only published in 1982. Finally, it attracted interest of researchers LeCun et al. as well as Rumelhart et al. in 1986 who further improved the concept of BP for supervised learning in NNs. In 1989, Hornik et al. could prove that NNs with 2 hidden layers can approximate any function and hence also solve the XOR problem.

In a first practical application, LeCun demonstrated the handwritten recognition of ZIP codes in 1989 using a convolution operation instead of a hidden layer together with a subsampling operator (also called Pooling). Due to these convolutions, a single neuron in the second layer could learn the same as a complete layer in a non-convolutional network which leads to much more efficient training due to the reduced number of parameters. A layer with 256 3x3 filters has 2,304 parameters, whereas a fully connected layer with 256 units following a second layer of 256 units has 65,536 parameters. Both convolutional layers and pooling layers build the basis for Convolutional Neural Networks (CNNs).

NNs advanced in many domains, such as unsupervised learning using AutoEncoders (AE) and Self-Organizing Maps (SOM) as well as reinforcement learning especially in the domain of control systems and robotics. New models such as Belief Networks, Time-Delay Neural Networks (TDNN) for audio processing and Recurrent Neural Networks (RNN) for speech recognition were implemented. However, in the late 1980s, multi-layer NNs were still difficult to train using BP due to the vanishing or exploding gradient problem. In the 1990s, new methods such as Support Vector Machines (SVMs) and Random Forests (RF) were determined as better suited for supervised learning than NNs due to their much simpler mathematical constructs.

Almost 2 decades later, Hinton et al. showed that Deep Neural Networks (DNNs) can be trained if the weights are initialized better than randomly and rebranded the domain of multi-layer NNs to Deep Learning (DL). Leveraging the parallelization power of GPUs resulted in a speedup factor of 1 million increase in training time compared to 1980s and a factor of 70 increase compared to common CPUs in 2007. In 2011, DNNs outperformed a 10-year old state-of-the art record in speech recognition due to 1000 times more training data than used in the 1980s. Glorot, LeCun and Hinton further studied the necessity of weight initialization and proposed a much simpler activation function $f(x) = max(0, x)$ - the so called Rectified Linear Unit (ReLU) - for more stable BP.

Since 2012, DNNs have been winning classification, detection, localization and segmentation tasks in the ImageNet competitions (Schmidhuber) and outperformed all methods using hand-engineered features with almost $10$ % higher accuracy (Krizhevsky). The winning model of 2015s ImageNet competition is ResNet-152 from Microsoft, a DNN with residual mapping and 152 layers who achieved $16.5$ % greater accuracy on average than the 2nd and surpassed human accuracy in classification.

From Neural Networks to Deep Learning

In Deep Learning, deeply nested Convolutional Neural Networks with more than a million parameters are trained on more than a million training samples using BP and optimization. Models are tuned by the filter arrangement and dimensions (VGGNet, etc.), module structure to reduce the number of parameters (Network in a Network, GoogLeNet, etc.), techniques for preventing overfitting (batch normalization, Dropout, etc.), new initialization and non-linearities (Xavier Initialization, Leaky-ReLU, etc.), better optimization techniques (RMSProp, etc.) new layer types and connections (ResNet, etc.), as well as composition of different structures (image captioning, deconvolution for visualization, etc.).

Neural Networks

Neural Networks date back to the work of Rosenblatt on the Perceptron in the late 1960s which is one of the building blocks of modern DNNs. These networks are constructed out of hidden layers (fully connected layers) of multiple perceptrons per layer.

The Perceptron - Gated Linear Regression

The Perceptron aims to recreate the behavior of a neuron in the human brain by combining different inputs with a weighted sum and triggering a signal if a threshold is reached. The more important an input is, the higher is the weight for this particular input. The subsequent figure shows the inputs $\vec{x}={x_1, x_2, x_3, ...}$ , the weights for each input $w_i$ and a bias term $\theta$ ; the threshold (or non-linearity) is modeled as a step function $h$ .

Perceptron (Source: github.com/cdipaolo)

The output of the Perceptron can be written as

$\vec{y} = h(W \cdot \vec{x} + \theta)$

and reminds of a simple linear regression with a threshold gate. While the Perceptron can be used as a building block for an universal function approximator, the singularity of the step function leads to problems when computing the gradient.

Non-linearities (Activation Functions)

Due to the singularity in the step function, other non-linear activation functions have been proposed and successfully used in NNs.

Rectified Linear Unit (ReLU) is the most widely used non-linearity for DNNs and is computed via $y = max(0, x)$ . While not being differentiable at position 0, the ReLU function greatly improves the learning process of the network due to the fact that the total gradient is simply passed to the input layer when the gate triggers.
Leaky Rectified Linear Unit (Leaky ReLU) exploits the fact that the original ReLU is set to 0 for negative inputs and hence does not propagate a gradient for negative inputs (this is a crucial fact for initialization).
Sigmoid non-linearities are commonly used together with regression networks in the final output layer as the outputs are bounded between $-1$ and $1$ .
Softmax non-linearities are commonly used together with classification networks in the final output layer, as the sum of the total outputs results to $1$ - similar to a probability distribution.
Hyperbolic tangent (tanh) non-linearities are often used as a replacement for the sigmoid function leading to slightly better training results due to more stable numeric computation.

Deep Neural Networks

DL models are constructed mostly out of Convolutional (Conv) and Pooling (Pool) layers as they have been used by LeCun in the late 1980s as shown in the following figure.

Architecture of LeNet

Convolutions

A Conv layer consists of spatial filters that are convolved along the spatial dimensions and summed up along the depth dimension of the input volume. In general one starts with a large filter size (e.g. 11x11) and a low depth (e.g. 32) and reduces the spatial filter dimensions (e.g. to 3x3) while increasing the depth (e.g. to 256) throughout the network. Due to weight sharing, they are much more efficient than fully-connected layers. A Conv layer has $w \cdot h \cdot d \cdot n_f$ number of parameters without bias ( $w$ .. width of the filter, $h$ .. height of the filter, $d$ .. depth of the filter, $n_f$ number of filters) that need to be learned during training.

Pooling

Conv layers are often followed by a Pool layer in order to reduce the spatial dimension of the volume for the next filter - this is the equivalent of a subsampling operation. The pooling operation itself has no learnable parameters.

Most of the time, $max$ pooling layers are used in DL models due to the easier gradient computation. During BP, the gradient only flows in the direction of the single max activation which can be computed very efficiently. A few other architectures use $avg$ pooling, mostly at the end of a network or before the fully connected layers and without a noticeable increase in performance.

Normalization

In modern (post-sigmoid) DNNs, Normalization is necessary for stable gradients throughout the network. Due to the unbounded behavior of the ReLU activations ( $y = max(0, x)$ ), filter responses have to be normalized. Usually this is done per batch using batch normalization or locally using a Local Response Normalization layer.

Fully Connected Layer

The FC layer works exactly as described in the previous section - it connects every output from the previous layer with each neuron. Usually, the FC layer is used at the end to combine all spatially distributed activations of the previous Conv layers. The FC layers have the highest number of parameters ( $n_i \cdot n_n$ , where $n_i$ is the number of outputs of the previous layer and $n_n$ is the number of neurons) in the model (almost 90%); most computing time is spent in the early Conv layers.

Final Output Layer

The final output layer of a DNN plays a crucial role for for the task of the whole network. Common choices are:

Classification: Softmax layer, computes a value $y_{i} \in [0, 1]$ such that $\sum_i y_{i} = 1$ - can be interpreted as a probability that $x_i$ to belong to a certain class
Regression: Sigmoid layer, predict values $y_{ij} \in [0, 1]$ for an output with $j$ dimensions.
Regression and classification: The tasks can be combined, by connecting 2 output layers and hence outputting both values at once. This is used for object detection with a fixed number of objects, e.g. output a regression per class
Encoder: Stop at the fully-connected layer and use it as feature space for clustering, SVM, post-processing etc.

After defining a final output layer, one need to define as well a loss function for the given task. Picking the right loss is crucial for training a DNN; common choices are:

Classification: Cross-entropy, computes the cross-entropy between the output of the network and the ground truth label and can be used for binary and categorical outputs (via hot-one encoding)
Regression: Squared error and mean squared error are common choices for regression problems
Segmentation: Intersection over union is a loss function well suited for comparing overlapping regions of an image and a prediction - however, it is not well suited for DNNs because it returns 0 for non overlapping regions; MSE is a better choice

The loss function can as well be extended with a regularization term to constraint the parameters of DNN. Both L1 and L2 regularization on the filter matrices $W_i$ are commonly used.

Deep Learning Architectures

This section describes state-of-the-art DNN architectures, common parameterizations and structures in DL.

Convolutional Neural Networks (CNN)

A CNN is a neural network model that contains (multiple) convolutional layers (with a non-linear activation function) and additional pooling layers at the beginning of the network.

A convolution layer extracts image features by convolving the input with multiple filters. It contains a set of 2-dimensional filters that are stacked into a 3-dimensional volume where each filter is applied to a volume constructed from all filter responses of the previous layer. If one considers the RGB channels of a 256x256 sized input image as a 256x256x3 volume, a 5x5x3 filter would be applied along a 5x5 2-dimensional region of the image and summed up across all 3 color channels. If the first layer after the RGB volume consist of 48 filters, it is represented as a volume of 5x5x3x48 weight parameters and 48 bias parameters. Using a convolution operation on the input volume and the filter volumes, the filter response (so called activation) results in an output volume with the dimensions 251x251x48 (using stride 1 and no padding). By padding the input layer with 0s, one can force to keep the spatial dimensions of the activations constant throughout the layers.

Each convolution layer is followed by a nonlinear activation function (in DL mostly ReLU layers are used) which results in an activation with the same dimensions as the output volume of the previous convolution layer.

A pooling layer subsamples the previous layer and outputs a volume of same depth but reduced spatial dimensions. Similar to a Gaussian pyramid, pooling helps filters to activate on features in the image at a different scale. Using a max-pooling 2x2 filter with stride 2 (the filter is shifted for 2 pixels on every iteration) one ends up with a 128x128x48 volume after pooling. In general, pooling layers are used at the beginning of the network and at the end to better control the dimensions of the activations right before the fully-connected layers.

AlexNet (2012)

The following figure shows the architecture of AlexNet, the winning model for classification in the ImageNet competition in 2012. As shown in the figure, it consist of 2 parallel set of layers with both convolution and max pooling layers. The model was arranged like this due to the fact that 2 graphic cards were used in parallel for training (training AlexNet on the ImageNet dataset took 2 weeks on this setup). The filters start with a spatial dimensions 5x5 and 55 in depth (1,375 weights without bias) going to a 3x3x192 filter volume (1,728 weights without bias) towards the end of the network; hence, the filter size is decreasing but the filter depth is increasing per layer. The end of the network consists of 2 fully connected layers with 2048 nodes (4,194,304 weights without bias), and an output layer with 1000 nodes (2,048,000 weights without bias) according to the 1000 classes in the ImageNet dataset. Hence, most memory is used for the weights in the final fully connected layers; the whole model requires about 240MB in total memory. The fully connected layers at the end are needed to set the spatial filter responses in relation to each other for the resulting class prediction.

Architecture of AlexNet

The winner of the ImageNet classification task in 2013 was a tuned version of AlexNet using different initialization and optimization.

VGGNet (2014)

The VGGNet model was placed second in the ImageNet competition in 2014 by using only 3x3 filters throughout the whole network. Many different layer depths have been tested with the resulting insight that deeper is always better. This statement holds only if the deeper model can be still trained without an exploding or vanishing gradient throughout the network.

As we can see in the next figure, VGG-19 achieves very good results on the ImageNet 2012 classification dataset while having a poor accuracy vs. number of parameters ratio. To store the parameters only of VGG-19, it requires more than 500MB memory.

DL models Top-1 classification accuracy vs. accuracy per parameter (Source: Canziani)

GoogLeNet (2014)

The winning model in the ImageNet competition in 2014 was GoogLeNet which is even deeper than the previously discussed VGGNet. However, it uses only one tenth of the number of parameters of AlexNet due to the architecture of 9 parallel modules, the inception module. As shown in the next figure, this module uses 1x1 convolutions (so called bottleneck convolutions) to sum up the depth dimensions of the previous layers while keeping the spatial dimensions of the volume. This 1x1 convolutions allow to control/reduce the depth dimension which greatly reduces the number of used parameters due to removal of redundancy of correlated filters. It also enables to learn a set of feature across 1x1, 3x3 and 5x5 spatial dimensions in parallel mixed with a pooling of the original volume.

Inception module (Source COS598 (Princeton) lecture slides)

Deep Learning models 2012-2014 (Source: CaffeJS)

Deep Residual Networks (2015)

Microsoft’s residual network ResNet, the winner from ImageNet 2015 and the deepest network so far with 153 convolution layers received a top-5 classification error of 4.9% (which is slightly better than human accuracy, Source: Karpathy). By introducing residual connections (skip connections) between an input and the filter activation (as shown in the following figure), the network can learn incremental changes instead of a complete new behavior. This concept is similar to the parallel inception modules while only one filter learns the incremental changes between an input volume and its activation.

Residual connections (Source: stackexchange.com)

Applications in Computer Vision

DNNs perform by a factor 10%-20% better in accuracy than methods using hand-engineered feature extraction in most computer vision tasks given enough training images and computational resources (Szegedy, Long). As shown in the next figure, the differences of DL applications in computer vision are based on the number of objects in the image and the output of the Network. This section describes in these applications and which DL approaches are commonly used.

Difference between Classification, Localization, Detection and Segmentation (Source: CS231n (Stanford) lecture slides)

Classification

In a classification task, a model has to predict the correct label of an image based on its contextual information which is extracted from the pixel values. Usually, one image belongs to a single class. Given an input image, the NN computes a probability value for each class; the most likely class is the resulting class for the image. An example is shown in the following figure for 8 samples from the ImageNet dataset, where the 5 most likely classes (blue bars) are displayed per input image and the correct label is shown as red bar (if it appears in these 5 classes). CNNs show a performance increase of 20% to 34% in comparison to traditional models.

DNNs for classification commonly use the “standard” architecture denoted as $IN \to [[CONV\to ReLU] \cdot N \to POOL?] \cdot M \to [FC \to ReLU] \cdot K \to FC$ , where input $IN$ is a single image and output $FC$ is a fully connected layer with $m$ units (one per class). The output of each unit in the last layer corresponds to the probability that the image belongs to the class $y$ . We can see that the network from 1989 had already a very similar structure (with the slight difference of using hyperbolic tangent non-linearities).

Image Classification using the ImageNet dataset (Source: Krizhevsky)

Localization

In a localization task, the model needs to identify the position of an object in a sample image. Many tasks such as localization and detection can be led back to a simple binary classification task. By using a sliding window over the sample image one can determine the possibility of detecting an object at the current position of the window. After trying all possible window positions, the position of the object can be computed from the output. Due to the sliding window, the image dimensions of the input images don’t have to be fixed. Despite its simple approach, this method requires $O(n^2)$ iterations and a complex network setup and hence is not used for localization in practice.

In the simplest case with only one object in the image, the localization task can also be solved by predicting its bounding box coordinates. This is a regression problem and can be solved with a similar architecture as classification. DNNs for bounding box localization are mostly used for cropping raw input images as a preprocessing step for other applications (such as classification).

A common architecture of DNNs to solve a localization task as regression uses a fully connected layer with $4$ units (the bounding box can be identified with the coordinates of its left-top corner, width and height) in the last layer followed by a sigmoid layer. The bounding box regression can be used in combination with classification, such that bounding box coordinates can be learned for each class individually. Both methods can be combined in one DNN with one classification and one regression output.

The bounding box approach can be applied when the exact number of objects in the image is known and the image dimensions of the input images are fixed. The localization precision is bound by the rectangle geometry of the bounding box; however, also other shapes such as skewed rectangles, circles, or parallelograms can be used.

Object Detection and Image Recognition

To understand the context of images, one has to not only classify an image, but also find multiple different object instances in an image and estimate their position. This application of classifying and localizing multiple objects in an image is referred to as Object Detection. Using DNN for object detection has the advantage that it can implicitly learn complex object representations instead of manually deriving them from a kinematic model, such as a Deformable Part-based Model (DPM).

A common approach to implement object detection is a binary mask regression for each object type combined with predicting the bounding box coordinates, which results into at least 1 model per object type. This method works on both the complete image or image patches as an input, as long as the size of the input tensor is fixed. While this approach is simple a yields on average 0.15% improvement in precision compared to traditional DPMs, it requires one DNN per object type. Another difficulty are overlapping objects of the same type, as these objects are not separable in the binary mask.

The term for Image Recognition is used interchangeably for object detection, as the ability of detecting and localizing objects of different types (in an hierarchical structure) equals the ability to understand the current scene (see the following figure).

Image Recognition through Object Detection (Source: He)

Segmentation

The application of Segmentation is to partition parts of an image with pixel precision, hence predicting the corresponding segment for each pixel. Thus, the network needs to predict a class value for each input pixel. convolutions and pooling both reduce the spatial dimension of the activations throughout the network which leads to the problem that the resulting activation has a smaller size than the input volume. Therefore, DNNs for segmentation need to implement an upscale strategy in order to predict per pixel.

To train a DNN for segmentation one can either input the whole image to the network and use a segmented image as ground truth (binary mask for foreground/background segmentation or pixel mask displayed in the subsequent figure) or follow a patch based approach. Using pixelwise CNNs, one can achieve up to 20% relative improvement in accuracy.

Similar to localization, segmentation can be turned into a classification problem, when using images patches of a fixed size. Instead of sliding a window over all possible locations, one can extract patches only from salient regions or distinctive segments. This approach has the advantage, that multiple patches can be stacked together as channels in the input layer to provide the network with positive and negative (or neutral) samples at the same time. This pairwise training can correct unbalanced class distributions and optimize gradient computation.

Image Segmentation using a Pixel Mask (Source: Long)

As we can see in the previous figure, this upscaling process can be a gigantic fully-connected layer or an inverted DNN structure (called up-convolution or deconvolution structure).

Semantic vs. Instance Segmentation

The previous figure shows the 2 conceptual approaches in segmentation, the semantic segmentation (middle image) where everything from a matching class should be segmented vs. instance segmentation (right image) where every instance from a class should be segmented.

In semantic segmentation, multi-scale architecture are used that upsample outputs and combine the results with traditional bottom-up segmentations. In a different approach, the complete upscaling process can learned via an inverted DNN structure. By adding skip connections from lower layers to higher upsampled levels one can also learn incremental structures and perform local refinement on the downsampled image.

Instance segmentation is often referred to as simultaneous detection and segmentation and is a very challenging task. Most commonly window functions are used for applying classification and segmentation on all possible sets of input patches. This requires unnecessary computational resources for regions that don’t contain any objects. Hence, so called region proposals techniques have been introduced, to optimize the selection of regions that can be further selected as a patch input for the segmentation. Fast R-CNN is a method using the ResNet architecture and implements a pipelines similar to object detection.

Image Encoding

A very common task of DNNs is image encoding, hence transforming and image from its original representation to a lower dimensional feature space. This task is often performed implicitly due to the fact that the last fully-connected layer of each DNN learns this encoding automatically while trained on a specific supervised task. At the end of the training, the last fully-connected layer contains a fixed-sized low-dimensional numeric representation of the input image that can be used as an input for conventional machine learning approaches such as SVM, linear regression, etc.

Using up-convolutional structures, one can also implement unsupervised auto-encoding networks. However, due to the implicit learning through classification and the high computational complexity of up-convolutional networks this is mostly done by training on a supervised method such as classification.

Summary

Using the correct DNN architecture for an application of DL in computer vision requires knowledge of the specific problem and limitations, such as fixed number of objects in localization and fixed image dimensions in pixelwise segmentation. However, DL architectures always perform better than classical hand-engineered approaches with a factor 10% to 20% in accuracy given enough training samples and computing power. However, in many DL applications such as detection and recognition, more than one model is required to predict all possible object types. Hence, training multiple models requires linear costs for training time and computation.

Resources

References

F. Rosenblatt, “The perceptron, a perceiving and recognizing automaton Project Para”, in Cornell Aeronautical Laboratory, 1957.
P. Werbos, “Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences”, in PhD thesis, Harvard University, Cambridge, 1974.
D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-propagating errors”, in Nature, vol. 323, pp. 533-536, 1986.
Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel, “Handwritten digit recognition with a back-propagation network”, in Advances in Neural Information Processing Systems (NIPS 1989), D. Touretzky (Ed.), vol. 2, pp. 533-536, 1990.
J. Schmidhuber, “Deep Learning in Neural Networks: An Overview”, in CoRR, 2014.
I. Goodfellow, Y. Bengio and A. Courville, “Deep Learning”, in preparation for MIT Press, 2016.
X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks”, in Proceedings of the International Conference on Artificial Intelligence and Statistics, 2010.
K. He, X. Zhang, S. Ren and J. Sun, “Deep Residual Learning for Image Recognition”, CoRR, 2015.
K. He, X. Zhang, S. Ren and J. Sun, “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”, CoRR, 2015.
A. Krizhevsky, I. Sutskever and G. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks”, in Advances in Neural Information Processing Systems, 2012.
A. Radford, L. Metz and S. Chintala, “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks”, CoRR, 2015.
E. Smirnov, D. Timoshenko and S. Andrianov, “Comparison of Regularization Methods for ImageNet Classification with Deep Convolutional Neural Networks”, in AASRI Procedia, vol. 6, pp. 89-94, 2014.
C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke and A. Rabinovich, “Going Deeper with Convolutions”, CoRR, 2014.
M. Zeiler and R. Fergus, “Visualizing and Understanding Convolutional Networks”, in COMPUTER VISION – ECCV 2014, 1st ed., D. Fleet, T. Pajdla, B. Schiele and T. Tuytelaars, Ed. 2014.
S. Zhou, Q. Chen and X. Wang, “Convolutional Deep Networks for Visual Data Classification”, in Neural Process Letters, vol. 38, no. 1, pp. 17-27, 2012.
D.P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization”, in The International Conference on Learning Representations (ICLR), 2015.
M. Henaff, A. Szlam, and Y. LeCun, “Orthogonal RNNs and Long-Memory Tasks”, CoRR, 2016.
C. Szegedy, A. Toshev, and D. Erhan, “Deep Neural Networks for Object Detection”, in Advances in Neural Information Processing Systems 26, C. Burges (Ed.), pp. 2553-2561, 2013.
K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition”, CoRR, 2014.
P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun, “Overfeat: Integrated recognition, localization and detection using convolutional networks”, in ICLR, 2014.
J. Long, E. Shelhamer, and T. Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CoRR, 2014.
M. Lin, Q. Chen and S. Yan, “Network In Network”, CoRR, 2013.
S. Ioffe and C. Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”, CoRR, 2015.
N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever and R. Salakhutdinov, “Dropout: A Simple Way to Prevent Neural Networks from Overfitting”, in Journal of Machine Learning Research, vol. 15, pp. 1929-1958, 2014.
F. Li, A. Karpathy, and J. Johnson, “Spatial Localization and Detection”, in CS231n: Convolutional Neural Networks for Visual Recognition, lecture slides, p. 8, 2016.
D. Ciregan, U. Meier, and J. Schmidhuber, “Multi-column deep neural networks for image classification”, in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012 pp. 3642-3649, 2012.
C. Farabet, C. Couprie, L. Najman, and Y. LeCun, “Learning Hierarchical Features for Scene Labeling”, in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 8, pp. 1915-1929, 2013.
H. Noh, S. Hong, and B. Han, “Learning Deconvolution Network for Semantic Segmentation”, CoRR, 2015.
B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik, “Simultaneous Detection and Segmentation”, in European Conference on Computer Vision (ECCV), 2014.
J. Dai, K. He, and J. Sun, “Instance-aware Semantic Segmentation via Multi-task Network Cascades”, CoRR, 2015.
A.M. Saxe, J.L. McClelland and S. Ganguli, “Exact solutions to the nonlinear dynamics of learning in deep linear neural networks”, CoRR, 2015.
T. Tieleman and G. Hinton, “RMSprop Gradient Optimization”, in Neural Networks for Machine Learningn, lecture slides, p. 29, 2014.
A. Canziani, A. Paszke and E. Culurciello, “An Analysis of Deep Neural Network Models for Practical Applications”, CoRR, 2016.

Data-driven Visualizations

Tue, 24 Nov 2015 19:00:00 +0000

I gave a talk about data-driven visualizations an D3.js at the last ViennaJS October Meetup (28.10.2015). Here are the links to the talk and the slides.

Installing Visual Studio Code on Ubuntu

Tue, 29 Sep 2015 11:00:00 +0000

Visual Studio Code is an open source multi-platform IDE for web development (especially JavaScript and Typescript) - enough reasons for me to check it out.

Installing

Download the latest Version from the Visual Studio Code website (I found the 64 bit version on the update page) and unzip it.

Then move it to the opt/ directory and create a symbolic link.

sudo mv VSCode-linux-x64 /opt/VSCode
sudo ln -s /opt/VSCode/Code /usr/local/bin/code

You are done; just run code from your terminal!

Creating a Desktop Icon

Create a desktop Icon by creating a VSCode.desktop file

sudo gedit /usr/share/applications/VSCode.desktop

with the following content

#!/usr/bin/env xdg-open

[Desktop Entry]
Version=1.0
Type=Application
Terminal=false
Exec=/opt/VSCode/Code
Name=VSCode
Icon=/opt/VSCode/resources/app/vso.png
Categories=Development

Now you can find VSCode in your start menu.

References

Swiss Web Audio Group: 1st Meetup

Mon, 24 Aug 2015 21:30:00 +0000

In the first official meetup of the Swiss Web Audio Group (SWAG) we talked about Soundio, Soundio, Music JSON and Soundio.

Attendees

Stefano, Chris, Stephan

Sound.io

Encapsulate the components and make them modular; maybe expose them via npm. People should be able to use the modules to create cool stuff.

Todos:

Make the lib Open Source
GitHub Bug tracker
Work on the getting started document for the modules
Using npm for the modules

Sequencer

Todos:

Share on social media
Embed the sequencer in a page
Include the keyboard MIDI player

Music JSON

An interchangeable readable format for note data for the web

Todos:

Implement a converter that converts a *.mid file into a Music JSON

Sound.io Introduction

Mon, 24 Aug 2015 19:30:00 +0000

The sound.io core library implements a graph object model for audio - we are calling this the audio graph. The audio graph can be constructed out of a collection of audio-objects and connections that links the audio-objects together.

audio-object

The audio object is a wrapper on the Web Audio API. These objects are the building blocks of an audio graph.

soundio-object-template

The soundio-object-template repository contains a plugin template to build custom plugins. The audio variable in the constructor contains a reference to the Audio Context.

MIDI

MIDI abstracts the Web MIDI API. It contains a normalize function that converts MIDI format to Music JSON which is used internally.

Clock

Maps the web audio absolute time to a Clock time using rates.

Sequence

Maps the clock time to the Sequence time.

Sampler

var soundio = Soundio();
var sampler = soundio.objects.create('sample');
// soundio.objects = [
//  {id; 0, type: 'sample'}
//]
sampler.trigger(0, 'noteon', 64, 127);

Sample map

A sample map maps notes across the keyboard and the velocity range.

Music JSON

Music JSON is an interchangeable MIDI format of the web.

Scribe

Scribe is a parser for Music JSON to create lead sheets in SVG.

Chaosmail Blog

Getting Started with Microsoft SQL 2019 Big Data clusters

SQL Server 2019 Big Data cluster (BDC)

Kubernetes

SQL Master Instance

Polybase

HDFS

Spark

Much More … (Knox, Grafana, SSIS, Report Server, etc.)

Installation

Prerequisites: Kubernetes and MSSQL clients

Prerequisites: Azure Data Studio

Install SQL Server 2019 BDC on Azure Kubernetes Services (AKS)

Querying SQL Server 2019 BDC

Working with HDFS

Working with SQL

Working with Spark

Summary

Resources

Updates

Introduction to BigData, Hadoop and Spark

Big Data

What is Big Data

What can we do with Big Data

What is Big Data Analytics

Hadoop: HDFS, Yarn and MapReduce

Spark: The Evolution of MapReduce

Summary

Resources

Bing Maps API - Rest Locations (Geocoding)

Request an API key

Locations API (Rest)

Resources

MS Hack 2018 - Smart Outlook

Idea

Solution

Execution

Future Work

Resources

Resolving _HOST in kerberized HDP Sandbox

Ambari’s autogenerated Kerberos principals

Diving into Ambari source code

Testing the hostname

Resolving _HOST in Kerberos principals

Resources

Intro to Deep Learning for Computer Vision

Introduction

From Neural Networks to Deep Learning

Neural Networks

The Perceptron - Gated Linear Regression

Non-linearities (Activation Functions)

Deep Neural Networks

Convolutions

Pooling

Normalization

Fully Connected Layer

Final Output Layer

Deep Learning Architectures

Convolutional Neural Networks (CNN)

AlexNet (2012)

VGGNet (2014)

GoogLeNet (2014)

Deep Residual Networks (2015)

Applications in Computer Vision

Classification

Localization

Object Detection and Image Recognition

Segmentation

Image Encoding

Summary

Resources

References

Data-driven Visualizations

Installing Visual Studio Code on Ubuntu

Installing

Creating a Desktop Icon

References

Swiss Web Audio Group: 1st Meetup

Attendees

Sound.io

Resolving `_HOST` in Kerberos principals