<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Chaosmail Blog</title>
    <description>Big Data and Artificial Intelligence Expert at Microsoft, Tech Author and Speaker</description>
    <link>https://chaosmail.github.io//</link>
    <atom:link href="https://chaosmail.github.io//feed.xml" rel="self" type="application/rss+xml" />
    <pubDate>Thu, 06 Jun 2019 10:18:03 +0000</pubDate>
    <lastBuildDate>Thu, 06 Jun 2019 10:18:03 +0000</lastBuildDate>
    <generator>Jekyll v3.8.5</generator>
    
      <item>
        <title>Getting Started with Microsoft SQL 2019 Big Data clusters</title>
        <description>&lt;p&gt;Microsoft latest &lt;a href=&quot;https://www.microsoft.com/en-us/sql-server/sql-server-2019&quot;&gt;SQL Server 2019&lt;/a&gt; (preview) comes in a new version, the SQL Server 2019 Big Data cluster (BDC). There are a couple of cool things about the BDC version: 
(1) it runs on &lt;a href=&quot;https://kubernetes.io/&quot;&gt;Kubernetes&lt;/a&gt;
(2) it integrates a sharded SQL engine
(3) it integrates &lt;a href=&quot;https://hadoop.apache.org/docs/current1/hdfs_design.html#Introduction&quot;&gt;HDFS&lt;/a&gt; (a distributed file storage)
(4) it integrates &lt;a href=&quot;http://spark.apache.org/&quot;&gt;Spark&lt;/a&gt; (a distributed compute engine)
(5) and both services Spark and HDFS run behind an &lt;a href=&quot;https://knox.apache.org/&quot;&gt;Apache Knox&lt;/a&gt; Gateway (HTTPS application gateway for Hadoop).&lt;/p&gt;

&lt;p&gt;On top, using Polybase you can connect to many different external data sources such as MongoDB, Oracle, Teradata, SAP Hana, and many more. Hence, SQL Server 2019 Big Data cluster (BDC) is a scalable, performant and maintainable SQL platform, Data Warehouse, Data Lake and Data Science platform without compromising cloud and on-premise. In this blog post I want to give you a quick start tutorial into &lt;a href=&quot;https://docs.microsoft.com/en-us/sql/big-data-cluster/big-data-cluster-overview?view=sql-server-ver15&quot;&gt;SQL 2019 Big Data clusters (BDC)&lt;/a&gt; and show you how to set it up on Azure Kubernetes Services (AKS), upload some data to HDFS and access the data from SQL and Spark.&lt;/p&gt;

&lt;h2 id=&quot;sql-server-2019-big-data-cluster-bdc&quot;&gt;SQL Server 2019 Big Data cluster (BDC)&lt;/h2&gt;

&lt;p&gt;&lt;a href=&quot;https://docs.microsoft.com/en-us/sql/big-data-cluster/big-data-cluster-overview?view=sql-server-ver15&quot;&gt;SQL Server 2019 Big Data cluster (BDC)&lt;/a&gt; is one of the most exciting pieces of technologies I have seen in a long time. Here is why.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/sql2019/SQL-Server-2019-big-data-cluster.png&quot; alt=&quot;SQL Server 2019 for Big Data architecture Source: [microsoft.com/sqlserver](https://cloudblogs.microsoft.com/sqlserver/2018/09/25/introducing-microsoft-sql-server-2019-big-data-clusters/)&quot; title=&quot;SQL Server 2019 for Big Data architecture&quot; class=&quot;image-col-1&quot; /&gt;&lt;/p&gt;

&lt;h3 id=&quot;kubernetes&quot;&gt;Kubernetes&lt;/h3&gt;
&lt;p&gt;SQL Server 2019 builds on a new abstraction layer called &lt;em&gt;Platform Abstraction Layer&lt;/em&gt; (PAL) which let’s you run SQL Server on multiple platforms and environments, such as Windows, Linux, and Containers. To take this one step further, we can run SQL Server clusters entirely within Kubernetes - either locally (e.g. on &lt;a href=&quot;https://kubernetes.io/docs/setup/minikube/&quot;&gt;Minikube&lt;/a&gt;), on on-premise clusters or in the cloud (e.g. on Azure Kubernetes Services). All data is persisted using &lt;a href=&quot;https://docs.microsoft.com/en-us/sql/big-data-cluster/concept-data-persistence?view=sqlallproducts-allversions&quot;&gt;&lt;em&gt;Persistent Volumes&lt;/em&gt;&lt;/a&gt;. To facilitate operations, there is a new &lt;code class=&quot;highlighter-rouge&quot;&gt;mssqlctl&lt;/code&gt; command to scaffold, configure, and scale SQL Server 2019 clusters in Kubernetes.&lt;/p&gt;

&lt;h3 id=&quot;sql-master-instance&quot;&gt;SQL Master Instance&lt;/h3&gt;
&lt;p&gt;If you deploy SQL Server 2019 as a cluster in &lt;a href=&quot;https://kubernetes.io/&quot;&gt;Kubernetes&lt;/a&gt;, it comes with a SQL &lt;em&gt;Master Instance&lt;/em&gt; and multiple SQL engine compute and storage shards. The great thing about the Master Instance is that it is just a normal SQL instance - you can use all existing tooling, code, etc. and interact with the SQL Server cluster as if it was a single DB instance. If you stream data to the cluster, you can stream the data directly to the SQL shards without going through the Master Instance. This gives you optimal throughput performance.&lt;/p&gt;

&lt;h3 id=&quot;polybase&quot;&gt;Polybase&lt;/h3&gt;
&lt;p&gt;You might know &lt;em&gt;Polybase&lt;/em&gt; from SQL Server 2016 as a service that let’s you connect to flat HDFS data sources. With SQL Server 2019, you can now as well connect to relational data sources (e.g. Oracle, Teradata, SAP Hana, etc.) or NoSQL data sources (e.g. Mongo DB, Cosmos DB, etc.) as well as using Polybase and &lt;a href=&quot;https://docs.microsoft.com/en-us/sql/relational-databases/polybase/data-virtualization?view=sqlallproducts-allversions&quot;&gt;external tables&lt;/a&gt; - both with &lt;a href=&quot;https://blogs.msdn.microsoft.com/sql_server_team/predicate-pushdown-and-why-should-i-care/&quot;&gt;predicate pushdown filters&lt;/a&gt;. It’s a fantastic feature turning your SQL Server 2019 cluster into your central data hub.&lt;/p&gt;

&lt;h3 id=&quot;hdfs&quot;&gt;HDFS&lt;/h3&gt;
&lt;p&gt;Now comes the fun part. When you deploy a SQL Server 2019 BDC, you also deploy an &lt;a href=&quot;https://hadoop.apache.org/docs/current1/hdfs_design.html#Introduction&quot;&gt;&lt;em&gt;Hadoop Distributed Filesystem&lt;/em&gt; (HDFS)&lt;/a&gt; within Kubernetes. With the &lt;a href=&quot;https://www.microsoft.com/en-us/research/project/tiered-storage/&quot;&gt;tiered storage feature in HDFS&lt;/a&gt; you can as well mount existing HDFS clusters into the integrated SQL Server 2019 HDFS. Using the &lt;a href=&quot;https://docs.microsoft.com/en-us/sql/relational-databases/polybase/configure-scale-out-groups-windows?view=sqlallproducts-allversions&quot;&gt;integrated Polybase scale-out groups&lt;/a&gt; you can efficiently access this distributed data from SQL with &lt;a href=&quot;https://docs.microsoft.com/en-us/sql/relational-databases/polybase/data-virtualization-csv?view=sqlallproducts-allversions&quot;&gt;external tables&lt;/a&gt;. If you install SQL Server 2019 as a BDC, all the configurations of those services is done automatically, even pass-through authentication. These features allows your SQL Server 2019 cluster to become the central data storage for both relational structured and massive volumes of flat unstructured data.&lt;/p&gt;

&lt;h3 id=&quot;spark&quot;&gt;Spark&lt;/h3&gt;
&lt;p&gt;And it’s getting better. The SQL Server 2019 BDC also includes a &lt;a href=&quot;http://spark.apache.org/&quot;&gt;&lt;em&gt;Spark&lt;/em&gt;&lt;/a&gt; run-time co-located with the HDFS data pools. For me - coming from a Big Data background - this is huge! This means, you can take advantage of all Spark features (SparkSQL, Dataframes, MLlib for machine learning, GraphX for graph processing, Structured Streaming for stream processing, and much more) directly within your SQL cluster. Now, your SQL 2019 cluster can as well be used by your data scientists and data engineers as a central Big Data hub. Thanks to integration of &lt;a href=&quot;https://livy.incubator.apache.org/&quot;&gt;Apache Livy&lt;/a&gt; (a Spark Rest Gateway) you can utilize this functionality with your existing tooling, such as Jupyter or Zeppelin notebooks out-of-the-box.&lt;/p&gt;

&lt;h3 id=&quot;much-more--knox-grafana-ssis-report-server-etc&quot;&gt;Much More … (Knox, Grafana, SSIS, Report Server, etc.)&lt;/h3&gt;
&lt;p&gt;Once we are running in Kubernetes, you can as well add many more services to the cluster and manage, operate, and scale them together. The Spark and HDFS functionality is configured with an &lt;a href=&quot;https://knox.apache.org/&quot;&gt;Apache Knox Gateway&lt;/a&gt; (HTTPS application gateway for Hadoop) and can be integrated into many other existing services (e.g. processes writing to HDFS, etc.). SQL Server 2019 BDC ships comes with an  integrated &lt;a href=&quot;https://docs.microsoft.com/en-us/sql/big-data-cluster/cluster-admin-portal?view=sqlallproducts-allversions&quot;&gt;Cluster Configuration portal&lt;/a&gt; and a Grafana dashboard for monitoring all relevant service metrics.&lt;/p&gt;

&lt;p&gt;Deploying other co-located services to the same Kubernetes cluster becomes quite easy. Services such as Integration Services, Analysis Services or Report Server can simply be deployed and scaled to the same SQL Server 2019 cluster as additional Kubernetes pods.&lt;/p&gt;

&lt;p&gt;Another cool feature of SQL Server 2019 worth mentioning is that along Python and R it will also support User Defined Functions (UDFs) written in Java. Niels Berglund has many examples in his &lt;a href=&quot;http://www.nielsberglund.com/s2k19_ext_framework_java/&quot;&gt;Blog post series&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;installation&quot;&gt;Installation&lt;/h2&gt;

&lt;p&gt;Currently, SQL Server 2019 and SQL Server 2019 Big data cluster (BDC) are still in private preview. Hence, you need to apply for the &lt;a href=&quot;https://sqlservervnexteap.azurewebsites.net/&quot;&gt;Early Adoption Program&lt;/a&gt; which will grant you access to Microsoft’s private registry and SQL Server 2019 images. You are also assigned a buddy (a PM on the SQL Server 2019 team) as well as granted access to a private Teams channel. Hence, if you want to try it already today, you should definitely sign up!&lt;/p&gt;

&lt;p&gt;In this section we will go through the prerequisites and installation process as documented in the &lt;a href=&quot;https://docs.microsoft.com/en-us/sql/big-data-cluster/quickstart-big-data-cluster-deploy?view=sql-server-ver15&quot;&gt;SQL Server 2019 installation guidelines&lt;/a&gt; for Big Data analytics. In the documentation, you will find a link to a &lt;a href=&quot;https://github.com/Microsoft/sql-server-samples/blob/master/samples/features/sql-big-data-cluster/deployment/aks/deploy-sql-big-data-aks.py&quot;&gt;Python script&lt;/a&gt; that allows you to spin up SQL 2019 on Azure Kubernetes Services (AKS).&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;If you want to install SQL Server 2019 BDC on your on-premise Kubernetes cluster, you can follow the steps in &lt;a href=&quot;https://chrisadkin.io/2018/12/18/building-a-kubernetes-cluster-for-sql-server-2019-big-data-clusters-part-1-hyper-v-virtual-machine-creation/&quot;&gt;Christopher Adkin’s Blog&lt;/a&gt;. You can find an official deployment guide for BDC on &lt;a href=&quot;https://docs.microsoft.com/en-us/sql/big-data-cluster/deploy-on-minikube?view=sql-server-ver15&quot;&gt;Minikube in the Microsoft docs&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3 id=&quot;prerequisites-kubernetes-and-mssql-clients&quot;&gt;Prerequisites: Kubernetes and MSSQL clients&lt;/h3&gt;

&lt;p&gt;To &lt;a href=&quot;https://docs.microsoft.com/en-us/sql/big-data-cluster/quickstart-big-data-cluster-deploy?view=sql-server-ver15&quot;&gt;deploy a SQL Server 2019 Big Data cluster (BDC)&lt;/a&gt; on Azure Kubernetes Services (AKS), you need the following tools installed. For this tutorial, I installed all these tools on Ubuntu 18.04 LTS on &lt;a href=&quot;https://docs.microsoft.com/en-us/windows/wsl/install-win10&quot;&gt;WSL&lt;/a&gt; (Windows Subsystem for Linux).&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest&quot;&gt;Azure CLI (install latest)&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://kubernetes.io/docs/tasks/tools/install-kubectl/&quot;&gt;kubectl&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://docs.microsoft.com/en-us/sql/big-data-cluster/deploy-install-mssqlctl?view=sqlallproducts-allversions&quot;&gt;mssqlctl&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://sqlservervnexteap.azurewebsites.net/&quot;&gt;SQL Server 2019 Early Adoption Program&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To avoid any problems with Kubernetes APIs, it’s best to install the same &lt;code class=&quot;highlighter-rouge&quot;&gt;kubectl&lt;/code&gt; version as the Kubernetes version on AKS. In the SQL Server 2019 docs, the version &lt;code class=&quot;highlighter-rouge&quot;&gt;1.12.6&lt;/code&gt; is recommended. Hence, in this case we also &lt;a href=&quot;https://kubernetes.io/docs/tasks/tools/install-kubectl/&quot;&gt;install&lt;/a&gt; the Kubernetes &lt;code class=&quot;highlighter-rouge&quot;&gt;1.12.6&lt;/code&gt; client.&lt;/p&gt;

&lt;div class=&quot;language-sh highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;apt-get update &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;apt-get install &lt;span class=&quot;nt&quot;&gt;-y&lt;/span&gt; apt-transport-https
curl &lt;span class=&quot;nt&quot;&gt;-s&lt;/span&gt; https://packages.cloud.google.com/apt/doc/apt-key.gpg | &lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;apt-key add -
&lt;span class=&quot;nb&quot;&gt;echo&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;deb https://apt.kubernetes.io/ kubernetes-xenial main&quot;&lt;/span&gt; | &lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;tee &lt;span class=&quot;nt&quot;&gt;-a&lt;/span&gt; /etc/apt/sources.list.d/kubernetes.list
&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;apt-get update
&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;apt-get install &lt;span class=&quot;nt&quot;&gt;-y&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;kubectl&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;1.12.6-00
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The &lt;code class=&quot;highlighter-rouge&quot;&gt;mssqlctl&lt;/code&gt; tool is a handy command-line utility that allows you to create and manage SQL Server 2019 Big Data cluster installations. You can &lt;a href=&quot;https://docs.microsoft.com/en-us/sql/big-data-cluster/deploy-install-mssqlctl?view=sqlallproducts-allversions&quot;&gt;install it&lt;/a&gt; using pip with the following command:&lt;/p&gt;

&lt;div class=&quot;language-sh highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;pip3 install &lt;span class=&quot;nt&quot;&gt;-r&lt;/span&gt;  https://private-repo.microsoft.com/python/ctp-2.4/mssqlctl/requirements.txt
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Before you continue, make sure that both &lt;code class=&quot;highlighter-rouge&quot;&gt;kubectl&lt;/code&gt; and &lt;code class=&quot;highlighter-rouge&quot;&gt;mssqlctl&lt;/code&gt; commands are available. If they are not, you might restart the current bash session.&lt;/p&gt;

&lt;h3 id=&quot;prerequisites-azure-data-studio&quot;&gt;Prerequisites: Azure Data Studio&lt;/h3&gt;

&lt;p&gt;&lt;a href=&quot;https://docs.microsoft.com/en-us/sql/azure-data-studio/what-is?view=sqlallproducts-allversions&quot;&gt;Azure Data Studio&lt;/a&gt; is a cross-platform management tool for Microsoft databases. It’s like SQL Server Management Studio on top of the popular VS Code editor engine, a rich T-SQL editor with IntelliSense and Plugin support. Currently, it’s the easiest way to connect to the different SQL Server 2019 endpoints (SQL, HDFS, and Spark). To do so, you need to &lt;a href=&quot;https://docs.microsoft.com/en-us/sql/azure-data-studio/download?view=sqlallproducts-allversions&quot;&gt;install Data Studio&lt;/a&gt; and the &lt;a href=&quot;https://docs.microsoft.com/en-us/sql/azure-data-studio/sql-server-2019-extension?view=sqlallproducts-allversions&quot;&gt;SQL Server 2019 extension&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The following screenshot (Source: &lt;a href=&quot;https://github.com/Microsoft/azuredatastudio&quot;&gt;Microsoft/azuredatastudio&lt;/a&gt;) shows an overview of Azure Data Studio and its capabilities.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/sql2019/data-studio-overview.jpg&quot; alt=&quot;Azure Data Studio overview&quot; title=&quot;Azure Data Studio overview&quot; class=&quot;image-col-1&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Azure Data Studio also supports Jupyter-style notebooks for T-SQL and Spark. The following screenshot shows Data Studio with the notebooks extension.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/sql2019/data-studio.png&quot; alt=&quot;Azure Data Studio with SQL Server 2019 extension&quot; title=&quot;Azure Data Studio with SQL Server 2019 extension&quot; class=&quot;image-col-1&quot; /&gt;&lt;/p&gt;

&lt;h3 id=&quot;install-sql-server-2019-bdc-on-azure-kubernetes-services-aks&quot;&gt;Install SQL Server 2019 BDC on Azure Kubernetes Services (AKS)&lt;/h3&gt;

&lt;p&gt;In this section, we will follow the steps from the &lt;a href=&quot;https://github.com/Microsoft/sql-server-samples/blob/master/samples/features/sql-big-data-cluster/deployment/aks/deploy-sql-big-data-aks.py&quot;&gt;installation script&lt;/a&gt; in order to install SQL Server 2019 for Big Data on AKS. I will give you a bit more details and explanation about the executed steps. If you just want to install SQL Server 2019 for Big Data, you can as well just use the script.&lt;/p&gt;

&lt;p&gt;First, we start setting all required parameters for the installation process. For the docker username and password, please use the credentials provided in the Early Adoption program. These will give you access to Microsoft’s internal registry with the latest SQL Server 2019 images.&lt;/p&gt;

&lt;p&gt;Please note that during the Early Adoption phase you will have to enter your docker credentials for the private Microsoft registry. You will receive these credentials after registering for the &lt;a href=&quot;https://sqlservervnexteap.azurewebsites.net/&quot;&gt;SQL Server Early Adoption&lt;/a&gt; program.&lt;/p&gt;

&lt;div class=&quot;language-sh highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c&quot;&gt;# Provide your Azure subscription ID&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;SUBSCRIPTION_ID&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;***&quot;&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# Provide Azure resource group name to be created&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;GROUP_NAME&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;demos.sql2019&quot;&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# Provide Azure region&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;AZURE_REGION&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;westeurope&quot;&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# Provide VM size for the AKS cluster&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;VM_SIZE&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;Standard_L4s&quot;&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# Provide number of worker nodes for AKS cluster&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;AKS_NODE_COUNT&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;3&quot;&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# Provide supported Kubernetes version&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;KUBERNETES_VERSION&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;1.12.7&quot;&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# This is both Kubernetes cluster name and SQL Big Data cluster name&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# Provide name of AKS cluster and SQL big data cluster&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;CLUSTER_NAME&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;sqlbigdata&quot;&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# Provide username to be used for Controller user&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;CONTROLLER_USERNAME&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;admin&quot;&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# This password will be use for Controller user, Knox user and SQL Server Master SA accounts&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# Provide password to be used for Controller user, Knox user and SQL Server Master SA accounts&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;PASSWORD&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;MySQLBigData2019&quot;&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;CONTROLLER_PASSWORD&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$PASSWORD&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;MSSQL_SA_PASSWORD&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$PASSWORD&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;KNOX_PASSWORD&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$PASSWORD&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# Private Microsoft registry&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;DOCKER_REGISTRY&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;private-repo.microsoft.com&quot;&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;DOCKER_REPOSITORY&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;mssql-private-preview&quot;&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# if brave choose &quot;latest&quot;&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;DOCKER_IMAGE_TAG&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;ctp2.4&quot;&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;DOCKER_IMAGE_POLICY&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;IfNotPresent&quot;&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# Provide your Docker username and email&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;DOCKER_USERNAME&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;***&quot;&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;DOCKER_EMAIL&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;***&quot;&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# Provide your Docker password&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;DOCKER_PASSWORD&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;***&quot;&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;DOCKER_PRIVATE_REGISTRY&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;1&quot;&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# aks | minikube | kubernetes&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;CLUSTER_PLATFORM&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;aks&quot;&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;ACCEPT_EULA&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;Y&quot;&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;STORAGE_SIZE&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;10Gi&quot;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;First, let’s create a new resource group.&lt;/p&gt;

&lt;div class=&quot;language-sh highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;az group create &lt;span class=&quot;nt&quot;&gt;--subscription&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$SUBSCRIPTION_ID&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--location&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$AZURE_REGION&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--name&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$GROUP_NAME&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Now, we can go ahead and create the AKS cluster.&lt;/p&gt;

&lt;div class=&quot;language-sh highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;az aks create &lt;span class=&quot;nt&quot;&gt;--subscription&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$SUBSCRIPTION_ID&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--location&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$AZURE_REGION&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--name&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$CLUSTER_NAME&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--resource-group&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$GROUP_NAME&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--generate-ssh-keys&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--node-vm-size&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$VM_SIZE&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--node-count&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$AKS_NODE_COUNT&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--kubernetes-version&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$KUBERNETES_VERSION&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;If your code fails at this point, the selected Kubernetes version might not be supported in your region. You can check which versions are supported using the following command.&lt;/p&gt;

&lt;div class=&quot;language-sh highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;az aks get-versions &lt;span class=&quot;nt&quot;&gt;--location&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$AZURE_REGION&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--output&lt;/span&gt; table
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Please note, if you have problems with the aks command creating the Service Principal in your Azure Active Directory (like for Microsoft employees), you can as well create the principal manually beforehand:&lt;/p&gt;

&lt;div class=&quot;language-sh highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;az ad sp create-for-rbac &lt;span class=&quot;nt&quot;&gt;--name&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$CLUSTER_NAME&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--skip-assignment&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;s2&quot;&gt;&quot;appId&quot;&lt;/span&gt;: &lt;span class=&quot;s2&quot;&gt;&quot;***&quot;&lt;/span&gt;,
  &lt;span class=&quot;s2&quot;&gt;&quot;displayName&quot;&lt;/span&gt;: &lt;span class=&quot;s2&quot;&gt;&quot;***&quot;&lt;/span&gt;,
  &lt;span class=&quot;s2&quot;&gt;&quot;name&quot;&lt;/span&gt;: &lt;span class=&quot;s2&quot;&gt;&quot;http://***&quot;&lt;/span&gt;,
  &lt;span class=&quot;s2&quot;&gt;&quot;password&quot;&lt;/span&gt;: &lt;span class=&quot;s2&quot;&gt;&quot;***&quot;&lt;/span&gt;,
  &lt;span class=&quot;s2&quot;&gt;&quot;tenant&quot;&lt;/span&gt;: &lt;span class=&quot;s2&quot;&gt;&quot;***&quot;&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# assign appId and password values&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;SP_APP_ID &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&amp;lt;app_id&amp;gt;&quot;&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;SP_PASSWORD &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&amp;lt;password&amp;gt;&quot;&lt;/span&gt;

&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;az aks create &lt;span class=&quot;nt&quot;&gt;--subscription&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$SUBSCRIPTION_ID&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--location&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$AZURE_REGION&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--name&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$CLUSTER_NAME&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--resource-group&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$GROUP_NAME&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--generate-ssh-keys&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--node-vm-size&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$VM_SIZE&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--node-count&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$AKS_NODE_COUNT&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--kubernetes-version&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$KUBERNETES_VERSION&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--service-principal&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$SP_APP_ID&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--client-secret&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$SP_PASSWORD&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;In the next step, we retrieve the credentials for the cluster. This will register the credentials in the &lt;code class=&quot;highlighter-rouge&quot;&gt;kubectl&lt;/code&gt; config.&lt;/p&gt;

&lt;div class=&quot;language-sh highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;az aks get-credentials &lt;span class=&quot;nt&quot;&gt;--name&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$CLUSTER_NAME&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--resource-group&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$GROUP_NAME&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--admin&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--overwrite-existing&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;In order to access the Kubernetes dashboard, we also need to create a role binding. I took this line from &lt;a href=&quot;https://pascalnaber.wordpress.com/2018/06/17/access-dashboard-on-aks-with-rbac-enabled/&quot;&gt;Pascal Naber’s blog post&lt;/a&gt;.&lt;/p&gt;

&lt;div class=&quot;language-sh highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;kubectl create clusterrolebinding kubernetes-dashboard &lt;span class=&quot;nt&quot;&gt;-n&lt;/span&gt; kube-system &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--clusterrole&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;cluster-admin &lt;span class=&quot;nt&quot;&gt;--serviceaccount&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;kube-system:kubernetes-dashboard
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Next, we can open the Kuerbenetes dashboard for the newly created AKS cluster and see if everything looks fine. To do so, we can forward the required ports to localhost.&lt;/p&gt;

&lt;div class=&quot;language-sh highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;az aks browse &lt;span class=&quot;nt&quot;&gt;--resource-group&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$GROUP_NAME&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--name&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$CLUSTER_NAME&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The Kubernetes dashboard should now be available via &lt;a href=&quot;http://localhost:8001&quot;&gt;http://localhost:8001&lt;/a&gt;. I recommend you to open it and take a look at your newly created cluster.&lt;/p&gt;

&lt;p&gt;Finally, we can deploy SQL Server 2019 BDC on the Kubernetes cluster using the &lt;code class=&quot;highlighter-rouge&quot;&gt;mssqlctl&lt;/code&gt; command-line utility.&lt;/p&gt;

&lt;div class=&quot;language-sh highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;mssqlctl cluster create &lt;span class=&quot;nt&quot;&gt;--name&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$CLUSTER_NAME&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Great, that was it! You are now ready to get started. The following figure shows the Kubernetes dashboard with an installed instance of SQL Server 2019 BDC. You can see the Storage, Data and Compute pools as well as the SQL Master instance.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/sql2019/kubernetes.png&quot; alt=&quot;Kubernetes dashboard for SQL Server 2019 BDC&quot; title=&quot;Kubernetes dashboard for SQL Server 2019&quot; class=&quot;image-col-1&quot; /&gt;&lt;/p&gt;

&lt;h2 id=&quot;querying-sql-server-2019-bdc&quot;&gt;Querying SQL Server 2019 BDC&lt;/h2&gt;

&lt;p&gt;For this section, we will use Azure Data Studio with the SQL Server 2019 extension which let’s us connect to both the SQL Server endpoint as well as the Knox endpoint for HDFS and Spark.&lt;/p&gt;

&lt;h3 id=&quot;working-with-hdfs&quot;&gt;Working with HDFS&lt;/h3&gt;

&lt;p&gt;First, we will put some data into the Big Data cluster. Let’s retrieve the Knox URL for HDFS.&lt;/p&gt;

&lt;div class=&quot;language-sh highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;kubectl get service service-security-lb &lt;span class=&quot;nt&quot;&gt;-o&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;custom-columns&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;IP:.status.loadBalancer.ingress[0].ip,PORT:.spec.ports[0].port&quot;&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-n&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$CLUSTER_NAME&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Using this URL, we can build the WebHDFS URL and use any HDFS client to connect to the file system.&lt;/p&gt;

&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;https://&amp;lt;service-security-lb service external IP address&amp;gt;:30433/gateway/default/webhdfs/v1/
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;You can follow the &lt;a href=&quot;https://docs.microsoft.com/en-us/sql/big-data-cluster/data-ingestion-curl?view=sqlallproducts-allversions&quot;&gt;guidelines in the Microsoft docs&lt;/a&gt; using &lt;code class=&quot;highlighter-rouge&quot;&gt;curl&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;You can as well use the integrated HDFS explorer in Data Studio. To do so, you must &lt;a href=&quot;https://docs.microsoft.com/en-us/sql/big-data-mincluster/connect-to-big-data-cluster?view=sqlallproducts-allversions&quot;&gt;create a new connection in Data Studio&lt;/a&gt; and select &lt;code class=&quot;highlighter-rouge&quot;&gt;SQL Server Big Data Cluster&lt;/code&gt;. I recommend to use the user &lt;code class=&quot;highlighter-rouge&quot;&gt;root&lt;/code&gt; in order to have read/write access in all directories. The configuration should look similar to the following picture.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/sql2019/config-spark.png&quot; alt=&quot;Configure HDFS in Data Studio&quot; title=&quot;Configure HDFS in Data Studio&quot; class=&quot;image-col-1&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Once added, you should see the server and the HDFS directories in Data Studio.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/sql2019/connect-data-services-node.png&quot; alt=&quot;HDFS in SQL Server 2019 (Source: [docs.microsoft.com](https://docs.microsoft.com/en-us/sql/big-data-cluster/connect-to-big-data-cluster?view=sqlallproducts-allversions))&quot; title=&quot;HDFS in SQL Server 2019&quot; class=&quot;image-col-1&quot; /&gt;&lt;/p&gt;

&lt;p&gt;In order to query the HDFS data from SQL, you can configure external tables with the &lt;a href=&quot;https://docs.microsoft.com/en-us/sql/relational-databases/polybase/data-virtualization-csv?toc=%2fsql%2fbig-data-cluster%2ftoc.json&amp;amp;view=sql-server-ver15&quot;&gt;external table wizard&lt;/a&gt;.&lt;/p&gt;

&lt;h3 id=&quot;working-with-sql&quot;&gt;Working with SQL&lt;/h3&gt;

&lt;p&gt;To work with SQL in SQL Server 2019 BDC, we can simply connect to the SQL Server &lt;em&gt;Master Instance&lt;/em&gt;. This instance is a standard SQL Server engine running behind a load balancer on Kubernetes. You can use als your familiar tools such as SQL Server Management Studio to connect and interact with the SQL Server instance.&lt;/p&gt;

&lt;p&gt;To connect to the SQL Server &lt;em&gt;Master Instance&lt;/em&gt; from outside the cluster, we need to provide the &lt;strong&gt;external&lt;/strong&gt; IP address of the master instance. You can find the local IP address of the SQL Server master instance service in the Kubernetes Dashboard, under the Big Data cluster namespace under services under the name &lt;code class=&quot;highlighter-rouge&quot;&gt;endpoint-master-pool&lt;/code&gt; as property Cluster IP. Alternatively, you can as well print the external IP address using the following command:&lt;/p&gt;

&lt;div class=&quot;language-sh highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;kubectl get service endpoint-master-pool &lt;span class=&quot;nt&quot;&gt;-o&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;custom-columns&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;IP:.status.loadBalancer.ingress[0].ip,PORT:.spec.ports[0].port&quot;&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-n&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$CLUSTER_NAME&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;As I said, it is just a normal SQL Server engine like in SQL Server 2016/2017. Hence, you can connect to the SQL Server endpoint using standard SQL tooling such as SQL Server Management Studio or Azure Data Studio. Since we will use Data Studio as well for Spark notebooks and HDFS, we will connect using Azure Data Studio.&lt;/p&gt;

&lt;p&gt;Create a new connection and select connection type &lt;code class=&quot;highlighter-rouge&quot;&gt;Microsoft SQL Server&lt;/code&gt;. Use the username &lt;code class=&quot;highlighter-rouge&quot;&gt;sa&lt;/code&gt; and the password you set in the setup script.&lt;/p&gt;

&lt;p&gt;The following screenshot shows a query over an external table storing data in HDFS on the same cluster.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/sql2019/sql.png&quot; alt=&quot;External table in SQL Server 2019&quot; title=&quot;External table in SQL Server 2019&quot; class=&quot;image-col-1&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Using Polybase, you can as well setup external tables to many other relational data sources, such as Oracle and SAP Hana. Using the &lt;a href=&quot;https://docs.microsoft.com/en-us/sql/relational-databases/polybase/data-virtualization?toc=%2fsql%2fbig-data-cluster%2ftoc.json&amp;amp;view=sql-server-ver15&quot;&gt;external table wizard&lt;/a&gt; in Data Studio this connection is easy to setup.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;You can find many more demos for SQL Server 2019 on &lt;a href=&quot;https://github.com/Microsoft/bobsql&quot;&gt;Bob Ward’s Github repository&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3 id=&quot;working-with-spark&quot;&gt;Working with Spark&lt;/h3&gt;

&lt;p&gt;To work with Spark in SQL Server 2019 BDC, we can leverage the notebook capabilities of Data Studio. Once we connected to the Big Data cluster, we will see options to create Spark notebooks for this instance.&lt;/p&gt;

&lt;p&gt;In the current version, the credentials from Spark are not yet passed to the SQL engine automatically. Hence we have to supply a username and password along with the local database host to build the JDBC connection string. Here is a simple PySpark script to connect to the SQL Server database from within Spark.&lt;/p&gt;

&lt;p&gt;To connect to the SQL Server &lt;em&gt;Master Instance&lt;/em&gt; from Spark from within the cluster, we need to provide the &lt;strong&gt;local&lt;/strong&gt; IP address of the master instance. You can find the local IP address of the SQL Server master instance service in the Kubernetes Dashboard, under the Big Data cluster namespace under services under the name &lt;code class=&quot;highlighter-rouge&quot;&gt;endpoint-master-pool&lt;/code&gt; as property Cluster IP. Alternatively, you can as well print the internal IP address using the following command:&lt;/p&gt;

&lt;div class=&quot;language-sh highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;kubectl get service endpoint-master-pool &lt;span class=&quot;nt&quot;&gt;-o&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;custom-columns&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;IP:.spec.clusterIP,PORT:.spec.ports[0].port&quot;&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-n&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$CLUSTER_NAME&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;host&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;&amp;lt;local_ip_address&amp;gt;:31334&quot;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;database&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;demos&quot;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;user&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;sa&quot;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;password&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;&amp;lt;sa_password&amp;gt;&quot;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;table&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;dbo.NYCTaxiTrips&quot;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;jdbc_url&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;jdbc:sqlserver://&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;%&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;s;database=&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;%&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;s;user=&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;%&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;s;password=&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;%&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;s&quot;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;%&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;host&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;database&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;user&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;password&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;spark&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;read&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;format&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;jdbc&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
       &lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;option&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;url&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;jdbc_url&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
       &lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;option&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;dbtable&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;table&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; \
       &lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;load&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;show&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The following screenshot shows the above query executed on my SQL Server 2019 BDC instance on the NYC Taxi Trips dataset.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/sql2019/spark.png&quot; alt=&quot;Spark accessing SQL in SQL Server 2019&quot; title=&quot;Spark accessing SQL in SQL Server 2019&quot; class=&quot;image-col-1&quot; /&gt;&lt;/p&gt;

&lt;p&gt;If you need to &lt;a href=&quot;https://jakevdp.github.io/blog/2017/12/05/installing-python-packages-from-jupyter/&quot;&gt;install additional Python packages&lt;/a&gt; on the cluster nodes or &lt;a href=&quot;https://becominghuman.ai/setting-up-a-scalable-data-exploration-environment-with-spark-and-jupyter-lab-22dbe7046269&quot;&gt;configure the Spark environment&lt;/a&gt;, you can use the Jupyter magic commands.&lt;/p&gt;

&lt;p&gt;I am sure you can see why this is really cool, right? You can easily run your Spark ETL, pre-processing and Machine Learning pipelines on data both stored in SQL and HDFS or any external sources.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;You can find many more Big Data samples on &lt;a href=&quot;https://github.com/Microsoft/sqlworkshops/tree/master/sqlserver2019bigdataclusters&quot;&gt;Buck Woody’s Github repository&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2 id=&quot;summary&quot;&gt;Summary&lt;/h2&gt;

&lt;p&gt;SQL Server 2019 Big Data cluster (BDC) is combining SQL Server, HDFS and Spark into one single cluster running on Kubernetes, either locally, on-premise or on the cloud. Using Polybase, one can connect multiple services - such as relational databases and NoSQL databases, or files in HDFS - as external tables. This allows you to have a single cluster for all your SQL and Spark workloads as well as for storing massive datasets.&lt;/p&gt;

&lt;p&gt;To setup SQL Server 2019 BDC, one needs to sign up for the SQL Server Early Adoption program and install &lt;code class=&quot;highlighter-rouge&quot;&gt;kubectl&lt;/code&gt; and &lt;code class=&quot;highlighter-rouge&quot;&gt;mssqlctl&lt;/code&gt; on the local machine. The cluster can be created using the Python &lt;a href=&quot;https://github.com/Microsoft/sql-server-samples/blob/master/samples/features/sql-big-data-cluster/deployment/aks/deploy-sql-big-data-aks.py&quot;&gt;installation script&lt;/a&gt;. Make sure to clean your credentials and setup rolebinding to access your Kubernetes cluster in the cloud.&lt;/p&gt;

&lt;p&gt;Once the cluster is created, one can use Azure Data Studio to manage both SQL Server and HDFS. On top, Data Studio provides Jupyter-like notebooks to run Spark on the SQL Server 2019 cluster.&lt;/p&gt;

&lt;h2 id=&quot;resources&quot;&gt;Resources&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://www.microsoft.com/en-us/sql-server/sql-server-2019&quot;&gt;SQL Server 2019&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://docs.microsoft.com/en-us/sql/big-data-cluster/big-data-cluster-overview?view=sql-server-ver15&quot;&gt;SQL Server 2019 Big Data cluster (BDC)&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://kubernetes.io/&quot;&gt;Kubernetes&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://hadoop.apache.org/docs/current1/hdfs_design.html#Introduction&quot;&gt;Hadoop Distributed Filesystem (HDFS)&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;http://spark.apache.org/&quot;&gt;Apache Spark&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://knox.apache.org/&quot;&gt;Apache Knox&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://sqlservervnexteap.azurewebsites.net/&quot;&gt;SQL Server 2019 Early Adoption Program&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://docs.microsoft.com/en-us/sql/big-data-cluster/quickstart-big-data-cluster-deploy?view=sql-server-ver15&quot;&gt;SQL Server 2019 BDC installation guide&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/Microsoft/sql-server-samples/blob/master/samples/features/sql-big-data-cluster/deployment/aks/deploy-sql-big-data-aks.py&quot;&gt;SQL Server 2019 BDC installation script&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://kubernetes.io/docs/tasks/tools/install-kubectl/&quot;&gt;Kubectl installation guide&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://docs.microsoft.com/en-us/sql/big-data-cluster/deploy-install-mssqlctl?view=sqlallproducts-allversions&quot;&gt;Mssqlctl installation guide&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://docs.microsoft.com/en-us/sql/azure-data-studio/what-is?view=sqlallproducts-allversions&quot;&gt;Azure Data Studio&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://microsoft.github.io/sqlworkshops/&quot;&gt;SQL Workshops&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
  &lt;p&gt;Thanks to &lt;a href=&quot;https://www.linkedin.com/in/kaijisse-w-85304ba0/&quot;&gt;Kaijisse Waaijer&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2 id=&quot;updates&quot;&gt;Updates&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;April 01, 2019: Update article to &lt;a href=&quot;https://cloudblogs.microsoft.com/sqlserver/2019/03/27/sql-server-2019-community-technology-preview-2-4-is-now-available/&quot;&gt;CTP 2.4&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
        <pubDate>Tue, 26 Feb 2019 14:30:00 +0000</pubDate>
        <link>https://chaosmail.github.io//hadoop/2019/02/26/getting-started-with-sql2019/</link>
        <guid isPermaLink="true">https://chaosmail.github.io//hadoop/2019/02/26/getting-started-with-sql2019/</guid>
        
        <category>bigdata</category>
        
        <category>hadoop</category>
        
        <category>spark</category>
        
        <category>sql2019</category>
        
        
        <category>hadoop</category>
        
      </item>
    
      <item>
        <title>Introduction to BigData, Hadoop and Spark </title>
        <description>&lt;p&gt;Everyone is speaking about &lt;em&gt;Big Data&lt;/em&gt; and &lt;em&gt;Data Lakes&lt;/em&gt; these days. Many IT professionals see &lt;a href=&quot;http://spark.apache.org/&quot;&gt;Apache Spark&lt;/a&gt; as &lt;em&gt;the&lt;/em&gt; solution to &lt;em&gt;every&lt;/em&gt; problem. At the same time, &lt;a href=&quot;https://hadoop.apache.org/&quot;&gt;Apache Hadoop&lt;/a&gt; has been around for &lt;a href=&quot;https://en.wikipedia.org/wiki/Apache_Hadoop&quot;&gt;more than 10 years&lt;/a&gt; and won’t go away anytime soon. In this blog post I want to give a brief introduction to Big Data, demystify some of the main concepts such as Map Reduce, and highlight the similarities and differences between Hadoop and Spark.&lt;/p&gt;

&lt;h2 id=&quot;big-data&quot;&gt;Big Data&lt;/h2&gt;

&lt;p&gt;You hear about Big Data everywhere. But what does it actually mean and what precisely can we do with it? When is a Big Data system required? We will try to answer these 2 questions in this section.&lt;/p&gt;

&lt;h3 id=&quot;what-is-big-data&quot;&gt;What is Big Data&lt;/h3&gt;

&lt;p&gt;What is &lt;em&gt;Big Data&lt;/em&gt;? Have you ever heard of the popular definition of Big Data with the 3 &lt;strong&gt;V&lt;/strong&gt;s? This definition is very common and can be found in many text books and &lt;a href=&quot;https://en.wikipedia.org/wiki/Big_data&quot;&gt;Wikipedia&lt;/a&gt;. It suggests that your data is &lt;em&gt;Big&lt;/em&gt; data when one (or all - depending on the definition) of the following criteria are fulfilled:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;V&lt;/strong&gt;olume&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;V&lt;/strong&gt;elocity&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;V&lt;/strong&gt;ariety&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I find this definition very concise and understandable but a bit imprecise which is probably intentional. Here is a more practical definition of what the 3 &lt;strong&gt;V&lt;/strong&gt;s stand for, based on my own experiences.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Volume&lt;/strong&gt; describes a large &lt;em&gt;amount of data&lt;/em&gt; you want to store, process or analyze. If we are speaking in terms of 100s of GBs to TBs to PBs then we are speaking about Big Data. An important aspect to consider, the data growth. As a rule of thumb: If your data is growing by multiple GBs per day, you are probably dealing with Big Data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Velocity&lt;/strong&gt; means a high &lt;em&gt;data throughput&lt;/em&gt; to be stored, processed and/or analyzed; often a large amount of data over a short period of time. When we are processing thousands to millions of records per second, then we are most likely speaking of Big Data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Variety&lt;/strong&gt; stands for the large amount of &lt;em&gt;different data types and formats&lt;/em&gt; that can be stored, processed or analyzed. This means one aims to process any kind of data, be it binary, text, structured, unstructured, compressed, uncompressed, nested, flat, etc. However, &lt;em&gt;variety&lt;/em&gt; is rather a consequence of Big Data as all data is eventually stored on a distributed file system and so one has to care about different optimized file formats for different use-cases.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Big Data&lt;/em&gt; systems are built to handle data of high volume, velocity and variety. Apache Hadoop and Apache Spark are popular Big Data frameworks for large-scale distributed processing. We will learn the similarities and differences in the following sections.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Please note that other definitions vary slightly and you will find 4 or even more &lt;strong&gt;V&lt;/strong&gt;s, such as &lt;em&gt;Veracity&lt;/em&gt; for example. &lt;em&gt;Veracity&lt;/em&gt; refers to the trustworthiness of the data and hence describes how useful your data actually is. While these extended definitions are relevant for Big Data, they don’t necessarily apply &lt;em&gt;only&lt;/em&gt; to Big Data systems or require &lt;em&gt;Big Data&lt;/em&gt; systems in my opinion.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3 id=&quot;what-can-we-do-with-big-data&quot;&gt;What can we do with Big Data&lt;/h3&gt;

&lt;p&gt;Big Data systems allow you to load, process and store data of high volume, velocity and variety. However, I prefer to see it the other way around. Once, you want to load, process and store data of high volume, velocity and variety, you need a Big Data system.&lt;/p&gt;

&lt;p&gt;A great example is Google’s search index. In a very simplified way, one has to count occurrences of keywords and phrases in websites and store each keyword and its corresponding list of occurring websites as an inverted index. As you might imagine, both the amount of content and the inverted index probably don’t fit into the memory of a single machine.&lt;/p&gt;

&lt;p&gt;Another classic example is Twitter’s trending hashtags (the word count example). One has to count the occurrences of a keyword in all tweets, maybe even weighted over time and aggregated per geographic region. As you might imagine, doing this on billion of messages per second probably exceeds the capabilities of a single machine.&lt;/p&gt;

&lt;p&gt;Hence, both use-cases require scalable distributed systems to handle the load and process the data efficiently. There is one more thing that both use-cases have in common: they aggregate/group data by a key, e.g. the occurrences per keyword and the counts per hashtag. This is one of the key requirements for a Big Data system.&lt;/p&gt;

&lt;p&gt;Hence most of the Big Data workloads can be categorized into 2 topics: &lt;em&gt;Batch Processing&lt;/em&gt; or &lt;em&gt;Stream Processing&lt;/em&gt;.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Batch Processing
    &lt;ul&gt;
      &lt;li&gt;Transformation, Join and Aggregation&lt;/li&gt;
      &lt;li&gt;Analytics: (historical) Analytics, Prediction and Modeling&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;Stream Processing
    &lt;ul&gt;
      &lt;li&gt;Transformation, Join and (temporal) Aggregation&lt;/li&gt;
      &lt;li&gt;Analytics: (real-time) Analytics, Inferencing Prediction models&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;what-is-big-data-analytics&quot;&gt;What is Big Data Analytics&lt;/h3&gt;

&lt;p&gt;Big Data systems are used to store and process massive amounts of data, mostly batch and stream processing. They are often used for analytics. For clarity about which data is processed, I like to make the distinction between 3 different use-cases of increasing difficulty in Big Data analytics. The use cases are:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;(Historical) Analysis&lt;/li&gt;
  &lt;li&gt;Prediction&lt;/li&gt;
  &lt;li&gt;Modeling&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In classical &lt;strong&gt;Analytics&lt;/strong&gt;, we analyze historical/observed data. In Big Data, we analyze massive amounts of such data. A typical question to answer with analytics could be &lt;em&gt;how to compute the number of visitors of the previous season based on all bookings of said season?&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;In &lt;strong&gt;Prediction&lt;/strong&gt;, we analyze the past to build a model that can predict the future. In more general terms, one fits a model on a set of training data to use it for inferring any unknown/unseen observation. We often use statistical methods (such as Generalized Linear Models, Logistic Regression, etc.) as well as Machine Learning (SVM, Gradient Boosted Trees, Deep Learning, etc.) techniques to build these models. A typical question to answer with prediction could be &lt;em&gt;how to forecast the number of visitors for the following season based on all bookings of previous seasons?&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Modeling&lt;/strong&gt; builds on both analytics and prediction capabilities. In &lt;em&gt;Modeling&lt;/em&gt;, the aims is to analyze the past and build a model to predict different possibilities of the future depending on the model parameters. These models are often more complicated than a simple statistical or Machine Learning model and take into account multiple state variables and parameters that can be modified. A typical question to answer with modeling could be &lt;em&gt;how to forecast the number of visitors for the following season if the winter will be two weeks shorter based on all bookings of previous seasons plus additional data sources (weather data, etc.)?&lt;/em&gt;.&lt;/p&gt;

&lt;h2 id=&quot;hadoop-hdfs-yarn-and-mapreduce&quot;&gt;Hadoop: HDFS, Yarn and MapReduce&lt;/h2&gt;

&lt;p&gt;&lt;a href=&quot;https://hadoop.apache.org/&quot;&gt;Apache Hadoop&lt;/a&gt; is a framework for storing and processing massive amounts of data on commodity hardware. It is a collection of services that sit together in the &lt;a href=&quot;https://github.com/apache/hadoop&quot;&gt;Hadoop repository&lt;/a&gt;.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;HDFS: a distributed file system&lt;/li&gt;
  &lt;li&gt;MapReduce: a framework for distributed processing&lt;/li&gt;
  &lt;li&gt;Yarn: a cluster resource manager&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;HDFS&lt;/strong&gt; (Hadoop Distributed File System) is a distributed file system that stores and replicates data in blobs across multiple nodes in a cluster. HDFS is the open-source implementation of the &lt;a href=&quot;https://ai.google/research/pubs/pub51&quot;&gt;Google File System (GFS)&lt;/a&gt; paper published by Jeff Dean and Sanjay Ghemawat at Google in 2003.&lt;/p&gt;

&lt;p&gt;HDFS consists of a &lt;em&gt;name node&lt;/em&gt; and multiple &lt;em&gt;data nodes&lt;/em&gt;. The &lt;em&gt;name node&lt;/em&gt; holds the references to the data blobs and takes care of the file system and meta operations whereas the &lt;em&gt;data nodes&lt;/em&gt; store the data in blobs on the local file system.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;MapReduce&lt;/strong&gt; is a high-level framework for distributed processing of large data sets that abstracts developers’ code into &lt;em&gt;Map&lt;/em&gt; (transformation) and &lt;em&gt;Reduce&lt;/em&gt; (aggregation) operations. By doing so, the code can automatically run in parallel on a distributed system. MapReduce is an open-source implementation of the &lt;a href=&quot;https://ai.google/research/pubs/pub62&quot;&gt;MapReduce: Simplified Data Processing on Large Clusters&lt;/a&gt; paper published by Jeff Dean and Sanjay Ghemawat at Google in 2004.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Due to the same name of the paper and the open-source implementation, the term &lt;em&gt;MapReduce&lt;/em&gt; can lead to confusion between the original concept and the framework.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Similar to both Google papers, HDFS and MapReduce were designed and developed to function as one single framework for distributed processing of large data sets. MapReduce takes advantage of data replication in HDFS by moving computations to the same physical machine were the data is stored. In Hadoop 1, the MapReduce services ran directly on the &lt;em&gt;data nodes&lt;/em&gt; without any resource manager.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Apache Yarn&lt;/strong&gt; (acronym for &lt;em&gt;Yet Another Resource Negotiator&lt;/em&gt;) is a distributed resource manager and job scheduler for managing the cluster resources (CPUs, RAM, GPUs, etc.) and for scheduling and running distributed jobs on a Hadoop cluster. It was introduced in Hadoop 2 to decouple the MapReduce engine from the cluster resource management and allowed more services to run on-top of Hadoop. Hence, instead of starting services on each of the nodes individually one can submit a service to Yarn which takes care of the resource negotiation, distribution of the service to all requested nodes, execution of the service, log collection, etc.&lt;/p&gt;

&lt;p&gt;Yarn consists of a &lt;em&gt;resource manager&lt;/em&gt; service to negotiate cluster resources and multiple &lt;em&gt;node manager&lt;/em&gt; services that manage the execution of processes on each of the nodes.&lt;/p&gt;

&lt;p&gt;Although not managed in the same repository as Apache Hadoop, I often like to mention Apache Zookeeper as another integral building block of Hadoop. &lt;strong&gt;Apache Zookeeper&lt;/strong&gt; is a distributed synchronized transaction-based in-memory key-value store. Many Hadoop services use Zookeeper for storing dynamic configuration (available nodes per partition, current master, etc.), leader election, synchronization, and much more.&lt;/p&gt;

&lt;p&gt;Nowadays, there are many other services related to or included in the Hadoop stack. Here is a (small) list of distributed services that usually run on-top of Hadoop:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Batch Processing
    &lt;ul&gt;
      &lt;li&gt;Hive&lt;/li&gt;
      &lt;li&gt;Pig&lt;/li&gt;
      &lt;li&gt;MapReduce&lt;/li&gt;
      &lt;li&gt;Tez&lt;/li&gt;
      &lt;li&gt;Druid&lt;/li&gt;
      &lt;li&gt;Impala&lt;/li&gt;
      &lt;li&gt;Spark&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;Stream processing
    &lt;ul&gt;
      &lt;li&gt;Storm&lt;/li&gt;
      &lt;li&gt;Flink&lt;/li&gt;
      &lt;li&gt;Spark Streaming&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;Data Storage
    &lt;ul&gt;
      &lt;li&gt;HDFS (File Store)&lt;/li&gt;
      &lt;li&gt;HBase (NoSql)&lt;/li&gt;
      &lt;li&gt;Cassandra (NoSQL)&lt;/li&gt;
      &lt;li&gt;Accumulo (NoSQL)&lt;/li&gt;
      &lt;li&gt;Kafka (Log Store)&lt;/li&gt;
      &lt;li&gt;Solr (Inverted Document Index)&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Most of these service run on top of Hadoop because they utilize one or more of its components. Typical examples of reused components are:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;HDFS as distributed storage (used in Hive, HBase, etc.)&lt;/li&gt;
  &lt;li&gt;Yarn as resource manager (used in Hive, Spark, Storm, etc.)&lt;/li&gt;
  &lt;li&gt;Zookeeper for synchronization and leader election (used in Hive, Kafka, Hive, etc. )&lt;/li&gt;
  &lt;li&gt;Hive Metastore as a meta data storage (used in Spark, Impala, etc.)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;spark-the-evolution-of-mapreduce&quot;&gt;Spark: The Evolution of MapReduce&lt;/h2&gt;

&lt;p&gt;&lt;a href=&quot;http://spark.apache.org/&quot;&gt;Apache Spark&lt;/a&gt; got &lt;a href=&quot;https://en.wikipedia.org/wiki/Apache_Spark&quot;&gt;popular in 2014&lt;/a&gt; as a fast general-purpose compute framework for distributed processing which claimed to be more than 100 times faster than the traditional MapReduce implementation. It provides high level operations for working with distributed data sets which are optimized and executed in-memory of the cluster nodes. Spark runs on top of multiple resource managers such as Yarn or Mesos.&lt;/p&gt;

&lt;p&gt;Conceptually, Spark’s execution engine is similar to the other distributed processing frameworks:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://ai.google/research/pubs/pub62&quot;&gt;MapReduce: Simplified Data Processing on Large Clusters&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;http://web.eecs.umich.edu/~mosharaf/Readings/Tez.pdf&quot;&gt;Tez: A Unifying Framework for Modeling and Building Data Processing Applications&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;http://cidrdb.org/cidr2015/Papers/CIDR15_Paper28.pdf&quot;&gt;Impala: A Modern, Open-Source SQL Engine for Hadoop&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://www2.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-82.pdf&quot;&gt;Spark: A Fault-Tolerant Abstraction for In-Memory Cluster Computing&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Apache Tez&lt;/em&gt; (Tez is Hindi for &lt;em&gt;speed&lt;/em&gt;) is &lt;a href=&quot;https://de.hortonworks.com/blog/introducing-tez-faster-hadoop-processing/&quot;&gt;a faster MapReduce engine&lt;/a&gt; based on Apache Yarn that optimizes complex execution graphs into single jobs to avoid intermediate writes to HDFS. It is the default execution engine powering Apache Pig and Apache Hive (a large scale data warehouse solution on top of Hadoop) on the Hortonworks Hadoop distribution.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Apache Impala&lt;/em&gt; is a &lt;a href=&quot;https://impala.apache.org/overview.html&quot;&gt;Big Data SQL engine&lt;/a&gt; on top of HDFS, HBase and Hive (Metastore) with its own specialized distributed query engine. It is the default engine for Apache Hive on the Cloudera and MapR Hadoop distributions.&lt;/p&gt;

&lt;p&gt;As we can see in the following figure, the main differences from the newer processing frameworks (Tez, Impala, and Spark) compared to the traditional MapReduce engine (left) is that they avoid writing intermediate results to HDFS and heavily optimize the execution graph. Another optimization strategy is processing/caching data in memory of the local nodes across the execution graph.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/hadoop/mr_tez_spark.png&quot; alt=&quot;MapReduce vs. Tez/Impala/Spark&quot; title=&quot;MapReduce vs. Tez/Impala/Spark&quot; class=&quot;image-col-1&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Traditional MapReduce (left) vs. Tez/Impala/Spark optimized engines (right) (Source: &lt;a href=&quot;https://de.hortonworks.com/blog/introducing-tez-faster-hadoop-processing/&quot;&gt;hortonworks.com&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;What sets &lt;em&gt;Apache Spark&lt;/em&gt; aside from the other frameworks, is the in-memory processing engine as well as the rich set of included libraries (GraphX for graph processing, MLib for Machine Learning, Spark Streaming for mini batch streaming, and Spark SQL) and SDKs (Scala, Python, Java, and R). Please note that these libraries are for &lt;em&gt;distributed&lt;/em&gt; processing, so distributed graph processing, distributed machine learning, etc. out-of-the-box.&lt;/p&gt;

&lt;p&gt;The amazing performance of Spark’s in-memory engine comes with a trade-off. Tuning and operating Spark pipelines with varying amounts of data requires a &lt;a href=&quot;https://spark.apache.org/docs/latest/tuning.html&quot;&gt;lot of manual tuning of configurations&lt;/a&gt;, digging through log files, and reading books, articles, and blog posts. And since the execution parallelism can be modified in a fine-grained way, one has to configure/set the number of tasks per JVM, the number of JVMs per worker, and the number of workers as well as all the memory settings (heap, shuffle, and storage) for these executors and the driver.&lt;/p&gt;

&lt;h2 id=&quot;summary&quot;&gt;Summary&lt;/h2&gt;

&lt;p&gt;We speak about Big Data, when we speak about large volumes (&amp;gt; 10s GB), high velocity (&amp;gt; 10.000s records/second) or large variety (binary, text, unstructured, compressed, etc.) of data. We use Big Data systems to store and process massive data sets - either as batch (process partitions of data) or as stream (process single records). In Big Data analytics, we usually differentiate between historical analytics, prediction and modeling.&lt;/p&gt;

&lt;p&gt;Apache Hadoop is a collection of services for large-scale distributed storage and processing, mainly HDFS (a distributed filesystem), MapReduce (a processing framework), Apache Yarn (a cluster resource manager), and Apache Zookeeper (a fast distributed key-value storage). Many other services such as Hive, HBase, etc. run on top of Hadoop.&lt;/p&gt;

&lt;p&gt;Apache Spark is a fast (100 times faster than traditional MapReduce) distributed in-memory processing engine with high-level APIs, libraries for distributed graph processing and machine learning, and SDKs for Scala, Java, Python and R. It also has support for SQL and streaming.&lt;/p&gt;

&lt;h2 id=&quot;resources&quot;&gt;Resources&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://en.wikipedia.org/wiki/Big_data&quot;&gt;Big Data - Wikipedia&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://en.wikipedia.org/wiki/Apache_Hadoop&quot;&gt;Apache Hadoop - Wikipedia&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://en.wikipedia.org/wiki/Apache_Spark&quot;&gt;Apache Spark - Wikipedia&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;http://spark.apache.org/&quot;&gt;Apache Spark - Website&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://hadoop.apache.org/&quot;&gt;Apache Hadoop - Website&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/apache/hadoop&quot;&gt;Apache Hadoop - Github Repository&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://impala.apache.org/overview.html&quot;&gt;Apache Impala - Website (Overview)&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://ai.google/research/pubs/pub51&quot;&gt;The Google File System (2003)&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://ai.google/research/pubs/pub62&quot;&gt;MapReduce: Simplified Data Processing on Large Clusters (2004)&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://de.hortonworks.com/blog/introducing-tez-faster-hadoop-processing/&quot;&gt;Introducing Tez - Hortonworks Blog&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://www2.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-82.pdf&quot;&gt;Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing (2011)&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;http://web.eecs.umich.edu/~mosharaf/Readings/Tez.pdf&quot;&gt;Apache Tez: A Unifying Framework for Modeling and Building Data Processing Applications (2015)&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;http://cidrdb.org/cidr2015/Papers/CIDR15_Paper28.pdf&quot;&gt;Impala: A Modern, Open-Source SQL Engine for Hadoop (2015)&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
  &lt;p&gt;Thanks to &lt;a href=&quot;https://www.linkedin.com/in/emil-jorgensen&quot;&gt;Emil Jorgensen&lt;/a&gt; and &lt;a href=&quot;https://www.linkedin.com/in/bryanminnock&quot;&gt;Bryan Minnock&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

</description>
        <pubDate>Thu, 31 Jan 2019 16:00:00 +0000</pubDate>
        <link>https://chaosmail.github.io//hadoop/2019/01/31/intro-to-bigdata-hadoop-and-spark/</link>
        <guid isPermaLink="true">https://chaosmail.github.io//hadoop/2019/01/31/intro-to-bigdata-hadoop-and-spark/</guid>
        
        <category>bigdata</category>
        
        <category>hadoop</category>
        
        <category>spark</category>
        
        
        <category>hadoop</category>
        
      </item>
    
      <item>
        <title>Bing Maps API - Rest Locations (Geocoding)</title>
        <description>&lt;p&gt;The &lt;a href=&quot;https://www.microsoft.com/en-us/maps/choose-your-bing-maps-api&quot;&gt;Bing Maps APIs&lt;/a&gt; provide a &lt;a href=&quot;https://msdn.microsoft.com/en-us/library/ff701711.aspx&quot;&gt;REST API for Geocoding&lt;/a&gt;, hence finding a location based on a text input. This text input can be either a structured address or an (unstructured) search query. In this blog post we will see how to request a developer key and use the REST API for finding the coordinates to a query using the Bing Maps Location API.&lt;/p&gt;

&lt;h2 id=&quot;request-an-api-key&quot;&gt;Request an API key&lt;/h2&gt;

&lt;p&gt;If you already use Azure or have an Azure account, then you can request an API key directly from within Azure. To do so, log into your Azure account and go to the Marketplace. Search for &lt;em&gt;Bing Maps API for Enterprise&lt;/em&gt;, select your pricing model and add it to your Azure account. The tier level 1 is free and grants you 10K requests/month. You can find more pricing information in the &lt;a href=&quot;https://azuremarketplace.microsoft.com/en-us/marketplace/apps/bingmaps.mapapis?tab=PlansAndPrice&quot;&gt;product documentation&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;If you don’t have an Azure account then you can request an API key from the &lt;a href=&quot;https://www.bingmapsportal.com/&quot;&gt;Bing Maps Portal&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;locations-api-rest&quot;&gt;Locations API (Rest)&lt;/h2&gt;

&lt;p&gt;There are 2 types of queries available in the Locations API, structured and unstructured queries. In this example we will use an unstructured query to query the API with a search term. However, you can also find examples using structured queries on the &lt;a href=&quot;https://msdn.microsoft.com/en-us/library/ff701711.aspx&quot;&gt;Bing Maps Location API&lt;/a&gt; documentation.&lt;/p&gt;

&lt;p&gt;Let’s find the geolocation of Howth an Irish village in east central Dublin. To run this, you have to replace the &lt;code class=&quot;highlighter-rouge&quot;&gt;&amp;lt;accesskey&amp;gt;&lt;/code&gt; placeholder with you own access key in the following snippet.&lt;/p&gt;

&lt;div class=&quot;language-sh highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;curl http://dev.virtualearth.net/REST/v1/Locations?query&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;Howth+Dublin&amp;amp;include&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;queryParse&amp;amp;key&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&amp;lt;accesskey&amp;gt;&amp;amp;output&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;json

&lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;s2&quot;&gt;&quot;authenticationResultCode&quot;&lt;/span&gt;: &lt;span class=&quot;s2&quot;&gt;&quot;ValidCredentials&quot;&lt;/span&gt;,
  &lt;span class=&quot;s2&quot;&gt;&quot;brandLogoUri&quot;&lt;/span&gt;: &lt;span class=&quot;s2&quot;&gt;&quot;http://dev.virtualearth.net/Branding/logo_powered_by.png&quot;&lt;/span&gt;,
  &lt;span class=&quot;s2&quot;&gt;&quot;copyright&quot;&lt;/span&gt;: &lt;span class=&quot;s2&quot;&gt;&quot;Copyright © 2018 Microsoft and its suppliers. All rights reserved. This API cannot be accessed and the content and any results may not be used, reproduced or transmitted in any manner without express written permission from Microsoft Corporation.&quot;&lt;/span&gt;,
  &lt;span class=&quot;s2&quot;&gt;&quot;resourceSets&quot;&lt;/span&gt;: &lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;
    &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
      &lt;span class=&quot;s2&quot;&gt;&quot;estimatedTotal&quot;&lt;/span&gt;: 2,
      &lt;span class=&quot;s2&quot;&gt;&quot;resources&quot;&lt;/span&gt;: &lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;
        &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
          &lt;span class=&quot;s2&quot;&gt;&quot;__type&quot;&lt;/span&gt;: &lt;span class=&quot;s2&quot;&gt;&quot;Location:http://schemas.microsoft.com/search/local/ws/rest/v1&quot;&lt;/span&gt;,
          &lt;span class=&quot;s2&quot;&gt;&quot;bbox&quot;&lt;/span&gt;: &lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;
            53.3509009854156,
            &lt;span class=&quot;nt&quot;&gt;-6&lt;/span&gt;.12213139880203,
            53.4088417514008,
            &lt;span class=&quot;nt&quot;&gt;-5&lt;/span&gt;.99270815503098
          &lt;span class=&quot;o&quot;&gt;]&lt;/span&gt;,
          &lt;span class=&quot;s2&quot;&gt;&quot;name&quot;&lt;/span&gt;: &lt;span class=&quot;s2&quot;&gt;&quot;Howth, Ireland&quot;&lt;/span&gt;,
          &lt;span class=&quot;s2&quot;&gt;&quot;point&quot;&lt;/span&gt;: &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
            &lt;span class=&quot;s2&quot;&gt;&quot;type&quot;&lt;/span&gt;: &lt;span class=&quot;s2&quot;&gt;&quot;Point&quot;&lt;/span&gt;,
            &lt;span class=&quot;s2&quot;&gt;&quot;coordinates&quot;&lt;/span&gt;: &lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;
              53.3798713684082,
              &lt;span class=&quot;nt&quot;&gt;-6&lt;/span&gt;.0574197769165
            &lt;span class=&quot;o&quot;&gt;]&lt;/span&gt;
          &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;,
          &lt;span class=&quot;s2&quot;&gt;&quot;address&quot;&lt;/span&gt;: &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
            &lt;span class=&quot;s2&quot;&gt;&quot;adminDistrict&quot;&lt;/span&gt;: &lt;span class=&quot;s2&quot;&gt;&quot;Dublin&quot;&lt;/span&gt;,
            &lt;span class=&quot;s2&quot;&gt;&quot;countryRegion&quot;&lt;/span&gt;: &lt;span class=&quot;s2&quot;&gt;&quot;Ireland&quot;&lt;/span&gt;,
            &lt;span class=&quot;s2&quot;&gt;&quot;formattedAddress&quot;&lt;/span&gt;: &lt;span class=&quot;s2&quot;&gt;&quot;Howth, Ireland&quot;&lt;/span&gt;,
            &lt;span class=&quot;s2&quot;&gt;&quot;locality&quot;&lt;/span&gt;: &lt;span class=&quot;s2&quot;&gt;&quot;Howth&quot;&lt;/span&gt;
          &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;,
          &lt;span class=&quot;s2&quot;&gt;&quot;confidence&quot;&lt;/span&gt;: &lt;span class=&quot;s2&quot;&gt;&quot;High&quot;&lt;/span&gt;,
          &lt;span class=&quot;s2&quot;&gt;&quot;entityType&quot;&lt;/span&gt;: &lt;span class=&quot;s2&quot;&gt;&quot;PopulatedPlace&quot;&lt;/span&gt;,
          &lt;span class=&quot;s2&quot;&gt;&quot;geocodePoints&quot;&lt;/span&gt;: &lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;
            &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
              &lt;span class=&quot;s2&quot;&gt;&quot;type&quot;&lt;/span&gt;: &lt;span class=&quot;s2&quot;&gt;&quot;Point&quot;&lt;/span&gt;,
              &lt;span class=&quot;s2&quot;&gt;&quot;coordinates&quot;&lt;/span&gt;: &lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;
                53.3798713684082,
                &lt;span class=&quot;nt&quot;&gt;-6&lt;/span&gt;.0574197769165
              &lt;span class=&quot;o&quot;&gt;]&lt;/span&gt;,
              &lt;span class=&quot;s2&quot;&gt;&quot;calculationMethod&quot;&lt;/span&gt;: &lt;span class=&quot;s2&quot;&gt;&quot;Rooftop&quot;&lt;/span&gt;,
              &lt;span class=&quot;s2&quot;&gt;&quot;usageTypes&quot;&lt;/span&gt;: &lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;
                &lt;span class=&quot;s2&quot;&gt;&quot;Display&quot;&lt;/span&gt;
              &lt;span class=&quot;o&quot;&gt;]&lt;/span&gt;
            &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
          &lt;span class=&quot;o&quot;&gt;]&lt;/span&gt;,
          &lt;span class=&quot;s2&quot;&gt;&quot;matchCodes&quot;&lt;/span&gt;: &lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;
            &lt;span class=&quot;s2&quot;&gt;&quot;Ambiguous&quot;&lt;/span&gt;
          &lt;span class=&quot;o&quot;&gt;]&lt;/span&gt;,
          &lt;span class=&quot;s2&quot;&gt;&quot;queryParseValues&quot;&lt;/span&gt;: &lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;
            &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
              &lt;span class=&quot;s2&quot;&gt;&quot;property&quot;&lt;/span&gt;: &lt;span class=&quot;s2&quot;&gt;&quot;Locality&quot;&lt;/span&gt;,
              &lt;span class=&quot;s2&quot;&gt;&quot;value&quot;&lt;/span&gt;: &lt;span class=&quot;s2&quot;&gt;&quot;howth&quot;&lt;/span&gt;
            &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;,
            &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
              &lt;span class=&quot;s2&quot;&gt;&quot;property&quot;&lt;/span&gt;: &lt;span class=&quot;s2&quot;&gt;&quot;AdminDistrict&quot;&lt;/span&gt;,
              &lt;span class=&quot;s2&quot;&gt;&quot;value&quot;&lt;/span&gt;: &lt;span class=&quot;s2&quot;&gt;&quot;dublin&quot;&lt;/span&gt;
            &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
          &lt;span class=&quot;o&quot;&gt;]&lt;/span&gt;
        &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;,
        &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
          &lt;span class=&quot;s2&quot;&gt;&quot;__type&quot;&lt;/span&gt;: &lt;span class=&quot;s2&quot;&gt;&quot;Location:http://schemas.microsoft.com/search/local/ws/rest/v1&quot;&lt;/span&gt;,
          &lt;span class=&quot;s2&quot;&gt;&quot;bbox&quot;&lt;/span&gt;: &lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;
            53.3849579306227,
            &lt;span class=&quot;nt&quot;&gt;-6&lt;/span&gt;.08322532103279,
            53.392683365764,
            &lt;span class=&quot;nt&quot;&gt;-6&lt;/span&gt;.06595509125969
          &lt;span class=&quot;o&quot;&gt;]&lt;/span&gt;,
          &lt;span class=&quot;s2&quot;&gt;&quot;name&quot;&lt;/span&gt;: &lt;span class=&quot;s2&quot;&gt;&quot;Howth, Ireland&quot;&lt;/span&gt;,
          &lt;span class=&quot;s2&quot;&gt;&quot;point&quot;&lt;/span&gt;: &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
            &lt;span class=&quot;s2&quot;&gt;&quot;type&quot;&lt;/span&gt;: &lt;span class=&quot;s2&quot;&gt;&quot;Point&quot;&lt;/span&gt;,
            &lt;span class=&quot;s2&quot;&gt;&quot;coordinates&quot;&lt;/span&gt;: &lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;
              53.3888206481934,
              &lt;span class=&quot;nt&quot;&gt;-6&lt;/span&gt;.07459020614624
            &lt;span class=&quot;o&quot;&gt;]&lt;/span&gt;
          &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;,
          &lt;span class=&quot;s2&quot;&gt;&quot;address&quot;&lt;/span&gt;: &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
            &lt;span class=&quot;s2&quot;&gt;&quot;adminDistrict&quot;&lt;/span&gt;: &lt;span class=&quot;s2&quot;&gt;&quot;Dublin&quot;&lt;/span&gt;,
            &lt;span class=&quot;s2&quot;&gt;&quot;countryRegion&quot;&lt;/span&gt;: &lt;span class=&quot;s2&quot;&gt;&quot;Ireland&quot;&lt;/span&gt;,
            &lt;span class=&quot;s2&quot;&gt;&quot;formattedAddress&quot;&lt;/span&gt;: &lt;span class=&quot;s2&quot;&gt;&quot;Howth, Ireland&quot;&lt;/span&gt;,
            &lt;span class=&quot;s2&quot;&gt;&quot;locality&quot;&lt;/span&gt;: &lt;span class=&quot;s2&quot;&gt;&quot;Howth&quot;&lt;/span&gt;
          &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;,
          &lt;span class=&quot;s2&quot;&gt;&quot;confidence&quot;&lt;/span&gt;: &lt;span class=&quot;s2&quot;&gt;&quot;Medium&quot;&lt;/span&gt;,
          &lt;span class=&quot;s2&quot;&gt;&quot;entityType&quot;&lt;/span&gt;: &lt;span class=&quot;s2&quot;&gt;&quot;RailwayStation&quot;&lt;/span&gt;,
          &lt;span class=&quot;s2&quot;&gt;&quot;geocodePoints&quot;&lt;/span&gt;: &lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;
            &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
              &lt;span class=&quot;s2&quot;&gt;&quot;type&quot;&lt;/span&gt;: &lt;span class=&quot;s2&quot;&gt;&quot;Point&quot;&lt;/span&gt;,
              &lt;span class=&quot;s2&quot;&gt;&quot;coordinates&quot;&lt;/span&gt;: &lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;
                53.3888206481934,
                &lt;span class=&quot;nt&quot;&gt;-6&lt;/span&gt;.07459020614624
              &lt;span class=&quot;o&quot;&gt;]&lt;/span&gt;,
              &lt;span class=&quot;s2&quot;&gt;&quot;calculationMethod&quot;&lt;/span&gt;: &lt;span class=&quot;s2&quot;&gt;&quot;Rooftop&quot;&lt;/span&gt;,
              &lt;span class=&quot;s2&quot;&gt;&quot;usageTypes&quot;&lt;/span&gt;: &lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;
                &lt;span class=&quot;s2&quot;&gt;&quot;Display&quot;&lt;/span&gt;
              &lt;span class=&quot;o&quot;&gt;]&lt;/span&gt;
            &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
          &lt;span class=&quot;o&quot;&gt;]&lt;/span&gt;,
          &lt;span class=&quot;s2&quot;&gt;&quot;matchCodes&quot;&lt;/span&gt;: &lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;
            &lt;span class=&quot;s2&quot;&gt;&quot;Ambiguous&quot;&lt;/span&gt;
          &lt;span class=&quot;o&quot;&gt;]&lt;/span&gt;,
          &lt;span class=&quot;s2&quot;&gt;&quot;queryParseValues&quot;&lt;/span&gt;: &lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;
            &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
              &lt;span class=&quot;s2&quot;&gt;&quot;property&quot;&lt;/span&gt;: &lt;span class=&quot;s2&quot;&gt;&quot;Landmark&quot;&lt;/span&gt;,
              &lt;span class=&quot;s2&quot;&gt;&quot;value&quot;&lt;/span&gt;: &lt;span class=&quot;s2&quot;&gt;&quot;howth&quot;&lt;/span&gt;
            &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;,
            &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
              &lt;span class=&quot;s2&quot;&gt;&quot;property&quot;&lt;/span&gt;: &lt;span class=&quot;s2&quot;&gt;&quot;AdminDistrict&quot;&lt;/span&gt;,
              &lt;span class=&quot;s2&quot;&gt;&quot;value&quot;&lt;/span&gt;: &lt;span class=&quot;s2&quot;&gt;&quot;dublin&quot;&lt;/span&gt;
            &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
          &lt;span class=&quot;o&quot;&gt;]&lt;/span&gt;
        &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
      &lt;span class=&quot;o&quot;&gt;]&lt;/span&gt;
    &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
  &lt;span class=&quot;o&quot;&gt;]&lt;/span&gt;,
  &lt;span class=&quot;s2&quot;&gt;&quot;statusCode&quot;&lt;/span&gt;: 200,
  &lt;span class=&quot;s2&quot;&gt;&quot;statusDescription&quot;&lt;/span&gt;: &lt;span class=&quot;s2&quot;&gt;&quot;OK&quot;&lt;/span&gt;,
  &lt;span class=&quot;s2&quot;&gt;&quot;traceId&quot;&lt;/span&gt;: &lt;span class=&quot;s2&quot;&gt;&quot;40fdef3af24f4784ace0cce7eac525d8|DB40060332|7.7.0.0|Ref A: 4E51EFC390E74430B0E957794757DF59 Ref B: DB3EDGE1021 Ref C: 2018-07-31T13:59:46Z&quot;&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;In the above code we see that the API returns the geolocation, a bounding box and other useful geo information such as country region and administrative district as a json body.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;With the Bing Maps API, one can also perform Batch operations. You can find more information about Batch operations on the &lt;a href=&quot;https://blogs.bing.com/maps/2010/08/31/batch-geocoding-and-batch-reverse-geocoding-with-bing-maps/&quot;&gt;Bing Blog&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2 id=&quot;resources&quot;&gt;Resources&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://www.microsoft.com/en-us/maps/choose-your-bing-maps-api&quot;&gt;Bing Maps API Overview&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://msdn.microsoft.com/en-us/library/ff701711.aspx&quot;&gt;Bing Maps API - Documentation of Locations API&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://azuremarketplace.microsoft.com/en-us/marketplace/apps/bingmaps.mapapis?tab=PlansAndPrice&quot;&gt;Bing Maps API - Pricing&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://www.bingmapsportal.com/&quot;&gt;Bing Maps Portal&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
        <pubDate>Tue, 31 Jul 2018 17:00:00 +0000</pubDate>
        <link>https://chaosmail.github.io//geoprocessing/2018/07/31/bing-maps-api-rest-locations/</link>
        <guid isPermaLink="true">https://chaosmail.github.io//geoprocessing/2018/07/31/bing-maps-api-rest-locations/</guid>
        
        <category>bing</category>
        
        <category>maps</category>
        
        <category>api</category>
        
        <category>geocoding</category>
        
        
        <category>geoprocessing</category>
        
      </item>
    
      <item>
        <title>MS Hack 2018 - Smart Outlook</title>
        <description>&lt;p&gt;In the MS Hack 2018 global Microsoft Hackathon, I joined the &lt;em&gt;Smart Outlook&lt;/em&gt; project for 3 days in order to develop a few ideas to make your daily work in Outlook more productive. We came up with a few ideas, mainly about enhancing folders, categories, and focus inbox as well as notifications about new emails. We build a &lt;a href=&quot;https://mshack2018.z13.web.core.windows.net&quot;&gt;mock UI&lt;/a&gt; to visualize the idea of our solution; the code is available on &lt;a href=&quot;https://github.com/chaosmail/mshack-2018-smart-outlook&quot;&gt;Github&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;idea&quot;&gt;Idea&lt;/h2&gt;

&lt;p&gt;Employees spend 20 hours/week on email. Common patterns include manually moving emails into folders, manually maintaining rules, maintaining a single folder inbox,keeping tons of unread emails in the inbox. These things cost time.&lt;/p&gt;

&lt;p&gt;Categorizing emails is not straight forward. The following tools are available to categorize emails:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Custom Folders&lt;/li&gt;
  &lt;li&gt;Categories&lt;/li&gt;
  &lt;li&gt;Focus Inbox&lt;/li&gt;
  &lt;li&gt;Junk Folder&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Both &lt;em&gt;Custom Folders&lt;/em&gt; and &lt;em&gt;Categories&lt;/em&gt; are filled manually or with manually-created rules whereas &lt;em&gt;Junk Folder&lt;/em&gt; and &lt;em&gt;Focus Inbox&lt;/em&gt; are filled automatically. We receive an email notification for every email received in the inbox.&lt;/p&gt;

&lt;h2 id=&quot;solution&quot;&gt;Solution&lt;/h2&gt;

&lt;p&gt;To make Outlook more efficient, we want to merge &lt;em&gt;Custom Folders&lt;/em&gt; and &lt;em&gt;Focus Inbox (Tabs)&lt;/em&gt; into &lt;em&gt;Smart Tabs&lt;/em&gt;. &lt;em&gt;Smart Tabs&lt;/em&gt; can be created within hierarchies like traditional folders. Emails can be dragged into &lt;em&gt;Smart Tabs&lt;/em&gt; like traditional folders and each email can only belong to one single &lt;em&gt;Smart Tab&lt;/em&gt;. However unlike traditional folders, the number of unread emails is added up with all parent tabs. According to the manual categorization of emails to &lt;em&gt;Smart Tabs&lt;/em&gt;, a Machine Learning algorithm trains/finetunes a model based on content/subject semantics as well as email and organizational metadata and moves new emails to the most likely tab (or inbox). During setup, common &lt;em&gt;Smart Tabs&lt;/em&gt; are created for you such as &lt;em&gt;Team&lt;/em&gt;, &lt;em&gt;Notifications&lt;/em&gt;, &lt;em&gt;News&lt;/em&gt;, &lt;em&gt;Org Updates&lt;/em&gt;, &lt;em&gt;Customers&lt;/em&gt;, &lt;em&gt;External&lt;/em&gt;, etc.&lt;/p&gt;

&lt;p&gt;To focus on only relevant notifications, any of your &lt;em&gt;Smart Tabs&lt;/em&gt; can be pinned to the top. You will only receive notifications from new emails in your pinned tabs. We merged the UI of the Outlook web client with the UI of Edge browser to create this Tab UI. We built a &lt;a href=&quot;https://mshack2018.z13.web.core.windows.net&quot;&gt;mock UI&lt;/a&gt; to visualize the idea of our solution; the code is available on &lt;a href=&quot;https://github.com/chaosmail/mshack-2018-smart-outlook&quot;&gt;Github&lt;/a&gt;. Here is a screenshot of our mock UI for &lt;em&gt;Smart Tabs&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/projects/smart-outlook-mock-ui.png&quot; alt=&quot;Smart Tabs UI&quot; title=&quot;Smart Tabs UI&quot; class=&quot;image-col-1&quot; /&gt;&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;A blog post about deploying a static website to Azure can be found in the &lt;a href=&quot;https://azure.microsoft.com/en-us/blog/azure-storage-static-web-hosting-public-preview/&quot;&gt;Azure Blog&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2 id=&quot;execution&quot;&gt;Execution&lt;/h2&gt;

&lt;p&gt;We collaborated with 1 team in London and 2 teams in India who had similiar ideas. Here is what we implemented on the hackathon:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Mock UI
    &lt;ul&gt;
      &lt;li&gt;Smart Tabs: A merge between classic folders and Focus/Other tabs. Mails should be classified automatically (based on previous user interactions) and put into those smart tabs. This tabs will also appear on the left the same folders did.&lt;/li&gt;
      &lt;li&gt;Focus Mode: Smart tabs can be pinned as tabs to the top. Then the user receives notifications solely on the current open tabs.&lt;/li&gt;
      &lt;li&gt;UI: We tried to merge functionality of Outlook.com Web client with the simple/clean UI of Edge into a single tab-based email UI&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;ML API Endpoint
    &lt;ul&gt;
      &lt;li&gt;Labeled a custom email dataset&lt;/li&gt;
      &lt;li&gt;Trained simple ML model with semantic features in Python&lt;/li&gt;
      &lt;li&gt;Email classification based on content&lt;/li&gt;
      &lt;li&gt;Deployment to Azure VM&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;Outlook Plugin
    &lt;ul&gt;
      &lt;li&gt;Create &lt;em&gt;Smart Tabs&lt;/em&gt; programmatically (as folders)&lt;/li&gt;
      &lt;li&gt;Classify incoming messages and move to &lt;em&gt;Smart Tabs&lt;/em&gt; (as folders)&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;future-work&quot;&gt;Future Work&lt;/h2&gt;

&lt;p&gt;It was great fun to hack on Outlook, to try makiung Outlook more efficient and to propose a new UI that let’s you focus on what is important. We see great potential in email clients becoming more intelligent, productive and efficient. In this 3 day hackathon we validated a few ideas and showed in a proof of concept that these ideas can potentially find their way into Outlook Windows or web client.&lt;/p&gt;

&lt;h2 id=&quot;resources&quot;&gt;Resources&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://mshack2018.z13.web.core.windows.net&quot;&gt;Smart Outlook - Demo&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/chaosmail/mshack-2018-smart-outlook&quot;&gt;Smart Outlook - Code&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
        <pubDate>Mon, 30 Jul 2018 17:00:00 +0000</pubDate>
        <link>https://chaosmail.github.io//hackathon/2018/07/30/ms-hack-2018-smart-outlook/</link>
        <guid isPermaLink="true">https://chaosmail.github.io//hackathon/2018/07/30/ms-hack-2018-smart-outlook/</guid>
        
        <category>hackathon</category>
        
        <category>office</category>
        
        
        <category>hackathon</category>
        
      </item>
    
      <item>
        <title>Resolving _HOST in kerberized HDP Sandbox</title>
        <description>&lt;p&gt;Since HDP-2.5 Hortenworks provides its &lt;a href=&quot;https://de.hortonworks.com/products/sandbox/&quot;&gt;HDP Sandbox&lt;/a&gt; within a docker container within a Virtual Machine ISO. For many developers working with multiple VM HDP Sandboxes this is not optimal, as we always have to tunnel each connection through the VM host into the docker container. That’s why we are building our own custom Sandbox. However, when building a kerberized Hadoop installation, it is a bit tricky to configure a hostname such that Kerberos principals resolve the &lt;code class=&quot;highlighter-rouge&quot;&gt;_HOST&lt;/code&gt; variable properly.&lt;/p&gt;

&lt;h2 id=&quot;ambaris-autogenerated-kerberos-principals&quot;&gt;Ambari’s autogenerated Kerberos principals&lt;/h2&gt;

&lt;p&gt;With Ambari and Ambari Blueprints automated Hadoop cluster installations are quite comfortable; one can simply describe all components and configurations in a Blueprint XML file. When it comes to Kerberos, Ambari automatically takes care of creating all principals and keytabs. However, I was experiencing a strange Kerberos authentication error when Ambari resolved the &lt;code class=&quot;highlighter-rouge&quot;&gt;_HOST&lt;/code&gt; variable to &lt;code class=&quot;highlighter-rouge&quot;&gt;localhost&lt;/code&gt; in all principals despite setting the hostname &lt;code class=&quot;highlighter-rouge&quot;&gt;sandbox.chaosmail.at&lt;/code&gt; in &lt;code class=&quot;highlighter-rouge&quot;&gt;/etc/hosts&lt;/code&gt; and &lt;code class=&quot;highlighter-rouge&quot;&gt;/etc/hostname&lt;/code&gt;. Hence, the Kerberos principal was not valid.&lt;/p&gt;

&lt;p&gt;It turned out, that the error occurred due to placing &lt;code class=&quot;highlighter-rouge&quot;&gt;127.0.0.1 sandbox.chaosmail.at&lt;/code&gt; in the last line of &lt;code class=&quot;highlighter-rouge&quot;&gt;/etc/hosts&lt;/code&gt; instead of the first. Here is how I debugged the error.&lt;/p&gt;

&lt;h2 id=&quot;diving-into-ambari-source-code&quot;&gt;Diving into Ambari source code&lt;/h2&gt;

&lt;p&gt;If we dive into the &lt;a href=&quot;https://github.com/apache/ambari&quot;&gt;Ambari source code on Github&lt;/a&gt; and search for &lt;code class=&quot;highlighter-rouge&quot;&gt;_HOST&lt;/code&gt;, we quickly find the following &lt;a href=&quot;https://github.com/apache/ambari/blob/d5cbe1940552c1ac6ed142b0d36bc84f45ba3c7f/ambari-server/src/main/java/org/apache/ambari/server/serveraction/kerberos/KerberosServerAction.java#L531&quot;&gt;code snippet&lt;/a&gt;.&lt;/p&gt;

&lt;div class=&quot;language-java highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;String&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;hostname&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;record&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;get&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;KerberosIdentityDataFileReader&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;HOSTNAME&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;);&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;KerberosHelper&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;AMBARI_SERVER_HOST_NAME&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;equals&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;hostname&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;))&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;c1&quot;&gt;// Replace KerberosHelper.AMBARI_SERVER_HOST_NAME with the actual hostname where the Ambari&lt;/span&gt;
  &lt;span class=&quot;c1&quot;&gt;// server is... this host&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;hostname&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;StageUtils&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;getHostName&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;();&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;// Evaluate the principal &quot;pattern&quot; found in the record to generate the &quot;evaluated principal&quot;&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// by replacing the _HOST and _REALM variables.&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;String&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;evaluatedPrincipal&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;principal&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;replace&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;_HOST&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;hostname&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;replace&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;_REALM&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;defaultRealm&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Bingo, that’s the place where the &lt;code class=&quot;highlighter-rouge&quot;&gt;_HOST&lt;/code&gt; variable gets resolved. In our case, running host and server on the same machine, the variable will be replaced by the return value of the &lt;code class=&quot;highlighter-rouge&quot;&gt;StageUtils.getHostName()&lt;/code&gt; function.&lt;/p&gt;

&lt;p&gt;Let’s find this function in the &lt;a href=&quot;https://github.com/apache/ambari/blob/d5cbe1940552c1ac6ed142b0d36bc84f45ba3c7f/ambari-server/src/main/java/org/apache/ambari/server/utils/StageUtils.java#L108&quot;&gt;source code&lt;/a&gt; and look at the relevant line.&lt;/p&gt;

&lt;div class=&quot;language-java highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;server_hostname&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;InetAddress&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;getLocalHost&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;().&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;getCanonicalHostName&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;().&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;toLowerCase&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Now we know, that the &lt;code class=&quot;highlighter-rouge&quot;&gt;_HOST&lt;/code&gt; variable in a Kerberos principal will be replaced with the output of the &lt;code class=&quot;highlighter-rouge&quot;&gt;getCanonicalHostName()&lt;/code&gt; function (which is implemented in the standard library in the package &lt;code class=&quot;highlighter-rouge&quot;&gt;java.net.InetAddress&lt;/code&gt;) when  autogenerating principals with Ambari.&lt;/p&gt;

&lt;h2 id=&quot;testing-the-hostname&quot;&gt;Testing the hostname&lt;/h2&gt;

&lt;p&gt;Let’s throw the pieces together and write a little Java script to print out the hostname using the &lt;code class=&quot;highlighter-rouge&quot;&gt;getCanonicalHostName()&lt;/code&gt; function.&lt;/p&gt;

&lt;div class=&quot;language-java highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;java.net.InetAddress&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;java.net.UnknownHostException&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;

&lt;span class=&quot;kd&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;PrintHostname&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;

  &lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;static&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;main&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[]&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;args&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;String&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;server_hostname&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;  
    &lt;span class=&quot;k&quot;&gt;try&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
      &lt;span class=&quot;n&quot;&gt;server_hostname&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;InetAddress&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;getLocalHost&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;().&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;getCanonicalHostName&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;().&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;toLowerCase&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;();&lt;/span&gt;
    &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;catch&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;UnknownHostException&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;e&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
      &lt;span class=&quot;n&quot;&gt;System&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;out&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;println&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Could not find canonical hostname&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;);&lt;/span&gt;
      &lt;span class=&quot;n&quot;&gt;server_hostname&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;localhost&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;

    &lt;span class=&quot;n&quot;&gt;System&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;out&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;println&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;server_hostname&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;);&lt;/span&gt;
  &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;We can run the script using the following commands:&lt;/p&gt;

&lt;div class=&quot;language-sh highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;javac PrintHostname.java
java PrintHostname
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Finally when we can test the 2 versions of &lt;code class=&quot;highlighter-rouge&quot;&gt;/etc/hosts&lt;/code&gt;.&lt;/p&gt;

&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
127.0.0.1   sandbox.chaosmail.at
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Using the above hosts file, the PrintHostname script outputs &lt;code class=&quot;highlighter-rouge&quot;&gt;localhost&lt;/code&gt;.&lt;/p&gt;

&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;127.0.0.1   sandbox.chaosmail.at
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Using the above hosts file, the PrintHostname script outputs &lt;code class=&quot;highlighter-rouge&quot;&gt;sandbox.chaosmail.at&lt;/code&gt;.&lt;/p&gt;

&lt;h2 id=&quot;resolving-_host-in-kerberos-principals&quot;&gt;Resolving &lt;code class=&quot;highlighter-rouge&quot;&gt;_HOST&lt;/code&gt; in Kerberos principals&lt;/h2&gt;

&lt;p&gt;Finally we can be sure that &lt;code class=&quot;highlighter-rouge&quot;&gt;getCanonicalHostName()&lt;/code&gt; returns &lt;code class=&quot;highlighter-rouge&quot;&gt;sandbox.chaosmail.at&lt;/code&gt; and hence the &lt;code class=&quot;highlighter-rouge&quot;&gt;_HOST&lt;/code&gt; variable we resolve to &lt;code class=&quot;highlighter-rouge&quot;&gt;sandbox.chaosmail.at&lt;/code&gt;. This means that all principals generated by Ambari will have the proper hostname and hence will be valid principals.&lt;/p&gt;

&lt;h2 id=&quot;resources&quot;&gt;Resources&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://de.hortonworks.com/products/sandbox/&quot;&gt;Hortonworks HDP 2.5&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/apache/ambari&quot;&gt;Ambari Github&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
        <pubDate>Sun, 05 Mar 2017 00:30:00 +0000</pubDate>
        <link>https://chaosmail.github.io//hadoop/2017/03/05/Resolving-host-in-kerberized-hdp-sandbox/</link>
        <guid isPermaLink="true">https://chaosmail.github.io//hadoop/2017/03/05/Resolving-host-in-kerberized-hdp-sandbox/</guid>
        
        <category>hdp</category>
        
        <category>hadoop</category>
        
        <category>kerberos</category>
        
        
        <category>hadoop</category>
        
      </item>
    
      <item>
        <title>Intro to Deep Learning for Computer Vision</title>
        <description>&lt;p&gt;The field of Deep Learning (DL) is rapidly growing and surpassing traditional approaches for machine learning and pattern recognition since 2012 by a factor 10%-20% in accuracy. This blog post gives an introduction to DL and its applications in computer vision with a focus on understanding state-of-the-art architectures such as AlexNet, GoogLeNet, VGG, and ResNet and methodologies such as classification, localization, segmentation, detection and recognition. It is based on a &lt;a href=&quot;http://www.slideshare.net/ChristophKrner/intro-to-deep-learning-for-computer-vision&quot;&gt;presentation&lt;/a&gt; that I held in the Seminar of Computer Graphics at Vienna University of Technology.&lt;/p&gt;

&lt;h2 id=&quot;introduction&quot;&gt;Introduction&lt;/h2&gt;

&lt;p&gt;First steps towards artificial intelligence and machine learning were made by Rosenblatt in 1958, a psychologist who developed a simplified mathematical model that mimics the behavior of neurons in the human brain - the &lt;a href=&quot;#Rosenblatt57&quot;&gt;&lt;em&gt;Perceptron&lt;/em&gt;&lt;/a&gt;. Using a set of training data, it is able to approximate a linear function by iteratively updating its weights according to the output of a simple FeedForward (FF) pass. By combining multiple Perceptrons (also called &lt;em&gt;Neurons&lt;/em&gt; or &lt;em&gt;Units&lt;/em&gt;) to a network using 1 input and 1 output layer, Rosenblatt built a system that can correctly classify simple geometric shapes. However, it is unable to approximate the nonlinear XOR function and does not allow &lt;em&gt;hidden&lt;/em&gt; layers due to the simple update rule. These limitations as well as hardware restrictions resulted in vanishing public interest during the following decades.&lt;/p&gt;

&lt;p&gt;The second important novelty for Artificial Neural Networks (ANN) was the formal introduction of &lt;em&gt;BackPropagation&lt;/em&gt; (BP) in the early 1960s, a concept of computing a derivative for all neurons in a network using chain rule and propagating the error through the different layers of the network from the output layer back to the input layer. This method allows to optimize the weights of a network by using optimization, such as stochastic gradient descent. It was first applied to NN by &lt;a href=&quot;#Werbos74&quot;&gt;Werbos&lt;/a&gt; in 1974 but due to the lack of interest in NN only published in 1982. Finally, it attracted interest of researchers LeCun et al. as well as &lt;a href=&quot;#Rumelhart86&quot;&gt;Rumelhart et al.&lt;/a&gt; in 1986 who further improved the concept of BP for supervised learning in NNs. In 1989, Hornik et al. could prove that NNs with 2 hidden layers can approximate any function and hence also solve the XOR problem.&lt;/p&gt;

&lt;p&gt;In a first practical application, LeCun demonstrated the handwritten recognition of ZIP codes in 1989 using a convolution operation instead of a hidden layer together with a subsampling operator (also called &lt;em&gt;Pooling&lt;/em&gt;). Due to these convolutions, a single neuron in the second layer could learn the same as a complete layer in a non-convolutional network which leads to much more efficient training due to the reduced number of parameters. A layer with 256 3x3 filters has 2,304 parameters, whereas a fully connected layer with 256 units following a second layer of 256 units has 65,536 parameters. Both convolutional layers and pooling layers build the basis for &lt;em&gt;Convolutional Neural Networks&lt;/em&gt; (CNNs).&lt;/p&gt;

&lt;p&gt;NNs advanced in many domains, such as unsupervised learning using AutoEncoders (AE) and Self-Organizing Maps (SOM) as well as reinforcement learning especially in the domain of control systems and robotics. New models such as Belief Networks, Time-Delay Neural Networks (TDNN) for audio processing and Recurrent Neural Networks (RNN) for speech recognition were implemented. However, in the late 1980s, multi-layer NNs were still difficult to train using BP due to the vanishing or exploding gradient problem. In the 1990s, new methods such as Support Vector Machines (SVMs) and Random Forests (RF) were determined as better suited for supervised learning than NNs due to their much simpler mathematical constructs.&lt;/p&gt;

&lt;p&gt;Almost 2 decades later, Hinton et al. showed that &lt;em&gt;Deep Neural Networks&lt;/em&gt; (DNNs) can be trained if the weights are initialized better than randomly and rebranded the domain of multi-layer NNs to &lt;em&gt;Deep Learning&lt;/em&gt; (DL). Leveraging the parallelization power of GPUs resulted in a speedup factor of 1 million increase in training time compared to 1980s and a factor of 70 increase compared to common CPUs in 2007. In 2011, DNNs outperformed a 10-year old state-of-the art record in speech recognition due to 1000 times more training data than used in the 1980s. &lt;a href=&quot;#Glorot10&quot;&gt;Glorot&lt;/a&gt;, LeCun and Hinton further studied the necessity of weight initialization and proposed a much simpler activation function &lt;script type=&quot;math/tex&quot;&gt;f(x) = max(0, x)&lt;/script&gt; - the so called Rectified Linear Unit (&lt;em&gt;ReLU&lt;/em&gt;) - for more stable BP.&lt;/p&gt;

&lt;p&gt;Since 2012, DNNs have been winning classification, detection, localization and segmentation tasks in the ImageNet competitions (&lt;a href=&quot;#Schmidhuber14&quot;&gt;Schmidhuber&lt;/a&gt;) and outperformed all methods using hand-engineered features with almost &lt;script type=&quot;math/tex&quot;&gt;10&lt;/script&gt;% higher accuracy (&lt;a href=&quot;#Krizhevsky12&quot;&gt;Krizhevsky&lt;/a&gt;). The winning model of 2015s ImageNet competition is &lt;a href=&quot;#He15&quot;&gt;ResNet-152 from Microsoft&lt;/a&gt;, a DNN with residual mapping and 152 layers who achieved &lt;script type=&quot;math/tex&quot;&gt;16.5&lt;/script&gt;% greater accuracy on average than the 2nd and &lt;a href=&quot;#He15b&quot;&gt;surpassed human accuracy in classification&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;from-neural-networks-to-deep-learning&quot;&gt;From Neural Networks to Deep Learning&lt;/h2&gt;

&lt;p&gt;In Deep Learning, deeply nested Convolutional Neural Networks with more than a million parameters are trained on more than a million training samples using BP and optimization. Models are tuned by the filter arrangement and dimensions (&lt;a href=&quot;#Simonyan14&quot;&gt;VGGNet&lt;/a&gt;, etc.), module structure to reduce the number of parameters (&lt;a href=&quot;#Lin13&quot;&gt;Network in a Network&lt;/a&gt;, &lt;a href=&quot;#Szegedy14&quot;&gt;GoogLeNet&lt;/a&gt;, etc.), techniques for preventing overfitting (&lt;a href=&quot;#Ioffe15&quot;&gt;batch normalization&lt;/a&gt;, &lt;a href=&quot;#Srivastava14&quot;&gt;Dropout&lt;/a&gt;, etc.), new initialization and non-linearities (&lt;a href=&quot;#Saxe13&quot;&gt;Xavier Initialization&lt;/a&gt;, Leaky-ReLU, etc.), better optimization techniques (&lt;a href=&quot;#Tieleman13&quot;&gt;RMSProp&lt;/a&gt;, etc.) new layer types and connections (&lt;a href=&quot;#He15b&quot;&gt;ResNet&lt;/a&gt;, etc.), as well as composition of different structures (image captioning, deconvolution for visualization, etc.).&lt;/p&gt;

&lt;h3 id=&quot;neural-networks&quot;&gt;Neural Networks&lt;/h3&gt;

&lt;p&gt;Neural Networks date back to the work of Rosenblatt on the &lt;a href=&quot;#Rosenblatt57&quot;&gt;Perceptron&lt;/a&gt; in the late 1960s which is one of the building blocks of modern DNNs. These networks are constructed out of hidden layers (fully connected layers) of multiple perceptrons per layer.&lt;/p&gt;

&lt;h4 id=&quot;the-perceptron---gated-linear-regression&quot;&gt;The Perceptron - Gated Linear Regression&lt;/h4&gt;

&lt;p&gt;The Perceptron aims to recreate the behavior of a neuron in the human brain by combining different inputs with a weighted sum and triggering a signal if a threshold is reached. The more important an input is, the higher is the weight for this particular input. The subsequent figure shows the inputs &lt;script type=&quot;math/tex&quot;&gt;\vec{x}={x_1, x_2, x_3, ...}&lt;/script&gt;, the weights for each  input &lt;script type=&quot;math/tex&quot;&gt;w_i&lt;/script&gt; and a bias term &lt;script type=&quot;math/tex&quot;&gt;\theta&lt;/script&gt;; the threshold (or non-linearity) is modeled as a step function &lt;script type=&quot;math/tex&quot;&gt;h&lt;/script&gt;.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/deep-learning/perceptron.png&quot; alt=&quot;Perceptron&quot; title=&quot;Perceptron&quot; class=&quot;image-col-1&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Perceptron (Source: &lt;a href=&quot;https://github.com/cdipaolo/goml/tree/master/perceptron&quot;&gt;github.com/cdipaolo&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;The output of the Perceptron can be written as&lt;/p&gt;

&lt;script type=&quot;math/tex; mode=display&quot;&gt;\vec{y} = h(W \cdot \vec{x} + \theta)&lt;/script&gt;

&lt;p&gt;and reminds of a simple linear regression with a threshold gate. While the Perceptron can be used as a building block for an universal function approximator, the singularity of the step function leads to problems when computing the gradient.&lt;/p&gt;

&lt;h4 id=&quot;non-linearities-activation-functions&quot;&gt;Non-linearities (Activation Functions)&lt;/h4&gt;

&lt;p&gt;Due to the singularity in the step function, other non-linear activation functions have been proposed and successfully used in NNs.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;em&gt;Rectified Linear Unit&lt;/em&gt; (ReLU) is the most widely used non-linearity for DNNs and is computed via &lt;script type=&quot;math/tex&quot;&gt;y = max(0, x)&lt;/script&gt;. While not being differentiable at position 0, the ReLU function greatly improves the learning process of the network due to the fact that the total gradient is simply passed to the input layer when the gate triggers.&lt;/li&gt;
  &lt;li&gt;&lt;em&gt;Leaky Rectified Linear Unit&lt;/em&gt; (Leaky ReLU) exploits the fact that the original ReLU is set to 0 for negative inputs and hence does not propagate a gradient for negative inputs (this is a crucial fact for initialization).&lt;/li&gt;
  &lt;li&gt;&lt;em&gt;Sigmoid&lt;/em&gt; non-linearities are commonly used together with regression networks in the final output layer as the outputs are bounded between &lt;script type=&quot;math/tex&quot;&gt;-1&lt;/script&gt; and &lt;script type=&quot;math/tex&quot;&gt;1&lt;/script&gt;.&lt;/li&gt;
  &lt;li&gt;&lt;em&gt;Softmax&lt;/em&gt; non-linearities are commonly used together with classification networks in the final output layer, as the sum of the total outputs results to &lt;script type=&quot;math/tex&quot;&gt;1&lt;/script&gt; - similar to a probability distribution.&lt;/li&gt;
  &lt;li&gt;&lt;em&gt;Hyperbolic tangent&lt;/em&gt; (tanh) non-linearities are often used as a replacement for the sigmoid function leading to slightly better training results due to more stable numeric computation.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;deep-neural-networks&quot;&gt;Deep Neural Networks&lt;/h3&gt;

&lt;p&gt;DL models are constructed mostly out of &lt;em&gt;Convolutional&lt;/em&gt; (Conv) and &lt;em&gt;Pooling&lt;/em&gt; (Pool) layers as they have been used by &lt;a href=&quot;#LeCun90&quot;&gt;LeCun&lt;/a&gt; in the late 1980s as shown in the following figure.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/deep-learning/lenet.png&quot; alt=&quot;Architecture of LeNet&quot; title=&quot;Architecture of LeNet&quot; class=&quot;image-col-1&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Architecture of &lt;a href=&quot;#LeCun90&quot;&gt;LeNet&lt;/a&gt;&lt;/p&gt;

&lt;h4 id=&quot;convolutions&quot;&gt;Convolutions&lt;/h4&gt;

&lt;p&gt;A Conv layer consists of spatial filters that are convolved along the spatial dimensions and summed up along the depth dimension of the input volume. In general one starts with a large filter size (e.g. 11x11) and a low depth (e.g. 32) and reduces the spatial filter dimensions (e.g. to 3x3) while increasing the depth (e.g. to 256) throughout the network. Due to weight sharing, they are much more efficient than fully-connected layers. A Conv layer has &lt;script type=&quot;math/tex&quot;&gt;w \cdot h \cdot d \cdot n_f&lt;/script&gt; number of parameters without bias (&lt;script type=&quot;math/tex&quot;&gt;w&lt;/script&gt; .. width of the filter, &lt;script type=&quot;math/tex&quot;&gt;h&lt;/script&gt; .. height of the filter, &lt;script type=&quot;math/tex&quot;&gt;d&lt;/script&gt; .. depth of the filter, &lt;script type=&quot;math/tex&quot;&gt;n_f&lt;/script&gt; number of filters) that need to be learned during training.&lt;/p&gt;

&lt;h4 id=&quot;pooling&quot;&gt;Pooling&lt;/h4&gt;

&lt;p&gt;Conv layers are often followed by a Pool layer in order to reduce the spatial dimension of the volume for the next filter - this is the equivalent of a subsampling operation. The pooling operation itself has no learnable parameters.&lt;/p&gt;

&lt;p&gt;Most of the time, &lt;script type=&quot;math/tex&quot;&gt;max&lt;/script&gt; pooling layers are used in DL models due to the easier gradient computation. During BP, the gradient only flows in the direction of the single max activation which can be computed very efficiently. A few other architectures use &lt;script type=&quot;math/tex&quot;&gt;avg&lt;/script&gt; pooling, mostly at the end of a network or before the fully connected layers and without a noticeable increase in performance.&lt;/p&gt;

&lt;h4 id=&quot;normalization&quot;&gt;Normalization&lt;/h4&gt;

&lt;p&gt;In modern (post-sigmoid) DNNs, &lt;em&gt;Normalization&lt;/em&gt; is necessary for stable gradients throughout the network. Due to the unbounded behavior of the ReLU activations (&lt;script type=&quot;math/tex&quot;&gt;y = max(0, x)&lt;/script&gt;), filter responses have to be normalized. Usually this is done per batch using &lt;a href=&quot;#Ioffe15&quot;&gt;batch normalization&lt;/a&gt; or locally using a Local Response Normalization layer.&lt;/p&gt;

&lt;h4 id=&quot;fully-connected-layer&quot;&gt;Fully Connected Layer&lt;/h4&gt;

&lt;p&gt;The FC layer works exactly as described in the previous section - it connects every output from the previous layer with each neuron. Usually, the FC layer is used at the end to combine all spatially distributed activations of the previous Conv layers. The FC layers have the highest number of parameters (&lt;script type=&quot;math/tex&quot;&gt;n_i \cdot n_n&lt;/script&gt;, where &lt;script type=&quot;math/tex&quot;&gt;n_i&lt;/script&gt; is the number of outputs of the previous layer and &lt;script type=&quot;math/tex&quot;&gt;n_n&lt;/script&gt; is the number of neurons) in the model (almost 90%); most computing time is spent in the early Conv layers.&lt;/p&gt;

&lt;h3 id=&quot;final-output-layer&quot;&gt;Final Output Layer&lt;/h3&gt;

&lt;p&gt;The final output layer of a DNN plays a crucial role for for the task of the whole network. Common choices are:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;em&gt;Classification&lt;/em&gt;: Softmax layer, computes a value &lt;script type=&quot;math/tex&quot;&gt;y_{i} \in [0, 1]&lt;/script&gt; such that &lt;script type=&quot;math/tex&quot;&gt;\sum_i y_{i} = 1&lt;/script&gt; - can be interpreted as a probability that &lt;script type=&quot;math/tex&quot;&gt;x_i&lt;/script&gt; to belong to a certain class&lt;/li&gt;
  &lt;li&gt;&lt;em&gt;Regression&lt;/em&gt;: Sigmoid layer, predict values &lt;script type=&quot;math/tex&quot;&gt;y_{ij} \in [0, 1]&lt;/script&gt; for an output with &lt;script type=&quot;math/tex&quot;&gt;j&lt;/script&gt; dimensions.&lt;/li&gt;
  &lt;li&gt;&lt;em&gt;Regression and classification&lt;/em&gt;: The tasks can be combined, by connecting 2 output layers and hence outputting both values at once. This is used for object detection with a fixed number of objects, e.g. output a regression per class&lt;/li&gt;
  &lt;li&gt;&lt;em&gt;Encoder&lt;/em&gt;: Stop at the fully-connected layer and use it as feature space for clustering, SVM, post-processing etc.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;After defining a final output layer, one need to define as well a loss function for the given task. Picking the right loss is crucial for training a DNN; common choices are:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;em&gt;Classification&lt;/em&gt;: Cross-entropy, computes the cross-entropy between the output of the network and the ground truth label and can be used for binary and categorical outputs (via hot-one encoding)&lt;/li&gt;
  &lt;li&gt;&lt;em&gt;Regression&lt;/em&gt;: Squared error and mean squared error are common choices for regression problems&lt;/li&gt;
  &lt;li&gt;&lt;em&gt;Segmentation&lt;/em&gt;: Intersection over union is a loss function well suited for comparing overlapping regions of an image and a prediction - however, it is not well suited for DNNs because it returns 0 for non overlapping regions; MSE is a better choice&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The loss function can as well be extended with a regularization term to constraint the parameters of DNN. Both &lt;em&gt;L1&lt;/em&gt; and &lt;em&gt;L2&lt;/em&gt; regularization on the filter matrices &lt;script type=&quot;math/tex&quot;&gt;W_i&lt;/script&gt; are commonly used.&lt;/p&gt;

&lt;h2 id=&quot;deep-learning-architectures&quot;&gt;Deep Learning Architectures&lt;/h2&gt;

&lt;p&gt;This section describes state-of-the-art DNN architectures, common parameterizations and structures in DL.&lt;/p&gt;

&lt;h3 id=&quot;convolutional-neural-networks-cnn&quot;&gt;Convolutional Neural Networks (CNN)&lt;/h3&gt;

&lt;p&gt;A CNN is a neural network model that contains (multiple) convolutional layers (with a non-linear activation function) and additional pooling layers at the beginning of the network.&lt;/p&gt;

&lt;p&gt;A &lt;em&gt;convolution layer&lt;/em&gt; extracts image features by convolving the input with multiple filters. It contains a set of 2-dimensional filters that are stacked into a 3-dimensional volume where each filter is applied to a volume constructed from all filter responses of the previous layer. If one considers the RGB channels of a 256x256 sized input image as a 256x256x3 volume, a 5x5x3 filter would be applied along a 5x5 2-dimensional region of the image and summed up across all 3 color channels. If the first layer after the RGB volume consist of 48 filters, it is represented as a volume of 5x5x3x48 weight parameters and 48 bias parameters. Using a convolution operation on the input volume and the filter volumes, the filter response (so called activation) results in an output volume with the dimensions 251x251x48 (using stride 1 and no padding). By padding the input layer with 0s, one can force to keep the spatial dimensions of the activations constant throughout the layers.&lt;/p&gt;

&lt;p&gt;Each convolution layer is followed by a nonlinear &lt;em&gt;activation function&lt;/em&gt; (in DL mostly ReLU layers are used) which results in an activation with the same dimensions as the output volume of the previous convolution layer.&lt;/p&gt;

&lt;p&gt;A &lt;em&gt;pooling layer&lt;/em&gt; subsamples the previous layer and outputs a volume of same depth but reduced spatial dimensions. Similar to a Gaussian pyramid, pooling helps filters to activate on features in the image at a different scale. Using a max-pooling 2x2 filter with stride 2 (the filter is shifted for 2 pixels on every iteration) one ends up with a 128x128x48 volume after pooling. In general, pooling layers are used at the beginning of the network and  at the end to better control the dimensions of the activations right before the fully-connected layers.&lt;/p&gt;

&lt;h3 id=&quot;alexnet-2012&quot;&gt;AlexNet (2012)&lt;/h3&gt;

&lt;p&gt;The following figure shows the architecture of &lt;a href=&quot;#Krizhevsky12&quot;&gt;AlexNet&lt;/a&gt;, the winning model for classification in the ImageNet competition in 2012. As shown in the figure, it consist of 2 parallel set of layers with both convolution and max pooling layers. The model was arranged like this due to the fact that 2 graphic cards were used in parallel for training (training AlexNet on the ImageNet dataset took 2 weeks on this setup). The filters start with a spatial dimensions 5x5 and 55 in depth (1,375 weights without bias) going to a 3x3x192 filter volume (1,728 weights without bias) towards the end of the network; hence, the filter size is decreasing but the filter depth is increasing per layer. The end of the network consists of 2 fully connected layers with 2048 nodes (4,194,304 weights without bias), and an output layer with 1000 nodes (2,048,000 weights without bias) according to the 1000 classes in the ImageNet dataset. Hence, most memory is used for the weights in the final fully connected layers; the whole model requires about 240MB in total memory. The fully connected layers at the end are needed to set the spatial filter responses in relation to each other for the resulting class prediction.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/deep-learning/alexnet.png&quot; alt=&quot;Architecture of AlexNet&quot; title=&quot;Architecture of AlexNet&quot; class=&quot;image-col-1&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Architecture of &lt;a href=&quot;#Krizhevsky12&quot;&gt;AlexNet&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The winner of the ImageNet classification task in 2013 was a tuned version of AlexNet using different initialization and optimization.&lt;/p&gt;

&lt;h3 id=&quot;vggnet-2014&quot;&gt;VGGNet (2014)&lt;/h3&gt;

&lt;p&gt;The VGGNet model was placed second in the ImageNet competition in 2014 by using only 3x3 filters throughout the whole network. Many different layer depths have been tested with the resulting insight that &lt;a href=&quot;#Simonyan14&quot;&gt;&lt;em&gt;deeper is always better&lt;/em&gt;&lt;/a&gt;. This statement holds only if the deeper model can be still trained without an exploding or vanishing gradient throughout the network.&lt;/p&gt;

&lt;p&gt;As we can see in the next figure, VGG-19 achieves very good results on the ImageNet 2012 classification dataset while having a poor accuracy vs. number of parameters ratio. To store the parameters only of VGG-19, it requires more than 500MB memory.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/deep-learning/top1.png&quot; alt=&quot;Accuracy&quot; title=&quot;Accuracy&quot; class=&quot;image-col-2&quot; /&gt;
&lt;img src=&quot;/images/deep-learning/top1-param.png&quot; alt=&quot;Accuracy per parameter&quot; title=&quot;Accuracy per parameter&quot; class=&quot;image-col-2&quot; /&gt;&lt;/p&gt;

&lt;p&gt;DL models Top-1 classification accuracy vs. accuracy per parameter (Source: &lt;a href=&quot;#Canziani16&quot;&gt;Canziani&lt;/a&gt;)&lt;/p&gt;

&lt;h3 id=&quot;googlenet-2014&quot;&gt;GoogLeNet (2014)&lt;/h3&gt;

&lt;p&gt;The winning model in the ImageNet competition in 2014 was &lt;a href=&quot;#Szegedy14&quot;&gt;GoogLeNet&lt;/a&gt; which is even deeper than the previously discussed VGGNet. However, it uses only one tenth of the number of parameters of AlexNet due to the architecture of 9 parallel modules, the inception module. As shown in the next figure, this module uses 1x1 convolutions (so called bottleneck convolutions) to sum up the depth dimensions of the previous layers while keeping the spatial dimensions of the volume. This 1x1 convolutions allow to control/reduce the depth dimension which greatly reduces the number of used parameters due to removal of redundancy of correlated filters. It also enables to learn a set of feature across 1x1, 3x3 and 5x5 spatial dimensions in parallel mixed with a pooling of the original volume.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/deep-learning/inception.png&quot; alt=&quot;GoogLeNet&quot; title=&quot;GoogLeNet&quot; class=&quot;image-col-1&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Inception module (Source &lt;a href=&quot;http://vision.princeton.edu/courses/COS598/2015sp/slides/GoogLeNet/2014-10-17_dlrg.pdf&quot;&gt;COS598 (Princeton) lecture slides&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/deep-learning/alexnetCustom.png&quot; alt=&quot;AlexNet&quot; title=&quot;AlexNet&quot; class=&quot;image-col-3&quot; /&gt;
&lt;img src=&quot;/images/deep-learning/vggCustom.png&quot; alt=&quot;VGG&quot; title=&quot;VGG&quot; class=&quot;image-col-3&quot; /&gt;
&lt;img src=&quot;/images/deep-learning/googlenetCustom.png&quot; alt=&quot;GoogLeNet&quot; title=&quot;GoogLeNet&quot; class=&quot;image-col-3&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Deep Learning models 2012-2014 (Source: &lt;a href=&quot;https://chaosmail.github.io/caffejs/models.html&quot;&gt;CaffeJS&lt;/a&gt;)&lt;/p&gt;

&lt;h3 id=&quot;deep-residual-networks-2015&quot;&gt;Deep Residual Networks (2015)&lt;/h3&gt;

&lt;p&gt;Microsoft’s residual network ResNet, the winner from ImageNet 2015 and the deepest network so far with 153 convolution layers received a top-5 classification error of 4.9% (which is slightly better than human accuracy, Source: Karpathy). By introducing residual connections (skip connections) between an input and the filter activation (as shown in the following figure), the network can learn incremental changes instead of a complete new behavior. This concept is similar to the parallel inception modules while only one filter learns the incremental changes between an input volume and its activation.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/deep-learning/residual.png&quot; alt=&quot;Residual connections&quot; title=&quot;Residual connections&quot; class=&quot;image-col-1&quot; /&gt;
Residual connections (Source: &lt;a href=&quot;http://stats.stackexchange.com/questions/56950/neural-network-with-skip-layer-connections&quot;&gt;stackexchange.com&lt;/a&gt;)&lt;/p&gt;

&lt;h2 id=&quot;applications-in-computer-vision&quot;&gt;Applications in Computer Vision&lt;/h2&gt;

&lt;p&gt;DNNs perform by a factor 10%-20% better in accuracy than methods using hand-engineered feature extraction in most computer vision tasks given enough training images and computational resources (&lt;a href=&quot;#Szegedy13&quot;&gt;Szegedy&lt;/a&gt;, &lt;a href=&quot;#Long14&quot;&gt;Long&lt;/a&gt;). As shown in the next figure, the differences of DL applications in computer vision are based on the number of objects in the image and the output of the Network. This section describes in these applications and which DL approaches are commonly used.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/deep-learning/localizationVsDetection.png&quot; alt=&quot;Difference between Classification, Localization, Detection and Segmentation&quot; title=&quot;Difference between Classification, Localization, Detection and Segmentation&quot; class=&quot;image-col-1&quot; /&gt;
Difference between Classification, Localization, Detection and Segmentation (Source: CS231n (Stanford) lecture slides)&lt;/p&gt;

&lt;h3 id=&quot;classification&quot;&gt;Classification&lt;/h3&gt;

&lt;p&gt;In a &lt;em&gt;classification&lt;/em&gt; task, a model has to predict the correct label of an image based on its contextual information which is extracted from the pixel values. Usually, one image belongs to a single class. Given an input image, the NN computes a probability value for each class; the most likely class is the resulting class for the image. An example is shown in the following figure for 8 samples from the ImageNet dataset, where the 5 most likely classes (blue bars) are displayed per input image and the correct label is shown as red bar (if it appears in these 5 classes). CNNs show a performance &lt;a href=&quot;#Krizhevsky12&quot;&gt;increase of 20%&lt;/a&gt; to &lt;a href=&quot;#Ciregan12&quot;&gt;34%&lt;/a&gt; in comparison to traditional models.&lt;/p&gt;

&lt;p&gt;DNNs for classification commonly use the “standard” architecture denoted as &lt;script type=&quot;math/tex&quot;&gt;IN \to [[CONV\to ReLU] \cdot N \to POOL?] \cdot M \to [FC \to ReLU] \cdot K \to FC&lt;/script&gt;, where input &lt;script type=&quot;math/tex&quot;&gt;IN&lt;/script&gt; is a single image and output &lt;script type=&quot;math/tex&quot;&gt;FC&lt;/script&gt; is a fully connected layer with &lt;script type=&quot;math/tex&quot;&gt;m&lt;/script&gt; units (one per class). The output of each unit in the last layer corresponds to the probability that the image belongs to the class &lt;script type=&quot;math/tex&quot;&gt;y&lt;/script&gt;. We can see that the network from 1989 had already a very similar structure (with the slight difference of using hyperbolic tangent non-linearities).&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/deep-learning/classification.png&quot; alt=&quot;Image Classification using the ImageNet dataset&quot; title=&quot;Image Classification using the ImageNet dataset&quot; class=&quot;image-col-1&quot; /&gt;
Image Classification using the ImageNet dataset (Source: &lt;a href=&quot;#Krizhevsky12&quot;&gt;Krizhevsky&lt;/a&gt;)&lt;/p&gt;

&lt;h3 id=&quot;localization&quot;&gt;Localization&lt;/h3&gt;

&lt;p&gt;In a localization task, the model needs to identify the position of an object in a sample image. Many tasks such as localization and detection can be led back to a simple binary classification task. By using a sliding window over the sample image one can determine the possibility of detecting an object at the current position of the window. After trying all possible window positions, the position of the object can be computed from the output. Due to the sliding window, the image dimensions of the input images don’t have to be fixed. Despite its simple approach, this method requires &lt;script type=&quot;math/tex&quot;&gt;O(n^2)&lt;/script&gt; iterations and a complex network setup and hence is not used for localization in practice.&lt;/p&gt;

&lt;p&gt;In the simplest case with only one object in the image, the localization task can also be solved by predicting its bounding box coordinates. This is a regression problem and can be solved with a similar architecture as classification. DNNs for bounding box localization are mostly used for cropping raw input images as a preprocessing step for other applications (such as classification).&lt;/p&gt;

&lt;p&gt;A common architecture of DNNs to solve a localization task as regression uses a fully connected layer with &lt;script type=&quot;math/tex&quot;&gt;4&lt;/script&gt; units (the bounding box can be identified with the coordinates of its left-top corner, width and height) in the last layer followed by a sigmoid layer. The bounding box regression can be used in combination with classification, such that bounding box coordinates can be learned for each class individually. Both methods can be combined in one DNN with one classification and one regression output.&lt;/p&gt;

&lt;p&gt;The bounding box approach can be applied when the exact number of objects in the image is known and the image dimensions of the input images are fixed. The localization precision is bound by the rectangle geometry of the bounding box; however, also other shapes such as skewed rectangles, circles, or parallelograms can be used.&lt;/p&gt;

&lt;h3 id=&quot;object-detection-and-image-recognition&quot;&gt;Object Detection and Image Recognition&lt;/h3&gt;

&lt;p&gt;To understand the context of images, one has to not only classify an image, but also find multiple different object instances in an image and estimate their position. This application of classifying and localizing multiple objects in an image is referred to as &lt;em&gt;Object Detection&lt;/em&gt;. Using DNN for object detection has the advantage that it can implicitly learn complex object representations instead of manually deriving them from a kinematic model, such as a Deformable Part-based Model (DPM).&lt;/p&gt;

&lt;p&gt;A common approach to implement object detection is a binary mask regression for each object type combined with predicting the bounding box coordinates, which results into at least 1 model per object type. This method works on both the complete image or image patches as an input, as long as the size of the input tensor is fixed. While this approach is simple a yields on average 0.15% improvement in precision compared to &lt;a href=&quot;#Szegedy13&quot;&gt;traditional DPMs&lt;/a&gt;, it requires one DNN per object type. Another difficulty are overlapping objects of the same type, as these objects are not separable in the binary mask.&lt;/p&gt;

&lt;p&gt;The term for &lt;em&gt;Image Recognition&lt;/em&gt; is used interchangeably for object detection, as the ability of detecting and localizing objects of different types (&lt;a href=&quot;#He15&quot;&gt;in an hierarchical structure&lt;/a&gt;) equals the ability to understand the &lt;a href=&quot;#Sermanet14&quot;&gt;current scene&lt;/a&gt; (see the following figure).&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/deep-learning/recognition.png&quot; alt=&quot;Image Recognition through Object Detection&quot; title=&quot;Image Recognition through Object Detection&quot; class=&quot;image-col-1&quot; /&gt;
Image Recognition through Object Detection (Source: &lt;a href=&quot;#He15&quot;&gt;He&lt;/a&gt;)&lt;/p&gt;

&lt;h3 id=&quot;segmentation&quot;&gt;Segmentation&lt;/h3&gt;

&lt;p&gt;The application of &lt;em&gt;Segmentation&lt;/em&gt; is to partition parts of an image with pixel precision, hence predicting the corresponding segment for each pixel. Thus, the network needs to predict a class value for each input pixel. convolutions and pooling both reduce the spatial dimension of the activations throughout the network which leads to the problem that the resulting activation has a smaller size than the input volume. Therefore, DNNs for segmentation need to implement an upscale strategy in order to predict per pixel.&lt;/p&gt;

&lt;p&gt;To train a DNN for segmentation one can either input the whole image to the network and use a segmented image as ground truth (binary mask for foreground/background segmentation or &lt;a href=&quot;#Long14&quot;&gt;pixel mask&lt;/a&gt; displayed in the subsequent figure) or follow a patch based approach. Using pixelwise CNNs, one can achieve up to &lt;a href=&quot;#Long14&quot;&gt;20% relative improvement in accuracy&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Similar to localization, segmentation can be turned into a classification problem, when using images patches of a fixed size. Instead of sliding a window over all possible locations, one can extract patches only from salient regions or distinctive segments. This approach has the advantage, that multiple patches can be stacked together as channels in the input layer to provide the network with positive and negative (or neutral) samples at the same time. This pairwise training can correct unbalanced class distributions and &lt;a href=&quot;#Long14&quot;&gt;optimize gradient computation&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/deep-learning/segmentationMask.png&quot; alt=&quot;Image Segmentation using a Pixel Mask&quot; title=&quot;Image Segmentation using a Pixel Mask&quot; class=&quot;image-col-1&quot; /&gt;
Image Segmentation using a Pixel Mask (Source: &lt;a href=&quot;#Long14&quot;&gt;Long&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;As we can see in the previous figure, this upscaling process can be a gigantic fully-connected layer or an inverted DNN structure (called up-convolution or &lt;a href=&quot;#Noh15&quot;&gt;deconvolution structure&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/deep-learning/segmentationRaw.png&quot; alt=&quot;raw image&quot; title=&quot;raw image&quot; class=&quot;image-col-3&quot; /&gt;
&lt;img src=&quot;/images/deep-learning/segmentationSemantic.png&quot; alt=&quot;semantic segmentation&quot; title=&quot;semantic segmentation&quot; class=&quot;image-col-3&quot; /&gt;
&lt;img src=&quot;/images/deep-learning/segmentationInstance.png&quot; alt=&quot;instance segmentation&quot; title=&quot;instance segmentation&quot; class=&quot;image-col-3&quot; /&gt;
Semantic vs. Instance Segmentation&lt;/p&gt;

&lt;p&gt;The previous figure shows the 2 conceptual approaches in segmentation, the semantic segmentation (middle image) where everything from a matching class should be segmented vs. instance segmentation (right image) where every instance from a class should be segmented.&lt;/p&gt;

&lt;p&gt;In semantic segmentation, multi-scale architecture are used that upsample outputs and combine the results with traditional &lt;a href=&quot;#Farabet13&quot;&gt;bottom-up segmentations&lt;/a&gt;. In a different approach, the complete upscaling process can learned via an &lt;a href=&quot;#Long14&quot;&gt;inverted DNN structure&lt;/a&gt;. By adding skip connections from lower layers to higher upsampled levels one can also learn incremental structures and perform local refinement on the downsampled image.&lt;/p&gt;

&lt;p&gt;Instance segmentation is often referred to as &lt;a href=&quot;#Hariharan14&quot;&gt;&lt;em&gt;simultaneous detection and segmentation&lt;/em&gt;&lt;/a&gt; and is a very challenging task. Most commonly window functions are used for applying classification and segmentation on all possible sets of input patches. This requires unnecessary computational resources for regions that don’t contain any objects. Hence, so called region proposals techniques have been introduced, to optimize the selection of regions that can be further selected as a patch input for the segmentation. Fast R-CNN is a method using the &lt;a href=&quot;#Dai15&quot;&gt;ResNet architecture&lt;/a&gt; and implements a pipelines similar to object detection.&lt;/p&gt;

&lt;h3 id=&quot;image-encoding&quot;&gt;Image Encoding&lt;/h3&gt;

&lt;p&gt;A very common task of DNNs is image encoding, hence transforming and image from its original representation to a lower dimensional feature space. This task is often performed implicitly due to the fact that the last fully-connected layer of each DNN learns this encoding automatically while trained on a specific supervised task. At the end of the training, the last fully-connected layer contains a fixed-sized low-dimensional numeric representation of the input image that can be used as an input for conventional machine learning approaches such as SVM, linear regression, etc.&lt;/p&gt;

&lt;p&gt;Using up-convolutional structures, one can also implement unsupervised auto-encoding networks. However, due to the implicit learning through classification and the high computational complexity of up-convolutional networks this is mostly done by training on a supervised method such as classification.&lt;/p&gt;

&lt;h2 id=&quot;summary&quot;&gt;Summary&lt;/h2&gt;

&lt;p&gt;Using the correct DNN architecture for an application of DL in computer vision requires knowledge of the specific problem and limitations, such as fixed number of objects in localization and fixed image dimensions in pixelwise segmentation. However, DL architectures always perform better than classical hand-engineered approaches with a factor 10% to 20% in accuracy given enough training samples and computing power. However, in many DL applications such as detection and recognition, more than one model is required to predict all possible object types. Hence, training multiple models requires linear costs for training time and computation.&lt;/p&gt;

&lt;h2 id=&quot;resources&quot;&gt;Resources&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://archive.org/details/cs231n-CNNs&quot;&gt;CS231n (Stanford) Video Lectures&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;http://vision.stanford.edu/teaching/cs231n/&quot;&gt;CS231n (Stanford) Lecture Notes&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://arxiv.org/pdf/1409.4842v1.pdf&quot;&gt;Going Deeper with Convolutions&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://arxiv.org/pdf/1512.00567v3.pdf&quot;&gt;Rethinking the Inception Architecture for Computer Vision&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;http://www.slideshare.net/ChristophKrner/intro-to-deep-learning-for-computer-vision&quot;&gt;Slides from my talk at the University&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;references&quot;&gt;References&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;span id=&quot;Rosenblatt57&quot;&gt;F. Rosenblatt, “The perceptron, a perceiving and recognizing automaton Project Para”, in &lt;em&gt;Cornell Aeronautical Laboratory&lt;/em&gt;, 1957.&lt;/span&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;span id=&quot;Werbos74&quot;&gt;P. Werbos, “Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences”, in &lt;em&gt;PhD thesis, Harvard University&lt;/em&gt;, Cambridge, 1974.&lt;/span&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;span id=&quot;Rumelhart86&quot;&gt;D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-propagating errors”, in &lt;em&gt;Nature&lt;/em&gt;, vol. 323,  pp. 533-536, 1986.&lt;/span&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;span id=&quot;LeCun90&quot;&gt;Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel, “Handwritten digit recognition with a back-propagation network”, in &lt;em&gt;Advances in Neural Information Processing Systems (NIPS 1989)&lt;/em&gt;, D. Touretzky (Ed.), vol. 2,  pp. 533-536, 1990.&lt;/span&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;span id=&quot;Schmidhuber14&quot;&gt;J. Schmidhuber, “Deep Learning in Neural Networks: An Overview”, in &lt;em&gt;CoRR&lt;/em&gt;, 2014.&lt;/span&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;span id=&quot;Goodfellow16&quot;&gt;I. Goodfellow, Y. Bengio and A. Courville, “Deep Learning”, &lt;em&gt;in preparation for MIT Press&lt;/em&gt;, 2016.&lt;/span&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;span id=&quot;Glorot10&quot;&gt;X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks”, in &lt;em&gt;Proceedings of the International Conference on Artificial Intelligence and Statistics&lt;/em&gt;, 2010.&lt;/span&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;span id=&quot;He15&quot;&gt;K. He, X. Zhang, S. Ren and J. Sun, “Deep Residual Learning for Image Recognition”, &lt;em&gt;CoRR&lt;/em&gt;, 2015.&lt;/span&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;span id=&quot;He15b&quot;&gt;K. He, X. Zhang, S. Ren and J. Sun, “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”, &lt;em&gt;CoRR&lt;/em&gt;, 2015.&lt;/span&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;span id=&quot;Krizhevsky12&quot;&gt;A. Krizhevsky, I. Sutskever and G. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks”, in &lt;em&gt;Advances in Neural Information Processing Systems&lt;/em&gt;, 2012.&lt;/span&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;span id=&quot;Radford15&quot;&gt;A. Radford, L. Metz and S. Chintala, “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks”, &lt;em&gt;CoRR&lt;/em&gt;, 2015.&lt;/span&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;span id=&quot;Smirnov14&quot;&gt;E. Smirnov, D. Timoshenko and S. Andrianov, “Comparison of Regularization Methods for ImageNet Classification with Deep Convolutional Neural Networks”, in &lt;em&gt;AASRI Procedia&lt;/em&gt;, vol. 6, pp. 89-94, 2014.&lt;/span&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;span id=&quot;Szegedy14&quot;&gt;C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke and A. Rabinovich, “Going Deeper with Convolutions”, &lt;em&gt;CoRR&lt;/em&gt;, 2014.&lt;/span&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;span id=&quot;Pajdla14&quot;&gt;M. Zeiler and R. Fergus, “Visualizing and Understanding Convolutional Networks”, in &lt;em&gt;COMPUTER VISION – ECCV 2014&lt;/em&gt;, 1st ed., D. Fleet, T. Pajdla, B. Schiele and T. Tuytelaars, Ed. 2014.&lt;/span&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;span id=&quot;Zhou12&quot;&gt;S. Zhou, Q. Chen and X. Wang, “Convolutional Deep Networks for Visual Data Classification”, in &lt;em&gt;Neural Process Letters&lt;/em&gt;, vol. 38, no. 1, pp. 17-27, 2012.&lt;/span&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;span id=&quot;Kingma15&quot;&gt;D.P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization”, in &lt;em&gt;The International Conference on Learning Representations (ICLR)&lt;/em&gt;, 2015.&lt;/span&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;span id=&quot;Henaff16&quot;&gt;M. Henaff, A. Szlam, and Y. LeCun, “Orthogonal RNNs and Long-Memory Tasks”, &lt;em&gt;CoRR&lt;/em&gt;, 2016.&lt;/span&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;span id=&quot;Szegedy13&quot;&gt;C. Szegedy, A. Toshev, and D. Erhan, “Deep Neural Networks for Object Detection”, in &lt;em&gt;Advances in Neural Information Processing Systems 26&lt;/em&gt;, C. Burges (Ed.), pp. 2553-2561, 2013.&lt;/span&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;span id=&quot;Simonyan14&quot;&gt;K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition”, &lt;em&gt;CoRR&lt;/em&gt;, 2014.&lt;/span&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;span id=&quot;Sermanet14&quot;&gt;P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun, “Overfeat: Integrated recognition, localization and detection using convolutional networks”, in &lt;em&gt;ICLR&lt;/em&gt;, 2014.&lt;/span&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;span id=&quot;Long14&quot;&gt;J. Long, E. Shelhamer, and T. Darrell, “Fully Convolutional Networks for Semantic Segmentation”, &lt;em&gt;CoRR&lt;/em&gt;, 2014.&lt;/span&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;span id=&quot;Lin13&quot;&gt;M. Lin, Q. Chen and S. Yan, “Network In Network”, &lt;em&gt;CoRR&lt;/em&gt;, 2013.&lt;/span&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;span id=&quot;Ioffe15&quot;&gt;S. Ioffe and C. Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”, &lt;em&gt;CoRR&lt;/em&gt;, 2015.&lt;/span&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;span id=&quot;Srivastava14&quot;&gt;N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever and R. Salakhutdinov, “Dropout: A Simple Way to Prevent Neural Networks from Overfitting”, in &lt;em&gt;Journal of Machine Learning Research&lt;/em&gt;, vol. 15, pp. 1929-1958, 2014.&lt;/span&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;span id=&quot;Li16&quot;&gt;F. Li, A. Karpathy, and J. Johnson, “Spatial Localization and Detection”, in &lt;em&gt;CS231n: Convolutional Neural Networks for Visual Recognition&lt;/em&gt;, lecture slides, p. 8, 2016.&lt;/span&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;span id=&quot;Ciregan12&quot;&gt;D. Ciregan, U. Meier, and J. Schmidhuber, “Multi-column deep neural networks for image classification”, in &lt;em&gt;IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012&lt;/em&gt; pp. 3642-3649, 2012.&lt;/span&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;span id=&quot;Farabet13&quot;&gt;C. Farabet, C. Couprie, L. Najman, and Y. LeCun, “Learning Hierarchical Features for Scene Labeling”, in &lt;em&gt;IEEE Transactions on Pattern Analysis and Machine Intelligence&lt;/em&gt;, vol. 35, no. 8, pp. 1915-1929, 2013.&lt;/span&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;span id=&quot;Noh15&quot;&gt;H. Noh, S. Hong, and B. Han, “Learning Deconvolution Network for Semantic Segmentation”, &lt;em&gt;CoRR&lt;/em&gt;, 2015.&lt;/span&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;span id=&quot;Hariharan14&quot;&gt;B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik, “Simultaneous Detection and Segmentation”, in &lt;em&gt;European Conference on Computer Vision (ECCV)&lt;/em&gt;, 2014.&lt;/span&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;span id=&quot;Dai15&quot;&gt;J. Dai, K. He, and J. Sun, “Instance-aware Semantic Segmentation via Multi-task Network Cascades”, &lt;em&gt;CoRR&lt;/em&gt;, 2015.&lt;/span&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;span id=&quot;Saxe13&quot;&gt;A.M. Saxe, J.L. McClelland and S. Ganguli, “Exact solutions to the nonlinear dynamics of learning in deep linear neural networks”, &lt;em&gt;CoRR&lt;/em&gt;, 2015.&lt;/span&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;span id=&quot;Tieleman13&quot;&gt;T. Tieleman and G. Hinton, “RMSprop Gradient Optimization”, in &lt;em&gt;Neural Networks  for Machine Learningn&lt;/em&gt;, lecture slides, p. 29, 2014.&lt;/span&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;span id=&quot;Canziani16&quot;&gt;A. Canziani, A. Paszke and E. Culurciello, “An Analysis of Deep Neural Network Models for Practical Applications”, &lt;em&gt;CoRR&lt;/em&gt;, 2016.&lt;/span&gt;&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

</description>
        <pubDate>Sat, 22 Oct 2016 21:30:00 +0000</pubDate>
        <link>https://chaosmail.github.io//deeplearning/2016/10/22/intro-to-deep-learning-for-computer-vision/</link>
        <guid isPermaLink="true">https://chaosmail.github.io//deeplearning/2016/10/22/intro-to-deep-learning-for-computer-vision/</guid>
        
        <category>deeplearning</category>
        
        <category>vision</category>
        
        
        <category>deeplearning</category>
        
      </item>
    
      <item>
        <title>Data-driven Visualizations</title>
        <description>&lt;p&gt;I gave a talk about data-driven visualizations an D3.js at the last &lt;a href=&quot;http://www.meetup.com/de/viennajs/&quot;&gt;ViennaJS October Meetup&lt;/a&gt; (28.10.2015). Here are the links to the &lt;a href=&quot;https://community.leapmotion.com/t/tip-ubuntu-systemd-and-leapd/2118&quot;&gt;talk&lt;/a&gt; and the &lt;a href=&quot;https://docs.google.com/presentation/d/1-7xsVq5fNi5Z3PwQO6JF2cBpXYNz1eaKx6EqhRU4AA0&quot;&gt;slides&lt;/a&gt;.&lt;/p&gt;

</description>
        <pubDate>Tue, 24 Nov 2015 19:00:00 +0000</pubDate>
        <link>https://chaosmail.github.io//web/graphics/2015/11/24/data-driven-visualizations/</link>
        <guid isPermaLink="true">https://chaosmail.github.io//web/graphics/2015/11/24/data-driven-visualizations/</guid>
        
        <category>talk</category>
        
        <category>d3js</category>
        
        
        <category>web</category>
        
        <category>graphics</category>
        
      </item>
    
      <item>
        <title>Installing Visual Studio Code on Ubuntu</title>
        <description>&lt;p&gt;&lt;a href=&quot;https://code.visualstudio.com/&quot;&gt;Visual Studio Code&lt;/a&gt; is an open source multi-platform IDE for web development (especially JavaScript and Typescript) - enough reasons for me to check it out.&lt;/p&gt;

&lt;h2 id=&quot;installing&quot;&gt;Installing&lt;/h2&gt;

&lt;p&gt;Download the latest Version from the &lt;a href=&quot;https://code.visualstudio.com/&quot;&gt;Visual Studio Code&lt;/a&gt; website (I found the 64 bit version on the &lt;a href=&quot;https://code.visualstudio.com/Docs/supporting/howtoupdate&quot;&gt;update page&lt;/a&gt;) and unzip it.&lt;/p&gt;

&lt;p&gt;Then move it to the &lt;em&gt;opt/&lt;/em&gt; directory and create a symbolic link.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;mv VSCode-linux-x64 /opt/VSCode
&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;ln &lt;span class=&quot;nt&quot;&gt;-s&lt;/span&gt; /opt/VSCode/Code /usr/local/bin/code&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;You are done; just run &lt;code class=&quot;highlighter-rouge&quot;&gt;code&lt;/code&gt; from your terminal!&lt;/p&gt;

&lt;h2 id=&quot;creating-a-desktop-icon&quot;&gt;Creating a Desktop Icon&lt;/h2&gt;

&lt;p&gt;Create a desktop Icon by creating a &lt;em&gt;VSCode.desktop&lt;/em&gt; file&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;gedit /usr/share/applications/VSCode.desktop&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;with the following content&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-ini&quot; data-lang=&quot;ini&quot;&gt;&lt;span class=&quot;c&quot;&gt;#!/usr/bin/env xdg-open
&lt;/span&gt;
&lt;span class=&quot;nn&quot;&gt;[Desktop Entry]&lt;/span&gt;
&lt;span class=&quot;py&quot;&gt;Version&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;1.0&lt;/span&gt;
&lt;span class=&quot;py&quot;&gt;Type&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;Application&lt;/span&gt;
&lt;span class=&quot;py&quot;&gt;Terminal&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;false&lt;/span&gt;
&lt;span class=&quot;py&quot;&gt;Exec&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;/opt/VSCode/Code&lt;/span&gt;
&lt;span class=&quot;py&quot;&gt;Name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;VSCode&lt;/span&gt;
&lt;span class=&quot;py&quot;&gt;Icon&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;/opt/VSCode/resources/app/vso.png&lt;/span&gt;
&lt;span class=&quot;py&quot;&gt;Categories&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;Development&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Now you can find VSCode in your start menu.&lt;/p&gt;

&lt;h2 id=&quot;references&quot;&gt;References&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://code.visualstudio.com/&quot;&gt;Visual Studio Code Website&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://code.visualstudio.com/Docs/supporting/howtoupdate&quot;&gt;Visual Studio Code Updates&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;http://askubuntu.com/questions/616075/how-to-install-visual-studio-code-on-ubuntu&quot;&gt;Visual Studio Code on Stackoverflow&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
        <pubDate>Tue, 29 Sep 2015 11:00:00 +0000</pubDate>
        <link>https://chaosmail.github.io//web/development/2015/09/29/installing-vscode/</link>
        <guid isPermaLink="true">https://chaosmail.github.io//web/development/2015/09/29/installing-vscode/</guid>
        
        <category>vscode</category>
        
        
        <category>web</category>
        
        <category>development</category>
        
      </item>
    
      <item>
        <title>Swiss Web Audio Group: 1st Meetup</title>
        <description>&lt;p&gt;In the first official meetup of the &lt;em&gt;Swiss Web Audio Group&lt;/em&gt; (SWAG) we talked about Soundio, Soundio, Music JSON and Soundio.&lt;/p&gt;

&lt;h2 id=&quot;attendees&quot;&gt;Attendees&lt;/h2&gt;

&lt;p&gt;Stefano, Chris, Stephan&lt;/p&gt;

&lt;h2 id=&quot;soundio&quot;&gt;Sound.io&lt;/h2&gt;

&lt;p&gt;Encapsulate the components and make them modular; maybe expose them via npm. People should be able to use the modules to create cool stuff.&lt;/p&gt;

&lt;h3 id=&quot;todos&quot;&gt;Todos:&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;Make the lib Open Source&lt;/li&gt;
  &lt;li&gt;GitHub Bug tracker&lt;/li&gt;
  &lt;li&gt;Work on the getting started document for the modules&lt;/li&gt;
  &lt;li&gt;Using npm for the modules&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;sequencer&quot;&gt;Sequencer&lt;/h2&gt;

&lt;h3 id=&quot;todos-1&quot;&gt;Todos:&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;Share on social media&lt;/li&gt;
  &lt;li&gt;Embed the sequencer in a page&lt;/li&gt;
  &lt;li&gt;Include the keyboard MIDI player&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;music-json&quot;&gt;Music JSON&lt;/h2&gt;

&lt;p&gt;An interchangeable readable format for note data for the web&lt;/p&gt;

&lt;h3 id=&quot;todos-2&quot;&gt;Todos:&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;Implement a converter that converts a *.mid file into a Music JSON&lt;/li&gt;
&lt;/ul&gt;
</description>
        <pubDate>Mon, 24 Aug 2015 21:30:00 +0000</pubDate>
        <link>https://chaosmail.github.io//web/audio/2015/08/24/swag/</link>
        <guid isPermaLink="true">https://chaosmail.github.io//web/audio/2015/08/24/swag/</guid>
        
        <category>soundio</category>
        
        <category>webaudio</category>
        
        <category>midi</category>
        
        
        <category>web</category>
        
        <category>audio</category>
        
      </item>
    
      <item>
        <title>Sound.io Introduction</title>
        <description>&lt;p&gt;The &lt;a href=&quot;https://github.com/soundio/soundio&quot;&gt;sound.io&lt;/a&gt; core library implements a graph object model for audio - we are calling this the &lt;em&gt;audio graph&lt;/em&gt;. The audio graph can be constructed out of a collection of &lt;em&gt;&lt;a href=&quot;https://github.com/soundio/audio-object&quot;&gt;audio-objects&lt;/a&gt;&lt;/em&gt; and &lt;em&gt;connections&lt;/em&gt; that links the audio-objects together.&lt;/p&gt;

&lt;h2 id=&quot;audio-object&quot;&gt;audio-object&lt;/h2&gt;

&lt;p&gt;The &lt;a href=&quot;https://github.com/soundio/audio-object&quot;&gt;audio object&lt;/a&gt; is a wrapper on the &lt;a href=&quot;https://developer.mozilla.org/en-US/docs/Web/API/Web_Audio_API&quot;&gt;Web Audio API&lt;/a&gt;. These objects are the building blocks of an audio graph.&lt;/p&gt;

&lt;h2 id=&quot;soundio-object-template&quot;&gt;soundio-object-template&lt;/h2&gt;

&lt;p&gt;The &lt;a href=&quot;https://github.com/soundio/soundio-object-template&quot;&gt;soundio-object-template repository&lt;/a&gt; contains a plugin template to build custom plugins. The &lt;em&gt;audio&lt;/em&gt; variable in the constructor contains a reference to the &lt;a href=&quot;https://developer.mozilla.org/en-US/docs/Web/API/AudioContext&quot;&gt;Audio Context&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;midi&quot;&gt;MIDI&lt;/h2&gt;

&lt;p&gt;&lt;a href=&quot;https://github.com/soundio/midi&quot;&gt;MIDI&lt;/a&gt; abstracts the &lt;a href=&quot;http://www.w3.org/TR/webmidi/&quot;&gt;Web MIDI API&lt;/a&gt;. It contains a &lt;em&gt;normalize&lt;/em&gt; function that converts MIDI format to &lt;a href=&quot;https://github.com/soundio/music-json&quot;&gt;Music JSON&lt;/a&gt; which is used internally.&lt;/p&gt;

&lt;h2 id=&quot;clock&quot;&gt;Clock&lt;/h2&gt;

&lt;p&gt;Maps the web audio absolute time to a Clock time using rates.&lt;/p&gt;

&lt;h2 id=&quot;sequence&quot;&gt;Sequence&lt;/h2&gt;

&lt;p&gt;Maps the clock time to the Sequence time.&lt;/p&gt;

&lt;h3 id=&quot;sampler&quot;&gt;Sampler&lt;/h3&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-js&quot; data-lang=&quot;js&quot;&gt;&lt;span class=&quot;kd&quot;&gt;var&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;soundio&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;Soundio&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;();&lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;var&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;sampler&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;soundio&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;objects&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;create&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s1&quot;&gt;'sample'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// soundio.objects = [&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;//  {id; 0, type: 'sample'}&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;//]&lt;/span&gt;
&lt;span class=&quot;nx&quot;&gt;sampler&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;trigger&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;'noteon'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;64&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;127&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;h3 id=&quot;sample-map&quot;&gt;Sample map&lt;/h3&gt;

&lt;p&gt;A &lt;a href=&quot;https://github.com/soundio/soundio/blob/master/js/soundio.sample.js#L12&quot;&gt;sample map&lt;/a&gt; maps notes across the keyboard and the velocity range.&lt;/p&gt;

&lt;h2 id=&quot;music-json&quot;&gt;Music JSON&lt;/h2&gt;

&lt;p&gt;&lt;a href=&quot;https://github.com/soundio/music-json&quot;&gt;Music JSON&lt;/a&gt; is an interchangeable MIDI format of the web.&lt;/p&gt;

&lt;h2 id=&quot;scribe&quot;&gt;Scribe&lt;/h2&gt;

&lt;p&gt;&lt;a href=&quot;https://github.com/soundio/scribe&quot;&gt;Scribe&lt;/a&gt; is a parser for &lt;a href=&quot;https://github.com/soundio/music-json&quot;&gt;Music JSON&lt;/a&gt; to create lead sheets in SVG.&lt;/p&gt;

&lt;h2 id=&quot;references&quot;&gt;References:&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/soundio/soundio&quot;&gt;sound.io&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/soundio/soundio/blob/master/js/soundio.sample.js#L435&quot;&gt;soundio sample&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/soundio/soundio/blob/master/js/soundio.sample.js#L12&quot;&gt;soundio sample map&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/soundio/soundio-object-template&quot;&gt;sound.io object-template&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/soundio/audio-object&quot;&gt;audio object&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/soundio/clock&quot;&gt;clock&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/soundio/midi&quot;&gt;MIDI&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/soundio/scribe&quot;&gt;Scribe&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;http://www.w3.org/TR/webmidi/&quot;&gt;Web MIDI API&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://developer.mozilla.org/en-US/docs/Web/API/Web_Audio_API&quot;&gt;Web Audio API&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://developer.mozilla.org/en-US/docs/Web/API/AudioContext&quot;&gt;Web Audio API - Audi Context&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
        <pubDate>Mon, 24 Aug 2015 19:30:00 +0000</pubDate>
        <link>https://chaosmail.github.io//web/audio/2015/08/24/soundio-intro/</link>
        <guid isPermaLink="true">https://chaosmail.github.io//web/audio/2015/08/24/soundio-intro/</guid>
        
        <category>soundio</category>
        
        <category>webaudio</category>
        
        <category>midi</category>
        
        
        <category>web</category>
        
        <category>audio</category>
        
      </item>
    
  </channel>
</rss>
