TangTalk - Tech Blog

Azure Defender for Key Vault just released!

2020-09-22T17:30:26.000Z

As the service owner, I am super excited to share that Azure Defender for Key Vault is now generally available!

It is really One Microsoft experience to work closely with Azure Security Center and Azure Key Vault team to launch this service. Also personally, I grew up a lot after going through the Machine Learning algorithm improvement, infrastructure refactoring, BCDR and privacy policy compliance, cost reduce, monthly business review(MBR), customer feedback investigation.

It is indeed a challenging and inspiring work to wake me up every day.

What is Azure Defender for Key Vault

https://docs.microsoft.com/en-us/azure/security-center/defender-for-key-vault-introduction

Customers are using Azure Key Vault to store the most sensitive information in their Azure environment: keys, passwords, secrets and certificates for all of their Azure resources. By achieving this data, attackers may be able to perform lateral movement and breach other resources in the customers Azure environment.

Azure Defender for Key Vault is a cloud-native, breadth threat protection suite – gives customers additional layer of protection for the precious secretes stored in the Key Vault by helping the SOC team to detect suspicious activities in their Key Vaults and protect the entire Azure environment.

How to Enable Azure Defender for Key Vault

Enable it from Azure Key Vault
In Key Vault’s Security page, click “try it for the first 30 days”
Enable it from Azure Security Center
https://docs.microsoft.com/en-us/azure/security-center/security-center-pricing#enable-azure-defender
1. From Security Center’s main menu, select Pricing & settings.
2. Select the subscription that you want to upgrade.
3. Select Azure Defender on to upgrade.
4. Select Save.
Below is the pricing page for an example subscription. You’ll notice that each plan in Azure Defender is priced separately and can be individually set to on or off. Make sure it is on for Azure Key Vault.

Azure Defender for Key Vault Alerts

https://review.docs.microsoft.com/en-us/azure/security-center/alerts-reference?branch=master#alerts-azurekv

Current Status

We just releasing to GA and we already have:

30G Azure Key Vault logs processed per month
1.2M Azure Key Vaults protected
63K Azure subscriptions protected

And expecting these numbers to raise dramatically in the current months.

General Availability Announcement at Ignite 2020

Azure Defender for Key Vault is generally available: https://docs.microsoft.com/en-us/azure/security-center/release-notes#azure-defender-for-key-vault-is-generally-available

What’s new in Azure Key Vault: https://techcommunity.microsoft.com/t5/video-hub/azure-key-vault-what-s-new/m-p/1698834

Introducing Azure Defender: https://myignite.microsoft.com/sessions/764ff397-97ff-4841-ad62-493f1da51d40

What’s new in Azure Security Center: https://myignite.microsoft.com/sessions/d40bd0a5-485e-455d-ac28-882b85de8dfb

Azure Certifications and Exams

2020-08-02T14:33:56.000Z

Azure Certifications and Exams

Last month, I passed two Azure exams and earned the Azure Data Engineer Certification.

I take data engineer certificate as an example, to explain how to prepare and what the Azure certification exams look like.

Overview

Microsoft has made big changes to its Azure certifications at Ignite conference 2018. The new certifications are role-based, more practical and have a narrower focus for each certification.

Currently, there are 9 role based azure certifications. To earn each certification, it is required to clear one or two certification exams. I list the current certifications and also the required exams in the following:

Associate
- Data Engineer (DP-200, DP-201)
- Data Scientist (DP-100)
- Administrator (AZ-104)
- Security Engineer (AZ-500)
- Database Administrator (DP-300)
- AI Engineer (AI-100)
- Developer (AZ-204)
Expert
- Solution Architect (AZ-303, AZ-304)
- DevOps Engineer (AZ-400)

Which certification is right for me?

Should you go with AWS, Azure, or Google Cloud? Whether you are looking to move into a high paying cloud career or just looking to declare your existing cloud skills, Azure certifications are a great choice. Besides all the benefits that will bring to your career, all the Azure certs can be obtained remotely, which is awesome when working from home to fight against COVID-19.

Once you decided which cloud provider path is right for you, your next step is to figure out which certification path is right for you. From my experience, since Microsoft’s Azure certifications are role-based, it is better to select the job title and also go through the skills measured for each certification.

Certification Learning Path

Data Engineer Certification: validates the skills and knowledge to design and implement the management, monitoring, security, and privacy of data using the full stack of Azure data services to satisfy business needs.

Job role: Data Engineer

Prerequisites: None

Required exams($165 each):

Skills measured: skills outline

Implement data storage solutions
Manage and develop data processing
Monitor and optimize data solutions
Design Azure data storage solutions
Design data processing solutions
Design for data security and compliance

What the Azure certification online exams look like?

All the exams are held by Pearson VUE, the exam appointments appear on the dashboard after registration.

All Azure role based certifications can be taken online, but there are extra policies need to be follow.

Before the exam:

Perform a system test
prepare 2 Government issued personal ID
prepare your phone for identity verification and room scans. Also the exam proctor might contact you if there is an issue during the exam. Make sure your mobile phone is outside of the immediate testing space, but within extended arms reach with the screen visible.
prepare a work area:
- Additional monitors (must be unplugged and turned away from you)
- Additional computers (must be turned off and monitors must be dark)
- clear of all materials, including the following items that are not allowed within arm’s reach: books, notepads, Post-it notes, typed notes/papers, or writing instruments such as pens, markers, whiteboards, or pencils.

Start the exam:

log in 15 minutes early to start the check-in process.
Choose Start a previously scheduled online proctored exam on dashboard.
Select the exam under Purchased Online Exams.
Select Begin exam and proceed through the self-check-in process and wait for a proctor to connect with you.

During the exam:

No breaks/eating/drinking
No personal belongings
No exam assistance
use facial comparison technology to verify identity during the testing process
the proctor will continuously monitor you by video and audio, and your face, voice, the physical room where you are seated, and the location during exam delivery will be recorded.

After exam:

when your exam ends, you should see your exam results(pass or fail) immediately before exiting the exam app.
Also your sore report will be available on dashboard after several hours.
You will receive an email about claiming your exams and certifications. Click the link to claim on cclaim, and then you can share your earned badge anywhere.

Reschedule policy: at least 6 business days prior to your appointment. 12.5% reschedule fee.

Cancelation policy: at least 24 hours prior to your appointment.

Exam question types

The exam may contain several question types: active screen, best answer, build list, case studies, drag and drop, hot area, multiple choice, repeated answer choices, short answer, labs, mark review, and review screen.

For security reason, exam formats or exact question types are not identified before the exam. You can view all question samples here to prepare the exam.

Learning materials

Spark stateful streaming processing is stuck in StateStoreSave stage!

2020-07-07T20:46:59.000Z

Spark stateful streaming processing is stuck in StateStoreSave stage!

A stateful structured stream processing job is suddenly stuck at the 1st micro-batch job. Here are the notes about that issue, how to debug Spark stateful streaming job, and also how I fix it.

Stateful Structured Streaming Processing Job

 ## specify data source, read data from Azure Event Hub
(spark.readStream.format("eventhubs") 
 .options(**self.config.ehConfig)
 .load()  
 ## use watermark to control state size
 .withWatermark(processTimeCol, waterMarkTime)
 ## transformations. 
 .withColumn(eventTimeCol, col(eventTimeCol).cast('timestamp'))
 ## aggregation by event time windows
 .groupBy(col(key), window(eventTimeCol, "15 mins"))
 .agg()
 ...
 .select(*(cols+['windowStart', 'firstEventTime', 'lastEventTime', 'count']))
 ## specify data sink, write transformed output to Azure blob storage
 .writeStream
 .format("parquet")
 .option('path', outputPath)
 .outputMode("append")
 ## Processing details--Trigger: when to process data
 .trigger(processingTime="2 seconds")  
 ## Processing details--Checkpoint: for tracking the progress of the query
 .option("checkpointLocation", checkpointPath)
 .start())

Spark SQL converts batch-like query to a series of incremental execution plans operating on new micro-batches of data.

Environment

This Streaming job is running on Databricks clusters triggered by Azure Data Factory pipeline.

Data source is from Azure Event Hub, and this job store the aggregated output to Azure Blob Storage mounted on Databricks.

Databricks instance has Vnet injections and have NSG associated with the Vnet.

Databricks cluster version is 5.5 LTS which use Scala 2.11, Spark 2.4.3 and Python 3.

We also use PySpark 2.4.4 for this streaming job.

Issue Symptom

This streaming job is scheduled to run for 4 hours every time. When the current job stops, the next one will start to run. It means the max number of concurrency job is 1.

It was running well before. Suddenly, when a new streaming job starts, it seems to stuck at the 1st micro-batch like the following picture. You can see the job is stuck for batch=0 and it fails because of time out.

We have 3 regions and this issue happened to every region one by one in a week.

If you check the checkpoint folder:

compare with the normal checkpoint:

The difference is obvious: there is no commits in the checkpoint path. It means the streaming job didn’t succeed in processing even a single micro-batch.

Possible Causes

It is really tough to debug this issue because no error message shown up and just several misleading warning messages in the executor’s error logs.

Usually, there are several possible reasons to cause streaming processing job stuck, such as:

Total size of state per worker is too large which leads to higher overheads of snapshotting and JVM GC pauses.
Number of shuffle partitions is too high, so the cost of writing state to HDFS will increase which cause the higher latency.
NSG rules added to Databricks Vnet might block some ports and thus infect worker to worker communication.
Databricks mounted blobs are expired and need to rotate the storage connection string and databricks access token.
Databricks cluster version is deprecated and not supported any more.
Spark .metadata directory is messed up. We need to delete the metadata and let the pipeline recreate a new one. but for this one, it would complete micro-batch, and it just do nothing in the process.

But none of them work this time. We struggle to figure out the root cause is:

Azure blob storage has too many files in the checkpoint folder which slow down the read and write speed.

I compare the DAG visualization with normal job’s, it is shown that the StateStoreSave stage takes much longer (16 hours) than the normal one (21 seconds). StateStoreSave is the stage when spark store current streaming process status in checkpoint. Thus the issue exists in checkpoints. More info for StateStoreSave can be found here

From this detailed stage information, we can get:

number of total state rows is not the concern. we cannot solve the issue by reduce the watermark threshold or recreate checkpoint.
memory used by state total is lower than the normal state. so it is not a JVM GC pause issue.
time to update total is the pain point. It takes longer even the number of updated state rows is less, which point the issue to the write speed in blob storage.
In the checkpoint folder, we stored around 17 million checkpoints for each region. After I delete the whole checkpoint folder and restart the streaming job, the issue is fixed. I am not 100% sure about the reason for it. One possible reason for it is Azure blob storage doesn’t support hierarchical namespace, and it just mimic hierarchical directory structure by using slashes in the name.

Solutions

Solution 1 - migrate Azure blob storage to Azure Data Lake Gen2 which supports hierarchical namespace.

Solution 2 - delete checkpoint folder and decrease retention period.

Step1: Stop current streaming job

Step2: Delete .metadata directory and checkpoint folder

Step3: Add a failover mechanism so that the streaming job will resume from where the streaming job stopped in the last successful data persistence.

Here is my failover mechanism if no checkpoint found, so we can delete the checkpoint folder without losing state.

try:
       # if the checkpoint exists, continue to use it without refreshing
       config.dbutils.fs.ls(config.checkpointPath)
       print('Continue streaming job with checkpoint path, %s' %
             config.checkpointPath)
   except:
       # remove _spark_metadata folder when use a new checkpoint
       config.dbutils.fs.rm(config.outputPath + '_spark_metadata', True)
       print('removed _spark_metadata for last checkpoint')
       print('New streaming job checkpoint path is %s' %
             config.checkpointPath)
       # set the streaming start time to catch up from where the streaming job stoped in last data persistence
       timeKey = 'windowStart'
       try:
           ts = [int(p.path.split(timeKey + "=")[1][:-1])
                 for p in config.dbutils.fs.ls(config.outputPath) if timeKey in p.path]
           if ts:
               lookbackTs = int(dt.datetime.now().replace(minute=0, second=0, microsecond=0).timestamp()) - defualtLookbackTime
               # Set stream start time to the maximum of (15min aggregation output timestamp or one day back from current time)
               streamStartTime = np.max([np.max(ts), lookbackTs])
               # Add 15min to start time
               streamStartTime = dt.datetime.fromtimestamp(
                   streamStartTime) + dt.timedelta(minutes=15)
               streamStartTime = streamStartTime.strftime("%Y-%m-%dT%H:%M:%S.%fZ")
               # Create the positions
               startingEventPosition = {
                   "offset": None,
                   "seqNo": -1,
                   "enqueuedTime": streamStartTime,
                   "isInclusive": True
               }
               config.ehConfig["eventhubs.startingPosition"] = json.dumps(
                   startingEventPosition)
        except:
           pass

Step 4: restart the streaming job.

Step 5: Add retention policy to the checkpoint folder to decrease the checkpoints lifetime.

Databricks Migration Guide

2020-06-24T22:18:09.000Z

Databricks migration steps

When you need to migrate an old Databricks to a new Databricks, all of the files, jobs, clusters, configurations and dependencies are supposed to move. It is time consuming and also easy to omit some parts. I document the detailed migration steps, and also write several scripts to automatically migrate folders, clusters and jobs.
In this chapter, I will show you how to migrate Databricks.

0. Prepare all scripts

Navigate to https://github.com/xinyeah/Azure-Databricks-migration-tutorial, fork this repository, and download all needed scripts.

1. Install databricks-cli

1	pip3 install databricks

2. Set up authentication for two profiles

Set up authentication for two profiles for old databricks and new databricks. This CLI authentication need to done by a personal access token.

2.1 Generate a personal access token.

Here is step by step guide to generate it.

2.2 Copy the generated token and store it as a secret in Azure Key Vault.

On the Key Vault properties pages, select Secrets.
Click on Generate/Import.
On the Create a secret screen choose the following values:

Upload options: Manual.
Name:
Value: paste the generated token here
Leave the other values to their defaults. Click Create.

2.3 Set up profiles

In this case, the profile primary is for the old Databricks, and the profile secondary is for the new one.

1 2	databricks configure --profile primary --token databricks configure --profile secondary --token

Every time set up a profile, you need to provide the Databricks host url and the personal access token generated previously.

2.4 Validate the profile

1 2	databricks fs ls --absolute --profile primary databricks fs ls --absolute --profile secondary

Here is the DBFS root locations from [docs](https://docs.microsoft.com/en-us/azure/databricks/data/databricks-file-system)![image-20200624113619104](https://raw.githubusercontent.com/xinyeah/xinyeah.github.io/master/images/image-20200624113619104.png)

3. Migrate Azure Active Directory users

3.1 Navigate to the old Databricks UI, expand Account in the right corner, then click Admin Console. You can get a list of users as admin in this Databricks.

3.2 Navigate to the new Databricks portal, click Add User under Users tag of Admin Console to add admins.

4. Migrate the workspace folders and notebooks

Solution 1
Put the migrate-folders.py in a separate folder (it will export files in this folder), and then run the migrate-folders.py script to migrate folders and notebooks. Libraries are not included using this scripts. It is shown in Step 5 to migrate libraries.
Remember to replace the profile variables in this script to your customized profile names:

1 2	EXPORT_PROFILE = "primary" IMPORT_PROFILE = "secondary"

Solution 2
Also, you can do it manually: Export as DBC file and then import.

5. Migrate libraries

There is no external API for libraries, so need to reinstall all libraries into new Databricks manually.

5.1 List all libraries in the old Databricks.

5.2 Install all libraries.

Maven libraries:

PyPI libraries:

6. Migrate the cluster configuration

Run migrate-cluster.py to migrate all interactive clusters. This script will skip all job source clusters.
Remember to replace the profile variables in this script to your customized profile names:

1 2	EXPORT_PROFILE = "primary" IMPORT_PROFILE = "secondary"

7. Migrate the jobs configuration

Run migrate-job.py to migrate all jobs, schedule information will be removed so job doesn’t start before proper cutover.
Remember to replace the profile variables in this script to your customized profile names:

1 2	EXPORT_PROFILE = "primary" IMPORT_PROFILE = "secondary"

8. Migrate Azure Key Vaults secret scopes

There are two types of secret scope: Azure Key Vault-backed and Databricks-backed.
Creating an Azure Key Vault-backed secret scope is supported only in the Azure Databricks UI. You cannot create a scope using the Secrets CLI or API.

List all secret scopes:

1	databricks secrets list-scopes --profile primary

Generate key vault-backed secret scope:

Go to https://#secrets/createScope. This URL is case sensitive; scope in createScope must be uppercase.
Enter the name of the secret scope. Secret scope names are case insensitive.
These properties are available from the Properties tab of an Azure Key Vault in your Azure portal.
Click the Create button.

9. Migrate Azure blob storage and Azure Data Lake Storage mounts

There is no external API to use, have to manually remount all storage.

9.1 List all mount points in old Databricks using `notebook`.

1	dbutils.fs.mounts()

9.2 Remount all blob storage following the official docs using `notebook`.

dbutils.fs.mount(
  source = "wasbs://@.blob.core.windows.net",
  mount_point = "/mnt/",
  extra_configs = {"":dbutils.secrets.get(scope = "", key = "")})

where

is a DBFS path representing where the Blob storage container or a folder inside the container (specified in source) will be mounted in DBFS.
can be either fs.azure.account.key..blob.core.windows.net or fs.azure.sas...blob.core.windows.net
dbutils.secrets.get(scope = "", key = "") gets the key that has been stored as a secret in a secret scope.

10. Migrate cluster init scripts

Copy all cluster initialization scripts to new Databricks using DBFS CLI.

// Primary to local
dbfs cp -r dbfs:/databricks/init ./old-ws-init-scripts --profile primary

// Local to Secondary workspace
dbfs cp -r old-ws-init-scripts dbfs:/databricks/init --profile secondary

11. ADF config

For Databricks jobs scheduled by Azure Data Factory, navigate to Azure Data Factory UI. Create a new Databricks linked service linked to the new Databricks by the personal access key generated in step 2.

Reference

https://docs.microsoft.com/en-us/azure/databricks/administration-guide/cloud-configurations/azure/vnet-inject
https://docs.microsoft.com/en-us/azure/azure-databricks/howto-regional-disaster-recovery#detailed-migration-steps
https://docs.microsoft.com/en-us/azure/databricks/dev-tools/cli/
https://docs.microsoft.com/en-us/azure/databricks/security/secrets/secret-scopes

Always flush Application Insights

2020-06-22T21:50:20.000Z

What is Application Insights?

Application Insights is one of Azure Monitoring solutions, it monitors the availability, performance, and usage of your web application.

How it works

Before you can use the Application Insights, you need to install an Application Insights SDK (instrumentation package) in your app. This instrumentation monitors your app and sends out the telemetry data to an Azure Application Insights resource identified by an instrumentation key (a unique GUID).

It support Java, C#, Node.js, python and so on

Flush data

The official docs says, the SDK sends out data at fixed intervals (typically 30 secs) or whenever the buffer is full (typically 500 items).

However, from my personal experience, it won’t send the data if you don’t flush.

Here is a C# code example to flush the telemetry.

// Set up some properties and metrics:
var properties = new Dictionary 
    {{"game", currentGame.Name}, {"difficulty", currentGame.Difficulty}};
var metrics = new Dictionary 
    {{"Score", currentGame.Score}, {"Opponents", currentGame.OpponentCount}};

// Send the event:
telemetry.TrackEvent("WinGame", properties, metrics);
// Flush the buffer
telemetry.Flush();
// Allow some time for flushing before shutdown.
System.Threading.Thread.Sleep(5000);

Reference

https://docs.microsoft.com/en-us/azure/azure-monitor/app/api-custom-events-metrics

https://docs.microsoft.com/en-us/azure/azure-monitor/app/app-insights-overview

Java serialization is a bitch!

2020-06-21T21:54:05.000Z

Concept

Object Serialization is the process of converting an object into a stream of bytes to store or transmit the object between machines.

The reverse process is called deserialization to use the byte stream to recreate the object.

Issue for Java Serialization

The main concern to use Java Serialization is security issue. There is a so called Java deserialization vulnerability affect all apps that receives serialized Java objects which can be used by attackers to gain complete remote control of an app service. Also, the attack surface is so big and even if you adhere to all best practice, your app is still be vulnerable.

What’s the vulnerability?

Many apps that accept serialized bytes stream do not validate or check untrusted input before deserialization. The attackers can insert a malicious code into the bytes stream and have it execute on the app. They can easily mount a denial-of-service attack by causing the deserialization takes forever, which is called deserialization bomb.

Solutions

The best way to avoid Java serialization vulnerability is never to use Java serialization!

There are other mechanisms to store and transmit between objects and bytes sequences which avoid Java serialization vulnerability, such as JSON and Protocol Buffers(Protobuf).

JSON vs Protocol Buffers

I summarize the differences between them:


JSON	Protocol Buffers
human-readable	not human-readable, but it provide pbtxt for readability.
text-based	binary
no schema needed	offer schemas to enforce appropriate usage
	simple, faster, smaller in size

But they are both good serialization mechanisms:

They are simpler than Java serialization.
They don’t support auto serialization or deserialization
They only support a few primitive and array data types to avoid deserialization issue.

Reference

Effective Java, Third Edition

https://www.darkreading.com/informationweek-home/why-the-java-deserialization-bug-is-a-big-deal/d/d-id/1323237

Deploy Spark .NET app on Databricks

2020-06-19T10:35:53.000Z

Deploy Spark .NET app on Databricks

I struggled to deploy a Spark .NET app on Databricks scheduled by Azure Data Factory pipeline. Here are the notes on the solutions how I finally figured out.

From this chapter, you can step-by-step create a Spark .NET app and deploy it either on Databricks directly or scheduled by an Azure Data Factory pipeline.

Prepare a Spark .NET application

This doc teaches you how to run a Spark .NET app using .NET Core. If you are familiar with .NET, we can simplify the process as:

Prepare environment.
1.1 Install the following dependencies: .NET, Java, compression software, Apache Spark, .NET for Apache Spark, WinUtils.
1.2 Set DOTNET_WORKER_DIR environment variable.
1.3 Verify you have all dependencies: you are good if you run dotnet, java,mvn,spark-shellfrom command line successfully.

Code a demo app to count words.

using Microsoft.Spark.Sql;

namespace MySparkApp
{
    class Program
    {
        static void Main(string[] args)
        {
            // Create a Spark session.
            SparkSession spark = SparkSession
                .Builder()
                .AppName("word_count_sample")
                .GetOrCreate();

            // Create initial DataFrame.
            DataFrame dataFrame = spark.Read().Text("input.txt");

            // Count words.
            DataFrame words = dataFrame
                .Select(Functions.Split(Functions.Col("value"), " ").Alias("words"))
                .Select(Functions.Explode(Functions.Col("words"))
                .Alias("word"))
                .GroupBy("word")
                .Count()
                .OrderBy(Functions.Col("count").Desc());

            // Show results.
            words.Show();

            // Stop Spark session.
            spark.Stop();
        }
    }
}

Build your app.
1
dotnet build

Locally submit your app to run on Apache Spark.

spark-submit \
--class org.apache.spark.deploy.dotnet.DotnetRunner \
--master local \
microsoft-spark-2.4.x-.jar \
dotnet HelloSpark.dll

If it is successful, you can see the word count data written on the console.

Prepare dependencies on Databricks

Download Microsoft.Spark.Worker which helps Apache Spark execute your app.
Download install-worker.sh which copys .NET for Apache Spark dependencies into your cluster’s nodes.
Download db-init.sh which installs dependencies on your Databricks cluster.

Publish your Spark .NET app.

1	dotnet publish -c Release -f netcoreapp3.1 -r ubuntu.16.04-x64

Compress the published app files in the previous step. Navigate to mySparkApp/bin/Release/netcoreapp3.1/ubuntu.16.04-x64, compress Publish folder as a zip file.

Upload files to DBFS.

databricks fs cp db-init.sh dbfs:/spark-dotnet/db-init.sh
databricks fs cp install-worker.sh dbfs:/spark-dotnet/install-worker.sh
databricks fs cp Microsoft.Spark.Worker.netcoreapp3.1.linux-x64-0.6.0.tar.gz dbfs:/spark-dotnet/   Microsoft.Spark.Worker.netcoreapp2.1.linux-x64-0.6.0.tar.gz

cd mySparkApp
databricks fs cp input.txt dbfs:/input.txt

cd mySparkApp\bin\Release\netcoreapp3.1\ubuntu.16.04-x64 directory
databricks fs cp mySparkApp.zip dbfs:/spark-dotnet/publish.zip
databricks fs cp microsoft-spark-2.4.x-0.6.0.jar dbfs:/spark-dotnet/microsoft-spark-2.4.x-0.6.0.jar

Then all the dependencies are ready. We can deploy it on Databricks.

How to deploy

We can run .NET for Apache Spark apps on Databricks, but it is not what we usually do for Python or Scala jobs. For the Python or Scala jobs, we can just start a Notebook task for them. But for Spark .NET job, we need to use the “spark-submit” or “Jar” tasks.

Scheduled by Azure Data Factory pipeline

Deploy using Set Jar

Generate a Databricks access token for Azure Data Factory to access.
1.1 In Databricks workspace, select your user profile in the upper right, and select User Settings.
1.2 Select Generate New Token under the Access Tokens tab.
1.3 Save the access token for later use in creating a Databricks linked service. Usually save it in Azure Key Vault for security.
Navigated to the Pipelines page on Azure Data Factory, create a new pipeline, search for Databricks activities, drag the Jar task to panel.
In the Jar activity Demo, updates the paths and settings as needed. Databricks linked service should be created using access token generated on Databricks previously. Remember to add init script for cluster settings.
Check the Jar settings. Main class name is org.apache.spark.deploy.dotnet.DotnetRunner. Parameters will pass to the main class. it must have your app publish.zip and your app name as the first two parameters. The rest parameters are what your app need. Append libraries are microsoft-spark-2.4.x-0.10.0.jar on dbfs.
Click Debug to run a test for the current pipeline.
Save the newly added pipeline by click Publish all.

Directly on databricks

1. Deploy using Spark-submit

Navigate to Databricks Workspace and create a job. Select Task as spark-submit. Set job parameters。
When configure Cluster, need to add init script located on DBFS (Databricks Filesystem).
select Run Now to test the job. Once the job’s cluster is created, your Spark job will be submitted.

2. Deploy using Set Jar

We can also use the Jar task to deploy on Databricks. The settings should be the same with the one triggered by Azure Data Factory.

Reference

https://docs.microsoft.com/en-us/dotnet/spark/how-to-guides/databricks-deploy-methods

https://docs.microsoft.com/en-us/azure/data-factory/solution-template-databricks-notebook

https://docs.microsoft.com/en-us/azure/data-factory/transform-data-databricks-jar

https://docs.microsoft.com/en-us/dotnet/spark/tutorials/get-started

https://dotnet.microsoft.com/learn/data/spark-tutorial/intro

搭建Hexo + Github Pages + Travis CI个人站点的详细教程

2020-06-07T21:27:46.000Z

技术栈选择： Github Pages + Hexo + Travis CI

首要原因：没钱。这是一套免费的组合拳。

在众多站点选择中，最终选择了Github Pages。主要还是因为熟悉Github的版本控制，以及Github对其他平台很好的集成。官方推荐的静态站点生成器(static site generator)是Jekyll。还可以在项目仓库Settings页面中的Github Pages部分，选择Jekyll的theme。

Hexo也是一款静态站点生成框架(static site generator)，基于Node.js 。通过Hexo可以使用Markdown来写文章，不用太关注排版和格式。而且Hexo比较成熟，有很多稳定的好看的主题。

Travis CI 是持续集成(continuous integration)的平台，可以监控repo具体分支上的代码变动，自动触发build和test。帮助实现频繁的merge小段代码的Best practice。有了自动部署，就可以不受开发平台限制，不需要搭建环境也可以发布文章。

安装环境

安装并配置github
安装Node.js 和 npm
npm(Node Package Manager), 是用来开发和分享Javascript代码的工具。在这里https://nodejs.org/en/download/下载Node.js的最新版本，就包含了NPM。
打开Command Prompt，验证安装成功了Node.js和NPM。
1
2
3
4
5
$ node –v
v12.17.0

$ npm –v
6.14.4

安装Hexo

$ npm install -g hexo-cli

$ hexo -v
hexo-cli: 3.1.0
os: Windows_NT 10.0.18363 win32 x64
node: 12.17.0
...

安装Hexo-deployer-git
1
$ npm install hexo-deployer-git --save

初始化Hexo+Github Pages项目

初始化.github.io为Hexo项目

$ mkdir .github.io

$ cd .github.io

$ hexo init
INFO  Copying data to ~/***/.github.io
INFO  You are almost done! Don't forget to run 'npm install' before you start blogging with Hexo!

$ npm install

$ git init

初始化后的目录如下：
.
├── _config.yml #站点的配置文件
├── package.json #应用的基本信息和依赖应用
├── scaffolds #模板文件夹。新建文章时候，默认填充的内容模板。
├── source #markdown和html文件会被解析存放在public文件夹中
| ├── _drafts #新建的draft会保存在这里
| └── _posts #新建post的时候会保存在这里
└── themes #主题文件夹，根据主题来生成静态页面
Github上创建一个.github.io为名的公开的代码库。其中Yourname应该跟你的Github用户名保持一致。
为了防止错误，不要用 README, license, 或者 gitignore文件初始化项目.
代码库Settings中查看Github Pages相关设置，你就拥有了自己的站点：https://.github.io。对于个人站点，只能将master分支设置为发布来源。
复制代码库的URL。

在本地代码库添加remote upstream

$ git remote add origin remote-repository-URL
# Sets the new remote
$ git remote -v
# Verifies the new remote URL

根据文档，修改_config.yml文件中关于站点的配置信息。

执行以下命令，验证效果

$ hexo clean
$ hexo generate
$ hexo server
INFO  Hexo is running at http://0.0.0.0:4000/. Press Ctrl+C to stop.

添加博客主题

Fork hexo-theme-next 项目到自己的仓库.

运行以下命令将 Fork 出来的仓库 pull 到本地子模块

1 2	cd .github.io git submodule add https://github.com//hexo-theme-next.git themes/next

运行该命令后会在项目根目录生成 .gitmodules 文件，文件内容如下：

1
2
3

[submodule "themes/next"]
    path = themes/next
    url = https://github.com/sugartxy/hexo-theme-next

对主题进行个性化配置后，先要 check in子模块，在 theme/next 目录下依次执行：
1
2
3
4
cd theme/next
git add .
git commit -m "update config file"
git push origin master
切换到项目根目录，打开站点配置文件(.github.io/_config.yml)，修改theme字段, 使得主题修改生效。
1
theme: next

执行以下命令，验证效果

1 2	$ hexo server INFO Hexo is running at http://0.0.0.0:4000/. Press Ctrl+C to stop.

在项目根目录下，将代码check in到项目仓库下：

cd .github.io
git add .
git commit -m "add submodule"
git push origin master

生成博客并部署

执行以下命令生成新的博客

1
2

$ hexo new post </span><br><span class="line">INFO  Created: ~/<YourName>/<YourName>.github.io/source/_posts/<title>.md</span><br></pre></td></tr></table></figure><p>将博客内容写在新创建的markdown文件里。</p></li><li><p>如果themes/next路径下的内容做了改变，在themes/next路径下，将更改的代码check in到刚刚Fork的repo中。</p></li><li><p>在YourName.github.io项目路径下，将更改的代码check in到YourName.github.io repo 的master分支.</p></li><li><p>在本地部署。使用后续Travis CI配置后，可以省略此步骤。</p><p>4.1 修改配置文件<code>_config.yml</code>中关于部署的字段</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line">deploy:</span><br><span class="line">  type: git</span><br><span class="line">  repository: https://git@github.com/<YourName>/<YourName>.github.io.git</span><br><span class="line">  branch: master</span><br></pre></td></tr></table></figure><p>4.2  执行以下命令部署站点，当执行 <code>hexo deploy</code> 时，Hexo 会将 <code>public</code> 目录中的文件和目录推送至 <code>_config.yml</code> 中指定的远端仓库和分支中，并且<strong>完全覆盖</strong>该分支下的已有内容。</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">$ hexo clean # 清除缓存文件（db.json）和已经生成的静态文件（public）</span><br><span class="line">$ hexo generate  # 生成静态文件</span><br><span class="line">$ hexo deploy  # 部署网站</span><br></pre></td></tr></table></figure></li></ol><h2 id="使用Travis-CI自动化部署"><a href="#使用Travis-CI自动化部署" class="headerlink" title="使用Travis CI自动化部署"></a>使用Travis CI自动化部署</h2><p>Travis CI对于开源的Repository是免费的，只需要拥有Github账户和至少一个项目，在项目中增加.travis.yml文件，就可以使用Travis CI。<a href="https://hexo.io/zh-cn/docs/github-pages" target="_blank" rel="noopener">Hexo文档</a>中详细说明了如何使用Travis CI将Hexo自动部署到Github Pages。只需要做如下修改:</p><ol><li><p>修改.travis.yml文件</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br></pre></td><td class="code"><pre><span class="line">sudo: required</span><br><span class="line">language: node_js</span><br><span class="line">node_js:</span><br><span class="line">  - 10 # use nodejs v10 LTS</span><br><span class="line"></span><br><span class="line">branches:</span><br><span class="line">  only:</span><br><span class="line">    - master # build master branch only</span><br><span class="line"></span><br><span class="line"></span><br><span class="line"># Start: Build Lifecycle</span><br><span class="line">install:</span><br><span class="line">  - npm install -g hexo-cli</span><br><span class="line">  - npm install</span><br><span class="line">  - npm install hexo-deployer-git --save</span><br><span class="line">  # 设置git提交名，邮箱</span><br><span class="line">  - git config user.name "<YourName>"</span><br><span class="line">  - git config user.email "<YourEmail>"</span><br><span class="line">  # 替换同目录下的_config.yml文件中gh_token字符串为travis后台刚才配置的变量，注意此处sed命令用了双引号。单引号无效！</span><br><span class="line">  - sed -i "s/gh_token/${GH_TOKEN}/g" ./_config.yml</span><br><span class="line"></span><br><span class="line">script:</span><br><span class="line">  - hexo clean</span><br><span class="line">  - hexo generate # generate static files</span><br><span class="line">  </span><br><span class="line">after_success: # 只有前面步骤成功了才会触发</span><br><span class="line">  - hexo deploy</span><br><span class="line">  </span><br><span class="line"># End: Build LifeCycle</span><br></pre></td></tr></table></figure></li></ol><ol start="2"><li><p>修改_config.yml文件</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line">deploy:</span><br><span class="line">  type: git</span><br><span class="line">  # 下方的gh_token会被.travis.yml中sed命令替换</span><br><span class="line">  repo: https://gh_token@github.com/<YourName>/<YourName>.github.io.git</span><br><span class="line">  branch: master</span><br></pre></td></tr></table></figure></li></ol><p>这样每一次更新博客，只需要check in Markdown文件到master 分支，就会自动部署。在Travis CI网站中可以看到部署的状态。</p><p></p><h2 id="其他问题"><a href="#其他问题" class="headerlink" title="其他问题"></a>其他问题</h2><h3 id="1-添加评论系统-gitalk"><a href="#1-添加评论系统-gitalk" class="headerlink" title="1. 添加评论系统-gitalk"></a>1. 添加评论系统-gitalk</h3><p>参考文献：<a href="https://www.standbyside.com/2018/12/04/add-comment-function-to-next/" target="_blank" rel="noopener">https://www.standbyside.com/2018/12/04/add-comment-function-to-next/</a></p><p>1.1 进入<a href="https://github.com/settings/applications/new" target="_blank" rel="noopener">github</a>新建一个认证application</p><p></p><p>创建完后会生成这个application对应的 Client ID 和 Client Secret</p><p>1.2 在自己的github中创建一个同名的repository</p><p>以后每篇文章都会对应这里的一个issue，这篇文章的comments和like都会记录到对应的issue里。</p><p>1.3 Next主题v7.6.0中已经集成了gitalk，只需要进入主题的_config.yml里修改comments相关属性</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br></pre></td><td class="code"><pre><span class="line">comments:</span><br><span class="line">    # Available values: tabs | buttons</span><br><span class="line">    style: tabs</span><br><span class="line">    # Choose a comment system to be displayed by default.</span><br><span class="line">    # Available values: changyan | disqus | disqusjs | gitalk | livere | valine</span><br><span class="line">    active: gitalk</span><br><span class="line">    # Setting `true` means remembering the comment system selected by the visitor.</span><br><span class="line">    storage: true</span><br><span class="line">    # Lazyload all comment systems.</span><br><span class="line">    lazyload: false</span><br><span class="line">    # Modify texts or order for any navs, here are some examples.</span><br><span class="line">    nav:</span><br><span class="line">        #disqus:</span><br><span class="line">        #  text: Load Disqus</span><br><span class="line">        #  order: -1</span><br><span class="line">        #gitalk:</span><br><span class="line">        #  order: -2</span><br><span class="line">    </span><br><span class="line">gitalk:</span><br><span class="line">    enable: true # 启用gitalk</span><br><span class="line">    github_id: # 你的github用户名</span><br><span class="line">    repo: # 刚才你创建的repository的名字，只要名字，不要全链接</span><br><span class="line">    client_id: # 你的 Client ID</span><br><span class="line">    client_secret: # 你的 Client Secret</span><br><span class="line">    admin_user: # 联系人, 页面显示联系**初始化评论</span><br><span class="line">    distraction_free_mode: true  # Facebook-like distraction free mode</span><br><span class="line">    # Gitalk's display language depends on user's browser or system environment</span><br><span class="line">    # If you want everyone visiting your site to see a uniform language, you can set a force language value</span><br><span class="line">    # Available values: en | es-ES | fr | ru | zh-CN | zh-TW</span><br><span class="line">    language:</span><br></pre></td></tr></table></figure><h3 id="2-本地图片无法显示"><a href="#2-本地图片无法显示" class="headerlink" title="2. 本地图片无法显示"></a>2. 本地图片无法显示</h3><p>参考文献：<a href="https://merrier.wang/20190111/image-skills-in-hexo.html" target="_blank" rel="noopener">https://merrier.wang/20190111/image-skills-in-hexo.html</a></p><p>2.1  在路径 yourName.github.io/source下创建images文件夹，将图片全部放在这个文件夹下。</p><p>2.2 Markdown访问图片方式：</p><figure class="highlight plain"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">![](/images/image_name.jpg)</span><br></pre></td></tr></table></figure><h2 id="参考文献"><a href="#参考文献" class="headerlink" title="参考文献"></a>参考文献</h2><p>Next中文教程：<a href="https://theme-next.iissnan.com/getting-started.html#description-setting" target="_blank" rel="noopener">https://theme-next.iissnan.com/getting-started.html#description-setting</a></p><p>Hexo中文教程：<a href="https://hexo.io/zh-cn/docs/" target="_blank" rel="noopener">https://hexo.io/zh-cn/docs/</a></p><p>Github Pages中文教程：<a href="https://help.github.com/cn/github/working-with-github-pages" target="_blank" rel="noopener">https://help.github.com/cn/github/working-with-github-pages</a></p><p>Travis官方文档：<a href="https://docs.travis-ci.com/user/tutorial/" target="_blank" rel="noopener">https://docs.travis-ci.com/user/tutorial/</a></p>
</article>
</main></body></html>

TangTalk - Tech Blog

Azure Defender for Key Vault just released!

What is Azure Defender for Key Vault

How to Enable Azure Defender for Key Vault

Azure Defender for Key Vault Alerts

Current Status

General Availability Announcement at Ignite 2020

Azure Certifications and Exams

Azure Certifications and Exams

Overview

Which certification is right for me?

Certification Learning Path

What the Azure certification online exams look like?

Exam question types

Learning materials

Spark stateful streaming processing is stuck in StateStoreSave stage!

Spark stateful streaming processing is stuck in StateStoreSave stage!

Stateful Structured Streaming Processing Job

Environment

Issue Symptom

Possible Causes

Solutions

Solution 1 - migrate Azure blob storage to Azure Data Lake Gen2 which supports hierarchical namespace.

Solution 2 - delete checkpoint folder and decrease retention period.

Databricks Migration Guide

Databricks migration steps

0. Prepare all scripts

1. Install databricks-cli

2. Set up authentication for two profiles

2.1 Generate a personal access token.

2.2 Copy the generated token and store it as a secret in Azure Key Vault.

2.3 Set up profiles

2.4 Validate the profile

3. Migrate Azure Active Directory users

4. Migrate the workspace folders and notebooks

5. Migrate libraries

5.1 List all libraries in the old Databricks.

5.2 Install all libraries.

6. Migrate the cluster configuration

7. Migrate the jobs configuration

8. Migrate Azure Key Vaults secret scopes

List all secret scopes:

Generate key vault-backed secret scope:

9. Migrate Azure blob storage and Azure Data Lake Storage mounts

9.1 List all mount points in old Databricks using notebook.

9.2 Remount all blob storage following the official docs using notebook.

10. Migrate cluster init scripts

11. ADF config

Reference

Always flush Application Insights

What is Application Insights?

How it works

Flush data

Reference

Java serialization is a bitch!

Concept

Issue for Java Serialization

Solutions

JSON vs Protocol Buffers

Reference

Deploy Spark .NET app on Databricks

Deploy Spark .NET app on Databricks

Prepare a Spark .NET application

Prepare dependencies on Databricks

How to deploy

Scheduled by Azure Data Factory pipeline

Deploy using Set Jar

Directly on databricks

1. Deploy using Spark-submit

2. Deploy using Set Jar

Reference

搭建Hexo + Github Pages + Travis CI个人站点的详细教程

技术栈选择： Github Pages + Hexo + Travis CI

安装环境

初始化Hexo+Github Pages项目

添加博客主题

生成博客并部署

9.1 List all mount points in old Databricks using `notebook`.

9.2 Remount all blob storage following the official docs using `notebook`.