Jekyll2025-06-09T09:32:26+00:00http://bhachauk.github.io/feed.xmlBhanuchander UdhayakumarData Science | Deep learning | Senior Software EngineerSimple Steps To Publish A Python Package With Best Practices2021-05-01T00:00:00+00:002021-05-01T00:00:00+00:00http://bhachauk.github.io/Simple-Steps-To-Publish-A-Python-Package-With-Best-PracticesContents

Overview

In this post, I will try to share my observations and best practices to publish a python package. The four simple steps mentioned below will help you to implement a systematic approach and make your publishing work easy.

4 steps to publish python package


Step 1 : Finalize your python package

Yes. This also a step and the first one to be done. In this step you need to check some plenty of points and also some post development points such as,

  • Check whether all the facilities in our checklist covered.
  • Find and remove dead codes if available in current version.
  • Update the readme and other notes if present.
  • Add TODOs or future plant comments as mentioned in your readme.
  • Please review your once again from end to end, sure you will find some enhancement.

Step 2 : Add test scripts

Adding test scripts will make the package more proper and will help you to find one or more cases to be covered. Also, python package : pytest offers test facility to check your package with the test cases. Whenever you add new facility or made changes, try to add the respective test case.

PyTest


Step 3 : Release the distribution to PyPI

I suggest to read the post from reference : How to Publish an Open-Source Python Package to PyPI. However here the summarized points as steps are added,

  • Use the link : https://pypi.org/account/register/ to register in PyPI if not yet done.
  • Configuring the package using setup.py which consists all info about your package.
  • Build and test your project using python package twine. The commands are,
python setup.py sdist bdist_wheel
twine check dist/*
  • To publish your package to PyPI, use the command,
twine upload dist/*
  • As there are other packages also mentioned on that post. If wish, have a look at Poetry, Flit, Cookiecutter.

Step 4 : Use GitHub and CI Tools

In this step, The usages of GitHub and Travis CI in publishing python packages are explained.

  • The Travis CI is used to verify the build passing status and test results of the latest commit.
  • GitHub workflows facility is used to publish the package to PyPI.

I have added these both facilities in my github repo patch_antenna. Let us take this repo as an example for the further discussion.

Travis CI

Integrating Travis CI to your github repo is done by adding a .travis.yml file in the same repo. I have added the steps what are all need to be done to test my package in that yaml file like shown below.

.travis.yml

sudo: false
language: python
python:
  - "3.7"
before_install:
  - pip3 install scipy
  - python3 setup.py install
script: pytest
notifications:
  email: false

Once this added to your github repo, you need to enable travis ci in github authorized Applications and also configure which repo need to be used by travis ci in https://travis-ci.com/.

GitHub workflows

GitHub offers workflows facility to do various of our requirements. I am using this facility to publish my python package to PyPI on release based mode like we did it with travis ci.

To avail the facility, a yaml file .github/workflows/python-publish.yml created in the same repo.

python-publish.yml

name: Upload Python Package              
on:
  release:
    types: [created]
jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v2
    - name: Set up Python
      uses: actions/setup-python@v2
      with:
        python-version: '3.x'
    - name: Install dependencies
      run: |
        python -m pip install --upgrade pip
        pip install setuptools wheel twine
    - name: Build and publish
      env:

        TWINE_USERNAME: ${{ secrets.PYPI_USERNAME }}
        TWINE_PASSWORD: ${{ secrets.PYPI_PASSWORD }}

      run: |
        python setup.py sdist bdist_wheel
        twine upload dist/*

The user name and password which asked by twine can be added as secrets in the same repository settings > secrets like shown below image.

Screenshot git secrets

Once you done all the above points, follow these steps as best practices to publish you package.

  • Do your regular commits with test cases and also verify the build status using travis ci.
  • Keep update readme and todo notes in your repo to track your goals.
  • Once you feel done with the facilities and also build passes, Make a release on github with respected version and release notes. (Also don’t forget to add the latest version in setup.py)
  • This will release new package version as you mentioned in setup.py file using github workflows and also you can check with workflow results in actions page.

References

]]>
Reality Of One Shot Learning For Face Recognition2020-01-30T00:00:00+00:002020-01-30T00:00:00+00:00http://bhachauk.github.io/Reality-Of-One-Shot-Learning-For-Face-RecognitionContents

  1. Introduction
  2. Setup details
  3. Results
  4. Conclusion
  5. References

Introduction

The goal of the post is to share my experience with this topic called One Shot Learning which is normally used while we have a small training data set for face recognition. After testing with various codes shared in github and posts shown in references, I made this post to show my observations and collected results.

One shot learning :

It is commonly a classification / categorization / similarity identification technique while having small training data set for computer vision tasks such as object detection, face recognition and hand writing recognition. Normally computer vision models have very large deep neural networks which are all not easy to train and requires more resources and training data. But real time problems doesn’t have that much data to train those large models. One shot learning is the recommended solution for these kind of specific problems.

siamese_base

Sample structure - One Shot Learning - source : reference_1



Here Two Convolution Networks pre-trained models are followed by a custom layer with sigmoid activated final layer used to learn the similarity between the two image inputs. The work flow would be like,

  • Pre-trained networks generates the feature vectors from its final layer.
  • That custom layer which joins the both pre-trained models, used to found the Euclidean distance between the generated feature vectors.
  • Sigmoid activation in last classifies the distance value to its target label.

Setup details

The pre-trained deep learning neural model Keras-VGG-Face-ResNet-50 is used again for training to learn our custom data faces. The reason for choosing ResNet50 was discussed in the evaluation of Face Authentication. Custom Final layer followed by sigmoid activation function was implemented on tensor layers for calculating the euclidean distance.

SiameseNet



Results

The results of Siamese Network test accuracy scores and real time scores are not up to the expectation as discussed in theories. I have experienced these scenarios for various data and also in varying metrics size of the data (but low for single class), number of epochs learned. Until increasing the size of training data for single class, can not found any enhancement in test set accuracy score and real time accuracy.

The results are shown below,

Applying Low Epochs size : 5
Low_epoch_Loss Low_epoch_Accuracy
Applying Low Epochs size : 50
Low_epoch_Loss Low_epoch_Accuracy



After increasing the epochs size, The model seems well with cross validation test data. But when this siamese model trained and loaded in real time test, It may even get 0 % accuracy.

$ Loaded model accuracy : 0 %

The point is Siamese network for face authentication with the discussed One shot learning technique is not reliable in my observations or may be i am wrong with implementation (If yes please correct me). As said in theories, the siamese network with transfer learned deep learning neural network can’t learn from lowest data (4-5 images per class. Even they mentioned 1 image per class idk how ?) even highly performing transfer learned model loaded.

Conclusion

One shot learning with siamese network may be work well with simple convolutional neural networks having few layers only. These kind of architecture only fit for the Similarity Detection based tasks such as hand write recognition, shapes similarity level calculation and etc.

If we increase the size of the convolutional network the learning phase would requires more system resources and consumes large time. So continuous / online learning is a difficult one for these kind of situations. Please correct me if any thing wrong.

References


Thanks to the sources, - One shot learning - siamese - Keras VGG-Face - ONe shot learning - Machine learning mastery - One shot learning - wiki

]]>
Httpsurlconnection Sslsocketfactory Configuration With Sslcertificates2019-09-05T00:00:00+00:002019-09-05T00:00:00+00:00http://bhachauk.github.io/HttpsUrlConnection-SSLSocketFactory-Configuration-With-SSLCertificatesContents

  1. Objective
  2. Important Points
  3. Simple Steps
  4. Certificates and Managers
  5. Using CLI
  6. References

Objective

Getting HttpsURLConnection successfully using the SSL Certificates is the ultimate aim. After a long struggle i came up with this post which is very useful for those who are all looking for this and may have all the files or some of the files which are listed below.

  • SomeThing-CA.crt
  • SomeThing.crt
  • SomeThing.key
  • SomeThing.jks
  • SomeThing.p12 or SomeThing.pfx
  • SomeThing.pem
  • SomeThing.csr

These are Certificate file formats often confuses the person who is novice. This post explains the basic and easy way to get connected HTTPS url with small work around.

Important Points

  • Ultimately you only need SomeThing-CA.crt, SomeThing.crt and SomeThing.key.
  • Other formats are combined (or) converted form of this 3 basic certificates.
  • SomeThing-CA.crt needs to be added inside the Trust Store of user machine.
  • The Other two ( SomeThing.crt and SomeThing.key) needs to be added inside the Key Store.

Simple Steps

Step 1 : Create a p12 or pfx from .key and .crt using the command, So the created file SomeThing.pfx normally contains the all three files inside it in binary form.

Step 2 : Create a KeyStore in the instance of JKS for the TrustManager and load the Certificate SomeThing-CA.crt in to the true store.

Step 3 : Create a KeyStore in the instance of PKCS12 for the KeyManager and load the Certificate SomeThing.pfx in to the key store.

Step 4 : Load these Two Managers (TrustManager, KeyManager) to the SSLContext and the SocketFactory for the HttpsURLConnection

Step 5 : Now We can use the HttpsURLConnection and set the SSLFactory from the certificate generated.

These steps are now implemented in normal Java / Groovy code to get the SSLSocketFactory.

Step 1 : openssl command to export as pfx file

openssl pkcs12 -export -in SomeThing.crt -inkey SomeThing.key -out SomeThing.pfx -certfile  SomeThing-CA.crt

Step 2 : Getting Trust Manager

String caCertPath = "/path/to/SomeThing-CA.crt";
String certPath = "/path/to/SomeThing.crt";
String passWord = "changeit";

private Certificate getCertificate (String path) throws Exception
{
    CertificateFactory cf = CertificateFactory.getInstance("X.509");
    InputStream caInput = new FileInputStream(new File(path));
    Certificate c = cf.generateCertificate(caInput);
    caInput.close();
    return c;
}
private TrustManager [] getTrustManagers() throws Exception
{
    KeyStore tKeyStore = KeyStore.getInstance("JKS")      
    tKeyStore.load (null, null);
    tKeyStore.setCertificateEntry ("CA-Cert", getCertificate (caCertPath));
    TrustManagerFactory tmf = TrustManagerFactory.getInstance (TrustManagerFactory.getDefaultAlgorithm());
    tmf.init (tKeyStore);
    return tmf.getTrustManagers();
}

Step 3 : Getting Key Manager

private KeyManager [] getKeyManagers() throws Exception
{
    KeyStore keyStore = getPKCSKeyStore();
    keyStore.load (new FileInputStream(new File(certPath)), passWord.toCharArray());
    KeyManagerFactory kmf = KeyManagerFactory.getInstance (KeyManagerFactory.getDefaultAlgorithm());
    kmf.init (keyStore, passWord.toCharArray());
    return kmf.getKeyManagers();
}

Step 4 : Getting SocketFactory

SSLSocketFactory getSocketFactory(){
    SSLContext sslContext = SSLContext.getInstance("TLS");
    sslContext.init(getKeyManagers(), getTrustManagers(), new SecureRandom());
    return sslContext.getSocketFactory();
}

Step 5 : Creating HttpsURLConnection

HttpsURLConnection connection = (HttpsURLConnection) new URL(u).openConnection()
connection.setSSLSocketFactory getSocketFactory()

Certificates and Managers

Formats of certificate files shown below,

.crt .cer .der .p12 .jks .pfx .pem .key

In these formats, .crt and .key are the primary formats from which other all can be generated using openssl command.

The types of managers used are,

  • Trust Manager / Store
  • Key Manager / Store

Those files which are all added in KeyStore, the related CA-Certificate need to be added in Trust Store. Generally $JAVA_HOME/jre/lib/security/cacerts is the basic jre’s trust store. you can add a certificate using command line like shown below,

keytool -importcert -file SomeThing.crt -keystore $JAVA_HOME/jre/lib/security/cacerts -storepass changeit

If you skip this or didn’t added the CA-Certificate inside the trust store, you will get the exception as,

sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target.

In another method, By setting the values in System Property can solve the issue but Not Recommended.

The related System properties need to be assigned are,

System.setProperty("javax.net.debug", "ssl")
System.setProperty("javax.net.ssl.trustStore", "$JAVA_HOME/jre/lib/security/cacerts")
System.setProperty("javax.net.ssl.trustStorePassword", "changeit") // commonly used password
System.setProperty("javax.net.ssl.trustStoreType", "JKS")
System.setProperty("javax.net.ssl.keyStoreType", "PKCS12")
System.setProperty("javax.net.ssl.keyStore", "/path/to/SomeThing.pfx")
System.setProperty("javax.net.ssl.keyStorePassword", "password_created_for_pfx")

These init process need be done before doing openConnection in java code.

Using CLI

These connection can be implemented with normal curl command like shown below,

curl -k https://url/path --cacert /path/to/SomeThing-CA.crt --cert /path/to/SomeThing.crt  --key /path/to/SomeThing.key  

or even without using trust store certificate, we can connect by only .crt and .key files in curl

curl -k https://url/path --cert /path/to/SomeThing.crt  --key /path/to/SomeThing.key  

References


Thanks to the references… - Certificate Authority - java-client-certificates-over-https-ssl - SSL Converter

]]>
Docker Containers In 5 Minutes2019-04-05T00:00:00+00:002019-04-05T00:00:00+00:00http://bhachauk.github.io/Docker-containers-in-5-minutesContents

Overview

This post explains about the steps involved in containerizing an application using docker tool. Also tried to disclose an overview on other areas related to this topic.

Containers

A Container is a running instance of a docker image. Containers are an abstraction at application layer that packages the codes and dependencies together, ships the application and a run-time environment.

Container are running on the docker engine almost as VM but not exactly. It has lot of merits in production basis and currently it is booming in IT-Industry now. For example containerizing an application with docker or kubernetes provides enormous facilities.


In simple, we can say that, we are going to releasing our application as an image which will run as a bounded isolated OS instead of releasing it without the environments and dependencies. It will fully reduce the environment and dependency problems which we often facing.

Totally we are just going to follow these three steps in overall. The steps are,

  File system Details
Build Dockerfile Packaging the application with required dependencies and custom files
Ship Docker image Releasing it as an image file globally using docker registry
Run Container Run containers using the image which will act as your application along with an isolated VM like environment

For very basic level, To understand the containers, I am just going to containerize a simple file printer shown below,

filename = "log.txt"
myfile = open(filename, 'w')
myfile.write( "Hi Here we are ..!" + '\n')
myfile.close()

This code just create and save the content in a file named log.txt. Lets containerise.. see what is happening … !

Creating Dockerfile

Dockerfile is also like a Makefile, which is used to build the docker image. This file normally contains three type of instruction sets,

  • Fundamental instruction
  • Configuration instruction
  • Execution instruction

For our scenario, The dockerfile would be like this,

FROM python:2
COPY test.py /usr/
WORKDIR /usr/
RUN python test.py
CMD bash

Description :

  • FROM command determines the base image. you can use scratch if you have total cross compiled system.
  • COPY command just copying your local file into your container location.
  • WORKDIR command changes the current directory for next command executions.
  • RUN command executes the command inside the container.
  • CMD is the container launching command. container will live until it runs.

Building Docker image

Go to the build directory and make sure you have Dockerfile in your current directory. To build the docker image use the command,

docker build -t testimage .

Sending build context to Docker daemon 3.072kB Step 1/5 : FROM python:2 —> 92c086fc9702 Step 2/5 : COPY test.py /usr/ —> Using cache —> 1b9aa6ce04cd Step 3/5 : WORKDIR /usr/ —> Using cache —> 6dcef3ef8785 Step 4/5 : RUN python test.py —> Using cache —> 6edf4708cf6c Step 5/5 : CMD bash —> Using cache —> 21303e2891d6 Successfully built 21303e2891d6 Successfully tagged testimage:latest

This command creates an image with the name and version you provided. When you build,

  • If the base image not available in your local, It will pull it from Docker Hub registry.
  • It creates layers for each commands inside the docker file and It will be remembered by docker.
  • If any thing changed after once build, It will create new layers for that.
  • You can’t use Capitals in image name.

You can check the created image by listing available images using the command,

docker image ls

or you can also specify the command

docker image ls testimage

REPOSITORY TAG IMAGE ID CREATED SIZE testimage latest 21303e2891d6 3 days ago 914MB

Run as Container

Method 1 :

We can containerize this image using, two steps.

docker container create --name testcontainer testimage
docker container start testcontainer

Method 2 :

Or it can be done in single step by,

docker run -itd --name testcontainer testimage

The arguments itd means interactive, tty and detach. See the help for more details. we can check the container status using the command

docker container ls

CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES c706bd3fe268 testimage “/bin/sh -c bash” 5 seconds ago Up 3 seconds testcontainer

we can enter inside the container using exec facility with bash command like shown below,

docker exec -it testcontainer bash 

Inside the container environment,

root@c706bd3fe268:/usr# ls bin games include lib local log.txt sbin share src test.py

Here we can see the file log.txt which was generated while this container started.

Common facilities

I always suggest to see this official documentation for commands. But here i have shared often some using commands and also felt useful for quick view :).

Some Basic commands,

  • Docker inspect - Inspecting the information about the custom object.
  • Docker rm - Removing the object provided in arg.
  • Docker info - Printing the docker information.
  • Docker rename - All objects can be renamed.
  • Docker login - Registry Login
  • Docker logout - Registry Logout

For Images,

  • Docker image
  • Docker search - Search images in customized docker registry.
  • Docker pull - Pulling images from registry.
  • Docker push - Pushing images to the customized registry.
  • Docker save - Save the image as compressed file.
  • Docker load - Load the compressed file as an image.
  • Docker rmi - Removing images.
  • Docker run - Create + Start the container in single command.

For containers,

  • Docker container
  • Docker ps - Process status for current running processes.
  • Docker start - Initiating the container.
  • Docker restart - Restarting the already started container.
  • Docker cp - Copy the files from the local directory to the container path.
  • Docker port - Port mapping on running container.
  • Docker pause
  • Docker unpause

Some Important facilities such as,

  • Docker volume - Mounting the local directory as a volume object to the container.
  • Docker network - Network configuration customizing for bridge, vlan and overlay networks.
  • Docker stats - Statistics on the container information on the machine environment.
  • Docker top - Normal Top command for quick view.
  • Docker secret
  • Docker plugin

Other facilities

Docker has many more awesome facilities such as,

  • Docker compose

    Composing or managing multiple containers by a single configuration file docker-compose.yaml.

  • Docker swarm

    Swarm mode is for managing the docker clusters using manager and worker scenario.

  • Docker Service

    It is used in swarm mode for deploying the application as a service with some facilities such as rollback, scale and update.

  • Docker Stack

    Docker stack is also in swarm mode for managing collection of multiple services.

These facilities used for High availability, distributed processing, scaling and zero down time deployment (almost).

Some Useful

Also I have created a single page guide which helps you to quickly act with docker. It consists simple 4 steps guide you to quickly done the dockerization.

Here i have shared the slides of docker - overview which explains the concepts behind this technology. Have a look at this to understand quickly and move on … :)

References

]]>
Machine Learning Resources Collection2019-04-05T00:00:00+00:002019-04-05T00:00:00+00:00http://bhachauk.github.io/Machine-Learning-Resources-CollectionContents

Update 1 : 11/01/2024 Posts section updated.

Official Docs

Collections

Courses

Posts

GitHub

Books

]]>
Docker Containers Most Asked Questions And Answers2019-03-20T00:00:00+00:002019-03-20T00:00:00+00:00http://bhachauk.github.io/Docker-Containers-Most-Asked-Questions-and-Answers

This post tried to cover the area of rarely shared and highly demanded methods in containerisation with docker and kubernetes. you can enhance this post by contributing in Docker-tutorial GitHub repo. I have added these kind of stuffs under ## Quick Details in every readme.md of all chapters.

What are the Dockerfile Instructions ?

Fundamental Instructions

  • FROM Sets the Base Image for subsequent instructions.
  • ARG Defines a build-time variable.
  • MAINTAINER Deprecated - use LABEL instead) Set the Author field of the generated images.
  • ENV Sets environment variable.
  • LABEL Apply key/value metadata to your images, containers, or daemons.

Configuration Instructions

  • RUN Execute any commands in a new layer on top of the current image and commit the results.
  • ADD Copies new files, directories or remote file to container. Invalidates caches. Avoid ADD and use COPY instead.
  • COPY Copies new files or directories to container. Note that this only copies as root, so you have to chown manually regardless of your USER / WORKDIR setting, as same as ADD.
  • VOLUME Creates a mount point for externally mounted volumes or other containers.
  • USER Sets the user name for following RUN / CMD / ENTRYPOINT commands.
  • WORKDIR Sets the working directory.

Execution Instructions

  • CMD Provide defaults for an executing container.
  • EXPOSE Informs Docker that the container listens on the specified network ports at runtime. NOTE: does not actually make ports accessible.
  • ONBUILD Adds a trigger instruction when the image is used as the base for another build.
  • STOPSIGNAL Sets the system call signal that will be sent to the container to exit.
  • ENTRYPOINT Configures a container that will run as an executable.
How to update port Mapping on running Container ?
  • Stop and commit a the container as an image.
  • Start the snapshot image of our container with port mapping

docker stop containerName docker commit tempImageName docker run -p 8080:8080 –name -td tempImageName

Reference : -An answer in Stackoverflow by Fujimoto Youichi

How to connect and disconnect containers in docker bridge network ?

Follow these simple steps to connect and disconnect multiple containers with a bridge network.

  • To create a bridge network with default info
docker network create --driver bridge testbridge
  • Lets start connect containers to these network
docker network connect testbridge containerOne
docker network connect testbridge containerTwo
  • Check the network for containers
docker network inspect testbridge

“Containers”: { “2475796b7bb161fafd661eb9e1f23233104ca57915dd88a3fc33aa6dc9d73700”: { “Name”: “containerOne”, “EndpointID”: “45319bd6ce083bf7e7d3015750e35f7644d4a2d3e5db8c27153c613958ab43d2”, “MacAddress”: “02:42:ac:1e:00:03”, “IPv4Address”: “172.30.0.3/16”, “IPv6Address”: “” }, “ae7ca4ac1e4aaa2bab4d53e24f76afa1f83de620d1ce7d244e03cb8707a8448b”: { “Name”: “containerTwo”, “EndpointID”: “166b2c5eea217c9baeeb906c5d83b04d1c1bab93e46ab01bbf6e94fc21c47c81”, “MacAddress”: “02:42:ac:1e:00:02”, “IPv4Address”: “172.30.0.2/16”, “IPv6Address”: “” } }, ….

You can check network by verifying containers from the generated json file.

How to share the images locally ?

Follow this two reliable steps to share the image as .tar compressed file format.

  • Save that image to as a .tar file.
  • Load the image from .tar file

docker save –output imageName.tar imageName docker load –input imageName.tar

What is dangling image ?

Dangling images are created while creating new build of a image without renaming / updating the version. So that the old image converted into dangling images like shown below.

  • List all dangling images
$ docker images -f dangling=true
REPOSITORY          TAG                 IMAGE ID            CREATED             SIZE
<none>              <none>              3f4ae2ddf543        4 days ago          1.37GB
<none>              <none>              b4c8cecab3bc        8 days ago          655MB
  • To remove all dangling images, Run this command,

$ docker rmi $(docker images -f dangling=true -q)

Reference : - What is Dangling images by SO

How to import images inside kubernetes cluster in simple way ?

It can be done by configuring the registry but i found it helpful. Initially go to the directory of the build path and make sure you have the Dockerfile in your directory.

-currentdirectory
       |---  Dockerfile 
       |---  Other-Project-Files
  • Start minikube

minikube start

  • set docker environment by eval in current directory

eval $(minikube docker-env)

  • Start build your image now…

docker build -t imageName:version .

  • Access the image by changing the image pull policy

kubectl run hello-foo –image=foo:0.0.1 –image-pull-policy=Never // Or it can be set inside the yaml config file like shown below -image: imageName:latest name: imageName imagePullPolicy: Never

References -Kubernetes official doc -How to use local docker images with Minikube? - stackoverflow

How Deployment used over Pod and ReplicaSet in kubernetes and Why ?

The need of ReplicaSet is,

  • Pod, basically contains one or more than one containers.
  • Pod is the most basic entity in kubernetes.
  • ReplicaSet is like a manger of Pod which ensures the Pods activity. It always sees Pod as Replica (also like a Job). So that it is called as ReplicaSet (Set of replicas or Set of Pods).

Running Pod alone is dangerous because,

  • When the machine crashes or some thing happens related to it, Pod will be deleted.
  • That’s why ReplicaSets are used to provide guarantee for Pod’s life.

Deployment having the merits,

  • It can update the replicas with zero down time.
  • Deployment controller contains deployment objects which can create new Replicas, remove old replicas excluding their resources and adopt it with the updated one.
]]>
Machine Learning With Groovy Tutorial2019-02-20T00:00:00+00:002019-02-20T00:00:00+00:00http://bhachauk.github.io/Machine-Learning-With-Groovy-TutorialContents

  1. Introduction
  2. Libraries
  3. Steps
    1. Source Code
    2. Run Algorithm
  4. Results
  5. References

Introduction


This is very simple and tutorial post for doing Machine Learning in Groovy. This post covers the clustering algorithms such as,

  • DBSCAN - Density-Based Spatial Clustering of Applications with Noise
  • KMean++
  • FuzzyKMean
  • Multi-KMean++

These algorithms differs in their motivation and working methodology. The simple and best summary of these algorithms discussed in Math3 Java Documentation.

View the Document here.

Libraries


Steps


The source code also shared in GitHub, you can find it in above shared link. Here Again I am sharing with some explanation,

Source Code

class ClusterWork
{
    List<DoublePoint> points = new ArrayList<DoublePoint>()
    Map<DoublePoint, List<String>> pointMap = [:]

    ClusterWork(Map table)
    {
        table.each{ k,v ->
            DoublePoint dArr = new DoublePoint(v)
            points.add(dArr)
            if (!(dArr in pointMap.keySet()))
                pointMap[dArr] = []
            pointMap[dArr].add(k)
        }
    }
    List<ClusterDetail> dbscan (double d, int i)
    {
        DBSCANClusterer DBScan = new DBSCANClusterer(d, i)
        collectDetails DBScan.cluster(this.points)
    }
    List<ClusterDetail> fuzzykmean (int k, double fuzziness)
    {
        FuzzyKMeansClusterer fKMean = new FuzzyKMeansClusterer(k, fuzziness)
        collectDetails fKMean.cluster(this.points)
    }
    List<ClusterDetail> multiplekmean (int k, int trials)
    {
        MultiKMeansPlusPlusClusterer mkppc = new MultiKMeansPlusPlusClusterer(new KMeansPlusPlusClusterer(k), trials)
        collectDetails mkppc.cluster(this.points)
    }

    List<ClusterDetail> kmean (int k)
    {
        KMeansPlusPlusClusterer kMean = new KMeansPlusPlusClusterer(k)
        collectDetails kMean.cluster(this.points)
    }
    private List<ClusterDetail> collectDetails(def clusters)
    {
        List<ClusterDetail> ret = []
        clusters.eachWithIndex{ c, ci ->
            c.getPoints().each { pnt ->
                DoublePoint pt = pnt as DoublePoint
                ret.add new ClusterDetail (ci + 1 as Integer, pt, this.pointMap[pt])
            }
        }
        ret
    }
}
class ClusterDetail
{
    int cluster
    DoublePoint point
    List<String> labels
    ClusterDetail(int no, DoublePoint pt, List<String> labs)
    {
        cluster = no; point= pt; labels = labs
    }
}
Run Algorithm

Running algorithm has multiple steps like showing below,

  • Step 1. Import all required
  • Step 2. Read a csv file for input data
  • Step 3. Converting it into required data type (Here as Map)
  • Step 4. Construct the source code class
  • Step 5. Run all algorithms above listed

Step 1 :

@Grab('com.xlson.groovycsv:groovycsv:1.1')
@Grab(group='org.apache.commons', module='commons-math3', version='3.6.1')
import org.apache.commons.math3.ml.clustering.DBSCANClusterer
import org.apache.commons.math3.ml.clustering.DoublePoint
import org.apache.commons.math3.ml.clustering.FuzzyKMeansClusterer
import org.apache.commons.math3.ml.clustering.KMeansPlusPlusClusterer
import org.apache.commons.math3.ml.clustering.MultiKMeansPlusPlusClusterer
import static com.xlson.groovycsv.CsvParser.parseCsv
// All imported

Step 2 :

//Read the csv input data
df = new FileReader('data.csv')

Step 3 :

Map<String, double[]> dfMap = [:]
for (line in parseCsv (df))
{
    double [] point= [line.temp.toDouble(), line.humidity.toDouble()]
    dfMap[line.city] = point
}
// Map dfMap formed.

Step 4 :

// Construct the cluster work using our Map
ClusterWork clusterWork = new ClusterWork (dfMap)

// Simple print closure implementation
def showClosure = {detail ->
println "Cluster : " + detail.cluster + " Point : " + detail.point + " Label : "+ detail.labels
}

Step 5 :

// Running All algorithms accordingly
println 'DBSCAN'
clusterWork.dbscan(6, 0).each(showClosure)
println '-----------'

println 'Kmean'
clusterWork.kmean( 5).each(showClosure)
println '-----------'

println 'FuzzyKMean'
clusterWork.fuzzykmean(5, 300).each(showClosure)
println '-----------'

println 'MultipleKMean'
clusterWork.multiplekmean(5, 5).each(showClosure)
println '-----------'

Results


Here I have attached the sample output for DBSCAN algorithm.

DBSCAN Cluster : 1 Point : [284.624954535, 76.0] Label : [Vancouver] Cluster : 1 Point : [282.100480976, 80.0] Label : [Portland] Cluster : 1 Point : [281.78244858, 80.0] Label : [Seattle] Cluster : 1 Point : [286.213142193, 71.0] Label : [Saint Louis] Cluster : 1 Point : [283.994444444, 76.0] Label : [Indianapolis] Cluster : 1 Point : [284.278140131, 75.0] Label : [Detroit] Cluster : 1 Point : [286.276495879, 81.0] Label : [Toronto] Cluster : 1 Point : [290.07866575, 70.0] Label : [Kansas City] Cluster : 1 Point : [287.009165955, 66.0] Label : [Minneapolis] Cluster : 1 Point : [284.300133393, 70.0] Label : [Chicago] Cluster : 1 Point : [285.85044048, 70.0] Label : [Philadelphia] Cluster : 1 Point : [287.277251086, 68.0] Label : [Boston] Cluster : 1 Point : [291.553209206, 81.0] Label : [San Diego] Cluster : 1 Point : [284.59253007, 62.0] Label : [Denver] Cluster : 1 Point : [289.89855969, 86.0] Label : [Dallas] Cluster : 1 Point : [289.446243412, 87.0] Label : [San Francisco] Cluster : 1 Point : [291.857503395, 88.0] Label : [Los Angeles] Cluster : 1 Point : [288.650991196, 87.0] Label : [Charlotte] Cluster : 1 Point : [289.373344722, 92.0] Label : [San Antonio] Cluster : 1 Point : [288.371111111, 92.0] Label : [Houston] Cluster : 1 Point : [285.860929124, 91.0] Label : [Montreal] Cluster : 1 Point : [294.064062959, 94.0] Label : [Atlanta] Cluster : 1 Point : [281.151870096, 93.0] Label : [Pittsburgh] Cluster : 2 Point : [293.381212832, 21.0] Label : [Las Vegas] Cluster : 2 Point : [296.654466164, 23.0] Label : [Phoenix] Cluster : 3 Point : [285.313345004, 49.0] Label : [Albuquerque] Cluster : 4 Point : [287.48791359, 99.0] Label : [Nashville] Cluster : 5 Point : [298.393960613, 87.0] Label : [Jacksonville] Cluster : 5 Point : [299.800641223, 82.0] Label : [Miami] Cluster : 6 Point : [288.406203155, 57.0] Label : [New York] Cluster : 7 Point : [307.145199718, 51.0] Label : [Beersheba] Cluster : 7 Point : [304.4, 51.0] Label : [Haifa, Nahariyya] Cluster : 7 Point : [303.5, 50.0] Label : [Jerusalem] Cluster : 8 Point : [304.238014609, 62.0] Label : [Tel Aviv District] Cluster : 9 Point : [310.327307692, 22.0] Label : [Eilat]

The JFree Chart visualization of all algorithms with cluster details attached accordingly. The JFree chart requires other dependencies such as,

  • JCommon
  • JFreeChart

Scatter plot is used to visualize the clusters. The result plots are,


DBSCAN


KMean


Fuzzy KMean


Multiple KMean

References


Thanks to the sources, 1. Machine Learning with Math3 2. JFree Chart Doc 3. Tutorials Point - JFree chart

]]>
Machine Learning Basic Questions And Answers2019-01-16T00:00:00+00:002019-01-16T00:00:00+00:00http://bhachauk.github.io/Machine-Learning-Basic-Questions-And-AnswersWhat is Machine Learning and Why we are using it ?


It may be a funny question to ask in this technological era. Any way, Let’s start with this question. It is like, computers learn by themselves not using any explicit programming requirements. Here we need to align the process flow to achieve this. Such as like,

  • Data preparation For example the data we prepared is the key factor for the machine learning, If we use meaningless data inside, It will affect the model.
  • model Selection model selection is also important as said in data preparation, It also should be meaningful. The model will vary according to the data. For example we can’t use the model for Nominal data, which is used in Continuous Data.
  • Other Environment configurations The environment we have should support the algorithm to run. we can say the all like,
    • libraries
    • GPU or RAM Configuration

Why ? The normal statistical methods such as correlations, etc., can be used to do the work but the key point and the truth is we are letting the machine learning model to do the various statistical analysis and getting the summary or formatted model based outputs.

  • What is Dimensionality Reduction Technique and Why we need it ? As i have absorbed from many open source posts, I can conclude that the dimensionality reduction mainly used to compressing the actual data into a logical dimension by removing unwanted variance elements from various dimensions. Most of the times, It is used to reduce the memory consumption and model runtime by removing unwanted datas.

It can cover the topics like,

  • Features Filtering
  • Component Analysis
Feature Filtering

  • Features filtering can be applied by our wish also to be more careful of what we doing.
  • Some of those like
  • High Null carriers filter
  • Low correlated filter
  • Feature Importance Analysis and Filtering
  • Low variance or Constant level Filter
Component Analysis

In these techniques, Actual data converted into Components which carries the characteristics of the data. It is like, Extracting the most important information from data what actually the data trying to convey to you. Some of the methods are,

  • PCA (Principal Component Analysis) for Linear
  • T-SNE (Stochastic Neighbour Embedding - T distributed) for Non linear
  • UMAP (Uniform Manifold Approximation and Projection)

Reference : - The Ultimate Guide to 12 Dimensionality Reduction Techniques (with Python codes) by Analytics Vidhya


What are all the best metrics to evaluate the machine learning model and explanation ?


Accuracy is the well known metric but this not only the important metric. Rather than this some of others are Precision, Recall, F1 Score and ROC.


  • Accuracy
\[Accuracy = {Total Correct Predictions \over Total Data Size}\]
  • Precision
\[Precision = {TruePositive \over Predicted True Count}\]
  • Recall
\[Recall = {Total Correct Predictions \over Actual True Count}\]

Precision and Recall are related to the Confusion Matrix values TP, TN, FP and FN. The confusion matrix looks like,

Confusion Matrix
  Positive Negative
Positive TP TN
Negative FP FN


So There may be a confusion between precision and recall :P. For quickly grab this difference, It can be concluded that, The Importance of predicting the non true observations as true. For example,

It is like saying to a non-infected patient that he (or) she was infected.

So now you can easily grab it for Recall.

It is like saying to a infected patient that he (or) she was not infected.

From above two examples, you can find that both are so important. We need to find the good balance format for these metrics. So what about F1 Score ?

  • F1 Score
\[F1 Score = {2 * {Precision + Recall \over Precision * Recall}}\]
  • ROC Curve (Receiver Operating Characteristics) and AUC (Area Under ROC Curve)

    This curve is constructed from the parameters,

    • True Positive Rate
    • False Positive Rate
\[TPR = {TP \over {TP + FN}}\] \[FPR = {FP \over {FP + TN}}\]


ROC Curve

img

Image source in Ref

Reference : - Machine Learning - Towards data science - ROC and AUC by Google Crash course


What is Regularization in machine learning ?

Regularization is so important technique as it determines how the learning should be. The word itself means, making the training or learning as a common one and not to converge by the current training data. It is a technique to protect the model from over-fitting. It can be simply explained from below lines,

  • It controls the learning by introducing changes in traditional work flow or math functions of the model.
  • It will decide the level of non-linearity and iterations which decides the learning level and provides better Test accuracy.
  • It will fit the model before it got over-fitted, can call Apt-fitting.

Well-known-techniques :

  • L1 Regularization (Can make weights as zero and compress the model to attain apt-fitting)
  • L2 Regularization (It reduces the weights almost to zero.)
  • Dropout (It will logically remove the weight as it cause for over-fitting.)

Reference : - Regularization post from towards-data-science - Fundamentals of Regularization Techniques by analyticsvidhya - L1 and L2 Regularization - Keras Regularizers - Regularization in Machine Learning - Regularization for simplicity by Google

What is KFold in python and why we need to use it ?


K Fold is a technique used to evaluate the model like Test-train split . It internally splits the training data into k number of groups (by default 10). Each group are the test data while another all groups are trained. This technique is commonly known as Cross validation. It will generate a log for each group test result. So you can understand how your machine learning model acts with various groups. If your model produces highly various result accuracies, That means the model over-fitted for some of the classes. That’s why it can’t detect some poorly recognized classes. The total object is to find out which model works better for all classes involved in the input data.

Reference : - KFold Doc by scikit - Cross Validation - K-Fold Cross validation


Why we need to use epochs in training ?


The goal of using epochs is to reduce the Error Rate of the model. In machine learning, It really works to improve the model accuracy by trying to Good fit the model. It can be absorbed when we use history plot. We can use the epochs until it saturated at some point to avoid over-fitting and remove unnecessary run. So, In practical we can’t be sure how much epochs needed for good results. Manual Observation required to validate to find out the epochs value. In deep learning, The sample code would be like,

history = model.fit(X, Y, validation_split=0.33, epochs=150, batch_size=10, verbose=0)

Reference : - Display Deep Learning Model Training History in Keras


What is seed and why we need to use it ?


This is because stochastic nature of the model. Every time you run, the model runs with random initialized values. To get the same results from a Machine learning model, we can generate a same random values for every time. Thi is the syntax to set random seed in numpy package.

from numpy.random import seed seed(7)

So you need to be aware of what random package used by your model.

Reference : - Reproducible Results by machine learning mastery

How to determine the number of hidden layers required ?

This question was the often arouse and also more complicated. I hope the notes and references below will help you to understand why i said like that.

  • The number of hidden layers equals one.
  • The number of neurons in that layer is the mean of the neurons in the input and output layers.
  • It is observed that most of the problems can be solved using the above said 2 rules.
  • If you build a very deep and wide network, that means you may end up memorizing or computing the results you want on output layer at very near hidden layer from the first.
  • Obviously it will take too much (unwanted) computation time and resource.
  • Literally All you want is a generalised data which is more important to be considered.
  • Increasing the Depth and width of a network, may cause over-fitting and also practically observed.

Reference : - Stack overflow discussion - 5.2 Capacity Over-fitting and Under-fitting - Stack Exchange Question and Answer 1 - Stack Exchange Question and Answer 2

How to replace the layers in NN ? (Collecting)



Due to the nature of this post, Always can find updates

]]>
Learn Javascript Basics In 30 Minutes2018-12-21T00:00:00+00:002018-12-21T00:00:00+00:00http://bhachauk.github.io/Learn-JavaScript-Basics-In-30-MinutesContents

Introduction


  • JavaScript is a web language which adds dynamism to the web page.
  • Brendan Eich (NetScape) created JS in Mozilla.
  • Created in light weight and not to show errors to the user by using intelligent guessing.
  • ECMA (European Computer Manufacturers Association) standardizes JS.

Scripts


Note : Use Browser console logs to see outputs, while script requires.

Alerts

<head>
<script>
	alert("Hello world !");
</script>
</head>
Variable

<head>
<script>
	var info = "Hi by variable";
	info += "\nSo Hello world"
	alert(info);
</script>
</head>

Note : Type Conversion

  • “4” * “3” == 12
  • “4” + “2” == 42
REPL

REPL - Read Eval Print Loop

<head>
<script>
	var info = "Hi by variable";
	info += "\nSo Hello world"
	alert(info);
	console.log(info);
</script>
</head>
DataTypes

Types :

  • Primitive
    • Number
    • String
    • Boolean
  • Object
    • Objects
    • Functions
  • NULL
<head>
<script>
	var x;
	var p = "Now x is : "
	console.log(p + typeof x);

	// datatype : number
	x = 5;
	console.log(p + typeof x);
	
	// datatype : null (special)
	x = null;
	console.log(p + typeof x);

	// datatype : string
	x = "Hi by variable";
	console.log(p + typeof x);

	// Using comments like this line
	// checking for Booleans data type
	x = false;
	var y;
	function chekBool(z){
		if(z){
			console.log(z + ' is True Type.')
		}else{
			console.log(z + ' is False Type.')
		}
	}
	chekBool(x);
	chekBool(y);
	y = 'String';
	chekBool(y);
</script>
</head>

Note :

  • Each char in String uses 16 Bits for storage.
  • Each number uses 64 Bit floating point type.
Operators

Operators          
Arithmetic + - * / %
Conditional < <= > >=  
Incre/ Decre ++      
If Statement

<head>
<script>
	function isEven(z){
		if(z%2 == 0){
			alert(z + ' is Even.');
		}else{
			alert(z + ' is Odd.');
		}
	}
	var x = "0";
	isEven(x);
</script>
</head>
Switch

<head>
<script>
	function isEven(z){
        switch(z){
            case z % 2 == 0: return true;
                break;
            default: return false;
        }
	}
	var x = "0";
	isEven(x);
</script>
</head>
Loops

  1. For Loop
  2. While Loop

Using break and continue also explained.

<head>
<script>
    var i=0;
    while (i <= 10)
    {
    	i++;
    	if (i == 5){break;}
    	if (i == 3){continue;}
        console.log('While Loop Iteration : '+i)
    }
    
    for (var j=1; j<=5 ;j++){
        console.log('For Loop Iteration : '+j);
    }
</script>
</head>
Scripts Note:

We can use external script file inside the html as for organizing the code and the build healthier.

//script used for IE9 browser.
<script src="http://html5shiv.googlecode.com/svn/trunk/html5.js"></script>
// Local js
<script src="path/to/your/all.js"></script> 
Object

# In Browser Console:

var x = new Object();
x.name = "chitti"
x.version = "2.0"
console.log(x);

Output : {name: “2.0”, version: “2.0”}

JSON

Normally JSON Object would be like,

{
  "root": {
    "binaries": {
        "0": {
            "val": "0"
        },
        "1": {
            "val": 1
        }
    },
    "ternaries": [0, 1, 2]
  }
}

Some rules are there .. such as…

  • A number can’t be a key for the object.

Initially it should look like a complex structure. Do some play with json to familiar with it.

List Operations

Types :

  • Basics
    • Sort
  • Comprehension
    • Filter
    • Map
    • Reduce
var x = [1,2];
x.unshift(0)
// x = [0,1,2]
x.shift()
// x = [1,2]
x.push(3)
// x = [1,2,3]
x.pop()
// x = [1,2]
x = [1,2,3,4,5,6]
var s = x.slice(3,7)
//s = [4, 5, 6]
x.splice(3)
// x = [4, 5, 6]
Sort :

var x = [1, 2, 5, 6, 10, 2, 7];
x.sort()
// Output : [1, 10, 2, 2, 5, 6, 7]

function sort_asc(a,b){ return (a-b)}
function sort_desc(a,b){ return (b-a)}

x.sort(sort_asc);
// Output : [1, 2, 2, 5, 6, 7, 10]
x.sort(sort_desc);
// Output : [10, 7, 6, 5, 2, 2, 1]
Filter :

var x = [1, 2, 5, 6, 10, 2, 7];
x.filter(function(x){return x % 2 == 0})

x.filter(function v(val, id, li){ console.log(val, id, li)})

[2, 6, 10, 2] 1 0 [1, 2, 3, 4] 2 1 [1, 2, 3, 4] 3 2 [1, 2, 3, 4] 4 3 [1, 2, 3, 4]

Map :

var x = [1, 2, 5, 6, 10, 2, 7];
x.map(function(x){return x / 2})

[0.5, 1, 2.5, 3, 5, 1, 3.5]

Reduce :

var x = [1, 2, 5, 6, 10, 2, 7];
x.reduce(function(prev, curr){console.log(prev, curr);prev = curr; return prev},0)

0 1 1 2 2 3 3 4 4

Functions


  • A function can be used as an argument to another function.
  • As arguments, a function can be returned from another function.
content :
Higher Order

function is_even(x){ 
    return x % 2 == 0
}

function a(filter){ 
    var li = [1,2,3,4]; 
    for (var x of li){
        if(filter(x)){ 
            console.log(x)
        }
    }
}

a(is_even)

2 4

Anonymous

var li = [1,2,3,4]; 
li.forEach(
        function (i){ console.log(i)}
        )

1 2 3 4

Nested

function hypotenuse (a, b){
    function sq(x) { return x*x }
    return Math.sqrt(sq(a)+sq(b))
}
Closure

function stepiter (start, step){
    return function (){
        var x = start;
        start += step;
        return x;
    }
}
var iter = stepiter (3,6);
iter()
iter()
iter()

3 9 15

Argument Parse

  • Converting Multiple arguments into Array
function mul_to_arr(){
    return Array.prototype.slice.call(arguments)
}
mul_to_arr(1,2,3,4)

[1,2,3,4]

OOP In JS


This

var square = { side : 4, area : function (){ return (this.side * this.side)}, perimeter : function() { return 4 * this.side;}}
square.area()
square.perimeter()

var square = { side : 4, area : (this.side * this.side), perimeter : (4 * this.side)}
square.area
square.perimeter

16 16 NaN NaN

Constructor and prototype

function Square(x){ this.side = x; this.area = function (){ return this.side * this.side};}
var x = new Square(3)
console.log(x.area())
Square.prototype.perimeter = function() { return (4 * this.side)}
console.log(s.perimeter())

9 8

Others :

  • SetTimeOut
setTimeout(function() {console.log("Delay time 5s over. The line printed.")},5000)

Delay time 5s over. The line printed.

DOM - Document Object Model


The DOM model represents a document with a logical tree like XML and HTML.

document |__html |__head | |__ . . . |__body |__ . . .

Some methods are :

  • document.getElementById()
  • document.getElementsByClassName()
  • document.getElementsByName()
  • document.getElementsByTagName()

For example we can change the document by Java script like shown below,

document.getElementById(“testElem”).style.display = “block”;

AJAX


AJAX - Asynchronous Java Script and XML

Without reloading the web page, we can communicate to the server using AJAX and can update the page. For example, by clicking a button will load a function() and that will do the rest.

var xhttp = new XMLHttpRequest();
xhttp.onreadystatechange = function() {
    if (this.readyState == 4 && this.status == 200) {
      console.log(this.responseText);
    }
  };
xhttp.open("GET", "additional.html", true);
xhttp.send()
]]>
How To Mine Interesting Informations From Nominal Data2018-10-09T00:00:00+00:002018-10-09T00:00:00+00:00http://bhachauk.github.io/How-To-Mine-Interesting-Informations-From-Nominal-DataContents

  1. Introduction
  2. Association Rule Mining
  3. Association Rule Mining on Titanic Data
  4. Algorithm Evaluation
  5. References

Introduction


In real world, We deal with various types of data for example date, currency, stock rate, categories and rank. These are all not same data types and also not easy to associate these all in single line information. There are lot of methods in Data Mining to extract the association or information from the complex data. Some methods are,

  • Classification
  • Estimation
  • Prediction
  • Affinity Grouping or Association Rules
  • Clustering
  • Anomaly Detection

In this post, I tried to explain the data mining process on Nominal Data Set.
The technique to extract the interesting information from Nominal data or Categorical data is Association Rule Mining. —

Association Rules Mining

Algorithms:


  • Apriori
  • FP Growth

Parameters:


  1. Support
    • Ratio of the particular Object observation count to the total count.
    • In another words, the percentage of a object strength in total strength.
    • Range [0 - 1]
    \[Support(B) = {Observations containing (B) \over Total Observations }\]
  2. Confidence
    • How much confident association has with its pair.
    • Range [0 - 1]
    \[Confidence(A→B) = { Observations containing both (A and B)) \over (Observations containing A)}\]
  3. Lift
    • How much likely associated than individually occurred.
    • Range [0 - inf]
    • if lift > 1 means, It is an interesting scenario to consider.
    \[Lift(A→B) = {Confidence (A→B) \over Support (B)}\]
  4. Leverage
    • Range [-1, 1]
    • If leverage =0 means, Both are independent.
    \[L (A → B) = {S (A→B) \over S (A) * S (B)}\]
  5. Conviction
    • It is the metric to find the dependency on premise by the consequent.
    • Range [0 - inf]
    • If conviction = 1, items are independent.
    • High Confident with Lower support. That means it is mostly depends on the another product.
    \[C (A -> B) = {1 - S (B) \over 1 - Confidence (A → B)}\]

Association Rule Mining on Titanic Data


Ready Up


  • Algorithm : Apriori
  • Language : Python 2.7.15
  • Data Set : Titanic Data From Kaggle -

Import Packages


import matplotlib.pyplot as plt
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules

import warnings
warnings.filterwarnings("ignore")
import seaborn as sns

Loading Data-set


titanic = pd.read_csv('train.csv')
nominal_cols = ['Embarked','Pclass','Age', 'Survived', 'Sex']
cat_cols = ['Embarked','Pclass','Age', 'Survived', 'Title']
titanic['Title'] = titanic.Name.str.extract('\, ([A-Z][^ ]*\.)',expand=False)
titanic['Title'].fillna('Title_UK', inplace=True)
titanic['Embarked'].fillna('Unknown',inplace=True)
titanic['Age'].fillna(0, inplace=True)
# Replacing Binary with String
rep = {0: "Dead", 1: "Survived"}
titanic.replace({'Survived' : rep}, inplace=True)

Binning Age Column


## Binning Method to categorize the Continous Variables
def binning(col, cut_points, labels=None):
  minval = col.min()
  maxval = col.max()
  break_points = [minval] + cut_points + [maxval]
  if not labels:
    labels = range(len(cut_points)+1)
  colBin = pd.cut(col,bins=break_points,labels=labels,include_lowest=True)
  return colBin

cut_points = [1, 10, 20, 50 ]
labels = ["Unknown", "Child", "Teen", "Adult", "Old"]
titanic['Age'] = binning(titanic['Age'], cut_points, labels)
in_titanic = titanic[nominal_cols]
cat_titanic = titanic[cat_cols]

The data type of the Age column is converted from Number to Categorical using the method Binning. The data Set of the age column is ["Unknown", "Child", "Teen", "Adult", "Old"] and also ensured that all the columns are only have nominal data. The data set is separated into two types. They are,

  • Gender Data
  • Title Data

Gender Data


in_titanic.head()


Embarked Pclass Age Survived Sex
0 S 3 Adult Dead male
1 C 1 Adult Survived female
2 S 3 Adult Survived female
3 S 1 Adult Survived female
4 S 3 Adult Dead male

Title Data


cat_titanic.head()


Embarked Pclass Age Survived Title
0 S 3 Adult Dead Mr.
1 C 1 Adult Survived Mrs.
2 S 3 Adult Survived Miss.
3 S 1 Adult Survived Mrs.
4 S 3 Adult Dead Mr.

Data Visualization with Plots


for x in ['Embarked', 'Pclass','Age', 'Sex', 'Title']:
    sns.set(style="whitegrid")
    ax = sns.countplot(y=x, hue="Survived", data=titanic)
    plt.ylabel(x)
    plt.title('Survival Plot')
    plt.show()

png

png

png

png

png

Analysis - Methodology


  1. Gender Wise
  2. Title Wise

Because title is also a keyword which shows the Gender type of a person. Analysing these both fields together will cause for the results with 100% association with both fields.

Example:


  • (Mr.) always associated with Male.
  • (Mrs.) always associated with Female.

Putting these two fields together does not make any sense. So that the analysis split into two types.


Gender Analysis


dataset = []
for i in range(0, in_titanic.shape[0]-1):
    dataset.append([str(in_titanic.values[i,j]) for j in range(0, in_titanic.shape[1])])
# dataset = in_titanic.to_xarray()

oht = TransactionEncoder()
oht_ary = oht.fit(dataset).transform(dataset)
df = pd.DataFrame(oht_ary, columns=oht.columns_)
print df.head()

Output:


1 2 3 Adult C Child Dead Old Q S \ 0 False False True True False False True False False True 1 True False False True True False False False False False 2 False False True True False False False False False True 3 True False False True False False False False False True 4 False False True True False False True False False True Survived Teen Unknown female male 0 False False False False True 1 True False False True False 2 True False False True False 3 True False False True False 4 False False False False True

All Nominal Values


print oht.columns_

Output:


[‘1’, ‘2’, ‘3’, ‘Adult’, ‘C’, ‘Child’, ‘Dead’, ‘Old’, ‘Q’, ‘S’, ‘Survived’, ‘Teen’, ‘Unknown’, ‘female’, ‘male’]

Implementing Apriori Algorithm:


output = apriori(df, min_support=0.2, use_colnames=oht.columns_)
print output.head()

idx support itemsets 0 0.242697 (1) 1 0.206742 (2) 2 0.550562 (3) 3 0.528090 (Adult) 4 0.615730 (Dead)

Rules Configuration


config = [
    ('antecedent support', 0.7),
    ('support', 0.5),
    ('confidence', 0.8),
    ('conviction', 3)
]

for metric_type, th in config:
    rules = association_rules(output, metric=metric_type, min_threshold=th)
    if rules.empty:
        print 'Empty Data Frame For Metric Type : ',metric_type,' on Threshold : ',th
        continue
    print rules.columns.values
    print '-------------------------------------'
    print 'Configuration : ', metric_type, ' : ', th
    print '-------------------------------------'
    print (rules)

    support=rules.as_matrix(columns=['support'])
    confidence=rules.as_matrix(columns=['confidence'])

    plt.scatter(support, confidence, edgecolors='red')
    plt.xlabel('support')
    plt.ylabel('confidence')
    plt.title(metric_type+' : '+str(th))
    plt.show()

Output : Config 1: antecedent support = 0.7

['antecedents' 'consequents' 'antecedent support' 'consequent support' 'support' 'confidence' 'lift' 'leverage' 'conviction'] ------------------------------------- Configuration : antecedent support : 0.7 ------------------------------------- antecedents consequents antecedent support \ 0 (S) (male) 0.723596 1 (S) (Adult, Dead) 0.723596 2 (S) (female, Adult, Survived) 0.723596 3 (S) (male, Dead) 0.723596 ...

png


Output : Config 2: antecedent support = 0.7

['antecedents' 'consequents' 'antecedent support' 'consequent support' 'support' 'confidence' 'lift' 'leverage' 'conviction'] ------------------------------------- Configuration : support : 0.5 ------------------------------------- antecedents consequents antecedent support consequent support support \ 0 (male) (Dead) 0.647191 0.615730 0.524719 1 (Dead) (male) 0.615730 0.647191 0.524719 confidence lift leverage conviction 0 0.810764 1.316752 0.126224 2.030636 1 0.852190 1.316752 0.126224 2.386905

png


Output : Config 3: confidence: 0.8

['antecedents' 'consequents' 'antecedent support' 'consequent support' 'support' 'confidence' 'lift' 'leverage' 'conviction'] ------------------------------------- Configuration : confidence : 0.8 ------------------------------------- antecedents consequents antecedent support \ 0 (1, female) (Survived) 0.105618 1 (Adult, Dead) (S) 0.319101 2 (2, male) (Dead) 0.121348 3 (2, Dead) (male) 0.108989 ...

png


Output : Config 4: conviction: 3

['antecedents' 'consequents' 'antecedent support' 'consequent support' 'support' 'confidence' 'lift' 'leverage' 'conviction'] ------------------------------------- Configuration : conviction : 3 ------------------------------------- antecedents consequents antecedent support consequent support support \ 0 (1, female) (Survived) 0.105618 0.384270 0.102247 1 (2, Dead) (male) 0.108989 0.647191 0.102247 ...

png


Gender Result


Interesting Information: Gender Analysis


  • Persons Who are Sex: female With PcClass: 1, have 96.80 % Confidence Survived : True
  • Persons Who are PcClass: 2 With Survived: False, have 93.81% Confidence Sex: Male

Common Information:


  • Persons Who are Survived : False With Age : UnKnown , have 81.88 % Confidence PcClass : 3
  • Persons Who are Age : Adult With PcClass : 2 , have 90.2 % Confidence Embarked : S
  • Persons Who are Survived: False With Age : Adult and PcClass : 3, have 86.36% Confidence Embarked: S

Title Analysis


dataset = []
in_titanic=cat_titanic
for i in range(0, in_titanic.shape[0]-1):
    dataset.append([str(in_titanic.values[i,j]) for j in range(0, in_titanic.shape[1])])
# dataset = in_titanic.to_xarray()

oht = TransactionEncoder()
oht_ary = oht.fit(dataset).transform(dataset)
df = pd.DataFrame(oht_ary, columns=oht.columns_)
print df.head()

Output:


1 2 3 Adult C Capt. Child Col. Dead Don. \ 0 False False True True False False False False True False 1 True False False True True False False False False False 2 False False True True False False False False False False 3 True False False True False False False False False False .... [5 rows x 30 columns]

All Nominal values:


print oht.columns_

Output:


['1', '2', '3', 'Adult', 'C', 'Capt.', 'Child', 'Col.', 'Dead', 'Don.', 'Dr.', 'Jonkheer.', 'Lady.', 'Major.', 'Master.', 'Miss.', 'Mlle.', 'Mme.', 'Mr.', 'Mrs.', 'Ms.', 'Old', 'Q', 'Rev.', 'S', 'Sir.', 'Survived', 'Teen', 'Title_UK', 'Unknown']

Implementing Apriori Algorithm:


output = apriori(df, min_support=0.2, use_colnames=oht.columns_)
print output.head()

support itemsets 0 0.242697 (1) 1 0.206742 (2) 2 0.550562 (3) 3 0.528090 (Adult) 4 0.615730 (Dead)


Rules Configuration


config = [
    ('antecedent support', 0.7),
    ('confidence', 0.8),
    ('conviction', 3)
]

for metric_type, th in config:
    rules = association_rules(output, metric=metric_type, min_threshold=th)
    if rules.empty:
        print 'Empty Data Frame For Metric Type : ',metric_type,' on Threshold : ',th
        continue
    print rules.columns.values
    print '-------------------------------------'
    print 'Configuration : ', metric_type, ' : ', th
    print '-------------------------------------'
    print (rules)

    support=rules.as_matrix(columns=['support'])
    confidence=rules.as_matrix(columns=['confidence'])

    plt.scatter(support, confidence, edgecolors='red')
    plt.xlabel('support')
    plt.ylabel('confidence')
    plt.title(metric_type+' : '+str(th))
    plt.show()

Output : Config 1: antecedent support = 0.7

['antecedents' 'consequents' 'antecedent support' 'consequent support' 'support' 'confidence' 'lift' 'leverage' 'conviction'] ------------------------------------- Configuration : antecedent support : 0.7 ------------------------------------- antecedents consequents antecedent support consequent support \ 0 (S) (Adult, Dead) 0.723596 0.319101 1 (S) (Mr.) 0.723596 0.579775 2 (S) (Dead) 0.723596 0.615730 3 (S) (Adult) 0.723596 0.528090 ...

png


Output : Config 2: confidence: 0.8

Empty Data Frame For Metric Type : support on Threshold : 0.5 ['antecedents' 'consequents' 'antecedent support' 'consequent support' 'support' 'confidence' 'lift' 'leverage' 'conviction'] ------------------------------------- Configuration : confidence : 0.8 ------------------------------------- antecedents consequents antecedent support consequent support \ 0 (Adult, Dead) (S) 0.319101 0.723596 1 (3, Mr.) (Dead) 0.357303 0.615730 2 (S, Mr.) (Dead) 0.446067 0.615730 3 (Mr., Adult) (S) 0.328090 0.723596 ...

png

Output : Config 3: conviction: 3

['antecedents' 'consequents' 'antecedent support' 'consequent support' 'support' 'confidence' 'lift' 'leverage' 'conviction'] ------------------------------------- Configuration : conviction : 3 ------------------------------------- antecedents consequents antecedent support consequent support support \ 0 (3, Mr.) (Dead) 0.357303 0.61573 0.316854 1 (S, Mr., 3) (Dead) 0.275281 0.61573 0.244944 confidence lift leverage conviction 0 0.886792 1.440229 0.096851 3.394382 1 0.889796 1.445107 0.075445 3.486891

png


Title Result


Interesting Information - Title Analysis:

  • Persons Who are Title : Mr. With Class : 3 and Embarked : S, have 88.9796 % Confidence Survived : Dead

How to filter ? - A simple Demo


rules[rules['confidence']==rules['confidence'].min()]
rules[rules['confidence']==rules['confidence'].max()]

Output Tables:


antecedents consequents antecedent support consequent support support confidence lift leverage conviction
8 (True) (female) 0.38427 0.352809 0.261798 0.681287 1.931035 0.126224 2.030636


antecedents consequents antecedent support consequent support support confidence lift leverage conviction
12 (1, female) (True) 0.105618 0.38427 0.102247 0.968085 2.519286 0.061661 19.292884



rules = association_rules (output, metric='support', min_threshold=0.1)
rules[rules['confidence'] == rules['confidence'].min()]
rules[rules['confidence'] == rules['confidence'].max()]

Output Tables:


antecedents consequents antecedent support consequent support support confidence lift leverage conviction
274 (S) (True, Adult, female) 0.723596 0.14382 0.103371 0.142857 0.993304 -0.000697 0.998876


antecedents consequents antecedent support consequent support support confidence lift leverage conviction
55 (1, female) (True) 0.105618 0.38427 0.102247 0.968085 2.519286 0.061661 19.292884

Algorithm Evaluation


Use this Python script to evaluate the algorithms Apriori and FP Growth.

The evaluation output would be like,

For Data Matrix : 891 x 5 Number of Individuals : 15 Apriori : 0.872148990631 FP-Algorithm : 0.0637619495392 -------------------------- For Data Matrix : 17999 x 5 Number of Individuals : 25 Apriori : 0.493063926697 FP-Algorithm : 0.621915102005 -------------------------- For Data Matrix : 35998 x 5 Number of Individuals : 25 Apriori : 0.990983963013 FP-Algorithm : 1.18582415581

Conclusion:


In terms of reading process, the algorithms Apriori and FP Growth differs. According to that FP Growth is more efficient than apriori for bigger data because it reads only two times a file. But for me both are working in same manner and almost consumes same time for a specific data. It may be differ with respect to the data and nominal value count. Any way before implementing these algorithm, once check with Algorithm-Evaluation</ark> as said before and find the suitable algorithm for your work.

Also published in Kaggle.

References:


Thanks to the Sources, - Apriori - FP Growth - Association Rule Mining Via Apriori Algorithm in python - Mining Frequent Items using apriori algorithm - Finding Frequent Patterns - Efficient - Apriori - Python 3.6 - Data mining with apriori

]]>