Hive, Machine Learning, Sqoop, Statistics, Tableau

Box plot- data distribution

The line in the middle is the median value of the data. Median is used over the mean since it is more robust to outlier values.The first quartile is essentially the 25th percentile; i.e 25% of the points in the data fall below that value. The third quartile is the 75th percentile; i.e 75% of the points in the data fall below that value. The min and max values represent the upper and lower ends of our data range.

A box plot perfectly illustrates what we can do with basic statistical features:

  • When the box plot is short it implies that much of your data points are similar, since there are many values in a small range
  • When the box plot is tall it implies that much of your data points are quite different, since the values are spread over a wide range
  • If the median value is closer to the bottom then we know that most of the data has lower values. If the median value is closer to the top then we know that most of the data has higher values. Basically, if the median line is not in the middle of the box then it is an indication of skeweddata.
  • Are the whiskers very long? That means your data has a high standard deviation and variance i.e the values are spread out and highly varying. If you have long whiskers on one side of the box but not the other, then your data may be highly varying only in one direction.
Hive, Interview, Scala, Sqoop

Interview question – MNC

1) Explain about your project archit?
2) How your project using streaming data ?
3) Kafka how to process the data ingest operation?
4) what format the result set will get store! in a list(1,2,3,4,5) find the values more than 2?
5) how you will find the number of partition in RDD?
6) how to split the value and load the file into RDD?
7) difference between VAL and Lazy VAL?
8) How DAG working in Spark Scala?
9) Difference between sqlcontext and Hivecontext?
10) what about Spark session?
11) If spark job get failed, where will you find the logs and how to rectify the issue?
12) Difference between the Narrow and wide range in Spark transformation?
13) What are the error you faced in hive and how you rectified?
14) How to remove the duplicate without using distinct and rank function in hive?
15) What is difference between the manage and external table and how the metadata store? whether we can store the metadata store in HDFS?
16) what are different type of join used in Hive?
17) What are the file format used in your project and which one you will use based on the scenario?
18) How you will store the data in external table which file format in avro format?
19) How the version controlling working in GitHub? How you will merge the file in GitHub?

Hive, Sqoop

Hive Partition and Bucketing

Github link

Hive Dynamic partition:

set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict

create temporary table temp_us (txnno INT, txndate STRING, custno INT, amount DOUBLE, category STRING, product STRING, city STRING,state STRING, spendby STRING) row format delimited fields terminated by ‘,’ lines terminated by ‘\n’ stored as textfile;

load data local inpath ‘/home/hduser/txns1’ into table temp_us;

create table txnrecordspar1(txnno INT, txndate STRING, custno INT, amount DOUBLE, category STRING, product STRING, city STRING, spendby STRING) partitioned by(state STRING) row format delimited fields terminated by ‘,’ lines terminated by ‘\n’ stored as textfile

insert into table txnrecordspar1 partition(state) select txnno,txndate,custno,amount,category,product,city,spendby,state from temp_us;

show partitions txnrecordspar1;

hdfs dfs -ls /user/hive/warehouse/inceptez.db/txnrecordspar1/

Hive Bucketing:

set hive.enforce.bucketing=true

create temporary table temp_us (txnno INT, txndate STRING, custno INT, amount DOUBLE, category STRING, product STRING, city STRING,state STRING, spendby STRING) row format delimited fields terminated by ‘,’ lines terminated by ‘\n’ stored as textfile

load data local inpath ‘/home/hduser/txns1’ into table temp_us;

create table txnrecordsbuck(txnno INT, txndate STRING, custno INT, amount DOUBLE, category STRING, product STRING, city STRING, spendby STRING) partitioned by(state STRING) clustered by (city) into 5 buckets row format delimited fields terminated by ‘,’ lines terminated by ‘\n’ stored as textfile;

set hive.enforce.bucketing=true;

insert into table txnrecordsbuck partition(state) select txnno,txndate,custno,amount,category,product,city,spendby,state from temp_us;

Sqoop incremental load with example of job creation in sqoop:

CREATE TABLE customer (custid INT,firstname VARCHAR(20),lastname VARCHAR(20),city varchar(50),age int,createdt date,transactamt int );
insert into customer values(1,’Arun’,’Kumar’,’chennai’,33,’2017-09-20′,100000);
insert into customer values(2,’srini’,’vasan’,’chennai’,33,’2017-09-21′,10000);
insert into customer values(3,’vasu’,’devan’,’banglore’,39,’2017-09-23′,90000);
insert into customer values(4,’mohamed’,’imran’,’hyderabad’,33,’2017-09-24′,1000);
insert into customer values(5,’arun’,’basker’,’chennai’,23,’2017-09-20′,200000);
insert into customer values(6,’ramesh’,’babu’,’manglore’,39,’2017-09-21′,100000)
Job creation: sqoop job –create sqoopimp — import –connect jdbc:mysql://localhost/custdb –username root –password root -table customer -m 1 –target-dir /user/hduser/sqoopdir –incremental append –check-column custid –last-value 0;

Simple sqoop import: sqoop import –connect jdbc:mysql://localhost/custdb –username root –password root -table customer -m 1 –target-dir /user/hduser/sqoopdir

Manual giving last value: sqoop import –connect jdbc:mysql://localhost/custdb –username root –password root -table customer -m 1 –target-dir /user/hduser/sqoopdir –incremental append –check-column custid –last-value 6;

checking dir:
hdfs dfs -ls /user/hduser/sqoopdir/

checking values:
hdfs dfs -cat /user/hduser/sqoopdir/*
creating external table in hive and pointing the hadoop directory:

create external table customer(custid INT, firstname STRING, lastname STRING, city STRING, age INT, createdt STRING, transactamt int) row format delimited fields terminated by ‘,’ lines terminated by ‘\n’ stored as textfile location ‘/user/hduser/sqoopdir1′;

verifying the records:
select * from customer;

increment records in mysql:
insert into customer values(7,’ramesh’,’babu’,’manglore’,39,’2017-09-21′,100000)
insert into customer values(8,’ramesh’,’babu’,’manglore’,39,’2017-09-21′,100000)

Running SQOOP JOb:
sqoop job –exec sqoopimp;
verifying the records:
select * from customer;