Timbakto

March 2, 2014

Big Data Tool- Hive: A simple overview

In this blog  I am here introduce some basic about Hive data warehouse system. The main goal is to understand hive and then working with hive. When we talk about hive then first question comes that is:

What is Hive? So ans is:

Hive is a data warehousing package/infrastructure built on top of Hadoop.It provides an SQL like dialect, called Hive Query Language(HQL) for querying data stored in a Hadoop cluster.HQL is the Hive query language. Like all SQL dialects in widespread use, it doesn’t fully conform to any particular revision of the SQL standard. It is perhaps closest to MySQL’s dialect, but with significant differences. Hive offers no support for rowlevel inserts, updates, and deletes. Hive doesn’t support transactions.So we can't compare it with RDBMS. Hive adds extensions to provide better performance in the context of Hadoop and to integrate with custom extensions and even external programs. It is well suited for batch processing data like: Log processing, Text mining, Document indexing, Customer-facing business intelligence, Predictive modeling, hypothesis testing etc.
We cannot compare it with traditional database system and it is not designed for online transaction processing and does not offer real-time queries.

Newer version of Hive community is trying to provide functionality of  insert, update, and delete in Hive with full ACID support. You can check it here.  Adding ACID to Apache Hive   or you can see here also for more detail.
HIVE-5317 - Implement insert, update, and delete in Hive with full ACID support

 

September 16, 2012

Complex Event processing Tools

Hi... Here I am talking about some most usefull event processing tools basically log processing tools from distribute environment or from cloud environment. I have collected it from various sources and listing here with short description. I hope it will be usefull for us in finding best tools for log collection and data collection from multple sources.

1. Flume:

It is an apache project and basically used for efficiently collection, aggregation, and moving large amounts of log data.It has simple architecture and it works on the basis of streaming data flow. It collects data from various sources and delivers it Hadoop's HDFS.
There is three basis component of flume
a) Agent- lives on the source machine from where we need to collect data or log
b) Collector- Agents sinks data to collector and finally it writes it to HDFS.
c) Master-   It keeps all configuration of agents and collectors and manages them.

Please visit wikipedia and http://archive.cloudera.com/cdh/3/flume/UserGuide/ for more information.

2. Scribe:

Scribe is a open source project from Facebook and being used as log aggregation framework. It has simple API and uses.The scribe server running on every node in the system, configured to aggregate messages and send them to a central scribe server or servers in larger groups.

We can get more knowledge about it from here:
https://github.com/facebook/scribe/wiki

3. Kafka;

Being developed by linkedin and bascally used for log collection.

 


Structured Unstructured and Semi Structured data

Structured Data:  Data that resides in fixed fields within a records or file.It  is identifiable because it is organized in a structure. It can be searchable by data type within the content.

Unstructured Data: Data that don't have any fixed fields for records. The record size can get change at any moment.