DataCloud Blog

Cloud, IoT, Big data tools, Data analytics techniques and tutorials

  • Home
  • Big data tutorials
  • Big data datasets
  • Big data tools
    • Hadoop
      • Hadoop query tools
      • Hadoop infrastructure
    • Data visualization
    • Analysis tools
    • Data store
    • Analysis environment
  • Data governance
  • Data security
  • Project & knowledge mgm.
  • About

Author: Peter Kortvelyesi

Hadoop

Hadoop and its facilitating open source tools for big data analysis

Configure Apache Kylin with ODBC to work with MS PowerBI

Configure Apache Kylin with ODBC to work with MS PowerBI

PowerBI and Kylin – reporting from Hadoop via ODBC This article discusses how to setup…

Evaluation of Apache Kylin 1.5.4.1 with HDP 2.5, performance comparison w Hive

Evaluation of Apache Kylin 1.5.4.1 with HDP 2.5, performance comparison w Hive

Apache Kylin is a data cube solution on top of Hadoop providing an ODBC interface…

Performance test of Pig vs Hive with code examples

Performance test of Pig vs Hive with code examples

Performance testing high level Hadoop query languages with example scripts. Analysis of NOAA weather data:…

Create a Hadoop Cluster easily by using PXE boot, Kickstart, Puppet and Ambari to auto-deploy nodes

Create a Hadoop Cluster easily by using PXE boot, Kickstart, Puppet and Ambari to auto-deploy nodes

This tutorial is to showcase unattended and automatic install of multiple CentOS 6.5 x86_64 Hadoop…

Security of a Hadoop cluster

Security of a Hadoop cluster

Hadoop is not only a data processing but a data warehouse solution. When we are…

A multi-tiered Big Data warehouse & processing facility

A multi-tiered Big Data warehouse & processing facility

The article details an exemplary setup for a multi-tiered data warehouse and processing facility using…

Environment setup for big data analytics

Environment setup for big data analytics

This articles covers basic tools and technologies to use when conducting the first steps on…

Sqoop – Hadoop to/from relational DB data migration

Sqoop – Hadoop to/from relational DB data migration

Sqoop is an efficient data transfer tool between Hadoop and structured datastores, such as relational…

Storm – Real-time data procession

Storm – Real-time data procession

Storm real-time data processor: while Hadoop is mainly used for batch processing of data Storm…

Mahout – Data Mining, Machine Learning

Mahout – Data Mining, Machine Learning

Mahout is  machine learning library that can be used on top of Hadoop HDFS. It…

Ambari – Hadoop management

Ambari – Hadoop management

Ambari simplifies Hadoop management by providing an easy to use provisioning and monitoring interface for…

Hue – Web UI for Hadoop

Hue – Web UI for Hadoop

Hue is an easy to use, user friendly web UI for Hadoop, featuring File browser…

Spark – Cyclic, high-performance data processing on top of Hadoop

Spark – Cyclic, high-performance data processing on top of Hadoop

Spark is a high-performance cyclic data-flow in memory computing platform that proves to be lot…

Zookeper – Hadoop coordination and configuration management

Zookeper – Hadoop coordination and configuration management

Zookeper provides coordination, configuration management, naming, synchronization and group services for large Hadoop clusters. Zookeeper itself…

HBase – Hadoop storage for tables

HBase – Hadoop storage for tables

HBase is a big data storage for tables with random read/write access needs. Billions of…

Pig – Hadoop Query Language

Pig – Hadoop Query Language

Pig is a platform for large dataset analysis, consisting of a language called Pig Latin.…

Hive – Hadoop Query Language

Hive – Hadoop Query Language

Hive is to query and manage large datasets of a Hadoop cluster using an SQL like…

MapReduce – Hadoop’s essential concept

MapReduce – Hadoop’s essential concept

MapReduce is a programming model used for processing large datasets with Hadoop. Map: to filter…

Hadoop big data framework – Hadoop virtual machines

Hadoop big data framework – Hadoop virtual machines

Hadoop is an open-source framework for processing large amount of data across clusters of computers…