Hadoop and Hive with GAUSS

Welcome to the Hadoop and Hive tutorial using GAUSS 16+. This example will involve acquiring movie ratings data and determining the average movie rating each user gave. The tutorial assumes that Hadoop and Hive are both installed and configured. All tests were performed under Ubuntu 14.04.1 with Hadoop 2.7.2 and Hive 1.2.1. All hduser references are for the local account used for testing.

Data Preparation

These instructions assume your Hadoop cluster is running on Linux. All data preparation commands are to be performed on the machine Hadoop is installed (HDFS location).

Fetch the sample data:

$ wget http://files.grouplens.org/datasets/movielens/ml-100k.zip

$ unzip ml-100k.zip
Archive:  ml-100k.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
        0  2015-03-31 14:33   ml-100k/
      716  2000-07-19 14:09   ml-100k/allbut.pl
      643  2000-07-19 14:09   ml-100k/mku.sh
     6403  2015-03-31 14:33   ml-100k/README
  1979173  2000-07-19 14:09   ml-100k/u.data
      202  2000-07-19 14:09   ml-100k/u.genre
       36  2000-07-19 14:09   ml-100k/u.info
   236344  2000-07-19 14:09   ml-100k/u.item
      193  2000-07-19 14:09   ml-100k/u.occupation
    22628  2000-07-19 14:09   ml-100k/u.user
  1586544  2001-03-08 11:33   ml-100k/u1.base
   392629  2001-03-08 11:32   ml-100k/u1.test
  1583948  2001-03-08 11:33   ml-100k/u2.base
   395225  2001-03-08 11:33   ml-100k/u2.test
  1582546  2001-03-08 11:33   ml-100k/u3.base
   396627  2001-03-08 11:33   ml-100k/u3.test
  1581878  2001-03-08 11:33   ml-100k/u4.base
   397295  2001-03-08 11:33   ml-100k/u4.test
  1581776  2001-03-08 11:34   ml-100k/u5.base
   397397  2001-03-08 11:33   ml-100k/u5.test
  1792501  2001-03-08 11:34   ml-100k/ua.base
   186672  2001-03-08 11:34   ml-100k/ua.test
  1792476  2001-03-08 11:34   ml-100k/ub.base
   186697  2001-03-08 11:34   ml-100k/ub.test
---------                     -------
 16100549                     24 files

Place the example data file, named u.data, directly into the HDFS (used for non-HiveQL queries):
```
$ hadoop fs -put ml-100k/u.data
```

Preview the first 20 lines of the data file we just placed in the HDFS:

$ hadoop fs -cat u.data | head -n20
userid  movieid rating  unixtime   # (Headers added for clarity)
196	242	3	881250949
186	302	3	891717742
22	377	1	878887116
244	51	2	880606923
166	346	1	886397596
298	474	4	884182806
115	265	2	881171488
253	465	5	891628467
305	451	3	886324817
6	86	3	883603013
62	257	2	879372434
286	1014	5	879781125
200	222	5	876042340
210	40	3	891035994
224	29	3	888104457
303	785	3	879485318
122	387	5	879270459
194	274	2	879539794
291	1042	4	874834944
234	1184	2	892079237

Start the Beeline command-line interface (CLI) for Hive to perform the data preparation steps (we are omitting specifying 'default' as the database as this is done automatically):
```
$ beeline -u jdbc:hive2://localhost:10000 -n hduser
```

Create a table in Hive to hold our example data

0: jdbc:hive2://localhost:10000> CREATE TABLE u_data (
  userid INT,
  movieid INT,
  rating INT,
  unixtime STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;

Import our example data into the Hive table we just created

0: jdbc:hive2://localhost:10000> LOAD DATA LOCAL INPATH 'ml-100k/u.data' OVERWRITE INTO TABLE u_data;

Verify the data was imported, by printing the number of records in our table:
```
0: jdbc:hive2://localhost:10000> SELECT COUNT(*) FROM u_data;
```

We now have identical data loaded into a Hive Table (u_data) and also as a file on HDFS (u.data) and can start working with it.

Example 1 - Local Hadoop batch job

Since the data in u.data is tab-delimited and we don't want the movieid or unixtime columns, our mapper will remove those fields.
GAUSS Mapper (mapcolumns.gss)

if eof(__STDIN);end;endif;

//Read all rows of columns 1:3
stats = csvReadM(__STDIN,1,1|3,"\t");

//Format print output as tab-separated
format /rdt;

//Print the first and third columns
print stats[.,1]~stats[.,3];

The reducer will pull in the data and perform the calculations, printing out the results
GAUSS Reducer (reducecolumns.gss)

if eof(__STDIN);end;endif;

//Read in the first 2 columns
stats = csvReadM(__STDIN, 1, 1|2, "\t");

all_ids = stats[.,1];
user_ids = unique(all_ids);

//Create matrix to hold user id's and average rating
final = user_ids ~ zeros(rows(user_ids), 1);

for i(1, rows(user_ids), 1);
    //Select i'th unique user id
    uid = user_ids[i];
    //Select all ratings by the i'th unique user
    user_ratings = selif(stats[.,2], all_ids .== uid);
    //Calculate average of user's ratings
    final[i,2] = meanc(user_ratings);
endfor;

format /rdt;
print final;

With the proper scripts in place (ensure the following command has the appropriate paths for the Hadoop jar and GAUSS scripts), we can test our code.

$ hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-x.x.x.jar -input u.data -output GAUSSExample1 -mapper '/home/hduser/gauss16/tgauss -b /home/hduser/gauss16/mapcolumns.gss' -reducer '/home/hduser/gauss16/tgauss -b /home/hduser/gauss16/reducecolumns.gss' -numReduceTasks 2

Check the output directory for the results:

$ hadoop fs -ls GAUSSExample1
Found 3 items
-rw-r--r--   1 hduser supergroup          0 2015-12-10 15:38 GAUSSExample1/_SUCCESS
-rw-r--r--   1 hduser supergroup      16522 2015-12-10 15:38 GAUSSExample1/part-00000
-rw-r--r--   1 hduser supergroup      16522 2015-12-10 15:38 GAUSSExample1/part-00001

$ hadoop fs -cat GAUSSExample1/part-00000 | head -n20
	
      1.00000000	      3.61029412	
      3.00000000	      2.79629630	
      5.00000000	      2.87428571	
      7.00000000	      3.96526055	
      9.00000000	      4.27272727	
     10.00000000	      4.20652174	
     12.00000000	      4.39215686	
     14.00000000	      4.09183673	
     16.00000000	      4.32857143	
     18.00000000	      3.88086643	
     21.00000000	      2.67039106	
     23.00000000	      3.63576159	
     25.00000000	      4.05128205	
     27.00000000	      3.24000000	
     29.00000000	      3.64705882	
     30.00000000	      3.76744186	
     32.00000000	      3.31707317	
     34.00000000	      4.05000000	
     36.00000000	      3.80000000

You can merge the output into a single file after the job is complete:

$ hadoop fs -getmerge GAUSSExample1 /local/output/file

Example 2 - Using HiveQL to perform remote map reduce

This will require usage of either the HortonWorks Hive ODBC Driver or the Microsoft Hive ODBC Driver.
Note: The architecture you select will need to match your GAUSS installation:
- HortonWorks: Download [Hive ODBC Driver for HDP]
- Microsoft: Download
Ensure successful connectivity via ODBC Data Source Administrator: An excellent resource for this process is on Page 7 of the HortonWorks ODBC Hive User Guide
Run the following file in GAUSS to perform MapReduce using HiveQL:

GAUSS Program (db_hive.gss)

new;

trap 1;
id = dbAddDatabase("ODBC");
trap 0;
if scalmiss(id);
    print "Driver not found.";
    end;
endif;

gauss_home = "/home/hduser/gauss16";
driver = "HortonWorks Hive ODBC Driver";
//driver = "Microsoft Hive ODBC Driver"; // If using Microsoft Hive ODBC Driver
user = ""; // leave blank if not using
password = ""; // leave blank if not using
host = "192.168.0.25";
port = 10000;
use_db = "default";
use_ssl = 1;

// Choose correct authentication type based on provided values
if strlen(user) > 0 or use_ssl;
    if strlen(password) > 0;
        if use_ssl;
            auth_mech = "4";
        else;
            auth_mech = "3";
        endif;
        opts = "AuthMech="$+auth_mech$+";UID="$+user$+";PWD="$+password$+";";
    else;
        opts = "AuthMech=2;UID="$+user$+";";
    endif;
else;
    opts = "AuthMech=0;"; // Specify authentication (None)
endif;

dbSetDatabaseName(id, "DRIVER={"$+driver$+"};HOST="$+host$+";PORT="$+ntos(port)$+";HiveServerType=2;Schema="$+use_db$+";"$+opts);
dbOpen(id);
// Set number of reduction tasks
rid = dbExecQuery(id, "set mapred.reduce.tasks=2;");
// Process query
qid = dbExecQuery(id, "from (select userid,rating from u_data distribute by userid) t1 insert overwrite directory 'GAUSSExample2' reduce userid,rating using '"$+gauss_home$+"/tgauss -b "$+gauss_home$+"/reducecolumns.gss';");
dbClose(id);

Verify the job was ran successfully and inspect the output:

$ hadoop fs -ls GAUSSExample2
Found 2 items
-rwxr-xr-x   1 hduser supergroup      16489 2015-12-10 15:33 GAUSSExample2/000000_0
-rwxr-xr-x   1 hduser supergroup      16524 2015-12-10 15:33 GAUSSExample2/000001_0

$ hadoop fs -cat GAUSSExample2/000000_0 | head -n20
\N
      2.00000000      3.70967742	
      4.00000000      4.33333333	
      6.00000000      3.63507109	
      8.00000000      3.79661017	
     10.00000000      4.20652174	
     12.00000000      4.39215686	
     14.00000000      4.09183673	
     16.00000000      4.32857143	
     18.00000000      3.88086643
     20.00000000      3.10416667	
     22.00000000      3.35156250	
     24.00000000      4.32352941	
     26.00000000      2.94392523	
     28.00000000      3.72151899	
     30.00000000      3.76744186	
     32.00000000      3.31707317	
     34.00000000      4.05000000	
     36.00000000      3.80000000	
     38.00000000      3.71900826

The possibilities for MapReduce with GAUSS are endless. The level of integration can be modified to suit your needs, such as only using GAUSS for the Reduce phase.

Hadoop & Hive Integration Tutorial

Hadoop and Hive with GAUSS

Data Preparation

Example 1 - Local Hadoop batch job

Example 2 - Using HiveQL to perform remote map reduce

Have a Specific Question?

Need Support?