Hadoop

HADOOP

Hadoop is an Apache Software Foundation project that importantly provides two things:

  1. A distributed filesystem called HDFS (Hadoop Distributed File System)
  2. A framework and API for building and running MapReduce jobs

HDFS

HDFS is structured similarly to a regular Unix filesystem except that data storage is distributed across several machines. It is not intended as a replacement to a regular filesystem, but rather as a filesystem-like layer for large distributed systems to use. It has in built mechanisms to handle machine outages, and is optimized for throughput rather than latency.

There are two and a half types of machine in a HDFS cluster:

  • Datanode – where HDFS actually stores the data, there are usually quite a few of these.
  • Namenode – the ‘master’ machine. It controls all the meta data for the cluster. Eg – what blocks make up a file, and what datanodes those blocks are stored on.
  • Secondary Namenode – this is NOT a backup namenode, but is a separate service that keeps a copy of both the edit logs, and filesystem image, merging them periodically to keep the size reasonable.
    • this is soon being deprecated in favor of the backup node and the checkpoint node, but the functionality remains similar (if not the same)

hdfs diagram

Data can be accessed using either the Java API, or the Hadoop command line client. Many operations are similar to their Unix counterparts.

Here are some simple examples:

list files in the root directory

hadoop fs -ls /

list files in my home directory

hadoop fs -ls ./

cat a file (decompressing if needed)

hadoop fs -text ./file.txt.gz

upload and retrieve a file

hadoop fs -put ./localfile.txt /home/vishnu/remotefile.txt

hadoop fs -get /home/vishnu/remotefile.txt ./local/file/path/file.txt

Note that HDFS is optimized differently than a regular file system. It is designed for non-realtime applications demanding high throughput instead of online applications demanding low latency. For example, files cannot be modified once written, and the latency of reads/writes is really bad by filesystem standards. On the flip side, throughput scales fairly linearly with the number of datanodes in a cluster, so it can handle workloads no single machine would ever be able to.

HDFS also has a bunch of unique features that make it ideal for distributed systems:

  • Failure tolerant - data can be duplicated across multiple datanodes to protect against machine failures. The industry standard seems to be a replication factor of 3 (everything is stored on three machines).
  • Scalability – data transfers happen directly with the datanodes so your read/write capacity scales fairly well with the number of datanodes
  • Space – need more disk space? Just add more datanodes and re-balance
  • Industry standard – Lots of other distributed applications build on top of HDFS (HBase, Map-Reduce)
  • HDFS Resources

For more information about the design of HDFS, you should read through apache documentation page. In particular the streaming and data access section has some really simple and informative diagrams on how data read/writes actually happen.

MapReduce

The second fundamental part of Hadoop is the MapReduce layer. This is made up of two sub components:

  • An API for writing MapReduce workflows in Java.
  • A set of services for managing the execution of these workflows.

The Map and Reduce APIs

The basic premise is this:

  1. Map tasks perform a transformation.
  2. Reduce tasks perform an aggregation.

In scala, a simplified version of a MapReduce job might look like this:

def map(lineNumber: Long, sentance: String) = {
  val words = sentance.split()
  words.foreach{word =>
    output(word, 1)
  }
}


def reduce(word: String, counts: Iterable[Long]) = {
  var total = 0l
  counts.foreach{count =>
    total += count
  }
  output(word, total)
}

Notice that the output to a map and reduce task is always a KEY, VALUE pair. You always output exactly one key, and one value. The input to a reduce is KEY, ITERABLE[VALUE]. Reduce is called exactly once for each key output by the map phase. The ITERABLE[VALUE] is the set of all values output by the map phase for that key.

So if you had map tasks that output

map1: key: foo, value: 1
map2: key: foo, value: 32

Your reducer would receive:

key: foo, values: [1, 32]

Counter intuitively, one of the most important parts of a MapReduce job is what happens between map and reduce, there are 3 other stages; Partitioning, Sorting, and Grouping. In the default configuration, the goal of these intermediate steps is to ensure this behavior; that the values for each key are grouped together ready for the reduce() function. APIs are also provided if you want to tweak how these stages work (like if you want to perform a secondary sort).

Here’s a diagram of the full workflow to try and demonstrate how these pieces all fit together, but really at this stage it’s more important to understand how map and reduce interact rather than understanding all the specifics of how that is implemented.

mapreduce diagram

What’s really powerful about this API is that there is no dependency between any two of the same task. To do it’s job a map() task does not need to know about other map task, and similarly a single reduce() task has all the context it needs to aggregate for any particular key, it does not share any state with other reduce tasks.

Taken as a whole, this design means that the stages of the pipeline can be easily distributed to an arbitrary number of machines. Workflows requiring massive datasets can be easily distributed across hundreds of machines because there are no inherent dependencies between the tasks requiring them to be on the same machine.

MapReduce API Resources

If you want to learn more about MapReduce (generally, and within Hadoop) I recommend you read the Google MapReduce paper, the Apache MapReduce documentation, or maybe even the hadoop book. Performing a web search for MapReduce tutorials also offers a lot of useful information.

To make things more interesting, many projects have been built on top of the MapReduce API to ease the development of MapReduce workflows. For example Hive lets you write SQL to query data on HDFS instead of Java.

The Hadoop Services for Executing MapReduce Jobs

Hadoop MapReduce comes with two primary services for scheduling and running MapReduce jobs. They are the Job Tracker (JT) and the Task Tracker (TT). Broadly speaking the JT is the master and is in charge of allocating tasks to task trackers and scheduling these tasks globally. A TT is in charge of running the Map and Reduce tasks themselves.

When running, each TT registers itself with the JT and reports the number of ‘map’ and ‘reduce’ slots it has available, the JT keeps a central registry of these across all TTs and allocates them to jobs as required. When a task is completed, the TT re-registers that slot with the JT and the process repeats.

Many things can go wrong in a big distributed system, so these services have some clever tricks to ensure that your job finishes successfully:

  • Automatic retries – if a task fails, it is retried N times (usually 3) on different task trackers.
  • Data locality optimizations – if you co-locate a TT with a HDFS Datanode (which you should) it will take advantage of data locality to make reading the data faster
  • Blacklisting a bad TT – if the JT detects that a TT has too many failed tasks, it will blacklist it. No tasks will then be scheduled on this task tracker.
  • Speculative Execution – the JT can schedule the same task to run on several machines at the same time, just in case some machines are slower than others. When one version finishes, the others are killed.

Here’s a simple diagram of a typical deployment with TTs deployed alongside datanodes. hadoop infra

MapReduce Service Resources

For more reading on the JobTracker and TaskTracker check out Wikipedia or the Hadoop book. I find the apache documentation pretty confusing when just trying to understand these things at a high level, so again doing a web-search can be pretty useful.

Cluster

A  cluster  is a  group of  computers  connected  via  a network. Similarly a Hadoop Cluster can also be a  combination of  a  number of  systems  connected  together  which  completes the picture of distributed computing. Hadoop uses  a master slave architecture.

Components  required  in the cluster

NameNodes

Name node is the master server of the cluster. It  doesnot store any file but knows where the blocks are stored in the child nodes and can give pointers and can re-assemble .Namenodes  comes up with  two  features  say Fsimage  and the edit log.FSImage   and edit log

Features

  1. Highly memory intensive
  2. Keeping it safe and isolated is necessary
  3. Manages the file system namespaces

DataNodes

Child nodes are attached to the main node.

Features:

  1. Data node  has  a configuration file to make itself  available in the cluster .Again they stores  data regarding storage capacity(Ex:5 out f 10 is available) of   that  particular data  node.
  2. Data nodes are independent ,since they are not pointing to any other data nodes.
  3. Manages the storage  attached to the  node.
  4. There  will be  multiple data nodes  in a cluster.

Job Tracker

  1. Schedules and assign task to the different datanodes.
  2. Work Flow
  3. Takes  the request.
  4. Assign the  task.
  5. Validate the requested work.
  6. Checks  whether  all the  data nodes  are working properly.
  7. If not, reschedule the tasks.

Task Tracker

Job Tracker and  task tracker   works   in  a master slave model. Every  datanode has got a  task tracker which  actually performs  the  task  which ever  assigned to it by the Job tracker.

Secondary Name Node

Secondaryname node  is not  a redundant  namenode but  this actually  provides  the  check pointing  and  housekeeping tasks  periodically.

Types of Hadoop Installations

  1. Standalone (local) mode:  It is used to run Hadoop directly on your local machine. By default Hadoop is configured to run in this mode. It is used for debugging purpose.
  2. Pseudo-distributed mode:  It is used to stimulate multi node installation using a single node setup. We can use a single server instead of installing Hadoop in different servers.
  3. Fully distributed mode:  In this mode Hadoop is installed in all the servers which is a part of the cluster. One machine need to be designated as NameNode and another one as JobTracker. The rest acts as DataNode and TaskTracker.

How to make a Single node Hadoop Cluster

A Single node cluster is a cluster where all the Hadoop daemons run on a single machine. The development can be described as several steps.

Prerequisites

OS Requirements

Hadoop is meant to be deployed on Linux based platforms which includes OS like Mackintosh. Larger Hadoop production deployments are mostly on Cent OS, Red hat etc.

GNU/Linux is using as the development and production platform. Hadoop has been demonstrated on Linux clusters with more than 4000 nodes.

Win32 can be used as a development platform, but is not used as a production platform. For developing cluster  in windows, we need Cygwin.

Since Ubuntu is a common Linux distribution and with interfaces similar to Windows, we’ll describe the details of Hadoop deployment on Ubuntu, it is better using the latest stable versions of OS.

This document deals with the development of cluster using Ubuntu Linux platform. Version is 12.04.1 LTS 64 bit.

Softwares Required

  • Java JDK

The recommended and tested versions of java are listed below, you can choose any of the following

Jdk 1.6.0_20

Jdk 1.6.0_21

Jdk 1.6.0_24

Jdk 1.6.0_26

Jdk 1.6.0_28

Jdk 1.6.0_31

*Source Apache Software Foundation wiki. Test resukts announced by Cloudera,MapR,HortonWorks

  • SSH must be installed.
  • SSHD must be running.

This is used by the Hadoop scripts to manage remote Hadoop daemons.

  • Download a latest stable version of Hadoop.

Here we are using Hadoop 1.0.3.

Now we are ready with a Linux machine and required softwares. So we can start the set up. Open the terminal and follow the steps described below

Step 1

Checking whether the OS is 64 bit or 32 bit

1
>$ uname –a

If it is showing a 64, then all the softwares(Java, ssh) must be of 64 bit. If it is showing 32, then use the softwares for 32 bit. This is very important.

Step 2

Installing  Java.

For setting up hadoop, we need java. It is recommended to use sun java 1.6.

For checking whether the java is already installed or not

>$ java –version


This will show the details about java, if it is already installed.

If it is not there, we have to install.

Download a stable version of java as described above.

The downloaded file may be .bin file or .tar file

For installing a .bin file, go to the directory containing the binary file.

>$ sudo chmod u+x <filename>.bin

>$ ./<filename>.bin


If it is a tar ball

>$ sudo chmod u+x <filename>.tar

>$ sudo tar xzf <filename>.tar


Then set the JAVA_HOME in .bashrc file

Go to $HOME/.bashrc file

For editing .bashrc file

1
2
3
4
5
6
7
>$ sudo nano $HOME/.bashrc
# Set Java Home
export JAVA_HOME=<path from root to that java directory>
export PATH=$PATH:$JAVA_HOME/bin

Now close the terminal, re-open again and check whether the java installation is correct.

1
>$ java –version

This will show the details, if java is installed correct.

Now we are ready with java installed.

Step 3

Adding a user for using Hadoop

We have to create a separate user account for running Hadoop. This is recommended, because it isolates other softwares and other users on the same machine from hadoop installation.

1
2
3
>$ sudo addgroup hadoop
>$ sudo adduser –ingroup hadoop user

Here we created a user “user” in a group “hadoop”.

Step 4

In the following steps,  If you are not able to do sudo with user.

Then add user to sudoers group.

For that

1
>$ sudo nano /etc/sudoers

Then add the following

1
%user ALL= (ALL)ALL

This will give user the root privileges.

If you are not interested in giving root privileges, edit the line in the sudoers file as below

1
2
3
# Allow members of group sudo to execute any command
%sudo   ALL=(ALL:ALL) ALL

Step 5

Installing SSH server.

Hadoop requires SSH access to manage the nodes.

In case of multinode cluster, it is remote machines and local machine.

In single node cluster, SSH is needed to access the localhost for user user.

If ssh server is not installed, install it before going further.

Download the correct version (64bit or 32 bit) of open-ssh-server.

Here we are using 64 bit OS, So I downloaded open ssh server for 64 bit.

The download link is

http://www.ubuntuupdates.org/package/core/precise/main/base/openssh-server

The downloaded file may be a .deb file.

For installing a .deb file

1
2
3
>$ sudo chmod u+x <filename>.deb
>$ sudo dpkg –I <filename>.deb

This will install the .deb file.

Step 6

Configuring SSH

Now we have SSH up and running.

As the first step, we have to generate an SSH key for the user

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
<div>
user@ubuntu:~$ su - user
user@ubuntu:~$ ssh-keygen -t rsa -P ""
Generating public/private rsa key pair.
Enter file in which to save the key (/home/user/.ssh/id_rsa):
Created directory '/home/user/.ssh'.
Your identification has been saved in /home/user/.ssh/id_rsa.
Your public key has been saved in /home/user/.ssh/id_rsa.pub.
The key fingerprint is:
9d:47:ab:d7:22:54:f0:f9:b9:3b:64:93:12:75:81:27user@ubuntu
The key’s randomart image is:
[........]
user@ubuntu:~$

Here it is needed to unlock the key without our interaction, so we are creating an RSA keypair with an empty password. This is done in the second line. If empty password is not given, we have to enter the password every time when Hadoop interacts with its nodes. This is not desirable, so we are giving empty password.

The next step is to enable SSH access to our local machine with the key created in the previous step.

1
2
3
user@ubuntu:~$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
</div>

The last step is to test SSH setup by connecting to our local machine with user. This step is necessary to save our local machine’s host key fingerprint to the useruser’sknown_hosts file.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
user@ubuntu:~$ sshlocalhost
The authenticity of host 'localhost (127.0.0.1)' can't be established.
RSA key fingerprint is 76:d7:61:86:ea:86:8f:31:89:9f:68:b0:75:88:52:72.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (RSA) to the list of known hosts.
Ubuntu 12.04.1
...
user@ubuntu:~$

Step 7

Disabling IPv6

There is no use in enabling IPv6 on our Ubuntu Box, because we are not connected to any IPv6 network. So we can disable IPv6. The performance may vary.

For disabling IPv6 on Ubuntu , go to

1
>$ cd /etc/

Open the file sysctl.conf

1
>$ sudo nano sysctl.conf

Add the following lines to the end of this file

1
2
3
4
5
6
7
#disable ipv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1

Reboot the machine to make the changes take effect

For checking whether IPv6 is enabled or not, we can use the following command.

1
>$ cat  /proc/sys/net/ipv6/conf/all/disable_ipv6

If the value is ‘0’ , IPv6 is enabled.

If it is ‘1’ , IPv6 is disabled.

We need the value to be ‘1’.

The requirements for installing Hadoop is ready. So we can start hadoop installation.

Step 8

Hadoop Installation

Here I am using this version hadoop 1.0.3.

So we are using this tar ball.

We create a directory named ‘utilities’ in user.

Practically, you can choose any directory. It will be good if you are keeping a good and uniform directory structure while installation. It will be good and when you deal with multinode clusters.

1
2
3
4
5
>$ cd utilities
>$ sudo tar -xvf  hadoop-1.0.3.tar.gz
>$ sudo   chown –R user:hadoop hadoop-1.0.3

Here the 2nd line will extract the tar ball.

The 3rd line will the permission(ownership)of hadoop-1.0.3 to user

Step 9

Setting HADOOP_HOME in $HOME/.bashrc

Add the following lines in the .bashrc file

1
2
3
4
5
6
7
# Set Hadoop_Home
export HADOOP_HOME=/home/user/utilities/hadoop-1.0.3
# Adding bin/ directory to PATH
export PATH=$PATH:$HADOOP_HOME/bin

Note: If you are editing this $HOME/.bashrc  file, the user doing this only will get the benefit.

For making this affect globally to all users,

go to /etc/bash.bashrc file  and do the same changes.

Thus JAVA_HOME and HADOOP_HOME will be available to all users.

Do the same procedure while setting java also.

Step 10

Configuring Hadoop

In hadoop, we can find three configuration files core-site.xml, mapred-site.xml, hdfs-site.xml.

If we open this files, the only thing we can see is an empty configuration tag <configuration></configuration>

What actually happening behind the curtain is that, hadoop assumes default value to a lot of properties. If we want to override that, we can edit these configuration files.

The default values are available in three files

core-default.xml, mapred-default.xml, hdfs-default.xml

These are available in the locations

utilities/hadoop-1.0.3/src/core, utilities/hadoop-1.0.3/src/mapred,

utilities/hadoop-1.0.3/src/hdfs.

If we open these files, we can see all the default properties.</pre>
Setting JAVA_HOME for hadoop directly

Open hadoop-env.sh file, you can see a JAVA_HOME with a path.

The location of hadoop-env.sh file is

hadoop-1.0.3/conf/hadoop-env.sh

Edit that JAVA_HOME and give the correct path in which java is installed.

1
>$ sudo  nano hadoop-1.0.3/conf/hadoop-env.sh
1
2
3
#The Java Implementation to use
export JAVA_HOME=<path from root to java directory>

Editting the Configuration files

All these files are present in the directory

hadoop-1.0.3/conf/

Here we are configuring the directory where the hadoop stores its data files, the network ports is listens to…etc

By default Hadoop stores its local file system and HDFS in hadoop.tmp.dir .

Here we are using the directory /app/hadoop/tmp for storing  temparory directories.

For that create a directory and set the ownership and  permissions to user

1
2
3
4
5
>$  sudo   mkdir –p /app/hadoop/tmp
>$ sudo   chownuser:hadoop /app/hadoop/tmp
>$ sudo   chmod 750 /app/hadoop/tmp

Here the first line will create the directory structure.

Second line will give the ownership of that directory to user

The third line will set the rwx permissions.

Setting the ownership and permission is very important, if you forget this, you will get into some exceptions while formatting the namenode.

1.       Core-site.xml

Open the core-site.xml file, you can see empty configuration tags.

Add the following lines between the configuration tags.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>
A base for other temporary directories.
</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
<description>The name of the default file system.</description>
</property>
2.       Mapred-site.xml

In the mapred-site.xml add the following between the configuration tags.

1
2
3
4
5
6
7
8
9
<property>
<name>mapred.job.tracker</name>
 <value>localhost:9001</value>
 <description> The host and port that the MapReduce job tracker runs </description>
</property>
3.       Hdfs-site.xml

In the hdfs-site.xml add the following between the configuration tags.

1
2
3
4
5
6
7
8
9
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication</description>
</property>

Here we are giving replication as 1, because we have only one machine.

We can increase this as the number of nodes increases.

Step 11

Formatting the Hadoop Distributed File System via  NameNode.

The first step for starting our Hadoop installation is to format the distributed file system. This should be done before first use. Be careful that, do not format an already running cluster, because all the data will be lost.

user@ubuntu:~$ $HADOOP_HOME/bin/hadoop namenode –format

The output will look like this

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
09/10/12 12:52:54 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = ubuntu/127.0.1.1
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 0.20.2
STARTUP_MSG:   build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0.3 -r 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010
************************************************************/
09/10/12 12:52:54 INFO namenode.FSNamesystem: fsOwner=user,hadoop
09/10/12 12:52:54 INFO namenode.FSNamesystem: supergroup=supergroup
09/10/12 12:52:54 INFO namenode.FSNamesystem: isPermissionEnabled=true
09/10/12 12:52:54 INFO common.Storage: Image file of size 96 saved in 0 seconds.
09/10/12 12:52:54 INFO common.Storage: Storage directory .../hadoop-user/dfs/name has been successfully formatted.
09/10/12 12:52:54 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at ubuntu/127.0.1.1
************************************************************/

Step 12

Starting Our single-node Cluster

Here we have only one node. So all the hadoop daemons are running on a single machine.

So we can start all the daemons by running a shell script.

1
user@ubuntu:~$ $HADOOP_HOME/bin/start-all.sh

This willstartup all the hadoop daemonsNamenode, Datanode, Jobtracker and Tasktracker on our machine.

The output when we run this is shown below.

1
2
3
4
5
6
7
8
9
10
11
12
13
user@ubuntu:/home/user/utilities/hadoop-1.0.3$ bin/start-all.sh
startingnamenode, logging to /home/user/utilities/hadoop-1.0.3/bin/../logs/hadoop-user-namenode-ubuntu.out
localhost: starting datanode, logging to home/user/utilities/hadoop-1.0.3/bin/../logs/hadoop-user-datanode-ubuntu.out
localhost: starting secondarynamenode, logging to home/user/utilities/hadoop-1.0.3/bin/../logs/hadoop-user-secondarynamenode-ubuntu.out
startingjobtracker, logging to home/user/utilities/hadoop-1.0.3/bin/../logs/hadoop-user-jobtracker-ubuntu.out
localhost: starting tasktracker, logging to home/user/utilities/hadoop-1.0.3/bin/../logs/hadoop-user-tasktracker-ubuntu.out
user@ubuntu$

You can check the process running on the by using jps.

1
2
3
4
5
6
7
8
9
10
11
12
13
user@ubuntu:/home/user/utilities/hadoop-1.0.3$ jps
1127 TaskTracker
2339 JobTracker
1943 DataNode
2098 SecondaryNameNode
2378 Jps
1455 NameNode

Note: If jps is not working, you can use another linux command.

ps –ef | grepuser

You can check for each daemon also

ps –ef | grep<daemonname>eg:namenode

Step 13

StoppingOur single-node Cluster

For stopping all the daemons running in the machine

Run the command

1
>$stop-all.sh

The output will be like this

1
2
3
4
5
6
7
8
9
10
11
12
13
user@ubuntu:~/utilities/hadoop-1.0.3$ bin/stop-all.sh
stoppingjobtracker
localhost: stopping tasktracker
stoppingnamenode
localhost: stopping datanode
localhost: stopping secondarynamenode
user@ubuntu:~/utilities/hadoop-1.0.3$

Then check with jps

1
2
3
>$jps
2378 Jps

Step 14

Testing the set up

Now our installation part is complete

The next step is to test the installed set up.

Restart the hadoop cluster again by using start-all.sh

Checking with HDFS
  1. Make a directory in hdfs
    1
    2
    3
    4
    </pre>
    </li>
    </ol>
    hadoop fs –mkdir  /user/user/trial

    If it is success list the created directory.

    1
    hadoop fs –ls /

    The output will be like this

    1
    drwxr-xr-x   - usersupergroup  0 2012-10-10 18:08 /user/user/trial

    If getting like this, the HDFS is working fine.

    1. Copy a file from local linux file system
    1
    hadoop fs –copyFromLocal  utilities/hadoop-1.0.3/conf/core-site.xml  /user/user/trial/

    Check for the file in HDFS

    1
    2
    3
    hadoop fs –ls /user/user/trial/
    -rw-r--r--   1 usersupergroup 557 2012-10-10 18:20 /user/user/trial/core-site.xml

    If the output is like this, it is success.

    Checking with a MapReduce job

    Mapreduce jars for testing are available with the hadoop itself.

    So we can use that jar. No need to import another.

    For checking with mapreduce, we can run a wordcountmapreduce job.

    Go to $HADOOP_HOME

    Then run

    1
    >$hadoop jar hadoop-examples-1.0.3.jar

    This output will be like this

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    An example program must be given as the first argument.
    Valid program names are:
    aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
    aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
    dbcount: An example job that count the pageview counts from a database.
    grep: A map/reduce program that counts the matches of a regex in the input.
    join: A job that effects a join over sorted, equally partitioned datasets
    multifilewc: A job that counts words from several files.
    pentomino: A map/reduce tile laying program to find solutions to pentomino problems.
    pi: A map/reduce program that estimates Pi using monte-carlo method.
    randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.
    randomwriter: A map/reduce program that writes 10GB of random data per node.
    secondarysort: An example defining a secondary sort to the reduce.
    sleep: A job that sleeps at each map and reduce task.
    sort: A map/reduce program that sorts the data written by the random writer.
    sudoku: A sudoku solver.
    teragen: Generate data for the terasort
    terasort: Run the terasort
    teravalidate: Checking results of terasort
    wordcount: A map/reduce program that counts the words in the input files.

    The above shown are the programs that are contained inside that jar, we can choose any program.

    Here we are  going to run the wordcount process.

    The input file using is the file that we already copied from local to HDFS.

    Run the following commands for executing the wordcount

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    >$ hadoop jar hadoop-examples-1.0.3.jar wordcount user/user/trial/core-site.xml user/user/trial/output/
    The output will be like this
    12/10/10 18:42:30 INFO input.FileInputFormat: Total input paths to process : 1
    12/10/10 18:42:30 INFO util.NativeCodeLoader: Loaded the native-hadoop library
    12/10/10 18:42:30 WARN snappy.LoadSnappy: Snappy native library not loaded
    12/10/10 18:42:31 INFO mapred.JobClient: Running job: job_201210041646_0003
    12/10/10 18:42:32 INFO mapred.JobClient:  map 0% reduce 0%
    12/10/10 18:42:46 INFO mapred.JobClient:  map 100% reduce 0%
    12/10/10 18:42:58 INFO mapred.JobClient:  map 100% reduce 100%
    12/10/10 18:43:03 INFO mapred.JobClient: Job complete: job_201210041646_0003
    12/10/10 18:43:03 INFO mapred.JobClient: Counters: 29
    12/10/10 18:43:03 INFO mapred.JobClient:   Job Counters
    12/10/10 18:43:03 INFO mapred.JobClient:     Launched reduce tasks=1
    12/10/10 18:43:03 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=12386
    12/10/10 18:43:03 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
    12/10/10 18:43:03 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
    12/10/10 18:43:03 INFO mapred.JobClient:     Launched map tasks=1
    12/10/10 18:43:03 INFO mapred.JobClient:     Data-local map tasks=1
    12/10/10 18:43:03 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=10083
    12/10/10 18:43:03 INFO mapred.JobClient:   File Output Format Counters
    12/10/10 18:43:03 INFO mapred.JobClient:     Bytes Written=617
    12/10/10 18:43:03 INFO mapred.JobClient:   FileSystemCounters
    12/10/10 18:43:03 INFO mapred.JobClient:     FILE_BYTES_READ=803
    12/10/10 18:43:03 INFO mapred.JobClient:     HDFS_BYTES_READ=688
    12/10/10 18:43:03 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=44801
    12/10/10 18:43:03 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=617
    12/10/10 18:43:03 INFO mapred.JobClient:   File Input Format Counters
    12/10/10 18:43:03 INFO mapred.JobClient:     Bytes Read=557
    12/10/10 18:43:03 INFO mapred.JobClient:   Map-Reduce Framework
    12/10/10 18:43:03 INFO mapred.JobClient:     Map output materialized bytes=803
    12/10/10 18:43:03 INFO mapred.JobClient:     Map input records=18
    12/10/10 18:43:03 INFO mapred.JobClient:     Reduce shuffle bytes=803
    12/10/10 18:43:03 INFO mapred.JobClient:     Spilled Records=90
    12/10/10 18:43:03 INFO mapred.JobClient:     Map output bytes=746
    12/10/10 18:43:03 INFO mapred.JobClient:     CPU time spent (ms)=3320
    12/10/10 18:43:03 INFO mapred.JobClient:     Total committed heap usage (bytes)=233635840
    12/10/10 18:43:03 INFO mapred.JobClient:     Combine input records=48
    12/10/10 18:43:03 INFO mapred.JobClient:     SPLIT_RAW_BYTES=131
    12/10/10 18:43:03 INFO mapred.JobClient:     Reduce input records=45
    12/10/10 18:43:03 INFO mapred.JobClient:     Reduce input groups=45
    12/10/10 18:43:03 INFO mapred.JobClient:     Combine output records=45
    12/10/10 18:43:03 INFO mapred.JobClient:     Physical memory (bytes) snapshot=261115904
    12/10/10 18:43:03 INFO mapred.JobClient:     Reduce output records=45
    12/10/10 18:43:03 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=2876592128
    12/10/10 18:43:03 INFO mapred.JobClient:     Map output records=48
    user@ubuntu:~/utilities/hadoop-1.0.3$

    If the program executed successfully, the output will be in

    user/user/trial/output/part-r-00000 file in hdfs

    Check the output

    1
    >$hadoop fs –cat user/user/trial/output/part-r-00000

    If output is coming, then our installation is success with mapreduce.

    Thus we checked our installation.

    So our single node hadoop cluster is ready

How to create a private npm.js repository

I want to create a mirror of the npm repository to mitigate periods of npm outages and to speed things up a little bit, so here’s how I did it!

Installing CouchDB

 Install the required packages:

 sudo apt-get install build-essential autoconf automake libtool erlang libicu-dev libmozjs-dev libcurl4-openssl-dev 

Download CouchDB 1.5:-

wget http://mirror.cc.columbia.edu/pub/software/apache/couchdb/source/1.5.0/apache-couchdb-1.5.0.tar.gz

Extracting :-

tar xfv apache-couchdb-1.5.0.tar.gz 

Compile:

  cd apache-couchdb-1.2.0

 ./configure

  make

  make install

To check:-

   $ couchdb
  Apache CouchDB 1.5.0 (LogLevel=info) is starting.
  Apache CouchDB has started. Time to relax.
  [info] [<0.32.0>] Apache CouchDB has started on http://127.0.0.1:5984/

   $ curl -X GET http://localhost:5984
   [info] [<0.361.0>] 127.0.0.1 – - GET / 200

Adding CouchDB to the init.d scripts

     $ sudo adduser –disabled-login –disabled-password –no-create-home couchdb

     Adding user `couchdb’ …
     Adding new group `couchdb’ (1001) …
     Adding new user `couchdb’ (1001) with group `couchdb’ …
     Not creating home directory `/home/couchdb’.
     Changing the user information for couchdb Enter the new value, or press ENTER for the default
     Full Name []: CouchDB Admin
     Room Number []:  
     Work Phone []:
     Home Phone []:
    Other []:
    Is the information correct? [Y/n] Y

User added, now permissions:

    sudo chown -R couchdb:couchdb /usr/local/var/{log,lib,run}/couchdb

    sudo chown -R couchdb:couchdb /usr/local/etc/couchdb/local.ini

    $ sudo vim /usr/local/etc/couchdb/default.ini
    [httpd]
    secure_rewrites = false

    sudo ln -s /usr/local/etc/init.d/couchdb /etc/init.d
    sudo update-rc.d couchdb defaults
    sudo /etc/init.d/couchdb start

    

Replicating npmjs.org
Now to replicate the npm registry.

curl -X POST http://127.0.0.1:5984/_replicate -d ‘{“source”:”http://isaacs.iriscouch.com/registry/&#8221;, “target”:”registry”, “create_target”:true}’ -H “Content-Type: application/json”
What this means is that once this is completed your local replication will be an exact copy of the npm registry. However, to ensure we do indeed receive all updates we will add a “continuous”:true parameter to the JSON string in our POST request, this utilises CouchDB’s _changes API and will pull any new changes when this API is notified.

curl -X POST http://127.0.0.1:5984/_replicate -d ‘{“source”:”http://isaacs.iriscouch.com/registry/&#8221;, “target”:”registry”, “continuous”:true, “create_target”:true}’ -H “Content-Type: application/json”
We are now replicating continuously from npmjs.org to our private CouchDB instance! If you ever want to stop these replications you can easily do this by running the same command as before but add a “cancel”:true parameter to the JSON POST data.

curl -X POST http://127.0.0.1:5984/_replicate -d ‘{“source”:”http://isaacs.iriscouch.com/registry/&#8221;, “target”:”registry”, “continuous”:true, “create_target”:true, “cancel”:true}’ -H “Content-Type: application/json”
And we’re almost done! All we need to do now is to set up our own version of the npmjs.org registry 

Getting npm to work with our replicated CouchDB
Most of the steps can be found on isaacs github in the npmjs.org git repositories README

Of course we need to have nodejs and git installed for this:

git clone git://github.com/isaacs/npmjs.org.git
cd npmjs.org
sudo npm install -g couchapp
npm install couchapp
npm install semver
couchapp push registry/app.js http://localhost:5984/registry
couchapp push www/app.js http://localhost:5984/registry
Boom, we now have a working npm repository, to test this we can run the following command.

npm –registry http://localhost:5984/registry/_design/scratch/_rewrite login
npm –registry http://localhost:5984/registry/_design/scratch/_rewrite search
If you are getting results then everything is fine.

Running on your own subdomain is to modify the [vhosts] section in /usr/local/etc/couchdb/local.ini. Uncomment the example and restart CouchDB.

$ vim /usr/local/etc/couchdb/local.ini
[vhosts]
example.com = /registry/_design/scratch/_rewrite
And while we are at it will lock down the application and prevent unauthorised users from deleting our data.

$ vim /usr/local/etc/couchdb/local.ini
[admins]
admin = password
$ sudo /etc/init.d/couchdb restart
Start using your version of npm with the client!
Straight from the npmjs.org README, just replace <registryurl> with your registries url, for example:

http://localhost:5984/registry/_design/app/_rewrite
You can point the npm client at the registry by putting this in your ~/.npmrc file:

registry = <registryurl>
You can also set the npm registry config property like:

npm config set <registryurl>
Or you can simple override the registry config on each call:

npm –registry <registryurl> install <packagename>

Git Mirror – If git hub goes down

Git has a built-in server for sharing git repositories. If you have several repositories in your working directory:

for eg:

Step-1 :- We clone the repo from Github

git clone git@github.com:vishnu/vishnu.git —> To our home directory for eg: /home/vishnu

step-2 :- git-deamon
git daemon –base-path=/home/vishnu –export-all –enable=receive-pack
(I created a supervisor to run this)

root@vishnu-machine:~# vim /etc/supervisor/conf.d/git_mirror_update.conf
[program:git_mirror_update]
command=/usr/local/bin/git daemon –base-path=/home/vishnu –export-all –enable=receive-pack
process_name=%(program_name)s
directory=/home/vishnu
autostart=true
autorestart=true
stopsignal=QUIT
redirect_stderr=true
stdout_logfile=/var/log/supervisor/%(program_name)s.log
stdout_logfile_maxbytes=1MB
stdout_logfile_backups=5
stdout_capture_maxbytes=1MB

root@vishnu-machine:~# supervisorctl status
git_mirror_update RUNNING pid 3041, uptime 8:18:46

 

step-3 :- then we create a mirror from the repo which we cloned to another location in our system

git clone git://127.0.0.1/vishnu

Step-4 :-

git clone –mirror git@github.com:vishnu/vishnu.git

Step-5 :-

cd vishnu.git

Step-6 :-

git push –mirror git://127.0.0.1/vishnu ——> This will update all the remote branches

In the deploy.rb, use

set :repository, “git://127.0.0.1/vishnu”

For refferel :- https://help.github.com/articles/working-when-github-goes-down

Script to take the DB backup by passing the db name and host names to S3

#!/bin/sh
DB=””
HOST=””
printhelp() {

echo “

Usage: sh mysql.sh [OPTION]…
-db, –database The database name which you need to take the backup

-host, –hostname The hostname

-h, –help Display help file

}
while [ "$1" != "" ]; do
case “$1″ in
-db | –database ) DB=$2; shift 2 ;;
-host | –hostname ) HOST=$2; shift 2 ;;
-h | –help ) echo “$(printhelp)”; exit; shift; break ;;
esac
done

#### BEGIN CONFIGURATION ####
NOWDATE=`date +%Y-%m-%d`
LASTDATE=$(date +%Y-%m-%d –date=’1 week ago’)

# set dump directory variables
SRCDIR=’/tmp/s3backups’
DESTDIR=’test’
BUCKET=’vishnu-test’

# database access details
USER=’root’
#### END CONFIGURATION ####

# make the temp directory if it doesn’t exist and cd into it
mkdir -p $SRCDIR
cd $SRCDIR

# dump the selected sql file and upload it to s3
#for DB in $(mysql -h$HOST -u$USER -e ‘show databases’ | grep -Ev ‘mysql|information_schema|performance_schema|Database’)
#do
mysqldump -h$HOST -u$USER –single-transaction $DB > $DB.sql
tar -czPf $DB.tar.gz $DB.sql
mv $DB.tar.gz $NOWDATE-$DB.tar.gz
/usr/bin/s3cmd put $SRCDIR/$NOWDATE-$DB.tar.gz s3://$BUCKET/$DESTDIR/
#done

# delete old backups from s3
/usr/bin/s3cmd del –recursive s3://$BUCKET/$DESTDIR/$LASTDATE-$DB.tar.gz

# remove all files in our source directory
cd
rm -rf $SRCDIR/*

 SAVE THIS SCRIPT AND RUN 

FOR EXAMPLE:-  sh  mysql.sh -db <give database name> -host <give hostname>

Ignore Mysql Tables using regex when using Mysqldump command

Hi guys,

We can ignore database tables while taking mysqldump using the below command

mysqldump -u username -p database –ignore-table=database.table1 –ignore-table=database.table2 > database.sql

However if we want to ignore more tables we need to add the tables manually and its a real pain in the A$$

Here I am showing how to take the mysqldump using regex to ignore mysql tables.

For example:- If I want to ignore the below tables starting from

1) alice_token,

2) import_

3)sales_order_item_shipment_tracking

4)sales_order_item_status_history

5)stock_import

6)ums

root@vishnu-machine# cd /var/lib/mysql/<database>

root@vishnu-machine:/var/lib/mysql/<database># echo ‘[mysqldump]‘ > mydump.cnf

root@vishnu-machine:/var/lib/mysql/<database>#  mysql -NBe “select concat(‘ignore-table=’, table_schema, ‘.’, table_name) from information_schema.tables where table_name REGEXP ‘^(alice_token|import_|sales_order_item_shipment_tracking|sales_order_item_status_history|stock_import|ums)’;” > mydump.cnf

And verify the file “mydump.cnf”

root@vishnu-machine:/var/lib/mysql/<database># cat mydump.cnf

ignore-table=<database>.alice_token

ignore-table=<database>.import_token

ignore-table=<database>.import_sales

ignore-table=<database>.sales_order_item_shipment_tracking

ignore-table=<database>.sales_order_item_status_history

ignore-table=<database>.stock_import

ignore-table=<database>.ums

You can see the above entries in the file “mydump.cnf”

root@vishnu-machine:/var/lib/mysql/<database># mysqldump –defaults-file=mydump.cnf -u root -p <database> > vishnu.sql

And here it is new dump without tables as mentioned above.

 

PhantomJS and Load Speed test for websites

PhantomJS is an optimal solution for

1) Headless Website Testing

2) Screen Capture

3) Page Automation

4) Network Monitoring

Please check this link  http://phantomjs.org/

TO INSTALL PHANTOMJS

cd /usr/local/share
wget https://phantomjs.googlecode.com/files/phantomjs-1.9.1-linux-x86_64.tar.bz2
tar xjvf phantomjs-1.9.1-linux-x86_64.tar.bz2
ln -s /usr/local/share/phantomjs-1.9.1-linux-x86_64/bin/phantomjs /usr/local/share/phantomjs
ln -s /usr/local/share/phantomjs-1.9.1-linux-x86_64/bin/phantomjs /usr/local/bin/phantomjs
ln -s /usr/local/share/phantomjs-1.9.1-linux-x86_64/bin/phantomjs /usr/bin/phantomjs
# Additional package
apt-get install fontconfig

Then go to the location

root@machine# cd /usr/local/share/phantomjs-1.9.1-linux-x86_64/examples

root@machine# vim loadspeed.js

ADD OUR EDITED FILE TO GET THE LOAD PAGE SPEED

var args = require(‘system’).args;
    if(args[1] === undefined){
        console.log(‘you should provide an url’);
        phantom.exit();
    }

var namshiUrl = args[1];

var page = require(‘webpage’).create(),
    system = require(‘system’),
    t,address;

if (system.args.length === 1) {
    console.log(‘Usage: loadspeed.js ‘+ namshiUrl);
    phantom.exit(1);
} else {
    t = Date.now();
    address = system.args[1];
    page.open(address, function (status) {
      if (status !== ‘success’) {
            console.log(‘FAIL to load the address’);
        } else {
            t = Date.now() – t;
            console.log(‘Page title is ‘ + page.evaluate(function () {
                return document.title;
            }));
            console.log(‘Loading time ‘ + t + ‘ msec’);
        }
        phantom.exit();
    });
}

THEN TEST IT

root@vishnu-machine:~# for weburl in http://google.com http://yahoo.com ; do phantomjs /usr/local/share/phantomjs-1.9.1-linux-x86_64/examples/loadspeed.js ${weburl}; done

THEN INORDER TO DOING THIS WRITE A SCRIPT

#!/bin/bash
true=0
for weburl in http://google.com http://yahoo.com http://hotmail.com ; do /usr/local/bin/phantomjs /usr/local/share/phantomjs-1.9.1-linux-x86_64/examples/loadspeed.js ${weburl}; done  > abc.txt
cat abc.txt | awk ‘{print $3}’ | sed -n “2~2 p” > abc2.txt
for i in `cat abc2.txt`
do
if [ $i -gt 12000 ]
then
    true=1
fi
done
if [ $true -eq 1 ]
then
   mail -s loadtime mailadddress@domain.com < abc.txt
fi

 

Centralizing Logs with Lumberjack, Logstash, and Elasticsearch

Elasticsearch

Elasticsearch is RESTful search server based on Apache Lucene. It’s easily scalable and uses push replication in distributed environments. But really, the Logstash/Elasticsearch/Kibana integration is great! If it wasn’t for how well these applications integrate with one another, I probably would have looked into putting my logs into SOLR (which I’m more familiar with).

As of Logstash 1.1.9, the Elasticsearch output plugin works with the 0.20.2 version of Elasticsearch, so even though Elasticsearch currently has a newer version available, we’re going to use 0.20.2. (If you really need to use the most current version of Elasticsearch, you can use the elasticsearch_http output plugin to make Logstash interface with Elasticsearch’s REST API. If you’re using the elasticsearch output plugin, your versions must match between Logstash and Elasticsearch.)

First, you’ll need to download a headless runtime environment. We run Squeeze around here, so we’re using apt-get install default-jre-headless

Then, you’ll need to download the .deb package.

 

Download and install Elasticsearch
1
2
wget http://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-0.20.2.deb
dpkg -i elasticsearch-0.20.2.deb

 

You should see that elasticsearch is now running on your server. Stop it for now /etc/init.d/elasticsearch stop – We might want to set up some configuration. For example, you’re probably going to be generating a lot of data. And even if you’re not, I highly recommend using external block storage if you’re on cloud hosting. That way, your data stays when your instance dies.

On some hosts, you may need to run modprobe acpiphp before attaching the block storage device for the first time. By the way, keep in mind that you will generate a LOT of I/O requests. I do NOT recommend setting this up at places like Amazon or HP Cloud where you get billed per 1,000,000 requests. That 1,000,000 may sound like a lot of requests, but if you’re stashing 100GB of logs per month, don’t be surprised when your I/O bill is more than $3,000. Just a heads up.

Next, you’ll want to edit the configuration in /etc/elasticsearch/elasticsearch.yml. If you’re on cloud hosting and running more than one node, odds are multicast won’t work so you’ll have to specify the IPs of your Elasticsearch nodes. You’ll also need to make sure that the data directory has been properly specified. By the way, Rackspace does support multicast using Cloud Networks. If you’ve got a multicast-enabled network, then all that you’ll need to do for your nodes to communicate with each other is ensure they are configured with the same cluster name.

After configuring Elasticsearch, you can go ahead and start it. If you’re running more than one node, I recommend waiting a few minutes between starting each one (or check the health of the cluster first).

Logstash

Once again, you’ll need a JRE to run this.

First, you’ll want to generate self-signed SSL on logstash server:

 

Generate SSL
1
openssl req -x509 -newkey rsa:2048 -keyout /etc/ssl/logstash.key -out /etc/ssl/logstash.pub -nodes -days 365

 

You will need the public key for your Lumberjack shipper.

Next, you’ll need to get the Logstash jar.

 

Download/Install Logstash
1
2
mkdir /opt/logstash
wget https://logstash.objects.dreamhost.com/release/logstash-1.1.9-monolithic.jar -O /opt/logstash/logstash.jar

 

You’ll need to put together your logstash.conf file according to the documentation. Be sure that you have an output filter for Elasticsearch if you’re going to store your logs there. If you put it in /etc/logstash/logstash.conf, you can use the following init script to run it. Otherwise, you’ll need to run /usr/bin/java -jar /opt/logstash/logstash.jar agent -f <path-to-your.conf> -l <path-to-where-you-want-the.log>

 

Logstash Init Script
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
#! /bin/sh
 
### BEGIN INIT INFO
# Provides:          logstash
# Required-Start:    $remote_fs $syslog
# Required-Stop:     $remote_fs $syslog
# Default-Start:     2 3 4 5
# Default-Stop:      0 1 6
# Short-Description: Start daemon at boot time
# Description:       Enable service provided by daemon.
### END INIT INFO
 
. /lib/lsb/init-functions
 
name="logstash"
logstash_bin="/usr/bin/java -- -jar /opt/logstash/logstash.jar"
logstash_conf="/etc/logstash/logstash.conf"
logstash_log="/var/log/logstash.log"
pid_file="/var/run/$name.pid"
 
start () {
        command="${logstash_bin} agent -f $logstash_conf --log ${logstash_log}"
 
        log_daemon_msg "Starting $name" "$name"
        if start-stop-daemon --start --quiet --oknodo --pidfile "$pid_file" -b -m --exec $command; then
                log_end_msg 0
        else
                log_end_msg 1
        fi
}
 
stop () {
        log_daemon_msg "Stopping $name" "$name"
        start-stop-daemon --stop --quiet --oknodo --pidfile "$pid_file"
}
 
status () {
        status_of_proc -p $pid_file "" "$name"
}
 
case $1 in
        start)
                if status; then exit 0; fi
                start
                ;;
        stop)
                stop
                ;;
        reload)
                stop
                start
                ;;
        restart)
                stop
                start
                ;;
        status)
                status && exit 0 || exit $?
                ;;
        *)
                echo "Usage: $0 {start|stop|restart|reload|status}"
                exit 1
                ;;
esac
 
exit 0

 

By the way, any inputs in your conf file that are accepting data from a Lumberjack shipper need to use the Lumberjack input filter. While it may be tempting to receive logs over UDP due to reduce overhead, I highly recommend using TCP. UDP is great if you don’t care if you actually receive it or not, but we want to receive all of our logs. TCP is a better choice in this case for better accounting since it will attempt to resend data that wasn’t acknowledged to have been received.

Also, here is a sample config for /etc/logstash/logstash.conf:

Logstash Conf
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
input {
  lumberjack {
    port => 5000
    type => syslog
    ssl_certificate => "/etc/ssl/logstash.pub"
    ssl_key => "/etc/ssl/logstash.key"
  }
  lumberjack {
    type => "nginx_access"
    port => 5001
    ssl_certificate => "/etc/ssl/logstash.pub"
    ssl_key => "/etc/ssl/logstash.key"
  }
  lumberjack {
    type => "varnishncsa"
    port => "5002"
    ssl_certificate => "/etc/ssl/logstash.pub"
    ssl_key => "/etc/ssl/logstash.key"
  }
}
 
filter {
  # Lumberjack sends custom fields. We're going to use those for multi-user
  # Kibana access control.
  mutate {
    add_tag => [ "%{customer}" ]
  }
  mutate {
    remove => [ "customer" ]
  }
 
  # strip the syslog PRI part and create facility and severity fields.
  # the original syslog message is saved in field %{syslog_raw_message}.
  # the extracted PRI is available in the %{syslog_pri} field.
  #
  # You get %{syslog_facility_code} and %{syslog_severity_code} fields.
  # You also get %{syslog_facility} and %{syslog_severity} fields if the
  # use_labels option is set True (the default) on syslog_pri filter.
  grok {
    type => "syslog"
    pattern => [ "<%{POSINT:syslog_pri}>%{SPACE}%{GREEDYDATA:message_remainder}" ]
    add_tag => "got_syslog_pri"
    add_field => [ "syslog_raw_message", "%{@message}" ]
  }
  syslog_pri {
    type => "syslog"
    tags => [ "got_syslog_pri" ]
  }
  mutate {
    type => "syslog"
    tags => [ "got_syslog_pri" ]
    replace => [ "@message", "%{message_remainder}" ]
  }
  mutate {
    # XXX must not be combined with replacement which uses same field
    type => "syslog"
    tags => [ "got_syslog_pri" ]
    remove => [ "message_remainder" ]
  }
 
  # strip the syslog timestamp and force event timestamp to be the same.
  # the original string is saved in field %{syslog_timestamp}.
  # the original logstash input timestamp is saved in field %{received_at}.
  grok {
    type => "syslog"
    pattern => [ "%{SYSLOGTIMESTAMP:syslog_timestamp}%{SPACE}%{GREEDYDATA:message_remainder}" ]
    add_tag => "got_syslog_timestamp"
    add_field => [ "received_at", "%{@timestamp}" ]
  }
  mutate {
    type => "syslog"
    tags => [ "got_syslog_timestamp" ]
    replace => [ "@message", "%{message_remainder}" ]
  }
  mutate {
    # XXX must not be combined with replacement which uses same field
    type => "syslog"
    tags => [ "got_syslog_timestamp" ]
    remove => [ "message_remainder" ]
  }
  date {
    type => "syslog"
    tags => [ "got_syslog_timestamp" ]
    # season to taste for your own syslog format(s)
    syslog_timestamp => [ "MMM  d HH:mm:ss", "MMM dd HH:mm:ss", "ISO8601" ]
  }
 
  # strip the host field from the syslog line.
  # the extracted host field becomes the logstash %{@source_host} metadata
  # and is also available in the filed %{syslog_hostname}.
  # the original logstash source_host is saved in field %{logstash_source}.
  grok {
    type => "syslog"
    pattern => [ "%{SYSLOGHOST:syslog_hostname}%{SPACE}%{GREEDYDATA:message_remainder}" ]
    add_tag => [ "got_syslog_host", "%{syslog_hostname}" ]
    add_field => [ "logstash_source", "%{@source_host}" ]
  }
  mutate {
    type => "syslog"
    tags => [ "got_syslog_host" ]
    replace => [ "@source_host", "%{syslog_hostname}" ]
    replace => [ "@message", "%{message_remainder}" ]
  }
  mutate {
    # XXX must not be combined with replacement which uses same field
    type => "syslog"
    tags => [ "got_syslog_host" ]
    remove => [ "message_remainder" ]
  }
 
  grok {
    type => "nginx_access"
    pattern => "%{COMBINEDAPACHELOG}"
  }
  grok {
    type => "varnishncsa"
    pattern => "%{COMBINEDAPACHELOG}"
  }
}
 
output {
#  stdout { debug => true debug_format => "json" }
 
  elasticsearch {
    cluster => "my-logstash-cluster"
  }
}

By the way, you’ll notice that a lot of the filters have “type” set, as well as the inputs. This is so you can limit which type a filter is acting on. Additionally, you can add multiple outputs that send only certain types to different destinations. For example, you might want to send Varnish and Nginx logs to statsd in addition to Elasticsearch, or maybe you want to send certain events to PagerDuty, or perhaps even to IRC. There are currently over 45 outputs that come with Logstash, and if what you need isn’t there you can always write your own. By the way, Lumberjack isn’t the only input; it’s just the one we’re focusing on here.

Also, the above logstash.conf assumes that you have Grok installed and that you’re running on a multicast-enabled network. Also, don’t get discouraged waiting for Logstash to start. It typically takes 70-90 seconds in my experience, although I’ve seen it take up to two minutes to start up and fully connect to Elasticsearch. I banged my head against the wall my first day with Logstash trying to figure out what was wrong – and then I went and got another cup of coffee and returned back to see it finally running.

Start Shipping with Lumberjack

I know, you’re probably thinking one of two things at this point.

  1. Logstash can ship logs, too.
  2. Rsyslog can ship logs and ships with most distros.

And you’re right. But when you’re shipping a lot of logs, especially on a system with limited resources, you need a lightweight solution that does nothing but ship. Enter Lumberjack.

Remember, Logstash is Java. Running as a shipper, its probably going to be using a significant part of your memory. While that might be acceptable on a 32GB server, if your deployment consists of mostly 512MB or 1GB cloud servers for web nodes then you’re giving up a significant chunk of resources that can be put to better use for other tasks.

The next option might be to ship using Rsyslog. But even then, I’ve seen that grow to over 60MB of memory usage, and I’ve heard people talk about over 100MB of memory usage. For shipping logs.

Both of those scenarios are unacceptable to me. Lumberjack uses a whopping 64K in my experience. (For reference, the most busy server I have at this moment generates >1M entries per day from the Nginx logs.)

First, you’ll need to clone the repo, and then it’s just the usual “make”. However, to make my life a little easier, I like to use a neat little Rubygem called “fpm”. And with that, I run  “make deb” after running “make” and it gives me a Debian package that I can reuse over and over again.

Install Lumberjack
1
2
3
4
5
6
7
8
apt-get install rubygems
gem install fpm
export PATH=$PATH:/var/lib/gems/1.8/bin
git clone https://github.com/jordansissel/lumberjack.git
cd lumberjack
make
make deb
dpkg -i lumberjack_0.0.8_amd64.deb

So now that we have Lumberjack, how do we run it? Basically, you just need to tell Lumberjack what files to ship to Logstash. The basic command is “/opt/lumberjack/bin/lumberjack –host your.logstash.host –port port-for-these-logs –ssl-ca-path /etc/ssl/logstash.pub /path/to/your.log /path/to/your/other.log”. You can also pass custom fields using “–field key=value”. (The custom fields can be useful in a multi-user Kibana environment.)

To make my life easier, I run the following init script:

Lumberjack Init.d Script
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
#! /bin/sh
# BEGIN INIT INFO
# Provides:          lumberjack-nginx
# Required-Start:    $remote_fs $syslog
# Required-Stop:     $remote_fs $syslog
# Default-Start:     2 3 4 5
# Default-Stop:      0 1 6
# Short-Description: Start daemon at boot time
# Description:       Enable service provided by daemon.
### END INIT INFO
 
. /lib/lsb/init-functions
 
name="lumberjack-nginx"
 
# Read configuration variable file if it is present
[ -r /etc/default/$name ] && . /etc/default/$name
 
lumberjack_bin="/opt/lumberjack/bin/lumberjack.sh"
pid_file="/var/run/$name.pid"
cwd=`pwd`
 
start () {
        command="${lumberjack_bin}"
 
        if start-stop-daemon --start --quiet --oknodo --pidfile "$pid_file" -b -m -N 19 --exec $command -- $LUMBERJACK_OPTIONS; then
                log_end_msg 0
        else
                log_end_msg 1
        fi
}
 
stop () {
        start-stop-daemon --stop --quiet --oknodo --pidfile "$pid_file"
}
 
status () {
        status_of_proc -p $pid_file "" "$name"
}
 
case $1 in
        start)
                if status; then exit 0; fi
                echo -n "Starting $name: "
                start
                echo "$name."
                ;;
        stop)
                echo -n "Stopping $name: "
                stop
                echo "$name."
                ;;
        restart)
                echo -n "Restarting $name: "
                stop
                sleep 1
                start
                echo "$name."
                ;;
        status)
                status && exit 0 || exit $?
                ;;
        *)
                echo "Usage: $0 {start|stop|restart|status}"
                exit 1
                ;;
esac
 
exit 0

You might have noticed it’s called “lumberjack-nginx”. That’s because I run a separate daemon for shipping different log formats, such as MySQL, Varnish, Syslog, etc. The options are defined in /etc/default/lumberjack-nginx (or whatever you’re shipping.)

Lumberjack Options File
1
LUMBERJACK_OPTIONS="--field customer=clientname --host 172.16.0.52 --port 5001 --ssl-ca-path /etc/ssl/logstash.pub /var/log/nginx/access.log"

Bonus: Kibana

Kibana is awesome. No really. It’s a really great front-end for Elasticsearch and makes analyzing your logs much, much more enjoyable. It’s super simple to install, too.

First, you’ll need to clone the repo: git clone –branch=kibana-ruby https://github.com/rashidkpc/Kibana.git

Install Kibana
1
2
3
4
5
git clone --branch=kibana-ruby https://github.com/rashidkpc/Kibana.git /opt/kibana
apt-get install rubygems libcurl4-openssl-dev
export PATH=$PATH:/var/lib/gems/1.8/bin
cd /opt/kibana
bundle install

By the way, there’s also a “kibana-ruby-auth” branch for multiuser installs. It seems to work pretty good from my testing, although there have been several bugs in the interface. At last checkout, there is an issue with the “Home” button linking to “index.html” which needs to be changed to “/”. I’m not up to speed on Ruby, so I don’t know what the “right” fixes are, but I get by.

You’ll also need to install the following packages:

apt-get install rubygems libcurl4-openssl-dev

If you want to play with the multiuser install (not recommended for production at this time), you’ll also need the libpam0g-dev package.

After that, you just need to make some minor changes to KibanaConfig.rb (such as pointing it to the right Elasticsearch server and making sure Kibana will listen on the right ports) and run “ruby kibana.rb”, or use this handy init script:

Kibana Init Script
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
#!/bin/bash
 
# Author - Matt Mencel - 2012
# Borrowed heavily from Patrick McKenzie's example
#
# kibana-daemon       Startup script for the Kibana Web UI running on sinatra
# chkconfig: 2345 20 80
#
## Copy to /etc/init.d/kibana and make it executable
## Add to system startup through chkconfig (CentOS/RHEL) or Upstart (Ubuntu)
KIBANA_PATH="/opt/kibana"
 
ruby1.8 $KIBANA_PATH/kibana-daemon.rb $1
RETVAL=$?
 
exit $RETVAL

So… what does this have to do with Drupal?

It will make your Drupal logging life easier.

No, really. I’m sure that most people have their Drupal site send all log messages to the database, and that’s fine if you have a low traffic site. But what about when you need to actually search for something, such as the messages associated with a particular issue your users are sporadically experiencing? If you’ve ever tried to attempt that through the DB log, you probably realize how painful it can be.

Why not take advantage of another built-in feature of Drupal? Drupal can send its log messages to syslog (via the core Syslog module). If you want to go to a specific file only for Drupal, there’s a contrib module for that. With Drupal logging to the filesystem, you take a load off of the database and you’re able to easily ship your logs elsewhere. They can be sent to be indexed in Elasticsearch for easy search and analysis through Kibana, or to several other places such as Graphite, Email, IRC, PagerDuty, etc. You get to decide how and where your logs are processed.

And if you have the spare RAM and aggressive log rotation, you could just send your logs to tmpfs and avoid disk load altogether. After all, you are shipping them out to be stored elsewhere.

Even if you can’t log to a file, Logstash can help. Logstash has an input for Drupal’s DBLog. All you have to do is tell it what database to connect to and it handles the rest. It only pulls the log messages that it hasn’t received previously. This can be useful if you don’t want to deal with configuring Drupal to log to Syslog or another file, or if you’re one of those crazy types that runs multiple web heads without configuration management to ensure that everyone is shipping file-based logs properly and consistently.

One situation where I’ve found shipping Drupal logs useful is debugging. A good Drupal module logs errors (and optional debugging messages) through watchdog. If you’re just leaving those watchdogs in DBLog or syslog, you’ve got a tough time ahead of you searching for issues. If you’re having your watchdogs shipped through Logstash, you can have them indexed in Elasticsearch and search through the messages, or you can have them graphed by statsd to see if there’s a trend for when your users are hitting errors, or you can have your developers nagged in IRC.

Another useful situation is easily tracking logins and user signups. Now you’re able to easily build charts showing when users are signing up or how often they are logging in, as well as figure out if someone is setting up multiple fake accounts.