Word Cloud Generator

I made a Word Cloud Generator at work for the use of our Analysts. It’s not as fully featured as Wordle but doesn’t use a Java Applet on the client side and lets you download the word cloud as an image.

It has NOT been tested on ANY version of IE, looks terrible but if you need to generate a word cloud image then give our Word Cloud Generator a shot.


JMapReduce: Easy MapReduce with Hadoop on the JVM

JMapReduce is a Mandy inspired library that lets you quickly write and run Hadoop MapReduce scripts in JRuby. The main difference between JMapReduce and Mandy is that JMapReduce runs in a JVM using JRuby and Mandy runs using Hadoop’s Streaming API in Ruby. The main aim of writing JMapReduce was that I needed Mandy like MapReduce scripts that could also make use of Java libraries and therefore needed something that runs in the JVM.

For a quick introduction to terms like Hadoop, MapReduce and Mandy, I would recommend reading my colleague Paul Ingles blog post MapReduce with Hadoop and Ruby from early 2010.

Here is a word count example:

You can also chain MapReduce jobs like so:

Mappers and Reducers can emit Integers, Floats, Strings, Arrays and Hashes, but the very last emit of the very last job should be a String otherwise you will see binary data in your eventual result.

Visit the main page for more information and examples.

Tags: jruby hadoop

Hive Thrift client

Assumptions made: you know what Hive is, you know what Thrift is and you know how to install and start the Hive Thrift server.

To query the Hive server in a language that doesn’t run on the Java Virtual Machine and therefore can’t easily make use of the JBDC drivers, you will need to generate a Thrift client library in your chosen language to use the Thrift Server; as long Thrift has generators for your language of course.

I couldn’t find much documentation online on exactly how to go about this, therefore I thought I’d document the steps I took to generate the Ruby thrift client for Hive, in case someone somewhere wanted to do the same:

thrift —gen rb -I service/include metastore/if/hive_metastore.thrift
thrift —gen rb -I service/include -I . service/if/hive_service.thrift
thrift —gen rb service/include/thrift/fb303/if/fb303.thrift
thrift —gen rb serde/if/serde.thrift
thrift —gen rb ql/if/queryplan.thrift
thrift —gen rb service/include/thrift/if/reflection_limited.thrift
  • Your thrift client will be in the folder: PATH_TO_HIVE_SOURCE_CODE/gen-rb (again replace rb in the generated folder name with your language)
  • You can now copy the generated code around and start using the library to connect with your Hive server

    In Ruby you would use the generated code as follows:

    Or you could use our Ruby thrift client library, we’ve written a thin layer on top of the thrift code to make things a little easier. Feel free to browse and download the code from here:  http://github.com/forward/rbhive


    Ultimate nginx config for Phusion Passenger

    Main nginx.conf:

    worker_processes  1;
    events {
        worker_connections  1024;

    http {
      include       mime.types;
      default_type  application/octet-stream;

      sendfile        on;
      keepalive_timeout  65;

      gzip  on;
      gzip_http_version 1.0;
      gzip_comp_level 2;
      gzip_buffers 16 8k;
      gzip_proxied any;
      gzip_min_length 360;
      gzip_types      text/plain text/html text/css application/x-javascript text/xml application/xml application/xml+rss text/javascript;
      proxy_set_header    Accept-Encoding  “”;

      passenger_root /usr/local/lib/ruby/gems/1.8/gems/passenger-2.2.10;
      passenger_ruby /usr/local/bin/ruby;

      include /path/to/your/app/config/nginx.conf;
      client_max_body_size 10M;
      client_body_buffer_size 128k;

    Your application’s nginx.conf:

    server {
      listen 80;
      server_name website.com;
      rewrite ^/(.*) https://www.website.com/$1 permanent;

    server {
      listen 443;
      server_name www.website.com;

      root   /path/to/your/public/folder;
      passenger_enabled on;

      access_log  /path/to/your/log/folder/nginx_access.log;
      error_log  /path/to/your/log/folder/nginx_error.log;

      ssl         on;
      ssl_certificate      /etc/ssl/certs/your-website.crt;
      ssl_certificate_key  /etc/ssl/private/your-website.key;

      location ~* ^.+.(jpg|jpeg|gif|css|png|js|ico)$ {
        root  /path/to/your/public/folder;
        expires max;


    Thoughts on team management

    I am currently in a software development team that works very differently to other teams that I have been on. This has allowed us to successfully deliver project after project in a timely and efficient manner whilst only being together for a year.

    We’ve been able to achieve this after receiving blessings from the Hindu God Ganesha … err actually no. It mostly comes down to how our team is managed, or not managed as it’ll become clear. Using a silly analogy, my high street grinds to a halt every morning due to traffic, the only mornings it has been free flowing is when the traffic lights stop working. There’s nothing to stop the cars so no queues build up. Yes its a dangerous scenario and you hope that most drivers that morning are competent and alert. Management gives order and control to how projects progress but over-management can slow things down.

    When you have competent and skillful ‘drivers’ in your team, minimal red tape can really let them flow. There is no real hierarchy in our team of 7, everyone is a team lead and everyone is experienced enough to be trusted. And let me be clear, we don’t have people in the team that don’t exist elsewhere. We all got hired because the company thought we were competent engineers, so they proved it by putting their trust in us. Our manager is the person that understands the business inside out and is in charge for giving and discussing the business requirements. We tackle business problems in small chunks and how we solve those problems is left up to us, the engineers. Whether we write a framework or a dirty script, whether we write tests or not, whether we write the solution in Ruby or Clojure, whether we pair on the problem or whether we re-write solutions from the ground up is our call.

    We don’t bank on any golden hammer to solve problems, we’ve used Ruby, JRuby, Clojure, Java, MapReduce using Hadoop, Hive and others I’m probably forgetting, to solve problems in the most appropriate way. It keeps things exciting and interesting, and the whole company is now reaping the benefits of all the work. But none of this would have happened if we were not allowed to explore and do what we thought was best in stead of having old plans, procedures, approved technologies, etc in place to chain down the creative minds of the people that work here.

    Some other IT teams in my company still have traditional structures because not all scenarios will fit the working styles my team employs. And this post is in no way suggesting that management is a bad thing, in fact good management is essential to the success of any team, including ours. Good management also involves knowing when you’re getting in the way and when to have faith in the people you hired. From experience I can say working in rigid or a tightly controlled environment makes for a mundane and less fun place to work, and the best employees will look to avoid just those places.

    So far it’s been all well and good the way we are working and i’m sure problems will surface in the months/years to come resulting from the way we do things because I know life isn’t that nice (f**k you life). But I’m not sure if they will be big enough to out-weigh the benefits it has to the business. Oh and by the way my company has roughly 200 employees so it’s not a multi-national yet, and what I’ve just said is probably a unrealistic idea in a multi-national (or is it?) but anyway that’s my two cents.

    Tags: management