COVID-19 Simulation

The Importance of Social Isolation

Coronavirus has got me shook. I’ve been staying up late for the past several days reading about the coronavirus. I’m trying to think of ways I can use my not-a-real-dr skills to help spread awareness.

Twitter is a great place to stay updated, and I found knowledge that only made it to mainstream media after 2-3 days – which by then is sometimes “old” news, and the reality then is much worse. For instance, as of this writing on March 14, 2020, we have about 10,000 to 20,000 cases in the United States. But the mainstream media is reporting the 2-3k numbers that are officially reported (perhaps they will try to correct for the underreported numbers soon).

This morning, I woke up to a really interesting tweet.

It inspired me to do something similar. I’m not an expert, but I wanted to see the general effects of social distancing.

Suppose there is a population of 100 people (represented in blue), and they all wear face masks. The CDC has suggested face masks are not effective but we can optimistically assume that it’s “harder” to get infected. For instance, we won’t be able to touch our mouths or nose (but still our eyes and ears). This is what it could look like, where the red color means a person is infected.

We can see that eventually, everyone gets infected. Assume the bottom x-axis represents some measure of time.

Now imagine if everyone interacted with each other with reckless abandon. Two people hug, shake hands, and immediately rub their eyes. We might see something like the following:

Here, the number of infected people increase much faster, and everyone still gets infected. Compared to the face mask version, everyone in reckless abandon is infected by the time things really start to ramp up for people wearing face masks.

Now let’s look at the effects of social isolation. Suppose that only 30% of a population practiced social distancing or social isolation, and were vigilant to always wash their hands. We don’t even have to quarantine people who are sick – just keep 30% of people working from home or socially isolated. This is what it could look like:

Unlike before, we see that some of the blue people still remain at the end – and it takes a longer time for most of the population to get infected.

Now let’s assume that a majority, say 60%, of the population practices social isolation or social distancing. Over the same amount of time as the previous simulations, not only will the entire population never become all red, but there are a lot of blue people around by the end! And this is still assuming that the red people infect others with reckless abandon!

In reality, we hope the results are more towards the last graph. Other things we don’t consider is that the red people will become better with time, and that those who are sick will self-quarantine. If we assume these two factors are true, we will have even more blue people.

So let’s do social distancing as much as we can, and we will not only flatten the curve, but we can smash it!

Special thanks to Ali Almossawi for showing me how to convert a sequence of image files to gif.

Bayes Theorem from Scratch

Whenever you hear a poorly descriptive name like “Bayes Theorem”, the last thing you’d want to do is learn what it is. Probability is one of those topics that is encountered all the time but is poorly taught in high school (I’m looking at you, Personal Finance). I’m here to show you it’s a simple idea, and it really should’ve been named The Easiest-to-Understand Theorem in Probability. After reading this, you’ll always be able to derive it from scratch, and you won’t get scared of calculating probability anymore. The one thing to keep in mind is that it takes more effort to learn some notation and terminology, rather than to understand the concept itself.

Let’s start with the most basic concept you already know – the probability of something happening.

It’s easy to visualize things when we have examples, so let’s take my favorite active NBA player today – Klay Thompson. Let’s say Klay takes 10 shots and makes 6 of them. Given this set of data, his probability of making a shot is 6 out of 10 (or 60%).

In this scenario, there are only two outcomes – he either makes a shot or misses a shot. Both outcomes need to add up to 100% since they represent all the possibilities that can happen.

That means his probability of missing a shot is 40%, or 4/10.

Let’s visualize this:

Let’s call Klay a “node”, and the possibilities coming from this node are called “branches”. There is a 6/10 chance to follow the “Makes” branch, and a 4/10 chance to follow the “Misses” branch. All branches stemming from a node need to sum to 1, since the branches represent all the possible outcomes. In reality, there can be more than 2 branches – in fact, as many branches as there are possible outcomes! We only stick with 2 here to keep things simple. In physics, we jokingly call this a “spherical cow” (the joke is that nothing in the real-world is a perfect sphere, yet we solve all our problems as if they are). So here, to keep things simple, we stick with just 2 branches.

Klay is feeling a little lonely, so let’s introduce his Splash Brother – another NBA player named Steph Curry.

Steph Curry is a great shooter. He breaks shooting records with no regard for human life. As a main contributor, he is on the basketball court (or known as “on the floor”) most of the time. Let’s say that at any given moment, there is a ¾ chance that Steph is on the floor.

When he’s on the basketball court, players tend to be more aware of him. Since defenders are occupied with Steph, Klay is more open and tends to make more of his shots. Klay can make shots without Steph on the floor, as we showed above, but he tends to make more of them when Steph is on the floor. When Steph is on the floor, Klay makes 80% (8/10) of his shots, and therefore misses 20% (2/10) of them. Let’s visualize all of this in a tree:

Just to recap, when Steph is on the floor, Klay makes 8/10 of his shots. When Steph is off the floor, Klay makes 6/10 of his shots. There is a ¾ chance that Steph is on the floor at any given moment. As a reminder, all branches stemming from a node sum to 1.

Now we can answer all sorts of questions with the above graph.

For instance, what is the probability that Klay misses a shot when Steph is off the floor ? (Answer: 2/10)

What is the probability that Klay makes a shot when Steph is off the floor ? (Answer: 6/10)

These kinds of probabilities are called “conditional probabilities”, because some “event” needs to happen before we answer the “probability” of something happening. In the 2 examples above, that “event” is whether or not Steph is on the floor, and the “probability” that we’re asking is whether Klay makes his shot.

We can represent a conditional probability like this:

P(k = makes \:|\: s = on\:the\:floor)

That is read: “What is the probability that Klay makes his shot given that Steph is on the floor”. A shorter version is this:


Which is another way of saying “What is the probability of k given s”. As you can guess, the probability is the first parameter, the “|” translates to “given”, and the last parameter is the event that must take place. Don’t worry if that doesn’t make sense – you get used to it the more you see it. It’s just notation that mathematicians have agreed upon.

Another way to look at conditional probabilities is using a Venn diagram.

We notice from the diagram that “Steph being on the floor” is all the “given” part of the conditional probability – this is everything inside the red circle.

Using this Venn diagram, we can come up with an equation for


Where k = Klay makes a shot, and s = Steph is on the floor. In other words, again, it is the probability that Klay makes a shot while Steph is on the floor.

We know that all the possible outcomes of the probability must include “Steph being on the floor”. This is represented by the red circle above. We see that Klay making a shot AND Step being on the floor is represented by the overlap of the red and blue circles, or the area that is shaded in purple.

The purple area is also known as the “intersection between Klay making a shot and Steph being on the floor”. We can represent it in this notation:

 P(k \cap s)

This is read as the probability of “Klay making a shot and Steph being on the floor”, or “the probability of k intersect s, where k = Klay making a shot, and s = Steph being on the floor”.

It is easy to see, then, that the probability of Klay making a shot given that Steph is on the floor is equal to the probability of Klay making a shot AND Steph being on the floor, over the probability of Steph being on the floor. Mathematically, we can say:

P(k|s)=\cfrac{P(k \: \cap \: s)}{P(s)}

Where k = Klay makes a shot, and s = Steph is on the floor.

Let’s suppose we want to answer the question, “what is the probability that Klay makes a shot AND Steph is on the floor ?”

Rearranging the equation, we get:

P(k \cap s) = P(k|s) \cdot P(s)

As you can see, it’s as simple as multiplying “the probability of Klay making a shot given Steph is on the floor” by “the probability that Steph is on the floor”.

Look at the tree diagram from earlier, let’s answer the question, “what is the probability that Klay makes a shot AND Steph is on the floor”:

  • The probability of Klay making a shot given Steph is on the floor = 8/10
  • The probability that Steph is on the floor = ¾.
  • Multiply these two together gives you (8/10) * (3/4) = 24/40 or 3/5

So there is a 3/5 chance that Klay makes a shot and Steph is on the floor.

What if we wanted to answer the question, what is the probability that Steph is on the floor, given Klay makes a shot ? This is the reverse conditional probability that we had from earlier. We can represent this probability like this:


We know that this must equal the probability of Steph being on the floor and Klay making a shot divided by the probability of Klay making a shot. In other words,

P(s|k) = \cfrac{P(s \: \cap \: k)}{P(k)}

How do we find the probability that Klay makes a shot? We need to find the probability that Klay makes a shot and Steph on the floor, and add it to the probability that Klay makes a shot and Steph is not on the floor. We already found the probability that Klay makes the shot and Steph is on the floor – which is 3/5.

To find the probability that Klay makes a shot and Steph is not on the floor, we can do a similar calculation using the tree diagram. Let’s also introduce one final notation. The “apostrophe” next to a variable means events that are mutually exclusive to that variable. For instance, if s means that Steph is on the floor, then s’ means that Steph is not on the floor.


This can be read, “the probability of Steph not being on the floor”, or as the tree diagram says, ¼.

Therefore, the probability of Klay making a shot and Steph not being on the floor is:

P(k \cap s') = P(s') \cdot P(k|s')

To get the probability of Klay making a shot, we add this probability together with the probability of Klay making a shot and Steph is on the floor:

P(k) = P(k \cap s) + P(k \cap s') = P(k|s) \cdot P(s) + P(k|s') \cdot P(s')

Therefore, the probability of Steph being on the floor given that Klay has made a shot is:

P(s|k) = \cfrac{P(s\:\cap\:k)}{P(k|s) \cdot P(s) + P(k|s') \cdot P(s')}

Furthermore, we can see that:

P(s \cap k) = P (k \cap s)

Where we know that the right hand side is really:

P(k \cap s) = P(k|s) \cdot P(s)

Finally, we can rewrite the probability of Steph being on the floor given that Klay has made a shot:

P(s|k) = \cfrac{P(k|s) \cdot P(s)}{P(k|s) \cdot P(s) + P(k|s') \cdot P(s')}

And now, we’re done!

Wait a sec, you protest. Where did Bayes theorem come in ?

We actually derived it much earlier, and then went on to show the “extended” form when the events are binary (here, our events are binary because either Steph is on the floor or he isn’t, and either Klay makes a shot or he doesn’t).

The whole point of Bayes theorem is to get from




We showed this earlier:

P(s|k) = \cfrac{P(s \cap k)}{P(k)}

Which could be simplified to:

P(s|k) = \cfrac{P(k|s) \cdot P(s)}{P(k)}

That is Bayes theorem. When the events are binary, then we can show what P(k) really is:

P(s|k) = \cfrac{P(k|s) \cdot P(s)}{P(k|s) \cdot P(s) + P(k|s') \cdot P(s')}

That wasn’t so bad, was it? See, I told you the hardest part was learning the notation!

Let’s actually find this using the tree diagram!

P(s|k) = \cfrac{8/10 \cdot 3/4}{3/4 \cdot 8/10 + 1/4 \cdot 6/10} = \cfrac{3/5}{3/5 + 3/20} = \cfrac{3/5}{15/20} = 4/5

There’s a 4/5, or 80% chance that Steph is on the floor when Klay made a shot!

Left Shift is Multiplication

Left-shift operator is the same as multiplication. So, if I = 3, then I << 1 is 6. How does that work? Let’s see.

2^3 + 2^2 = 8 + 4 = 12

2^4 + 2^3 = 16 + 8 = 24

Indeed, it multiplied the original value by 2. What happened here ?

We’ll, let’s multiply the first line by 2:

2 * (2^3 + 2^2) = 2^4 + 2^3.

Now it is clear that left-shifting means you’re multiply by two. If you left shift again, you’re multiply the original by 2^2, and so on.

If you want to multiply by a number that is not a power of two, say 17, then it would be something like this:

2^4 + 1 = 16 + 1

So you left-shift 4 times, and add the original number.

N * (2^4 + 1) = N(16 + 1) = (N << 4) + N


Converting Binary to Integer

I recently came across this example in my “Learning Python” book:

It converts a binary string to integer. The ord method is interesting, but I was interested in the algorithm itself. The algorithm says to look at the digits one by one, and each time through the loop, multiply the current value by 2 and add the next digit’s integer value. In this case, ‘1011’ would end up printing 13, which is correct. 13 in binary is 1011.

But then I struggled to think of how this would work. My naive solution would be to create a powers of two table, work my way from left to right, and get the highest value I can get without going over. For instance, 2^4 is 16, so I have to use 2^3, which is 8. Thus there is a bit in the 2^3 spot. Next I try 2^2, which is 4, and 8 + 4 is less than 13, so I can put a bit in the 2^2 spot. Then I try 2^1 which is 2, and 8 + 4 + 2 is 14, which is larger than 13, so there is a 0 in the 2^1 spot. And finally I need 2^0, as that is 1, and 8 + 4 + 1 = 13.

Since that was my solution, I would try to convert to code – because often times how you think about a solution is how you write it in code.

However, that was not what the book had. I now know what the book did.

Any integer can be represented in the form:

N is the integer, x_n is the number of digits depending on the base. So if it’s base 2 (binary), then it’s either 0 or 1. If it’s base 10, then it’s 0-9. q^n is going to be the base to the nth power. For binary, it’s going to be 2^n.

13 can be represented like this:

Now we can factor the 2 out, and keep factoring out the inner 2 until we can’t anymore:

We end up with the bottom form, which makes it easier to see why the original algorithm works. As a reminder, the original algorithm says multiply the current value by 2, and then add the next value. This is indeed what we do with (1), (2), and then (3).

So it all makes sense now! Yay!

Linked Lists to the Rescue – Part II – Queues

In Part I, we looked at an implementation of Stacks using Linked Lists. In this post, we’ll look at Linked Lists with Queues.

As a review of the Linked Lists of Stacks, here’s a drawing of the implementation of Stacks using the linked-list nodes:


As you can see, the latest node points to the previous node, which points to the node previous to that, etc…

Whenever we add a new node to the Stack, we make sure we point it back to the previous node. Once we “pop” the stack, we return the value of the “current” node, and follow the “current” node’s pointer back to the previous node, which we now make as the new “current” node. The previous “current” node is now orphaned, and left for the garbage collector. The main thing to note is that the pointers are going backwards (towards the left), as we add more nodes to the Stack.

In my implementation of a Queue, the pointers are going to go in the opposite direction! Check it:


The pointers are now going to the right. What’s going on ? In a queue, recall that it’s “FIFO”, or “First in, First Out”. We start out with a single node as our first item. When we add a node, the previous node points to our new node.

In order to “pop” (we call it “dequeue” in this case), we have to remember our very first node on the left, so we assign a variable to keep track of it. After that first node is returned, we follow it’s pointer to the next node, and make this next node the “first” node. The first node is then orphaned, and gets cleaned up from the garbage collector.

In addition to remembering our first node, we also have to keep track of our most recent node, in order to add the future nodes to it. Let’s talk with code now:

The “@q” variable is our “first” node, and the “@current” variable is the latest node.

To summarize, creating Stacks and Queues with linked lists has some advantages:

  • We don’t need to keep track of the size of the array, and the memory is dependent strictly on how big our list is
  • As another point related to size – the bigger the size, the longer it takes to do certain operations, such as resizing. For instance, downsizing a large array takes at least as many operations as the size of the array to resize, since we’re copying the values. In linked lists, the time is independent of the size
  • It’s easier to think about conceptually, then the resizing (more of an opinion)
  • Linked lists are so cool!

    Linked Lists to the Rescue – Part I – Stacks

    As one might have imagined from the many hints and questions from the previous post, linked lists can be very useful when dealing with Stacks and Queues.

    To define a linked list, let’s start with the basic element of a linked list, called a node. Here is my silly illustration of a node:

    Screen Shot 2016-07-11 at 8.38.42 PM

    The left box represents the “value” of the node, and the right box represent where to look for the next node. That’s it!

    You can put nodes in a list like this:

    Screen Shot 2016-07-11 at 8.38.46 PM

    The leftmost node has a value, and it also tell us where to go for the next node, which is the node in the center. The center node has a value and tells us to look at the right node next. Nodes in a list are called a “linked list.”

    Let’s see a node represented in Ruby!

    In this “node class”, there are two variables: “item” and “next”.

    The “Item” can hold anything you want: Strings, Numbers, Integers, Arrays, Hashes, other Ruby classes, etc… the “list” goes on (pardon my intended pun). The “next” is like a pointer, in that it shows us where to look for the “next” node.

    To see it in action, here is one implementation of a Stack using Linked Lists:

    We need our node class, so the first line requires it. Before saying anything, see how much shorter the code is already! This has to do with not needing to remember the array size. If all you need to remember is where to go next, you don’t need to remember to resize your data structure once you hit size limits. The Linked Lists shrinks and grows on command, and “orphaned” nodes are cleaned up by the garbage collector. It seems to be more pleasant to read (and implement) the stack class using linked lists.

    The stack is initialized by creating a single node. As we add items to the stack, we tell our new node to point to our previous node. Unlike the array implementation where we had to keep track of all elements in the stack, all we ever have to know at any point in time is our latest node when we use the linked list implementation. When we remove items, we follow our latest node to the previous node, and tell the previous node to become the “current” node.

    In Part II, let’s see how to implement Queues with Linked Lists!

    A World Without Linked Lists

    To get a sense of how important Linked Lists are, it might be useful to imagine a world without Linked Lists.

    Suppose we want to build a Stack. A Stack is a collection of objects in a sequence, where the last object added is also the first object removed. The common term is “Last in, first out”, or “LIFO.”

    A commonly used real-world example for a Stack is a stack of pancakes. When pancakes are placed on top of each other, the last pancake added is the one on the very top. This very top pancake is also the first to be eaten.

    How would we implement a Stack without using Linked Lists ?

    Let’s say we are allowed to use an Array to implement our Stack. How big should our array be at the start ?

    By its definition, a Stack implies that its size varies with the addition or removal of objects. What if our Stack outgrows our initial Array size ?

    If we are able to solve the problem of our Stack outgrowing our Array size, what then if our Stack eventually shrinks significantly ? Would we have hold a large number of empty elements in our Array ?

    With some thought, we are able to implement our Stack using just an Array. Here is one implementation:

    The main takeaway is that we double the array size whenever we reach our limit, and we shrink to a quarter whenever half of the array is empty (we shrink to a quarter instead of a half so that we aren’t teetering on the edge of doubling or shrinking the array with a single add or drop in the stack). In expanding or reducing the array, we make use of the “resize” method, which makes a copy of the entire array, temporarily doubling the space cost.

    Let’s look at another collection of objects – a queue. A Queue is “First in, first out”, or “FIFO”. A queue in the real world could be people lining up to see a movie. The first person in line is also the first person to get into the theater.

    We run across the same sizing questions as the Stack – and can handle them similarly.

    In the Stack implementation, the last element of the Array was always the element being popped, and the element being pushed was always added at the end of the Array. For a Queue, we need to keep the position of the “head” and “tail” of the queue. If the beginning element of the array is removed when we “dequeue”, what do we do with the empty elements at the beginning of the array as we continue to dequeue ?

    Here is one implementation of a Queue using Arrays:

    We have to be more careful with our tail and head position of the array. Whenever we resize our array or whenever our head position reaches our tail position, we shift our head position back to the beginning of the array. The length of our queue is determined by our head position less our tail position. After these differences, we handle our sizing decisions are similar to our Stack implementation.

    Boy – that’s a lot of resizing and keeping track of positions! A world without Linked Lists is tough!
    Can Linked Lists make things easier ? What are Linked Lists ?

    Connecting Storm and Kafka

    Previously, we set up Kafka and we’ve also set up Storm.

    Now, let’s connect the two!

    By now, we’ve assumed that you are able to get the ExclamationTopology running from our previous guide. In this guide, we are going to use Kafka as our Spout to the ExclamationTopology.

    I usually check out the binary version of Storm to run Storm itself, but I like to develop on the source version of Storm. To that end, let’s grab the source version of Storm and extract it:

    $ wget

    $ tar -zvxf apache-storm-0.9.2-incubating-src.tar.gz

    You can see that a Kafka-connector project has already been written for us, in external/storm-kafka.

    To understand how to leverage this project, let’s make Kafka as our Spout for ExclamationTopology.

    First, copy the ExclamationTopology to the storm-kafka project, since that’s where all the KafkaSpout classes are (there are other ways of doing this, such as bundling storm-kakfa into a jar, and using these libraries instead).

    $ cp apache-storm-0.9.2-incubating-src/examples/storm-starter/src/jvm/storm/starter/ apache-storm-0.9.2-incubating-src/external/storm-kafka/src/jvm/storm/kafka

    Now add the following imports into ExclamationTopology:

    import storm.kafka.SpoutConfig;
    import backtype.storm.spout.SchemeAsMultiScheme;

    It should look something like this:


    Remember also to change the package at the top of the import statements to storm.kaka (not storm.starter), as seen in the screenshot.

    Then use this to setup the Kafka Spout, assuming you have created a “test” topic as shown in our Kafka guide.

    String zkConnString = "";
    String topicName = "test";

    BrokerHosts hosts = new ZkHosts("localhost:2181");
    SpoutConfig spoutConfig = new SpoutConfig(hosts, "test", "/test", "discovery");
    spoutConfig.scheme = new SchemeAsMultiScheme(new StringScheme());
    KafkaSpout kafkaSpout = new KafkaSpout(spoutConfig);

    builder.setSpout("kafka", kafkaSpout);

    Here is a screenshot:


    Then add the following lines to the pom.xml file currently in that directory. These have to go within the “plugins” bracket:

    Also, remove the “scope” field in the org.apache.kafka POM dependency in the pm.xml file under storm-kaka.

    Then, compile the package, and submit it to a Storm production cluster.

    If you remember from our tutorial on Kafka, you can open up a Producer command line terminal and start typing messages into the Kafka. Go ahead and do this.

    On the other side, you should see Storm adding exclamations to everything you type.

    Congratulations, you have connected Storm with Kafka!

    Setting up Storm and Running Your First Topology

    This guide will setup Storm on a single Ubuntu instance, and show you how to run a simple Word Count Topology. This guide assumes no experience with Storm.

    Storm was created by Nathan Marz, and is designed for real-time event processing, and improves on some of Hadoop’s distributed design. Storm provides excellent documentation, which you are highly encouraged to go through. If you’re pressed for time, though, the following guide gets you started with running a simple real-time event processor (this is called a topology, but I assume you haven’t read any documentation and just want to get the thing up and running. Though this puts you at a conceptual disadvantage, there’s nothing like getting your hands dirty right away).

    Setting up Storm

    First grab version 0.9.2 of Storm (already compiled version)

    $ wget

    Extract the files:

    $ tar -zxvf apache-storm-0.9.2-incubating.tar.gz

    Grab the latest version of maven:

    $ sudo apt-get install maven

    If you are on a Mac:

    $ brew install maven

    also set the JAVA_HOME path in Mac through the ~/.bash_profile file:

    $ export JAVA_HOME=$(/usr/libexec/java_home)

    Check the maven version to see that it installed correctly:

    $ mvn -version


    If you checked out the src version of Storm, you can build and install the Storm jars locally with the following command (requires pom.xml file). This doesn’t need to be done if you already downloaded the compiled version as this guide has shown. However, it’s worth noting now because you’ll be using this command to compile your projects if you modify any of the source code.

    Instead of building jars for the Storm project (since we’ve checked out the compiled version), let’s build the jar file for the storm-starter example project. First go into the storm-starter project within the apache-storm-0.9.2-incubating/examples folder:

    $ cd apache-storm-0.9.2-incubating/examples/storm-starter

    Now compile and build the jar files:

    $ mvn clean install -DskipTests=true

    It should take a few minutes, and you’ll see a lot of output. At the end of the output, you should see:


    You are now ready to run your first Storm job (called a “topology”). We are going to run the topology located in apache-storm-0.9.2-incubating/examples/storm-starter/src/jvm/storm/starter/

    Let’s run the topology first, and then go briefly into the details of what is happening.

    In the storm-starter directory, issue:

    $ mvn compile exec:java -Dstorm.topology=storm.starter.WordCountTopology

    The whole process will take about 50 seconds, and you will see a whole bunch of output. The main thing to look for in this output is something like this:


    It should occur near the middle of all the output being shown. The end of the output should a bunch of shutdown messages, along with a success message like this:


    Congratulations! You have ran your first Storm topology!

    Storm has a local mode (called “Local Cluster”) and a production-cluster mode. The local mode is helpful for development and debugging, while the production-cluster mode is intended to run in a production environment. You just submitted the WordCountTopology in local mode.

    Let’s take a look at what you just did.

    A Storm topology consists of “Spouts” and “Bolts”. You can think of Spouts as obtaining the data, and Bolts as transforming the data. In a topology, you typically have one or more Bolts stemming from one Spout. The “data” in a Storm topology is called a “Tuple”.


    In the WordCountTopology, the Spout used is the RandomSentenceSpout:


    RandomSetenceSpout is located at apache-storm-0.9.2-incubating/examples/storm-starter/src/jvm/storm/starter/spout/

    If you take a peak at this file, you can see the sentences being used:


    That explains our output in our example – the words being “emitted” are taken from these sentences. The nextTuple method is common in all Spouts, and determines what you want the Spout to do. As you can see, it is a method that is commonly overridden.

    Let’s now take a look at the Bolts in


    These bolts are methods defined with the same file ( By their name, we can guess what they do. Let’s take a look at “SplitSentence”:


    It looks like it is calling a python script called “”. Doing a little digging, we find this script located in apache-storm-0.9.2-incubating/examples/storm-starter/multilang/resources/

    We’ve just stumbled upon a cool thing that Storm can do – bolts are allowed to be language-agnostic! This means, we can write our logic in any language we please. In this case, we are splitting sentences with Python! Here is the logic:


    As you can see, it’s splitting the sentences by a single space, and “emitting” each word in that sentence.

    So our first bolt “SplitSentence” is actually a python script that splits the sentences into words. Let’s take a look at our second bolt, “WordCount”, which is defined in


    As you can see, a HashMap called “counts” is created, which stores the counts of each word going through.

    This is the basic and fundamental template of a Storm topology. All other topologies you see are just different variations on this.

    Just for completeness, let’s take a look at the rest of WordCountTopology:


    As you might guess based on the names of the variables, the rest of the file is used for configuration information.

    conf.setDebug controls the verbosity of the output. The block of code within the “if” statement is configuration for production, and the block of code in the “else” statement is for local mode (which is what we just saw). The topology being submitted is called “word-count”, and we’ve asked the job to run for 10 seconds.

    In the meantime, as a “homework” assignment, you are encouraged to get the working, located in examples/storm-starter/src/jvm/

    If you are feeling ambitious, try modifying the input Spout (, and see how things change. However, you will need to download the source version and build storm-core from scratch, as is part of storm-core. Remember to issue the compile command at the top storm level after each modification of the code:

    $ mvn clean install -DskipTests=true

    Deploying Storm on Production

    Package the project for use on a Storm cluster. For instance, in storm-starter, do:

    $ mvn package

    The package should be in:


    You can check out the binary version of storm (as we did above), and use the “storm” command from there. You can also add the bin path to $PATH.

    Read this to fill in the storm.yam part.

    I only modified the following in storm.yam:

    – “localhost”
    # – “server2” “localhost”

    Then start nimbus, supervisor, and UI:

    $ storm nimbus
    $ storm supervisor
    $ storm ui
    (localhost:8080 by default)

    Then, from the machine that has the storm-jar-dependencies.jar, submit it:

    $ storm jar /Users/haroldnguyen/workspace/storm-tutorial/apache-storm-0.9.2-incubating-src/examples/storm-starter/target/storm-starter-0.9.2-incubating-jar-with-dependencies.jar storm.starter.ExclamationTopology production-topology-1

    The logs are located in the binary version of storm/logs.

    The storm ui is pretty neat, giving you a summary of running topologies and visualization:

    If you still have trouble, please read through this excellent documentation on Running Topologies on a Production Cluster.


    Congratulations – you went from knowing nothing about Storm to running a Word Count topology! The world is really your oyster!

    In our next post, we’ll see how to connect Storm with Kafka!

    Setting up Zookeeper and Kafka

    This was tested on an m1.medium instance on AWS on Ubuntu 12.04.

    Installing Zookeeper

    Zookeeper is needed for Kafka. To quote the Zookeeper Apache project page,

    Zookeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.

    In other words, we need Zookeeper to maintain configuration for Kafka nodes in a multi-node environment. In this tutorial, we’ll only be setting up a single Kafka node instance, but Kafka relies on Zookeeper to for configuration information.

    We can start off by grabbing Zookeeper:

    $ wget

    Extract the files:

    $ tar -zxvf zookeeper-3.4.6.tar.gz

    Change to the zookeeper/conf directory:

    $ cd zookeeper-3.4.6/conf

    In this directory, create the zoo.cfg file from the provided template configuration file named zoo_sample.cfg:

    $ cp zoo_sample.cfg zoo.cfg

    Open up zoo.cfg and change the dataDir variable to point to /var/lib/zookeeper. (I used the vi editor to make these changes).


    Since this directory doesn’t exist yet, create it:

    $ sudo mkdir /var/lib/zookeeper

    Install Java on a fresh Ubuntu 12.04:

    $ sudo apt-get update
    $ sudo apt-get install openjdk-7-jdk

    In the zookeeper-3.4.6 directory, start Zookeeper:

    $ sudo bin/ start

    You should see a message similar to:


    You can test to see that it can be connected from Java:

    $ bin/ -server


    Great! Now you have Zookeeper running.

    Installing Kafka

    We’ll setup using kafka_2.9, since at the time of this writing, Storm-0.9.2 (which we will be using) plays nicer with this version of Kafka.

    First grab Kafka, and then extract it:

    $ wget
    $ tar xvzf kafka_2.9.2-

    Change to the kafka_2.9.2- directory:

    $ cd kafka_2.9.2-

    Here, change the log.dirs property in the file to /var/lib/kafka-logs:


    Create that directory:

    $ sudo mkdir /var/lib/kafka-logs

    In the file again, point the property to the public DNS (the address that others can connect to) of your instance:

    In the kafka_2.9.2- directory, you can start the Kafka server:

    $ sudo bin/ config/ &

    Or, start it as a daemon:

    $ sudo bin/ -daemon config/

    You can stop the server anytime:

    $ sudo bin/

    Create a “topic” named test with a single partition and one replica:

    $ bin/ --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test

    See it at the command line:

    $ bin/ --list --zookeeper localhost:2181


    You can also start inputting command-line message, and read the output in another terminal. First start the producer to the topic test:

    $ bin/ --broker-list localhost:9092 --topic test

    Type a message and press ‘enter’. For instance, the message can be “hello”.


    Open another terminal, and in the kafka_2.9.2- directory, start the consumer:

    $ bin/ --zookeeper localhost:2181 --topic test --from-beginning

    This will start consuming all messages since the beginning of the topic, but you can set the offset. Messages survive for 24 hours by default, but can be tuned to your desire.

    You should see each message given from the producer:


    Talking to Kafka with Python

    Instead of the command line, maybe we’d like to push messages to Kafka programmatically, say, from a Python script.

    First, get ‘pip’ so that you can install Python packages.

    $ sudo apt-get install python-pip

    Now install the kaka-python package:

    $ sudo pip install kafka-python

    While leaving your consumer terminal open, run the following Python script:

    You should see the message show up in the consumer:


    You’ve now successfully pushed messages to Kafka using a Python script. There are many more things you can do with the kafka-python package, and you are encouraged to look through the documentation on the kafka-python package Github page.


    In this tutorial, you set up Zookeeper, Kafka, and used a Python script to push messages to Kafka. If you’ve seen any errors in this tutorial, please feel free to leave comments in the section below.