About benchmarking Hadoop systems and scalability

I've been reading a lot of benchmarks for Hadoop technologies lately and these benchmarks that software companies make should always be taken with a grain of salt. Every new version or software is 10-100 times faster than the others.


IBM did a benchmark with Big SQL 3.0 against Cloudera Impala and Hortonworks Hive. The results were very good and I'm not claiming the results are false. But there was one claim that caught my attention.

There were some charts about scalability when the same 99 test queries were run with 1 user vs. 4 users. The 4 concurrent users test took only 1,8 times longer than 1 user test and it was said it is a sign of good scalability.

Blistering Fast SQL Access to Hadoop using IBM BigInsights 3.0 with Big SQL 3.0

Of particular note is the fact that 4 concurrent query streams (and therefore 4 times more queries) only takes 1.8x longer than a single query stream at 30TB. Once again highlighting Big SQL’s impressive multi-user scalability – this time at 30TB.


Serialized vs. parallelized execution 

 

With parallel systems it is essential that the execution can be parallelized as evenly between all the nodes as possible. If in test system ( 17 nodes in IBM's test) only one node is used 100% and others are idle, that means the execution is serialized.

Even then the execution is serialized if the it uses multiple nodes, but only one node is used at a time. I've seen software that should be multi-threaded but it in fact behaves single-threaded, using many processors but only one at a time.



Here is an example of 20 second query that uses all 4 nodes but actually it is only using one node at a time. The execution in the next node does not start before the earlier node has finished. The execution is serialized.



Here is perfectly parallelized execution. It uses 20 seconds of CPU as the other one, but now it uses all nodes in parallel and the whole execution only lasts for 5 seconds. Normally these kind of executions exist only in computer labs. These have not been seen in wild. If you see, contact local authorities.

What can be parallelized?

 


In a Hadoop system the primary resources that can executed in parallel are CPU, memory, disk and network. Each of these resources is also finite and can be overloaded. Execution can also be parallel in that way that one thread uses CPU, one thread disk and one thread network without affecting each others performance.

When a resource is overloaded it might simply make every execution slower in a linear way but a resource might also be so overloaded that it cannot execute any more requests like what happened to Impala and Hive in the test with memory running out.

How parallel was Hadoops execution in test?


If the test query is evenly parallelized, then every node and every resource is used 100%. If you run 4 concurrent tests then if it's is evenly parallelized,  the minimal time it takes to execute is four times longer. If 4 concurrent tests take less time than that, then it wasn't evenly parallelized in the first place. Remember that this applies to every resource in every node: CPU should be used 100% and every disk should by used 100%.

If I'm doing a benchmark, of course I select that amount of concurrent users that gives the results that appear to show best parallel execution.

In an execution there are parts that are serialized and parts that are parallelized. Lets try to calculate how much there is serialized execution in this Hadoop test. Lets make highly simplified model.

S is the execution time for serialized part and P is the execution time for parallelized part. Single_time is test execution times for single user test and parallel_time is the time of one execution in multiple concurrent users time. Threads is the amount of concurrent users and nodes amount of nodes.

Lets make assumption in our model that the serialized parts always get one whole node because that would be the case if we want to show results that appear as best parallel execution. The parallel part has to divide nodes between concurrent users.

single_time = S + P/nodes
parallel_time = S + threads*P/nodes

Let do some math:
single_time = S + P/nodes
single_time - S = P/nodes
nodes(single_time - S) = P

parallel_time = S + threads*nodes(single_time - S)/nodes
parallel_time = S + threads*(single_time - S)
parallel_time = S + threads*single_time - threads*S
parallel_time - threads*single_time = S - threads*S
parallel_time - threads*single_time = (1 - threads)*S
(parallel_time - threads*single_time)/(1 - threads) = S
S = (parallel_time - threads*single_time)/(1 - threads)

P = nodes(single_time - (parallel_time - threads*single_time)/(1 - threads))

In IBM's test the 4 user test was 1,8 times longer then single user test so:
single_time = 1
parallel_time = 1,8
threads = 4
nodes = 17
=>
S = 0,733....
P = 4,533...
Whole execution time S+P = 5,266...
S% (serialized part percentage) = 13,9%


We can calculate how these figures change if we add or substract nodes:
          Nodes         S         P         S+P         S%
3 0,7333 0,8000 1,5333 47,8%
4 0,7333 1,0667 1,8000 40,7%
5 0,7333 1,3333 2,0667 35,5%
6 0,7333 1,6000 2,3333 31,4%
7 0,7333 1,8667 2,6000 28,2%
8 0,7333 2,1333 2,8667 25,6%
9 0,7333 2,4000 3,1333 23,4%
10 0,7333 2,6667 3,4000 21,6%
11 0,7333 2,9333 3,6667 20,0%
12 0,7333 3,2000 3,9333 18,6%
13 0,7333 3,4667 4,2000 17,5%
14 0,7333 3,7333 4,4667 16,4%
15 0,7333 4,0000 4,7333 15,5%
16 0,7333 4,2667 5,0000 14,7%
17 0,7333 4,5333 5,2667 13,9%
18 0,7333 4,8000 5,5333 13,3%
19 0,7333 5,0667 5,8000 12,6%
20 0,7333 5,3333 6,0667 12,1%
21 0,7333 5,6000 6,3333 11,6%
22 0,7333 5,8667 6,6000 11,1%
23 0,7333 6,1333 6,8667 10,7%
24 0,7333 6,4000 7,1333 10,3%
25 0,7333 6,6667 7,4000 9,9%
26 0,7333 6,9333 7,6667 9,6%

We can see that serialized part of time stays the same so amount of nodes does not affect it. The test had at least 17 nodes and as I said earlier the actual "node" count is higher because it should be the amount of individual resources(CPU & disk in each node) . But as we can see from the calculation, the S% does not change that much when node count is higher. (And you have to also understand this is only a simplified model )

The parallelized execution time P is the whole summed execution time of all nodes used. So actually the time it takes to execute the parallelized part is constant 0,266... I'm not saying it does not matter how many nodes you use in your test. I'm saying to get these kind of results (one thread 1, 4 threads 1,8x) it does not matter how many nodes you had in your test. It only matters how the execution was parallelized.

We can also see than when there are 17 nodes, the serialized part takes 73% of the execution time and parallized part 27%. 

How does it scale then? 


Let's make a simple example where the original test only had 4 nodes.

This is how a single thread executes:



First it executes the parallel part in 4 nodes and then continues with serialized part in one node. The reality is of course much more complex. As you can see, when the serialized part starts, three nodes are idle.

Here is what happens when 4 threads are run in 4 node system.




The serialized part executes as single thread test but parallelized part has to divide nodes between all threads. Then 4 thread execution is about 1,8 times slower than single thread execution. No nodes are idle. This is how 4 thread execution is only 1,8 times slower:in the single thread execution there are idle resources because it is not perfectly parallelized.

What happens when we add 5. thread?


We start to see normal behaviour in scaling. There are no idle nodes to utilize any more.

Conclusion


I don't have the test results of individual queries or data about resource consumption So I had to try to mathematically make a model about what is happening. There might also be other reasons like disk caches because all 4 threads execute the queries in the same order (actually that was not said in paper how it was run).

But I'm sure one of the reasons is also serialization. It is not a good thing to have serialized parts in executions because if you add more nodes to have faster executions, only the parallelized part gets faster. If it's true that 73% of execution time in test was serialized that is the part you will never get rid of by simple adding more nodes. This is why it's is good to have a proper benchmarking when you build your Hadoop system.

Comments