Thursday, 21 January 2016

Multiplication table using matrix in spark

In this post, I am sharing a code snippet to demonstrate multiplication table in spark.

In addition, it also demonstrates the following:

  • How to create a custom Spark RDD
  • How to create an Iterator/Generator
  • How to work with matrices in spark

Assumption :

  • You are using JDK 8.
  • You added spark and spark-mllib libraries to the classpath. I assume you use something like Maven and add these dependencies.



import org.apache.spark.Partition;
import org.apache.spark.SparkConf;
import org.apache.spark.SparkContext;
import org.apache.spark.TaskContext;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.mllib.linalg.distributed.CoordinateMatrix;
import org.apache.spark.mllib.linalg.distributed.MatrixEntry;
import org.apache.spark.rdd.RDD;
import scala.collection.AbstractIterator;
import scala.collection.Iterator;
import scala.collection.mutable.ArrayBuffer;
import scala.reflect.ClassManifestFactory$;

/**
 * This class illustrates multiplication table on spark!
 * @author  Thamme Gowda N
 *
 */
public class Main {

    /**
     * This RDD has numbers in it.
     * This RDD can be used for illustrating math operations on Spark
     */
    public static class SequenceRDD extends RDD<Long>{

        private long start;
        private long end;
        private Partition[] partitions;

        public SequenceRDD(SparkContext _sc, long start, long end) {
            super(_sc, new ArrayBuffer<> (),
                    ClassManifestFactory$.MODULE$.fromClass(Long.class));
            this.start = start;
            this.end = end;
            this.partitions = new Partition[]{
                    () -> 0 //just one part at index 0
            };
        }

        @Override
        public Iterator<Long> compute(Partition split, TaskContext ctx) {
            //we have only one part, so this is it
            return  new SequenceIterator(start, end);
        }

        @Override
        public Partition[] getPartitions() {
            return partitions;
        }
    }

    /**
     * This iterator yields numbers so we can build an RDD on it
     */
    private static class SequenceIterator
            extends AbstractIterator<Long>
            implements java.util.Iterator<Long> {

        private long nextStart;
        private long end;

        /**
         * Number generator for [start, end]
         * @param start the start
         * @param end the end, inclusive
         */
        public SequenceIterator(long start, long end) {
            this.nextStart = start;
            this.end = end;
            assert end >= start : "Invalid Args";
        }

        @Override
        public boolean hasNext(){
            return nextStart <= end;
        }

        @Override
        public Long next() {
            return nextStart++;
        }
    }

    public static void main(String[] args) {
        long n = 1_000;
        String outpath = "multiplication-matrix";

        long st = System.currentTimeMillis();
        SparkConf conf = new SparkConf()
                .setAppName("Large Matrix")
                .setMaster("local[2]");
        JavaSparkContext ctx = new JavaSparkContext(conf);
        JavaRDD<Long> rdd = new SequenceRDD(ctx.sc(), 0, n)
                .toJavaRDD()
                .cache();

        JavaPairRDD<Long, Long> pairs = rdd.cartesian(rdd);
        JavaRDD<MatrixEntry> entries = pairs.map(tup ->
                new MatrixEntry(tup._1(), tup._2(), tup._1() * tup._2()));
        CoordinateMatrix matrix = new CoordinateMatrix(entries.rdd());
        matrix.toIndexedRowMatrix()
                .rows()
                .saveAsTextFile(outpath);

        System.out.printf("n=%d\t outpath=%s\nTime taken : %dms\n", n,
                outpath, System.currentTimeMillis() - st);
        ctx.stop();
    }
}

Tuesday, 19 January 2016

Comparing cubic time complexity algorithm with a fourth-power complexity

   I thought of comparing n3 and n4 running times to feel what it means to have a slow algorithm.

Let us suppose we have an algorithm with cubic runtime complexity:

 int n = 2500;
 int[] arr = new int[n];
 System.out.println("Start");
 long st = System.currentTimeMillis();
 for (int i = 0; < n; i++) {
   for (int j = 0; j < n; j++) {
     for (int k = 0; k < n; k++) {
       arr[k]++;
    }
   } 
 }
 System.out.printf("%dms\nDone!", System.currentTimeMillis() - st);

For n = 2500, it completed in 1.827 seconds on a modern i5 CPU :
 Start  
 1827ms  
 Done!  

Now, an algorithm of power 4 time complexity,
 int n = 2500;
 int[] arr = new int[n];
 System.out.println("Start");
 long st = System.currentTimeMillis();
 for (int i = 0; i < n; i++) { 
   for (int j = 0; j < n; j++) {
     for (int k = 0; k < n; k++) {
       for (int l = 0; l < n; l++) {
         arr[l]++;
      }  
     }  
   }  
 }  
 System.out.printf("%dms\nDone!", System.currentTimeMillis() - st);

With just one additional loop, for the same n=2500, it took:

Well, it didn't complete... I waited 15 minutes, yet no answer!!
EDIT: It completed after 47 minutes!
 Start  
 2863936ms  
 Done! 


Why is this comparison important to me?
  This comparison became interesting to me because I have implemented a Tree Edit Distance algorithm which has a fourth-power time complexity of the input size (number of nodes in the trees). I used this algorithm on HTML DOM trees which have around 1000 to 2500 HTML elements.

Summary 
 Thus, an algorithm with power 4 is not an acceptable solution for tree edit distance for web pages with more than 1000 HTML elements.
I am seeking a better algorithm.



Digit grouping for very long numbers in Java

When we need to write a Billion on a sheet of paper we don't just write 1000000000, because it is hard to see if it is exactly a billion or  hundred million in the very first glance without counting the digits. Instead, grouping the digits is common practice so it can be written as 1,000,000,000. It is evident that grouping the digits improves the readability of numbers.

    However, all these days I never saw anywhere grouping the numeric literals in the software code.  Recently I was surprised to see that Java language supports digit groups by using the underscore ('_') characters as group separators. For example, the following is a perfectly valid way to store a decimal Billion in java.

int n = 1_000_000_000;