TG's Blog: 2016

Monday, 2 May 2016

Moved - to Wordpress

Dear Blogger,

If someone come to visit me here, please let them know that I moved to a new place at WordPress.com. Please forward any letters/packages you may receive to my new home: https://thammegowda.wordpress.com/

Thanks and Good bye.

-
TG

Monday, 4 April 2016

Maven project with scala and Java sources

Some of the best tech I learned these days are scala, spark, spark-MLlib and graphx.
My work with spark forced me to learn scala even though I was very comfortable with Java8. I admit I had attempted to learn scala earlier but gave up with the prejudice that Java8 caught up with scala in terms of richness of language features. Actually now I figured out that Scala is 10x improvement over java (< 1.8). Java language with version 8 upgrade caught upto 4-5x and still missing bunch of rich features of scala language.

That is being said, i wanted to document how to setup a maven project which can build both the scala and java source code. (Do I need to say that we can call scala routines from java and viceversa?). This is very useful because we can just add the scala plugin to existing maven java projects and start to make use of scala without any extra effort.

mkdir my-project
cd my-project
mkdir -p src/main/java src/main/scala src/main/resources/ \
                src/test/java src/test/scala/ src/test/resource
emacs/edit pom.xml  # and copy paste the below content to

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>
    
    <groupId>com.example</groupId>
    <artifactId>test-project</artifactId>
    <version>1.0-SNAPSHOT</version>
    
    <packaging>jar</packaging>

    <name>Test Project</name>
    <url>http://maven.apache.org</url>

    <properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <scala.version>2.11.8</scala.version>
        <exec.mainClass>JavaMain</exec.mainClass>
    </properties>

    <dependencies>
      <dependency>
        <groupId>org.scala-lang</groupId>
        <artifactId>scala-library</artifactId>
        <version>${scala.version}</version>
      </dependency>
      <dependency>
        <groupId>junit</groupId>
        <artifactId>junit</artifactId>
        <version>4.12</version>
        <scope>test</scope>
      </dependency>
    </dependencies>

    <build>
        <pluginManagement>
            <plugins>
                <plugin>
                    <groupId>net.alchim31.maven</groupId>
                    <artifactId>scala-maven-plugin</artifactId>
                    <version>3.2.1</version>
                </plugin>
            </plugins>
        </pluginManagement>
        <plugins>
            <plugin>
                <groupId>net.alchim31.maven</groupId>
                <artifactId>scala-maven-plugin</artifactId>
                  <executions>
                    <execution>
                        <id>scala-compile-first</id>
                        <phase>process-resources</phase>
                        <goals>
                            <goal>add-source</goal>
                            <goal>compile</goal>
                        </goals>
                    </execution>
                    <execution>
                        <id>scala-test-compile</id>
                        <phase>process-test-resources</phase>
                        <goals>
                            <goal>testCompile</goal>
                        </goals>
                    </execution>
                </executions> 
            </plugin>
        </plugins>
    </build>
</project>

Let's put a sample Scala code

emacs src/main/scala/ScalaMain.scala # paste the below conents

object ScalaMain{
     def hello(){
           println("Hello there from scala")
     }
}

Lets call scala code from java :

emacs src/main/java/JavaMain.java  # paste the below contents

public class JavaMain {
     public static void main(String[] args){
          System.out.println("Hello From Java");
     ScalaMain.hello();
     }
}

Lets Compile and run

mvn compile exec:java  # the class is already specified in pom.xml as exec.mainClass

[INFO] Scanning for projects...
[INFO]                                                                         
[INFO] ------------------------------------------------------------------------
[INFO] Building Test Project 1.0-SNAPSHOT
[INFO] ------------------------------------------------------------------------
.........
[INFO] 
[INFO] --- exec-maven-plugin:1.4.0:java (default-cli) @ test-project ---
Hello From Java
Hello there from scala
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 2.239s
[INFO] Finished at: Mon Apr 04 00:42:25 PDT 2016
[INFO] Final Memory: 14M/216M
[INFO] ------------------------------------------------------------------------

Summary :
I just documented the pom.xml configuration for a maven project which can compile both scala and java sources.

Thursday, 21 January 2016

Multiplication table using matrix in spark

In this post, I am sharing a code snippet to demonstrate multiplication table in spark.

In addition, it also demonstrates the following:

How to create a custom Spark RDD
How to create an Iterator/Generator
How to work with matrices in spark

Assumption :

You are using JDK 8.
You added spark and spark-mllib libraries to the classpath. I assume you use something like Maven and add these dependencies.

import org.apache.spark.Partition;
import org.apache.spark.SparkConf;
import org.apache.spark.SparkContext;
import org.apache.spark.TaskContext;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.mllib.linalg.distributed.CoordinateMatrix;
import org.apache.spark.mllib.linalg.distributed.MatrixEntry;
import org.apache.spark.rdd.RDD;
import scala.collection.AbstractIterator;
import scala.collection.Iterator;
import scala.collection.mutable.ArrayBuffer;
import scala.reflect.ClassManifestFactory$;

/**
 * This class illustrates multiplication table on spark!
 * @author  Thamme Gowda N
 *
 */
public class Main {

    /**
     * This RDD has numbers in it.
     * This RDD can be used for illustrating math operations on Spark
     */
    public static class SequenceRDD extends RDD<Long>{

        private long start;
        private long end;
        private Partition[] partitions;

        public SequenceRDD(SparkContext _sc, long start, long end) {
            super(_sc, new ArrayBuffer<> (),
                    ClassManifestFactory$.MODULE$.fromClass(Long.class));
            this.start = start;
            this.end = end;
            this.partitions = new Partition[]{
                    () -> 0 //just one part at index 0
            };
        }

        @Override
        public Iterator<Long> compute(Partition split, TaskContext ctx) {
            //we have only one part, so this is it
            return  new SequenceIterator(start, end);
        }

        @Override
        public Partition[] getPartitions() {
            return partitions;
        }
    }

    /**
     * This iterator yields numbers so we can build an RDD on it
     */
    private static class SequenceIterator
            extends AbstractIterator<Long>
            implements java.util.Iterator<Long> {

        private long nextStart;
        private long end;

        /**
         * Number generator for [start, end]
         * @param start the start
         * @param end the end, inclusive
         */
        public SequenceIterator(long start, long end) {
            this.nextStart = start;
            this.end = end;
            assert end >= start : "Invalid Args";
        }

        @Override
        public boolean hasNext(){
            return nextStart <= end;
        }

        @Override
        public Long next() {
            return nextStart++;
        }
    }

    public static void main(String[] args) {
        long n = 1_000;
        String outpath = "multiplication-matrix";

        long st = System.currentTimeMillis();
        SparkConf conf = new SparkConf()
                .setAppName("Large Matrix")
                .setMaster("local[2]");
        JavaSparkContext ctx = new JavaSparkContext(conf);
        JavaRDD<Long> rdd = new SequenceRDD(ctx.sc(), 0, n)
                .toJavaRDD()
                .cache();

        JavaPairRDD<Long, Long> pairs = rdd.cartesian(rdd);
        JavaRDD<MatrixEntry> entries = pairs.map(tup ->
                new MatrixEntry(tup._1(), tup._2(), tup._1() * tup._2()));
        CoordinateMatrix matrix = new CoordinateMatrix(entries.rdd());
        matrix.toIndexedRowMatrix()
                .rows()
                .saveAsTextFile(outpath);

        System.out.printf("n=%d\t outpath=%s\nTime taken : %dms\n", n,
                outpath, System.currentTimeMillis() - st);
        ctx.stop();
    }
}

Tuesday, 19 January 2016

Comparing cubic time complexity algorithm with a fourth-power complexity

I thought of comparing n³ and n⁴ running times to feel what it means to have a slow algorithm.

Let us suppose we have an algorithm with cubic runtime complexity:

 int n = 2500;
 int[] arr = new int[n];
 System.out.println("Start");
 long st = System.currentTimeMillis();
 for (int i = 0; < n; i++) {
   for (int j = 0; j < n; j++) {
     for (int k = 0; k < n; k++) {
       arr[k]++;
    }
   } 
 }
 System.out.printf("%dms\nDone!", System.currentTimeMillis() - st);

For n = 2500, it completed in 1.827 seconds on a modern i5 CPU :

 Start  
 1827ms  
 Done!

Now, an algorithm of power 4 time complexity,

 int n = 2500;
 int[] arr = new int[n];
 System.out.println("Start");
 long st = System.currentTimeMillis();
 for (int i = 0; i < n; i++) { 
   for (int j = 0; j < n; j++) {
     for (int k = 0; k < n; k++) {
       for (int l = 0; l < n; l++) {
         arr[l]++;
      }  
     }  
   }  
 }  
 System.out.printf("%dms\nDone!", System.currentTimeMillis() - st);

With just one additional loop, for the same n=2500, it took:

Well, it didn't complete... I waited 15 minutes, yet no answer!!
EDIT: It completed after 47 minutes!

 Start  
 2863936ms  
 Done!

Why is this comparison important to me?
This comparison became interesting to me because I have implemented a Tree Edit Distance algorithm which has a fourth-power time complexity of the input size (number of nodes in the trees). I used this algorithm on HTML DOM trees which have around 1000 to 2500 HTML elements.

Summary
Thus, an algorithm with power 4 is not an acceptable solution for tree edit distance for web pages with more than 1000 HTML elements.
I am seeking a better algorithm.

Digit grouping for very long numbers in Java

When we need to write a Billion on a sheet of paper we don't just write 1000000000, because it is hard to see if it is exactly a billion or hundred million in the very first glance without counting the digits. Instead, grouping the digits is common practice so it can be written as 1,000,000,000. It is evident that grouping the digits improves the readability of numbers.

However, all these days I never saw anywhere grouping the numeric literals in the software code. Recently I was surprised to see that Java language supports digit groups by using the underscore ('_') characters as group separators. For example, the following is a perfectly valid way to store a decimal Billion in java.

int n = 1_000_000_000;