Minborg

Minborg
Minborg

Wednesday, May 23, 2018

Making Pivot Tables with Java Streams from Databases

Making Pivot Tables with Java Streams from Databases

Raw data from database rows and tables does not provide so much insight to human readers. Instead, humans are much more likely to see data patterns if we perform some kind of aggregation on the data
before it is being presented to us. A pivot table is a specific form of aggregation where we can apply operations like sorting, averaging, or summing, and also often grouping of columns values.

In this article, I will show how you can compute pivot tables of data from a database in pure Java without writing a single line of SQL. You can easily reuse and modify the examples in this article to fit your own specific needs.

In the examples below, I have used open-source Speedment, which is a Java Stream ORM, and the open-source Sakila film database content for MySQL. Speedment works for any major relational database type such as MySQL, PostgreSQL, Oracle, MariaDB, Microsoft SQL Server, DB2, AS400 and more.

Pivoting

I will construct a Map of Actor objects and, for each Actor, a corresponding List of film ratings of films that a particular Actor has appeared in. Here is an example of how a pivot entry for a specific Actor might look like expressed verbally:

“John Doe participated in 9 films that were rated ‘PG-13’ and 4 films that were rated ‘R’”.

We are going to compute pivot values for all actors in the database. The Sakila database has three tables of interest for this particular application:

1) “film” containing all the films and how the films are rated (e.g. “PG-13”, “R”, etc.).
2) “actors” containing (made up) actors (e.g. “MICHAEL BOLGER”, “LAURA BRODY”, etc.).
3) “film_actor” which links films and actors together in a many-to-many relation.

The first part of the solution involves joining these three tables together. Joins are created using Speedment’s JoinComponent which can be obtained like this:

// Visit https://github.com/speedment/speedment
// to see how a Speedment app is created. It is easy!
Speedment app = …;

JoinComponent joinComponent = app.getOrThrow(JoinComponent.class);

Once we have the JoinComponent, we can start defining Join relations that we need to compute our pivot table:
Join<Tuple3<FilmActor, Film, Actor>> join = joinComponent
        .from(FilmActorManager.IDENTIFIER)
        .innerJoinOn(Film.FILM_ID).equal(FilmActor.FILM_ID)
        .innerJoinOn(Actor.ACTOR_ID).equal(FilmActor.ACTOR_ID)
        .build(Tuples::of);
The build() takes a method reference Tuples::of that will resolve to a constructor that takes three entities of type; FilmActor, Film and Actor and that will create a compound immutable Tuple3 comprising those specific entities. Tuples are built into Speedment.

Armed with our Join object we now can create our pivot Map using a standard Java Stream obtained from the Join object:

Map<Actor, Map<String, Long>> pivot = join.stream()
    .collect(
        groupingBy(
            // Applies Actor as a first classifier
            Tuple3::get2,
            groupingBy(
                // Applies rating as second level classifier
                tu -> tu.get1().getRating().get(),
                counting() // Counts the elements 
                )
            )
        );

Now that the pivot Map has been computed, we can print its content like this:
// pivot keys: Actor, values: Map<String, Long>
pivot.forEach((k, v) -> { 
    System.out.format(
        "%22s  %5s %n",
        k.getFirstName() + " " + k.getLastName(),
        V
    );
});
This will produce the following output:
        MICHAEL BOLGER  {PG-13=9, R=3, NC-17=6, PG=4, G=8} 
           LAURA BRODY  {PG-13=8, R=3, NC-17=6, PG=6, G=3} 
     CAMERON ZELLWEGER  {PG-13=8, R=2, NC-17=3, PG=15, G=5}
...


Mission completed! In the code above, the method Tuple3::get2 will retrieve the third element from the tuple (an Actor) whereas the method tu.get1() will retrieve the second element from the tuple (a Film).

Speedment will render SQL code automatically from Java and convert the result to a Java Stream. If we enable Stream logging, we can see exactly how the SQL was rendered:
SELECT 
    A.`actor_id`,A.`film_id`,A.`last_update`, 
    B.`film_id`,B.`title`,B.`description`,
    B.`release_year`,B.`language_id`,B.`original_language_id`,
    B.`rental_duration`,B.`rental_rate`,B.`length`,
    B.`replacement_cost`,B.`rating`,B.`special_features`,
    B.`last_update`, C.`actor_id`,C.`first_name`,
    C.`last_name`,C.`last_update`
FROM 
    `sakila`.`film_actor` AS A
INNER JOIN 
    `sakila`.`film` AS B ON (B.`film_id` = A.`film_id`) 
INNER JOIN 
    `sakila`.`actor` AS C ON (C.`actor_id` = A.`actor_id`)

Joins with Custom Tuples

As we noticed in the example above, we have no actual use of the FilmActor object in the Stream since it is only used to link Film and Actor entities together during the Join phase. Also, the generic Tuple3 had general get0(), get1() and get2() methods that did not say anything about what they contained.

All this can be fixed by defining our own custom “tuple” called ActorRating like this:

private static class ActorRating {
    private final Actor actor;
    private final String rating;

    public ActorRating(FilmActor fa, Film film, Actor actor) {
        // fa is not used. See below why
        this.actor = actor;
        this.rating = film.getRating().get();
    }

    public Actor actor() {
        return actor;
    }

    public String rating() {
        return rating;
    }

}


When Join objects are built using the build() method, we can provide a custom constructor that we want to apply on the incoming entities from the database. This is a feature that we are going use as depicted below:
Join<ActorRating> join = joinComponent
    .from(FilmActorManager.IDENTIFIER)
    .innerJoinOn(Film.FILM_ID).equal(FilmActor.FILM_ID)
    .innerJoinOn(Actor.ACTOR_ID).equal(FilmActor.ACTOR_ID)
    .build(ActorRating::new); // Use a custom constructor

Map<Actor, Map<String, Long>> pivot = join.stream()
    .collect(
        groupingBy(
            ActorRating::actor,
            groupingBy(
                ActorRating::rating,
                counting()
            )
         )
    );
In this example, we proved a class with a constructor (the method reference ActorRating:new gets resolved to new ActorRating(fa, actor, film)) that just discards the linking FilmActor object altogether. The class also provided better names for its properties which made the code more readable. The solution with the custom ActorRating class will produce exactly the same output result as the first example but it looks much nicer when used. I think the effort of writing a custom tuple is worth the extra effort over using generic Tuples in most cases.

Using Parallel Pivoting

One cool thing with Speedment is that it supports the Stream method parallel() out-of-the-box. So, if you have a server with many CPUs, you can take advantage of all those CPU cores when running database queries and joins. This is how parallel pivoting would look like:

Map<Actor, Map<String, Long>> pivot = join.stream()
    .parallel()  // Make our Stream parallel
    .collect(
        groupingBy(
            ActorRating::actor,
            groupingBy(
                ActorRating::rating,
                counting()
            )
         )
    );
We only have to add a single line of code to get parallel aggregation. The default parallel split strategy kicks in when we reach 1024 elements. Thus, parallel pivoting will only take place on tables or joins larger than this. It should be noted that the Sakila database only contains 1000 films, so we would have to run the code on a bigger database to actually be able to benefit from parallelism.

Take it for a Spin!

In this article, we have shown how you can compute pivot data from a database in Java without writing a single line of SQL code. Visit Speedment open-source on GitHub to learn more.

Read more about other features in the the User's Guide.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.