Java & Scala – tastefulcode

Modern Database Access with Scala and Slick

Over the last few weeks I have been writing my first Scala application – a RESTful service backed by a SQL database for persistent storage. After some googling around the ‘Scala way’ to interact with a SQL database, I came across a modern database access library for Scala, called Slick.

Slick is a really clever library. Not only does it provide you with a way to perform operations on database tables by writing familiar Scala code – in the same way that you would perform operations on Scala collections – but, it also lets you do this in a type-safe way. Effectively, Slick is compiling Scala code to SQL. Consider a simple SQL query which selects a row by ID:

SELECT username FROM users WHERE userId = 10

1	SELECT username FROM users WHERE userId = 10

With Slick, we can use the familiar filter and map methods to write this query – it’s just like we are using the Scala collections API:

val query = users.filter(_.userId === 10).map(_.username)

1	val query = users.filter(_.userId === 10).map(_.username)

There is a key difference – if users were a normal Scala collection, then we would probably observe that _.userId has type Int, the literal 10 has type Int and _.username has type String. But here, users is not a Scala collection, it is an instance of a Slick table query. And we don’t actually have Ints and Strings to use in our filters and maps, since we are building a query that has not yet been executed. So what types are we actually using?

To achieve type safety, behind the scenes Slick is ‘lifting’ types into its own parameterised type, called Rep. From our table model, Slick knows that the userId column stores integers, so the type of _.userId is Rep[Int]. This can be read ‘representation of Int’. Since the === function on Rep[Int] expects another Rep[Int], an implicit function is used to ‘lift’ the plain literal 10 into a Rep[Int]. If we try to compare _.userId with something that cannot be converted into a Rep[Int], then the compiler will complain. This is how Slick provides a type safe DSL for writing database queries. Similarly, the type of the username column has type Rep[String] and so if we run this query we will get back a result of type String.

This really demonstrates the power and extensibility of the Scala language, as it simply isn’t possible to achieve the same thing in other languages. Something similar has been done before by Microsoft, with LINQ for C#. However, Microsoft didn’t achieve LINQ by using the C# language itself; they had to build the functionality into the C# compiler.

While this feature of Slick really stands out, there are many other reasons to use the library:

Full support for all popular SQL implementations
Handling of sessions and transactions. You can plugin a connection pool implementation such as HikariCP
Protection against SQL injection. When using plain SQL, all parameters are converted to SQL bind/substitution variables
An asynchronous API for performing database operations

The asynchronous API has been introduced very recently in Slick 3 and has seen Slick move away from a blocking API and instead towards an API which leverages Scala futures. This was perfect for my application, a RESTful web service written on top of Spray – a toolkit for building services based on REST/HTTP. A Future[Response] could be returned back to my request handling logic and Spray’s in-built marshaller for futures will delay marshalling until the response becomes available, at which point it can be marshalled by another thread.

As you might expect, with Slick you can build more complex queries or sequences of database operations using Scala’s for-yield construct. It turns out there are a few ways to do this – you can build queries by composing in either Query, DBIOAction or Future itself and at first I wasn’t sure when to use one over the other.

Composing in Query
When you compose in Query, Slick generates just one SQL statement. For this reason, this approach should be used wherever possible. In the example below, we perform a (implicit) join between a users and purchases table to find all purchases for a specific user.

val queryResult = (for {
    u <- users if u.userId === 10
    p <- purchases if p.userId === u.userId
} yield p).result

val queryResult = (for {

u <- users if u.userId === 10

p <- purchases if p.userId === u.userId

} yield p).result

Composing in DBIOAction
In Slick, a DBIOAction corresponds to a single database operation. Composing in DBIOAction results in multiple SQL statements, but just one database connection is used to execute them. This is particularly good if you have a sequence of dependent database operations that you want to execute transactionally. You can compose in DBIOAction and use Slick’s .transactionally method to execute the queries in a single transaction. This means that if any one of the queries fails, all prior operations will be rolled back – leaving the database in a consistent state. In the example below, we insert a user and then use the ID returned by this insert operation to insert a sequence of purchases, all in a single transaction.

val queryResult = db.run((for {
  newId <- (users returning users.map(_.userId)) += user
  _ <- purchases ++= purchases.map(p => p.copy(userId = Some(newId)))
} yield newId).transactionally)

val queryResult = db.run((for {

newId <- (users returning users.map(_.userId)) += user

_ <- purchases ++= purchases.map(p => p.copy(userId = Some(newId)))

} yield newId).transactionally)

Composing in Future
With this approach, Slick obtains a connection from the connection pool for each statement, so potentially multiple connections are used. Therefore, composing in Future should only be done if absolutely necessary. In the example below, we perform the same inserts as above. Note that here, it’s not possible to execute the two operations transactionally and a different database connection may be used for each operation. The only real reason to compose in Future would be in the unlikely case that you need to perform some non-Slick related operation in-between the database operations.

val queryResult = for {
  userId <- db.run((users returning users.map(_.userId)) += user)
  _ <- db.run(purchases ++= purchases.map(p => p.copy(userId = Some(newId))))
} yield userId

val queryResult = for {

userId <- db.run((users returning users.map(_.userId)) += user)

_ <- db.run(purchases ++= purchases.map(p => p.copy(userId = Some(newId))))

} yield userId

In conclusion, with Query resulting in a single SQL statement and DBIOAction resulting in a single connection, for optimal execution prefer Query over DBIOAction over Future.

Like all good things, Slick has some limitations – perhaps the biggest drawback is the lack of control you have over how Slick generates SQL queries from your Scala code. There are cases where Slick fails to generate the most optimal SQL and as a result the query performance will be worse than if you’d just written the SQL yourself. So when query performance is critical, there may be times where it is better to avoid Slick’s SQL generation functionality. Running Slick in debug mode enables you to see the SQL that is being executed, so if you hit a performance issue it is easy to find out whether Slick is the cause. Another limitation I came across is that some queries are currently not expressible with Slick’s Scala abstractions. For example, this fairly common SQL query cannot be expressed in the current release of Slick:

UPDATE table SET field = field + 123

This example is an exception, as the vast majority of common SQL queries can be expressed very easily. The library is still undergoing rapid development, so issues like these are becoming increasingly less common. In cases where it is not possible to correctly express the operation you want to perform using Slick’s abstractions, you can always use Slick’s plain SQL API.

Writing a Query Parser for Elasticsearch

I was first exposed to parsers when studying compilers at university – I built something that could take a program written in a fictional programming language as input and produce runnable assembly code as output. It was interesting and fun, but I doubted I would ever use the knowledge in industry. When the need came up for a ‘language’ that non-technical people could use to construct Elasticsearch queries, I realised that a custom built parser would be the ideal solution.

We were using Elasticsearch to power our sites’ search functionality and product list pages. The list pages would be created by non-technical site managers and to avoid the inflexibility of manually specifying a list of product IDs we wanted them to instead be able to write a search query which would result in the list page being built dynamically. For example, a list page for mens shampoos and conditioners which have been reduced to half price or less might be represented by the below ‘list definition’, which describes how to query the product index:

content_gender:male AND content_category:(shampoo OR conditioner) AND price_discount_multiplier: [0 TO 0.5]

Elasticsearch actually has a query type, called the query string query, which uses a built-in query parser to build queries from a simple language with very similar structure to the above. However, we had some very specific requirements, such as being able to elevate certain products under certain conditions (i.e. enabling the list to have some manual ordering which may override the ordering of the results returned by the search engine). So, it became clear that we would need to invent our own query parser.

There are a number of frameworks available for building parsers. Two popular choices are ANTLR and JavaCC (Java Compiler Compiler). ANTLR has been around for a long time and provides the most features – it is especially appropriate for more complex tasks such as building an actual program compiler. JavaCC is simpler and, as its name implies, targets the Java language and provides the most familiarity for Java developers. Since the systems powering our websites are written in Java, JavaCC seemed like the right choice.

The first step in building a parser is to write a grammar. A grammar defines the syntax of our language and is used for syntax analysis; in our case this involves going through a list definition, ensuring that it syntactically makes sense and simultaneously building a data structure which we can later use to generate an Elasticsearch query. For our language, the data structure that we want to build is called an abstract syntax tree (AST) – a tree representation of our syntax. For example, the syntax tree for the list definition content_gender:male AND content_category:(shampoo OR conditioner) would look like this:

To build this tree, we can use JJTree, an extension for JavaCC which will build the AST data structure for us, and will enable us to describe how the tree should be built within our grammar. The grammar itself is fairly straightforward to write, once you get used to the syntax. Firstly, we define all the tokens (i.e. strings) which we want our parser to be able to consume:

TOKEN :
{
    < AND : "AND" >
  | < OR : "OR" >
  | < LBRACKET : "(" >
  | < RBRACKET : ")" >
  | < COLON : ":" >
  | < WORD : ["a"-"z", "*", "_"]* >
... etc
}

Next, we define the parsing rules, which describe how the parser should consume these tokens and how the AST should be built. JJTree extends the grammar syntax so that we can use #NodeName to indicate that a node should be added to the syntax tree whenever the parser consumes the corresponding tokens. Below is a simplified example of what the rules will look like. These rules will support a language which consists of the AND and OR logical operators and grouping with brackets. Note that the rules are recursive: a query consists of sub queries, which can consist of more queries. We can also extract tokens such as the field names and values and store them as in the generated syntax tree. This is done by assigning the token value to a field defined within the node.

 ASTQuery query() #Query:
{
  Token t;
} 
{
  subQuery()
  (
    < AND > subQuery() #LogicalAnd(2)
  | < OR > subQuery() #LogicalOr(2)
  )*
  {
	return jjtThis;
  }
}

void subQuery() #void:
{
  boolean not = false;
} 
{
  < LBRACKET >query()< RBRACKET >
  | queryElement()
}

void queryElement() #QueryElement:
{}
{
  (
  LOOKAHEAD(2) field() < COLON > ( value() | valueList() )
  | vstring()
  )
}

void field() #Field:
{
  Token t;
}
{
	t = < WORD >
	{
		jjtThis.value = t.image;
	}
}

void valueList() #void:
{}
{
	< LBRACKET > value() (< OR > value())* < RBRACKET >
}

void value() #Value:
{
  Token t;
}
{
  (
    < QUOTE >t=< STRING >< ENDQUOTE >
    | t=< WORD >
  )
  {
    jjtThis.value = t.image;
  } 
}

Having defined the rules, we can use JavaCC to generate classes for each node and compile our parser which will do all the work needed to convert list definitions written in our language into syntax trees. Now comes the interesting part – turning these syntax trees into Elasticsearch queries which can be ran against our index. To do this, we need to traverse the abstract syntax tree and construct a query. For traversal, JJTree provides support for the visitor design pattern. For building the query, we can use the Elasticsearch Java API.

To use the visitor pattern, we must ensure that the JJTree option VISITOR=true is set. This flag causes JJTree to generate a visitor interface which we can implement, and also adds a jjtAccept() method to the generated node classes. We can then write a visitor class which implements the generated interface. For each node, we implement visit method which describes how to behave when we encounter each specific type of node. For example, when we reach a LogicalAND node, we can use the Elasticsearch Java API to create an AndFilterBuilder, and populate it with filters retrieved from the child nodes. We can retrieve these filters by calling the accept methods on the node’s children. At each level in the tree, we return an Elasticsearch query or filter object to the caller which can be composed into other queries further up the tree.

Here’s an example of what the visit methods could look like for the Query (root), LogicalAnd and QueryElement nodes:

@Override
public Object visit(ASTQuery node, Object data) {
    Object result = node.jjtGetChild(0).jjtAccept(this, null);
    return filteredQuery(matchAllQuery(), (FilterBuilder) result);
}

@Override
public Object visit(ASTLogicalAnd node, Object data) {  
    if (node.jjtGetChild(0) instanceof ASTLogicalAnd) {
        AndFilterBuilder fb1 = (AndFilterBuilder)(node.jjtGetChild(0).jjtAccept(this, null));
        FilterBuilder fb2 = (FilterBuilder)(node.jjtGetChild(1).jjtAccept(this, null));
        return fb1.add(fb2);
    } else if (node.jjtGetChild(1) instanceof ASTLogicalAnd) {
        AndFilterBuilder fb1 = (AndFilterBuilder)(node.jjtGetChild(1).jjtAccept(this, null));
        FilterBuilder fb2 = (FilterBuilder)(node.jjtGetChild(0).jjtAccept(this, null));
        return fb1.add(fb2);
    } else {
        FilterBuilder fb1 = (FilterBuilder)(node.jjtGetChild(0).jjtAccept(this, null));
        FilterBuilder fb2 = (FilterBuilder)(node.jjtGetChild(1).jjtAccept(this, null));
        return andFilter(fb1, fb2);
    }
}

@Override
public Object visit(ASTQueryElement node, Object data) {
    if (node.jjtGetNumChildren() == 2) {
        // QueryElement is a field-value pair
        String field = ((SimpleNode)(node.jjtGetChild(0))).value.toString();
        String value = ((SimpleNode)(node.jjtGetChild(1))).value.toString();
        return termFilter(field, value);
    } else if (node.jjtGetNumChildren() > 2) {
        // QueryElement is a field with a list of possible values (logical OR query)
        String field = ((SimpleNode)(node.jjtGetChild(0))).value.toString();
        OrFilterBuilder orfb = orFilter();
        for (int i = 1; i < node.jjtGetNumChildren(); i++) {
            String value = ((SimpleNode)(node.jjtGetChild(i))).value.toString();
            orfb.add(termFilter(field, value));
        }
        return orfb;
    }
    return FilterBuilders.matchAllFilter();
}

@Override

public Object visit(ASTQuery node, Object data) {

Object result = node.jjtGetChild(0).jjtAccept(this, null);

return filteredQuery(matchAllQuery(), (FilterBuilder) result);

}

@Override

public Object visit(ASTLogicalAnd node, Object data) {

if (node.jjtGetChild(0) instanceof ASTLogicalAnd) {

AndFilterBuilder fb1 = (AndFilterBuilder)(node.jjtGetChild(0).jjtAccept(this, null));

FilterBuilder fb2 = (FilterBuilder)(node.jjtGetChild(1).jjtAccept(this, null));

return fb1.add(fb2);

} else if (node.jjtGetChild(1) instanceof ASTLogicalAnd) {

AndFilterBuilder fb1 = (AndFilterBuilder)(node.jjtGetChild(1).jjtAccept(this, null));

FilterBuilder fb2 = (FilterBuilder)(node.jjtGetChild(0).jjtAccept(this, null));

return fb1.add(fb2);

} else {

FilterBuilder fb1 = (FilterBuilder)(node.jjtGetChild(0).jjtAccept(this, null));

FilterBuilder fb2 = (FilterBuilder)(node.jjtGetChild(1).jjtAccept(this, null));

return andFilter(fb1, fb2);

}

@Override

public Object visit(ASTQueryElement node, Object data) {

if (node.jjtGetNumChildren() == 2) {

// QueryElement is a field-value pair

String field = ((SimpleNode)(node.jjtGetChild(0))).value.toString();

String value = ((SimpleNode)(node.jjtGetChild(1))).value.toString();

return termFilter(field, value);

} else if (node.jjtGetNumChildren() > 2) {

// QueryElement is a field with a list of possible values (logical OR query)

String field = ((SimpleNode)(node.jjtGetChild(0))).value.toString();

OrFilterBuilder orfb = orFilter();

for (int i = 1; i < node.jjtGetNumChildren(); i++) {

String value = ((SimpleNode)(node.jjtGetChild(i))).value.toString();

orfb.add(termFilter(field, value));

}

return orfb;

}

return FilterBuilders.matchAllFilter();

}

To test this, we can just use the parser to generate a tree and pass an instance of our visitor class to the root node to generate a query.

DDLParser parser = new DDLParser(new java.io.StringReader(str));
ASTQuery root = parser.run();
ElasticSearchQueryVisitor esv = new ElasticSearchQueryVisitor();
QueryBuilder qb = (QueryBuilder)(root.jjtAccept(esv, null));
System.out.println(qb.toString());

DDLParser parser = new DDLParser(new java.io.StringReader(str));

ASTQuery root = parser.run();

ElasticSearchQueryVisitor esv = new ElasticSearchQueryVisitor();

QueryBuilder qb = (QueryBuilder)(root.jjtAccept(esv, null));

System.out.println(qb.toString());

That’s it! We now have a parser that can take input written in our custom-built query definition language, turn valid syntax into a syntax tree and finally build an Elasticsearch query.