Mirai Solutions :: Introduction to Apache Drill

Here we are with a new tool for data processing. This time it’s Apache Drill.

Say hello to a miner’s best friend

After our last dive into Terraform on Azure:

and Nix:

we mix things up a little and step out of the DevOps game for a bit. Today we will focus on something totally different, more on the data analytics side - Apache Drill. As you might already expect, since we cover another project in the Apache umbrella, things are about to get a bit interesting and a whole lot useful.

So what is Drill in the first place? In one simple sentence - it is the Holy Grail of the Data Miner (or Data Scientist, Data Analyst, or whatever you want to call it). It allows querying all the major NoSQL databases (think HDFS derivatives, MongoDB, S3, etc.) using the SQL syntax from a unified service with a marginal level of configuration.

Sounds too good to be true? Here are some pros and cons:

Pros:

very easy to set up
no additional query language is necessary
works with most popular providers
variety on interfaces for query execution (shell, web UI, ODBC/JDBC, C++ API)
able to join data from multiple providers
memory-based (think more Spark, less MapReduce)
optimized for performance (vectorization of values, runtime compilation, no row materialization)
can be run from the command line or set up as a clustered service (with Zookeeper or YARN)
it’s possible to create custom functions to enhance queries
ODBC/JDBC driver allows connecting to Drill from a multitude of 3rd party analytical tools

Cons:

parallelization of work can be a bit uneven
you still have to take into account the lower layers (providers) when trying to optimize queries
yet another layer, yet another level of debug necessary (although it comes with some nice helpers)
most likely you will still have to keep your Data Engineers around (sorry!)

Now, if you are still interested, let’s jump a bit deeper.

Architecture

As we mentioned before, Apache Drill can be run using embedded mode or spanned into a cluster. In both cases, we will end up with one Drillbit agent running per physical/logical machine. This is the entry point for all our interactions. Once we issue a query, one of the Drillbits becomes the Foreman and is responsible for execution. Similar to solutions like Spark everything will be kept in memory. Another common thing is that the query will get first converted to a logical plan, passed through a query optimizer and then made into a physical plan.

The plan will be parallelized in a manner resembling the fork/join pattern, which creates a DAG representation, with one fragment serving as root that divides work, aggregates results, and sends them back to the client. This type of organization is fairly common, battle-proven, and relatively easy to debug. Unfortunately, it can lead to some bottlenecks since each fragment maps to a blocking thread and we can communicate to different data stores with various schemas and response times. All in all, it’s more of a general problem with data lakes than with Apache Drill in particular.

Anyhow, those are the details that might help us when trying to debug a plan, but for the most part, we will be only interested in interacting with this architecture from a top-level perspective. In this sense, we have to ask ourselves - what are the components that allow us to translate the queries to concrete storage providers?

To provide the functionality that it does, Drill utilizes a set of plugins that serve as connectors to different NoSQL databases. Those are self-contained Java JARs that we place on the classpath and providing some additional configuration (plugin specific to some degree). We can enable/disable those from the Web UI as well. By default, we have two special plugins enabled - cp and dfs. The former allows us to read JAR files from the classpath (which can contain data) and the latter is the basic filesystem connector. dfs is primarily used to query the local filesystem, but can be configured to point to HDFS or S3 as well.

Setup

Installation is pretty straightforward. For Linux or OS X just download and extract this tarball. For Windows, you might want to follow those instructions.

In any case, you will need to have Java Development Kit in version 8+ installed, JAVA_HOME defined, and PATH pointing to your JDK binaries. If you did not touch Java for some time, you might want to check out Oracle documentation on the specifics.

Once Drill is installed, you just need to cd into the bin folder containing the extracted package and run ./drill-embedded. This will eject you into a standalone version of Drill that will allow you to interact with your local system.

First steps

Drill comes packaged with a few data sets in the sample-data directory. Let’s try querying one of those. Please run SELECT * FROM dfs.`[path to directory]/apache-drill-1.19.0/sample-data/region.parquet`;. You should get something like this:

+-------------+-------------+----------------------+
| R_REGIONKEY |   R_NAME    |      R_COMMENT       |
+-------------+-------------+----------------------+
| 0           | AFRICA      | lar deposits. blithe |
| 1           | AMERICA     | hs use ironic, even  |
| 2           | ASIA        | ges. thinly even pin |
| 3           | EUROPE      | ly final courts cajo |
| 4           | MIDDLE EAST | uickly special accou |
+-------------+-------------+----------------------+

We can see that the structure of Drill commands contains a bit more information than vanilla SQL. The dfs in this case points to the plugin that we are using. If we are going to use dfs a lot, then we could run USE dfs;. Now when you re-run the previous query, you can skip the dfs. part and just execute SELECT * FROM `[path to directory]/apache-drill-1.19.0/sample-data/region.parquet`;. It’s a fun little feature, but why does it work in the first place? The trick is, that Drill defines several workspaces in each plugin. In the scope of those, we can point to files/schemas on the filesystem, which are treated just as tables. When running USE dfs; we narrow down the namespace from which tables are resolved. Each plugin has one default workspace that is used when we do not specify anything else in the query. In the case of hbase and hive plugins, those would point to the metastore. dfs plugin behaves a bit differently and uses the root of our local filesystem as the workspace! There are a few interesting things to unbox here. First of all, there is some kind of duality between our local file system and a SQL schema. Drill translates whatever is described by the plugin to the abstraction that is easily understood by anybody that has written a plain SQL query before. Second of all, we have read and interpreted on the fly metadata from the parquet file and treated this as a relational table. With what else can we interact using dfs? Could we perform a similar operation on a CSV file?

Let’s try. Please past this content into a file called [path to directory]/apache-drill-1.19.0/sample-data/region.csv:

id;name;description
0;AFRICA;lar deposits. blithe
1;AMERICA;hs use ironic, even
2;ASIA;ges. thinly even pin
3;EUROPE;ly final courts cajo
4;MIDDLE EAST;uickly special accou

Now, please run SELECT * FROM `[path to directory]/apache-drill-1.19.0/sample-data/region.csv`;. Did not turn out as expected, did it? This is what we got in our case:

+----------------------------------------+
|                columns                 |
+----------------------------------------+
| ["id;name;description"]                |
| ["0;AFRICA;lar deposits. blithe"]      |
| ["1;AMERICA;hs use ironic"," even"]    |
| ["2;ASIA;ges. thinly even pin"]        |
| ["3;EUROPE;ly final courts cajo"]      |
| ["4;MIDDLE EAST;uickly special accou"] |
+----------------------------------------+

We imported our data, but it seems that our header was not imported and it did not recognize our delimiter. We could use this table as it is by using columns[0], thanks to support for complex data types, but that’s not very elegant. How can we fix that?

Configuration

First, let’s take a brief look at Drill Web UI. With drill-embedded still running go to http://localhost:8047/. You should see a list of currently running Drillbits. Since we are working locally, obviously there will be just one of those.

Next, go to the Storage section. Here we can see all the plugins for providers available to our Drillbit. Under Enabled Storage Plugins you should see dfs listed. Please click on Update. Now, we are looking at the configuration of the plugin. Each plugin can be customized using either Web UI, configuration files, or REST API. Please find the format section and then csv. You might want to copy-paste it to an external editor for convenience and create a backup copy as well. Let’s change this entry:

"csv": {
  "type": "text",
  "extensions": [
    "csv"
  ]
},

to:

"csv": {
  "type": "text",
  "extensions": [
    "csv"
  ],
  "extractHeader": true,
  "delimiter": ";"
},

Save the configs with Update button and try running the previous query again. You can either use the command line as before or switch to the Query section. What you should get as result is something like:

+----+-------------+----------------------+
| id |    name     |     description      |
+----+-------------+----------------------+
| 0  | AFRICA      | lar deposits. blithe |
| 1  | AMERICA     | hs use ironic, even  |
| 2  | ASIA        | ges. thinly even pin |
| 3  | EUROPE      | ly final courts cajo |
| 4  | MIDDLE EAST | uickly special accou |
+----+-------------+----------------------+

Working with files and schemas

Drill possesses a distinction between formats that contain Strong vs Weak schema. In the case of the former, when we run CREATE VIEW type compatibility and column existence are checked against the schema. For the latter CREATE VIEW always succeeds.

Strong schema sources are:

views
Hive tables
HBase column families
text (a bit of an overstatement to call this strong, but nevertheless true)

To weak schema sources we count:

JSON
MongoDB
HBase column qualifiers
Parquet

For the basic input formats available on any Drill deployment we can use:

Avro
CSV, TSV and PSV
Parquet
Hadoop Sequence

Let’s play a bit more with the file that we created before and create a new table from it. Please run ALTER SESSION SET `store.format`='parquet';, followed by:

CREATE TABLE `[path to directory]/apache-drill-1.19.0/sample-data/region2/` AS
SELECT
    CAST(`id` AS INT) `ID`,
    `name` as `NAME`,
    SUBSTR(`description`,1, 3) `SHORT_DESCRIPTION`
FROM `[path to directory]/apache-drill-1.19.0/sample-data/region.csv`;

What you will get is an error of the form: Error: VALIDATION ERROR: Unable to create or drop objects. Schema [dfs] is immutable.. The problem is that we’ve previously run USE dfs; which is a shortcut for USE dfs.default; or USE dfs.root;. If you go to dfs plugin configuration you will something like that:

"workspaces": {
  "tmp": {
    "location": "/tmp",
    "writable": true,
    "defaultInputFormat": null,
    "allowAccessOutsideWorkspace": false
  },
  "root": {
    "location": "/",
    "writable": false,
    "defaultInputFormat": null,
    "allowAccessOutsideWorkspace": false
  }
},

Our root workspace is read-only by default. We can either change the writable flag, switch to tmp workspace or create a new workspace. Let’s go for the last option. Please create a folder named [path to directory]/apache-drill-1.19.0/sample-data/workwork/ and add the following workspace configuration in Web UI:

"workwork": {
  "location": "/[path to directory]/apache-drill-1.19.0/sample-data/workwork/",
  "writable": true,
  "defaultInputFormat": null,
  "allowAccessOutsideWorkspace": false
}

Finally, we will change the query a bit and re-run it:

CREATE TABLE dfs.workwork.`/region2/` AS
SELECT
    CAST(`id` AS INT) `ID`,
    `name` as `NAME`,
    SUBSTR(`description`,1, 3) `SHORT_DESCRIPTION`
FROM `[path to directory]/apache-drill-1.19.0/sample-data/region.csv`;

Notice, that now we have to specify the full schema name in at least one case since cannot USE both, and we provide location starting from the root where our workspace is mounted.

Let’s quickly run SELECT * FROM dfs.workwork.`/region2/`; just to make sure that we get:

+----+-------------+-------------------+
| ID |    NAME     | SHORT_DESCRIPTION |
+----+-------------+-------------------+
| 0  | AFRICA      | lar               |
| 1  | AMERICA     | hs                |
| 2  | ASIA        | ges               |
| 3  | EUROPE      | ly                |
| 4  | MIDDLE EAST | uic               |
+----+-------------+-------------------+

Now, for a grand finale, please run the following:

SELECT R2.`ID`, R.`R_REGIONKEY`, R2.`SHORT_DESCRIPTION`
FROM dfs.workwork.`/region2/` AS R2
INNER JOIN `[path to directory]/apache-drill-1.19.0/sample-data/region.parquet` AS R
ON R2.`ID`=R.`R_REGIONKEY`;

We probably won’t win the annual International SQL Championship, but for sure we will win something like this:

+----+-------------+-------------------+
| ID | R_REGIONKEY | SHORT_DESCRIPTION |
+----+-------------+-------------------+
| 0  | 0           | lar               |
| 1  | 1           | hs                |
| 2  | 2           | ges               |
| 3  | 3           | ly                |
| 4  | 4           | uic               |
+----+-------------+-------------------+

Let’s keep this last query somewhere close because now we will try to find out what happens during the execution and how to do a simple debug.

Debugging a plan

Inspecting a job in Drill is pretty straightforward. Just go to the Web UI and browse to Profiles. Our last query should be at the top of the list. Please click on the query, which will take you to a detailed view.

Here we can see four tabs - Query, Physical Plan, Visualized Plan, and Edit Query. If you go to the third one, you will see a nice graph of how Drill has divided the execution. First, we will have our root fragment, then some projections, and finally a hash join of two table scans.

Let’s scroll a bit down and look at the Operator Profiles > Overview. Each of the operators that have been run is listed along with some useful metrics like % Query Time. It looks like our join was quite costly in the scope of our query, which is nothing too surprising.

If you switch to Physical Plan you can see a bit more obscure view of the same plan, but there is at least one thing that is worth noticing - we have access to the cumulative cost of each operator in regards to different metrics.

The last thing that we recommend checking out is the Edit Query tab, which not only allows us to edit the query but also cancel it in case something takes more than we expected.

Closing words

I hope we got your interest and you will give Apache Drill a chance. It’s a very powerful tool, that remains surprisingly simple and approachable at the same time. For the next post in our series, we chose to cover Apache Druid, which is a real-time analytics database. Since Drill comes out of the box with a storage plugin dedicated for Druid, we will do a short throwback to show how to integrate those two pieces of technology. Thank you for your attention and please come back to check out our new material!