An Inside Look at the Future of Predictive Energy Analytics
We leverage a lot of data to help our users manage energy more intelligently. To be specific, we need to digest around 12 terabytes of data on a daily basis. That data gets transformed into everything from portfolio-wide performance reports to a persistent monitoring tool following a retrocommission.
One of the latest advanced analytics features we’ve developed with all this data—and one that we’re particularly proud of—is a day-ahead cost prediction tool. We’re able to tell our users, with a high degree of accuracy, how much they’re going to spend on energy tomorrow.
Predicting the Future
Calculating a building’s energy cost down to the dollar before it’s happened requires a set of highly advanced calculations based on weather forecasts, buildings’ past energy consumption, and a host of other factors. We process all of this data using a statistical computing language called R. For those who know R, you’ll already know that R runs calculations one at a time (sequentially).
This generally isn’t an issue, but we have 15,000 sites to predict tomorrow’s consumption for, and with the amount of data we’re processing, each calculation takes about a minute. At that rate, we’d need nearly 10 days to do what has to be done overnight so users can act on the information first thing in the morning.
The problem is not just running the R processes in time, it’s also feeding those processes the data they need in time. Re-writing code is possible but comes with the problems of writing libraries and testing them in addition to testing our analysis code, all of which are heavily time-intensive.
As far as we know, no one’s published a solution to this limitation in R yet. So we went digging.
That Creative Spark
We knew Spark—a data processing framework with an emphasis on massive scalability—held the potential to solve our processing hang-up with Spark’s machine learning libraries, but it lacked the advanced analytics R offered.
Rather than choosing between the two, we decided to use Spark as a foundation, and have it run our R process in parallel. This gave us the depth to feed the necessary data into R, and the necessary breadth to run the 15,000 processes in parallel on dozens of cloud-based servers.
With a small amount of code rewriting to distribute the data (using HDFS and Google’s Protocol Buffers) we were able to speed up calculations by two or three orders of magnitude: the old system was working through 10-100 sites a day, our new system processes 10,000-100,000 sites in the same cloud computing infrastructure.
Better still, because we rewrote the code using proto-buffs, our new R code is smaller than the old stuff even though it’s much faster. Proto-buffs allow data to move across different systems while guaranteeing much better accuracy than formats like JSON, which we were previously using.
What It Means
Our approach generally enables data scientists to leverage R’s massive statistical libraries within the context of an easily scalable application (Spark), plugged into a production analytics pipeline.
This approach also defines a strict interface that a data scientist can work against, allowing him to focus on where he can have the most impact: the analytic solution. All while benefiting his engineer by essentially making the analytics a black box. This allows the engineer to worry exclusively about the input and the output of the analytic solution which has been strictly defined.
Possibilities for Using Spark and R Together
The technique we use to massively scale calculations in R and Spark can be applied across data science questions being answered today.
Say you wanted to forecast an hour ahead what the price of an Uber ride will be in your area. You could collect data using Uber’s API and do some exploration in R. Using the Forecast R package you should be able to make a reasonably good estimate of what the price will be for a small sample of locations.
Now say you wanted to predict costs in every neighborhood for every major city an hour from now. You could pretty easily set up an application in Spark – using R, Scala, or really any JVM language -- that will collect the data from Uber’s API in regular intervals for every location in which you are interested. This application will serve the required predictor data into the same R analysis that you developed before. The analysis does its magic and sends the results back to your application where you can happily do what you want with the results.