SODA Developers

A Move to Main Branch

2021-10-13T00:00:00+00:00

About a year ago, Github began automatically defaulting the designation of the main repository branch from master to main Read More. In the spirit of that movement, our public github repositories will also be renamed. This will effect anyone who had previously forked or cloned the repository locally. This post will describe remediation steps if this applies to you. There will also be notifications in github of this change with additional steps.

To move your local respository over to the main branch, please execute the following commands in the terminal:

git branch -m master main
git fetch origin
git branch -u origin/main main
git remote set-head origin -a

Once this is done, you should be set with main branch being your local origin.

Time Series Analysis with Jupyter Notebooks and Socrata

2019-10-07T00:00:00+00:00

Time Series Analysis with Jupyter Notebooks and Socrata

Time series analysis and time series forecasting are common data analysis tasks that can help organizations with capacity planning, goal setting, and anomaly detection. There are an increasing number of freely available tools that are bringing advanced modeling techniques to people with basic programming skills, techniques that were previously only accessible to those with advanced degrees in statistics. This is particularly significant among our customers – government agencies – where resources are constrained and data aware employees are at a premium. In this blog post, I would like to show you how you can use just a few of these tools. We will start with a dataset downloaded using the Socrata API and loaded into a data frame in a Python Jupyter notebook. Then we will do some data wrangling to prepare our data for analysis, we will do some plotting, and finally, we will use the Prophet library to make a forecast based on our data.

A time series is an ordered sequence of observations where each observation is made at some point in time. Time series data occur across many domains. In any domain in which we make measurements over time, we can expect to find time series. Government is no exception. For the purpose of this blog post, we focus on our home city of Seattle. Specifically, we will use the City of Seattle’s Building Permits dataset.

Getting Started

In the interest of brevity, this post assumes that you are comfortable writing and executing Python code. Further, it assumes that you have setup a virtual environment, and that you have installed a bunch of dependencies, including Jupyter. Finally, it assumes that you have already downloaded the City of Seattle Building Permits dataset into a Pandas DataFrame named seattle_permits_df. The entire notebook is available for download here. If you need help getting your data to this point, you can follow the first two steps in this blog post.

Exploring Our Data

The first thing you’ll want to do when working with a new dataset is explore it. Here are a few ways you might do that:

get a sense of how many rows your dataset contains
get a list of the different columns and the types of data that they store
plot your data

We’ll do all of the above.

print(len(seattle_permits_df))
seattle_permits_df.head(5)

The output of the first line tells us that we have just shy of 130,000 rows in our dataset. The head command prints the first N=5 rows of our dataset. This gives us a sense of what columns exist, and a quick sense of some of the values in the dataset. But there’s an even better way to determine the top values for a particular column – the value_counts method.

seattle_permits_df["applieddate"].value_counts(dropna=False).head(10)

Data Wrangling

The value counts make it clear that a lot of the values in the “applieddate” column are missing or null. There are a variety of ways you can handle missing data, but removing incomplete rows is the simplest, so it’s what we’ll do here. In the next cell, we’ll remove rows with null dates. We’ll also filter down our dataset to just the columns we’re interested in to reduce the amount of extraneous information in this analysis.

# Remove all columns except `applieddate` and null rows
seattle_permits_df = seattle_permits_df[seattle_permits_df["applieddate"].notnull()]
# Ensure the index is still sequential
seattle_permits_df = seattle_permits_df[["applieddate"]].reset_index(drop=True)
# Select the first 10 rows
seattle_permits_df.head(10)

At this point, each row in our dataset corresponds to a permit application and the only column we’ve preserved is the date of the application. The task of forecasting number of permit applications is not really interesting (or reliable) at the granularity of day. Predicting at the granularity of week might be interesting, but let’s start by grouping by month. To get some date-time functionality from Python, we’ll convert our date column to a datetime type.

import datetime

# Convert applieddate to datetime
fixed_dates_df = seattle_permits_df.copy()
fixed_dates_df["applieddate"] = fixed_dates_df["applieddate"].apply(pd.to_datetime)
fixed_dates_df = fixed_dates_df.set_index(fixed_dates_df["applieddate"])

# Group by month
grouped = fixed_dates_df.resample("M").count()
data_df = pd.DataFrame({"count": grouped.values.flatten()}, index=grouped.index)
data_df.head(10)

Plotting our Data

Our data_df dataframe consists of two columns: a date time index corresponding to each month since the City of Seattle started reporting permit applications, and a count that corresponds to the number of permit applications received during that month. And now we’re ready to plot our time series.

import matplotlib.pyplot as plt
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()
plt.style.use("ggplot")

data_df.plot(color="purple")

Plotting our time series reveals something interesting that would have been hard to notice earlier. Notice how the number of applications in 2005 and before looks suspiciously low. This certainly appears to be a data problem. Let’s remove all data from before 2006, since bad data will impact the accuracy of our model. Let’s also remove data from after October of this year, since October is incomplete (at the time of this writing).

def is_between_2006_and_now(date):
    return date > datetime.datetime(2006, 1, 1) and date < datetime.datetime(2019, 10, 1)

data_df = data_df[data_df.index.to_series().apply(is_between_2006_and_now)]
data_df.plot(color="purple")

After removing data, our new plot makes two things pretty clear. Firstly, there are some clear trends in the time series – for example, an increase between 2009 and 2016, followed by a leveling off of permit applications. Secondly, there is a cyclic nature to the time series, which is indicative of there being seasonal variation in permit applications.

Time Series Decomposition

To better understand the seasonal nature of our data, we can decompose our time series into components. The first step in decomposing our time series is determining whether our underlying stochastic process should be modeled with an additive or multiplicative decomposition. One heuristic here is if the magnitude of the seasonal fluctuations changes significantly over time, then use a multiplicative model. Otherwise, use an additive model. In our case, the magnitude of the seasonal fluctuations appears to be relatively consistent over time. We can formalize the additive decomposition as follows:

$ y_t = S_t + T_t + R_t $

where $ y_t $ is our data (counts of permit applications), $ S_t $ is our seasonal component, $ T_t $ is our trend component, and $ R_t $ is whatever is left over (the remainder).

We will use a function in the statsmodels module to perform this decomposition for us, but we could compute it ourselves using a technique known as differencing.

from statsmodels.tsa.seasonal import seasonal_decompose
result = seasonal_decompose(data_df)
fig = result.plot()

The seasonal_decompose method generates this handy plot for us. And this plot helps highlight a few interesting things about our data. Firstly, it appears as though there has been an overall growth in permit applications in Seattle since 2009. That growth followed a steep decline in permit applications that appears to have begun at the end of 2007 or early 2008. The housing market across the country was impacted by the subprime mortgage crisis at this time, and Seattle appears to have been no exception. Secondly, we notice that the peak season for permit applications is the late spring, with applications tapering off significantly at the end of the year. (It is generally accepted that the warmer months are the busiest months for construction, and the data seem to reflect this as well.)

This decomposition gives us a great overall picture of the data, but we’d like use the historical data to forecast future building permit applications. We’ll use Prophet to help us do that.

How Prophet works

The basics

Prophet is a module that enables time-series forecasting. The motivations for Prophet’s design decisions are outlined here. Prophet uses an additive decomposable time series model very much like what we showed above:

$ y_t = g(t) + s(t) + h(t) + \epsilon_t $

In a Prophet model, there are three main components:

a trend function $ g(t) $
a seasonality function $ s(t) $
a holidays function $ h(t) $

$ \epsilon_t $ is an error function, but we won’t talk about it in any more depth.

The introduction of holidays is one unique aspect of Prophet that makes it both powerful and configurable. Let’s dive into each component to get a better idea of how Prophet works its magic. If you’re just interested in the magic, skip ahead to “Prophet in Action”.

Trend $ g(t) $

Prophet exposes two options for the trend component: a logistic growth function, or alternatively, a simpler piecewise linear growth function (both of which are parameterized by a growth rate $ k $).

The trend component incorporates a notion of changepoints – another aspect that makes Prophet unique. The motivation for changepoints is that domain experts in a particular time series will have knowledge about dates that they expect in advance that will impact the trend. You can imagine that if our job is to forecast adoption of a product, we may have advance knowledge about release dates and other important dates that will have an impact on product adoption. Prophet allows us to pass an input vector of real numbers that correspond to the change in the growth rate at those times of interest. We won’t leverage this feature of the model here, but it’s a neat feature that gives domain experts a straightforward way to incorporate prior knowledge into their forecasts.

Seasonality $ s(t) $

The seasonality component is modeled using a Fourier series. Fourier series are used to approximate periodic functions as an infinite series of sines and cosines.

$ s(t) = \sum_{n=1}^{N} (a_n \cos { \frac { (2\pi nt ) } P } + b_n sin { \frac { (2\pi nt ) } P }) $

The $ P $ parameter corresponds to the period of our seasonality; in our case, the seasonality is yearly, so $ P = 365 $. The choice of the parameter $ N $ can be thought of as a way of increasing the sensitivity of our seasonality model. As we increase $ N $, we allow for the model to capture more seasonal changes, but with the potential downside of overfitting, potentially decreasing the model’s ability to generalize to future data.

In matrix form, assuming $ N $ = 10 (a reasonable default according to the Prophet documentation), we have seasonality vectors that looks as follows:

\( X(t) = [\cos { \frac { 2 \pi (1) t } P }, …, \sin { \frac { 2 \pi (10) t } P }]

s(t) = \beta X(t) \)

$ \beta $ is a vector of length $ 2N $ of parameters that we’ll learn in the fit step. More on that below.

Holidays $ h(t) $

The last component is the holiday component. If we pass a list of holidays to the model, for each holiday $ i $ we let $ D_i $ be the set of past and future dates for those holidays. Those holidays are incorporated as vectors of indicator functions (ie. for each time $ t $ in our data set, it has a 1 for each holiday occurring on that day, and a bunch of zeroes). These vectors should be very sparse.

$ h(t) = [1(t \in D_1), …, 1(t \in D_L)] $

Calculation

Once we’ve encoded our data in a matrix, where each row corresponds to one of the times $ t $ in our dataset, we need to estimate the parameters of our model. Prophet uses the L-BFGS algorithm to fit the model. This is the learning step in machine learning, but it’s referred to as “fitting” because we’re trying to define the function whose curve best fits the observed data. Typically, we do this by identifying an objective function that we want to optimize.

If you’re not familiar with optimization functions, think back to your calculus days, when you found a function’s optima. The goal was to find the inputs that produced our function’s minimum or maximum output values. You did this by taking the derivative of the function, setting it equal to zero, and finding the possible inputs that produced that output. In this case, the function we’re optimizing is the maximum a posteriori likelihood function, which amounts to finding the set of parameters $ \theta $ that are most likely given the observed data.

Prophet In Action

Now let’s see if we can forecast permit applications for the remainder of 2019 using Prophet. The first step is to train our forecasting model.

from fbprophet import Prophet
model = Prophet()
train_df = data_df.rename(columns={"count":'y'})
train_df["ds"] = train_df.index
model.fit(train_df)

Easy enough. Now, let’s try to do some forecasting:

pd.plotting.register_matplotlib_converters()

# We want to forecast over the next 5 months
future = model.make_future_dataframe(5, freq='M', include_history=True)
forecast = model.predict(future)
model.plot(forecast)

Neat! Prophet forecasts that we will see a continuation of the downward trend in building permit applications that seems to have begun in 2016.

There are a couple of interesting things about what Prophet gives us:

Prophet generates uncertainty intervals for us (also known as confidence intervals). It looks like there’s a lot of uncertainty in this forecasting model, so we shouldn’t rely too heavily on it.
The plotted forecast includes our actual data points, as well as the forecast on the future (for which we don’t yet have any observed data). This allows us to see where our actual observed data lie outside of our uncertainty level.

There are a lot of assumptions baked into the defaults that we’re using here. If we were experts on this data, we could go in and experiment with tuning some of these parameters. We might also want to add additional variables to our model to improve our forecasts. For example, we would intuitively expect that there are lots of external variables that would impact the amount of new construction taking place in a city like Seattle. Are companies like Amazon and Google continuing to rapidly grow their presence in the city or are they expanding elsewhere? How is transportation changing the city and how might that impact development in previously underdeveloped neighborhoods?

If you’re interested in comparing multiple models after adding variables, we could do so using some additional functions included in the fbprophet.diagnostics package, such as the cross_validation and performance_metrics functions. Take a look at the Prophet API documentation for more information about these functions – it’s super helpful.

Takeaways

Thanks to some very powerful open source tools like Prophet, advanced statistical analysis is becoming available to people with only basic statistical and scripting skills. One key piece to remember is that there is a high degree of uncertainty in our forecast. Remembering that uncertainty is a key part of any forecast is incredibly important, particularly in government, where forecasts can have significant impacts on the citizens that our governments serve. Analyses like this one have the potential to be valuable tools both for the Seattle Department of Construction & Inspections – who may be able to make important resourcing decisions based on their expected applications in the coming year – but at agencies across government with similar concerns.

Continual Improvement : CI / CD at Tyler Technologies, Data & Insights Division

2019-09-26T00:00:00+00:00

Continual Improvement: CI/CD at Socra..er Tyler Technologies, Data & Insights Division

Jenkins in the Closet

Our Engineering organization has a long, and storied history with CI / CD in our division. It starts well before I arrived at Socrata in 2014 (during the antediluvian period aka 2013) when it was decided that ‘Hey, what we could really use is some regular end-to-end testing’. Contractors were hired and soon this testing was underway at Socrata, a scrappy startup running on gumption and a dream in the heart of Seattle’s Pike Place neighborhood. The testing took the form of a series of automated suites based on the Cucumber framework and a Jenkins server which was running on a server under a desk in the Lead Contractor’s apartment. Things were pretty lean back then.

Soon end-to-end testing became a much more important part of the process that we relied upon for shipping our code. We brought the test server into our office and shoved it into an unobtrusive broom closet in the corner. Our test infrastructure grew to include a blade server for Windows VMs and an Ubuntu 12.04 server that hosted the new shiny Jenkins server and its associated Linux based testing. Things were better, faster and more reliable, but they were still pretty slapdash.

We chugged along this way for most of a year. We shipped software based on those cucumber test results and almost all was well. But like the flow of time, software and its infrastructure are never constant and it was decided that now was the time to completely overhaul our infrastructure. It was time we cast off from Azure and our physical data-centers where Engineers must change disks and speak soft assurances to temperamental machinery. It’s 2014 now, almost 2015 really. It’s time for AWS.

Jenkins in the Cloud

We began a great quest to move the entire business to AWS and Jenkins was in the vanguard of that effort. We began converting Jenkins infrastructure by leveraging the following important elements:

Chef provisioned host configurations
AMI images
IAM roles and permissions

First we created a chef cookbook for our Jenkins server and began provisioning it with all the build toolchains and plugins that existed on the Jenkins in the closet. Its was a hodgepodge of Ruby and Scala, Java and Python, Docker and Rails. Then we migrated our jobs, one by one, from our Jenkins in the closet to our Jenkins in the cloud, making sure that each worked.

We completed our move to the cloud and wound up with two AWS based Jenkins servers on EC2 instances. Why two? Well, it turns out our growing testing and build processes put together on a single node quickly overwhelmed the poor server and revealed a particularly nasty network driver bug present in AWS M4 EC2 instances. Under high load, the driver would, on occassion, get overloaded, fall over and not be able to get back up and running until you rebooted the server. As our test frequency increased, this became more and more of a problem. To avoid this problem, we split the load between a build server (handling all build-type jobs) and test server (handling all automated testing jobs).

Even with our jobs split into two servers, the load continued to grow and our poor servers soon became overloaded again during peak development times. The servers were so overloaded that they would fail and need to be stopped and restarted in the AWS console several times a week. Yuck! Later, we were able to compile a new network driver into the kernel that didn’t have this issue as a temporary solution. It was only when M5 EC2 instances (which use a different network card and thus different driver) became available that we were able to resolve the network load issue permanently.

Even after solving this issue, we still needed to scale our Jenkins servers as the daily load continued to grow. We could make them bigger (truly humunguous in fact), but there is an upper bound (in both capacity and cost) and there are also fringe resource collision issues as the parallelism of tests grows on a single host. For example, several of our test suites drive headless browsers for UI testing. We found that the driver that runs those browsers starts to show performance issues when it must manage more than a few dozen active connections simultaneously. This drove up our test times during heavy load times. In addition, heavily parallelized scenarios revealed some process shutdown issues that would essentially start orphaning driver threads over time, requiring administrative intervention or a reboot.

Obviously we couldn’t scale up. Our best option was to scale out. Enter Jenkins Workers, stage left.

Go Forth My Children And Scale

Jenkins is probably the most pervasive CI / CD solution in the world. Hundreds of thousands of deployments and millions of user make a pretty big community. And since the customers of Jenkins are developers by and large, they are able to give back in a virtuous cycle to the Jenkins ecosystem. The Jenkins community is huge and active, and Jenkins encourages this by being highly extensible through the plugin capabilities that the core system provides.

Plugins are the lifeblood of the Jenkins system. There are thousands of these plugins maintained by community members which provide immeasurable value to the community. One of these plugins is the EC2 plugin. This plugin allows Jenkins to interact with AWS to manage worker EC2 nodes and send them work in a dynamic way. This enables Jenkins to scale out dynamically and provide nearly limitless execution capability. The limiting factor when using this plugin really comes down to a question of budget.

So we decided to solve our scalability problems by converting our jobs to running on workers. Easy Peasy. Well, not exactly.

First, we created a chef recipe for the workers which provisioned it for each of the toolchains and software tools we use for building and testing our projects. A partial list includes:

This is a partial list. With this chef recipe created, we were able to use the packer tool to build an AWS image (AMI) from this cookbook. Now we have a provisioned worker ready to work right? Not yet. We still need to give this worker permssions to interact with the systems we use internally as we build our projects. To do that we configure the worker to interact with (among others) the following 3rd party services:

Complicating the worker’s interaction with these services is the requirement that we not store credentials on the AMI at rest. Our solution for credential management on the Jenkins server node isn’t available to the built AMIs, so we ended up using a homegrown solution to interact with the encrypted AWS KMS secure storage service from anywhere. We can use this KMS wrapper to pull files to the worker at boot time from within the AWS cloud. We leveraged the boot script capability of the worker configuration and made it pull down and apply the security credentials and configuration files at boot time. These workers are short-lived so the credentials are effectively ephemeral. Problem solved.

So, after adequately provisioning our worker AMI we now have two recipes; one server and one worker that we will use to scale Jenkins.

Now, after verifying the workers with some testing, it was time to start migrating our build jobs to the workers. This was a voyage of discovery. We found significant configuration requirements that weren’t documented when we initially created the worker. But we used each discovery to formally capture these requirements and place them in the worker recipe such that it handles the requirements of each job and is repeatable. This migration process also helped us realize when we were solving a problem more than once or in diverse ways. Going through this process allowed us to streamline the jobs to be more homogeneous. By forcing our jobs to work on disposable hosts we encourage job owners to make their jobs self-sufficient. It was a long process taking months (in fact is continues to this day) but we pretty quickly began to see the benefits of this work. By tackling our most resource intensive jobs first, we were able to see immediate impacts on the server; loads dropped, server memory related test failures became almost non-existent. Tests run in relative isolation no longer needed to contend with other jobs for shared local resources. The server was no longer the bottleneck. Here are some graphs that show the effects. See if you can spot when the workers really started getting going.

System Load Averages (January 2019 - September 2019)

System Memory Utilization Averages (January 2019 - September 2019)

After a few months of effort, most of the jobs we considered “heavy hitters” had been moved to workers. But we had this other server? This test server. What should we do about that?

Well, again, we made sure that the worker was provisioned with all the tools it needed for the tests to run on it properly. Then we began moving jobs from the test server directly to workers. No additional load hit the server and our worker utilization continued to grow. Soon, we were in a position where we could turn off the Jenkins test server and truly have only one Jenkins server running all builds, deploys and tests.

These days, the build server is now becoming a job scheduler. Through configuration of the EC2 plugin we can define pretty robust behavior of our workers. We can set how many workers are allowed, how long they live, what types of EC2 nodes to use, how many jobs they can execute simultaneously and which jobs, etc…

We anticipate that in the next few months, as we finish migrating all jobs to workers, we will be able to start reducing the EC2 host size of the server itself (its currently a C5.4xlarge, costing us quite a bit each month). We will start by reducing its size in half, profiling the new sized node and then progressing to smaller nodes as seems reasonable. The hope is that it will truly become just a job scheduler. Job execution will live on the scalable, disposable workers. We aren’t done yet with this work (lots of jobs still to convert) but we have a much a happier development team and a much relieved ops team. Achievement unlocked!

Welcome (back) to our blog!

2019-08-14T00:00:00+00:00

Hey there! I’m Helena — I joined Tyler Technologies, Data and Insights Division (fka Socrata) about a year and a half ago. I’m a software engineer on the Performance team (as in the Socrata Performance Optimization product, not page performance). Over my tenure, I’ve been continuously impressed by all the cool things my peers do — both my coworkers and the customers that build on top of our products. I’ve also been continuously surprised that we don’t showcase our technical work anywhere. Since one of our core values is “Celebrate success together”, I’m here to do just that!

Introducing dev.socrata.com/blog

After a nearly 2-year hiatus, I’d like to re-introduce dev.socrata.com/blog! 🎉 As part of our re-launch, we’re also doing a little bit of re-focusing. In the past, this blog has been primarily a place for Socrata Open Data API announcements and technical how-to’s. Going forward, we’ll also be mixing in some behind-the-scenes cuts — how the team behind Socrata’s products builds the good stuff (and how we learn from the bad stuff).

Here are some topics you can look forward to in the coming months:

Time Series Analysis with Jupyter Notebooks and Socrata
Robert Voyer (Software Engineering Manager)

Learn how to download the Seattle Building Permits dataset from the Socrata API, and do a time series analysis using open source data science tools in Python.
Informatics and Dogfood
Andrew Deming (Software Support Team Lead) and Ryan Hall (Data Analyst)

Informatics is how our employees use our own data and our own product on a daily basis. A grassroots initiative from the start, Informatics had several auxiliary goals, like onboarding new staff with the product and securely sharing data with internal and external stakeholders.
Jenkins Workers
Joe Nunnelley (Senior Automation Engineer)

Tyler’s Data and Insights Division uses Jenkins to execute automated jobs that support testing, builds, and deploys. Learn how we introduced CI/CD to our engineering process and how we started using Jenkins Workers to make this infrastructure more dynamic.

Up next

I’ll be facilitating this blog going forward, which means I’ll be doing the wrangling, but not the writing. Expect to hear from a variety of people and roles about all of the awesome work they do.

PS: We’re hiring! If you’re interested in learning more about our work, then check out our jobs page.

Elixir in production, an open data tale

2017-08-28T00:00:00+00:00

Elixir in production, an open data tale

The problem

The job of the Data Pipeline team is to build and maintain software to seamlessly get data into our platform.

Socrata has had, for a long time, a wizard which would parse certain files, provide a few formatting options, and allow the user to import the data into the Socrata platform.

This process had a few major issues. It was an all or nothing deal - either your data had issues and it would fail to import, or it was perfectly clean. It also didn’t provide any feedback on what the status of anything was; you just saw an indefinite spinner until you didn’t anymore. To make matters worse, if there was an issue, you often got an error message that didn’t tell you how to fix your source data.

When we set out to make that experience better we had one major goal: be transparent about what’s happening with the data. Every step has information that is potentially actionable, and we should surface that information as soon as we know it. We wanted to front-load all of the actionable information, so the user can upload their file, make their changes and walk away before the file is even done uploading. We also want to provide a quick retry cycle if the user uploads something and realizes it’s wrong. This allows them to go back to the data owner or source and fix it quickly.

Uploading this 10gb/28 million row file gives you a preview and the ability to start interacting and transforming the data before it is uploaded

We also had an internal goal, which was to run our service(s) sustainably with a relatively small team (about 4 backend engineers, who also have other jobs to do). Our engineering team had been adopting a microservices model, which, despite all the Medium thinkpieces extolling the virtues, had failed to deliver us to engineering nirvana as we had hoped. With a small number of human engineers, and a large number of services, context switching between them became challenging. Moreover, due to the small size of our engineering organization, we had no dedicated team working on tooling, which led to duplicated effort across teams who were all chartered to deliver customer value, not engineering value.

What we built

Given the UX problems, the engineering problems, and the goal, we settled on Elixir and Phoenix as the tools to make this thing work. There are plenty of other posts that describe why Elixir is interesting, but in short, it was the only tool that would allow us to accomplish the real time feedback we wanted in a single package. Elixir and Erlang* also provide primitives for building and running distributed systems that can’t be beat (at the moment), and given that we were going to be doing computation across the whole cluster in parallel, it seemed like the right tool for the job.

The core of the data pipeline service is really an interpreter, which interprets the same language used for querying data, called SoQL (Socrata Query Language). SoQL looks a lot like SQL, but it’s simplified for the use case we see a lot at Socrata. We also needed an API around said interpreter, for accepting data and allowing the user to manipulate it, which is where Phoenix came in.

In our data pipeline service, we implement a different set of functions that are required to transform data. An example would be the following

geocode(address, city, state, zip)

This, given columns in your source file called adresss, city, state, and zip, will geocode the values and make a new column, which can then be imported alongside the rest of your data.

Obviously faster is better, so all the execution happens with as much parallelism as we can get out of the cluster. This is where Elixir really shines. Coordinating all that state across the cluster would have been tricky, but in Elixir, it’s trivial to assign work to different nodes in the cluster. With a lot of parallelism, we can do slow transforms that may do IO to other services (like geocoding) and still get reasonable performance. It also gives us the ability to meet whatever service level we want by scaling the cluster up or down.

For a 28 million row dataset, running a simple string concatenation expression takes an amount of time proportional to the cluster size

Cluster Size	Time spent evaluating
1 node	39.871s
3 nodes	17.539s
5 nodes	11.953s

Results

We ended up with a system that handles the workloads we wanted with minimal drama. We’ve been running the system in production for several months now, and haven’t had issues that were related to our tools, which is about as much as you can ask for. One of the most impressive aspects of Elixir (and Erlang) is the tooling for analyzing a running system. We’ve shipped plenty of bugs out into production, but a combination of the Erlang Observer, the remote IEx repl, distributed tracing and debugging has allowed us to track them down quickly. These tools are indespensible, and once you have them, it’s exceedingly difficult to go back to a world without them.

Elixir as a language and Erlang as a platform has its pros and cons. Elixir is an extremely simple language, and our team was able to ramp up on it quickly. The tooling in the Elixir ecosystem is simple, well documented, and fits together well. Coming from a language like Java or Ruby, there will be some struggling to understand the Erlang/OTP programming model, but ultimately it simplifies fault tolerance, concurrency, and distribution into a small set of primitives which can be composed to make a reliable system. The language and VM are no silver bullet for reliability, but they encourage the developer to think about the common problems in building a distributed application.

One issue we ran into was that our team was used to the static typing provided by Scala, and leaving that behind has required some adjustment, for some more than others. This might be a non-starter for some teams, but may not be a big deal to others. It undeniably makes refactoring more difficult and requires that we have a more thorough test suite, which has a high overhead. We experimented with dialyzer, but found that it was too noisy to be usable.

Ultimately though, we’ve accomplished the goals we set out to accomplish, and more importantly we’re working at a pace which is sustainable. The most positive thing to say about Elixir and the tooling is that we don’t really think about it much. The amount of time we spend talking, thinking about, and wrestling with tooling (coming from a world of microservices) has been seriously reduced, which leaves much more time to focus on what actually matters, which building the product that our users use every day.

*Elixir is a language which compiles to Erlang AST and runs on the Erlang Virtual Machine, BEAM. Elixir has an identical programming model to Erlang, but with a different syntax, standard library, and tooling.

Creating a monthly calendar with FullCalendar.io

2017-03-30T00:00:00+00:00

Recently I was helping a customer of ours with an interesting problem: they have a Socrata dataset full of events, in this case public meetings, and they wanted a flexible way of displaying them within a monthly calendar embedded within their website.

A colleague of mine recommended the MIT-licensed FullCalendar project, and it worked out wonderfully. This example will demonstrate how you can combine the power and flexibility of Socrata’s APIs with open source software, and quickly build out a monthly calendar visualization for your dataset that looks like the one below:

Prerequisites

There are a couple of prerequisites for this example:

jQuery - An insanely popular JavaScript framework that FullCalendar requires to work. You probably are already using it even if don’t know it.
Moment.js - A great JavaScript library for parsing and manipulating dates.
FullCalendar - The actual FullCalendar library.

I recommend following the FullCalendar “Basic Usage” doc to start off. All three libraries must be loaded, in that order, before your code can run.

Step 0: Create your SoQL query

Starting from the API docs for our source dataset, we’re going to craft a SoQL query that does the following:

Uses a $where clause to pull the last 31 days of events, so we always can see all of the current month’s events
Filters to return only events for Portland
Uses $order to sort them by date

The full query will look like the following, but we’ll need to fill in the correct bounding date later on:

The TryIt macro has been disabled until future notice while we upgrade this site to SODA3.

Step 1: Query our API for events

In this step, we’ll use jQuery’s $.ajax(...) utility function to fetch our records from the API.

We’ll pass in the url of our API endpoint, a method of GET, and a datatype of json. For our data, we can use the broken out parameter pairs of our SoQL query. We also use Moment.js’s subtract(...) and format(...) functions to generate a date string for 31 days ago.

$(document).ready(function() {
  // Fetch our events
  $.ajax({
    url: "https://data.oregon.gov/resource/yid5-c4eq.json",
    method: "GET",
    datatype: "json",
    data: {
      "$where" : "start_date_time > '" + moment().subtract(31, 'days').format("YYYY-MM-DDT00:00:00") + "'",
      "city" : "Portland",
      "$order" : "start_date_time DESC"
    }
  }).done(function(response) {
    // TODO: Handle our response
  });
});

Step 2: Handle our response and create Event Objects

Next we’ll take each of the events in the response from our API call, and create FullCalendar Event Objects for each of them. At a minimum, we’ll need start and end dates for them, as well as a title. If we have a URL, that will make the event clickable.

...
  }).done(function(response) {
    // Parse our events into an event object for FullCalendar
    var events = [];
    $.each(response, function(idx, e) {
      events.push({
        start: e.start_date_time,
        end: e.end_date_time,
        title: e.meeting_title,
        url: e.web_link
      });
    });

    // TODO: Initialize calendar
  });
});

Step 3: Initialize our Calendar

This is the simplest part. We pass in our new collection of events to the FullCalendar initialization function, targeting the #calendar div. This is also where you could use eventClick(...) to change what happens when you click on an event:

...
  }).done(function(response) {
    ...

    $('#calendar').fullCalendar({
      events: events
    });
  });
});

That’s it! We’ll pull all the pieces together in one last to show all of the code at once, but that should be enough to help you build a basic calendar visualization!

Pulling it all together

Here’s all the code as one block, including all of the HTML to make it a standalone page:

<!DOCTYPE html>
<html>
  <head>
    <!-- JS Dependencies -->
    <script data-require="jquery@*" data-semver="3.1.1" src="https://ajax.googleapis.com/ajax/libs/jquery/3.1.1/jquery.min.js"></script>
    <script data-require="moment.js@*" data-semver="2.14.1" src="https://npmcdn.com/[email protected]"></script>
    <script src="//cdnjs.cloudflare.com/ajax/libs/fullcalendar/3.3.0/fullcalendar.min.js"></script>
    
    <!-- CSS Styles -->
    <link rel="stylesheet" href="//cdnjs.cloudflare.com/ajax/libs/fullcalendar/3.3.0/fullcalendar.min.css" />
  </head>

  <body>
    <div id="calendar"></div>

    <script type="text/javascript">
$(document).ready(function() {
  // Fetch our events
  $.ajax({
      url: "https://data.oregon.gov/resource/yid5-c4eq.json",
    method: "GET",
    datatype: "json",
    data: {
      "$where" : "start_date_time > '" + moment().subtract(31, 'days').format("YYYY-MM-DDT00:00:00") + "'",
      "city" : "Portland",
      "$order" : "start_date_time DESC"
    }
  }).done(function(response) {
    // Parse our events into an event object for FullCalendar
    var events = [];
    $.each(response, function(idx, e) {
      events.push({
        start: e.start_date_time,
        end: e.end_date_time,
        title: e.meeting_title,
        url: e.web_link
      });
    });
    $('#calendar').fullCalendar({
      events: events
    });
  });
});
    </script>
  </body>
</html>

Conditional notifications with Huginn

2017-03-29T00:00:00+00:00

As more and more open datasets approach the point where they’re receiving “real time” updates, the topic of how to receive push or webhook notifications when a dataset is updated and matches certain conditions. For example, you might want to be notified when there are crimes in your neighborhood, when your local government releases a new dataset, or when the current number of outstanding pot hole requests go over a certain threshold.

Currently (or at least as of when this article is posted), Socrata Publica doesn’t support such notifications, but with a few open source tools, you can create incredibly powerful workflows alert you based on changes in open data (and do the rest of your bidding)!

We’re going to use an open source tool called Huginn to create our custom workflows. If you’re familiar with IFTTT, Huginn will seem conceptually similar - it allows you to set up triggers and actions that occur based on them. However, it is far more powerful - Huginn workflows can branch, have conditionals, make API calls, and even execute arbitrary JavaScript.

This tutorial will walk you through a simple scenario: My commute each morning takes me across the Aurora Bridge in Seattle, a high-level bridge that is prone to icing when it gets cold enough in the winter. The City of Seattle has recently published real-time road sensor readings that include road temperature readings.

Prerequisites

This tutorial assumes a couple of things:

You have Huginn installed and running somewhere, or you have access to a running instance. You can run it locally on your own hardware, or you can run it in the cloud. I found their one-click Heroku option to be the quickest, and that’s how I developed this tutorial.
You have a Twilio account and you’ve followed their tutorial to set up a phone number for sending texts. You don’t have to use Twilio for notifications - I actually use Slack for most of mine - and Huginn provides agents to push notifications via a number of different mechanisms.

Step 0: Author our SoQL query

Starting from our source dataset, I want to take the average of the last five road temperature readings and determine if they are below freezing. That will smooth out any momentary drops in the road temperature. So, I want to:

Start with the API endpoint: https://data.seattle.gov/resource/ivtm-938t.json
$where filter to only get the readings for the Aurora Bridge: stationname = 'AuroraBridge'
$order the results from latest to oldest: datetime DESC
$limit myself to only 5 results
Use $select to aggregate the results with a AVG(roadsurfacetemperature)

That last bit is a bit tricky, since it needs to be applied after all the other work is done. Don’t fret, because we have a SoQL feature that helps with that. Using the sub-query functionality of $query, we can chain our aggregation after the rest of our query.

The full query looks like the below, and outputs a single value representing the average of the last 5 sensor readings:

The TryIt macro has been disabled until future notice while we upgrade this site to SODA3.

Step 1: Call the API via a “Website Agent”

Our first step in Huginn will be to create a “Website Agent” to make our API call and turn the result into an event for our workflow:

Completed Website Agent »

Within Huginn, select “Agents” and click “New Agent” to start the process of creating a new agent.
In the “Type” dropdown, select “Website Agent”
Fill out your new agent with the following details. If I don’t say what to fill out, accept the default:
- Name: Name your agent. I called mine “Fetch Aurora Bridge rolling average surface temperature”
- Schedule: Choose how often you want your agent to check the API. I chose “5 min”
- Sources: Leave blank
- Propagate Immediately: You can leave this unchecked, but I’m impatient and check it
- Recievers: Leave blank for now
- Options: Copy the details from the screenshot. For url, use the full URL for your query from above.
Click “Save” when you’re done. When completed, your agent configuration should look like the screenshot to the right.

Step 2: Determine whether or not you want to pass on an alert

In this next step, we’ll use a “Trigger Agent” to conditionally pass on the events generated by our Website Agent and turn them into alerts to be messaged about.

Completed Trigger Agent »

Select “Agents” and then “New Agent” to start the process of creating a new agent.
In the “Type” dropdown, select “Trigger Agent”.
Fill out your new agent with the following details:
- Name: Name your agent. I named mine “Is the bridge freezing?”
- Sources: Select the agent you created in Step 1
- Propagate Immediately: I’m impatient, so I check this box. If you leave it unchecked, you’ll need to wait for Huginn to pass on your events with each check it does, and it may take several minutes to be notified.
- Receivers: Leave this blank for now
In the Options section of your agent configuration, match the details from the screenshot to the right. Most importantly:
- type: The type of check to perform. We want to see if our value is less than or equal to freeing, so we use field<=value
- path: The JSON output by your SoQL query and extracted by your Website Agent, in my case rolling_average
- value: The value to check against, 32
- message: The message we want to format and pass on to the next step. It’s a Liquid-templated string, and we used Watch out! Temperature has reached degrees and the bridge may be icy!
Click “Save” when you’re done and your configuration matches the screenshot to the right.

Step 3: Send our text message with Twilio

This is where the rubber hits the road! We’ll be setting up a “Twilio Agent” to send us a text message via Twilio when the above criteria is met.

Heads Up! Twilio is a paid service, and if you want to send actual text messages, you’ll need to add a credit card to your account. If you just want to try things out, you can use your test credentials, but the workflow won’t send actual alerts.

Follow the steps below to set up your Twilio Agent:

Completed Twilio Agent »

Select “Agents” and then “New Agent” to start the process of creating a new agent
In the “Type” dropdown, select “Twilio Agent”
Fill out your new agent with the following details:
- Name: Name your agent. I called mine “Text message me an alert”
- Sources: Select the agent you created in Step 2
- Propagate Immediately: You can leave this unchecked, but I’m impatient and check it
Under Options, make yours look similar to the screenshot, filling in the details below based on your credentials in Twilio:
- account_sid and auth_token: Your account SID and secret auth token from your Twilio account details
- sender_cell: The phone number Twilio is configured to send from
- receiver_cell: The cell phone number you want the text message to go to
- receive_text: Must be set to true to have the agent send a text message
- receive_call: Must be set to false
- expected_receive_period_in_days: Set this to however often you expect this agent to receive a “bridge frozen” event from it source. The agent will wait this long before setting a flag to note that it might be broken. I set mine to 180, which might not be long enough in Seattle.
Click “Save” when you’re done. When completed, your agent configuration should look like the screenshot to the right.

Step 4: Testing!

At this point you have a few options to test things out:

Wait until Seattle drops below freezing. As it is almost April, this may not happen for awhile.
Adjust the value in your Trigger Agent to trigger at a much higher temperature.
Create an agent that will inject a fake event into the workflow. This is the option we’ll use.

To create an agent that can emit fake events:

Manual Event Agent »

Select “Agents” and then “New Agent” to start the process of creating a new agent
In the “Type” dropdown, select “Manual Event Agent”
Fill out your new agent with the following details:
- Name: Name your agent. I called mine “Let’s pretend its freezing!”
- Receivers: Select the agent you created in Step 2
Click “Save” when you’re done. When completed, your agent configuration should look like the screenshot to the right.

Next, use your Manual Event Agent to fake an event that makes it look like the temperature dropped below freeing:

Emit Manual Event »

Click on your Manual Event Agent within the agent listing
Within the event payload, click the “+” button to create a single key/value pair:
- For the key (on the left), use our variable name: rolling_average
- For the value, use something below freezing, like -150
Click “Submit” to emit your event into the system.

Your fake event will then propagate through the system, and if everything is configured properly, you’ll get a text message to your device!

By now, hopefully your mind is buzzing with ideas of how you can use Huginn to monitor and alert based on open data! Please let us know what you come up with!

By the way…

After sleeping on this post, I actually realized we could skip the “Trigger Agent” by modifying our SoQL query slightly. By using the $having filter on our aggregation, we can make it only return a temperature value when its less than our specified threshold:

The TryIt macro has been disabled until future notice while we upgrade this site to SODA3.

Don’t be surprised if the query above doesn’t output any records when you click on it, that’s the point! It’ll only return a rolling_average when its 32 degrees or less. This would allow us to connect our Website Agent directly to our Twilio Agent, simplifying our workflow! However, its also harder to test, and we’d need to format our message either with aggregation in our SELECT or with an additional “Liquid Output Agent” before the Twilio Agent. That exercise is left up to the reader!

Validate Your Data with FME

2017-02-02T00:00:00+00:00

This post describes how to use an FME workspace to validate your data and highlight what data cleansing needs to be undertaken before putting it into Socrata. FME is a powerful tool for not only data automation, but data analysis. Here we will use it to do a ‘health check’ on our data to understand what errors, warnings and roadblocks it may cause. To better understand the rules for importing data, check out this support article that discusses importing your data into Socrata.

Prerequisites:

An installed copy of FME Desktop 2016.1 - the latest version can be downloaded here
Limited experience for use with FME, for a warmup see this previous post: Using the FME Socrata Writer
An installed copy of Microsoft Excel
Read access to a dataset of your choice that will be ingressed to Socrata
Publishing rights to a Socrata domain - you will need this if you plan to publish your cleansed data to Socrata. It is not required for the data cleanse operation, but is a good test of data cleanliness to publish it.

Data Validation for Socrata

Data must be properly formatted for each data type to be ingressed into Socrata. If the data is not formatted it may not load properly, or in some cases not at all. If you are using FME to publish your data, the workflow will fail if your data is not formatted. Data types are used to tell a common story by allowing each data type to be filtered, displayed and analyzed in their own ways.

This post and the FME workspace will be focusing on the following data types:

Calendar date
Checkbox
Number
Percent
Text

These data types are the most commonly used and have a high potential for error, but are also very easy to fix! This workspace can be used and reused as a tool to assess a variety of datasets that you plan to ingress to Socrata. The output file from the workspace will guide you to where errors may lie within your dataset and give you an idea of how many, if any, errors need to be fixed. As a user, you will need to know: What your data should look like If the rule failures are problems with the data If action should be taken to change the data

Creating Your Schema Map

In order for FME to validate your data against the correct rules, each attribute (or column) must be assigned a data type that Socrata can read. From here each feature (or row) can be broken apart so every cell is matched to the correct set of rules. To do this, we must establish a schema for your dataset, a schema is what tells the program how to read and organize your data. You will be building your schema in an Excel using a template file which will be read by your FME workspace - this is much easier than assigning each attribute a data type within FME.

Open the Schema Map Template in Excel, rename and save your Schema Map.
Open the dataset to validate in Excel. Note: you may have to extract your dataset from another source like a database, it is best practice to export as a .csv.
Highlight all attribute names on the sheet and Copy.
In your schema map workbook, on the Schema_Map sheet, in cell A2 Paste Special, Transpose. This will create a list of your attributes to validate.
Use the drop down menu in cell B2 to select the Socrata data type for the attribute in cell A2. Repeat this for all attributes listed in column A. Note: Best practice that ZIP codes, unique ID fields (employee/invoice ID’s) should be set to text to avoid errors such as dropping leading zeros (northeast US Zip codes), and dashes in ZIP code plus four’s.
Save your changes.

Running the FME Validation Workspace

This workspace is intended for a range of users, it is heavily annotated to inform users how things are working and shows all connections/transformers.

Open the FME validation workspace, rename and save your validation workspace. Note your workspace may look different based on operating system and FME version. These screenshots come from FME 2016.1.3.0 - Build 16709 - WIN64, used in Windows 10 Pro.
Update Published Parameters by double clicking on each parameter in the Navigator pane
- Entity: Your organization/entity name (used only for output file naming purposes)
- Dataset_Name: Populate this with your dataset, general text or dataset identifier are allowed (used only for output file naming purposes)
- Dataset_Path: The full file path for where the dataset to validate including extension
- Output_Folder: Folder where you want the output .xlsx, you must include the final “/” at the end of the path
Add your schema map to the workspace by updating the AttributeValueMapper transformer
1. Open the AttributeValueMapper dialogue box and click Import
2. Select the format of your dataset to validate and its file path, then click next
3. Change the Import Mode to Attribute Values. Select the Feature Type by clicking the box next to Schema_Map to point the Import Wizard to the correct Excel sheet, then click Next
4. Select the attributes for the source and destination fields. The “Source Value” should come from SourceAttributeName and the destination should come from DestinationDataType. Click “Import”
5. FME should have imported your dataset’s attribute names and data types. Click OK to complete the import process.
Add inspectors where you want to monitor specific errors.
- If there are specific errors you want to track more closely, you can insert Inspectors on AttributeValidators or Testers. If you want to track a specific validation rule that you’re curious about, adding an inspector midway through the workspace will show which specific cells do not pass validation rules. Inspectors can be added by right clicking on an outgoing port (green triangle pointing right) of a transformer and clicking “Connect Inspector.”
Run the workspace
- Click the green “play” button in the toolbar.
Examine your results and assess your data
- An .xlsx will be output the folder you specified with a name including your entity, the dataset validated and a timestamp of when the workspace was ran.

Interpreting your results

With the Excel output quantifying potential errors to fix, use your favorite platform to cleanse the data. To better understand each of validation rules, check out this table. If you cannot see the errors in your dataset that the workspace is reporting, open your dataset in the FME Data Inspector. If you are using Excel to view your dataset, it may be automatically formatting the data and hiding the errors.

Not all validation failures are errors in the dataset, it is up to the user to decide if these should be ignored or if they should be changed. For example, if negative numbers are found in a dataset, the user needs to decide if they can be allowed or amended.

The workspace can be reran to validate a cleansed dataset as many times as you like. If you change your schema, you will need to update the schema map and workspace to read any changes (revisit the Creating Your Schema Map or Running the FME Validation Workspace sections of this page).

Dates:

Safe has created a quick tutorial to better understand how dates are handled in FME. It’s recommended to learn about how dates are read, especially by the DateFormatter. This tutorial will give you enough knowledge to make you dangerous, but some experimentation will help you understand what’s happening under the hood.

WARNING: The DateFormatter coerces your data from one string into another, some of the calculations it does in the process may change your data. Review this cheat sheet to see how DateFormatter coerces common errors in date fields to properly formatted dates. There may be consequences when you set the Source Date Format parameter to be “Unknown - Automatic Detection.” You can decrease errors by properly populating the Source Date Format parameter, see the DateFormatter help link above. If your date is in a format that has day before month, then you must input an expected format.

Pro Tips for dates:

To avoid letting DateFormatter create problems, you can format your date to YYYYMMDD before it is read into FME. For example, you can use the following expression in SQL to help - if that is how your data is stored.

SELECT CONVERT(VARCHAR(10),getdate(),120)

When choosing a date format to publish on Socrata, the Socrata writer in FME has two different date formats to write. The data type calendar_date should be used if there is no time zone included in the data. If time zone data is not essential, it is recommended not to include it.

Some common date formats converted to FME date formats:

DD.MM.YYYY → %d.%m.%Y
MM/DD/YYYY → %m/%d/%Y
MM-DD-YY → %m"-"%d"-"%y
YYYY-MM-DD['T']HH:mm:ssZ (ISO8601 with timezone) → %Y-%m-%dT%H:%M:%S%Z

Other links to better understand date formatting:

Leading Zeros:

Leading zeros are zeros at the beginning of your data e.g. 0001234. They can cause problems when importing into Socrata as a number. Not all leading zeros are errors or poor data, for example some zip codes, vendor ID’s or phone numbers start with a zero and must be present. In these cases you may consider switching the data type to text. For advice on how to clean up your leading zeros, here’s a support article to help.

Visualizing data using the Google Calendar Chart

2017-01-03T00:00:00+00:00

This example shows how to pull data from a Socrata Dataset (in this case, the City of Chicago crime records) with the Google “Calendar Chart” visualization. As a bonus, we will then embed that chart into a Socrata Perspectives page.

The Google Charts library provides a number of different chart types for visualization that can be leveraged using the SODA API. The “Calendar Chart” is useful when you have incident level data for which you would like to visualize by daily density over the course of a year.

Prerequisites

There are a number of prerequisites necessary before starting with this example:

Most obviously, you’ll need to work with data in a Socrata dataset containing time series data that can be aggregated at a daily level. If you’re looking for a dataset to work with, we recommend you explore the Open Data Network, where you can find a full catalog of datasets from our awesome customers.
You’ll need some basic familiarity with JavaScript before starting. If you’ve never worked with JavaScript before, we recommend this course from CodeAcademy.
We’ll also be making use of jQuery to simplify some of our development tasks.

Check out all of the different chart types available through the Google Charts library.

Craft your SoQL query

The Calendar Chart requires at a minimum two fields - a date and a numeric value. So we’ll use the SoQL $group and $group parameters to aggregate our dataset to daily roll-ups, this results in a SoQL query that looks like the following:

The TryIt macro has been disabled until future notice while we upgrade this site to SODA3.

The results will be aggregated like the following:

[
  {
    "count": "762",
    "day": "2016-09-04T00:00:00.000"
  },
  {
    "count": "842",
    "day": "2014-07-20T00:00:00.000"
  },
  ...
]

Fetch data using jQuery

We’ll define a fetchValues function that uses the jQuery.get(...) function to fetch data from the SODA API, transform it into an array of JavaScript Date objects and counts, and returns it for handling:

var fetchValues = function() {
  return $.get(
    'https://data.cityofchicago.org/resource/6zsd-86xi.json',
    {
      '$select' : 'date_trunc_ymd(date) as day, count(*)',
      '$where' : "date > '2014-01-01'",
      '$group' : 'day'
    }
  ).pipe(function(res) {
    var ary = []
    $.each(res, function(idx, rec) {
      ary.push([new Date(rec.day.replace("T00:00:00", "T12:00:00")), parseInt(rec.count)]);
    });
    return ary;
  });
};

Visualize the data with Google Charts

Once we’ve got our data from the SODA API, we’ll plumb it into the Google Calendar Chart library to visualize the actual data. We do this in our drawChart function:

First we initialize our DataTable and add two columns - one for our date and another for our value.
Then we initialize our Calendar, feeding it our target element by ID, calendar_basic.
Finally, we draw our chart, feeding it configuration via our options object.

var drawChart = function(ary) {
  var data = new google.visualization.DataTable();
  data.addColumn({ type: 'date', id: 'Date' });
  data.addColumn({ type: 'number', id: 'count' });
  data.addRows(ary);

  var chart = new google.visualization.Calendar(document.getElementById('calendar_basic'));
  var options = {
    title: "City of Chicago Police Incidents Over Time",
    height: 500,
  };

  chart.draw(data, options);
};

Finally, we tie things all together by having the Google Charts library call our function when it loads:

google.charts.setOnLoadCallback(function() {
  fetchValues().done(function(data) {
    drawChart(data);
  });
});

BONUS: Embed your visualization in Socrata Perspectives

To get access to Socrata Perspectives page, you'll need to work for one of our awesome customers. Maybe your local government is hiring!

Once you’ve created your visualization, you can use the ability for Perspectives to include embedded content to embed your visualization into a new story. To do so, first you’ll need to craft a very simple HTML page like the following which loads your visualization. Make sure you include in that page the script tags to load your dependencies, in this case both jQuery and the Google Charts library.

<html>
  <head>
    <script type="text/javascript" src="https://www.gstatic.com/charts/loader.js"></script>
    <script type="text/javascript" src="https://www.google.com/jsapi"></script>
    <script
  src="https://code.jquery.com/jquery-3.1.1.min.js"
  integrity="sha256-hVVnYaiADRTO2PzUGmuLJr8BLUSjGIZsDYGmIJLv2b8="
  crossorigin="anonymous"></script>
   
  </head>
  <body>
    <div id="calendar_basic" style="width: 1000px; height: 350px;"></div>

    <script type="text/javascript">
(function() {
  // Initialize the charting library
  google.charts.load("current", { packages:["calendar"] });

  var fetchValues = function() {
    return $.get(
      'https://data.cityofchicago.org/resource/6zsd-86xi.json',
      {
        '$select' : 'date_trunc_ymd(date) as day, count(*)',
        '$where' : "date > '2014-01-01'",
        '$group' : 'day'
      }
    ).pipe(function(res) {
      var ary = []
      $.each(res, function(idx, rec) {
        ary.push([new Date(rec.day.replace("T00:00:00", "T12:00:00")), parseInt(rec.count)]);
      });
      return ary;
    });
  };

  var drawChart = function(ary) {
    var data = new google.visualization.DataTable();
    data.addColumn({ type: 'date', id: 'Date' });
    data.addColumn({ type: 'number', id: 'count' });
    data.addRows(ary);

    var chart = new google.visualization.Calendar(document.getElementById('calendar_basic'));
    var options = {
      title: "City of Chicago Police Incidents Over Time",
      height: 500,
    };

    chart.draw(data, options);
  };

  google.charts.setOnLoadCallback(function() {
    fetchValues().done(function(data) {
      drawChart(data);
    });
  });
})();
    </script>
  </body>
</html>

Then, to add it as a content block in your story:

When editing your story, click “Add Content” to bring up the palate, and drag in a new content block.
Click “Insert” and then “HTML Embed”
Where it says “Paste or type HTML code”, paste in the entire contents of your HTML snippet and click “Insert”

That’s it! Click below to see what this looks like.

Scrubbing data with Python

2017-01-03T00:00:00+00:00

There’s an awesome Python package called Scrubadub that can can help you remove personally identifiable information from text data. This is a great step to take before publishing a dataset that may contain PII, in order to prevent inadvertent disclosure.

In this example, we’ll clean up some CSV data using Scrubadub, in order to prep it for loading in Socrata:

First we’ll load a local CSV it into a dataframe with Pandas,
Then we’ll remove names using Scrubadub,
And finally write it to a CSV that can be loaded using DataSync.

Prerequisites

Before you start, make sure you have the following installed on your machine:

Loading your CSV with Pandas

Create a dataframe from your local CSV file with Pandas:

import pandas as pd
df = pd.read_csv('~/Dallas_Police_Officer-Involved_Shootings.csv')
df.head(n=5)

Remove names using Scrubadub

Scrubadub is a simple package that will look for names and other identifying information, like email addresses, SSNs, and phone numbers.

import scrubadub
scrub = lambda x: scrubadub.clean(x.decode('utf-8'), replace_with='identifier')
df['Officer(s)'] = df['Officer(s)'].apply(scrub)

Data cleansing is a serious topic and you should always work with your privacy or policy officers within your organization to make sure you are taking the correct steps to protect privacy.

Write cleansed data back to CSV

Finally, we’ll write our cleansed records back out to CSV:

df.to_csv("~/Dallas_Police_Officer-Involved_Shootings.csv", encoding='utf-8', index = False)

Once you’re done, the cleaned data file can be used to update a dataset via DataSync. For more information, see its detailed documentation

SODA Developers

A Move to Main Branch

Time Series Analysis with Jupyter Notebooks and Socrata

Time Series Analysis with Jupyter Notebooks and Socrata

Getting Started

Exploring Our Data

Data Wrangling

Plotting our Data

Time Series Decomposition

How Prophet works

The basics

Trend \( g(t) \)

Seasonality \( s(t) \)

Holidays \( h(t) \)

Calculation

Prophet In Action

Takeaways

Continual Improvement : CI / CD at Tyler Technologies, Data & Insights Division

Continual Improvement: CI/CD at Socra..er Tyler Technologies, Data & Insights Division

Jenkins in the Closet

Jenkins in the Cloud

Go Forth My Children And Scale

System Load Averages (January 2019 - September 2019)

System Memory Utilization Averages (January 2019 - September 2019)

Welcome (back) to our blog!

Introducing dev.socrata.com/blog

Up next

Elixir in production, an open data tale

Elixir in production, an open data tale

The problem

What we built

Results

Creating a monthly calendar with FullCalendar.io

Prerequisites

Step 0: Create your SoQL query

Step 1: Query our API for events

Step 2: Handle our response and create Event Objects

Step 3: Initialize our Calendar

Pulling it all together

Conditional notifications with Huginn

Prerequisites

Step 0: Author our SoQL query

Step 1: Call the API via a “Website Agent”

Step 2: Determine whether or not you want to pass on an alert

Step 3: Send our text message with Twilio

Step 4: Testing!

By the way…

Validate Your Data with FME

Prerequisites:

Contents:

Data Validation for Socrata

Creating Your Schema Map

Running the FME Validation Workspace

Interpreting your results

Dates:

Pro Tips for dates:

Leading Zeros:

Visualizing data using the Google Calendar Chart

Prerequisites

Craft your SoQL query

Fetch data using jQuery

Visualize the data with Google Charts

BONUS: Embed your visualization in Socrata Perspectives

Scrubbing data with Python

Prerequisites

Loading your CSV with Pandas

Remove names using Scrubadub

Write cleansed data back to CSV