Jekyll2025-12-17T01:01:48+00:00https://dev.socrata.com/SODA DevelopersUpdates, examples, and significant changes to http://dev.socrata.comSocrata Developer Programhttps://dev.socrata.comA Move to Main Branch2021-10-13T00:00:00+00:002021-10-13T00:00:00+00:00https://dev.socrata.com/blog/2021/10/13/a-move-to-main-branch<p>About a year ago, Github began automatically defaulting the designation of the main repository branch from <code class="highlighter-rouge">master</code> to <code class="highlighter-rouge">main</code> <a href="https://github.com/github/renaming">Read More</a>. In the spirit of that movement, our public github repositories will also be renamed. This will effect anyone who had previously forked or cloned the repository locally. This post will describe remediation steps if this applies to you. There will also be notifications in github of this change with additional steps.</p> <p>To move your local respository over to the main branch, please execute the following commands in the terminal:</p> <div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git branch <span class="nt">-m</span> master main git fetch origin git branch <span class="nt">-u</span> origin/main main git remote set-head origin <span class="nt">-a</span> </code></pre></div></div> <p>Once this is done, you should be set with <code class="highlighter-rouge">main</code> branch being your local origin.</p>Peter MooreAbout a year ago, Github began automatically defaulting the designation of the main repository branch from master to main Read More. In the spirit of that movement, our public github repositories will also be renamed. This will effect anyone who had previously forked or cloned the repository locally. This post will describe remediation steps if this applies to you. There will also be notifications in github of this change with additional steps.Time Series Analysis with Jupyter Notebooks and Socrata2019-10-07T00:00:00+00:002019-10-07T00:00:00+00:00https://dev.socrata.com/blog/2019/10/07/time-series-analysis-with-jupyter-notebooks-and-socrata<h1 id="time-series-analysis-with-jupyter-notebooks-and-socrata">Time Series Analysis with Jupyter Notebooks and Socrata</h1> <p>Time series analysis and time series forecasting are common data analysis tasks that can help organizations with capacity planning, goal setting, and anomaly detection. There are an increasing number of freely available tools that are bringing advanced modeling techniques to people with basic programming skills, techniques that were previously only accessible to those with advanced degrees in statistics. This is particularly significant among our customers – government agencies – where resources are constrained and data aware employees are at a premium. In this blog post, I would like to show you how you can use just a few of these tools. We will start with a dataset downloaded using the Socrata API and loaded into a data frame in a Python Jupyter notebook. Then we will do some data wrangling to prepare our data for analysis, we will do some plotting, and finally, we will use the <a href="https://facebook.github.io/prophet/">Prophet library</a> to make a forecast based on our data.</p> <p>A time series is an ordered sequence of observations where each observation is made at some point in time. Time series data occur across many domains. In any domain in which we make measurements over time, we can expect to find time series. Government is no exception. For the purpose of this blog post, we focus on our home city of Seattle. Specifically, we will use the City of Seattle’s <a href="https://data.seattle.gov/Permitting/Building-Permits/76t5-zqzr">Building Permits dataset</a>.</p> <h3 id="getting-started">Getting Started</h3> <p>In the interest of brevity, this post assumes that you are comfortable writing and executing Python code. Further, it assumes that you have setup a virtual environment, and that you have installed a bunch of dependencies, including Jupyter. Finally, it assumes that you have already downloaded the <a href="https://data.seattle.gov/Permitting/Building-Permits/76t5-zqzr">City of Seattle Building Permits dataset</a> into a <a href="https://pandas.pydata.org/">Pandas</a> <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html">DataFrame</a> named <code class="highlighter-rouge">seattle_permits_df</code>. The entire notebook is available for download <a href="https://dev.socrata.com/files/20191007.socrata-time-series-prophet.ipynb">here</a>. If you need help getting your data to this point, you can follow the first two steps in <a href="https://dev.socrata.com/blog/2016/02/01/pandas-and-jupyter-notebook.html">this blog post</a>.</p> <h3 id="exploring-our-data">Exploring Our Data</h3> <p>The first thing you’ll want to do when working with a new dataset is explore it. Here are a few ways you might do that:</p> <ul> <li> <p>get a sense of how many rows your dataset contains</p> </li> <li> <p>get a list of the different columns and the types of data that they store</p> </li> <li> <p>plot your data</p> </li> </ul> <p>We’ll do all of the above.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">print</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">seattle_permits_df</span><span class="p">))</span> <span class="n">seattle_permits_df</span><span class="o">.</span><span class="n">head</span><span class="p">(</span><span class="mi">5</span><span class="p">)</span> </code></pre></div></div> <p>The output of the first line tells us that we have just shy of 130,000 rows in our dataset. The <code class="highlighter-rouge">head</code> command prints the first N=5 rows of our dataset. This gives us a sense of what columns exist, and a quick sense of some of the values in the dataset. But there’s an even better way to determine the top values for a particular column – the <code class="highlighter-rouge">value_counts</code> method.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">seattle_permits_df</span><span class="p">[</span><span class="s">"applieddate"</span><span class="p">]</span><span class="o">.</span><span class="n">value_counts</span><span class="p">(</span><span class="n">dropna</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span><span class="o">.</span><span class="n">head</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span> </code></pre></div></div> <h3 id="data-wrangling">Data Wrangling</h3> <p>The value counts make it clear that a lot of the values in the “applieddate” column are missing or null. There are a variety of ways you can <a href="https://en.wikipedia.org/wiki/Imputation_(statistics)">handle missing data</a>, but removing incomplete rows is the simplest, so it’s what we’ll do here. In the next cell, we’ll remove rows with null dates. We’ll also filter down our dataset to just the columns we’re interested in to reduce the amount of extraneous information in this analysis.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Remove all columns except `applieddate` and null rows</span> <span class="n">seattle_permits_df</span> <span class="o">=</span> <span class="n">seattle_permits_df</span><span class="p">[</span><span class="n">seattle_permits_df</span><span class="p">[</span><span class="s">"applieddate"</span><span class="p">]</span><span class="o">.</span><span class="n">notnull</span><span class="p">()]</span> <span class="c"># Ensure the index is still sequential</span> <span class="n">seattle_permits_df</span> <span class="o">=</span> <span class="n">seattle_permits_df</span><span class="p">[[</span><span class="s">"applieddate"</span><span class="p">]]</span><span class="o">.</span><span class="n">reset_index</span><span class="p">(</span><span class="n">drop</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span> <span class="c"># Select the first 10 rows</span> <span class="n">seattle_permits_df</span><span class="o">.</span><span class="n">head</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span> </code></pre></div></div> <p>At this point, each row in our dataset corresponds to a permit application and the only column we’ve preserved is the date of the application. The task of forecasting number of permit applications is not really interesting (or reliable) at the granularity of day. Predicting at the granularity of week might be interesting, but let’s start by grouping by month. To get some date-time functionality from Python, we’ll convert our date column to a <code class="highlighter-rouge">datetime</code> type.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">datetime</span> <span class="c"># Convert applieddate to datetime</span> <span class="n">fixed_dates_df</span> <span class="o">=</span> <span class="n">seattle_permits_df</span><span class="o">.</span><span class="n">copy</span><span class="p">()</span> <span class="n">fixed_dates_df</span><span class="p">[</span><span class="s">"applieddate"</span><span class="p">]</span> <span class="o">=</span> <span class="n">fixed_dates_df</span><span class="p">[</span><span class="s">"applieddate"</span><span class="p">]</span><span class="o">.</span><span class="nb">apply</span><span class="p">(</span><span class="n">pd</span><span class="o">.</span><span class="n">to_datetime</span><span class="p">)</span> <span class="n">fixed_dates_df</span> <span class="o">=</span> <span class="n">fixed_dates_df</span><span class="o">.</span><span class="n">set_index</span><span class="p">(</span><span class="n">fixed_dates_df</span><span class="p">[</span><span class="s">"applieddate"</span><span class="p">])</span> <span class="c"># Group by month</span> <span class="n">grouped</span> <span class="o">=</span> <span class="n">fixed_dates_df</span><span class="o">.</span><span class="n">resample</span><span class="p">(</span><span class="s">"M"</span><span class="p">)</span><span class="o">.</span><span class="n">count</span><span class="p">()</span> <span class="n">data_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">({</span><span class="s">"count"</span><span class="p">:</span> <span class="n">grouped</span><span class="o">.</span><span class="n">values</span><span class="o">.</span><span class="n">flatten</span><span class="p">()},</span> <span class="n">index</span><span class="o">=</span><span class="n">grouped</span><span class="o">.</span><span class="n">index</span><span class="p">)</span> <span class="n">data_df</span><span class="o">.</span><span class="n">head</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span> </code></pre></div></div> <h3 id="plotting-our-data">Plotting our Data</h3> <p>Our <code class="highlighter-rouge">data_df</code> <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html">dataframe</a> consists of two columns: a date time index corresponding to each month since the City of Seattle started reporting permit applications, and a count that corresponds to the number of permit applications received during that month. And now we’re ready to plot our time series.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span> <span class="kn">from</span> <span class="nn">pandas.plotting</span> <span class="kn">import</span> <span class="n">register_matplotlib_converters</span> <span class="n">register_matplotlib_converters</span><span class="p">()</span> <span class="n">plt</span><span class="o">.</span><span class="n">style</span><span class="o">.</span><span class="n">use</span><span class="p">(</span><span class="s">"ggplot"</span><span class="p">)</span> <span class="n">data_df</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">color</span><span class="o">=</span><span class="s">"purple"</span><span class="p">)</span> </code></pre></div></div> <p><img src="/img/20191007.ts_plot_1.png" alt="Time Series Plot 1" /></p> <p>Plotting our time series reveals something interesting that would have been hard to notice earlier. Notice how the number of applications in 2005 and before looks suspiciously low. This certainly appears to be a data problem. Let’s remove all data from before 2006, since bad data will impact the accuracy of our model. Let’s also remove data from after October of this year, since October is incomplete (at the time of this writing).</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">is_between_2006_and_now</span><span class="p">(</span><span class="n">date</span><span class="p">):</span> <span class="k">return</span> <span class="n">date</span> <span class="o">&gt;</span> <span class="n">datetime</span><span class="o">.</span><span class="n">datetime</span><span class="p">(</span><span class="mi">2006</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span> <span class="ow">and</span> <span class="n">date</span> <span class="o">&lt;</span> <span class="n">datetime</span><span class="o">.</span><span class="n">datetime</span><span class="p">(</span><span class="mi">2019</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span> <span class="n">data_df</span> <span class="o">=</span> <span class="n">data_df</span><span class="p">[</span><span class="n">data_df</span><span class="o">.</span><span class="n">index</span><span class="o">.</span><span class="n">to_series</span><span class="p">()</span><span class="o">.</span><span class="nb">apply</span><span class="p">(</span><span class="n">is_between_2006_and_now</span><span class="p">)]</span> <span class="n">data_df</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">color</span><span class="o">=</span><span class="s">"purple"</span><span class="p">)</span> </code></pre></div></div> <p><img src="/img/20191007.ts_plot_2.png" alt="Time Series Plot 2" /></p> <p>After removing data, our new plot makes two things pretty clear. Firstly, there are some clear trends in the time series – for example, an increase between 2009 and 2016, followed by a leveling off of permit applications. Secondly, there is a cyclic nature to the time series, which is indicative of there being <a href="https://en.wikipedia.org/wiki/Seasonality">seasonal variation</a> in permit applications.</p> <h3 id="time-series-decomposition">Time Series Decomposition</h3> <p>To better understand the seasonal nature of our data, we can decompose our time series into components. The first step in decomposing our time series is determining whether our underlying stochastic process should be modeled with an additive or multiplicative decomposition. One heuristic here is if the magnitude of the seasonal fluctuations changes significantly over time, then use a multiplicative model. Otherwise, use an additive model. In our case, the magnitude of the seasonal fluctuations appears to be relatively consistent over time. We can formalize the additive decomposition as follows:</p> <p>\( y_t = S_t + T_t + R_t \)</p> <p>where \( y_t \) is our data (counts of permit applications), \( S_t \) is our seasonal component, \( T_t \) is our trend component, and \( R_t \) is whatever is left over (the remainder).</p> <p>We will use a function in the <a href="https://www.statsmodels.org/stable/index.html">statsmodels</a> module to perform this decomposition for us, but we could compute it ourselves using a technique known as <a href="https://people.duke.edu/~rnau/411diff.htm">differencing</a>.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">statsmodels.tsa.seasonal</span> <span class="kn">import</span> <span class="n">seasonal_decompose</span> <span class="n">result</span> <span class="o">=</span> <span class="n">seasonal_decompose</span><span class="p">(</span><span class="n">data_df</span><span class="p">)</span> <span class="n">fig</span> <span class="o">=</span> <span class="n">result</span><span class="o">.</span><span class="n">plot</span><span class="p">()</span> </code></pre></div></div> <p><img src="/img/20191007.ts_decompose.png" alt="Seasonal Decomposition" /></p> <p>The <code class="highlighter-rouge">seasonal_decompose</code> method generates this handy plot for us. And this plot helps highlight a few interesting things about our data. Firstly, it appears as though there has been an overall growth in permit applications in Seattle since 2009. That growth followed a steep decline in permit applications that appears to have begun at the end of 2007 or early 2008. The housing market across the country was impacted by the <a href="https://en.wikipedia.org/wiki/Subprime_mortgage_crisis">subprime mortgage crisis</a> at this time, and Seattle appears to have been no exception. Secondly, we notice that the peak season for permit applications is the late spring, with applications tapering off significantly at the end of the year. (It is generally accepted that the warmer months are the busiest months for construction, and the data seem to reflect this as well.)</p> <p>This decomposition gives us a great overall picture of the data, but we’d like use the historical data to forecast future building permit applications. We’ll use Prophet to help us do that.</p> <h3 id="how-prophet-works">How Prophet works</h3> <h4 id="the-basics">The basics</h4> <p>Prophet is a module that enables time-series forecasting. The motivations for Prophet’s design decisions are outlined <a href="https://research.fb.com/blog/2017/02/prophet-forecasting-at-scale/">here</a>. Prophet uses an additive decomposable time series model very much like what we showed above:</p> <p>\( y_t = g(t) + s(t) + h(t) + \epsilon_t \)</p> <p>In a Prophet model, there are three main components:</p> <ol> <li>a trend function \( g(t) \)</li> <li>a seasonality function \( s(t) \)</li> <li>a holidays function \( h(t) \)</li> </ol> <p>\( \epsilon_t \) is an error function, but we won’t talk about it in any more depth.</p> <p>The introduction of holidays is one unique aspect of Prophet that makes it both powerful and configurable. Let’s dive into each component to get a better idea of how Prophet works its magic. If you’re just interested in the magic, skip ahead to <a href="#prophet-in-action">“Prophet in Action”</a>.</p> <h4 id="trend--gt-">Trend \( g(t) \)</h4> <p>Prophet exposes two options for the trend component: a <a href="https://www.khanacademy.org/science/biology/ecology/population-growth-and-regulation/a/exponential-logistic-growth">logistic growth function</a>, or alternatively, a simpler piecewise linear growth function (both of which are parameterized by a growth rate \( k \)).</p> <p>The trend component incorporates a notion of changepoints – another aspect that makes Prophet unique. The motivation for changepoints is that domain experts in a particular time series will have knowledge about dates that they expect in advance that will impact the trend. You can imagine that if our job is to forecast adoption of a product, we may have advance knowledge about release dates and other important dates that will have an impact on product adoption. Prophet allows us to pass an input vector of real numbers that correspond to the change in the growth rate at those times of interest. We won’t leverage this feature of the model here, but it’s a neat feature that gives domain experts a straightforward way to incorporate prior knowledge into their forecasts.</p> <h4 id="seasonality--st-">Seasonality \( s(t) \)</h4> <p>The seasonality component is modeled using <a href="http://mathworld.wolfram.com/FourierSeries.html">a Fourier series</a>. Fourier series are used to approximate periodic functions as an infinite series of sines and cosines.</p> <p>\( s(t) = \sum_{n=1}^{N} (a_n \cos { \frac { (2\pi nt ) } P } + b_n sin { \frac { (2\pi nt ) } P }) \)</p> <p>The \( P \) parameter corresponds to the period of our seasonality; in our case, the seasonality is yearly, so \( P = 365 \). The choice of the parameter \( N \) can be thought of as a way of increasing the sensitivity of our seasonality model. As we increase \( N \), we allow for the model to capture more seasonal changes, but with the potential downside of <a href="https://en.wikipedia.org/wiki/Overfitting">overfitting</a>, potentially decreasing the model’s ability to generalize to future data.</p> <p>In matrix form, assuming \( N \) = 10 (a reasonable default according to the Prophet documentation), we have seasonality vectors that looks as follows:</p> <p>\( X(t) = [\cos { \frac { 2 \pi (1) t } P }, …, \sin { \frac { 2 \pi (10) t } P }]</p> <p>s(t) = \beta X(t) \)</p> <p>\( \beta \) is a vector of length \( 2N \) of parameters that we’ll learn in the <code class="highlighter-rouge">fit</code> step. More on that below.</p> <h4 id="holidays--ht-">Holidays \( h(t) \)</h4> <p>The last component is the holiday component. If we pass a list of holidays to the model, for each holiday \( i \) we let \( D_i \) be the set of past and future dates for those holidays. Those holidays are incorporated as vectors of indicator functions (ie. for each time \( t \) in our data set, it has a 1 for each holiday occurring on that day, and a bunch of zeroes). These vectors should be very sparse.</p> <p>\( h(t) = [1(t \in D_1), …, 1(t \in D_L)] \)</p> <h4 id="calculation">Calculation</h4> <p>Once we’ve encoded our data in a matrix, where each row corresponds to one of the times \( t \) in our dataset, we need to <em>estimate</em> the parameters of our model. Prophet uses the <a href="https://en.wikipedia.org/wiki/Limited-memory_BFGS">L-BFGS algorithm</a> to <em>fit</em> the model. This is the learning step in machine learning, but it’s referred to as “fitting” because we’re trying to define the function whose curve best fits the observed data. Typically, we do this by identifying an objective function that we want to optimize.</p> <p>If you’re not familiar with optimization functions, think back to your calculus days, when you found a function’s optima. The goal was to find the inputs that produced our function’s minimum or maximum output values. You did this by taking the derivative of the function, setting it equal to zero, and finding the possible inputs that produced that output. In this case, the function we’re optimizing is the <a href="https://en.wikipedia.org/wiki/Maximum_a_posteriori_estimation">maximum a posteriori likelihood function</a>, which amounts to finding the set of parameters \( \theta \) that are most likely <em>given</em> the observed data.</p> <h3 id="prophet-in-action">Prophet In Action</h3> <p>Now let’s see if we can forecast permit applications for the remainder of 2019 using Prophet. The first step is to train our forecasting model.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">fbprophet</span> <span class="kn">import</span> <span class="n">Prophet</span> <span class="n">model</span> <span class="o">=</span> <span class="n">Prophet</span><span class="p">()</span> <span class="n">train_df</span> <span class="o">=</span> <span class="n">data_df</span><span class="o">.</span><span class="n">rename</span><span class="p">(</span><span class="n">columns</span><span class="o">=</span><span class="p">{</span><span class="s">"count"</span><span class="p">:</span><span class="s">'y'</span><span class="p">})</span> <span class="n">train_df</span><span class="p">[</span><span class="s">"ds"</span><span class="p">]</span> <span class="o">=</span> <span class="n">train_df</span><span class="o">.</span><span class="n">index</span> <span class="n">model</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">train_df</span><span class="p">)</span> </code></pre></div></div> <p>Easy enough. Now, let’s try to do some forecasting:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">pd</span><span class="o">.</span><span class="n">plotting</span><span class="o">.</span><span class="n">register_matplotlib_converters</span><span class="p">()</span> <span class="c"># We want to forecast over the next 5 months</span> <span class="n">future</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">make_future_dataframe</span><span class="p">(</span><span class="mi">5</span><span class="p">,</span> <span class="n">freq</span><span class="o">=</span><span class="s">'M'</span><span class="p">,</span> <span class="n">include_history</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span> <span class="n">forecast</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">future</span><span class="p">)</span> <span class="n">model</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">forecast</span><span class="p">)</span> </code></pre></div></div> <p><img src="/img/20191007.prophet_forecast.png" alt="Prophet Forecast" /></p> <p>Neat! Prophet forecasts that we will see a continuation of the downward trend in building permit applications that seems to have begun in 2016.</p> <p>There are a couple of interesting things about what Prophet gives us:</p> <ul> <li>Prophet generates uncertainty intervals for us (also known as <a href="https://en.wikipedia.org/wiki/Confidence_interval">confidence intervals</a>). It looks like there’s a lot of uncertainty in this forecasting model, so we shouldn’t rely too heavily on it.</li> <li>The plotted forecast includes our actual data points, as well as the forecast on the future (for which we don’t yet have any observed data). This allows us to see where our actual observed data lie outside of our uncertainty level.</li> </ul> <p>There are a lot of assumptions baked into the defaults that we’re using here. If we were experts on this data, we could go in and experiment with tuning some of these parameters. We might also want to add additional variables to our model to improve our forecasts. For example, we would intuitively expect that there are lots of external variables that would impact the amount of new construction taking place in a city like Seattle. Are companies like Amazon and Google continuing to rapidly grow their presence in the city or are they expanding elsewhere? How is transportation changing the city and how might that impact development in previously underdeveloped neighborhoods?</p> <p>If you’re interested in comparing multiple models after adding variables, we could do so using some additional functions included in the <code class="highlighter-rouge">fbprophet.diagnostics</code> package, such as the <code class="highlighter-rouge">cross_validation</code> and <code class="highlighter-rouge">performance_metrics</code> functions. Take a look at the <a href="https://facebook.github.io/prophet/docs/quick_start.html">Prophet API documentation</a> for more information about these functions – it’s super helpful.</p> <h3 id="takeaways">Takeaways</h3> <p>Thanks to some very powerful open source tools like Prophet, advanced statistical analysis is becoming available to people with only basic statistical and scripting skills. One key piece to remember is that there is a high degree of uncertainty in our forecast. Remembering that uncertainty is a key part of any forecast is incredibly important, particularly in government, where forecasts can have significant impacts on the citizens that our governments serve. Analyses like this one have the potential to be valuable tools both for the Seattle Department of Construction &amp; Inspections – who may be able to make important resourcing decisions based on their expected applications in the coming year – but at agencies across government with similar concerns.</p>rlvoyerTime Series Analysis with Jupyter Notebooks and SocrataContinual Improvement : CI / CD at Tyler Technologies, Data &amp; Insights Division2019-09-26T00:00:00+00:002019-09-26T00:00:00+00:00https://dev.socrata.com/build/and/deployment/2019/09/26/continual-improvement---ci---cd-at-tyler-technologies--data---insights-division<h1 id="continual-improvement-cicd-at-socraer-tyler-technologies-data--insights-division">Continual Improvement: CI/CD at <s>Socra</s>..er Tyler Technologies, Data &amp; Insights Division</h1> <h2 id="jenkins-in-the-closet">Jenkins in the Closet</h2> <p>Our Engineering organization has a long, and storied history with <a href="https://semaphoreci.com/blog/cicd-pipeline">CI / CD</a> in our division. It starts well before I arrived at Socrata in 2014 (during the antediluvian period aka 2013) when it was decided that <em>‘Hey, what we could really use is some regular end-to-end testing’</em>. Contractors were hired and soon this testing was underway at Socrata, a scrappy startup running on gumption and a dream in the heart of Seattle’s Pike Place neighborhood. The testing took the form of a series of automated suites based on the <a href="https://en.wikipedia.org/wiki/Cucumber_(software)">Cucumber</a> framework and a <a href="https://jenkins.io">Jenkins</a> server which was running on a server under a desk in the Lead Contractor’s apartment. Things were pretty lean back then.</p> <p>Soon end-to-end testing became a much more important part of the process that we relied upon for shipping our code. We brought the test server into our office and shoved it into an unobtrusive broom closet in the corner. Our test infrastructure grew to include a <a href="https://en.wikipedia.org/wiki/Blade_server">blade server</a> for Windows VMs and an Ubuntu 12.04 server that hosted the new shiny Jenkins server and its associated Linux based testing. Things were better, faster and more reliable, but they were still pretty slapdash.</p> <p>We chugged along this way for most of a year. We shipped software based on those cucumber test results and almost all was well. But like the flow of time, software and its infrastructure are never constant and it was decided that now was the time to completely overhaul our infrastructure. It was time we cast off from Azure and our physical data-centers where Engineers must change disks and speak soft assurances to temperamental machinery. It’s 2014 now, almost 2015 really. It’s time for <a href="https://aws.amazon.com/what-is-aws/">AWS</a>.</p> <h2 id="jenkins-in-the-cloud">Jenkins in the Cloud</h2> <p>We began a great quest to move the entire business to AWS and Jenkins was in the vanguard of that effort. We began converting Jenkins infrastructure by leveraging the following important elements:</p> <ul> <li><a href="https://www.chef.io/">Chef</a> provisioned host configurations</li> <li><a href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AMIs.html">AMI</a> images</li> <li><a href="https://aws.amazon.com/iam/">IAM</a> roles and permissions</li> </ul> <p>First we created a chef cookbook for our Jenkins server and began provisioning it with all the build toolchains and plugins that existed on the Jenkins in the closet. Its was a hodgepodge of Ruby and Scala, Java and Python, Docker and Rails. Then we migrated our jobs, one by one, from our Jenkins in the closet to our Jenkins in the cloud, making sure that each worked.</p> <p>We completed our move to the cloud and wound up with two AWS based Jenkins servers on <a href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/concepts.html">EC2</a> instances. Why two? Well, it turns out our growing testing and build processes put together on a single node quickly overwhelmed the poor server and revealed a particularly nasty network driver <a href="https://bugs.launchpad.net/cloud-images/+bug/1510315">bug</a> present in AWS <a href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instance-types.html#AvailableInstanceTypes">M4</a> EC2 instances. Under high load, the driver would, on occassion, get overloaded, fall over and not be able to get back up and running until you rebooted the server. As our test frequency increased, this became more and more of a problem. To avoid this problem, we split the load between a build server (handling all build-type jobs) and test server (handling all automated testing jobs).</p> <p>Even with our jobs split into two servers, the load continued to grow and our poor servers soon became overloaded again during peak development times. The servers were so overloaded that they would fail and need to be stopped and restarted in the AWS console several times a week. Yuck! Later, we were able to compile a new network driver into the kernel that didn’t have this issue as a temporary solution. It was only when M5 EC2 instances (which use a different network card and thus different driver) became available that we were able to resolve the network load issue permanently.</p> <p>Even after solving this issue, we still needed to scale our Jenkins servers as the daily load continued to grow. We could make them bigger (truly humunguous in fact), but there is an upper bound (in both capacity and cost) and there are also fringe resource collision issues as the parallelism of tests grows on a single host. For example, several of our test suites drive <a href="https://blog.logrocket.com/introduction-to-headless-browser-testing-44b82310b27c/">headless browsers for UI testing</a>. We found that the driver that runs those browsers starts to show performance issues when it must manage more than a few dozen active connections simultaneously. This drove up our test times during heavy load times. In addition, heavily parallelized scenarios revealed some process shutdown issues that would essentially start orphaning driver threads over time, requiring administrative intervention or a reboot.</p> <p>Obviously we couldn’t scale up. Our best option was to scale out. Enter <a href="https://wiki.jenkins.io/display/JENKINS/Distributed+builds">Jenkins Workers</a>, stage left.</p> <h2 id="go-forth-my-children-and-scale">Go Forth My Children And Scale</h2> <p>Jenkins is probably the most pervasive CI / CD solution in the world. Hundreds of thousands of deployments and millions of user make a pretty big community. And since the customers of Jenkins are developers by and large, they are able to give back in a virtuous cycle to the Jenkins ecosystem. The Jenkins community is huge and active, and Jenkins encourages this by being highly extensible through the plugin capabilities that the core system provides.</p> <p><a href="https://plugins.jenkins.io/">Plugins</a> are the lifeblood of the Jenkins system. There are thousands of these plugins maintained by community members which provide immeasurable value to the community. One of these plugins is the <a href="https://plugins.jenkins.io/ec2">EC2 plugin</a>. This plugin allows Jenkins to interact with AWS to manage worker EC2 nodes and send them work in a dynamic way. This enables Jenkins to scale out dynamically and provide nearly limitless execution capability. The limiting factor when using this plugin really comes down to a question of budget.</p> <p>So we decided to solve our scalability problems by converting our jobs to running on workers. Easy Peasy. Well, not exactly.</p> <p>First, we created a chef recipe for the workers which provisioned it for each of the toolchains and software tools we use for building and testing our projects. A partial list includes:</p> <ul> <li><a href="https://www.scala-lang.org/">Scala</a>/<a href="https://www.scala-sbt.org/">SBT</a></li> <li><a href="https://aws.amazon.com/cli/">AWSCLI</a></li> <li><a href="https://opensource.com/resources/what-docker">Docker</a></li> <li><a href="https://elixir-lang.org/">Elixir</a></li> <li><a href="https://golang.org/">Go</a></li> <li><a href="https://openjdk.java.net/">Java</a></li> <li><a href="https://nodejs.org/en/">Node</a></li> <li><a href="https://www.packer.io/">Packer</a></li> <li><a href="https://www.postgresql.org/">Postgres</a></li> <li><a href="https://www.python.org/">Python</a></li> <li><a href="https://www.r-project.org/">R</a></li> <li><a href="https://www.ruby-lang.org/en/">Ruby</a></li> <li><a href="https://www.rust-lang.org/">Rust</a></li> <li><a href="https://www.seleniumhq.org/">Selenium</a></li> </ul> <p>This is a partial list. With this chef recipe created, we were able to use the packer tool to build an AWS image (AMI) from this cookbook. Now we have a provisioned worker ready to work right? Not yet. We still need to give this worker permssions to interact with the systems we use internally as we build our projects. To do that we configure the worker to interact with (among others) the following 3rd party services:</p> <ul> <li><a href="https://www.atlassian.com/git/tutorials/what-is-git">Git</a>/<a href="https://techcrunch.com/2012/07/14/what-exactly-is-github-anyway/">GitHub</a></li> <li><a href="https://aws.amazon.com/ecr/">AWS Elastic Container Registry</a></li> <li><a href="https://aws.amazon.com/s3/">AWS S3 Buckets</a></li> <li><a href="https://devops.stackexchange.com/questions/1898/what-is-an-artifactory">Artifactory Repositories</a></li> </ul> <p>Complicating the worker’s interaction with these services is the requirement that we not store credentials on the AMI at rest. Our solution for credential management on the Jenkins server node isn’t available to the built AMIs, so we ended up using a homegrown solution to interact with the encrypted AWS KMS secure storage service from anywhere. We can use this KMS wrapper to pull files to the worker at boot time from within the AWS cloud. We leveraged the boot script capability of the worker configuration and made it pull down and apply the security credentials and configuration files at boot time. These workers are short-lived so the credentials are effectively ephemeral. Problem solved.</p> <p>So, after adequately provisioning our worker AMI we now have two recipes; one server and one worker that we will use to scale Jenkins.</p> <p>Now, after verifying the workers with some testing, it was time to start migrating our build jobs to the workers. This was a voyage of discovery. We found significant configuration requirements that weren’t documented when we initially created the worker. But we used each discovery to formally capture these requirements and place them in the worker recipe such that it handles the requirements of each job and is repeatable. This migration process also helped us realize when we were solving a problem more than once or in diverse ways. Going through this process allowed us to streamline the jobs to be more homogeneous. By forcing our jobs to work on disposable hosts we encourage job owners to make their jobs self-sufficient. It was a long process taking months (in fact is continues to this day) but we pretty quickly began to see the benefits of this work. By tackling our most resource intensive jobs first, we were able to see immediate impacts on the server; loads dropped, server memory related test failures became almost non-existent. Tests run in relative isolation no longer needed to contend with other jobs for shared local resources. The server was no longer the bottleneck. Here are some graphs that show the effects. See if you can spot when the workers really started getting going.</p> <h4 id="system-load-averages-january-2019---september-2019"><em>System Load Averages (January 2019 - September 2019)</em></h4> <p><img src="/img/2019-09-jenkins-build-load.png" alt="System Load Averages (January 2019 - September 2019)" height="100%" width="100%" /></p> <h4 id="system-memory-utilization-averages-january-2019---september-2019"><em>System Memory Utilization Averages (January 2019 - September 2019)</em></h4> <p><img src="/img/2019-09-jenkins-build-memory.png" alt="System Memory Utilization Averages (January 2019 - September 2019)" height="100%" width="100%" /></p> <p>After a few months of effort, most of the jobs we considered “heavy hitters” had been moved to workers. But we had this other server? This test server. What should we do about that?</p> <p>Well, again, we made sure that the worker was provisioned with all the tools it needed for the tests to run on it properly. Then we began moving jobs from the test server directly to workers. No additional load hit the server and our worker utilization continued to grow. Soon, we were in a position where we could turn off the Jenkins test server and truly have only one Jenkins server running all builds, deploys and tests.</p> <p>These days, the build server is now becoming a job scheduler. Through configuration of the EC2 plugin we can define pretty robust behavior of our workers. We can set how many workers are allowed, how long they live, what types of EC2 nodes to use, how many jobs they can execute simultaneously and which jobs, etc…</p> <p>We anticipate that in the next few months, as we finish migrating all jobs to workers, we will be able to start reducing the EC2 host size of the server itself (its currently a C5.4xlarge, costing us quite a bit each month). We will start by reducing its size in half, profiling the new sized node and then progressing to smaller nodes as seems reasonable. The hope is that it will truly become just a job scheduler. Job execution will live on the scalable, disposable workers. We aren’t done yet with this work (lots of jobs still to convert) but we have a much a happier development team and a much relieved ops team. Achievement unlocked!</p>JoeNunnelleyContinual Improvement: CI/CD at Socra..er Tyler Technologies, Data &amp; Insights DivisionWelcome (back) to our blog!2019-08-14T00:00:00+00:002019-08-14T00:00:00+00:00https://dev.socrata.com/blog/2019/08/14/welcome--back--to-our-blog-<p>Hey there! I’m Helena — I joined Tyler Technologies, Data and Insights Division (fka Socrata) about a year and a half ago. I’m a software engineer on the Performance team (as in the <a href="https://www.tylertech.com/products/socrata/performance-optimization">Socrata Performance Optimization</a> product, not page performance). Over my tenure, I’ve been continuously impressed by all the cool things my peers do — both my coworkers and the customers that build on top of our products. I’ve also been continuously surprised that we don’t showcase our technical work anywhere. Since one of our core values is “Celebrate success together”, I’m here to do just that!</p> <h2 id="introducing-devsocratacomblog">Introducing dev.socrata.com/blog</h2> <p>After a nearly 2-year hiatus, I’d like to re-introduce <a href="https://dev.socrata.com/blog/">dev.socrata.com/blog</a>! 🎉 As part of our re-launch, we’re also doing a little bit of re-focusing. In the past, this blog has been primarily a place for Socrata Open Data API announcements and technical how-to’s. Going forward, we’ll also be mixing in some behind-the-scenes cuts — how the team behind Socrata’s products builds the good stuff (and how we learn from the bad stuff).</p> <p>Here are some topics you can look forward to in the coming months:</p> <ul> <li> <p><strong>Time Series Analysis with Jupyter Notebooks and Socrata</strong><br /> <em>Robert Voyer (Software Engineering Manager)</em></p> <p>Learn how to download the Seattle Building Permits dataset from the Socrata API, and do a time series analysis using open source data science tools in Python.<br /></p> </li> <li> <p><strong><em>Informatics</em> and Dogfood</strong><br /> <em>Andrew Deming (Software Support Team Lead) and Ryan Hall (Data Analyst)</em></p> <p>Informatics is how our employees use our own data and our own product on a daily basis. A grassroots initiative from the start, Informatics had several auxiliary goals, like onboarding new staff with the product and securely sharing data with internal and external stakeholders.<br /></p> </li> <li> <p><strong>Jenkins Workers</strong><br /> <em>Joe Nunnelley (Senior Automation Engineer)</em></p> <p>Tyler’s Data and Insights Division uses Jenkins to execute automated jobs that support testing, builds, and deploys. Learn how we introduced CI/CD to our engineering process and how we started using Jenkins Workers to make this infrastructure more dynamic.</p> </li> </ul> <h2 id="up-next">Up next</h2> <p>I’ll be facilitating this blog going forward, which means I’ll be doing the wrangling, but not the writing. Expect to hear from a variety of people and roles about all of the awesome work they do.</p> <hr /> <p>PS: We’re hiring! If you’re interested in learning more about our work, then check out our <a href="https://app.jobvite.com/j?bj=or8b4fwy&amp;s=devblog">jobs page</a>.</p>helenaswHey there! I’m Helena — I joined Tyler Technologies, Data and Insights Division (fka Socrata) about a year and a half ago. I’m a software engineer on the Performance team (as in the Socrata Performance Optimization product, not page performance). Over my tenure, I’ve been continuously impressed by all the cool things my peers do — both my coworkers and the customers that build on top of our products. I’ve also been continuously surprised that we don’t showcase our technical work anywhere. Since one of our core values is “Celebrate success together”, I’m here to do just that!Elixir in production, an open data tale2017-08-28T00:00:00+00:002017-08-28T00:00:00+00:00https://dev.socrata.com/blog/2017/08/28/elixir-in-production<h1 id="elixir-in-production-an-open-data-tale">Elixir in production, an open data tale</h1> <h2 id="the-problem">The problem</h2> <p>The job of the Data Pipeline team is to build and maintain software to seamlessly get data into our platform.</p> <p>Socrata has had, for a long time, a wizard which would parse certain files, provide a few formatting options, and allow the user to import the data into the Socrata platform.</p> <p>This process had a few major issues. It was an all or nothing deal - either your data had issues and it would fail to import, or it was perfectly clean. It also didn’t provide any feedback on what the status of anything was; you just saw an indefinite spinner until you didn’t anymore. To make matters worse, if there was an issue, you often got an error message that didn’t tell you how to fix your source data.</p> <p>When we set out to make that experience better we had one major goal: be transparent about what’s happening with the data. Every step has information that is potentially actionable, and we should surface that information as soon as we know it. We wanted to front-load all of the actionable information, so the user can upload their file, make their changes and walk away before the file is even done uploading. We also want to provide a quick retry cycle if the user uploads something and realizes it’s wrong. This allows them to go back to the data owner or source and fix it quickly.</p> <p><img src="/img/posts/2017-08-28 - upload-preview.gif" alt="Uploading a file" width="800px" /></p> <p><em>Uploading this 10gb/28 million row file gives you a preview and the ability to start interacting and transforming the data before it is uploaded</em></p> <p>We also had an internal goal, which was to run our service(s) sustainably with a relatively small team (about 4 backend engineers, who also have other jobs to do). Our engineering team had been adopting a microservices model, which, despite all the Medium thinkpieces extolling the virtues, had failed to deliver us to engineering nirvana as we had hoped. With a small number of human engineers, and a large number of services, context switching between them became challenging. Moreover, due to the small size of our engineering organization, we had no dedicated team working on tooling, which led to duplicated effort across teams who were all chartered to deliver customer value, not engineering value.</p> <h2 id="what-we-built">What we built</h2> <p>Given the UX problems, the engineering problems, and the goal, we settled on Elixir and Phoenix as the tools to make this thing work. There are plenty of other posts that describe why Elixir is interesting, but in short, it was the only tool that would allow us to accomplish the real time feedback we wanted in a single package. Elixir and Erlang* also provide primitives for building and running distributed systems that can’t be beat (at the moment), and given that we were going to be doing computation across the whole cluster in parallel, it seemed like the right tool for the job.</p> <p>The core of the data pipeline service is really an interpreter, which interprets the same language used for querying data, called SoQL (Socrata Query Language). SoQL looks a lot like SQL, but it’s simplified for the use case we see a lot at Socrata. We also needed an API around said interpreter, for accepting data and allowing the user to manipulate it, which is where Phoenix came in.</p> <p>In our data pipeline service, we implement a different set of functions that are required to transform data. An example would be the following</p> <div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">geocode</span><span class="p">(</span><span class="n">address</span><span class="p">,</span> <span class="n">city</span><span class="p">,</span> <span class="k">state</span><span class="p">,</span> <span class="n">zip</span><span class="p">)</span> </code></pre></div></div> <p>This, given columns in your source file called <code class="highlighter-rouge">adresss</code>, <code class="highlighter-rouge">city</code>, <code class="highlighter-rouge">state</code>, and <code class="highlighter-rouge">zip</code>, will geocode the values and make a new column, which can then be imported alongside the rest of your data.</p> <p>Obviously faster is better, so all the execution happens with as much parallelism as we can get out of the cluster. This is where Elixir really shines. Coordinating all that state across the cluster would have been tricky, but in Elixir, it’s trivial to assign work to different nodes in the cluster. With a lot of parallelism, we can do slow transforms that may do IO to other services (like geocoding) and still get reasonable performance. It also gives us the ability to meet whatever service level we want by scaling the cluster up or down.</p> <p>For a 28 million row dataset, running a simple string concatenation expression takes an amount of time proportional to the cluster size</p> <table> <thead> <tr> <th>Cluster Size</th> <th>Time spent evaluating</th> </tr> </thead> <tbody> <tr> <td>1 node</td> <td>39.871s</td> </tr> <tr> <td>3 nodes</td> <td>17.539s</td> </tr> <tr> <td>5 nodes</td> <td>11.953s</td> </tr> </tbody> </table> <h2 id="results">Results</h2> <p>We ended up with a system that handles the workloads we wanted with minimal drama. We’ve been running the system in production for several months now, and haven’t had issues that were related to our tools, which is about as much as you can ask for. One of the most impressive aspects of Elixir (and Erlang) is the tooling for analyzing a running system. We’ve shipped plenty of bugs out into production, but a combination of the Erlang Observer, the remote IEx repl, distributed tracing and debugging has allowed us to track them down quickly. These tools are indespensible, and once you have them, it’s exceedingly difficult to go back to a world without them.</p> <p>Elixir as a language and Erlang as a platform has its pros and cons. Elixir is an extremely simple language, and our team was able to ramp up on it quickly. The tooling in the Elixir ecosystem is simple, well documented, and fits together well. Coming from a language like Java or Ruby, there will be some struggling to understand the Erlang/OTP programming model, but ultimately it simplifies fault tolerance, concurrency, and distribution into a small set of primitives which can be composed to make a reliable system. The language and VM are no silver bullet for reliability, but they encourage the developer to think about the common problems in building a distributed application.</p> <p>One issue we ran into was that our team was used to the static typing provided by Scala, and leaving that behind has required some adjustment, for some more than others. This might be a non-starter for some teams, but may not be a big deal to others. It undeniably makes refactoring more difficult and requires that we have a more thorough test suite, which has a high overhead. We experimented with dialyzer, but found that it was too noisy to be usable.</p> <p>Ultimately though, we’ve accomplished the goals we set out to accomplish, and more importantly we’re working at a pace which is sustainable. The most positive thing to say about Elixir and the tooling is that we don’t really think about it much. The amount of time we spend talking, thinking about, and wrestling with tooling (coming from a world of microservices) has been seriously reduced, which leaves much more time to focus on what actually matters, which building the product that our users use every day.</p> <p>*Elixir is a language which compiles to Erlang AST and runs on the Erlang Virtual Machine, BEAM. Elixir has an identical programming model to Erlang, but with a different syntax, standard library, and tooling.</p>rozapElixir in production, an open data taleCreating a monthly calendar with FullCalendar.io2017-03-30T00:00:00+00:002017-03-30T00:00:00+00:00https://dev.socrata.com/blog/2017/03/30/creating-a-monthly-calendar-with-fullcalendar-io<p>Recently I was helping a customer of ours with an interesting problem: they have a Socrata dataset full of events, in this case <a href="https://data.oregon.gov/dataset/Oregon-Public-Meetings/gs36-7t8m">public meetings</a>, and they wanted a flexible way of displaying them within a monthly calendar embedded within their website.</p> <p>A colleague of mine recommended the MIT-licensed <a href="https://fullcalendar.io/">FullCalendar</a> project, and it worked out wonderfully. This example will demonstrate how you can combine the power and flexibility of Socrata’s APIs with open source software, and quickly build out a monthly calendar visualization for your dataset that looks like the one below:</p> <div id="calendar"></div> <h3 id="prerequisites">Prerequisites</h3> <p>There are a couple of prerequisites for this example:</p> <ol> <li><a href="https://jquery.com/">jQuery</a> - An insanely popular JavaScript framework that FullCalendar requires to work. You probably are already using it even if don’t know it.</li> <li><a href="https://momentjs.com/">Moment.js</a> - A great JavaScript library for parsing and manipulating dates.</li> <li><a href="https://fullcalendar.io/">FullCalendar</a> - The actual FullCalendar library.</li> </ol> <p>I recommend following the <a href="https://fullcalendar.io/docs/usage/">FullCalendar “Basic Usage” doc</a> to start off. All three libraries must be loaded, in that order, before your code can run.</p> <h2 id="step-0-create-your-soql-query">Step 0: Create your SoQL query</h2> <p>Starting from the <a href="https://dev.socrata.com/foundry/data.oregon.gov/yid5-c4eq">API docs for our source dataset</a>, we’re going to craft a SoQL query that does the following:</p> <ul> <li>Uses a <code class="highlighter-rouge">$where</code> clause to pull the last 31 days of events, so we always can see all of the current month’s events</li> <li>Filters to return only events for <code class="highlighter-rouge">Portland</code></li> <li>Uses <code class="highlighter-rouge">$order</code> to sort them by date</li> </ul> <p>The full query will look like the following, but we’ll need to fill in the correct bounding date later on:</p> <div class="tryit-link"> <code>The TryIt macro has been disabled until future notice while we upgrade this site to SODA3.</code> </div> <h2 id="step-1-query-our-api-for-events">Step 1: Query our API for events</h2> <p>In this step, we’ll use jQuery’s <a href="https://api.jquery.com/jquery.ajax/#jQuery-ajax-settings"><code class="highlighter-rouge">$.ajax(...)</code></a> utility function to fetch our records from the API.</p> <p>We’ll pass in the <code class="highlighter-rouge">url</code> of our API endpoint, a <code class="highlighter-rouge">method</code> of <code class="highlighter-rouge">GET</code>, and a <code class="highlighter-rouge">datatype</code> of <code class="highlighter-rouge">json</code>. For our <code class="highlighter-rouge">data</code>, we can use the broken out parameter pairs of our SoQL query. We also use Moment.js’s <a href="https://momentjs.com/docs/#/manipulating/subtract/"><code class="highlighter-rouge">subtract(...)</code></a> and <a href="https://momentjs.com/docs/#/displaying/format/"><code class="highlighter-rouge">format(...)</code></a> functions to generate a date string for 31 days ago.</p> <figure class="highlight"><pre><code class="language-javascript" data-lang="javascript"><span class="nx">$</span><span class="p">(</span><span class="nb">document</span><span class="p">).</span><span class="nx">ready</span><span class="p">(</span><span class="kd">function</span><span class="p">()</span> <span class="p">{</span> <span class="c1">// Fetch our events</span> <span class="nx">$</span><span class="p">.</span><span class="nx">ajax</span><span class="p">({</span> <span class="na">url</span><span class="p">:</span> <span class="s2">"https://data.oregon.gov/resource/yid5-c4eq.json"</span><span class="p">,</span> <span class="na">method</span><span class="p">:</span> <span class="s2">"GET"</span><span class="p">,</span> <span class="na">datatype</span><span class="p">:</span> <span class="s2">"json"</span><span class="p">,</span> <span class="na">data</span><span class="p">:</span> <span class="p">{</span> <span class="s2">"$where"</span> <span class="p">:</span> <span class="s2">"start_date_time &gt; '"</span> <span class="o">+</span> <span class="nx">moment</span><span class="p">().</span><span class="nx">subtract</span><span class="p">(</span><span class="mi">31</span><span class="p">,</span> <span class="s1">'days'</span><span class="p">).</span><span class="nx">format</span><span class="p">(</span><span class="s2">"YYYY-MM-DDT00:00:00"</span><span class="p">)</span> <span class="o">+</span> <span class="s2">"'"</span><span class="p">,</span> <span class="s2">"city"</span> <span class="p">:</span> <span class="s2">"Portland"</span><span class="p">,</span> <span class="s2">"$order"</span> <span class="p">:</span> <span class="s2">"start_date_time DESC"</span> <span class="p">}</span> <span class="p">}).</span><span class="nx">done</span><span class="p">(</span><span class="kd">function</span><span class="p">(</span><span class="nx">response</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// TODO: Handle our response</span> <span class="p">});</span> <span class="p">});</span></code></pre></figure> <h2 id="step-2-handle-our-response-and-create-event-objects">Step 2: Handle our response and create Event Objects</h2> <p>Next we’ll take each of the events in the response from our API call, and create FullCalendar <a href="https://fullcalendar.io/docs/event_data/Event_Object/">Event Object</a>s for each of them. At a minimum, we’ll need start and end dates for them, as well as a title. If we have a URL, that will make the event clickable.</p> <figure class="highlight"><pre><code class="language-javascript" data-lang="javascript"><span class="p">...</span> <span class="p">}).</span><span class="nx">done</span><span class="p">(</span><span class="kd">function</span><span class="p">(</span><span class="nx">response</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// Parse our events into an event object for FullCalendar</span> <span class="kd">var</span> <span class="nx">events</span> <span class="o">=</span> <span class="p">[];</span> <span class="nx">$</span><span class="p">.</span><span class="nx">each</span><span class="p">(</span><span class="nx">response</span><span class="p">,</span> <span class="kd">function</span><span class="p">(</span><span class="nx">idx</span><span class="p">,</span> <span class="nx">e</span><span class="p">)</span> <span class="p">{</span> <span class="nx">events</span><span class="p">.</span><span class="nx">push</span><span class="p">({</span> <span class="na">start</span><span class="p">:</span> <span class="nx">e</span><span class="p">.</span><span class="nx">start_date_time</span><span class="p">,</span> <span class="na">end</span><span class="p">:</span> <span class="nx">e</span><span class="p">.</span><span class="nx">end_date_time</span><span class="p">,</span> <span class="na">title</span><span class="p">:</span> <span class="nx">e</span><span class="p">.</span><span class="nx">meeting_title</span><span class="p">,</span> <span class="na">url</span><span class="p">:</span> <span class="nx">e</span><span class="p">.</span><span class="nx">web_link</span> <span class="p">});</span> <span class="p">});</span> <span class="c1">// TODO: Initialize calendar</span> <span class="p">});</span> <span class="p">});</span></code></pre></figure> <h3 id="step-3-initialize-our-calendar">Step 3: Initialize our Calendar</h3> <p>This is the simplest part. We pass in our new collection of events to the FullCalendar initialization function, targeting the <code class="highlighter-rouge">#calendar</code> div. This is also where you could use <a href="https://fullcalendar.io/docs/mouse/eventClick/"><code class="highlighter-rouge">eventClick(...)</code></a> to change what happens when you click on an event:</p> <figure class="highlight"><pre><code class="language-javascript" data-lang="javascript"><span class="p">...</span> <span class="p">}).</span><span class="nx">done</span><span class="p">(</span><span class="kd">function</span><span class="p">(</span><span class="nx">response</span><span class="p">)</span> <span class="p">{</span> <span class="p">...</span> <span class="nx">$</span><span class="p">(</span><span class="s1">'#calendar'</span><span class="p">).</span><span class="nx">fullCalendar</span><span class="p">({</span> <span class="na">events</span><span class="p">:</span> <span class="nx">events</span> <span class="p">});</span> <span class="p">});</span> <span class="p">});</span></code></pre></figure> <p>That’s it! We’ll pull all the pieces together in one last to show all of the code at once, but that should be enough to help you build a basic calendar visualization!</p> <h3 id="pulling-it-all-together">Pulling it all together</h3> <p>Here’s all the code as one block, including all of the HTML to make it a standalone page:</p> <figure class="highlight"><pre><code class="language-html" data-lang="html"><span class="cp">&lt;!DOCTYPE html&gt;</span> <span class="nt">&lt;html&gt;</span> <span class="nt">&lt;head&gt;</span> <span class="c">&lt;!-- JS Dependencies --&gt;</span> <span class="nt">&lt;script </span><span class="na">data-require=</span><span class="s">"jquery@*"</span> <span class="na">data-semver=</span><span class="s">"3.1.1"</span> <span class="na">src=</span><span class="s">"https://ajax.googleapis.com/ajax/libs/jquery/3.1.1/jquery.min.js"</span><span class="nt">&gt;&lt;/script&gt;</span> <span class="nt">&lt;script </span><span class="na">data-require=</span><span class="s">"moment.js@*"</span> <span class="na">data-semver=</span><span class="s">"2.14.1"</span> <span class="na">src=</span><span class="s">"https://npmcdn.com/[email protected]"</span><span class="nt">&gt;&lt;/script&gt;</span> <span class="nt">&lt;script </span><span class="na">src=</span><span class="s">"//cdnjs.cloudflare.com/ajax/libs/fullcalendar/3.3.0/fullcalendar.min.js"</span><span class="nt">&gt;&lt;/script&gt;</span> <span class="c">&lt;!-- CSS Styles --&gt;</span> <span class="nt">&lt;link</span> <span class="na">rel=</span><span class="s">"stylesheet"</span> <span class="na">href=</span><span class="s">"//cdnjs.cloudflare.com/ajax/libs/fullcalendar/3.3.0/fullcalendar.min.css"</span> <span class="nt">/&gt;</span> <span class="nt">&lt;/head&gt;</span> <span class="nt">&lt;body&gt;</span> <span class="nt">&lt;div</span> <span class="na">id=</span><span class="s">"calendar"</span><span class="nt">&gt;&lt;/div&gt;</span> <span class="nt">&lt;script </span><span class="na">type=</span><span class="s">"text/javascript"</span><span class="nt">&gt;</span> <span class="nx">$</span><span class="p">(</span><span class="nb">document</span><span class="p">).</span><span class="nx">ready</span><span class="p">(</span><span class="kd">function</span><span class="p">()</span> <span class="p">{</span> <span class="c1">// Fetch our events</span> <span class="nx">$</span><span class="p">.</span><span class="nx">ajax</span><span class="p">({</span> <span class="na">url</span><span class="p">:</span> <span class="s2">"https://data.oregon.gov/resource/yid5-c4eq.json"</span><span class="p">,</span> <span class="na">method</span><span class="p">:</span> <span class="s2">"GET"</span><span class="p">,</span> <span class="na">datatype</span><span class="p">:</span> <span class="s2">"json"</span><span class="p">,</span> <span class="na">data</span><span class="p">:</span> <span class="p">{</span> <span class="s2">"$where"</span> <span class="p">:</span> <span class="s2">"start_date_time &gt; '"</span> <span class="o">+</span> <span class="nx">moment</span><span class="p">().</span><span class="nx">subtract</span><span class="p">(</span><span class="mi">31</span><span class="p">,</span> <span class="s1">'days'</span><span class="p">).</span><span class="nx">format</span><span class="p">(</span><span class="s2">"YYYY-MM-DDT00:00:00"</span><span class="p">)</span> <span class="o">+</span> <span class="s2">"'"</span><span class="p">,</span> <span class="s2">"city"</span> <span class="p">:</span> <span class="s2">"Portland"</span><span class="p">,</span> <span class="s2">"$order"</span> <span class="p">:</span> <span class="s2">"start_date_time DESC"</span> <span class="p">}</span> <span class="p">}).</span><span class="nx">done</span><span class="p">(</span><span class="kd">function</span><span class="p">(</span><span class="nx">response</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// Parse our events into an event object for FullCalendar</span> <span class="kd">var</span> <span class="nx">events</span> <span class="o">=</span> <span class="p">[];</span> <span class="nx">$</span><span class="p">.</span><span class="nx">each</span><span class="p">(</span><span class="nx">response</span><span class="p">,</span> <span class="kd">function</span><span class="p">(</span><span class="nx">idx</span><span class="p">,</span> <span class="nx">e</span><span class="p">)</span> <span class="p">{</span> <span class="nx">events</span><span class="p">.</span><span class="nx">push</span><span class="p">({</span> <span class="na">start</span><span class="p">:</span> <span class="nx">e</span><span class="p">.</span><span class="nx">start_date_time</span><span class="p">,</span> <span class="na">end</span><span class="p">:</span> <span class="nx">e</span><span class="p">.</span><span class="nx">end_date_time</span><span class="p">,</span> <span class="na">title</span><span class="p">:</span> <span class="nx">e</span><span class="p">.</span><span class="nx">meeting_title</span><span class="p">,</span> <span class="na">url</span><span class="p">:</span> <span class="nx">e</span><span class="p">.</span><span class="nx">web_link</span> <span class="p">});</span> <span class="p">});</span> <span class="nx">$</span><span class="p">(</span><span class="s1">'#calendar'</span><span class="p">).</span><span class="nx">fullCalendar</span><span class="p">({</span> <span class="na">events</span><span class="p">:</span> <span class="nx">events</span> <span class="p">});</span> <span class="p">});</span> <span class="p">});</span> <span class="nt">&lt;/script&gt;</span> <span class="nt">&lt;/body&gt;</span> <span class="nt">&lt;/html&gt;</span></code></pre></figure>chrismetcalfRecently I was helping a customer of ours with an interesting problem: they have a Socrata dataset full of events, in this case public meetings, and they wanted a flexible way of displaying them within a monthly calendar embedded within their website.Conditional notifications with Huginn2017-03-29T00:00:00+00:002017-03-29T00:00:00+00:00https://dev.socrata.com/blog/2017/03/29/conditional-notifications-with-huginn<p>As more and more open datasets approach the point where they’re receiving “real time” updates, the topic of how to receive push or <a href="https://en.wikipedia.org/wiki/Webhook">webhook</a> notifications when a dataset is updated and matches certain conditions. For example, you might want to be notified when there are crimes in your neighborhood, when your local government releases a new dataset, or when the current number of outstanding pot hole requests go over a certain threshold.</p> <p>Currently (or at least as of when this article is posted), Socrata Publica doesn’t support such notifications, but with a few open source tools, you can create incredibly powerful workflows alert you based on changes in open data (and do the rest of your bidding)!</p> <p>We’re going to use an open source tool called <a href="https://github.com/cantino/huginn">Huginn</a> to create our custom workflows. If you’re familiar with <a href="https://ifttt.com/">IFTTT</a>, Huginn will seem conceptually similar - it allows you to set up triggers and actions that occur based on them. However, it is <em>far</em> more powerful - Huginn workflows can branch, have conditionals, make API calls, and even execute arbitrary JavaScript.</p> <p>This tutorial will walk you through a simple scenario: My commute each morning takes me across the <a href="https://en.wikipedia.org/wiki/Aurora_Bridge">Aurora Bridge</a> in Seattle, a high-level bridge that is prone to icing when it gets cold enough in the winter. The City of Seattle has recently published <a href="https://dev.socrata.com/foundry/data.seattle.gov/ivtm-938t">real-time road sensor readings</a> that include road temperature readings.</p> <h3 id="prerequisites">Prerequisites</h3> <p>This tutorial assumes a couple of things:</p> <ol> <li>You have Huginn installed and running somewhere, or you have access to a running instance. You can run it locally on your own hardware, or you can run it in the cloud. I found their <a href="https://github.com/cantino/huginn#heroku">one-click Heroku option</a> to be the quickest, and that’s how I developed this tutorial.</li> <li>You have a <a href="https://www.twilio.com/">Twilio</a> account and you’ve followed their tutorial to set up a phone number for sending texts. You don’t have to use Twilio for notifications - I actually use Slack for most of mine - and Huginn provides agents to push notifications via a number of different mechanisms.</li> </ol> <h2 id="step-0-author-our-soql-query">Step 0: Author our SoQL query</h2> <p>Starting from <a href="https://dev.socrata.com/foundry/data.seattle.gov/ivtm-938t">our source dataset</a>, I want to take the average of the last five road temperature readings and determine if they are below freezing. That will smooth out any momentary drops in the road temperature. So, I want to:</p> <ul> <li>Start with the API endpoint: <code class="highlighter-rouge">https://data.seattle.gov/resource/ivtm-938t.json</code></li> <li><code class="highlighter-rouge">$where</code> filter to only get the readings for the Aurora Bridge: <code class="highlighter-rouge">stationname = 'AuroraBridge'</code></li> <li><code class="highlighter-rouge">$order</code> the results from latest to oldest: <code class="highlighter-rouge">datetime DESC</code></li> <li><code class="highlighter-rouge">$limit</code> myself to only <code class="highlighter-rouge">5</code> results</li> <li>Use <code class="highlighter-rouge">$select</code> to aggregate the results with a <code class="highlighter-rouge">AVG(roadsurfacetemperature)</code></li> </ul> <p>That last bit is a bit tricky, since it needs to be applied <em>after</em> all the other work is done. Don’t fret, because we have a SoQL feature that helps with that. Using the <a href="/docs/queries/query.html#sub-queries">sub-query functionality of <code class="highlighter-rouge">$query</code></a>, we can chain our aggregation after the rest of our query.</p> <p>The full query looks like the below, and outputs a single value representing the average of the last 5 sensor readings:</p> <div class="tryit-link"> <code>The TryIt macro has been disabled until future notice while we upgrade this site to SODA3.</code> </div> <h2 id="step-1-call-the-api-via-a-website-agent">Step 1: Call the API via a “Website Agent”</h2> <p>Our first step in Huginn will be to create a “<a href="https://github.com/cantino/huginn/wiki/Agent-configuration-examples#websiteagents">Website Agent</a>” to make our API call and turn the result into an event for our workflow:</p> <figure class="figure pull-right"> <a href="/img/posts/2017-03-28 - website agent.png" data-featherlight="image"> <img class="figure-image img-fluid rounded" src="/img/posts/2017-03-28 - website agent.thumb.png" alt="Completed Website Agent" /> <figcaption class="figure-caption">Completed Website Agent &raquo;</figcaption> </a> </figure> <ol> <li>Within Huginn, select “Agents” and click “New Agent” to start the process of creating a new agent.</li> <li>In the “Type” dropdown, select “Website Agent”</li> <li>Fill out your new agent with the following details. If I don’t say what to fill out, accept the default: <ul> <li><code class="highlighter-rouge">Name</code>: Name your agent. I called mine “Fetch Aurora Bridge rolling average surface temperature”</li> <li><code class="highlighter-rouge">Schedule</code>: Choose how often you want your agent to check the API. I chose “5 min”</li> <li><code class="highlighter-rouge">Sources</code>: Leave blank</li> <li><code class="highlighter-rouge">Propagate Immediately</code>: You can leave this unchecked, but I’m impatient and check it</li> <li><code class="highlighter-rouge">Recievers</code>: Leave blank for now</li> <li><code class="highlighter-rouge">Options</code>: Copy the details from the screenshot. For <code class="highlighter-rouge">url</code>, use the full URL for your query from above.</li> </ul> </li> <li>Click “Save” when you’re done. When completed, your agent configuration should look like the screenshot to the right.</li> </ol> <h2 id="step-2-determine-whether-or-not-you-want-to-pass-on-an-alert">Step 2: Determine whether or not you want to pass on an alert</h2> <p>In this next step, we’ll use a “Trigger Agent” to conditionally pass on the events generated by our Website Agent and turn them into alerts to be messaged about.</p> <figure class="figure pull-right"> <a href="/img/posts/2017-03-28 - trigger agent.png" data-featherlight="image"> <img class="figure-image img-fluid rounded" src="/img/posts/2017-03-28 - trigger agent.thumb.png" alt="Completed Trigger Agent" /> <figcaption class="figure-caption">Completed Trigger Agent &raquo;</figcaption> </a> </figure> <ol> <li>Select “Agents” and then “New Agent” to start the process of creating a new agent.</li> <li>In the “Type” dropdown, select “Trigger Agent”.</li> <li>Fill out your new agent with the following details: <ul> <li><code class="highlighter-rouge">Name</code>: Name your agent. I named mine “Is the bridge freezing?”</li> <li><code class="highlighter-rouge">Sources</code>: Select the agent you created in Step 1</li> <li><code class="highlighter-rouge">Propagate Immediately</code>: I’m impatient, so I check this box. If you leave it unchecked, you’ll need to wait for Huginn to pass on your events with each check it does, and it may take several minutes to be notified.</li> <li><code class="highlighter-rouge">Receivers</code>: Leave this blank for now</li> </ul> </li> <li>In the <code class="highlighter-rouge">Options</code> section of your agent configuration, match the details from the screenshot to the right. Most importantly: <ul> <li><code class="highlighter-rouge">type</code>: The type of check to perform. We want to see if our value is less than or equal to freeing, so we use <code class="highlighter-rouge">field&lt;=value</code></li> <li><code class="highlighter-rouge">path</code>: The JSON output by your SoQL query and extracted by your Website Agent, in my case <code class="highlighter-rouge">rolling_average</code></li> <li><code class="highlighter-rouge">value</code>: The value to check against, <code class="highlighter-rouge">32</code></li> <li><code class="highlighter-rouge">message</code>: The message we want to format and pass on to the next step. It’s a Liquid-templated string, and we used <code class="highlighter-rouge">Watch out! Temperature has reached degrees and the bridge may be icy!</code></li> </ul> </li> <li>Click “Save” when you’re done and your configuration matches the screenshot to the right.</li> </ol> <h2 id="step-3-send-our-text-message-with-twilio">Step 3: Send our text message with Twilio</h2> <p>This is where the rubber hits the road! We’ll be setting up a “Twilio Agent” to send us a text message via <a href="https://www.twilio.com/">Twilio</a> when the above criteria is met.</p> <div class="alert alert-info "><p><em>Heads Up!</em> Twilio is a paid service, and if you want to send actual text messages, you’ll need to add a credit card to your account. If you just want to try things out, you can use your <a href="https://www.twilio.com/docs/api/rest/test-credentials">test credentials</a>, but the workflow won’t send actual alerts.</p> </div> <p>Follow the steps below to set up your Twilio Agent:</p> <figure class="figure pull-right"> <a href="/img/posts/2017-03-28 - twilio agent.png" data-featherlight="image"> <img class="figure-image img-fluid rounded" src="/img/posts/2017-03-28 - twilio agent.thumb.png" alt="Completed Twilio Agent" /> <figcaption class="figure-caption">Completed Twilio Agent &raquo;</figcaption> </a> </figure> <ol> <li>Select “Agents” and then “New Agent” to start the process of creating a new agent</li> <li>In the “Type” dropdown, select “Twilio Agent”</li> <li>Fill out your new agent with the following details: <ul> <li><code class="highlighter-rouge">Name</code>: Name your agent. I called mine “Text message me an alert”</li> <li><code class="highlighter-rouge">Sources</code>: Select the agent you created in Step 2</li> <li><code class="highlighter-rouge">Propagate Immediately</code>: You can leave this unchecked, but I’m impatient and check it</li> </ul> </li> <li>Under <code class="highlighter-rouge">Options</code>, make yours look similar to the screenshot, filling in the details below based on your credentials in Twilio: <ul> <li><code class="highlighter-rouge">account_sid</code> and <code class="highlighter-rouge">auth_token</code>: Your account SID and secret auth token from your Twilio account details</li> <li><code class="highlighter-rouge">sender_cell</code>: The phone number Twilio is configured to send from</li> <li><code class="highlighter-rouge">receiver_cell</code>: The cell phone number you want the text message to go to</li> <li><code class="highlighter-rouge">receive_text</code>: Must be set to <code class="highlighter-rouge">true</code> to have the agent send a text message</li> <li><code class="highlighter-rouge">receive_call</code>: Must be set to <code class="highlighter-rouge">false</code></li> <li><code class="highlighter-rouge">expected_receive_period_in_days</code>: Set this to however often you expect this agent to receive a “bridge frozen” event from it source. The agent will wait this long before setting a flag to note that it might be broken. I set mine to <code class="highlighter-rouge">180</code>, which might not be long enough in Seattle.</li> </ul> </li> <li>Click “Save” when you’re done. When completed, your agent configuration should look like the screenshot to the right.</li> </ol> <h2 id="step-4-testing">Step 4: Testing!</h2> <p>At this point you have a few options to test things out:</p> <ol> <li>Wait until Seattle drops below freezing. As it is almost April, this may not happen for awhile.</li> <li>Adjust the <code class="highlighter-rouge">value</code> in your Trigger Agent to trigger at a much higher temperature.</li> <li>Create an agent that will inject a fake event into the workflow. This is the option we’ll use.</li> </ol> <p>To create an agent that can emit fake events:</p> <figure class="figure pull-right"> <a href="/img/posts/2017-03-28 - manual agent.png" data-featherlight="image"> <img class="figure-image img-fluid rounded" src="/img/posts/2017-03-28 - manual agent.thumb.png" alt="Manual Event Agent" /> <figcaption class="figure-caption">Manual Event Agent &raquo;</figcaption> </a> </figure> <ol> <li>Select “Agents” and then “New Agent” to start the process of creating a new agent</li> <li>In the “Type” dropdown, select “Manual Event Agent”</li> <li>Fill out your new agent with the following details: <ul> <li><code class="highlighter-rouge">Name</code>: Name your agent. I called mine “Let’s pretend its freezing!”</li> <li><code class="highlighter-rouge">Receivers</code>: Select the agent you created in Step 2</li> </ul> </li> <li>Click “Save” when you’re done. When completed, your agent configuration should look like the screenshot to the right.</li> </ol> <p>Next, use your Manual Event Agent to fake an event that makes it look like the temperature dropped below freeing:</p> <figure class="figure pull-right"> <a href="/img/posts/2017-03-28 - emit event.png" data-featherlight="image"> <img class="figure-image img-fluid rounded" src="/img/posts/2017-03-28 - emit event.thumb.png" alt="Emit Manual Event" /> <figcaption class="figure-caption">Emit Manual Event &raquo;</figcaption> </a> </figure> <ol> <li>Click on your Manual Event Agent within the agent listing</li> <li>Within the event payload, click the “+” button to create a single key/value pair: <ul> <li>For the key (on the left), use our variable name: <code class="highlighter-rouge">rolling_average</code></li> <li>For the value, use something below freezing, like <code class="highlighter-rouge">-150</code></li> </ul> </li> <li>Click “Submit” to emit your event into the system.</li> </ol> <p>Your fake event will then propagate through the system, and if everything is configured properly, you’ll get a text message to your device!</p> <p><img src="/img/posts/2017-03-28 - text message.png" alt="It worked!" /></p> <p>By now, hopefully your mind is buzzing with ideas of how you can use Huginn to monitor and alert based on open data! Please let us know what you come up with!</p> <h2 id="by-the-way">By the way…</h2> <p>After sleeping on this post, I actually realized we could skip the “Trigger Agent” by modifying our SoQL query slightly. By using the <a href="/docs/queries/having.html"><code class="highlighter-rouge">$having</code></a> filter on our aggregation, we can make it only return a temperature value when its less than our specified threshold:</p> <div class="tryit-link"> <code>The TryIt macro has been disabled until future notice while we upgrade this site to SODA3.</code> </div> <p>Don’t be surprised if the query above doesn’t output any records when you click on it, that’s the point! It’ll only return a <code class="highlighter-rouge">rolling_average</code> when its 32 degrees or less. This would allow us to connect our Website Agent directly to our Twilio Agent, simplifying our workflow! However, its also harder to test, and we’d need to format our message either with aggregation in our <code class="highlighter-rouge">SELECT</code> or with an additional “Liquid Output Agent” before the Twilio Agent. That exercise is left up to the reader!</p>chrismetcalfAs more and more open datasets approach the point where they’re receiving “real time” updates, the topic of how to receive push or webhook notifications when a dataset is updated and matches certain conditions. For example, you might want to be notified when there are crimes in your neighborhood, when your local government releases a new dataset, or when the current number of outstanding pot hole requests go over a certain threshold.Validate Your Data with FME2017-02-02T00:00:00+00:002017-02-02T00:00:00+00:00https://dev.socrata.com/blog/2017/02/02/validate-your-data-with-fme-<p>This post describes how to use an FME workspace to validate your data and highlight what data cleansing needs to be undertaken before putting it into Socrata. FME is a powerful tool for not only data automation, but data analysis. Here we will use it to do a ‘health check’ on our data to understand what <a href="https://support.socrata.com/hc/en-us/articles/202950008-Import-Warning-and-Errors">errors, warnings</a> and roadblocks it may cause. To better understand the rules for importing data, check out this <a href="https://knowledge.safe.com/articles/715/working-with-date-and-time-attributes-tutorial.html">support article</a> that discusses importing your data into Socrata.</p> <h2 id="prerequisites">Prerequisites:</h2> <ul> <li>An installed copy of FME Desktop 2016.1 - the latest version can be downloaded <a href="https://www.safe.com/support/support-resources/fme-downloads/">here</a></li> <li>Limited experience for use with FME, for a warmup see this previous post: <a href="https://dev.socrata.com/blog/2014/10/09/fme-socrata-writer.html">Using the FME Socrata Writer</a></li> <li>An installed copy of Microsoft Excel</li> <li>Read access to a dataset of your choice that will be ingressed to Socrata</li> <li>Publishing rights to a Socrata domain - you will need this if you plan to publish your cleansed data to Socrata. It is not required for the data cleanse operation, but is a good test of data cleanliness to publish it.</li> </ul> <h2 id="contents">Contents:</h2> <ol> <li>Getting to know data validation rules for Socrata</li> <li>Creating your schema map</li> <li>Running the FME validation workspace</li> <li>Interpreting your results</li> </ol> <h2 id="data-validation-for-socrata">Data Validation for Socrata</h2> <p>Data must be properly formatted for each data type to be ingressed into Socrata. If the data is not formatted it may not load properly, or in some cases not at all. If you are using FME to publish your data, the workflow will fail if your data is not formatted. Data types are used to tell a common story by allowing each data type to be filtered, displayed and analyzed in their own ways.</p> <p>This post and the FME workspace will be focusing on the following data types:</p> <ul> <li>Calendar date</li> <li>Checkbox</li> <li>Number</li> <li>Percent</li> <li>Text</li> </ul> <p>These data types are the most commonly used and have a high potential for error, but are also very easy to fix! This workspace can be used and reused as a tool to assess a variety of datasets that you plan to ingress to Socrata. The output file from the workspace will guide you to where errors may lie within your dataset and give you an idea of how many, if any, errors need to be fixed. As a user, you will need to know: What your data should look like If the rule failures are problems with the data If action should be taken to change the data</p> <h2 id="creating-your-schema-map">Creating Your Schema Map</h2> <p>In order for FME to validate your data against the correct rules, each attribute (or column) must be assigned a data type that Socrata can read. From here each feature (or row) can be broken apart so every cell is matched to the correct set of rules. To do this, we must establish a schema for your dataset, a schema is what tells the program how to read and organize your data. You will be building your schema in an Excel using a template file which will be read by your FME workspace - this is much easier than assigning each attribute a data type within FME.</p> <ol> <li>Open the Schema Map Template in Excel, rename and save your Schema Map.</li> <li>Open the dataset to validate in Excel. <em>Note:</em> you may have to extract your dataset from another source like a database, it is best practice to export as a <code class="highlighter-rouge">.csv</code>.</li> <li> <p>Highlight all attribute names on the sheet and Copy.</p> <p><img src="/img/FME_Data_Cleanse/2017-01-31_Schema_Map1.png" alt="Schema_Map1" /></p> </li> <li> <p>In your schema map workbook, on the Schema_Map sheet, in cell A2 Paste Special, Transpose. This will create a list of your attributes to validate.</p> <p><img src="/img/FME_Data_Cleanse/2017-01-31_Schema_Map2.png" alt="Schema_Map2" /></p> </li> <li>Use the drop down menu in cell B2 to select the Socrata data type for the attribute in cell A2. Repeat this for all attributes listed in column A. Note: Best practice that ZIP codes, unique ID fields (employee/invoice ID’s) should be set to text to avoid errors such as dropping leading zeros (northeast US Zip codes), and dashes in ZIP code plus four’s.</li> <li>Save your changes.</li> </ol> <h2 id="running-the-fme-validation-workspace">Running the FME Validation Workspace</h2> <p>This workspace is intended for a range of users, it is heavily annotated to inform users how things are working and shows all connections/transformers.</p> <ol> <li> <p>Open the FME validation workspace, rename and save your validation workspace. Note your workspace may look different based on operating system and FME version. These screenshots come from FME 2016.1.3.0 - Build 16709 - WIN64, used in Windows 10 Pro.</p> <p><img src="/img/FME_Data_Cleanse/2017-01-31_Workspace1.png" alt="Workspace1" /></p> </li> <li> <p>Update Published Parameters by double clicking on each parameter in the Navigator pane</p> <ul> <li> <p><code class="highlighter-rouge">Entity</code>: Your organization/entity name (used only for output file naming purposes)</p> </li> <li> <p><code class="highlighter-rouge">Dataset_Name</code>: Populate this with your dataset, general text or dataset identifier are allowed (used only for output file naming purposes)</p> </li> <li> <p><code class="highlighter-rouge">Dataset_Path</code>: The full file path for where the dataset to validate including extension</p> </li> <li> <p><code class="highlighter-rouge">Output_Folder</code>: Folder where you want the output <code class="highlighter-rouge">.xlsx</code>, you must include the final “/” at the end of the path</p> </li> </ul> <p><img src="/img/FME_Data_Cleanse/2017-01-31_Workspace2.png" alt="Workspace2" /></p> </li> <li> <p>Add your schema map to the workspace by updating the AttributeValueMapper transformer</p> <ol> <li> <p>Open the <code class="highlighter-rouge">AttributeValueMapper</code> dialogue box and click Import</p> <p><img src="/img/FME_Data_Cleanse/2017-01-31_Workspace3.png" alt="Workspace3" /></p> </li> <li> <p>Select the format of your dataset to validate and its file path, then click next</p> <p><img src="/img/FME_Data_Cleanse/2017-01-31_Workspace4.png" alt="Workspace4" /></p> </li> <li> <p>Change the Import Mode to Attribute Values. Select the Feature Type by clicking the box next to <code class="highlighter-rouge">Schema_Map</code> to point the Import Wizard to the correct Excel sheet, then click Next</p> <p><img src="/img/FME_Data_Cleanse/2017-01-31_Workspace5.png" alt="Workspace5" /></p> </li> <li> <p>Select the attributes for the source and destination fields. The “Source Value” should come from <code class="highlighter-rouge">SourceAttributeName</code> and the destination should come from <code class="highlighter-rouge">DestinationDataType</code>. Click “Import”</p> <p><img src="/img/FME_Data_Cleanse/2017-01-31_Workspace6.png" alt="Workspace6" /></p> </li> <li> <p>FME should have imported your dataset’s attribute names and data types. Click OK to complete the import process.</p> <p><img src="/img/FME_Data_Cleanse/2017-01-31_Workspace7.png" alt="Workspace7" /></p> </li> </ol> </li> <li> <p>Add inspectors where you want to monitor specific errors.</p> <ul> <li> <p>If there are specific errors you want to track more closely, you can insert <code class="highlighter-rouge">Inspectors</code> on <code class="highlighter-rouge">AttributeValidators</code> or <code class="highlighter-rouge">Testers</code>. If you want to track a specific validation rule that you’re curious about, adding an inspector midway through the workspace will show which specific cells do not pass validation rules. <code class="highlighter-rouge">Inspectors</code> can be added by right clicking on an outgoing port (green triangle pointing right) of a transformer and clicking “Connect Inspector.”</p> <p><img src="/img/FME_Data_Cleanse/2017-01-31_Workspace8.png" alt="Workspace8" /></p> </li> </ul> </li> <li> <p>Run the workspace</p> <ul> <li> <p>Click the green “play” button in the toolbar.</p> <p><img src="/img/FME_Data_Cleanse/2017-01-31_Workspace9.png" alt="Workspace9" /></p> </li> </ul> </li> <li> <p>Examine your results and assess your data</p> <ul> <li>An <code class="highlighter-rouge">.xlsx</code> will be output the folder you specified with a name including your entity, the dataset validated and a timestamp of when the workspace was ran.</li> </ul> </li> </ol> <h2 id="interpreting-your-results">Interpreting your results</h2> <p>With the Excel output quantifying potential errors to fix, use your favorite platform to cleanse the data. To better understand each of validation rules, check out this <a href="https://opendata.socrata.com/dataset/Data-Validation-Failures/9wr8-8fe8">table</a>. If you cannot see the errors in your dataset that the workspace is reporting, open your dataset in the FME Data Inspector. If you are using Excel to view your dataset, it may be automatically formatting the data and hiding the errors.</p> <p>Not all validation failures are errors in the dataset, it is up to the user to decide if these should be ignored or if they should be changed. For example, if negative numbers are found in a dataset, the user needs to decide if they can be allowed or amended.</p> <p>The workspace can be reran to validate a cleansed dataset as many times as you like. If you change your schema, you will need to update the schema map and workspace to read any changes (revisit the Creating Your Schema Map or Running the FME Validation Workspace sections of this page).</p> <h3 id="dates">Dates:</h3> <p>Safe has created a quick <a href="https://knowledge.safe.com/articles/715/working-with-date-and-time-attributes-tutorial.html">tutorial</a> to better understand how dates are handled in FME. It’s recommended to learn about how dates are read, especially by the <a href="http://docs.safe.com/fme/2016.1/html/FME_Desktop_Documentation/FME_Transformers/Transformers/dateformatter.htm">DateFormatter</a>. This tutorial will give you enough knowledge to make you dangerous, but some experimentation will help you understand what’s happening under the hood.</p> <p><strong>WARNING:</strong> The DateFormatter coerces your data from one string into another, some of the calculations it does in the process may change your data. Review this <a href="https://opendata.socrata.com/dataset/FME-Date-Formatter-Data-Coercion/7bdf-9qsu">cheat sheet</a> to see how DateFormatter coerces common errors in date fields to properly formatted dates. There may be consequences when you set the Source Date Format parameter to be “Unknown - Automatic Detection.” You can decrease errors by properly populating the Source Date Format parameter, see the DateFormatter help link above. If your date is in a format that has day before month, then you must input an expected format.</p> <h4 id="pro-tips-for-dates">Pro Tips for dates:</h4> <p>To avoid letting DateFormatter create problems, you can format your date to YYYYMMDD before it is read into FME. For example, you can use the following expression in SQL to help - if that is how your data is stored.</p> <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>SELECT CONVERT(VARCHAR(10),getdate(),120) </code></pre></div></div> <p>When choosing a date format to publish on Socrata, the Socrata writer in FME has two different date formats to write. The data type <code class="highlighter-rouge">calendar_date</code> should be used if there is no time zone included in the data. If time zone data is not essential, it is recommended not to include it.</p> <p>Some common date formats converted to FME date formats:</p> <ul> <li><code class="highlighter-rouge">DD.MM.YYYY</code> → <code class="highlighter-rouge">%d.%m.%Y</code></li> <li><code class="highlighter-rouge">MM/DD/YYYY</code> → <code class="highlighter-rouge">%m/%d/%Y</code></li> <li><code class="highlighter-rouge">MM-DD-YY</code> → <code class="highlighter-rouge">%m"-"%d"-"%y</code></li> <li><code class="highlighter-rouge">YYYY-MM-DD['T']HH:mm:ssZ</code> (ISO8601 with timezone) → <code class="highlighter-rouge">%Y-%m-%dT%H:%M:%S%Z</code></li> </ul> <p>Other links to better understand date formatting:</p> <ul> <li><a href="https://msdn.microsoft.com/en-us/library/ms187928.aspx">https://msdn.microsoft.com/en-us/library/ms187928.aspx</a></li> <li><a href="http://socrata.github.io/datasync/resources/using-map-fields-dialog.html">http://socrata.github.io/datasync/resources/using-map-fields-dialog.html</a></li> <li><a href="http://socrata.github.io/datasync/resources/control-config.html#datetime-formatting">http://socrata.github.io/datasync/resources/control-config.html#datetime-formatting</a></li> </ul> <h3 id="leading-zeros">Leading Zeros:</h3> <p>Leading zeros are zeros at the beginning of your data e.g. 0001234. They can cause problems when importing into Socrata as a number. Not all leading zeros are errors or poor data, for example some zip codes, vendor ID’s or phone numbers start with a zero and must be present. In these cases you may consider switching the data type to text. For advice on how to clean up your leading zeros, here’s a <a href="https://support.socrata.com/hc/en-us/articles/206108587-Tips-and-Tricks-How-to-handle-leading-zeros-when-publishing-a-dataset">support article</a> to help.</p>cdesistoThis post describes how to use an FME workspace to validate your data and highlight what data cleansing needs to be undertaken before putting it into Socrata. FME is a powerful tool for not only data automation, but data analysis. Here we will use it to do a ‘health check’ on our data to understand what errors, warnings and roadblocks it may cause. To better understand the rules for importing data, check out this support article that discusses importing your data into Socrata.Visualizing data using the Google Calendar Chart2017-01-03T00:00:00+00:002017-01-03T00:00:00+00:00https://dev.socrata.com/blog/2017/01/03/visualizing-data-using-google-calendar-chart<div id="calendar_basic" style="float:center; width:1000px"><!-- This space intentionally left blank --></div> <p>This example shows how to pull data from a Socrata Dataset (in this case, the <a href="https://dev.socrata.com/foundry/data.cityofchicago.org/6zsd-86xi">City of Chicago crime records</a>) with the Google <a href="https://developers.google.com/chart/interactive/docs/gallery/calendar">“Calendar Chart”</a> visualization. As a bonus, we will then embed that chart into a <a href="https://socrata.com/solutions/publica-open-data-cloud/">Socrata Perspectives page</a>.</p> <p>The <a href="https://developers.google.com/chart/">Google Charts library</a> provides a number of different chart types for visualization that can be leveraged using the SODA API. The “Calendar Chart” is useful when you have incident level data for which you would like to visualize by daily density over the course of a year.</p> <h2 id="prerequisites">Prerequisites</h2> <p>There are a number of prerequisites necessary before starting with this example:</p> <ol> <li>Most obviously, you’ll need to work with data in a Socrata dataset containing time series data that can be aggregated at a daily level. If you’re looking for a dataset to work with, we recommend you explore the <a href="https://www.opendatanetwork.com">Open Data Network</a>, where you can find a full catalog of datasets from our awesome customers.</li> <li>You’ll need some basic familiarity with JavaScript before starting. If you’ve never worked with JavaScript before, we recommend <a href="https://www.codecademy.com/learn/javascript">this course from CodeAcademy</a>.</li> <li>We’ll also be making use of <a href="https://jquery.com/">jQuery</a> to simplify some of our development tasks.</li> </ol> <div class="alert alert-info"><p>Check out all of the different chart types available through the <a href="https://developers.google.com/chart/interactive/docs/gallery">Google Charts library.</a> </p></div> <h2 id="craft-your-soql-query">Craft your SoQL query</h2> <p>The Calendar Chart requires at a minimum two fields - a date and a numeric value. So we’ll use the SoQL <a href="/docs/queries/group.html"><code class="highlighter-rouge">$group</code></a> and <a href="/docs/queries/group.html"><code class="highlighter-rouge">$group</code></a> parameters to aggregate our dataset to daily roll-ups, this results in a SoQL query that looks like the following:</p> <div class="tryit-link"> <code>The TryIt macro has been disabled until future notice while we upgrade this site to SODA3.</code> </div> <p>The results will be aggregated like the following:</p> <figure class="highlight"><pre><code class="language-javascript" data-lang="javascript"><span class="p">[</span> <span class="p">{</span> <span class="s2">"count"</span><span class="p">:</span> <span class="s2">"762"</span><span class="p">,</span> <span class="s2">"day"</span><span class="p">:</span> <span class="s2">"2016-09-04T00:00:00.000"</span> <span class="p">},</span> <span class="p">{</span> <span class="s2">"count"</span><span class="p">:</span> <span class="s2">"842"</span><span class="p">,</span> <span class="s2">"day"</span><span class="p">:</span> <span class="s2">"2014-07-20T00:00:00.000"</span> <span class="p">},</span> <span class="p">...</span> <span class="p">]</span></code></pre></figure> <h2 id="fetch-data-using-jquery">Fetch data using jQuery</h2> <p>We’ll define a <code class="highlighter-rouge">fetchValues</code> function that uses the <a href="https://api.jquery.com/jquery.get/"><code class="highlighter-rouge">jQuery.get(...)</code></a> function to fetch data from the SODA API, transform it into an array of JavaScript <a href="https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Date">Date</a> objects and counts, and returns it for handling:</p> <figure class="highlight"><pre><code class="language-javascript" data-lang="javascript"><span class="kd">var</span> <span class="nx">fetchValues</span> <span class="o">=</span> <span class="kd">function</span><span class="p">()</span> <span class="p">{</span> <span class="k">return</span> <span class="nx">$</span><span class="p">.</span><span class="kd">get</span><span class="p">(</span> <span class="s1">'https://data.cityofchicago.org/resource/6zsd-86xi.json'</span><span class="p">,</span> <span class="p">{</span> <span class="s1">'$select'</span> <span class="p">:</span> <span class="s1">'date_trunc_ymd(date) as day, count(*)'</span><span class="p">,</span> <span class="s1">'$where'</span> <span class="p">:</span> <span class="s2">"date &gt; '2014-01-01'"</span><span class="p">,</span> <span class="s1">'$group'</span> <span class="p">:</span> <span class="s1">'day'</span> <span class="p">}</span> <span class="p">).</span><span class="nx">pipe</span><span class="p">(</span><span class="kd">function</span><span class="p">(</span><span class="nx">res</span><span class="p">)</span> <span class="p">{</span> <span class="kd">var</span> <span class="nx">ary</span> <span class="o">=</span> <span class="p">[]</span> <span class="nx">$</span><span class="p">.</span><span class="nx">each</span><span class="p">(</span><span class="nx">res</span><span class="p">,</span> <span class="kd">function</span><span class="p">(</span><span class="nx">idx</span><span class="p">,</span> <span class="nx">rec</span><span class="p">)</span> <span class="p">{</span> <span class="nx">ary</span><span class="p">.</span><span class="nx">push</span><span class="p">([</span><span class="k">new</span> <span class="nb">Date</span><span class="p">(</span><span class="nx">rec</span><span class="p">.</span><span class="nx">day</span><span class="p">.</span><span class="nx">replace</span><span class="p">(</span><span class="s2">"T00:00:00"</span><span class="p">,</span> <span class="s2">"T12:00:00"</span><span class="p">)),</span> <span class="nb">parseInt</span><span class="p">(</span><span class="nx">rec</span><span class="p">.</span><span class="nx">count</span><span class="p">)]);</span> <span class="p">});</span> <span class="k">return</span> <span class="nx">ary</span><span class="p">;</span> <span class="p">});</span> <span class="p">};</span></code></pre></figure> <h2 id="visualize-the-data-with-google-charts">Visualize the data with Google Charts</h2> <p>Once we’ve got our data from the SODA API, we’ll plumb it into the Google Calendar Chart library to visualize the actual data. We do this in our <code class="highlighter-rouge">drawChart</code> function:</p> <ol> <li>First we initialize our <code class="highlighter-rouge">DataTable</code> and add two columns - one for our date and another for our value.</li> <li>Then we initialize our <code class="highlighter-rouge">Calendar</code>, feeding it our target element by ID, <code class="highlighter-rouge">calendar_basic</code>.</li> <li>Finally, we draw our chart, feeding it configuration via our <code class="highlighter-rouge">options</code> object.</li> </ol> <figure class="highlight"><pre><code class="language-javascript" data-lang="javascript"><span class="kd">var</span> <span class="nx">drawChart</span> <span class="o">=</span> <span class="kd">function</span><span class="p">(</span><span class="nx">ary</span><span class="p">)</span> <span class="p">{</span> <span class="kd">var</span> <span class="nx">data</span> <span class="o">=</span> <span class="k">new</span> <span class="nx">google</span><span class="p">.</span><span class="nx">visualization</span><span class="p">.</span><span class="nx">DataTable</span><span class="p">();</span> <span class="nx">data</span><span class="p">.</span><span class="nx">addColumn</span><span class="p">({</span> <span class="na">type</span><span class="p">:</span> <span class="s1">'date'</span><span class="p">,</span> <span class="na">id</span><span class="p">:</span> <span class="s1">'Date'</span> <span class="p">});</span> <span class="nx">data</span><span class="p">.</span><span class="nx">addColumn</span><span class="p">({</span> <span class="na">type</span><span class="p">:</span> <span class="s1">'number'</span><span class="p">,</span> <span class="na">id</span><span class="p">:</span> <span class="s1">'count'</span> <span class="p">});</span> <span class="nx">data</span><span class="p">.</span><span class="nx">addRows</span><span class="p">(</span><span class="nx">ary</span><span class="p">);</span> <span class="kd">var</span> <span class="nx">chart</span> <span class="o">=</span> <span class="k">new</span> <span class="nx">google</span><span class="p">.</span><span class="nx">visualization</span><span class="p">.</span><span class="nx">Calendar</span><span class="p">(</span><span class="nb">document</span><span class="p">.</span><span class="nx">getElementById</span><span class="p">(</span><span class="s1">'calendar_basic'</span><span class="p">));</span> <span class="kd">var</span> <span class="nx">options</span> <span class="o">=</span> <span class="p">{</span> <span class="na">title</span><span class="p">:</span> <span class="s2">"City of Chicago Police Incidents Over Time"</span><span class="p">,</span> <span class="na">height</span><span class="p">:</span> <span class="mi">500</span><span class="p">,</span> <span class="p">};</span> <span class="nx">chart</span><span class="p">.</span><span class="nx">draw</span><span class="p">(</span><span class="nx">data</span><span class="p">,</span> <span class="nx">options</span><span class="p">);</span> <span class="p">};</span></code></pre></figure> <p>Finally, we tie things all together by having the Google Charts library call our function when it loads:</p> <figure class="highlight"><pre><code class="language-javascript" data-lang="javascript"><span class="nx">google</span><span class="p">.</span><span class="nx">charts</span><span class="p">.</span><span class="nx">setOnLoadCallback</span><span class="p">(</span><span class="kd">function</span><span class="p">()</span> <span class="p">{</span> <span class="nx">fetchValues</span><span class="p">().</span><span class="nx">done</span><span class="p">(</span><span class="kd">function</span><span class="p">(</span><span class="nx">data</span><span class="p">)</span> <span class="p">{</span> <span class="nx">drawChart</span><span class="p">(</span><span class="nx">data</span><span class="p">);</span> <span class="p">});</span> <span class="p">});</span></code></pre></figure> <h2 id="bonus-embed-your-visualization-in-socrata-perspectives">BONUS: Embed your visualization in Socrata Perspectives</h2> <div class="alert alert-info"><p>To get access to <a href="https://socrata.com/solutions/publica-open-data-cloud/">Socrata Perspectives page</a>, you'll need to work for one of our awesome customers. Maybe your local government is hiring!</p></div> <p>Once you’ve created your visualization, you can use the ability for Perspectives to include embedded content to embed your visualization into a new story. To do so, first you’ll need to craft a very simple HTML page like the following which loads your visualization. Make sure you include in that page the <code class="highlighter-rouge">script</code> tags to load your dependencies, in this case both jQuery and the Google Charts library.</p> <figure class="highlight"><pre><code class="language-html" data-lang="html"><span class="nt">&lt;html&gt;</span> <span class="nt">&lt;head&gt;</span> <span class="nt">&lt;script </span><span class="na">type=</span><span class="s">"text/javascript"</span> <span class="na">src=</span><span class="s">"https://www.gstatic.com/charts/loader.js"</span><span class="nt">&gt;&lt;/script&gt;</span> <span class="nt">&lt;script </span><span class="na">type=</span><span class="s">"text/javascript"</span> <span class="na">src=</span><span class="s">"https://www.google.com/jsapi"</span><span class="nt">&gt;&lt;/script&gt;</span> <span class="nt">&lt;script </span><span class="na">src=</span><span class="s">"https://code.jquery.com/jquery-3.1.1.min.js"</span> <span class="na">integrity=</span><span class="s">"sha256-hVVnYaiADRTO2PzUGmuLJr8BLUSjGIZsDYGmIJLv2b8="</span> <span class="na">crossorigin=</span><span class="s">"anonymous"</span><span class="nt">&gt;&lt;/script&gt;</span> <span class="nt">&lt;/head&gt;</span> <span class="nt">&lt;body&gt;</span> <span class="nt">&lt;div</span> <span class="na">id=</span><span class="s">"calendar_basic"</span> <span class="na">style=</span><span class="s">"width: 1000px; height: 350px;"</span><span class="nt">&gt;&lt;/div&gt;</span> <span class="nt">&lt;script </span><span class="na">type=</span><span class="s">"text/javascript"</span><span class="nt">&gt;</span> <span class="p">(</span><span class="kd">function</span><span class="p">()</span> <span class="p">{</span> <span class="c1">// Initialize the charting library</span> <span class="nx">google</span><span class="p">.</span><span class="nx">charts</span><span class="p">.</span><span class="nx">load</span><span class="p">(</span><span class="s2">"current"</span><span class="p">,</span> <span class="p">{</span> <span class="na">packages</span><span class="p">:[</span><span class="s2">"calendar"</span><span class="p">]</span> <span class="p">});</span> <span class="kd">var</span> <span class="nx">fetchValues</span> <span class="o">=</span> <span class="kd">function</span><span class="p">()</span> <span class="p">{</span> <span class="k">return</span> <span class="nx">$</span><span class="p">.</span><span class="kd">get</span><span class="p">(</span> <span class="s1">'https://data.cityofchicago.org/resource/6zsd-86xi.json'</span><span class="p">,</span> <span class="p">{</span> <span class="s1">'$select'</span> <span class="p">:</span> <span class="s1">'date_trunc_ymd(date) as day, count(*)'</span><span class="p">,</span> <span class="s1">'$where'</span> <span class="p">:</span> <span class="s2">"date &gt; '2014-01-01'"</span><span class="p">,</span> <span class="s1">'$group'</span> <span class="p">:</span> <span class="s1">'day'</span> <span class="p">}</span> <span class="p">).</span><span class="nx">pipe</span><span class="p">(</span><span class="kd">function</span><span class="p">(</span><span class="nx">res</span><span class="p">)</span> <span class="p">{</span> <span class="kd">var</span> <span class="nx">ary</span> <span class="o">=</span> <span class="p">[]</span> <span class="nx">$</span><span class="p">.</span><span class="nx">each</span><span class="p">(</span><span class="nx">res</span><span class="p">,</span> <span class="kd">function</span><span class="p">(</span><span class="nx">idx</span><span class="p">,</span> <span class="nx">rec</span><span class="p">)</span> <span class="p">{</span> <span class="nx">ary</span><span class="p">.</span><span class="nx">push</span><span class="p">([</span><span class="k">new</span> <span class="nb">Date</span><span class="p">(</span><span class="nx">rec</span><span class="p">.</span><span class="nx">day</span><span class="p">.</span><span class="nx">replace</span><span class="p">(</span><span class="s2">"T00:00:00"</span><span class="p">,</span> <span class="s2">"T12:00:00"</span><span class="p">)),</span> <span class="nb">parseInt</span><span class="p">(</span><span class="nx">rec</span><span class="p">.</span><span class="nx">count</span><span class="p">)]);</span> <span class="p">});</span> <span class="k">return</span> <span class="nx">ary</span><span class="p">;</span> <span class="p">});</span> <span class="p">};</span> <span class="kd">var</span> <span class="nx">drawChart</span> <span class="o">=</span> <span class="kd">function</span><span class="p">(</span><span class="nx">ary</span><span class="p">)</span> <span class="p">{</span> <span class="kd">var</span> <span class="nx">data</span> <span class="o">=</span> <span class="k">new</span> <span class="nx">google</span><span class="p">.</span><span class="nx">visualization</span><span class="p">.</span><span class="nx">DataTable</span><span class="p">();</span> <span class="nx">data</span><span class="p">.</span><span class="nx">addColumn</span><span class="p">({</span> <span class="na">type</span><span class="p">:</span> <span class="s1">'date'</span><span class="p">,</span> <span class="na">id</span><span class="p">:</span> <span class="s1">'Date'</span> <span class="p">});</span> <span class="nx">data</span><span class="p">.</span><span class="nx">addColumn</span><span class="p">({</span> <span class="na">type</span><span class="p">:</span> <span class="s1">'number'</span><span class="p">,</span> <span class="na">id</span><span class="p">:</span> <span class="s1">'count'</span> <span class="p">});</span> <span class="nx">data</span><span class="p">.</span><span class="nx">addRows</span><span class="p">(</span><span class="nx">ary</span><span class="p">);</span> <span class="kd">var</span> <span class="nx">chart</span> <span class="o">=</span> <span class="k">new</span> <span class="nx">google</span><span class="p">.</span><span class="nx">visualization</span><span class="p">.</span><span class="nx">Calendar</span><span class="p">(</span><span class="nb">document</span><span class="p">.</span><span class="nx">getElementById</span><span class="p">(</span><span class="s1">'calendar_basic'</span><span class="p">));</span> <span class="kd">var</span> <span class="nx">options</span> <span class="o">=</span> <span class="p">{</span> <span class="na">title</span><span class="p">:</span> <span class="s2">"City of Chicago Police Incidents Over Time"</span><span class="p">,</span> <span class="na">height</span><span class="p">:</span> <span class="mi">500</span><span class="p">,</span> <span class="p">};</span> <span class="nx">chart</span><span class="p">.</span><span class="nx">draw</span><span class="p">(</span><span class="nx">data</span><span class="p">,</span> <span class="nx">options</span><span class="p">);</span> <span class="p">};</span> <span class="nx">google</span><span class="p">.</span><span class="nx">charts</span><span class="p">.</span><span class="nx">setOnLoadCallback</span><span class="p">(</span><span class="kd">function</span><span class="p">()</span> <span class="p">{</span> <span class="nx">fetchValues</span><span class="p">().</span><span class="nx">done</span><span class="p">(</span><span class="kd">function</span><span class="p">(</span><span class="nx">data</span><span class="p">)</span> <span class="p">{</span> <span class="nx">drawChart</span><span class="p">(</span><span class="nx">data</span><span class="p">);</span> <span class="p">});</span> <span class="p">});</span> <span class="p">})();</span> <span class="nt">&lt;/script&gt;</span> <span class="nt">&lt;/body&gt;</span> <span class="nt">&lt;/html&gt;</span></code></pre></figure> <p>Then, to add it as a content block in your story:</p> <ol> <li>When editing your story, click “Add Content” to bring up the palate, and drag in a new content block.</li> <li>Click “Insert” and then “HTML Embed”</li> <li>Where it says “Paste or type HTML code”, paste in the entire contents of your HTML snippet and click “Insert”</li> </ol> <p>That’s it! Click below to see what this looks like.</p> <iframe src="https://evergreen.data.socrata.com/stories/s/City-of-Chicago-Crimes-2001-Present-Story/d4y4-b8nv/tile" style="width:600px;height:345px;background-color:transparent;overflow:hidden;" scrolling="no" frameborder="0"></iframe> <script type="text/javascript"> (function() { // Initialize the charting library google.charts.load("current", { packages:["calendar"] }); var fetchValues = function() { return $.get( 'https://data.cityofchicago.org/resource/6zsd-86xi.json', { '$select' : 'date_trunc_ymd(date) as day, count(*)', '$where' : "date > '2014-01-01'", '$group' : 'day' } ).pipe(function(res) { var ary = [] $.each(res, function(idx, rec) { ary.push([new Date(rec.day.replace("T00:00:00", "T12:00:00")), parseInt(rec.count)]); }); return ary; }); }; var drawChart = function(ary) { var data = new google.visualization.DataTable(); data.addColumn({ type: 'date', id: 'Date' }); data.addColumn({ type: 'number', id: 'count' }); data.addRows(ary); var chart = new google.visualization.Calendar(document.getElementById('calendar_basic')); var options = { title: "City of Chicago Police Incidents Over Time", height: 500, }; chart.draw(data, options); }; google.charts.setOnLoadCallback(function() { fetchValues().done(function(data) { drawChart(data); }); }); })(); </script>stuaganoScrubbing data with Python2017-01-03T00:00:00+00:002017-01-03T00:00:00+00:00https://dev.socrata.com/blog/2017/01/03/scrubbing-data-with-python<p>There’s an awesome Python package called <a href="https://scrubadub.readthedocs.io/en/stable/">Scrubadub</a> that can can help you remove personally identifiable information from text data. This is a great step to take before publishing a dataset that may contain <a href="https://en.wikipedia.org/wiki/Personally_identifiable_information">PII</a>, in order to prevent inadvertent disclosure.</p> <p>In this example, we’ll clean up some CSV data using Scrubadub, in order to prep it for loading in Socrata:</p> <ol> <li>First we’ll load a local CSV it into a dataframe with <a href="https://pypi.python.org/pypi/pandas/0.19.1/#downloads">Pandas</a>,</li> <li>Then we’ll remove names using Scrubadub,</li> <li>And finally write it to a CSV that can be loaded using <a href="https://socrata.github.io/datasync">DataSync</a>.</li> </ol> <h2 id="prerequisites">Prerequisites</h2> <p>Before you start, make sure you have the following installed on your machine:</p> <ol> <li><a href="https://www.python.org/">Python</a></li> <li><a href="https://pypi.python.org/pypi/pandas/0.19.1/#downloads">Pandas</a></li> <li><a href="https://scrubadub.readthedocs.io/en/stable/">Scrubadub</a></li> <li><a href="https://socrata.github.io/datasync/">Socrata DataSync</a></li> </ol> <h2 id="loading-your-csv-with-pandas">Loading your CSV with Pandas</h2> <p>Create a dataframe from your local CSV file with Pandas:</p> <figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span> <span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">'~/Dallas_Police_Officer-Involved_Shootings.csv'</span><span class="p">)</span> <span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">(</span><span class="n">n</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span></code></pre></figure> <p><img src="/img/with-officer-name.png" alt="With Officer Names" /></p> <h2 id="remove-names-using-scrubadub">Remove names using Scrubadub</h2> <p>Scrubadub is a simple package that will look for names and other identifying information, like email addresses, SSNs, and phone numbers.</p> <figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">scrubadub</span> <span class="n">scrub</span> <span class="o">=</span> <span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">scrubadub</span><span class="o">.</span><span class="n">clean</span><span class="p">(</span><span class="n">x</span><span class="o">.</span><span class="n">decode</span><span class="p">(</span><span class="s">'utf-8'</span><span class="p">),</span> <span class="n">replace_with</span><span class="o">=</span><span class="s">'identifier'</span><span class="p">)</span> <span class="n">df</span><span class="p">[</span><span class="s">'Officer(s)'</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s">'Officer(s)'</span><span class="p">]</span><span class="o">.</span><span class="nb">apply</span><span class="p">(</span><span class="n">scrub</span><span class="p">)</span></code></pre></figure> <p><img src="/img/without-officer-name.png" alt="Without Officer Names" /></p> <div class="alert alert-warning"><p>Data cleansing is a <em>serious topic</em> and you should always work with your privacy or policy officers within your organization to make sure you are taking the correct steps to protect privacy.</p></div> <h2 id="write-cleansed-data-back-to-csv">Write cleansed data back to CSV</h2> <p>Finally, we’ll write our cleansed records back out to CSV:</p> <figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">df</span><span class="o">.</span><span class="n">to_csv</span><span class="p">(</span><span class="s">"~/Dallas_Police_Officer-Involved_Shootings.csv"</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="s">'utf-8'</span><span class="p">,</span> <span class="n">index</span> <span class="o">=</span> <span class="bp">False</span><span class="p">)</span></code></pre></figure> <p>Once you’re done, the cleaned data file can be used to update a dataset via DataSync. For more information, see its <a href="https://socrata.github.io/datasync/">detailed documentation</a></p>stuaganoThere’s an awesome Python package called Scrubadub that can can help you remove personally identifiable information from text data. This is a great step to take before publishing a dataset that may contain PII, in order to prevent inadvertent disclosure.