Mark Lapierre

Using GitLab CI/CD for Automated Let’s Encrypt Certificate Deployment

2018-11-17T10:20:00-05:00

I previously wrote about setting up this blog on GitLab Pages¹. Back then I set up HTTPS manually. This post is about how I’ve automated that process using GitLab CI/CD and Justin Aiken’s jekyll-gitlab-letsencrypt plugin for Jekyll.

Last time I used certbot-auto (it was called letsencrypt-auto at the time) to manually obtain and verify my Lets’s Encrypt certificate. If you have shell access to your webserver you can use certbot to automatically update your certificate. But I don’t have shell access; GitLab Pages are for static sites. However, all the steps involved in obtaining, installing, and verifying a Let’s Encrypt certificate can be automated as long as you have access to some kind of shell environment, and can publish the static results of the process.

That’s where GitLab CI/CD comes in.

It’s possible to write your own shell script and add it to your .gitlab-ci.yml file, but thankfully other people have already done that work. In this case, in the form of jekyll-gitlab-letsencrypt.

Configuring it involved adding it to my Gemfile:

group :jekyll_plugins do
  gem 'jekyll-gitlab-letsencrypt'
end

And then updating my _config.yml:

gitlab-letsencrypt:
  # Gitlab settings:
  gitlab_repo: 'mlapierre/mlapierre.gitlab.io' # Namespaced repository identifier

  # Domain settings:
  email: '[email protected]' # Let's Encrypt email address
  domain: 'marklapierre.net'   # Domain that the cert will be issued for

  # Jekyll settings:
  base_path: '/'  # Where you want the file to go
  pretty_url: true  # Add a "/" on the end of the URL
  filename: '_pages/letsencrypt_challenge.html'

  # Delay settings:
  initial_delay: 180 # How long to wait for Gitlab CI to push your changes before it starts checking

And then adding a job to my .gitlab-ci.yml:

renew-letsencrypt:
  stage: build
  script:
    - bundle install
    - bundle exec jekyll letsencrypt
  only:
    - schedules

Items in only specify how the renew-letsencrypt job can be executed. The schedules option allows the job to be run in a schedule.

And here is a scheduled pipeline that runs once a week:

Let’s Encrypt certificates expire after 90 days but you can make the request to renew them as frequently as you like (within limits that most people wouldn’t reach).

For the plugin to work it needs push access to my blog’s repository on GitLab, and for that it needs a personal access token. If you want to use jekyll-gitlab-letsencrypt locally and will never publish your _config.yml you can provide your personal access token via:

gitlab-letsencrypt:
  personal_access_token: 'ENTER SECRET HERE'

But if you use GitLab CI/CD you can provide it more securely via an environment variable, GITLAB_TOKEN. You can do that via Settings -> CI/CD -> Variables:

When the scheduled job executes, the plugin takes care of publishing the challenge file that the Let’s Encrypt service verifies, and then it uses the GitLab API to update my certificate in my GitLab Pages settings:

$ bundle exec jekyll letsencrypt
Configuration file: /builds/mlapierre/mlapierre.gitlab.io/_config.yml
Registering [email protected] to https://acme-v01.api.letsencrypt.org/...
Pushing file to Gitlab
Commiting challenge file as _pages/letsencrypt_challenge.html
Done Commiting! Check https://gitlab.com/mlapierre/mlapierre.gitlab.io/commits/master
Going to check http://marklapierre.net/.well-known/acme-challenge/d7D2-lvQqQJSolI42L9RSaOjQYBQrBSQsFWdzYELLJM/ for the challenge to be present...
Waiting 180 seconds before we start checking for challenge..
Got response code 200, file is present!
Requesting verification...
Challenge status = pending
Challenge is valid!
Updating domain marklapierre.net pages setting with new certificates..
           Success!
All finished!  Don't forget to `git pull` in order to bring your local repo up to date with changes this plugin made.
Job succeeded

And now I don’t have to bother manually updating my certificate. Many thanks to Justin and everyone else who did all the hard work!

1. In the interest of transparency, I should note that since writing that post I've started working for GitLab. More on that in another post.

Pros and Cons of Quarantined Tests

2018-06-06T16:10:00-04:00

Flaky tests, i.e., those that only fail sometimes, are the bane of any end-to-end automated test suite.

Another type of problem test is one that fails every time but which tests something that is deemed not important enough to fix right now. If you have to ignore some of the failed tests sooner or later you’re going to ignore one that you should have paid attention to. Or worse, you might decide to ignore them all because clearly no-one is fixing the bugs.

If a test is broken, fixing it should always be the first course of action, if possible. But what if some other task has a higher priority? If you’re confident that the problem is the test and not the software being tested, it might be reasonable to allow the test to keep failing, at least temporarily.

When you frequently ignore some failing tests, the whole suite is at risk of being seen as unreliable. A common way to prevent that is to quarantine the flaky/failing tests. Quarantine in this context refers to isolating the troublesome tests from the rest of the test suite. Not for fear of contagion, except in the sense of the negative impact they can have on the perception of the rest of the tests.

I think I first came across the concept in an article by Martin Fowler. It’s a great read on the topic of flaky tests and how to identify and resolve the causes of their flakiness. This post isn’t about how to fix them so check out that article if you’re after that kind of info.

More recently, an article on the Google Testing Blog mentioned the same technique for dealing with the same types of troublesome tests.

Even though quarantining tests can be a good temporary solution, if you don’t fix the tests (or the bugs) you can end up in the situation I mentioned before; a few failing tests create the impression that the entire suite is unreliable, enough so that you might consider them a death sentence.

My team and try to avoid that death sentence in a few ways:

Report quarantined test results separately from the rest of the test suite.

That way everyone can see the results of the reliable tests and know that a failure there is something that should be looked at immediately. We don’t have to try to identify the “true” failures amongst the flaky ones.
Tag quarantined tests with a reason they’re quarantined.

So flaky tests get tagged as such. Failing tests that aren’t going to get fixed for a while get reported and tagged with the issue number. Comments can be added if the tag isn’t sufficient. This isn’t enough to rescue a quarantined test from oblivion, but it can help avoid the potential problem of losing track of why a test was quarantined.
Schedule a regular review of quarantined tests.

If it’s not scheduled it’s not likely to happen. Failing tests can be assigned to someone to fix if priorities change, and time can be invested in fixing a flaky test if we decide it’s more important than we first thought.
Delete the test

If any test stays in quarantine for a long time it would be worthwhile rethinking the value the test provides. Maybe it turns out that unit tests, or even exploratory tests, provide enough coverage. Or the test might cover a part of the software that rarely changes, or which doesn’t get much use. In that case if there is a regression it’s not a big deal. We might @Ignore the test and leave a comment explaining why—instead of deleting it—if it seems likely someone might decide to write the test again.

How do you deal with flaky or failing tests that don’t get fixed quickly?

Inattentional Blindness and Scripted Tests

2018-05-08T17:40:00-04:00

My previous workplace was a large organisation in which many testers were employed to evaluate the quality of the software we developed and maintained. We had a collection of scripted test cases that testers would follow step-by-step. That worked reasonably well, although I was employed to help automate our processes, including testing, which contributed significantly to the quality of our software.

When I started working at my current workplace I found similarly detailed scripted test cases. Part of my responsibilities included manual testing, so I thought I could test the way I was familiar with, and how the rest of my colleagues tested—just follow the test cases. It hasn’t worked well. We find bugs, for sure, but as I’ve grown in experience I’ve found more and more problems with the software that had been there for a long time, through many versions of the software and through many executions of test cases that should have revealed them.

There are at least a few things that I think explain why we, including my past self, failed to identify problems:

Out-of-date test cases. Change happens constantly and we have too many tests with too much detail for our small QA team to keep up with.
Treating test cases like an instruction manual. It’s relatively easy for an experienced tester to follow the steps of a scripted test case to the letter, assuming the steps are accurate. That was our standard practice. But it’s even easier to miss out on opportunities to reveal bugs if you do that.
Overwhelming detail. Many test cases are so long, verbose, and complicated that it’s very easy to miss important details in the steps and expected results, especially when you’re under pressure to get the job done quickly.
Unnecessarily specific detail. Often a test case instructs the tester to use a particular element of the UI in a particular way. E.g., “enter a value in the Account text field and click the Validate button at the bottom of the panel.” That sort of specificity means the other fields are likely to be ignored, as well as ignoring all the other ways that validation could be triggered. And that problem is in addition to making it hard to keep the test case up to date (because sooner or later, that Validate button is going to move, or be removed entirely).

The last three points have something in particular in common. They all trigger a cognitive phenomenon called inattentional blindness. It’s something that we all experience, whether we’re aware of it or not¹. A well-known demonstration of the phenomenon comes from a psychological study and you can perform the experiment from the study yourself by watching a video and following the instructions at the start:

If you haven’t done so already, I strongly recommend you do the experiment first before you read on—this is something you really only get to experience once, although there are variations of it. Although it’s likely something you’ll experience again and again in real life.

The study and others like it find that half of the time on average, people fail to notice the unexpected element. They’re asked to perform a task and they’re focused so intently on it that they fail to perceive something they’re looking right at. It’s one of the reasons using your phone while driving is so dangerous, even hands-free; if your attention is on the text/app/call it’s not on the road.

That kind of inattentional blindness is exactly what can happen when you follow a scripted test. You focus on the steps you have to follow and the results you have to check for and you fail to notice anything else. The software can behave in unexpected ways, but you might miss it if you’re only paying attention to what the test case says should happen. Even if you’re looking right at the problem. Missing the unexpected becomes more likely the longer you’re doing the task and the more anxious you are about completing it quickly.

I’m certainly not the first to draw the link between inattentional blindness and scripted tests. Michael Bolton, for one, has mentioned it a few times on his blog. But it’s particularly relevant to me now as I try to improve our testing practices. Whatever changes we make, we need to be aware of the potential for inattentional blindness.

For now it’s clear to me that the scripted, highly-detailed test cases we’re used to at my workplace are getting in the way of us improving the quality of our software. Part of the solution is a more exploratory approach to testing. One suggestion is that instead of following a test case step-by-step, you could:

glance over the test case; try to discern the task that’s being modeled or the information that’s being sought; then try to fulfill the task without referring to the test case. That puts the tester in command to try the task and stumble over problems in getting to the goal. Since the end user is not going to be following the test case either, those problems are likely to be bugs.

For those interested, there is more information about inattentional blindness on Scholarpedia, including references to the original research publications.

How do you avoid inattentional blindness while you’re testing? I haven’t set up a commenting system here yet but this article is cross-posted to dev.to and you’re welcome to comment there.

1. Technically, we're never aware of it. That's the inattentional part.

From GitHub to GitLab pages

2018-05-01T19:10:00-04:00

I started this site on GitHub Pages because:

I was already using GitHub
I didn’t want to have to deal with hosting
It’s free

But I’d been meaning to set up my own domain for a while¹ and although GitHub pages supports custom domains it didn’t seem to support HTTPS on custom domains. And there’s no way I’m setting up a web site in 2018 without HTTPS.

It turns out that support for HTTPS with custom domains is gradually being rolled out now, and it’s possible to set it up yourself. But there’s no mention of this in the documentation, and word from GitHub Support doesn’t suggest it’ll be official any time soon. And it hasn’t been rolled out to my account yet.

So I looked into GitLab as an alternative. I liked what I found. Not only is there extensive documentation, but there’s also an official tutorial.

I won’t say it was easy, but unless something has gone terribly wrong you’re reading this over a secure connection certified by Let’s Encrypt.

I followed the instructions to add my domain to my GitLab pages settings, and to configure the DNS records for my domain. There are a few helpful links to how-to pages for specific hosts, but none for my registration service. Fortunately, the instruction provided were sufficient.

Before I delved into HTTPS I wanted to make sure the domain setup was behaving as expected. It was not. When I tried to open a link to my site I got a 404 back. The DNS record was directing the request to the correct server, but the server wasn’t associating the request with my GitLab Pages repository. It was then that I realised the project name still ended in github.com, not gitlab.com. When I imported my GitHub repository, the import function on GitLab copied the original repo’s name, ignoring the new name it allowed me to enter. No matter, I thought, there’s an option to change the name. So I did that. But still my site threw up a 404. It seems that I’d only changed the name of my project, i.e., the display name. I had to hunt down the setting to change the name of the repository itself (Settings > Advanced Settings > Rename repository). But once I’d updated that I could finally access my site via marklapierre.net.

And, finally, the HTTPS configuration. I followed the tutorial and ran the letsencrypt-auto CLI to set up a challenge response that would confirm for Let’s Encrypt that I control marklapierre.net. Unfortunately, at some point after the instructions were written GitLab Pages and Jekyll stopped accepting permalinks with dots in them. Fortunately, someone pointed this out in the comments on the tutorial along with the solution—end the permalink with / not .html.

So now I have a secure site on my own domain. Huzzah!

[Update: on the same day I published this, GitHub announced support for HTTPS on custom domains. Nice.]

1. Someone else got marklapierre.com. Last update in 2014? C'mon Mark, you're killing me!

Science and software testing

2018-04-22T19:31:00-04:00

Software testing, particularly manual software testing, is sometimes thought of as nothing more than following a script to confirm that the software does what it was designed to do. From that perspective, testing might seem like a boring and relatively mindless task. And to be honest, that is the traditional view of testing as part of the Waterfall method of software development in large organisations. Division of labour meant that there were some people who did nothing but follow scripts someone else had written, and report bugs that someone else would fix.

Science, on the other hand, is undeniably interesting and challenging. So if you share the impression that software testing is boring, you might be suprised to know that I find both engaging and worth spending my time and effort on¹. Having worked as a software tester, and having studied a scientific field (cognitive science), I’ve noticed some similarities that help explain why I’m drawn to both pursuits despite their apparent lack of similarity.

Science can be defined as:

“The intellectual and practical activity encompassing the systematic study of the structure and behaviour of the physical and natural world through observation and experiment.”

That doesn’t seem to describe following testing scripts at all. Even if you swap “the physical and natural world” for “the software under test”, and even if you include the task of writing scripts. But if you consider the entire process of software testing you’ll see similarities emerge. For one thing, test scripts have to be written based on something, and in today’s world of agile software development, than something is usually not requirements handed down from designers, but rather requirements explored, developed, and refined iteratively. Observation and experiment are a big part of that iterative process. This is especially the case when working on existing software that doesn’t have good documentation—how else could you figure out how the software works except through observation and experiment? Even if you have access to the code, it’s unlikely you could read the code and know exactly how the software will behave. And there isn’t always someone else around to ask.

The reality of software testing is a lot more than following a script. A more complete definition of testing is that:

“Testing is the process of evaluating a product by learning about it through exploration and experimentation, which includes to some degree: questioning, study, modeling, observation, inference, etc.”

When defined that way, it’s much clearer how testing and science are similar. Questioning, study, modeling, observation, and inference are all core aspects of science and testing.

In testing, we question whether the software does what we expect it to do. We question whether it does what customers want it to do and in the way they want. We question whether a code change has unintended effects. We study how the software behaves under various conditions. We construct models of how we believe the software performs, even if they’re only mental models. We observe how the software responds to input. And ultimately we make inferences about the quality of the software.

Another similarity between science and software testing is that neither process truly has an end. There is always more to discover through science, even at the end of a project that has produced significant insights. And there is always more to learn about any but the simplest software. In the case of science and testing, it’s not meaningful to think of the entire process as having a goal, but it is necessary to define a reasonable milestone as the completion of a project. We don’t finish testing when there are no bugs, because that will never happen, but we can consider testing complete when the software behaves well under a reasonable range of scenarios.

Science is often described as trying to prove things². That is not the aim of science, nor is it how science works. Science is, in part, a way of trying to better understand the world. And science is the knowledge produced by that process. The scientific method involves making a hypothesis and then gathering evidence and analysing data to draw conclusions about whether the hypothesis is supported. It’s possible to find evidence that rules out a hypothesis, but it’s not possible find evidence that a particular hypothesis is the only explanation for the data. This is because other hypotheses might explain that evidence just as well, including hypotheses that no-one has come up with yet. But after carefully analysing the results of many experiments a clearer understanding can begin to emerge (in the form of a theory). In that way you can think of science as showing what doesn’t work until there’s a reasonably solid explanation left. It’s not about being right; it’s about being less wrong.

Similarly, testing isn’t about proving that the software is bug free; it’s about providing evidence that you can use the software without any significant issues, so that what’s left is reasonably solid. It’s also not about proving that the software does exactly what the customer wants, but it is about helping to iteratively improve the customer’s satisfaction with the software. This is an important part of software testing that’s sometimes forgotten—the aim isn’t solely to find bugs, but also to find unexpected, unusual, or confusing behaviour.

On the other hand, there are plenty of ways in which science and testing are different. But I’ll leave that for another post.

1. Not so much manual testing specifically, but a comprehensive approach to testing that includes exploratory testing and automation.

2. Do a search for "science proves". It's enough to make a scientist or philosopher or mathematician cry.

We’re neither rock stars nor impostors

2017-11-23T17:36:00-05:00

Recently, Rach Smith raised some important points about how we tend to talk about impostor syndrome:

it minimizes the impact that this experience has on people that really do suffer from it.

we’re labelling what should be considered positive personality traits - humility, an acceptance that we can’t be right all the time, a desire to know more, as a “syndrome” that we need to “deal with”, “get over” or “get past”.

If you haven’t read her post yet I highly recommend you do. The issue came up again during Rach’s chat with Dave on Developer on Fire.

I can’t truly say I’ve experienced impostor syndrome, although I suspect that’s mostly because I’ve often been in small teams where everyone was similarly skilled. For example, I was once one of two novice web developers in a product development team. We really didn’t know what we were doing. I did feel unqualified, but since there was no one more experienced to compare myself against I didn’t feel like an impostor. But I did suffer from low self-confidence and a huge pile of self-doubt. Fortunately, experience and education has helped me come to grips with the limits of my knowledge and ability. I’m sure that self-awareness has contributed to better performance independently of any increase in my skills.

It all got me thinking about my experience with how jobs are advertised and how interviews are conducted, about the pressure to elevate one’s technical skills, about the growing awareness of the importance of “soft” skills, and about the rock star culture that’s promoted in some parts of the industry.

Rach noted that even highly successful senior developers sometimes experience self-doubt and the awareness of gaps in their knowledge. This is something that is all too often missing from discussions about preparing for interviews, especially for highly sought-after positions. We’re always told to prepare extensively (good advice), and to project confidence (sure, projecting a lack of confidence is understandably unhelpful), but the highest quality advice also points out the importance of awareness of the limits of one’s skills and knowledge so that they can be appropriately managed. Much of the advice I remember from my early days suggested I should do my best to cover up my weaknesses. I don’t believe that did anything but lead to feelings of insecurity and inevitably falling apart when the limits of my knowledge were revealed. Later, I received much better advice; to be able to say “I don’t know,” and then to work through the problem aloud, asking questions to fill in the gaps until I do have enough understanding to give a reasonable answer. And isn’t that more or less how we work each day? If anyone actually had the supreme skills and confidence we’re naively advised to portray during interviews, I’m pretty sure they wouldn’t find the job challenging or interesting enough (and would likely inflict their arrogance and the consequences of their boredom on the rest of us).

Another topic missing from good career advice, fortunately less common these days, is the importance of soft skills. As Rach noted, “the most accomplished developers [have] constant awareness of the ‘gap’ in their knowledge and willingness to work towards closing it.” That sort of awareness is as important a soft skill as general social and communication skills. It’s a key part of metacognition. The people I’ve experienced most joy in working with are those who freely admit their limitations and strive daily towards eliminating them. That effort shows in their contributions at work that go above and beyond the explicit requirements of their role. Among the worst people to work with are those who do the minimum work required, without any awareness of the opportunities for improvement that pass them by every day. Even worse are those who perform at a similar level while believing that they are in fact contributing much more and at a much greater degree of competence¹. The latter type of person is unlikely to experience anything that might be called “impostor syndrome”, although if anyone were truly an impostor, it would be them.

Beyond a growing understanding of the importance of interpersonal soft skills, there are many other non-technical skills that make a solid team member. For example, the O*NET database shows active learning towards the top of a list of skills seen as important for a programmer². And yet typical hiring practices overwhelmingly reflect the prioritisation of immediate technical skills. I’m confident that’s a big part of the reason “rock star” developers are those seen as having the greatest skills rather than being most able to learn or improve. And yet the former doesn’t imply the latter, especially if those great skills lie in one highly specific domain; you can learn to do one thing really well without being able to generalise that skill, nor does it mean you possess other distinct but important skills. Other downsides of specialisation are a topic for another post.

Similarly, the poor attitudes and bad behaviours of some workers are accepted because of their technical skills, despite the negative impact they have on the people around them. I suspect this might be a subtle influence on feeling like an imposter; we provide a perverse incentive for people to behave in ways that no reasonable person wants to. Our industry favours those who promote themselves as the best coder, the most knowledgeable developer, the ideal technical candidate, and we (at least implicitly) discourage people from embracing their range of skills and their ability to improve.

1. The Dunning-Kruger effect in effect, so to speak.

2. Although communication skills are apparently the #1 requirement in computing-related job ads, other soft skills and transferable technical skills are far less frequently mentioned.

Summer of Data Science 2017 - Final Update

2017-09-09T16:58:00-04:00

Ok, so it’s not summer any more. My defence is that I did this work during summer but I’m only writing about it now.

To recap, I’d been working on a smart filter; a system to predict articles I’d like based on articles I’d previously found interesting. I’m calling it my rss-thingy / smart filter / information assistant¹. I’m tempted to call it “theia”, short for “the information assistant” and a play on “the AI”, but it sounds too much like a Siri rip-off. Which it’s not.

Aaaanyway, I’d collected 660 interesting articles and 801 that I didn’t find interesting–fewer than expected, but I had to get rid of some that were too short or weren’t articles (e.g., lists of links, or github repositories). There was also a bit of manual work to make sure none of the ‘misses’ were actually ‘hits’. I.e., I didn’t want interesting articles to turn up as misses, so I skimmed through all the misses to make sure they weren’t coincidentally interesting (there were a few). The hits and misses then went into separate folders, ready to be loaded by scikit-learn.

I used scikit-learn to vectorise the documents as a tf-idf matrix, and then trained a linear support vector machine and a naive bayes classifier. Both showed reasonable precision and recall upon my first attempt, but tests on new articles showed that the classifier tended to categorise articles as misses, even if I did find them interesting. This is not particularly surprising; most articles I’m exposed to are not particularly interesting, and such simple models trained on a relatively small dataset are unlikely to be exceptionally accurate in identifying them. I spent a little time tuning the models without getting very far and decided to take a step sideways before going further.

Eventually I’ll want to group potentially interesting articles, so I wrote up a quick topic analysis of the articles I liked, comparing non-negative matrix factorization with latent dirichlet allocation. They did a reasonable job of identifying common themes, including brain research, health research, science, technology, politics, testing, and, of course, data science.

You can see the code for this informal experiment on github.

In my next experiment (now, not SoDS18!) I plan to refine the predictions by paying more attention to cleaning and pre-processing the data. And I need to brush up on tuning these models. I’ll also use the trained models to make ranked predictions rather than simple binary classifications. The dataset will be a little bigger now at around 800 interesting articles, and a few thousand not-so-interesting.

1. Given all the trouble I have naming things, I'm really glad I haven't had to do any cache-invalidation yet.

Summer of Data Science 2017 - Update 1

2017-06-28T17:11:12-04:00

My dataset/corpus is coming together.

It was relatively easy to create a set of text files from the articles I’d saved to Evernote. It’s taking more time to collect a set of articles that I didn’t find interesting enough to save. I’ll make that easier in the future by automatically saving all the articles that pass through my feed reader, but for now I’m grabbing copies from CommonCrawl. This saves me the trouble of crawling dozens of different websites, but I still have to search the CommonCrawl index to find articles among everything else in the index from each site.

I created a list of all the site I’d saved at least one article from, then I downloaded the CommonCrawl index records for each site from the last two years. Next I filtered the records to include only pages that were likely to be articles (e.g., no ‘about’ or ‘contact’ pages, etc.). I took a random sample of up to 100 of the records remaining for each site and downloaded the WARC records, and then extracted and saved each article’s text. I’ll make all the code available once I’ve polished it a little.

The next step will be to explore the dataset a little before diving into topic analysis.

Summer of Data Science 2017

2017-06-14T17:45:17-04:00

Goal: To launch^* my learn’ed system for coping with the information firehose

I heard about the Summer of Data Science 2017 recently and decided to join in. I like learning by doing so I chose a personal project as the focus of my efforts to enhance my data science skills.

For the past forever I’ve been stopping and starting one side-project in particular. It’s a system that searches and filters many sources of information to provide me with articles/papers/web pages relevant to my interests. It will use NLP and machine learning to model my interests and to predict whether I’m likely to find a new article worthwhile. Like a recommender system but just for me, because I’m selfish. Something like Winds. The idea is to collect all the articles I read/skim/ignore via an RSS reader, and tag those I find interesting. And to build up a Zotero collection of papers of several degrees of quality and interest. Those tagged and untagged articles and papers will comprise my datasets. There is a lot more to this project, but that’s the core of it.

My first (mis)step was to begin building an RSS reader than could automatically gather data on my reading habits that I could use to infer interest based on my behaviour; whether I clicked a link to the full article, how long I spent reading an article, whether I shared it, etc. Recently I decided that was not the best use of my time, as it would be much easier to start with explicitly tagged articles–I can start gathering those without creating a new tool. So I’m doing that by saving interesting articles to Evernote. Today I have just under 900. I can use CommonCrawl to get all the articles I didn’t find interesting on the relevant sites (i.e., the articles that would have appeared in my RSS reader, but which I didn’t save).

There are many things I’ll need to do before I’m done, but all of those depend on having a dataset I can analyse. So my next step will be to turn those Evernote notes and other articles into a dataset suitable for consumption by NLP tools. Given the tools available for transforming text-based datasets from one format to another, I’m not going to spend much time choosing a particular format. I’ll start with a set of plain-text copies of each article and associated metadata, and take it from there.

I’ve been less consistent in gathering research papers. I’ve been saving the best papers I’ve stumbled across, but I could do much better by approaching it as a research project, i.e., do a literature review. That’s a huge task so I’ll focus on analysing web articles first.

*I was going to write "complete" but really, it'll always be changing and will probably never be complete. But ready for automated capture and analysis? Sure, I can make that happen.

Hello, World!

2017-06-13T18:05:00-04:00

Hello, World!

For a programmer this is a mandatory thing. No apologies.

Mark Lapierre

Using GitLab CI/CD for Automated Let’s Encrypt Certificate Deployment

Pros and Cons of Quarantined Tests

Inattentional Blindness and Scripted Tests

From GitHub to GitLab pages

Science and software testing

We’re neither rock stars nor impostors

Summer of Data Science 2017 - Final Update

Summer of Data Science 2017 - Update 1

Summer of Data Science 2017

Goal: To launch* my learn’ed system for coping with the information firehose

Hello, World!

Goal: To launch^* my learn’ed system for coping with the information firehose