Tech Leadership Weekly, Issue 3

A weekly dose of management, process, and leadership.


Marcus Blankenship
The Real Work of (Software) Management

As managers, instead of producing code, we produce cohesive, productive teams. Our actions need to be force multipliers, and our focus is on deliver value as a team. It can be hard to transition away from writing code, but as a leader, it’s important that you do. Marcus offers some excellent advice on what you need to focus on as a manager.

Read Time: 10 minutes


Dwaine Smith
Mentoring Junior Developers

In my experience, hiring junior developers is easy (comparable, of course, to hiring senior developers), but growing and developing junior developers is the real challenge. Dwaine Smith offers some great advice to this point. Respect, offering challenging opportunities, education, and feedback are important for developing and shaping the more junior members of your team.

Read Time: 5 minutes


Bernard Golden (CIO)
4 principles that will shape the future of IT

At this point, most companies leverage software as some sort of competitive advantage, despite not calling themselves software companies. Amazon: a software company that sells pretty much everything, Uber: a software company that provides on demand transportation, Netflix: a software company that provides video rentals and streaming movies. Companies are rapidly evolving into software companies which serving a particular niche. The practices and processes we currently utilize in our software teams will continue to expand into the larger organization. The future is bright.

Read Time: 10 minutes


Jeremiah Dillon (Fast Company)
Read This Google Email About Time Management Strategy

Just because we’re not makers doesn’t mean we shouldn’t think like makers. Just as a developer needs solid blocks of uninterrupted time to work effectively, managers need uninterrupted time to tackle tasks like planning, scheduling, writing, brainstorming, and relationship building. This post is a great reminder that as managers, we shouldn’t schedule our work in 30 minute blocks.

Read Time: 5 minutes


This content originally appeared in the Tech Leadership Weekly Newsletter. Signup for a weekly dose of management, process, and leadership delivered to your inbox every Wednesday at Tech Leadership Weekly.

Posted in Agile, Management, Process, Project Management, tech leadership weekly | Tagged | Leave a comment

Tech Leadership Weekly, Issue 2

A weekly dose of management, process, and leadership.  Issue 2.


Joel Spolsky
The Identity Management Method

As Joel points out, there are a couple of approaches to management. Getting your team to intrinsically understand and embrace the goals of the organization is hard. It is the most effective way to lead, especially a technical team. For contrast, I recommend reading the ‘Command and Control Management’, and ‘Econ 101 Management’ posts as well. They are referenced in the article.

Reading Time: 5 minutes (15 minutes for all three posts)


First Round
This Product Prioritization System Nabbed Pandora 70 Million Monthly Users with Just 40 Engineers

The business landscape around us changes quickly, especially for those organizations in the technical space. This post provides a great perspective on how to prioritize, and then act on marketplace opportunities in a timely fashion.

Reading Time: 10 minutes


Zach Holman
How to Deploy Software

Unless releasing code is fantastically boring, you’re probably doing it wrong. A rapid ‘release and iterate’ cycle allows organizations to maximize opportunities. Streamlining the deployment process can offer a high return on investment. It results in better code, fewer production bugs, and lets a development team move smart and fast. This is a long post, but offers a ton of valuable insight.

Reading Time: 20 minutes


This content originally appeared in the Tech Leadership Weekly Newsletter. Signup for a weekly dose of management, process, and leadership delivered to your inbox every Wednesday at Tech Leadership Weekly.

Posted in Agile, Management, Process, Project Management, tech leadership weekly | Tagged | Leave a comment

Tech Leadership Weekly, Issue 1

A weekly dose of management, process, and leadership.
Issue 1, March 2, 2016


Marcus Blankship
The Case for Weekly Meetings, Why an Old-School Schedule Gets Leading-Edge Results

A great point to the value of weekly check-in meetings with each of your people. Management is a two way street. As a manager, it’s important that you listen, and provide direction and guidance. Weekly meetings are an important part of achieving this. They provide time for your team members to communicate concerns, and for you to provide direction, clarification, and guidance.

Reading time: 3 minutes.


Fast Company
Former Googler Lets Us In On The Surprising Secret To Being A Good Boss

An interesting take on the radical level of candor it can take to become a good boss. Brutal honesty like this is going to be hard to deliver for most people. As a leader, it’s critical to set clear expectations of those on your team. People need feedback to understand what they are doing well and what needs to change. Above all, caring for, and understanding your people, will make you a better leader.

Reading time: 3 minutes.


Chad Fowler
Killing the Crunch Mode Antipattern

Herculean efforts to hit a deadline is really a planning, communication, and management problem. It can take a huge toll on your people, and ultimately leads to lower quality software. Chad Fowler, one of the great technical thought leaders, shares some insight into the business downsides of crunch mode, causes, and how both leaders and team members can attack the causes before it’s too late.

Reading time: 5 minutes.

Signup for a weekly dose of management, process, and leadership delivered to your inbox at Tech Leadership Weekly.

Posted in Agile, Management, Process, Project Management, tech leadership weekly | Tagged | Leave a comment

Including Gem Rake Tasks in Sinatra

I learned today that Sinatra doesn’t automatically load Rake tasks from included gems (Rails has Railties, which make it easy for a gem builder to add Rake tasks from a gem into your Rails project). Some searching the interwebs turned up an old gem (Alltasks), which unfortunately didn’t work, but had some hints. The solution turned out to be pretty straight forward. Add the following the the Rakefile file in your Sinatra application:

# Rakefile

require 'bundler'
Bundler.setup
Bundler.load.specs.each do |spec|
  spec.load_paths.each do |load_path|
    Dir.glob("#{load_path}/**/*.rake").each do |rake_file|
      load rake_file
    end
  end
end

The above code will load Bundler, step through all the gems you’ve included in your Gemfile, and load any Rake files found within any of your gems. Running rake -T should display tasks from gems you are using in your project.

Posted in Gems, ruby, Sinatra | Leave a comment

Heroku DevOps with Heroku Builder

Heroku Builder can be leveraged two ways to improve your development and deploy workflow on Heroku: Config Variable management, and full app setup and management. As it’s far more likely you have an existing application, let’s start by looking at Config Variable management.

Before we begin, let’s install Heroku Builder. Add it to your Gemfile:

gem 'heroku_builder'

And install:

$ bundle install

Now, generated a configuration file (it will be placed in config/heroku.yml):

$ rake builder:init

Open up the config/heroku.yml file, and take a look. The contents will look something like this:

staging:
  app:
    name: my-heroku-app-name-staging
    git_branch: staging
  config_vars: []
  addons: []
  resources:
    web:
      count: 1
      type: Free

production:
  app:
    name: my-heroku-app-name
    git_branch: master
  config_vars: []
  addons: []
  resources:
    web:
      count: 1
      type: Free

As we’re only managing Config Variables, remove addons and resources. You should end up with something like:

staging:
  app:
    name: my-heroku-app-name-staging
    git_branch: staging
  config_vars: []

production:
  app:
    name: my-heroku-app-name
    git_branch: master
  config_vars: []

Update the staging app name and production app name to be that of your staging and production Heroku applications. If you don’t use ‘staging’ and ‘master’ for your staging and production branches, you should also update the git branches to match yours. If the staging application is called ‘foo-staging’ and the production application is called ‘foo’, your config/heroku.yml file should look something like this:

staging:
  app:
    name: foo-staging
    git_branch: staging
  config_vars: []

production:
  app:
    name: foo
    git_branch: master
  config_vars: []

Now let’s look at how you might use Heroku Builder to implement a feature flag. As a silly example, let’s look at using a feature flag to redirect to the index view rather than the show view after an object is created:

class BarsController < ApplicationController
  ...
  def create
    @bar = Bar.new(bar_params)

    if @bar.save
      redirect_to redirect_path
    else
      render :new, :status => :unprocessable_entity
    end
  end

  ...

  private

  def redirect_to_path
    if ENV.fetch('REDIRECT_TO_INDEX', 'false').match(/true|on/)
      bars_path
    else
      bar_path
    end
  end
end

So now we have our feature completed, and tests (not included here) written, as well as had the code reviewed. Let’s make sure our feature flag is set in the heroku.yml:

staging:
  app:
    name: foo-staging
    git_branch: staging
  config_vars:
    - REDIRECT_TO_INDEX: on

production:
  app:
    name: foo
    git_branch: master
  config_vars:
    - REDIRECT_TO_INDEX: on

The advantage to keeping environment configuration in source code is twofold. First, it allows configuration changes to be reviewed as part of the pull request development flow. The second advantage is that you can couple configuration changes with the code that requires it. When you deploy your code, you also deploy the configuration that code requires.

Assuming this code has been merged into our staging branch, let’s deploy it!

$ rake builder:staging:apply

This will set the Heroku Config Variables and deploy code from the head of your staging branch to the foo-staging app on Heroku. Be aware that prior to pushing code to Heroku (using git push), Heroku Builder will pull changes from the remote branch you’ve defined in the config file.

When it’s ready, deploying to production follows a similar pattern:

$ rake builder:production:apply

We can still do a bit of cleanup on our config/heroku.yml file to DRY it up. Let’s use YAML node anchors to let us define common configuration:

config_defaults: &config_defaults
  REDIRECT_TO_SHOW: on
  SUPER_SECRET_KEY: <%= ENV['LOCAL_SECRET_KEY'] %>

staging:
  app:
    name: foo-staging
    git_branch: staging
  config_vars:
    <<: *config_defaults

production:
  app:
    name: foo
    git_branch: master
  config_vars:
    <<: *config_defaults

Now as our list of Config Variables grows, we don’t need to update them under each environment, we can set the global ones under config_defaults and optionally override them in a particular environment. ERB tags are evaluated in the YAML file, so secret credentials can be stored locally without checking them into source control.

Hopefully this has given you a little insight into Heroku Builder and how you can use it to include project configuration in your project code base. In the next post, we’ll look at using Heroku Builder to create a multi-environment application on Heroku.

Posted in deployment, dev ops, Heroku, ruby | Leave a comment

Scheduling Heroku Free Node Downtime

My current company takes advantage of Heroku’s free nodes for running non-production environment applications. Recently we ran into an issue where nodes where not reliably sleeping for the required six hours. I put together a small Rake file we use with the Heroku Scheduler to ensure apps sleep during the night.

Setup:

First off, add the Heroku Platform API gem to your Gemfile:

gem 'platform-api'

Install:

$ bundle install

Now create a Rake file:

# lib/tasks/heroku.rake

require 'platform-api'

namespace :heroku do
  def conn
    @conn ||= PlatformAPI.connect_oauth(ENV['HEROKU_API_KEY'])
  end

  def scale_to(count)
    %w{web worker}.each do |type|
      conn.formation.update(ENV['HEROKU_APP_NAME'], type, 'quantity' => count)
    end 
  end

  desc 'shut down nodes for mandatory downtime'
  task :shutdown do
    scale_to(0)
  end

  desc 'start nodes back up'
  task :startup do
    scale_to(1)
  end
end

You’ll need to set a couple of environment variables locally and in each Heroku application environment you want to bring up and down:

  • HEROKU_APP_NAME – The name of the application as listed in Heroku.
  • HEROKU_API_KEY – Your Heroku API key. You can generate a key with the following command:
$ heroku auth:token

 

Check your code in and deploy it to Heroku. Next, log into Heroku.

Configure Scheduler:
If you don’t have the Heroku Scheduler Add-on enabled, add it to the application.

Shutdown: Click the Scheduler Add-on, and click the ‘Add new job’ button to add a shutdown task:

$ rake heroku:shutdown

Select ‘Daily’ for the frequency. Select the time (UTC) that you’d like your application to go offline. Click the ‘Save’ button to save the job.

Startup: In the Scheduler view, click the ‘Add new job’ button to add a startup task:

$ rake heroku:startup

Select ‘Daily’ for the frequency. Select the time (UTC) that you’d like your application to come back online (Heroku free nodes require six hours of downtime). Click the ‘Save’ button to save the job.

That’s it. Now you can run your non-production application environments on free nodes an insure it’s available when you need it to be.

Posted in dev ops, Heroku, ruby, sys admin | Leave a comment

Planning a Sprint

Before starting a sprint, it’s important to decide how much work can be fit into the upcoming sprint. As a team, We’ve chosen to use hours as the measurement for work. We start by prioritizing work in the backlog, ensuring the big tasks important to the next release are near the top of the backlog.

All stories, tasks, and bugs you’re planning to address in the coming sprint should have hour estimates associated with them. I recommend estimating in the pessimistic but realistic range. We always estimate as a team to ensure a diverse set of skills and experiences are taken into account. It’s helpful to provide hour estimates for a bit more than a sprints worth of stories to allow some movement of smaller tasks in and out of a sprint.

Once you have estimates for tasks, work backwards from the release date. We release every two weeks, going to production on Wednesday. This means we need to release to staging on Monday for regression testing. A Monday push to staging means code freeze needs to be the Friday before. As a new sprint technically starts on a Wednesday with the release to production, we have eight working days in our sprints (the other two are technically QA and bug fixes, but usually we start work on the next sprint). We plan for 30-32 working hours a week. Meetings are not counted as part of working hours. Look at any planned unavailability on your team. If someone has scheduled time off, sprint hours should be reduced accordingly.

Calculate the available hours (3 full time devs x 8 days x 6 working hours = 144 hours). Only schedule 144 hours of effort in the upcoming sprint.

I find it helpful to have team members track their time on tasks. Estimating is hard, and we can use the actual vs estimated time to talk about what we missed in our initial estimates. The goal is always to get better as team.

Repeat, communicate, and improve. Good process takes time. Actively working to evolve is the first step.

Posted in Agile, Management, Process, Project Management | Leave a comment

A Provisioning Solution

Question:
Design end to end application stack provisioning solution (architecture interview):

  • Consider:
    • Hardware v Cloud differences (one solution to control them all)
    • OS provisioning
    • Security
    • Golden image VS Bootstrap V other patterns
    • Bare Metal implications
    • Cloud implications
    • Cohesive application stack deployment
    • Configuration management
    • Custom Code application provisioning
  • Must be complete stack
  • Modular
  • Cannot be store bought only solution (no blade logic)
  • Address diff challenges in solution tool chain

My Solution:
End to End Environment Provisioner

This tool will need to accomplish a couple of key objectives:

  • Should be easy to use (simple flow to provision resources and deploy code)
  • Should be a hybrid cloud, a mix of data centers and cloud providers (AWS)
  • Should not get in the way of engineers trying to ship code

The Provisioner stack I’m proposing will mix open source tools and custom code. It’s a web-based application with an asynchronous job queue. It will leverage Packer, Chef, Fog (a generic cloud provider interface), and SSHKit (a remote ssh tool).

Before I launch into the details of what various parts of the application do, I want to provide a high level overview of parts that need to happen to allow us to provision a full custom application.

First, we need to choose a standard operating system. For this exercise, I’m going to choose Ubuntu 14.04.1 (64 bit). Automation is all about standardization, and having one OS reduces variation in securing and managing servers at the OS level.

A web application requires a number of different resource types (application server, load balancer, Postgres, Redis, Memcached, Elastic Search, etc.). Other than the application servers, which need to be built to run the application they are intended for, these resource types should be built as as generic instances. Even the application server itself should be built as generic as possible. At my current company, we have Ruby applications using Ruby 1.8.7, 1.9.3, and 2.1.3 and Java applications using Java 6 and 7. We’ve standardized our applications so that all Java servers run Tomcat, and all Ruby servers run Passenger.

Golden images can cause a great deal of trouble if they are not properly managed. If they are properly built and managed, they can provide a lightning fast means of provisioning and/or scaling. I’m going to use golden images, but tightly control how they are built, tested, and managed. To build images, I’m going to use Chef (Chef is easily interchangeable with Ansible, Puppet, Salt, Bash, etc.). A base cookbook should be used for installing and configuring libraries present across all servers. Application specific cookbooks should be used to configuring the core server application (Postgres/Redis/Tomcat/etc.), and resource specific cookbooks should be used for applying the unique set of application configuration specific to Backcountry’s needs. These cookbooks should be developed locally using Vagrant. Server Spec tests need to be written to insure the end result works as desired. The development of cookbooks should mirror the process of application development with tests passing through a CI server before changes are merged into the master branch. Packer will be used to build images. Packer will allows us to simultaneously build, tag, and push images for AWS, Open Stack, and Virtualbox. Server Spec tests are run before images are finalized. Failing tests mean images are not created, insuring that only valid golden images are available for future provisioning.

The Provisioner application acts as an information broker and manager. It’s responsible for starting and stopping, provisioning and destroying, configuring and managing applications and environments. It keeps track of what servers are currently running and has SSH access to those machines.

In the context of a provisioner, an application consists of a number of different resource types (ex. load balancer, java 7 application server, Postgres database), and environment specific settings (staging/production/experimental/etc.). Environments specific settings include information like application domain, database connection credentials, external service API keys, and resource counts (two load balancers and 10 application servers in production vs one load balancer and one application server in staging), etc. Separation of credentials by environment also allows tighter control of access to environment credentials is desired.

After the Provisioner application provisions a server and it has come online, the resource application cookbook is run again, setting the configuration relevant to that resource for that application in the desired environment. For a load balancer, this might include adding the available backend nodes. For an application server, this might include setting the environment variables to configure that application. Configuration needs to happen in a particular order. Databases need to come up and be configured prior to application servers, which in turn need to be configured before being added to load balancers.

An application cannot be considered complete without its code. I like to separate code deploys from server provisioning. Code changes regularly, while the context that code runs in generally does not. When code requires context change (upgrade from Java 6 to 7), new resources should be provisioned and moved through and tested in each of the environments. These major changes are not all that common in comparison to shipping code changes. I believe code should be shipped as a package, with all dependencies, if possible. This is easily done in Java applications with artifacts. For scripting languages, resetting the code against a particular Git tag and then running a dependency resolution tool (like Bundler for Ruby applications) is an alternative. The final step of provisioning the resources for an application is to deploy that applications code to the application servers.

Now that I’ve provided an overview of functionality, I’ll dive into the implementation. The functionality will be wrapped up in a web application. Given my comfort with Ruby, it will be a Rails app (although it could easily be written in a variety of languages). The site will have users and groups to restrict access. All actions in the system are logged providing a history of who, when, and what. The bulk of the heavy lifting will take place in jobs. I will use Postgres for the RDB, with column level encryption for secure credentials and other private information. Redis will be used to back the job queue. All configuration data stored in Postgres will be versioned (image json and environment settings).

Packer templates will be part of the project, and be as generic as possible. Creating a new image means grabbing the JSON and Packer template for that particular image and shelling out to Packer to generate the image. Successful Images will be added to the database and be selectable when provisioning resources. The image will also be tagged when it’s pushed remotely to Open Stack and AWS.

Creating a new application involves naming the application and selected the desired resources. Creating a new application will also create a default environment (staging?). Required information (ex. resource counts) will be added to the environment configuration. The job to provision an application will launch the server(s) based on the image for that resource (using Fog). Once all servers are present and registered, a second job will be run to configure and deploy code. This job will leverage SSHKit. The resource cookbook will be installed locally on each resource, and Chef run again with the configuration information from the Provisioner application specific to that particular application and environment. Once the application is configured, the application code will be deployed and the application will be available for use.

The application will be deployable through the Provisioner application or through a RESTful API. Exposing deployment functionality through an API allows CI servers to deploy changes when tests pass. At my current company, our CI server deploys straight to backstage, but staging and production release are handled manually.

I’ve glossed over how this will work in managed data centers. It’s no small part, but Open Stack should be rolled out prior to work being done on the Provisioner. Open Stack provides a generic access to compute, storage and network, and allows the data center to be API driven as AWS is.

In addition, all AWS deployed applications should be deployed into VPC. If they require access to resources in the data center, secure tunnels can be created between the two networks. Particularly for applications running in AWS, VPC ingress should be restricted by location (ex. port 22 accessible from the offices, and from Provisioner, internal service traffic over SSL, and 443 open only to those applications requiring access). The Provisioner server itself should be accessible only on internal networks.

Some challenges to implement this include implementing Open Stack in the existing data centers. This is a strategic investment, and should provide better utilization of existing hardware for the company as a whole in the future. Open Stack largely removes the local vs cloud discrepancies. One of the biggest challenges to implementing a new system is the simple fact that it’s not the way people currently work. Automation requires removing unique snowflakes. There will need to be a lot of discussion around what is installed on these servers and how, to ensure they can be reliably replicated. Additional tooling and process may need to be implemented to help teams move from their current state to this proposed state.

In summary, this should be a web based application, heavily utilizing a job queue. This queue will be driver golden image creation through Chef, Packer, and Server Spec. Provisioning and scaling will leverage Fog to provision instances on either Open Stack or AWs (or both). SSH Kit will be leveraged with Chef to finalize configuration and wire up resources within an application. Code and configuration will be deployed through SSK Kit and Chef. The end result is a system that allows for the rapid provisioning and deployment of applications across a hybrid cloud environment.

Posted in capistrano, deployment, dev ops, sys admin | Leave a comment

A Monitoring Solution

Question:
Design an end to end monitoring solution (Systems Interview):

  • Covering:
    • availability (what is core)
    • capacity
    • transaction monitoring
    • synthetic transaction monitoring
    • application performance monitoring
    • operational analytics data for business transactions
    • passive monitoring

    how these differ from one another

  • Be able to explain and define this solution why each part is needed in ecommerce and explain tradeoffs
  • Be prepared to defend the architecture
  • Consider best practices vs bleeding edge v industry standard
  • Go deep into technical detail about implementation of your design

My Solution:

To begin, let’s look at the different aspects of a system that need to be monitored and why they are important:

Availability Monitoring insures a resource is accessible and responsive. A resource can include everything from a data center, application, individual servers, and services (databases).  Resource availability is likely the most critical monitoring. The failure of a part of the system requires immediate action to resolve. Availability monitoring is critical as it ensures your key systems are up and available to end users.

Capacity Monitoring insures a system as a whole has ample resources in which to run. Capacity monitoring should look at two different levels of the system: the server level and the resource/application level.  At the server level, metrics like available disk space, free memory, CPU load, disk IO, and network traffic.  Spikes in any of these metrics are likely to lead to decreased performance at the server and resource/application level.  The second level that’s important to monitor is the resource/application level. This includes clustered systems like databases, Elastic Search, Memcached, Riak, etc., and multi-node applications (load balanced application nodes). Capacity monitoring of multi server resources/applications should focus on throughput, average response time, and particularly, slow responses. Monitoring slow responses help to identify uncommon requests that have the potential to tie up the system, and can lead to failures of dependent systems. Effective capacity monitoring can allow for proactive action. Actions might include bringing up additional servers to help with higher load, or optimizing slow actions to reduce loads on a system. Capacity monitoring is also important for predicting total system costs.

Transaction Monitoring looks at the velocity and response time of key business transactions, generally across multiple systems. An example might be the checkout process on xxxxxxxxx.com.  An online checkout process generally touches a number of different systems. The inventory system might be checked to insure all cart items are currently available, the shipping address might need need to be verified, a hold will need to be applied to the credit card for the purchase amount, the warehouse pick system will need the order, and the ERP will need to record the transaction.  Monitoring transaction response time and velocity allows for the identification of business critical issues. An unexpected slowdown in checkout velocity is indicative of a problem in the system.  Monitoring response time can help identify bottlenecks in the current system. Tracking a transaction through the system can help identify which part of the system might be responsible for a workflow issue.  Transaction monitoring is critical to ensuring key business value actions are happening as expected.  A failure in checkout could potentially have a huge cost to the business in terms of lost revenue.

Synthetic Transaction Monitoring involves periodically performing key business transactions with an automated tool to measure end user response times.  Examples of synthetic transactions include running a Selenium script through the process of selecting and adding an item to the cart, or checking out.  Synthetic transaction monitoring helps identify issues that may be seen by an end user. End user response time has a noticeable impact on conversion and user happiness. Identifying slow experiences allows a team to be proactive in addressing end user performance issues.

Application Performance Monitoring provides insight into the response times of various parts of an application.  This can include high level metrics, like the total response time of requests.  Ideally, very small incremental steps of processing a request should be recorded, allowing an engineer to identity the particular point of slowness in the software. Transactions like database or external service calls are very important to monitor as they identify needed database optimizations or potential points of failure. Being able to examine response time at a method level allows for the understanding of performance bottlenecks. Capturing metrics at such a granular level can provide incredible visibility into the workings of an application, but quickly will lead to huge amounts of data. In high volume systems, the best approach may be to sample a small set of transactions in a production setting.  Application Performance monitoring is important in improving the quality and efficiency of each particular application.

Operational Analytics monitors tasks related to operational efficiency. These might include metrics like the time to pick and pack an order, time between order placed and order shipped, or response time of customer inquiries. These metrics help identify inefficiencies and bottleneck in the current operational process.

Passive Monitoring examines network traffic flowing between different parts of the system. As passive monitoring tracks real traffic, it can provide the ability to “replay” the moments leading up to an issue, examining the TCP traffic.  Passive monitoring can also provide insight into the volume of communication within the larger system, helping to identify communication bottlenecks. As with application performance monitoring, passive monitoring can generate huge volumes of data. Sampling may help reduce the volume of data while still offering insight into the flow of communication within the larger system.  Passive monitoring can be important in figuring out why an issue occurred.  Being able to replay the events leading up to a problem can help engineers to replicate and resolve issues.

Solution:  The elements of an effective monitoring system can be broken down into three main categories: Producers, Collectors, and Consumers.  Producers are all the systems that send data as part of monitoring.  The Collector is the system that collects and aggregate the data. Consumers are services that subscribe to the Collector and process or act on the data they are interested in. Let’s look at these individually:

Producers generate data.  They can take a number of different forms.  It might be a daemon process running on server that sends system metrics every few seconds.  It could be an application that sends execution time of particular controller action.  It might be a service that pings for the uptime of other services, or the results of automated Selenium tests.  Producer data should be in JSON form and follow a standard data schema relevant to it’s intention. Producers send data to the Collector via TCP.  Each producer can be written in the language that makes the most sense given it’s purpose and location.

Collectors collect and aggregate data.  Kafka seems like the right tool for the job at this scale. It would be run as a cluster to ensure availability and durability.  The Collector provides a unified platform for Producers to publish to and Consumers to consume from.

Consumers act on the data they are interested in from the Collector.  One of the challenges around monitoring and analytics is the way in which data needs to be presented. Consumers should be tailored to their intended audience.  An executive dashboard might only watch for key Operational Analytics such as warehouse efficiency or daily revenue numbers.  This system could periodically important all related events, aggregate it, and store those aggregated results locally for display.  Another Consumer might bulk upload all the collected data to S3 for future processing with Hadoop. Yet another consumer could monitor a particular application in real time merging capacity, transaction, synthetic transaction, and application monitoring to provide a detailed picture to how an application is running.

This Producer/Collector/Consumers approach provides nearly infinite options on the Producer and Consumer sides. Applications can be tailored to the needs of their end users and utilize the technologies that make the most sense for the given problem. The only real requirement for producers and consumers is an understanding of the common data scheme for a particular message type.  Changes in data schemes can be mitigated through versioning.

I’ll now give some examples of Producers and Consumers for each of the above monitoring focuses.

For Availability Monitoring, a monitoring agent should installed on each machine and run periodic health checks. This agent could be written in a number of different languages (Python, Go, etc.). A Consumer would watch for registered servers, and generate an alert if a consumer hadn’t checked in recently enough or a health check failed.  Some external monitoring should also be in place to insure connectivity between networks.

For Capacity Monitoring, a monitoring agent should installed on each machine and generate periodic stats to be sent to the Collector. These might include load (uptime), memory (free -m), available disk space (df -h).  Key metrics should be sent to Collector.  A Consumer could display these metrics in graph form (d3.js), or alert if a threshold is passed.  With the high volume of near live data, a key value store like Riak would be an option for storing and searching on data points.  This data would not need a long life cycle, and could be aged off after a week or two.

For Transaction Monitoring, the Producer would likely be part of the application, working the same way logging does in an application.  It is important to include the time of transaction as part of the event message.  Transaction data seems very ripe for a variety of consumers. The business might be interested in metrics around checkout, order size, order velocity and product interest.  Engineering might need to see a more detailed log of slow transactions, or view them in the context of more detailed application performance metrics.

For Operational Analytics, the application responsible for the particular event should send data from within the relevant application (ex. order shipped event sent when the order is marked as shipped).  A consumer might consist of a series of graphs showing counts in comparison with historical data for management review.

For Passive Monitoring, agents could be installed to listen on mirrored ports and send all communication to the collector.  This data could be archived in S3 and processed through Hadoop.

One of the challenges with aggressive monitoring is the amount of data generated.  In most cases, the majority of that data is not particularly useful, but it becomes critical when issues occur in the system.  In this implementation, I would be particularly concerned about over engineering a complete solution.  I would focus initially on writing simple, configurable, reliable producers for server level monitoring, and building libraries to ensure engineers could log events and actions from within application code. It’s better to have more data and not consume it.

On the Consumer side, I’d focus on critical areas with no good visibility. If there is no existing monitoring, availability monitoring and capacity monitoring are key starting points. Consumer applications should be single purpose and strive to be as simple as possible. In memory or key/value stores can and should be leveraged where appropriate. Each consumer should handle aging or warehousing data as is appropriate for their function.  Consumers should also be written with APIs, to allow other consumers to access  relevant information.

A challenge with monitoring is how much information to display and how to display it.  It’s important that consumers are designed to answer the questions being asked, just showing users lots of information. Keeping consumers small, modular, and focused will be key. Monitoring should be considered an ongoing project.  With the increased visibility comes the opportunity to ask more complex questions.

Posted in dev ops, monitoring, sys admin | Leave a comment

Git Push Deployment for Big Commerce

One of the challenges we had working on a Big Commerce implementation was integrating it efficiently into our team’s work flow. At Gazelle, GitHub is a key part of how me move code from engineer to production. As is common, we branch and use pull requests as an opportunity for peer review. We wanted to insure we could take the same approach with the Big Commerce templates, and ensure individuals didn’t override each others work.

Code:
To get started, I created a private repo in our GitHub account. Next,I exported the templates folder into the new project. I chose to only export changed files to reduce the project size. Our GitHub project has two branches; dev and master. The dev branch is tied to the staging site. The master branch is tied to the production site. It was important we have a place to review changes internally before those changes move into production.

Continuous Integration:
We use CircleCI to run our tests. Circle is great. It’s dead simple to setup, very customizable, and they let you run tests in parallel. There are no tests to run in our Big Commerce project. It’s just HTML, Javascript, CSS, images, and fonts. I was particularly interested in the git commit hook as a mechanism to trigger branch deployment. The following is particular to Circle CI, but I imagine, could easily be adopted to a variety of CI alternatives. Below is the circle.yml file used in this project:

general:
  branches:
    only:
     - master
     - dev
test:
  override:
    - ruby ./run.rb

A quick overview of what’s going on.

general:
  branches:
    only:
     - master
     - dev

This sets up circle to only run on commits made to the master and dev branches. This lets us work on branches and review those changes before the changes find their way into the staging or production environments.

test:
  override:
    - ruby ./run.rb

We’ve overridden the default test call. Instead of running rspec or unit tests, we’ll run run.rb. run.rb handles the actual deployment. I’ll go into this file in more detail next.

Deployment:
Here is an overview of the script to handle the actual code deployment. I’ll step through the parts below.

require 'rubygems'
require 'bundler/setup'
require 'uri'
require 'curb'
require 'net/dav'
require 'dotenv'

Dotenv.load if defined? Dotenv

def master?
  ENV['CIRCLE_BRANCH'] == 'master'
end

def env_name
  (master?) ? 'PRODUCTION' : 'STAGING'
end

def url
  ENV["#{env_name}_URL"]
end

def password
  ENV["#{env_name}_PASSWORD"]
end

def create_directory(conn, path, arr)
  @current_folders ||= {}
  create_path = (path.empty?) ? arr.shift : "#{path}/#{arr.shift}"
  unless @current_folders.has_key?(create_path)
    conn.mkdir(create_path) unless conn.exists?(create_path)
    @current_folders[create_path] = 1
  end
  create_directory(conn, create_path, arr) if arr.length > 0
end

def create_path_if_missing(conn, file)
  create_directory(conn, '', File.dirname(file).split(/\//))
end

def changed_files
  all_files = Dir.glob('template/**/*.*')
  return all_files if !ENV['CIRCLE_COMPARE_URL'] || ENV['CIRCLE_COMPARE_URL'].empty?
  puts ENV['CIRCLE_COMPARE_URL']

  range = ENV['CIRCLE_COMPARE_URL'].split('/').last

  `git diff --name-only #{range}`.split("\n").reject{|file| file !~ /^template\//}
end

Net::DAV.start(URI("https://#{url}/dav/")) do |dav|
  dav.credentials(ENV['USERNAME'], password)

  puts "push to: #{url}"
  changed_files.each do |file|
    if File.exists?(file)
      create_path_if_missing(dav, file)
      dav.put_string(file, File.open(file, 'r').read) 
      puts "add/update #{file}"
    else
      dav.delete(file) if dav.exists?(file)
      puts "remove: #{file}"
    end
  end
end

Now, a quick run through what this code is doing. We need to include a couple of Ruby libraries, rubygems (so we can add gems), and URI. Circle uses Bundler to resolve gem dependencies. Bundler/setup loads the cached gems into this scripts scope. The net_dav gem is going to handle the WebDAV connection. It’s a bit light on documentation, but the code is easy to read. I’ve included to curb gem as well to take advantage of the speed of curl. The dotenv gem simplifies the environment variable setup while developing locally.

Some simple helper methods to figure out where to deploy the code and what passwords to use:

def master?
  ENV['CIRCLE_BRANCH'] == 'master'
end

def env_name
  (master?) ? 'PRODUCTION' : 'STAGING'
end

def url
  ENV["#{env_name}_URL"]
end

def password
  ENV["#{env_name}_PASSWORD"]
end

Unfortunately, the net_dav doesn’t have anything like mkdir -p to recursively create directories, so we need to make sure they can be created efficiently before we try to upload a file to that location:

def create_directory(conn, path, arr)
  @current_folders ||= {}
  create_path = (path.empty?) ? arr.shift : "#{path}/#{arr.shift}"
  unless @current_folders.has_key?(create_path)
    conn.mkdir(create_path) unless conn.exists?(create_path)
    @current_folders[create_path] = 1
  end
  create_directory(conn, create_path, arr) if arr.length > 0
end

def create_path_if_missing(conn, file)
  create_directory(conn, '', File.dirname(file).split(/\//))
end

We keep track of confirmed directories for the obvious performance benefits. Now we’re getting to the good stuff.

def changed_files
  all_files = Dir.glob('template/**/*.*')
  return all_files if !ENV['CIRCLE_COMPARE_URL'] || ENV['CIRCLE_COMPARE_URL'].empty?
  puts ENV['CIRCLE_COMPARE_URL']

  range = ENV['CIRCLE_COMPARE_URL'].split('/').last

  `git diff --name-only #{range}`.split("\n").reject{|file| file !~ /^template\//}
end

This method takes advantage of GitHub’s comparison view to only act on the files that have changed. Most of the changes we’re pushing are small, and we want this deployment to be as nearly real time as possible.

And that brings us to the final block of code:

Net::DAV.start(URI("https://#{url}/dav/")) do |dav|
  dav.credentials(ENV['USERNAME'], password)

  puts "push to: #{url}"
  changed_files.each do |file|
    if File.exists?(file)
      create_path_if_missing(dav, file)
      dav.put_string(file, File.open(file, 'r').read) 
      puts "add/update #{file}"
    else
      dav.delete(file) if dav.exists?(file)
      puts "remove: #{file}"
    end
  end
end

Here were connecting to our Big Commerce account and upload (or remove) the files add, changed, or deleted as part of the commit.

That’s it. Sixty four lines of code and the power of CircleCI lets us deploy our Big Commerce templates with:

git push origin dev
Posted in big commerce, ruby | Tagged , , | Leave a comment