At some point, you’re going to want to deploy your system. You have a good idea of the final destination of your system early on in the project. If you’re building a site that has no server-side code, aim to use Federalist. If you’re going to host server-side code, aim to deploy to cloud.gov. You can also deploy to TTS-managed infrastructure as a service (IaaS) directly, but your life will be harder. For GSA systems, see comparison of hosting options.
Whichever option you choose, you should start deploying to a production-like environment from early on in the development process.
Note that sending traffic from the internet to your local machine for any testing purposes is not permitted. In order to enable testing, you can request sandbox accounts on both cloud.gov or AWS.
- The more your system looks like other TTS systems, the better
- This allows TTS to more easily share people, patterns, code, and services across projects
- The more you can offload (to your hosting provider, frameworks, etc.), the better
- This will lower your operational and compliance burden
- Below, “internal” projects mean “things built by and for TTS”, i.e. “not for a partner agency”. If you’re building for a partner agency to own long term, you will want to factor in considerations for their environment.
- If an option isn’t listed below, you probably can’t use it for deploying TTS projects. This includes:
- GitHub Pages (why)
- Heroku and other platform services
- Your personal AWS account
TTS uses AWS as the underlying IaaS, but spending effort at the IaaS level is not the best use of your team’s time. TTS has invested in developing cloud.gov to provide for the most common infrastructure needs. cloud.gov uses Cloud Foundry – an open source Platform-as-a-Service (PaaS) – as a team-friendly abstraction above AWS, encapsulating good practice cloud hosting without having to worry about a lot of the details. For most of the products that TTS develops, deploying onto cloud.gov will:
- Minimize ATO compliance overhead (which is quite hefty) and reduce security concerns
- Reduce TTS’s overhead for handling infrastructure billing, since it is fully self-service
- Make it easier for teams to ensure high availability/scalability
As a result, cloud.gov significantly reduces the portion of your team’s capacity that you need to dedicate to operational concerns. For this reason, when making infrastructure decisions, opt to use cloud.gov for your deployment whenever possible, and only resort to directly using AWS for infrastructure pieces that are impossible to achieve through cloud.gov or use new AWS services not yet available in cloud.gov.
Comprehensive documentation for cloud.gov is available.
Cloud.gov has a FedRAMP JAB Provisional ATO at the Moderate level.
- FedRAMP package
- System Security Plan
- Control Implementation Summary
- Customer Responsibility Matrix
can be found on cloud.gov’s FedRAMP page.
Infrastructure as a service (IaaS)
Amazon Web Services (AWS)
If you do want to use AWS directly, see the AWS page.
Microsoft Azure and Google Cloud Platform (GCP)
See outstanding issue.
FISMA High systems
There are some specific cases where the product is categorized “FISMA High”. This would usually only happen due to your product handling extremely sensitive information or being critical to normal government function. AWS GovCloud has received a FedRAMP JAB Provisional ATO at the High level.
Note however that when partner agencies assert that of course their product will be FISMA High, TTS often finds upon examination that a product should really be judged FISMA Moderate or FISMA Low… So don’t discard cloud.gov or AWS as options before probing that point carefully!
See cloud.gov page on deploying static sites.
- Internal: Likely free, but start by checking with ##cloud-gov-business with your use case.
- External: see the pricing page
Sandbox accounts - both cloud.gov and AWS - are available to all TTS staff for non-production use. Things to bear in mind about sandbox accounts:
- Sandbox accounts should be used for testing and demonstration purposes. Nobody outside the federal government should be given access details for systems running in the sandbox unless authentication is in place. Exposing systems to the public without authentication requires an ATO.
- Sandbox accounts must be used when you are sending internet traffic to a non-production system: tools such as
localtunnelare strictly forbidden since they can allow your laptop to be compromised.
- No sensitive or personally identifiable information (PII) should be stored in sandbox accounts.
- Any system that becomes publicly routable (ex: for testing) must have a robots.txt configuration that prevents indexing by all search engine robots.
- The sandbox is for testing and demonstration purposes only. Nobody outside the federal government should be given access details for systems running in the sandbox unless authentication is in place.
- No sensitive information can be stored in the sandbox accounts.
- Creating resources that will cost more than $500 per month requires prior agreement from the Tech Portfolio team.
- All resources must be tagged with a
Project. Resources without this tag can be deleted at any time.
- Any website that is publicly routable for more than one day must have a robots.txt configuration that prevents indexing by search engines.
Cloud.gov sandbox accounts
Information on cloud.gov sandboxes is available in the Getting Started section of the cloud.gov documentation.
Cloud Service Provider (CSP) sandbox accounts
Anyone in TTS can get an access to any of the three (3) Cloud Service Provider (CSP) sandbox account(s) that we currently have contractual access to in:
- Google Cloud Platform (GCP)
- Amazon Web Services (AWS)
- Microsoft Azure (Azure)
Sandbox users have
power user access, which means they have full privileges to all services except for Identity and Access Management (IAM). Program account requests will be isseued
Once you complete the form above, you will be contacted by a member of the TTS Tech Portfolio for the exchange of credentials. You can reach out to them direct by email
firstname.lastname@example.org or in slack
#tts-tech-portfolio once you’ve completed the form to ask any questions or inquiry about the status of a request.
Important notes for Cloud Service Provider (CSP) users
There are a few special notes on using any “Infrastructure as a Service” in the Federal context.
Other people’s money
The federal government cannot pay one penny more than it is authorized to spend. There is no retroactive justification for spends. When government exceeds these limits, a report and explanation is required to the GSA Administrator, General Counsel, and Congress. So tracking costs is a big deal.
However we recognize that it’s important to provide compute resources for TTS folks to be able to experiment. Thus sandbox users can spend up to $500 per month without explicit permission from Infrastructure. This money counts towards our operating costs, which are ultimately indirectly billed to customers in the form of increased rates.
Thus in order to keep our rates low, it’s extremely important to bill infrastructure costs, including non-production costs, to agency partners wherever possible. If the work you are doing is in support of a project which has an inter-agency agreement (IAA), you must register your system with #infrastructure, including the Tock project code and the infrastructure tag you will be using, and tag any AWS resources accordingly so we can bill these costs to our partner agencies.
These are things like your AWS password, secret API key, and the mobile device that generates your multi-factor authentication token. You are wholly and solely responsible for safeguarding them, and are responsible if they are released to non-authorized parties.
In particular, your AWS credentials, like all other credentials and secrets, must never be checked in to version control. If you check them in by mistake, please treat this as a security incident.
If you are unfamiliar with how to protect these credentials, please consult with TTS Infrastructure. We’re working on getting additional tools to help make this easy for everyone.
Amazon Web Services
At TTS, we use Amazon Web Services (AWS) as our infrastructure as a service (IaaS). We have separate AWS accounts for our production systems and sandboxes for development and testing. If you’re used to developing locally, you should feel empowered to do everything you’d like in an AWS sandbox account. You’re free to develop purely locally as long as you’d like, but if you want to get a system online, AWS and cloud.gov are your only options, of which cloud.gov is preferred.
In particular, you cannot send traffic from the internet to your local machine - you must use a sandbox account for this purpose.
TTS has opinions on how you should manage your infrastructure with AWS. For more information on how TTS manages its infrastructure, see the AWS Management Guide.
If you are familiar with running virtual machines on your own computer, through Parallels, VirtualBox, or VMWare, AWS operates on the same principles but on a truly massive scale. Pretty much everything in AWS can be orchestrated via the AWS API & command-line interface.
The core service of AWS is the Elastic Compute Cloud (EC2). These are virtual machines just like on your computer, but hosted in the AWS environment.
If you want very basic and cheap object storage, AWS provides the Simple Storage Service (S3).
These are just the concepts necessary for initial on-boarding. AWS has an extensive list of other services.
Building systems that will be deployed directly to AWS
Although cloud.gov is strongly preferred as the production environment for the systems we build, there are some systems that will need to run on AWS. See the GSA approval status and caveats for using different AWS services.
In order to ensure systems deployed to AWS are robust and reliable, and to ensure the integrity of information stored in AWS, we impose some additional restrictions on systems deployed to the TTS production AWS environment.
Anyone in TTS can get access to the AWS sandbox account. However only the TTS infrastructure team has login credentials to our production TTS account, and they are only used for debugging and incident management purposes. All systems are deployed using a continuous delivery service from scripts stored in version control, and registered with #infrastructure.
- All configuration of your production environment must be performed using Terraform scripts checked into version control.
- There will be no “back channel” access to AWS resources for systems deployed into production. Any routine activities such as data management, import / export / archiving, must be performed through your system.
Auto scale groups
In order to ensure that systems remain available even in the face of hardware failures within AWS leading to VMs being terminated, all EC2 instances must be launched within an auto-scaling group from an AMI.
To ensure logical partitioning of systems running within the TTS production environment, every system must be hosted within its own virtual private cloud (VPC). Network security settings are set at the VPC level, including what ports IP addresses EC2 instances can communicate with each other and back out to the internet.
Occasionally, out-of-date documentation from third parties and Amazon itself may reference EC2 Classic. We at TTS do not support this environment.
Regardless of what your system does, we enforce HTTPS Everywhere.
Approved services for production use
Not all AWS services are approved by GSA IT for production use. GSA IT maintains a current list of approved services (note: only visible to GSA employees and contractors).
Operating system (OS) baseline
We use a pre-hardened version of Ubuntu as our baseline OS for all EC2 instances in AWS. These are created using the FISMA Ready project on GitHub. In AWS, there are Amazon Machine Images (AMIs) in each AWS Region with these controls already implemented. You should always launch new instances from this baseline. You can find them by searching for the most recent AMI with the name
FISMA Ready Baseline Ubuntu (TIMESTAMP - Packer), where
TIMESTAMP will be a timestamp value.
Other people’s information
Any system in AWS might have the public’s information (as opposed to public data) at any time. Some systems use stronger measures to help protect the information if it is sensitive. For example, MyUSA uses row-level encryption. If you are unsure of the sensitivity of the data you’re going to be handling, consult with TTS Infrastructure first.
Use common sense when handling this information. Unless you have permission and need to in order to do your job:
- Don’t release information
- Don’t share information
- Don’t view information
Regardless of your own norms around privacy, always assume the owner of that data has the most conservative requirements unless they have taken express action, either through a communication or the system itself, telling you otherwise. Take particular care in protecting sensitive personally identifiable information (PII).
In order to make sure we are protecting the integrity of the public systems, you have no expectation of privacy on any federal system. Everything you do on these systems is subject to monitoring and auditing.
Tagging resources in AWS is essential for identifying and tracking resources deployed. A tagged resource makes it easier for reasoning from a billing perspective and aids in determining if a system is in a particular environment (ex. production). See the sandbox environment to see how tagged resources enables lifecycle management of resources in AWS.
At a minimum, an AWS resource must have a
Project tag defined with enough information to be able to identify a project that the AWS resource is associated with.
Creating new accounts
- Forecast the spending for the next 6-12 months.
- If you expect the spend across your accounts to increase by more than a few percent, the contract may need to be modified. Post in ##admins-iaas if this is the case.
- Create an issue
Use Federalist for publishing static sites. See the Federalist homepage for more information.
- Within TTS: Likely free, but check with ##federalist on Slack with your use case.
- External to TTS: Check out the Federalist website for pricing.
If you are publishing a new site through Federalist and it’s not connecting to any APIs or third-party services beyond public API calls from the browser (i.e. it’s a simple static site), the site is considered part of that system, so it does not require its own ATO (source). Note: Technically, static site builders are just adding a collection of pages in an existing system. Therefore, from an ATO perspective, “sites” created through Federalist remain within the security boundary, and thus ATO.
To make a new Federalist site public (and covered under the ATO), see the launch checklist.
How to check if a site is on Federalist
- Open a Terminal
curl -Is https://<site>.gov | grep -I x-server
If it outputs
x-server: Federalist, it’s a Federalist site. Otherwise, it’s not.
For information on how HTTPS and HSTS compensate for an absence of DNSSEC for HTTP-based services, see:
Good Production Practices
Below is a list of “good” production ops practices, which product and tech leads should consider early in their development and review as part of any major launch. Items in bold are considered must-haves.
We will be adding more documentation about how to achieve these within TTS’ infrastructure soon, but docs.cloud.gov is a good place to start. It includes a guide to production-ready apps on cloud.gov with tips about how to implement relevant practices.
- All volatile data storage is on redundant infrastructure
- Periodic snapshots of volatile data storage are happening
- Ideally, point-in-time recovery is possible
- Recovery is documented in a testable procedure
- Tests of the recovery path are part of the continuous deployment pipeline
- Can push a new version with a single command
- More than one person is able to do it
- Blue-green deployment
- Automated schema updates
- Snapshot/rollback of volatile data is incorporated in the process
- Deployment only includes production-necessary files
- Secrets are retrieved securely (eg via credential service rather than setting environment variables)
- Download, build, and configuration limited to staging, not runtime
- Pin dependencies
- Service-level targets are documented
- Clear entry point for complaints
- Clear escalation for handling infrastructure vs application vs api problems
- Support queue is public
- Resources are appropriately tagged
- Someone is alerted, somehow, if a monitor test is failing
- Flexible targets (for vacation, by component, etc)
- Alerts triggered based on “out of the norm” thresholds
- Flapping status does not result in excess/bouncing alerts
- A status page is available to all users and downstream services
- The status page is hosted off-infrastructure
- The status page shows any planned and all previous outages
- Users can subscribe to notices
- In-person discussion/audit around launch and major changes
- Third-party services are approved to hold the data being sent to them
- Automated pen-testing in a staging environment as part of continuous deployment
- Automated vuln-scanning in production environment that is fed with newly-discovered vulns
- Enable HTTPS for everything
- Redirect http to https (automatic with cloud.gov and federalist)
- Periodic tests of in-scope components in a staging environment as part of continuous deployment pipeline
- Upstream components are known to be load-tested up to max foreseeable pressure
- Planning around launch, significant news, and seasonal deadlines
- Analysis of similar service traffic in steady state
- Ideally app-relevant elastic response to scale up as needed and back down to control costs
- Each component has at least two instances at all times
- Each component horizontally scalable with more instances
- Must-be-vertical components do not pressure their hosts in even elevated traffic condition
- Ideally must-be-vertical components do not share hosts
- Instances are distributed across availability zones
- No in-app dependencies on the number/distribution of upstream instances
- Upstream is similarly resilient (multiple instances in multiple zones)
- Expected exposure for alpha/beta/blue-green environments is enforced
- Exposure is controlled via configurable non-bespoke proxy (eg not the app)
- A/B cohorts/affinity supported
- If using cloud.gov, obtain through the CDN broker.
- If using Federalist, they are set up automatically.
- If using TTS-managed infrastructure as a service (IaaS), there are a few options:
- If using another agency’s infrastructure, consult their IT department.
There are several kinds of monitoring that you will need to have in place for any application:
- Uptime/Downtime: Is the app available?
- Errors: Is the app generating errors at an unacceptable rate?
- Performance: Even if the app is functional, is it unusably slow?
Monitoring is only useful if the relevant people are alerted when something goes wrong, and then only if those individuals…
- consider these alerts worth investigating
- have sufficient access and understanding to at least triage and escalate an alert, if not fix it
- have a clear escalation path
It will likely take some tweaking of the thresholds to get the signal-to-noise ratio right. Plan to have monitoring active for several weeks before the go-live date to give the team time to spot problems, practice response and tune the alert conditions.
Your DevOps Team
At present we don’t have a dedicated first-line support team across TTS. Projects need to coordinate their own DevOps teams for alert response. Teams will need:
- Reachability: Alerts should go directly to their devices, not just to Slack.
- Escalation path: Team members should know how to at least start dealing with alerts. Here’s a great example from College Scorecard. (Thanks, @abisker!)
- Direct access to monitoring systems: Make sure everyone has a working login on whichever monitoring systems you pick, and has at least a little experience navigating them.
- Clear expectations of uptime & availability: At present, TTS staff work 40 hour weeks and there is no requirement to be available in off hours. In practice, people want to make sure their stuff works, and many will jump online to fix things if they see a problem over the weekend. But there should be no expectation of this. Furthermore, this understanding must be established with project partners. Projects that need greater support coverage should arrange dedicated on-call staff elsewhere.
Errors & Performance Problems
For a non-static site, you will want to know if exceptions are being thrown within your application. TTS uses New Relic.
- For New Relic access, open an issue in the Infrastructure repo to get an account set up for your project.
For custom events, DAP and/or New Relic can be used.
Ask #analytics if you have questions.
Error & performance monitors can trigger alerts on a number of different conditions, including:
- Error counts (total or percentage)
- Apdex score (a responsiveness statistic)
- Response time
- Custom metric (which can be sent to monitors for logging using the monitor’s client library)
All of the above can be set with thresholds for given time periods; for example, alerting if more than 2% of transactions in any five-minute period return errors.
We recommend creating a mixture of alert conditions during development and tuning them based on the current performance of the app. You may have an Apdex target of 0.9, but if the app is regularly scoring lower then it’s counter-productive to keep that as an alert threshold: you’ll just fill the alerts with noise that can’t be dealt with quickly. The work to meet that performance should be managed at the project level.
Once you’ve created alert conditions, ensure that they’re actually working. It helps if you have errors or performance problems that you can trigger on demand; if the production environment is already live to the public then you might need to push a test branch to staging and try your conditions there. Also, when testing conditions, make sure to limit their notifications to only go to you, or you’ll need to warn everyone in advance.
You will want to know if your site goes down. Options (as of 1/20):
- Uptrends - GSA Systems can request to have an account setup for their endpoints by submitting a Generic Request via GSA’s Servicedesk. Optionally a public dashboard can be setup by the GSA Uptrends Administrator upon request. https://www.uptrends.com/support/academy/public-status-pages/configuration
- Statuspage - TBD
- New Relic Synthetics. -(Here’s a walkthrough for setting up a simple ping with Synthetics, testing it and connecting it notification channels). In order to use this service you will need to consult with ##acquisitions in slack, in order to apply funds to make a call on TTS’s existing New Relic procurement for this service.
Projects can supplement their uptime/ping services together with a status, by embedded it as an
<iframe></iframe> on their own sub-domain. This allows the team to provide one place for their customers to go for the system’s about how you are responsing to the outage and/or annoucements of degraded services or maintainace periods.
Static site (JAMstack) alternatives: to manage the domain/build and using some JAMstack/static site like https://github.com/netlify/netlify-statuskit or https://github.com/cstate/cstate.
Deploy it to Cloud.gov
Deploy it with
cf push <app-name>
Deploy it to Federalist or just host it in your app or in an s3 bucket (or alike).
Open Source alternatives (self-hosted):
Ways to alert DevOps & project team members:
- Slack, though you may not want all errors going to the project’s main Slack channel. (See the section below on grouping notification channels.)
- ~SMS, which is only available through certain services~ Note: no GSA approved SMS options currently exist. Use Slack on mobile instead.
- Push Notifications, for which team members need to have the mobile app installed and registered.
- Email, which in practice isn’t as useful since most people aren’t immediately alerted by it.
Grouping Notification Channels
New Relic (and possibly other monitoring tools) allows you to group notification targets - that is, individuals and Slack channels. This makes it easier to ensure that different kinds of alerts only go to team members who can act on them.
Good production practices
- Must-have: User-representative tests (eg can access service, can perform a critical operation) running regularly. Both of the downtime monitors mentioned above can be scripted to perform and verify multi-step transactions.
- Tests of sub-components also running regularly. Monitoring at the sub-component level will make it significantly easier to diagnose higher-level problems.
- Historical graph (e.g. uptime)
- Tests are run frequently
- Tests are reported with low latency
- Behavior vs stated service-level targets is tracked
- Dev team regularly reviews errors caught by monitors for triage and fixing (even if they didn’t set off alerts)
The practice of “pinning dependencies” refers to making explicit the versions of software your application depends on (defining the dependencies of new software libraries is outside the scope of this document). Dependency pinning takes different forms in different frameworks, but the high-level idea is to “freeze” dependencies so that deployments are repeatable. Without this, we run the risk of executing different software whenever servers are restaged, a new team-member joins the project, or between development and production environments. In addition to repeatability, pinning dependencies allows automatic notification of vulnerable dependencies via static analysis.
As such, all deployed applications should be pinning their library (and where possible: language, OS, etc.) versions. Let’s look at how to implement this in different languages.
No action is necessary for dependencies to be pinned. This is because the
Gemfile.lock should be committed to the repo
in development, causing it to be deployed along with the source code:
. . . the Gemfile.lock makes your application a single package of both your own code and the third-party code it ran the last time you know for sure that everything worked.
Package locking is built into npm >= 5. As you
npm install packages,
they’ll be added to your
package.json file and exact versions of all
libraries (including dependencies of dependencies) will be referenced in the
package.json and lock file should be
committed to the project repo.
In npm 6.x, the
npm ci command was introduced. This command will clear out
node_modules and install the exact dependency tree as defined in
package-lock.json. This is now the preferred method of ensuring dependencies
are pinned in CI/CD. npm 6 or greater is the default from Node.js 10.3.0.
Be sure to use an up-to-date npm 5.x client, as the lockfile behavior was
buggy in early versions. Use at least npm 5.4.2. Running
with no arguments will install the versions of libraries defined in the
npm < 5
npm < 5, you may imitate some of the above behavior by creating a
“shrinkwrap” file. As you install packages, use
npm install --save to update
package.json. After making changes, run
npm shrinkwrap to generate an
npm-shrinkwrap.json file, which references the versions of all the currently
npm install with no arguments will inspect that file and
install the versions it defines. Both the
package.json and shrinkwrap file
should be committed to the project repo.
If you are using
yarn to manage your node dependencies,
you will automatically have dependency pinning due to the
yarn produces and uses.
yarn.lock should be committed to your repository
All yarn.lock files should be checked into source control (e.g. git or mercurial). This allows Yarn to install the same exact dependency tree across all machines, whether it be your coworker’s laptop or a CI server.
pipenv install django ## or, with stricter version bounds pipenv install django~=2.0.4
This will generate a
Pipfile containing a loose Django definition and a
Pipfile.lock referencing an exact Django version as well as all its
dependencies. Users need only run
pipenv install with no arguments to
synchronize the latest libraries.
Pipenv can also export a
requirements.txt file for tools that need one:
pipenv lock -r > requirements.txt
If Pipenv isn’t available, we can imitate some of its functionality by using
pip directly. We’ll create a
requirements.in file, specifying un-pinned
dependencies and install it via
pip install -r requirements.in
Then, we can “freeze” our libraries, generating a list of the exact versions of not only our immediate dependencies but their dependencies, by using:
pip freeze > requirements.txt
Be sure to run this command in an activated virtualenv to avoid freezing system-wide dependencies.
pip-tools provides a more automated
method of managing this flow.
What to log
Things you are required to log:
- Successful and unsuccessful account logon events
- Account management events
- Object access
- Examples: reading database records or files on disk
- Policy change
- Privilege functions
- Process tracking
- System events
- For Web applications:
- All administrator activity
- Authentication checks
- Authorization checks
- Data deletions
- Data access
- Data changes
- Permission changes
Do not log sensitive information.
- It’s important that the events are traceable back to the user that performed them (if possible), and when, so include things like:
- The user ID
- Timestamps, standardized in UTC
- Make sure the right logging is done in production (outside of debug/development mode)
- If not using cloud.gov, here are some things to think about:
- Logs are captured to durable storage before rotation
- Logs with sensitive data are only available to appropriate people
- Logs can be browsed/drilled with low-latency (e.g. grepping not necessary)
To decide whether a site needs to be decommissioned, use the following decision trees. Is the site required by law/policy?
When taking down a production system, create an issue (preview). Feel free to add/remove tasks as appropriate, add a username after each task to assign it, and/or make corresponding items in your issue tracker.