Pulumi at NearMe: Embracing True Infrastructure as Code

Introduction

Infrastructure as Code (IaC) has revolutionized how organizations manage their cloud infrastructure, and at NearMe, our journey led us to choose Pulumi as our primary IaC tool. This article explores how Pulumi's unique approach to infrastructure management has transformed our Platform Engineering practices, enhanced developer productivity, and improved our infrastructure reliability. We'll dive into the advantages of using real programming languages for infrastructure definitions and compare Pulumi with other tools like Terraform, AWS CDK. While Pulumi's blend of imperative and declarative programming has empowered us with greater flexibility, we also acknowledge potential drawbacks, such as increased code complexity that can sometimes overwhelm developers due to multiple ways to achieve the same outcome. By sharing our practical experiences and insights, we aim to help those undecided between Pulumi, Terraform and AWS CDK to choose the right IaC framework for their specific needs.

About Pulumi

Infrastructure as Code

Infrastructure management has evolved from manual configuration through click-ops to script-based automation, and finally to declarative Infrastructure as Code. This progression reflects the growing need for:

Reproducibility and consistency across environments
Version control and change tracking
Automated testing and validation
Scalable infrastructure management
Reduced human error through automation

Pulumi's Features

Pulumi offered several compelling advantages that aligned with our needs:

Programming Language Support

Instead of learning a domain-specific language , our developers can use familiar programming languages like TypeScript.

import * as aws from '@pulumi/aws'

const lambdaFunction = new aws.lambda.Function(
  'lambda-function',
  {
    code: new pulumi.asset.FileArchive(lambdaArchive.outputPath),
    sourceCodeHash: lambdaArchive.outputBase64sha256,
    handler: 'function.handler',
    environment: {
      variables: {
        /* ... */
      },
    },
    role: lambdaRole.arn,
    runtime: aws.lambda.Runtime.NodeJS20dX,
    loggingConfig: {
      logFormat: 'Text',
    },
  },
  { parent },
)

Type Safety and IDE Support

Built-in type checking
Inline documentation
Code completion

For instance, when working with Kubernetes Custom Resources like Argo Workflows, we can generate type definitions from their JSON Schema using json-schema-to-typescript from npm.

import * as pulumi from '@pulumi/pulumi'
import { Primitive } from 'zod'
import { IoArgoprojWorkflowV1Alpha11 } from './__generated__/argoWorkflows'

type DeepPulumiInputType<T> = T extends Primitive
  ? pulumi.Input<T>
  : T extends object | any[]
  ? {
      [k in keyof T]: DeepPulumiInputType<T[k]>
    }
  : T

type ArgoWorkFlowsTemplateCrdSpecInput = DeepPulumiInputType<IoArgoprojWorkflowV1Alpha11>

new k8s.apiextensions.CustomResource('argo-workflow', {
  apiVersion: 'argoproj.io/v1alpha1',
  kind: 'WorkflowTemplate',
  spec: {
    entrypoint: 'main',
    /* Type-safe properties */
  } satisfies ArgoWorkFlowsTemplateCrdSpecInput,
})

Testing Capabilities

Pulumi enables writing tests using familiar testing frameworks, allowing us to validate infrastructure configurations before deployment. This is enforced both during the preview and deployment stages, providing immediate feedback and ensuring compliance with organizational policies.

For example, the following code snippet ensures that DynamoDB tables do not have a write capacity greater than or equal to 64:

new policy.PolicyPack('dynamo-db-testing', {
  policies: [
    {
      name: 'Minimum DynamoDB Capacity',
      description: '',
      enforcementLevel: 'mandatory',
      validateStack: async ({ resources }, reportViolation) => {
        const dynamoDbTables = resources
          .filter((r) => r.isType(aws.dynamodb.Table))
          .map((tb) => tb.asType(aws.dynamodb.Table))

        dynamoDbTables.forEach((table) => {
          if ((table?.writeCapacity || 0) >= 64) {
            reportViolation(`Unwanted write capacity ${table?.writeCapacity} for table ${table?.name}`)
          }
        })
      },
    },
  ],
})

Multi-Cloud Support

Pulumi provides robust multi-cloud support, allowing you to define and manage infrastructure across various cloud providers such as AWS, Azure, Google Cloud, and more using the same programming languages and tools. This flexibility enables organizations to adopt a consistent infrastructure-as-code approach in heterogeneous cloud environments, facilitating easier migration and hybrid deployments. In contrast, AWS CDK is tailored specifically for AWS services. While it offers deep integration and rich features within the AWS ecosystem, it lacks native support for other cloud platforms, limiting its applicability in multi-cloud scenarios.

Adopting Existing Resources

Pulumi excels in adopting existing cloud resources into your infrastructure codebase. With a single pulumi import CLI command, you can generate the necessary Pulumi code to manage resources that were created outside of Pulumi, such as those provisioned manually or by other tools. This capability streamlines the process of bringing un-managed resources under version control and consistent management practices. AWS CDK and Terraform, on the other hand, does not provide a built-in command-line tool to automatically generate code from existing resources. While you can reference and manage existing AWS resources in CDK or Terraform, it often requires manual code writing and additional effort to integrate them into your new setup.

Comparisons

vs Terraform

Terraform, a popular IaC tool by HashiCorp, uses the declarative HashiCorp Configuration Language (HCL). In Terraform, you define the desired state of your infrastructure, and Terraform ensures your cloud resources align with that state. Its strength lies in simplicity and a rich provider ecosystem for multi-cloud support. Here's an example of Terraform code for provisioning an AWS EC2 instance:

provider "aws" {
  region = "us-west-2"
}

resource "aws_instance" "example" {
  ami           = "ami-12345678"
  instance_type = "t2.micro"

  tags = {
    Name = "ExampleInstance"
  }
}

This snippet defines an EC2 instance with a specific AMI, instance type, and tags. Terraform’s declarative approach focuses on what the infrastructure should look like rather than how to create it, making it easier for teams to adopt without programming expertise.

Key Differences:

Ease of Use: Terraform’s declarative syntax is easier for non-developers, while Pulumi’s programming language approach is more flexible for developers.
Extensibility: Pulumi enables complex logic with programming constructs, while Terraform requires workarounds for advanced scenarios.
Ecosystem: Terraform has a larger ecosystem and Pulumi is still catching up.

Feature	Pulumi	Terraform
OSS License	Yes, Apache License 2.0	No, Business Source License 1.1
Language	General-purpose programming languages	HCL (Domain-specific)
Testing	Native testing frameworks	Limited testing capabilities
Abstraction	Object-oriented/Functional programming	Module system
Dynamic Provider Support	Yes	No
Learning Curve	Familiar for developers	New DSL to learn
Pricing - 1st paid plan	$0.37/resource/month	$0.1/resource/month
Self host	Available	Available

vs AWS CDK

AWS CDK is purpose-built for AWS and allows developers to define AWS infrastructure using languages like TypeScript, Python, Java, and C#. It provides a high-level abstraction for AWS services, enabling users to define resources with reusable constructs. Here’s an AWS CDK example in TypeScript to create an S3 bucket:

import * as cdk from 'aws-cdk-lib'
import { Bucket } from 'aws-cdk-lib/aws-s3'

const app = new cdk.App()
const stack = new cdk.Stack(app, 'MyStack')

new Bucket(stack, 'MyBucket', {
  versioned: true,
})

This code creates an S3 bucket with versioning enabled using TypeScript. AWS CDK integrates deeply with AWS, offering a rich library of constructs tailored to AWS services. However, its AWS focus limits its ability to manage multi-cloud environments effectively.

Key Differences Between Pulumi and AWS CDK

Multi-Cloud Support: Pulumi offers extensive support for multiple cloud providers, making it an excellent choice for hybrid or multi-cloud strategies. In contrast, AWS CDK focuses more on AWS-specific use cases.
Dual Knowledge Requirement: AWS CDK functions as a source-to-source compiler, translating its high-level code into CloudFormation templates. As a result, users must understand both AWS CDK and CloudFormation to effectively debug and manage their infrastructure, increasing the learning curve.
Dependency Deadlock Issues: Managing dependencies between stacks in AWS CDK can lead to deployment challenges. For example, removing an exported value from a stack that is still referenced by another stack can result in deployment failures. Resolving such issues often requires intricate workarounds, such as introducing temporary fake outputs—an approach that can be tedious and error-prone.

By addressing these differences, Pulumi provides a more flexible, general platform for managing infrastructure across diverse cloud environments, while AWS CDK remains a strong choice for teams focused solely on AWS.

Practice

Canary Deployment Example

In this example, we will explore a canary deployment process that takes advantage of Pulumi's unique strengths. This process consists of four stages: Original, Start Canary, Switch, and Next Stable. The desired state at each stage is determined by a combination of environment variables and outputs from the previous stack. Below, we detail each stage of the process.

Process Overview

Original

In the diagram above, a set of listener rules with priorities that are multiples of 3 route requests to Component 1. The stack's output at this stage includes a list of version 1 commit hashes and identifies the stable component as Component 1.

Start Canary

Configure the Pulumi program with STAGE = 'Start Canary' and version 2 commit hashes. This deploys a new stack for version 2 and sets up listener rules with priorities that are multiples of 2, directing traffic to version 2. These canary rules require special headers, ensuring only specific requests reach the new version. The stack outputs remain unchanged, maintaining stability while preparing for the canary deployment.

Switch

After manual validation, update the listener rules to transition traffic:

Component 1 (the previous stable stack) now uses listener rules with priorities that are multiples of 1 and requires a special header for access.
Component 2 (the previous canary stack) now uses listener rules with priorities that are multiples of 4, routing public requests to it.

This shifts traffic from the stable stack to the canary stack.

Next Stable

Finally, destroy the previously stable stack and revert listener rules to their original configurations. Update the stack outputs with version 2 commit hashes, designating Component 2 as the new stable component.

Look Back

Deterministic Stage Completion: Pulumi's ability to directly manage both Kubernetes and AWS resources ensures that each stage of the deployment process completes reliably and deterministically. Unlike ingress rule annotations—which may experience delays due to operator processing—Pulumi controls the resources directly, guaranteeing that transitions occur exactly as intended. Additionally, errors surface in the code itself rather than in controller logs, making them easier to track and debug.
Risk Mitigation for Premature Exits: If the Next Stable stage exits prematurely, rerunning it is risky because the stack outputs might already indicate that Component 2 is stable, even if it wasn't fully deployed or Component 1 wasn't completely deleted. To mitigate this risk, we make the stack outputs dependent on an HTTP request. By incorporating a health check that confirms the successful deployment and operation of Component 2 before updating the stack outputs, we ensure the system only proceeds when the new component is fully functional. This dependency prevents the stack from prematurely marking Component 2 as stable and avoids potential downtime from failed re-deployments.

const healthChecks = pulumi.all(stackOutputIds).apply(async () =>
  // health checks are performed on non-global dns
  pulumi.runtime.isDryRun() || envs.NM_USE_GLOBAL_DNS === 'true'
    ? []
    : // check the url health and timeout if necessary
      checkBasicWebApp({
        routings: routingOutputs,
        newAppImageTagEnvs,
        headers: {
          ...(NM_TARGET_STAGE === 'start canary' && {
            [toCanaryHeader.name]: toCanaryHeader.value,
          }),
        },
      }),
)

// stack outputs
const stageInfo = healthChecks.apply((checks) =>
  pulumi.all(checks).apply(() => {
    const stable = variantConstruction.find((c) => c.variantName === 'stable')
    const canary = variantConstruction.find((c) => c.variantName === 'canary')
    if (!stable || !canary) throw new Error('Error constructing components.')
    return {
      stable: stable.component,
      canary: canary.component,
      images: stable.imgTags,
    }
  }),
)

While Pulumi offers the flexibility to configure deployments dynamically, it does introduce additional complexity, which may require engineers to invest more time in understanding and maintaining the codebase. At NearMe, we have chosen Pulumi for its powerful capabilities in handling our specific needs, despite the steeper learning curve.

Pulumi Tips

Below are some Pulumi tips we would like to share:

Utilize pulumi.runtime.isDryRun()

During the first deployment, certain resource properties might not be available until the resource is actually created. This is particularly common when dealing with:

Generated resource IDs
Dynamically assigned IPs
Kubernetes assigned node ports

const nodePort = pulumi
  .all([service.spec.ports[0], service.metadata.name, service.metadata.namespace])
  .apply(([p, n, ns]) => {
    // nodePort will be generated by Kubernetes
    if (pulumi.runtime.isDryRun() && !p.nodePort) return '30000'
    const port = p.nodePort
    if (typeof port !== 'number') {
      const msg = `Unable to find node port in service. Make sure the passed service, ${n}.${ns}, is of type NodePort. Port object: ${JSON.stringify(
        p,
      )}`
      pulumi.log.error(msg)
      throw new Error(msg)
    }
    return port.toString()
  })
const targetGroup = new aws.lb.TargetGroup('target-group', {
  port: nodePort.apply((p) => parseInt(p)),
  healthCheck: {
    port: nodePort,
  },
  /* ... */
})

Avoid default provider

Using default providers like AWS environment variables in non-CI environments might cause a lot of problems. For example, Kubernetes resources might be deployed to a local Minikube context accidentally. Passing providers explicitly is recommended as shown below.

const localWorkspace = await LocalWorkspace.createOrSelectStack(
  {
    projectName: 'example',
    stackName: 'local',
    program: async () => {
      /* Kubernetes Provider */
      const provider = new k8s.Provider('provider', {
        context: 'minikube', // or pass config from EKS resource
      })
      new k8s.core.v1.Namespace(
        'example',
        {
          metadata: {
            name: 'example',
          },
        },
        { provider },
      )
    },
  },
  {
    envVars: {
      PULUMI_CONFIG_PASSPHRASE: 'passphrase',
      PULUMI_BACKEND_URL: `file://${homedir()}`,
    },
  },
)
/* Disable default providers */
await localWorkspace.setConfig('pulumi:disable-default-providers', {
  value: JSON.stringify(['*']),
})

await localWorkspace.up({
  onOutput(out) {
    console.log(out)
  },
})

Perspectives

Leveraging Pulumi's ability to manage both cloud infrastructure and Kubernetes resources, we can create an automated system for spinning up isolated ephemeral environments. This enables developers to test their changes in production-like environments before merging their code.

Conclusion

Pulumi has revolutionized infrastructure management at NearMe, empowering our Platform Engineering team to use familiar programming languages, enhance type safety, and implement robust testing practices. While it introduces some complexity, Pulumi excels at addressing highly dynamic requirements in our ever-evolving environment. If you’re interested in solving challenging problems and shaping the future of infrastructure at NearMe, we’re hiring! Check out the link below to join our team.

Recruit Information

Author: Cyan Chen

NearMe Tech Blog

NearMeの技術ブログです