Options for Extracting Data from TeamForm

Options for Extracting Data from TeamForm

This document outlines the options available for extracting data from TeamForm.


Overview

TeamForm has two mechanisms to extract data

- via a batch created dataset written to S3 (Options 1A, 1B, 1C) with broad coverage of TeamForm data reporting tables.

and

- via public API (Option 2) for real time data needs focusing on core people, tag and team data fields.

You may choose to adopt either or both paths depending on your use case.


Option 1A: S3 + SFTP

How it works: TeamForm runs the extract and writes files to an S3 bucket. You download the files via SFTP using SSH key authentication, then load them into your data lake (e.g. via internal stage + COPY, or external stage if you first copy files into your own S3).

Pros

Cons

Pros

Cons

No AWS account required on your side for access

Requires SFTP client and key management

Works with any downstream stack (not only AWS)

You must run a separate step to move/load files into your data lake

Familiar protocol for many enterprises

Optional: if files are PGP-encrypted, you must decrypt before load

Can restrict access by IP if required

 

Customer setup

  1. Provide SSH public key(s)
    TeamForm will create SFTP user(s) and associate your public key(s). You’ll receive the SFTP host and username.

  2. Optional: IP allow list
    If you use a fixed egress IP for SFTP, we can restrict SFTP access to that IP.

  3. Connect and download
    Use any SFTP client (e.g. sftp, WinSCP, or an orchestration tool) to connect and download the extract files from the provided path.

  4. Load into downtream system

    • Upload files to a stage (internal or S3 external stage in your account), then use COPY INTO … FROM @stage to load tables, or

    • If you first copy files into your own S3, create an external stage on that bucket and run COPY from there.

Note: SFTP is currently tied to the integration upload bucket. If the extract is written to a different bucket, TeamForm may configure an additional output (e.g. copy) so that the same files are available under the SFTP-accessible path. Confirm the exact S3 path and SFTP path with your TeamForm contact.


Option 1B: AWS IAM – Pull (you read from our bucket)

How it works: TeamForm writes extract files to a TeamForm-owned S3 bucket in our AWS account. You assume an IAM role we create (trusting your AWS principal) and read objects from that bucket. You can then load into your data lake from your own S3 or via a data lake external stage that uses that assumed role.

Pros

Cons

Pros

Cons

No inbound access from TeamForm into your account

You must have an AWS identity (role or user) to assume our role

You control when and how often you read

Data lives in our account until you copy it

Fits well with data lake external stages and storage integrations

You need to configure data lake to use the assumed-role credentials (e.g. storage integration)

Optional IP restriction on the role for extra security

 

Customer setup

  1. Provide your AWS role ARN
    Give TeamForm the ARN of the IAM role (or user) that will be allowed to assume our “remote reader” role (e.g. arn:aws:iam::YOUR_ACCOUNT:role/YourDatalakeDataIngestionRole).

  2. Optional: IP allow list
    If you want to restrict access by source IP, provide the CIDR(s). We will add a condition on the role’s policy so only requests from that IP can use it.

  3. Assume our role and read
    From your side (e.g. EC2, Lambda, or Datalake storage integration):

    • Assume the TeamForm-provided role (e.g. data-extractor-batch-remote-role-<tenantId>).

    • Use the temporary credentials to read from the bucket and prefix we give you (e.g. s3://teamform-reports-data-extract-<env>-<tenantId>/<prefix>/).

  4. Load into Datalake

    • Option A: Use a Datalake storage integration that assumes the TeamForm role (if supported in your Datalake/AWS setup), and create an external stage on our bucket. Then COPY INTO … FROM @external_stage.

    • Option B: Use a job in your account (e.g. Lambda, ECS) that assumes the role, reads from our bucket, and writes to your S3; then point Datalake at your S3.

TeamForm will provide: bucket name, optional prefix, role ARN to assume, and region. If objects are KMS-encrypted, the role we create will have permission to use the relevant KMS key.


Option 1C: AWS IAM – Push (we write to your bucket)

How it works: TeamForm’s Batch job assumes an IAM role in your AWS account and writes extract files directly into your S3 bucket. You then load from that bucket into Datalake (e.g. external stage + COPY).

Pros

Cons

Pros

Cons

Data lands in your account; no pull step

You must create a role and bucket and allow our account to assume the role

Simple Datalake integration: external stage on your bucket

Requires AWS and some IAM setup on your side

You control retention, lifecycle, and access in your bucket

Cross-account and optional KMS setup to configure

Can target a different region via configuration

 

Customer setup

  1. Create an S3 bucket
    In the account and region where you want the extract (e.g. same region as Datalake or a dedicated data-lake account). Note the bucket name and, if different from our region, the region.

  2. Create an IAM role for TeamForm to assume

    • Create a role that only the TeamForm AWS account can assume (trust policy: Principal: { AWS: "arn:aws:iam::TEAMFORM_ACCOUNT_ID:root" } or a specific role ARN we provide).

    • Attach a policy that allows:

      • s3:PutObject, s3:GetObject, s3:DeleteObject on arn:aws:s3:::YOUR_BUCKET_NAME/* (and optionally s3:ListBucket if we need it).

    • If you use a customer-managed KMS key for the bucket, grant the role kms:Decrypt, kms:GenerateDataKey on that key.

  3. Provide TeamForm

    • Role ARN (e.g. arn:aws:iam::YOUR_ACCOUNT:role/TeamFormDataExtractWriteRole).

    • Bucket name.

    • Optional: object prefix (e.g. teamform/extract/), target region if different from our default.

  4. Optional: PGP encryption
    If you want files encrypted at rest in your bucket, provide a public PGP key. TeamForm will encrypt the Parquet files with it before uploading. You decrypt in your pipeline before loading into Datalake (or use a tool that supports PGP in the load step).

  5. Load into Datalake
    Create an external stage (and storage integration if needed) on your bucket and run COPY INTO … FROM @stage for the extract files. If files are PGP-encrypted, add a decrypt step before or during load.

TeamForm will configure the Batch job with your role ARN and bucket (and optional prefix/region) so that each run writes directly to your bucket.


Option 2: Public API

How it works: TeamForm exposes a REST-style Public API (POST endpoints, JSON request/response) secured by OAuth2 client credentials. You call the API to query people, teams, memberships, tags, objectives, and workspaces with filters, pagination, and optional point-in-time (asOfDate). You then ETL the responses into Datalake (e.g. scheduled jobs that call the API and load into tables). This option is not a bulk dump: it is request/response, so building a full data lake copy requires many calls and your own orchestration.

Pros

Cons

Pros

Cons

Near real-time data (no wait for batch schedule)

Not designed for bulk export; you must paginate and orchestrate

Fine-grained queries (filter by team, person, date, etc.)

Rate limits apply; large datasets need many requests

No file transfer or AWS setup required

Only a subset of data is available (see comparison below)

Point-in-time (asOfDate) supported per request

JSON format; you own transforming and loading into Datalake

Good for incremental syncs or small/medium datasets

Several extract-only datasets have no API equivalent

Customer setup

  1. Obtain API credentials
    Create a Public API credential (Auth0 machine-to-machine) via TeamForm (GraphQL mutation createPublicAPICredential or self-serve if enabled). You receive a client ID and client secret; use them to get a JWT with audience <https://api.teamform.co/<tenant-id>>/api.

  2. Call the API
    Base URL: <https://api.teamform.co/<tenant-id>>/api (or regional endpoint, e.g. api-euw2.teamform.co). All listed endpoints are POST; send workspaceId and optional asOfDate (ISO 8601) in the body. Use the OpenAPI spec (/getReference or /getSpecification) for exact request/response schemas.

  3. Paginate and sync
    Endpoints such as searchPeople and searchTeams support size and page. Implement a sync job that pages through results, then loads into Datalake (e.g. merge into staging tables).

  4. Optional: IP allow list
    If your Confluence/tenant has Public API IP filtering enabled, ensure your egress IPs are allow-listed.

Note: The Public API does not expose all datasets that the batch extract provides. See Data availability and format comparison below for what is available via API vs extract and the main gaps.


Data format and scope

  • Format: Parquet (default). Filenames and optional date-based prefixes are configurable (e.g. <source>_<date>.parquet or .parquet.pgp when encrypted).

  • Datasets: The extract can include many entity types, e.g. people, teams, memberships, allocations, baselines, objectives, tags, comments, planning, and others. The exact list and names are configurable; default list is documented in the data extract Batch job (e.g. in teamform-api: DATA_EXTRACTOR_FILE_LIST / FILE_LIST in the Batch job).

  • Source code: Extract logic and scheduling live in teamform-api (Batch job definition, output configs, SFTP, IAM); the reporting data and pipeline that produce the source Parquet are in teamform-reporting.


Data availability and format comparison (API vs batch extract)

Choosing between the Public API and the batch extract (Options 1–3) depends on how much data you need, how it’s shaped, and how you want to load it. Below is a concise comparison and gap summary.

What the Public API exposes

The Public API is documented in OpenAPI form (e.g. /getSpecification, /getReference). In summary it offers:

Area

Endpoints (examples)

Notes

Area

Endpoints (examples)

Notes

Workspaces

getWorkspaces

List workspaces (e.g. for picking workspaceId)

Teams

searchTeams, getTeams, getTeamAssociations, getTeamsAssociations, getTeamMemberships, getTeamsMemberships, getTeamTags, getTeamsTags, getTeamTypes, updateTeamLinks

Search/filter, pagination, point-in-time

People

searchPeople, getPeople, getPersonMemberships, getPeopleMemberships, getPersonTags, getPeopleTags

Search/filter, pagination, point-in-time; attributes via unlisted getPersonAttributes

Tags

getTags, getTagTypes, getAppliedTags, searchTags, searchAppliedTags

Tag definitions and applied tags

Objectives

getObjectives

Objectives (e.g. OKRs)

Memberships

searchMemberships

Memberships with filters

  • Format: JSON request/response. Pagination via size and page (e.g. max 100 per page). Point-in-time via asOfDate in the request body where supported.

  • Auth: Bearer JWT (OAuth2 client credentials). Optional IP allow list per tenant.

What the batch extract includes (default file list)

The extract writes Parquet files (one per dataset) from the reporting data store. The default FILE_LIST includes many more “tables” than the API exposes, for example:

  • Available in both (but different shape): people, teams, memberships, tags (and related: tag_attributes, tag_people, tag_teams, entity_tag_attributes, entity_tags_today), objectives (objectives, team_objectives), associations, people_attributes.

  • Extract only (no Public API equivalent): projects, allocations, baseline_allocations, baseline_people, baseline_teams, comments, divisions, forecast, individual_allocations, memberships_journal, memberships_schedule, people_to_teams, person_history, planning, team_attributes, team_history.

So: the API is a subset of the data model, focused on current-state and search-oriented access. The extract is a full snapshot of the reporting schema, including history, baselines, planning, and allocations.

Main gaps when using the Public API for a data lake

  1. No bulk snapshot – You must call the API multiple times (paginated) to assemble the dataset. There is no single “dump all people” or “dump all teams” endpoint that returns the same scope as one Parquet file. (using changes since an “as of” date can avoid a need to pull full dataset - see point 5)

  2. Missing datasets – comments, planning, forecast, history tables and similar are not available via the Public API. For those, you need the batch extract.

  3. Format and shape – API returns nested JSON (e.g. team with memberships); extract returns flat or normalized Parquet tables. Field names and structure can differ; you need a mapping layer if you mix API and extract in the same lake.

  4. Rate limits – The API is rate-limited (e.g. per API key). Large or frequent bulk syncs can hit limits; the batch extract is better for “full refresh” of large datasets.

  5. History and point-in-time – The API supports asOfDate on individual requests. The extract gives you full table snapshots (and optionally dated paths); for historical series you may still need the extract (e.g. team_history, person_history).

When to use which

  • Use the Public API when you need near real-time, incremental, or filtered access to people, teams, memberships, tags, objectives and your volume is manageable (pagination + rate limits acceptable). Good for syncs into Datalake that don’t require planning and forecast data.

  • Use the batch extract (Options 1–3) when you need full data reporting snapshots including planning, forecasting, comments, history tables. Best for a full data lake load or when the API doesn’t expose the entities you need.


Summary

Option

Who moves data?

Where data lives

Best when

Option

Who moves data?

Where data lives

Best when

S3 + SFTP

You pull via SFTP

Our S3 (SFTP-accessible path)

You don’t use AWS or prefer SFTP

IAM Pull

You pull via assume role

Our S3

You use AWS and want to read from us

IAM Push

We push to your bucket

Your S3

You use AWS and want data in your account

Public API

You call API, then load

Your pipeline / Datalake

Near real-time, subset of data, no file transfer

For Datalake :

  • SFTP: Download from SFTP → upload to Datalake stage (or your S3) → COPY into tables.

  • Pull: Assume our role → read from our bucket → either stage in your S3 and use Datalake external stage, or use a storage integration to our bucket if your Datalake setup supports it.

  • Push: Data lands in your S3 → create external stage on that bucket → COPY into Datalake tables (with optional PGP decrypt in the pipeline).

  • Public API: Call the API (paginated), write JSON to stage or tables (e.g. variant columns), or ETL into relational tables. Best for the subset of data the API exposes; use extract for full scope and for data not available via API.

If you tell us your preference (SFTP vs Pull vs Push vs API) and whether you use AWS and PGP, we can confirm the exact setup steps and config keys (e.g. DATA_EXTRACTOR_OUTPUT_CONFIGS, DATA_EXTRACTOR_REMOTE_ROLE, SFTP host, bucket names, role ARNs, or API credentials) for your tenant.