Cloud Storage and Data Warehouse Destinations

Updated

How it works

Rather than streaming data to the destination in real time, like with most destinations, our data warehouse and storage destinations send data to your storage buckets in bulk at regular, 10-minute intervals. When we load data, we insert and update events, people, and groups, in JSON, CSV, or parquet files that we upload to your storage bucket. You can then ingest those files into the data warehouse or database of your choice.

These destinations only create new files in your storage bucket; they’ll never overwrite or append an existing file, so you can delete or remove files from your storage bucket after you ingest them into their ultimate destination—your data warehouse or database.

Exported files

Our data warehouse and cloud storage destinations generate parquet, JSON, or CSV files that we load in a storage bucket you specify. The data we send (the files we generate in your storage bucket) are based on the Actions you enable. But, by default, that means we generate files for all our standard source calls: identify, group, page, screen, track, and alias.

Files are named in the format workspaceName-dataType-syncNumber. For example, if your workspace is called production, files for track calls will be called production_tracks_1 where _1 is the first file.

Sync frequency

Unlike other destinations where we send data in real time, these kinds of destinations attempt to send data to your storage bucket every 10 minutes—though actual sync intervals and processing times may vary. When syncing large data sets, or when you have a high volume of concurrent sync operations, it can take a little longer to process and export data.

Each sync file contains data from the previous sync interval. For example, if the last sync occurred at 12:00 PM, the next sync will only send data from 12:00 PM to 12:09:59 PM.

Handling objects and arrays in CSV and Parquet files

Our source libraries pass nested objects and arrays into tracking calls as properties, traits, and tracking calls, but CSVs and Parquet files don’t have a concept of objects or arrays. So we stringify or flatten properties and traits in CSVs and Parquet files to preserve source data without significantly manipulating it.

{
  "received_at": "2019-08-24T14:15:22Z",
  "id": "a7280cfea0f6d",
  "user_id": "97980cfea0067",
  "anonymous_id": "d19b0cfeb606a",
  "sent_at": "2019-08-24T14:15:22Z",
  "traits": {
    "name": "Cool Person",
    "email": "cool.person@example.com",
    "likes_baseball": true
  },
  "context": {
    ...
  }
}
received_at,id,user_id,anonymous_id,sent_at,traits,context
2019-08-24T14:15:22Z,a7280cfea0f6d,97980cfea0067,d19b0cfeb606a,2019-08-24T14:15:22Z,"{\"name\": \"Cool Person\", \"email\": \"cool.person@example.com\", \"likes_baseball\": true}", "{...}"

Schemas

When we load data into your storage buckets, we create and update files to match the shape of your source data. Note that we flatten or stringify nested objects and arrays according to the rules above.

Identifies schema

Identifies files contain identify calls made from your sources. The context and traits in the schema below are objects in JSON. In CSV and parquet files, these columns contain stringified objects.

  • anonymous_id string
    A unique substitute for a User ID in cases when you don’t have an absolutely unique identifier. Our libraries generate this value automatically to help you track people before they sign up, log in, provide their email, etc.
  • context
    A dictionary of context about a source call/event, like the user’s IP address or locale. Context is automatically collected by our source libraries.
    • active boolean

      Whether a user is active.

      This is usually used when you send an .identify() call to update the traits independently of when you’ve “last seen” a user.

    • channel string
      The channel the event originated from.

      Accepted values:browser,server,mobile

    • ip string
      The user’s IP address. This isn’t captured by our libraries, but by our servers when we receive client-side events (like from our JavaScript source).
    • locale string
      The locale string for the current user, e.g. en-US.
    • userAgent string
      The user agent of the device making the request
      • content string
      • medium string
        The type of traffic a person/event originates from, like email, or referral.
      • name string
        The campaign name.
      • source string
        The source of traffic—like the name of your email list, Facebook, Google, etc.
      • term string
        The keyword term(s) a user came from.
      • Additional UTM Parameters* string
      • keywords array of [ strings ]
        A list/array of keywords describing the page’s content. The keywords are likely the same as, or similar to, the keywords you would find in an HTML meta tag for SEO purposes. This property is mainly used by content publishers that rely heavily on pageview tracking. This isn’t automatically collected.
      • name string
        The name of the page. Reserved for future use.
      • path string
        The path portion of the page’s URL. Equivalent to the canonical path which defaults to location.pathname from the DOM API.
      • referrer string
        The previous page’s full URL. Equivalent to document.referrer from the DOM API.
      • search string
        The query string portion of the page’s URL. Equivalent to location.search from the DOM API.
      • title string
        The page’s title. Equivalent to document.title from the DOM API.
      • url string
        A page’s full URL. We first look for the canonical URL. If the canonical URL is not provided, we’ll use location.href from the DOM API.
  • id string
    A unique identifier for a Data Pipelines event, ensuring that each individual event is unique.
  • received_at string  (date-time)
    The ISO-8601 timestamp when Data Pipelines receives an event.
  • sent_at string  (date-time)
    The ISO-8601 timestamp when a library sends an event to Data Pipelines.
    • createdAt string  (date-time)
      We recommend that you pass date-time values as ISO 8601 date-time strings. We convert this value to fit destinations where appropriate.
    • email string
      A person’s email address. In some cases, you can pass an empty userId and we’ll use this value to identify a person.
    • Additional Traits* any type
      Traits that you want to set on a person. These can take any JSON shape.
  • user_id string
    The unique identifier for a person. This value should be unique across systems, so you recognize the same person in your sources and destinations.

Groups schema

Groups files contain group calls made from your sources. If your integration outputs CSV or parquet files, the context and traits columns contain stringified objects.

  • anonymous_id string
    A unique substitute for a User ID in cases when you don’t have an absolutely unique identifier. Our libraries generate this value automatically to help you track people before they sign up, log in, provide their email, etc.
  • group_id string
    ID of the group
  • id string
    A unique identifier for a Data Pipelines event, ensuring that each individual event is unique.
  • objectTypeId string

    If you use Customer.io Journeys as a destination, this value is the type of group/object your group belongs to; object type IDs are stringified integers. If you don’t include this value, we assume the object type ID is 1. See objects in Customer.io Journeys for more information.

    You can include this value as objectTypeId at the top level of your payload or as object_type_id in the traits object.

  • received_at string  (date-time)
    The ISO-8601 timestamp when Data Pipelines receives an event.
  • sent_at string  (date-time)
    The ISO-8601 timestamp when a library sends an event to Data Pipelines.
    • Additional Traits* any type
      Traits can have any name, like `account_name` or `total_employees`. These can take any JSON shape.
  • user_id string
    The unique identifier for a person. This value should be unique across systems, so you recognize the same person in your sources and destinations.

Page schema

Pages contains entries for the page calls your sources send to Customer.io. If your integration outputs CSV or parquet files, the context and properties columns contain stringified objects. If your integration outputs JSON files, the context and properties columns contain objects.

  • anonymous_id string
    A unique substitute for a User ID in cases when you don’t have an absolutely unique identifier. Our libraries generate this value automatically to help you track people before they sign up, log in, provide their email, etc.
  • context
    A dictionary of context about a source call/event, like the user’s IP address or locale. Context is automatically collected by our source libraries.
    • active boolean

      Whether a user is active.

      This is usually used when you send an .identify() call to update the traits independently of when you’ve “last seen” a user.

    • channel string
      The channel the event originated from.

      Accepted values:browser,server,mobile

    • ip string
      The user’s IP address. This isn’t captured by our libraries, but by our servers when we receive client-side events (like from our JavaScript source).
    • locale string
      The locale string for the current user, e.g. en-US.
    • userAgent string
      The user agent of the device making the request
      • content string
      • medium string
        The type of traffic a person/event originates from, like email, or referral.
      • name string
        The campaign name.
      • source string
        The source of traffic—like the name of your email list, Facebook, Google, etc.
      • term string
        The keyword term(s) a user came from.
      • Additional UTM Parameters* string
      • keywords array of [ strings ]
        A list/array of keywords describing the page’s content. The keywords are likely the same as, or similar to, the keywords you would find in an HTML meta tag for SEO purposes. This property is mainly used by content publishers that rely heavily on pageview tracking. This isn’t automatically collected.
      • name string
        The name of the page. Reserved for future use.
      • path string
        The path portion of the page’s URL. Equivalent to the canonical path which defaults to location.pathname from the DOM API.
      • referrer string
        The previous page’s full URL. Equivalent to document.referrer from the DOM API.
      • search string
        The query string portion of the page’s URL. Equivalent to location.search from the DOM API.
      • title string
        The page’s title. Equivalent to document.title from the DOM API.
      • url string
        A page’s full URL. We first look for the canonical URL. If the canonical URL is not provided, we’ll use location.href from the DOM API.
  • id string
    A unique identifier for a Data Pipelines event, ensuring that each individual event is unique.
    • category string
      The category of the page. This might be useful if you have a single page routes or have a flattened URL structure.
    • path string
      The path of the page. This defaults to location.pathname, but can be overridden.
    • referrer string
      The referrer of the page, if applicable. This defaults to document.referrer, but can be overridden.
    • search string
      The search query in the URL, if present. This defaults to location.search, but can be overridden.
    • title string
      The title of the page. This defaults to document.title, but can be overridden.
    • url string
      The URL of the page. This defaults to a canonical url if available, and falls back to document.location.href.
    • Page Properties* any type
  • received_at string  (date-time)
    The ISO-8601 timestamp when Data Pipelines receives an event.
  • sent_at string  (date-time)
    The ISO-8601 timestamp when a library sends an event to Data Pipelines.
  • user_id string
    The unique identifier for a person. This value should be unique across systems, so you recognize the same person in your sources and destinations.

Screen schema

Screens files contain entries for the screen calls your sources send to Customer.io. If your integration outputs CSV or parquet files, the context and properties columns contain stringified objects. If your integration outputs JSON files, the context and properties columns contain objects.

  • anonymous_id string
    A unique substitute for a User ID in cases when you don’t have an absolutely unique identifier. Our libraries generate this value automatically to help you track people before they sign up, log in, provide their email, etc.
  • context
    A dictionary of context about a source call/event, like the user’s IP address or locale. Context is automatically collected by our source libraries.
    • active boolean

      Whether a user is active.

      This is usually used when you send an .identify() call to update the traits independently of when you’ve “last seen” a user.

    • channel string
      The channel the event originated from.

      Accepted values:browser,server,mobile

    • ip string
      The user’s IP address. This isn’t captured by our libraries, but by our servers when we receive client-side events (like from our JavaScript source).
    • locale string
      The locale string for the current user, e.g. en-US.
    • userAgent string
      The user agent of the device making the request
      • content string
      • medium string
        The type of traffic a person/event originates from, like email, or referral.
      • name string
        The campaign name.
      • source string
        The source of traffic—like the name of your email list, Facebook, Google, etc.
      • term string
        The keyword term(s) a user came from.
      • Additional UTM Parameters* string
      • keywords array of [ strings ]
        A list/array of keywords describing the page’s content. The keywords are likely the same as, or similar to, the keywords you would find in an HTML meta tag for SEO purposes. This property is mainly used by content publishers that rely heavily on pageview tracking. This isn’t automatically collected.
      • name string
        The name of the page. Reserved for future use.
      • path string
        The path portion of the page’s URL. Equivalent to the canonical path which defaults to location.pathname from the DOM API.
      • referrer string
        The previous page’s full URL. Equivalent to document.referrer from the DOM API.
      • search string
        The query string portion of the page’s URL. Equivalent to location.search from the DOM API.
      • title string
        The page’s title. Equivalent to document.title from the DOM API.
      • url string
        A page’s full URL. We first look for the canonical URL. If the canonical URL is not provided, we’ll use location.href from the DOM API.
  • id string
    A unique identifier for a Data Pipelines event, ensuring that each individual event is unique.
    • Additional event properties* any type
      Properties that you sent in the event. These can take any JSON shape.
  • received_at string  (date-time)
    The ISO-8601 timestamp when Data Pipelines receives an event.
  • sent_at string  (date-time)
    The ISO-8601 timestamp when a library sends an event to Data Pipelines.
  • user_id string
    The unique identifier for a person. This value should be unique across systems, so you recognize the same person in your sources and destinations.

Track Schema

Tracks contains entries for the track calls you send to Customer.io. It shows information about the events your users perform.

If your integration outputs CSV or parquet files, the context and properties columns contain stringified objects. If your integration outputs JSON files, the context and properties columns contain objects.

  • anonymous_id string
    A unique substitute for a User ID in cases when you don’t have an absolutely unique identifier. Our libraries generate this value automatically to help you track people before they sign up, log in, provide their email, etc.
  • context
    A dictionary of context about a source call/event, like the user’s IP address or locale. Context is automatically collected by our source libraries.
    • active boolean

      Whether a user is active.

      This is usually used when you send an .identify() call to update the traits independently of when you’ve “last seen” a user.

    • channel string
      The channel the event originated from.

      Accepted values:browser,server,mobile

    • ip string
      The user’s IP address. This isn’t captured by our libraries, but by our servers when we receive client-side events (like from our JavaScript source).
    • locale string
      The locale string for the current user, e.g. en-US.
    • userAgent string
      The user agent of the device making the request
      • content string
      • medium string
        The type of traffic a person/event originates from, like email, or referral.
      • name string
        The campaign name.
      • source string
        The source of traffic—like the name of your email list, Facebook, Google, etc.
      • term string
        The keyword term(s) a user came from.
      • Additional UTM Parameters* string
      • keywords array of [ strings ]
        A list/array of keywords describing the page’s content. The keywords are likely the same as, or similar to, the keywords you would find in an HTML meta tag for SEO purposes. This property is mainly used by content publishers that rely heavily on pageview tracking. This isn’t automatically collected.
      • name string
        The name of the page. Reserved for future use.
      • path string
        The path portion of the page’s URL. Equivalent to the canonical path which defaults to location.pathname from the DOM API.
      • referrer string
        The previous page’s full URL. Equivalent to document.referrer from the DOM API.
      • search string
        The query string portion of the page’s URL. Equivalent to location.search from the DOM API.
      • title string
        The page’s title. Equivalent to document.title from the DOM API.
      • url string
        A page’s full URL. We first look for the canonical URL. If the canonical URL is not provided, we’ll use location.href from the DOM API.
  • event string
    The slug of the event name, mapping to an event-specific table.
  • event_text string
    The name of the event.
  • id string
    A unique identifier for a Data Pipelines event, ensuring that each individual event is unique.
    • Event Properties* any type
  • received_at string  (date-time)
    The ISO-8601 timestamp when Data Pipelines receives an event.
  • sent_at string  (date-time)
    The ISO-8601 timestamp when a library sends an event to Data Pipelines.
  • user_id string
    The unique identifier for a person. This value should be unique across systems, so you recognize the same person in your sources and destinations.

Alias

The Alias schema contains entries for the alias calls you send to Customer.io. It shows information about the users you merge, with each entry showing a user’s new user_id and their previous_id.

  • id string
    A unique identifier for a Data Pipelines event, ensuring that each individual event is unique.
  • previous_id string
    The userId that you want to merge into the canonical profile.
  • received_at string  (date-time)
    The ISO-8601 timestamp when Data Pipelines receives an event.
  • sent_at string  (date-time)
    The ISO-8601 timestamp when a library sends an event to Data Pipelines.
  • user_id string
    The unique identifier for a person. This value should be unique across systems, so you recognize the same person in your sources and destinations.

Timestamps

We associate four timestamps with every source call: timestamp, original_timestamp, sent_at and received_at. All four timestamps pass through to your warehouse, and it may help to understand the purpose of each.

In general, you should use timestamp when you query for historical events and received_at for all other queries based on time.

timestamp is the UTC-converted timestamp set by the Customer.io library. If you import historical events using a server-side library, this is the timestamp you’ll want to reference in your queries.

original_timestamp is the original timestamp set by the source library when the event/source call is created. This timestamp can be affected by device clock skew. You can override this value by manually passing a timestamp in your source calls, which we map to the original_timestamp. Generally, this timestamp should be ignored in favor of the timestamp column.

sent_at is a UTC timestamp set when source libraries send calls to Customer.io. This timestamp can also be affected by device clock skew.

received_at is a UTC timestamp set by the Customer.io source API when we receive a payload from a source library. All tables use received_at as the sort key.

 Use received_at for quries baed on times

The sent_at timestamp relies on a client’s device clock being accurate, which can be unreliable.

id

Each row in your database has an id which is equivalent to the messageId that our source libraries pass in source calls. This is a unique identifier associated with the row.

Sort Key

All tables use received_at as the sort key. Amazon Redshift stores your data on disk in sorted order according to the sort key. The Redshift query optimizer uses sort order when it determines optimal query plans.

Copied to clipboard!
  Contents
Is this page helpful?