Warehouse Tables

Objects from the main Benchling database are periodically synced to the warehouse to make them available for queries. It may take time (usually a few minutes, up to 24 hours) for data to appear in the warehouse.

If you need additional help for the Benchling Insights application please visit https://help.benchling.com/hc/en-us/sections/8506377929869-Insights.

📘

Why isn't a table appearing in my warehouse?

Warehouse tables only appear once there are rows to create for that table! That means if you create a new schema, it won't appear in the warehouse until you've also created an object under that schema.

$raw Tables

There are usually two tables within the warehouse for a single schema in Benchling: one with and without the suffix $raw after the schema name. Generally speaking, the views are "cleaned up" versions of the raw data (e.g. archived items are hidden, only results from reviewed notebook entries are shown, etc). Deciding which table version to use depends on your specific use case.

For most users, we recommend using the "cleaned up" views, particularly if you are just getting started. These views are designed to present only the most useful data for the most common warehouse use cases, but might suffer from performance issues in some particularly complex edge cases. Users that are using particularly complex queries or who need optimal performance, we recommend using the unprocessed $raw table versions instead.

Some important considerations when using $raw tables:

  • $raw tables also include various "other" columns starting with "_"; these columns exist for internal use, and shouldn't be used in any query.
  • The _pkey column does not represent the primary key for a particular table. All tables will have an initial id column which represents that particular table's primary key.
  • The acl_resource_id column exists for internal use, and should not be used in any query.
  • The stability and semantics of these columns do not fall under our Stability guidelines, and may change without warning.

Joins

The Benchling warehouse, like other relational databases, stores data across multiple tables. But what happens when you want to have a single table representing rows from several tables at once? This is where SQL joins can help.

There are already a lot of great tutorials elsewhere on using joins in SQL. Here we'll just introduce a couple of examples relevant to interacting with data across multiple Benchling warehouse tables.

Simple joins

Notebook entry metadata is stored in the entry warehouse table, but an entry is related to other concepts (and tables) in Benchling, for instance:

  • a Folder and Project that the entry is stored in
  • a User (or App) that created the entry
  • (optional) a schema that the entry was registered in

The example below shows how to use a series of joins between the entry table and folder, project, principal (this table includes both users and Benchling Apps, as either can author an entry) and schema tables. We'll just return the name from these joined tables, but you could return other fields by including them in the SELECT statement.

SELECT
    e.id, 
    e.archived$, 
    e.archive_purpose$, 
    e.created_at, 
    e.modified_at, 
    e.name as entry_name, 
    e.display_id,
    pr.name as creator_name,
    f.name as folder_name, 
    p.name as project_name, 
    s.name as schema_name
FROM entry e
LEFT JOIN folder f on e.folder_id = f.id
LEFT JOIN project p on e.source_id = p.id
LEFT JOIN principal pr on e.creator_id = pr.id
LEFT JOIN schema s on e.schema_id = s.id

What if we also want to list out the names of all the authors for each notebook entry? One way to do this is to create a Common Table Expression (CTE) where we join the author and principal tables and combine the authors names together using string_agg. We'll call this CTE author_names:

WITH author_names AS (
    SELECT 
        a.entry_id, 
        string_agg(pr.name, ', ' ORDER BY pr.name) AS author_names
    FROM author a
    JOIN principal pr ON a.user_id = pr.id
    GROUP BY a.entry_id
)
SELECT
    e.id, 
    e.archived$, 
    e.archive_purpose$, 
    e.created_at, 
    e.modified_at, 
    e.name AS entry_name, 
    e.display_id, 
    creator.name AS creator_name,
    f.name AS folder_name, 
    p.name AS project_name, 
    s.name AS schema_name,
    an.author_names
FROM entry e
LEFT JOIN folder f ON e.folder_id = f.id
LEFT JOIN project p ON e.source_id = p.id
LEFT JOIN principal creator ON e.creator_id = creator.id
LEFT JOIN schema s ON e.schema_id = s.id
LEFT JOIN author_names an ON e.id = an.entry_id

Advanced joins

Recursive queries are a useful way to summarize hierarchies, where the data forms a tree or graph-like structure. Some example of where a recursive query may come in handy in Benchling are:

  • summarizing Benchling Inventory locations (for example: Plate -> Shelf -> Freezer -> Lab -> Building)
  • summarizing folder locations of Benchling objects - which we'll explore in more detail in this section

To create a list of Notebook entries and their corresponding locations, start by defining a new recursive CTE - called folder_path here. Folder names and IDs are in the folder warehouse table:

WITH RECURSIVE folder_path AS (
  SELECT 
    id, 
    name AS path
  FROM folder
  WHERE parent_folder_id is null
  UNION ALL
  SELECT
    f.id, concat_ws(' / ', fp.path, f.name)
  FROM
    folder AS f
  JOIN folder_path AS fp ON f.parent_folder_id = fp.id
),

Now that we have a list of folders and their corresponding paths, create a second CTE (called entries) by joining the folder_path CTE to the entry table to give the complete list of Notebook Entries and folder locations.

entries as (
  SELECT e.id, concat_ws(' / ', path, e.name) AS folder_path
  FROM folder_path AS fp
  JOIN entry e ON e.folder_id = fp.id
)
SELECT * from entries

For more details on recursive queries, refer to the PostgreSQL documentation here.

Metadata Tables

In the Insights product, Benchling currently limits the access to many Postgres metadata tables found in information_schema and pg_catalog . The following are the metadata tables currently available to users for convenience:

information_schema.check_constraints
information_schema.columns
information_schema.constraint_column_usage
information_schema.key_column_usage
information_schema.referential_constraints
information_schema.routines
information_schema.tables
information_schema.table_constraints
information_schema.views
pg_catalog.pg_attrdef
pg_catalog.pg_attribute
pg_catalog.pg_event_trigger
pg_catalog.pg_class
pg_catalog.pg_extension
pg_catalog.pg_index
pg_catalog.pg_namespace
pg_catalog.pg_proc
pg_catalog.pg_stats
pg_catalog.pg_tables
pg_catalog.pg_timezone_names
pg_catalog.pg_type
pg_catalog.pg_views