Skip to main content

Connect to data source

We need to obtain our data source by copying our Formula 1 data into Snowflake tables from a public S3 bucket that dbt Labs hosts.

  1. When a new Snowflake account is created, there should be a preconfigured warehouse in your account named COMPUTE_WH.

  2. If for any reason your account doesn’t have this warehouse, we can create a warehouse using the following script:

    create or replace warehouse COMPUTE_WH with warehouse_size=XSMALL
  3. Rename the worksheet to data setup script since we will be placing code in this worksheet to ingest the Formula 1 data. Make sure you are still logged in as the ACCOUNTADMIN and select the COMPUTE_WH warehouse.

    Rename worksheet and select warehouseRename worksheet and select warehouse
  4. Copy the following code into the main body of the Snowflake worksheet. You can also find this setup script under the setup folder in the Git repository. The script is long since it's bring in all of the data we'll need today!

    -- create and define our formula1 database
    create or replace database formula1;
    use database formula1;
    create or replace schema raw;
    use schema raw;

    -- define our file format for reading in the csvs
    create or replace file format csvformat
    type = csv
    field_delimiter =','
    field_optionally_enclosed_by = '"',
    skip_header=1;

    --
    create or replace stage formula1_stage
    file_format = csvformat
    url = 's3://formula1-dbt-cloud-python-demo/formula1-kaggle-data/';

    -- load in the 8 tables we need for our demo
    -- we are first creating the table then copying our data in from s3
    -- think of this as an empty container or shell that we are then filling
    create or replace table formula1.raw.circuits (
    CIRCUITID NUMBER(38,0),
    CIRCUITREF VARCHAR(16777216),
    NAME VARCHAR(16777216),
    LOCATION VARCHAR(16777216),
    COUNTRY VARCHAR(16777216),
    LAT FLOAT,
    LNG FLOAT,
    ALT NUMBER(38,0),
    URL VARCHAR(16777216)
    );
    -- copy our data from public s3 bucket into our tables
    copy into circuits
    from @formula1_stage/circuits.csv
    on_error='continue';

    create or replace table formula1.raw.constructors (
    CONSTRUCTORID NUMBER(38,0),
    CONSTRUCTORREF VARCHAR(16777216),
    NAME VARCHAR(16777216),
    NATIONALITY VARCHAR(16777216),
    URL VARCHAR(16777216)
    );
    copy into constructors
    from @formula1_stage/constructors.csv
    on_error='continue';

    create or replace table formula1.raw.drivers (
    DRIVERID NUMBER(38,0),
    DRIVERREF VARCHAR(16777216),
    NUMBER VARCHAR(16777216),
    CODE VARCHAR(16777216),
    FORENAME VARCHAR(16777216),
    SURNAME VARCHAR(16777216),
    DOB DATE,
    NATIONALITY VARCHAR(16777216),
    URL VARCHAR(16777216)
    );
    copy into drivers
    from @formula1_stage/drivers.csv
    on_error='continue';

    create or replace table formula1.raw.lap_times (
    RACEID NUMBER(38,0),
    DRIVERID NUMBER(38,0),
    LAP NUMBER(38,0),
    POSITION FLOAT,
    TIME VARCHAR(16777216),
    MILLISECONDS NUMBER(38,0)
    );
    copy into lap_times
    from @formula1_stage/lap_times.csv
    on_error='continue';

    create or replace table formula1.raw.pit_stops (
    RACEID NUMBER(38,0),
    DRIVERID NUMBER(38,0),
    STOP NUMBER(38,0),
    LAP NUMBER(38,0),
    TIME VARCHAR(16777216),
    DURATION VARCHAR(16777216),
    MILLISECONDS NUMBER(38,0)
    );
    copy into pit_stops
    from @formula1_stage/pit_stops.csv
    on_error='continue';

    create or replace table formula1.raw.races (
    RACEID NUMBER(38,0),
    YEAR NUMBER(38,0),
    ROUND NUMBER(38,0),
    CIRCUITID NUMBER(38,0),
    NAME VARCHAR(16777216),
    DATE DATE,
    TIME VARCHAR(16777216),
    URL VARCHAR(16777216),
    FP1_DATE VARCHAR(16777216),
    FP1_TIME VARCHAR(16777216),
    FP2_DATE VARCHAR(16777216),
    FP2_TIME VARCHAR(16777216),
    FP3_DATE VARCHAR(16777216),
    FP3_TIME VARCHAR(16777216),
    QUALI_DATE VARCHAR(16777216),
    QUALI_TIME VARCHAR(16777216),
    SPRINT_DATE VARCHAR(16777216),
    SPRINT_TIME VARCHAR(16777216)
    );
    copy into races
    from @formula1_stage/races.csv
    on_error='continue';

    create or replace table formula1.raw.results (
    RESULTID NUMBER(38,0),
    RACEID NUMBER(38,0),
    DRIVERID NUMBER(38,0),
    CONSTRUCTORID NUMBER(38,0),
    NUMBER NUMBER(38,0),
    GRID NUMBER(38,0),
    POSITION FLOAT,
    POSITIONTEXT VARCHAR(16777216),
    POSITIONORDER NUMBER(38,0),
    POINTS NUMBER(38,0),
    LAPS NUMBER(38,0),
    TIME VARCHAR(16777216),
    MILLISECONDS NUMBER(38,0),
    FASTESTLAP NUMBER(38,0),
    RANK NUMBER(38,0),
    FASTESTLAPTIME VARCHAR(16777216),
    FASTESTLAPSPEED FLOAT,
    STATUSID NUMBER(38,0)
    );
    copy into results
    from @formula1_stage/results.csv
    on_error='continue';

    create or replace table formula1.raw.status (
    STATUSID NUMBER(38,0),
    STATUS VARCHAR(16777216)
    );
    copy into status
    from @formula1_stage/status.csv
    on_error='continue';

  5. Ensure all the commands are selected before running the query an easy way to do this is to use Ctrl-a to highlight all of the code in the worksheet. Select run (blue triangle icon). Notice how the dot next to your COMPUTE_WH turns from gray to green as you run the query. The status table is the final table of all 8 tables loaded in.

    Load data from S3 bucketLoad data from S3 bucket
  6. Let’s unpack that pretty long query we ran into component parts. We ran this query to load in our 8 Formula 1 tables from a public S3 bucket. To do this, we:

    • Created a new database called formula1 and a schema called raw to place our raw (untransformed) data into.
    • Defined our file format for our CSV files. Importantly, here we use a parameter called field_optionally_enclosed_by = since the string columns in our Formula 1 csv files use quotes. Quotes are used around string values to avoid parsing issues where commas , and new lines /n in data values could cause data loading errors.
    • Created a stage to locate our data we are going to load in. Snowflake Stages are locations where data files are stored. Stages are used to both load and unload data to and from Snowflake locations. Here we are using an external stage, by referencing an S3 bucket.
    • Created our tables for our data to be copied into. These are empty tables with the column name and data type. Think of this as creating an empty container that the data will then fill into.
    • Used the copy into statement for each of our tables. We reference our staged location we created and upon loading errors continue to load in the rest of the data. You should not have data loading errors but if you do, those rows will be skipped and Snowflake will tell you which rows caused errors
  7. Now let's take a look at some of our cool Formula 1 data we just loaded up!

    1. Create a new worksheet by selecting the + then New Worksheet.

      Create new worksheet to query dataCreate new worksheet to query data
    2. Navigate to Database > Formula1 > RAW > Tables.

    3. Query the data using the following code. There are only 76 rows in the circuits table, so we don’t need to worry about limiting the amount of data we query.

      select * from formula1.raw.circuits
    4. Run the query. From here on out, we’ll use the keyboard shortcuts Command-Enter or Control-Enter to run queries and won’t explicitly call out this step.

    5. Review the query results, you should see information about Formula 1 circuits, starting with Albert Park in Australia!

    6. Finally, ensure you have all 8 tables starting with CIRCUITS and ending with STATUS. Now we are ready to connect into dbt Cloud!

      Query circuits dataQuery circuits data
0