How the MongoDB Connector for BI Creates SQL Schema from Sampling

The MongoDB Connector for BI (BIC or BI Connector) is a tool which provides a read-only SQL interface to query MongoDB data. The BI Connector sits between client applications and a MongoDB cluster. Client applications send SQL queries to the BI Connector where they are transformed into aggregation pipelines equivalent to the original SQL query. The results are presented back to the client as SQL tables, columns, and rows.

BIC overview

In order for this work, document fields need to be mapped to a SQL schema. By default, this schema is generated when the BI Connector initially connects to a MongoDB cluster. By default, BIC randomly samples documents from each database collection. Top-level document data is mapped as a table with a name matching the collection name. Embedded documents are mapped differently depending on the data contained within embedded fields. If fields contains a single document, SQL columns will be appended to the top level data table. The naming convention is [top level field].[embedded document field]. If, however, an embedded field contains an array of documents an additional SQL table is presented. The naming convention for the resulting table is [collection name]_[field name]. Columns and rows correspond to the data within in each document in the array.

Let’s take a look at some Airbnb data for the Seattle area. We have a document structure representing individual listings with top level data and embedded arrays of documents for both ratings and calendar. The BI Connector will represent this document using 3 tables: sea, sea_calendar, and sea_ratings:

mysql> show tables;
+------------------+
| Tables_in_airbnb |
+------------------+
| sea              |
| sea_calendar     |
| sea_ratings      |
+------------------+
3 rows in set (0.01 sec)
{ 
    "_id" : ObjectId("5b9a6d75f68257c5a994558d"), 
    "id" : NumberInt(2318), 
    "listing_url" : "https://www.airbnb.com/rooms/2318", 
    "scrape_id" : NumberLong(20180816121209), 
    "last_scraped" : "2018-08-16", 
    "name" : "Casa Madrona - Urban Oasis, 1 block from the Park!", 
  
...

 "calendar" : [
        {
            "_id" : ObjectId("5b91447d5cf83e0e55659cb1"), 
            "listing_id" : NumberInt(2318), 
            "date" : "2019-08-15", 
            "available" : "t", 
            "price" : 296.0
        }, 
        {
            "_id" : ObjectId("5b91447d5cf83e0e55659cb3"), 
            "listing_id" : NumberInt(2318), 
            "date" : "2019-08-14", 
            "available" : "t", 
            "price" : 296.0
        }, 

...

   "ratings" : [
        {
            "_id" : ObjectId("5b91452d5cf83e0e55962e81"), 
            "listing_id" : NumberInt(4291), 
            "id" : NumberInt(5739075), 
            "date" : "2013-07-14", 
            "reviewer_id" : NumberInt(7078957), 
            "reviewer_name" : "Kathy", 
        }, 
        {
            "_id" : ObjectId("5b91452d5cf83e0e55962e82"), 
            "listing_id" : NumberInt(4291), 
            "id" : NumberInt(6283627), 
            "date" : "2013-08-06", 
            "reviewer_id" : NumberInt(7443181), 
            "reviewer_name" : "Louisa", 
        }, 
	]

...

    "neighborhoods" : {
        "_id" : ObjectId("5b91452c5cf83e0e55962dc8"), 
        "neighbourhood_group" : "Other neighborhoods", 
        "neighbourhood" : "Roosevelt"
    } 
}

This document also includes a field, neighborhoods, that contains a single document, rather than an array of documents. The data here is appended to the top-level in the table sea as separate columns: neighborhoods._id, neighborhoods.neighbourhood, and neighborhoods.neighbourhood_group.