SQL Schema Format
On this page
The support for SQL format queries is available as a Beta feature. The feature and the corresponding documentation may change at any time during the Beta stage.
Overview
To query data using SQL, Atlas Data Lake needs to be aware of the schema for that data. Data Lake automatically generates a JSON schema for all new collections and views. To learn more about auto-generated schemas, see Query with SQL.
By default, Data Lake samples data from a single document in your collection or view to generate a JSON schema. If your collection or view contains polymorphic data, you can provide a larger sampling size to Data Lake when manually generating the schema.
Data Lake maps JSON schemas to relational schemas. MongoDB's flexible schema model allows a given field to contain data of multiple types, while relational databases restrict columns to a single data type. The following sections describe the fields supported in the JSON schema, the BSON types that are supported in a relational schema, and how Data Lake resolves conflicts for polymorphic fields when mapped to relational schema.
JSON Schema Format
The schema for a collection is a document with two fields:
jsonSchema
and version
.
"schema" : { "version" : NumberLong(1), "jsonSchema" : {} }
The version
field represents the version of the schema format used
by the document and the value is always 1
. The jsonSchema
field
is a document that describes the schema of the namespace.
Supported JSON Schema Fields
Data Lake supports the following JSON schema fields:
bsonType
items
properties
You can provide a single document or an array of documents for the
items
field. When you retrieve the schema, the items
field
shows the form that you used for setting the schema.
To learn more about these fields, see JSON Schema Keywords.
Supported BSON Types
Data Lake only supports the following BSON types when mapping JSON schema to relational schema:
array
binData
bool
date
decimal
double
int
long
null
object
objectId
string
Other types are ignored in the relational schema. Fields with composite types, such as objects and arrays, are handled specially.
Object Fields
Object fields are flattened such that each nested field maps to its
own column in the relational schema. For example, consider the
following eg
collection:
{ _id: 0, a: true, b: "bar", c: { x: 1.0, y: 2.0 } } { _id: 1, a: true, b: "baz", c: { x: 10.0, y: 20.0 } }
For the above collection, Data Lake generates the following JSON schema:
{ bsonType: "object", properties: { _id: { bsonType: "int" }, a: { bsonType: "bool" }, b: { bsonType: "string" }, c: { bsonType: "object", properties: { x: { bsonType: "double" }, y: { bsonType: "double" } } } } }
In the relational schema, the nested fields c.x
and c.y
are flattened into their own columns. The following
table is a representation of the above schema:
_id | a | b | c.x | c.y |
---|---|---|---|---|
0 | true | bar | 1.0 | 2.0 |
1 | true | baz | 10.0 | 20.0 |
Array Fields
Array fields are mapped to their own tables. For example,
consider the following eg
collection:
{ _id: 0, a: ["k", "l", "m"] } { _id: 1, a: ["x", "y", "z"] }
For the above collection, Data Lake generates the following JSON schema:
{ bsonType: "object", properties: { _id: { bsonType: "int" }, a: { bsonType: "array", items: { bsonType: "string" } } } }
In the relational schema, the array field a
is mapped to its
own table. The array tables include a column suffixed with _idx
to
track the index of the row's value in the array. The following tables
are representations of the above schema:
Table: eg
id |
---|
0 |
1 |
Table: eg_a
_id | a_idx | a |
---|---|---|
0 | 0 | k |
0 | 1 | l |
0 | 2 | m |
1 | 0 | x |
1 | 1 | y |
1 | 2 | z |
Type Conversion Conflicts
MongoDB's flexible schema model allows a given field to contain data of multiple types, while relational databases restrict columns to a single data type. When Data Lake maps the JSON schema to relational schema, type conflicts can occur if a field is polymorphic. There are two main categories of type conversion conflicts that might occur when there are multiple data types:
- Conflicts between scalar types
- Conflicts involving composite types like documents and arrays
Scalar-Scalar Conflicts
When a scalar-scalar confict occurs, Data Lake uses the following lattice to determine the least upper bound of the two different data types:

If a given field contains:
int
,double
, andstring
types, the least upper bound isstring
.int64
anddouble
types, the least upper bound isdecimal128
.
When a scalar-scalar conflict occurs, Data Lake takes the least upper bound type and uses that as the single type in the relational schema.
Composite Conflicts
Composite conflicts are type conversion conflicts involving a document or an array. The following sections describe how Data Lake resolves both types of composite conflicts.
Document Conflicts
When a conflict occurs involving a document, Data Lake displays the fields
of the document type as separate columns using dot notation. For
example, consider a collection named conflict
that contains the
following documents:
{ _id: 0, a: "foo" } // "a" is scalar (string) { _id: 1, a: { a: "bar", b: "baz" } } // "a" is composite (document)
Data Lake detects a scalar-composite conflict and generates the schema. The following table is a representation of that schema:
Table: conflict
_id | a | a.a | a.b |
---|---|---|---|
0 | "foo" | NULL | NULL |
1 | NULL | "bar" | "baz" |
When a conflict occurs involving a document, Data Lake inserts a column in the table and lists the different sampled types in the column.
Array Conflicts
When a conflict occurs involving an array, Data Lake creates a new
table for the conflict field with a column for the foreign key,
array index, and value. Data Lake unwinds the array into multiple rows
in a new table and fills in the columns accordingly. For example, the
conflict
collection contains the following documents:
{ _id: 0, a: "foo" } // "a" is scalar (string) { _id: 1, a: ["bar", "baz"] } // "a" is composite (array)
The following two tables are representations of the relational schema:
Table: conflict
_id |
---|
0 |
1 |
Table: conflict_a
_id | a_idx | a |
---|---|---|
0 | NULL | "foo" |
1 | 0 | "bar" |
1 | 1 | "baz" |