DEFINE INDEX
statement
Just like in other databases, SurrealDB uses indexes to help optimize query performance. An index can consist of one or more fields in a table and can enforce a uniqueness constraint. If you don't intend for your index to have a uniqueness constraint, then the fields you select for your index should have a high degree of cardinality, meaning that there is a high amount of diversity between the data in the indexed table records.
Requirements
- You must be authenticated as a root or namespace user before you can use the
DEFINE INDEX
statement. - You must select your namespace and database before you can use the
DEFINE INDEX
statement.
Statement syntax
SurrealQL SyntaxDEFINE INDEX [ IF NOT EXISTS ] @name ON [ TABLE ] @table [ FIELDS | COLUMNS ] @fields
[ UNIQUE
| SEARCH ANALYZER @analyzer [ BM25 [(@k1, @b)] ] [ HIGHLIGHTS ]
| MTREE DIMENSION @dimension [ TYPE @type ] [ DIST @distance ] [ CAPACITY @capacity]
| HNSW DIMENSION @dimension [ TYPE @type ] [DIST @distance] [ EFC @efc ] [ M @m ]
]
[ COMMENT @string ]
Index Types
SurrealDB offers a range of indexing capabilities designed to optimize data retrieval and search efficiency.
Unique Index
Ensures each value in the index is unique. A unique index helps enforce uniqueness across records by preventing duplicate entries in fields such as user IDs, email addresses, and other unique identifiers.
Let's create a unique index for the email address field on a user table.
-- Makes sure that the email address in the user table is always unique
DEFINE INDEX userEmailIndex ON TABLE user COLUMNS email UNIQUE;
The created index can be tested using the INFO
statement.
INFO FOR TABLE user;
The INFO
statement will help you understand what indexes are defined in your TABLE
.
{
"events": {},
"fields": {},
"indexes": {
"userEmailIndex": "DEFINE INDEX userEmailIndex ON user FIELDS email UNIQUE"
},
"lives": {},
"tables": {}
}
As we defined a UNIQUE
index on the email
column, a duplicate entry for that column or field will throw an error.
-- Create a user record and set an email ID.
CREATE user:1 SET email = 'test@surrealdb.com';
Response[
{
"email": "test@surrealdb.com",
"id": "user:1"
}
]
Creating another record with the same email ID will throw an error.
-- Create another user record and set the same email ID.
CREATE user:2 SET email = 'test@surrealdb.com';
ResponseDatabase index `userEmailIndex` already contains 'test@surrealdb.com',
with record `user:1`
To set the same email for user:2
, the original record must be deleted
DELETE user:1;
CREATE user:2 SET email = 'test@surrealdb.com'
[
{
"email": "test@surrealdb.com",
"id": "user:2"
}
]
Non-Unique Index
This allows for the indexing of attributes that may have non-unique values, facilitating efficient data retrieval. Non-unique indexes help index frequently appearing data in queries that do not require uniqueness, such as categorization tags or status indicators.
Let's create a non-unique index for an age field on a user table.
-- optimise queries looking for users of a given age
DEFINE INDEX userAgeIndex ON TABLE user COLUMNS age;
Composite Index
The composite index spans multiple fields and columns of a table. Similar to a single field index, composite indexes can also be UNIQUE
-- Create an index on the account and email fields of the user table
DEFINE INDEX test ON user FIELDS account, email;
Full-Text Search Index
Enables efficient searching through textual data, supporting advanced text-matching features like proximity searches and keyword highlighting.
The Full-Text search index helps implement comprehensive search functionalities in applications, such as searching through articles, product descriptions, and user-generated content.
Let's create a full-text search index for a name
field on a user
table.
-- Allow full-text search queries on the name of the user
DEFINE INDEX userNameIndex ON TABLE user COLUMNS name SEARCH ANALYZER ascii BM25 HIGHLIGHTS;
SEARCH
: By using theSEARCH
keyword, you enable full-text search on the specified column.ANALYZER ascii
: Uses a custom analyzer calledascii
for tokenizing and analyzing the text.BM25
: Ranking algorithm used for relevance scoring.HIGHLIGHTS
: Allows keyword highlighting in search results output when using thesearch::highlight
function
Vector Search Indexes
Vector search indexes in SurrealDB support efficient k-nearest neighbors (kNN) and Approximate Nearest Neighbor (ANN) operations, which are pivotal in performing similarity searches within complex, high-dimensional datasets and data types.
Types
When defining a vector index with MTREE or HNSW, you can define the types the vector will be stored in. The TYPE
clause is optional and can be used to specify the data type of the vector. SurrealDB supports the following types:
F64
| F32
| I64
| I32
| I16
F64
: Represents 64-bit floating-point numbers (double precision floating-point numbers).F32
: Represents 32-bit floating-point numbers (single precision floating-point numbers).I64
: Represents 64-bit signed integers.I32
: Represents 32-bit signed integers.I16
: Represents 16-bit signed integers.
Note: In SurrealDB the default type for vectors isF32
.
For example, to define a vector index with 64-bit signed integers, you can use the following query:
DEFINE INDEX idx_mtree_embedding ON Document FIELDS items.embedding MTREE DIMENSION 4 TYPE I64;
M-Tree index Since 1.3.0
The M-Tree index is suitable for the task of finding exact nearest neighbors based on a distance metric (like Euclidean distance). Mtree currently supports Euclidean, Cosine, Manhattan and Minkowski distance functions.
Note: When no function is specified it chooses Euclidean for the default distance function.
Capacity
The CAPACITY
clause is used to specify the maximum number of records that can be stored in the index. This is useful when you want to limit the number of records stored in the index to optimize performance by default the capacity is set to 40.
Example usage: Define a M-Tree index on a table with MANHATTAN and COSINE distance
For example, consider an index made for records with 4 dimensional vectors using the MANHATTAN and COSINE distance function:
DEFINE INDEX idx_mtree_embedding_manhattan ON Document FIELDS items.embedding MTREE DIMENSION 4 DIST MANHATTAN;
DEFINE INDEX idx_mtree_embedding_cosine ON Document FIELDS items.embedding MTREE DIMENSION 4 DIST COSINE;
Because the Document
table has a unique index on the items.embedding
field, you can use the index to perform vector searches to find records based on the distance from a given point. Any vector dimensions that don't match 4 dimensions will throw an error. 'Incorrect vector dimension (3). Expected a vector of 4 dimension.'
Example usage: Perform a vector search using an M-Tree index with just a Distance
Another example is to create an M-tree index on a table with 3-dimensional vectors.
CREATE pts:1 SET point = [1,2,3];
CREATE pts:2 SET point = [4,5,6];
CREATE pts:3 SET point = [8,9,10];
To define an M-tree index on the point
field of the pts
table, you can use the query below:
DEFINE INDEX mt_pt ON pts FIELDS point MTREE DIMENSION 3;
In the above example, the MTREE DIMENSION
clause specifies the dimension of the vector. The point
field is a 3-dimensional vector array, so we set the dimension to 3
. If this is sucessfully created, you can use the index to perform vector searches to find records based on the distance from a given point.
HNSW (Hierarchical Navigable Small World) Since 1.5.0
This method uses a graph-based approach to efficiently navigate and search in high-dimensional spaces. While it is an approximate technique, it offers a high-performance balance between speed and accuracy, making it ideal for very large datasets.
Note: Keep in mind the in-memory nature of HNSW when considering system resource allocation.
In the example above, you may notice the EFC
and M
parameters. These are optional to your query but are parameters of the HNSW algorithm and can be used to tune the index for better performance. At a glance
-
M (Max Connections per Element): Defines the maximum number of bi-directional links (neighbors) per node in each layer of the graph, except for the lowest layer. This parameter controls the connectivity and overall structure of the network. Higher values of MM generally improve search accuracy but increase memory usage and construction time.
-
EFC (EF construction): Stands for "exploration factor during construction." This parameter determines the size of the dynamic list for the nearest neighbor candidates during the graph construction phase. A larger efConstruction value leads to a more thorough construction, improving the quality and accuracy of the search but increasing construction time. The default value is 150.
-
M0 (Max Connections in the Lowest Layer): Similar to M, but specifically for the bottom layer (the base layer) of the graph. This layer contains the actual data points. M0 is often set to twice the value of M to enhance search performance and connectivity at the base layer, at the cost of increased memory usage.
-
LM (Multiplier for Level Generation): Used to determine the maximum level ll for a new element during its insertion into the hierarchical structure. It is used in the formula l←⌊−ln(unif(0..1))⋅mL⌋, where unif(0..1) is a uniform random variable between 0 and 1. This parameter influences the distribution of elements across different levels, impacting the overall balance and efficiency of the search structure.
Note: You can only provide TYPE, M, and EFC. SurrealDB automatically computes M0 and LM with the most appropriate value. If not specified, M AND EFC are set to 12 and 150, respectively.
Brute Force method
The Brute Force method is suitable for tasks with smaller datasets or when the highest accuracy is required. Brute Force currently supports Euclidean, Cosine, Manhattan and Minkowski distance functions.
In the example below, the query searches for points closest to the vector [2,3,4,5]
and uses vector functions to calculate the distance between two points, indicated by <|2|>
.
Verifying Index Utilization in Queries
The EXPLAIN
clause from SurrealQL helps you understand the execution plan of the query and provides transparency around index utilization.
SELECT * FROM user WHERE email='test@surrealdb.com' EXPLAIN FULL;
It also reveals details about which operation
was used by the query planner and how many records matched the search criteria.
[
{
"detail": {
"plan": {
"index": "userEmailIndex",
"operator": "=",
"value": "test@surrealdb.com"
},
"table": "user"
},
"operation": "Iterate Index"
},
{
"detail": {
"count": 1
},
"operation": "Fetch"
}
]
Using IF NOT EXISTS
clause Since 1.3.0
The IF NOT EXISTS
clause can be used to define a index only if it does not already exist. If the index already exists, the DEFINE INDEX
statement will return an error.
-- Create a INDEX if it does not already exist
DEFINE INDEX IF NOT EXISTS example ON example FIELDS example;
Performance Implications
When defining indexes, it's essential to consider the fields most frequently queried or used to optimize performance.
Indexes may improve the performance of SurrealQL statements. This may not be noticeable with small tables but it can be significant for large tables; especially when the indexed fields are used in the WHERE
clause of a SELECT
statement.
Indexes can also impact the performance of write operations (INSERT, UPDATE, DELETE) since the index needs to be updated accordingly. Therefore, it's essential to balance the need for read performance with write performance.