Although I work for VAST Data, these notes are my own personal notes and are not authoritative. They may be wrong.
VAST implements vector database functionality as part of its structured interface, VAST DataBase. This vector query capability falls under the VAST DataBase (and VAST DataEngine?) branding.
This vector database is actually implemented in two parts:
- Vectors are stored as a vector column type within VAST DataBase
- Vectors are searched using a VAST SDK which provides an ADBC driver that offloads queries to a VAST Query Engine running inside the VAST cluster.
Writes/Inserts
Vectors are inserted using pyarrow (pyarrow.table.insert).1 For example,
table = pyarrow.table(
schema=columns, # columns is a pyarrow.schema
data=[
[ # index column of int64
1,
2,
3,
],
[ # vector column with five-element lists of float32
[0.732, 0.914, 0.059, 0.427, 0.106],
[0.839, 0.245, 0.601, 0.913, 0.758],
[0.317, 0.564, 0.129, 0.987, 0.405],
],
[ # timestamp column of type timestamp
datetime.datetime(2013, 2, 17, 8, 6),
datetime.datetime(2016, 11, 3, 19, 42),
datetime.datetime(2019, 7, 28, 14, 15),
],
])
table.insert(arrow_table)The flow is as follows:
vastdb.connectinstantiates a session- The session spawns a transaction context
- The transaction retrieves/creates a bucket object which scopes the table
- The bucket provides access to a schema namespace
- The schema is used to retrieve/create a table handle. PyArrow is used to define the schema if a new table is being created.
- PyArrow tables (
pyarrow.table) are constructed with the payload and inserted via the VAST table handle’sinsert()method
Reads/Selects
Vector similarity is queried via the SQL dialect exposed by VAST’s special Arrow Database Connectivity (ADBC) driver.2 A vector query looks something like this:1
SELECT * from table
WHERE some_column > some_criteria
ORDER BY
array_distance(vector_column, [0.123, 0.456, 0.789]::FLOAT[3])
LIMIT 2The following similarity functions are provided and offloaded to the VAST query engine:
- Cosine similarity (
array_cosine_distance) - Euclidean distance (
array_distance) - Negative inner product (
array_negative_inner_product)
By integrating these similarity functions into SQL, queries can include both similarity and categorical or bounded criteria.