INTRODUCTION
A columnar data engine for Node.js. Zero dependencies. Built on SharedArrayBuffer, Worker Threads, and TypedArrays.
Cervid-data reads your
entire dataset into a SharedArrayBuffer — one contiguous
block of RAM. Worker Threads receive a reference to that buffer, not a
copy. Each worker processes its own partition in parallel, writing
results back to shared memory.
Numeric
columns are stored as Float64Array views over the shared
buffer. This means aggregations, filters, and feature engineering
operate directly on raw memory — no object allocation, no garbage
collection pressure.
fs, worker_threads,
os, path. No npm install required beyond the
package itself. Cervid — Static entry point. Use
Cervid.read() to load CSV or JSON files into a DataFrame.
DataFrame — The main data structure. Holds columnar data and exposes all transformation, aggregation, and export methods.
Column — A linked reference to a single column that enables chained arithmetic ops.
Series — The internal columnar primitive. Each column inside a DataFrame is a Series backed by a TypedArray.
INSTALLATION
Cervid-data requires Node.js 18+ for SharedArrayBuffer and Worker Threads support.
npm i
@cervid/data import { Cervid } from '@cervid/data' import { DataFrame } from '@cervid/data' import { Series, Column } from '@cervid/data'
package.json has
"type": "module" or use .mjs extensions.
SharedArrayBuffer requires Node.js 16.4+ with cross-origin isolation headers if used in a browser context. In Node.js (server-side), no extra configuration is needed.
QUICKSTART
A complete example loading, cleaning, transforming, and aggregating 7.8M rows.
import { Cervid } from '@cervid/data' // 1. Load CSV — uses SharedArrayBuffer + Worker Threads const df = await Cervid.read('trips.csv') console.log(df.info()) // { rowCount: 7832546, columnCount: 19, memoryUsage: '...' } // 2. Select only the columns you need (frees RAM) const slim = df.select([ 'fare_amount', 'tip_amount', 'trip_distance', 'passenger_count', 'tpep_pickup_datetime' ]) // 3. Clean — remove invalid rows const clean = slim.filter( ['fare_amount', 'trip_distance', 'passenger_count'], (fare, dist, pax) => fare > 0 && dist > 0 && pax > 0 ) // 4. Feature engineering — vectorized over TypedArrays clean.with_columns([ { name: 'tip_pct', inputs: ['tip_amount', 'fare_amount'], formula: (tip, fare) => (tip / fare) * 100 }, { name: 'revenue_per_mile', inputs: ['fare_amount', 'trip_distance'], formula: (fare, dist) => fare / dist } ]) // 5. Extract hour from timestamp clean.col('tpep_pickup_datetime').to_datetime().extract_hour(0) // 6. GroupBy — best hour by tip percentage const byHour = clean.groupBy('tpep_pickup_datetime_hour', { tip_pct: 'mean' }).sort('tip_pct', false) byHour.show(5) // 7. Export await clean.toCSV('output/clean_trips.csv')
Cervid.READ()
Universal entry point for loading CSV and JSON files into a DataFrame.
Detects file format by
extension and routes to the appropriate engine. .csv uses
the Nitro engine (SharedArrayBuffer + Workers). .json uses
the JSON engine with auto flattening.
| Parameter | Type | Default | Description |
|---|---|---|---|
| filePath | string | — | Path to the file |
| options.workers | number | os.cpus().length | Number of Worker Threads |
| options.indexerCapacity | number | 10_000_000 | Max rows for column buffers |
| options.useOffsets | boolean | true | Store byte offsets for string columns |
| options.type | 'csv' | 'json' | auto | Force a specific format |
const df = await Cervid.read('data.csv') const df = await Cervid.read('data.json') const df = await Cervid.read('data.csv', { workers: 4 })
Automatically detects the
root array (e.g. "prizes" in Nobel dataset). Recursively
flattens nested objects into columns. Expands nested arrays into
multiple rows, inheriting parent fields.
//
Input: { prizes: [{ year: "2023", laureates: [{id,name}] }] }
const df = await Cervid.read('nobel.json') // Each laureate becomes its
own row, with year inherited // Columns: year, laureates_id,
laureates_firstname, ... CORE & DISPLAY
DataFrame constructor, static builders, and display methods.
Creates a new DataFrame. You typically get
DataFrames from Cervid.read(), but you can construct one
manually.
| Property | Type | Description |
|---|---|---|
| columns | Record<string, TypedArray> | Column data — use Float64Array for numerics |
| rowCount | number | Total number of rows |
| headers | string[] | Column names in order |
Converts a plain JS array of objects into a
DataFrame. Numeric values are stored as Float64Array
automatically.
const df = DataFrame.fromObjects([ { name: 'Alice', score: 95 }, { name: 'Bob', score: 82 }, ])
Constructs a DataFrame directly from an array of
numeric objects, initializing SharedArrayBuffers
automatically.
Constructs a DataFrame from a
SharedDef object, instantiating TypedArray
views directly over a SharedArrayBuffer without copying
memory. Used internally by the Worker Threads engine.
Assigns new columns or overwrites existing ones. Accepts an object of columns or another DataFrame. Mutates in-place. Throws error if row counts mismatch.
Returns the underlying raw TypedArray
or array data for a given column name.
Prints the first
n rows as a console.table. String values
longer than 20 characters are truncated with ....
Returns a summary with row count, column count, column names, and estimated memory usage.
exampleconst info = df.info() // { rowCount: 7832546, columnCount: 19, columns: [...], memoryUsage: '1139.45 MB' }
Prints descriptive statistics for all numeric columns: count, mean, min, 25%, 50%, 75%, max. Non-numeric columns are skipped.
Returns a quick statistical summary of a single numeric column: count, sum, mean, min, and max.
TRANSFORMATION
Methods for creating new columns, selecting, renaming, casting, and reshaping data.
Vectorized feature engineering. Applies formulas
row-by-row using direct TypedArray access. Optimized fast paths for 1,
2, and 4 inputs. Returns this for chaining.
| ColSpec property | Type | Description |
|---|---|---|
| name | string | Name of the new column to create |
| inputs | string[] | Column names fed into the formula |
| formula | Function | (...values: number[]) =>
number |
df.with_columns([ { name: 'revenue_per_mile', inputs: ['total_amount', 'trip_distance'], formula: (amount, dist) => dist > 0 ? amount / dist : 0 }, { name: 'speed_mph', inputs: ['trip_distance', 'duration_hours'], formula: (dist, dur) => dur > 0 ? dist / dur : 0 } ])
this. New columns are stored
as Float64Array. Returns a new DataFrame with only the specified columns. Essential for freeing RAM — drop unused columns as early as possible.
const slim = df.select(['fare_amount', 'tip_amount', 'trip_distance'])
Renames columns without copying data. Returns a new DataFrame with updated headers.
const df2 = df.rename({ tpep_pickup_datetime: 'pickup', PULocationID: 'zone' })
Forces a type conversion on
a column. 'float' and 'int' both produce
Float64Array. 'string' produces a regular JS Array. Mutates
in-place.
Computes a running
cumulative sum over a column. Creates a new column named
{columnName}_cumsum. Mutates in-place.
Applies a StringIndexer to encode a
string column as numeric IDs. Creates a new column named
{input}_indexed and stores the indexer in
metadata.indexers for later decoding.
FILTERING
Methods for selecting subsets of rows based on conditions, position, or null values.
Returns a new DataFrame
containing only rows where the predicate returns true. The
predicate receives the values of the listed columns for each row.
const valid = df.filter( ['fare_amount', 'trip_distance', 'passenger_count'], (fare, dist, pax) => fare > 0 && dist > 0 && pax > 0 )
Returns a new DataFrame
with the first n rows.
Returns a new DataFrame
with the last n rows.
Removes rows containing null,
undefined, or NaN. how: 'any'
drops the row if any column has a null. how: 'all' drops
only if all columns are null.
Replaces all null, undefined, or NaN values in all columns with the specified value. Mutates in-place.
Filters rows where the string column matches the regex pattern. Case-insensitive.
exampleconst result = df.str_contains('product_name', 'wireless')
AGGREGATION
GroupBy, sorting, and scalar aggregation methods.
Groups rows by a column and applies aggregation
functions. Supports sum, mean,
count, max, min. Pass a string
for a single op or an array for multiple — output columns will be named
{col}_{op}.
const byHour = df.groupBy('hour', { tip_pct: 'mean', fare_amount: 'sum' })multiple ops
const byZone = df.groupBy('zone_id', { fare_amount: ['sum', 'mean', 'count'], tip_amount: ['sum', 'max'] }) // Columns: fare_amount_sum, fare_amount_mean, fare_amount_count, ...
O(n) aggregation using a
JavaScript Map. Ideal for unbounded or string categories.
Returns a Map with sum and count.
O(n) groupBy for bounded integer keys. Uses
Uint32Array as a direct lookup — no Map, no hashing, no
allocations. Returns array of { group, avg } sorted
descending.
Alias for
groupByRange(colName, targetCol, 300). Preset for NYC Taxi
zone IDs.
Index sort — builds an index array, sorts by target column values, then reorders all columns in one pass. Returns a new DataFrame.
example — top 10 most profitable tripsconst top10 = df.sort('revenue_per_mile', false).head(10)
| Method | Returns | Description |
|---|---|---|
| sum(col) | number | Sum of all values in a column |
| mean(col) | number | Arithmetic mean — returns 0 if rowCount is 0 |
| max(col) | number | Maximum value |
| min(col) | number | Minimum value |
df.sum('fare_amount') // 48291043.21 df.mean('tip_pct') // 14.82 df.max('trip_distance') // 189.4
INSPECTION
Methods for exploring unique values, frequencies, and joining DataFrames.
Returns an array of unique values in a column using
a Set.
Returns the count of unique
values. Faster than unique().length.
Returns a frequency table sorted from most to least common.
exampleconst freq = df.value_counts('payment_type') // [{ value: 1, count: 5821034 }, { value: 2, count: 1823456 }]
Hash join on a common
column. inner returns only matching rows. left
keeps all left rows, filling unmatched right columns with
null.
const enriched = trips.join(zones, 'zone_id', 'left')
EXPORT
Write DataFrames to disk in CSV, JSON, TXT, or plain JS arrays.
Streaming write export for DataFrames directly to disk as CSV.
Exports the DataFrame to a CSV file via streaming
write. Floats are written with 4 decimal places. Validates the
.csv extension.
await df.toCSV('output/results.csv')
Exports to a JSON file. Validates
.json extension.
Exports to a plain text file. Validates
.txt extension.
Converts the DataFrame back to a plain JS array of row objects. Useful for interoperability.
COLUMN & SERIES
Low-level column operations and the internal Series primitive.
Returns a Column instance linked to
the underlying TypedArray. Enables chained arithmetic that mutates the
column in-place.
df.col('tpep_pickup_datetime') .to_datetime() .extract_hour(0) // Creates new column: tpep_pickup_datetime_hourexample — arithmetic between columns
df.col('total_amount') .sub(df.col('tip_amount')) .div(1.08)
| Method | Accepts | Description |
|---|---|---|
| add(value) | number | Column | Addition in-place |
| sub(value) | number | Column | Subtraction in-place |
| mul(value) | number | Column | Multiplication in-place |
| div(value) | number | Column | Division in-place — guards against divide by zero |
| to_datetime() | — | Converts string timestamps to ms since epoch (Date.getTime) |
| extract_hour(offsetSeconds) | number | Extracts hour 0–23 from ms timestamp. Creates
{name}_hour column |
The internal columnar primitive. Each column inside
a DataFrame is backed by a Series. Use
Series.fromRawBuffer() to reconstruct from raw buffer data.
The optional indexer enables transparent numeric-ID to
string translation via .get(index).
| Method | Returns | Description |
|---|---|---|
| get(index) | any | Value at index — decodes via indexer if present |
| slice(start, end) | Series | Returns a slice preserving the indexer reference |
| Series.fromRawBuffer() | Series | Static factory from raw buffer + metadata |
| Series.formatResults() | object | Formats aggregation results with a .show() method |