0
Dependencies
3.44s
Load 7.8M rows
SAB
SharedArrayBuffer
MIT
License
How it works

Cervid-data reads your entire dataset into a SharedArrayBuffer — one contiguous block of RAM. Worker Threads receive a reference to that buffer, not a copy. Each worker processes its own partition in parallel, writing results back to shared memory.

Numeric columns are stored as Float64Array views over the shared buffer. This means aggregations, filters, and feature engineering operate directly on raw memory — no object allocation, no garbage collection pressure.

Cervid-data uses only Node.js core modules: fs, worker_threads, os, path. No npm install required beyond the package itself.
Core classes

Cervid — Static entry point. Use Cervid.read() to load CSV or JSON files into a DataFrame.

DataFrame — The main data structure. Holds columnar data and exposes all transformation, aggregation, and export methods.

Column — A linked reference to a single column that enables chained arithmetic ops.

Series — The internal columnar primitive. Each column inside a DataFrame is a Series backed by a TypedArray.

npm
install
npm i
        @cervid/data
Import
ESM (recommended)
import { Cervid } from '@cervid/data' import {
        DataFrame } from '@cervid/data' import {
        Series, Column } from '@cervid/data'
Cervid-data is ESM only. Make sure your package.json has "type": "module" or use .mjs extensions.
Node.js version

SharedArrayBuffer requires Node.js 16.4+ with cross-origin isolation headers if used in a browser context. In Node.js (server-side), no extra configuration is needed.

complete example
import { Cervid } from '@cervid/data' // 1. Load CSV — uses SharedArrayBuffer
        + Worker Threads const df = await Cervid.read('trips.csv')
        console.log(df.info()) // { rowCount: 7832546, columnCount: 19,
        memoryUsage: '...' } // 2. Select only the columns you need (frees RAM)
        const slim = df.select([ 'fare_amount', 'tip_amount', 'trip_distance',
        'passenger_count', 'tpep_pickup_datetime' ]) // 3. Clean — remove
        invalid rows const clean = slim.filter( ['fare_amount', 'trip_distance',
        'passenger_count'], (fare, dist, pax) => fare > 0 && dist > 0 && pax > 0
        ) // 4. Feature engineering — vectorized over TypedArrays
        clean.with_columns([ { name: 'tip_pct', inputs: ['tip_amount',
        'fare_amount'], formula: (tip, fare) => (tip / fare) * 100 }, { name:
        'revenue_per_mile', inputs: ['fare_amount', 'trip_distance'], formula:
        (fare, dist) => fare / dist } ]) // 5. Extract hour from timestamp
        clean.col('tpep_pickup_datetime').to_datetime().extract_hour(0) // 6.
        GroupBy — best hour by tip percentage const byHour =
        clean.groupBy('tpep_pickup_datetime_hour', { tip_pct: 'mean'
        }).sort('tip_pct', false) byHour.show(5) // 7. Export await
        clean.toCSV('output/clean_trips.csv')
static async Cervid.read(filePath: string, options?: ReadOptions) → Promise<DataFrame>
static

Detects file format by extension and routes to the appropriate engine. .csv uses the Nitro engine (SharedArrayBuffer + Workers). .json uses the JSON engine with auto flattening.

Parameter Type Default Description
filePath string Path to the file
options.workers number os.cpus().length Number of Worker Threads
options.indexerCapacity number 10_000_000 Max rows for column buffers
options.useOffsets boolean true Store byte offsets for string columns
options.type 'csv' | 'json' auto Force a specific format
examples
const df
        = await Cervid.read('data.csv') const df = await Cervid.read('data.json') const df = await Cervid.read('data.csv', {
        workers: 4 })
The CSV engine reads the entire file into a single SharedArrayBuffer, then each Worker Thread receives a reference — not a copy. No data is duplicated in RAM.
JSON Engine
static async Cervid._readJSON(filePath: string) → Promise<DataFrame>
static

Automatically detects the root array (e.g. "prizes" in Nobel dataset). Recursively flattens nested objects into columns. Expands nested arrays into multiple rows, inheriting parent fields.

nested json example
//
        Input: { prizes: [{ year: "2023", laureates: [{id,name}] }] }
        const df = await Cervid.read('nobel.json') // Each laureate becomes its
        own row, with year inherited // Columns: year, laureates_id,
        laureates_firstname, ...
new DataFrame(config: DataFrameConfig)
instance

Creates a new DataFrame. You typically get DataFrames from Cervid.read(), but you can construct one manually.

Property Type Description
columns Record<string, TypedArray> Column data — use Float64Array for numerics
rowCount number Total number of rows
headers string[] Column names in order
static DataFrame.fromObjects(data: object[]) → DataFrame
static

Converts a plain JS array of objects into a DataFrame. Numeric values are stored as Float64Array automatically.

example
const df = DataFrame.fromObjects([ { name: 'Alice', score: 95 }, { name:
        'Bob', score: 82 },
        ])
static DataFrame.fromArray(data: Record<string, number>[]) → DataFrame
static

Constructs a DataFrame directly from an array of numeric objects, initializing SharedArrayBuffers automatically.

static DataFrame.fromShared(def: SharedDef) → DataFrame
static

Constructs a DataFrame from a SharedDef object, instantiating TypedArray views directly over a SharedArrayBuffer without copying memory. Used internally by the Worker Threads engine.

assign(data: Record<string, ColumnData> | DataFrame) → this
instance

Assigns new columns or overwrites existing ones. Accepts an object of columns or another DataFrame. Mutates in-place. Throws error if row counts mismatch.

getCol(name: string) → ColumnData | null
instance

Returns the underlying raw TypedArray or array data for a given column name.

show(n?: number = 5) → void
instance

Prints the first n rows as a console.table. String values longer than 20 characters are truncated with ....

info() → DataFrameInfo
instance

Returns a summary with row count, column count, column names, and estimated memory usage.

example
const
        info = df.info() // { rowCount: 7832546,
        columnCount: 19, columns: [...], memoryUsage: '1139.45 MB' }
describe() → void
instance

Prints descriptive statistics for all numeric columns: count, mean, min, 25%, 50%, 75%, max. Non-numeric columns are skipped.

stats(colName: string) → object | null
instance

Returns a quick statistical summary of a single numeric column: count, sum, mean, min, and max.

with_columns(specs: ColSpec[]) → DataFrame
instance

Vectorized feature engineering. Applies formulas row-by-row using direct TypedArray access. Optimized fast paths for 1, 2, and 4 inputs. Returns this for chaining.

ColSpec property Type Description
name string Name of the new column to create
inputs string[] Column names fed into the formula
formula Function (...values: number[]) => number
example
df.with_columns([ { name: 'revenue_per_mile', inputs: ['total_amount', 'trip_distance'], formula: (amount, dist) => dist > 0
        ? amount / dist : 0 }, { name: 'speed_mph',
        inputs: ['trip_distance', 'duration_hours'], formula: (dist, dur) => dur > 0 ?
        dist / dur : 0 } ])
with_columns mutates in-place and returns this. New columns are stored as Float64Array.
select(columnNames: string[]) → DataFrame
instance

Returns a new DataFrame with only the specified columns. Essential for freeing RAM — drop unused columns as early as possible.

const
        slim = df.select(['fare_amount', 'tip_amount',
        'trip_distance'])
rename(mapping: Record<string, string>) → DataFrame
instance

Renames columns without copying data. Returns a new DataFrame with updated headers.

const
        df2 = df.rename({ tpep_pickup_datetime: 'pickup', PULocationID: 'zone'
        })
cast(columnName: string, type: 'float' | 'int' | 'string') → DataFrame
instance

Forces a type conversion on a column. 'float' and 'int' both produce Float64Array. 'string' produces a regular JS Array. Mutates in-place.

cumsum(columnName: string) → DataFrame
instance

Computes a running cumulative sum over a column. Creates a new column named {columnName}_cumsum. Mutates in-place.

with_label(specs: LabelSpec[]) → DataFrame
instance

Applies a StringIndexer to encode a string column as numeric IDs. Creates a new column named {input}_indexed and stores the indexer in metadata.indexers for later decoding.

filter(inputs: string[], predicate: Function) → DataFrame
instance

Returns a new DataFrame containing only rows where the predicate returns true. The predicate receives the values of the listed columns for each row.

example
const valid = df.filter(
        ['fare_amount', 'trip_distance', 'passenger_count'], (fare, dist, pax) => fare > 0 &&
        dist > 0 && pax > 0 )
head(n?: number = 5) → DataFrame
instance

Returns a new DataFrame with the first n rows.

tail(n?: number = 5) → DataFrame
instance

Returns a new DataFrame with the last n rows.

dropNA(options?: { how?: 'any' | 'all' }) → DataFrame
instance

Removes rows containing null, undefined, or NaN. how: 'any' drops the row if any column has a null. how: 'all' drops only if all columns are null.

fillna(value: number | string) → DataFrame
instance

Replaces all null, undefined, or NaN values in all columns with the specified value. Mutates in-place.

str_contains(columnName: string, pattern: string) → DataFrame
instance

Filters rows where the string column matches the regex pattern. Case-insensitive.

example
const
        result = df.str_contains('product_name', 'wireless')
groupBy(groupCol: string, aggs: AggSpec) → DataFrame
instance

Groups rows by a column and applies aggregation functions. Supports sum, mean, count, max, min. Pass a string for a single op or an array for multiple — output columns will be named {col}_{op}.

single op
const byHour = df.groupBy('hour', { tip_pct:
        'mean', fare_amount: 'sum' })
multiple ops
const byZone = df.groupBy('zone_id', {
        fare_amount: ['sum', 'mean', 'count'], tip_amount:
        ['sum', 'max'] }) //
        Columns: fare_amount_sum, fare_amount_mean, fare_amount_count, ...
groupByCategory(groupCol: string, valueCol: string) → Map<string | number, GroupAccumulator>
instance

O(n) aggregation using a JavaScript Map. Ideal for unbounded or string categories. Returns a Map with sum and count.

groupByRange(colName: string, targetCol: string, maxRange: number) → Result[]
instance

O(n) groupBy for bounded integer keys. Uses Uint32Array as a direct lookup — no Map, no hashing, no allocations. Returns array of { group, avg } sorted descending.

This is the fastest aggregation in Cervid-data. Use it when group keys are bounded integers (e.g. zone IDs, hour 0–23).
groupByID(colName: string, targetCol: string) → Result[]
instance

Alias for groupByRange(colName, targetCol, 300). Preset for NYC Taxi zone IDs.

sort(columnName: string, ascending?: boolean = true) → DataFrame
instance

Index sort — builds an index array, sorts by target column values, then reorders all columns in one pass. Returns a new DataFrame.

example — top 10 most profitable trips
const top10 =
        df.sort('revenue_per_mile', false).head(10)
Scalar Aggregations
Method Returns Description
sum(col) number Sum of all values in a column
mean(col) number Arithmetic mean — returns 0 if rowCount is 0
max(col) number Maximum value
min(col) number Minimum value
example
df.sum('fare_amount') //
        48291043.21 df.mean('tip_pct') // 14.82 df.max('trip_distance') //
        189.4
unique(columnName: string) → any[]
instance

Returns an array of unique values in a column using a Set.

nunique(columnName: string) → number
instance

Returns the count of unique values. Faster than unique().length.

value_counts(columnName: string) → { value, count }[]
instance

Returns a frequency table sorted from most to least common.

example
const freq = df.value_counts('payment_type')
        // [{ value: 1, count: 5821034 }, { value: 2, count: 1823456 }]
join(other: DataFrame, on: string, how?: 'inner' | 'left' = 'inner') → DataFrame
instance

Hash join on a common column. inner returns only matching rows. left keeps all left rows, filling unmatched right columns with null.

example
const enriched = trips.join(zones, 'zone_id', 'left')
write(filePath: string) → Promise<void>
instance

Streaming write export for DataFrames directly to disk as CSV.

toCSV(outputPath: string, options?: object) → Promise<void>
instance

Exports the DataFrame to a CSV file via streaming write. Floats are written with 4 decimal places. Validates the .csv extension.

example
await
        df.toCSV('output/results.csv')
toJSON(outputPath: string) → Promise<void>
instance

Exports to a JSON file. Validates .json extension.

toTXT(outputPath: string) → Promise<void>
instance

Exports to a plain text file. Validates .txt extension.

toArray() → object[]
instance

Converts the DataFrame back to a plain JS array of row objects. Useful for interoperability.

All export methods validate the file extension and throw if it doesn't match. Pass the path with the extension explicitly.
col(name: string) → Column
instance

Returns a Column instance linked to the underlying TypedArray. Enables chained arithmetic that mutates the column in-place.

example — timestamp parsing
df.col('tpep_pickup_datetime') .to_datetime() .extract_hour(0) // Creates new
        column: tpep_pickup_datetime_hour
example — arithmetic between columns
df.col('total_amount') .sub(df.col('tip_amount')) .div(1.08)
Column methods
Method Accepts Description
add(value) number | Column Addition in-place
sub(value) number | Column Subtraction in-place
mul(value) number | Column Multiplication in-place
div(value) number | Column Division in-place — guards against divide by zero
to_datetime() Converts string timestamps to ms since epoch (Date.getTime)
extract_hour(offsetSeconds) number Extracts hour 0–23 from ms timestamp. Creates {name}_hour column
Series
new Series(name, data: TypedArray, type: string, indexer?, mask?)
instance

The internal columnar primitive. Each column inside a DataFrame is backed by a Series. Use Series.fromRawBuffer() to reconstruct from raw buffer data. The optional indexer enables transparent numeric-ID to string translation via .get(index).

Series is the internal primitive — most users work with DataFrame and Column methods directly.
Method Returns Description
get(index) any Value at index — decodes via indexer if present
slice(start, end) Series Returns a slice preserving the indexer reference
Series.fromRawBuffer() Series Static factory from raw buffer + metadata
Series.formatResults() object Formats aggregation results with a .show() method