@Cervid/data — Documentation

0

Dependencies

3.44s

Load 7.8M rows

SAB

SharedArrayBuffer

MIT

License

How it works

Cervid-data reads your entire dataset into a SharedArrayBuffer — one contiguous block of RAM. Worker Threads receive a reference to that buffer, not a copy. Each worker processes its own partition in parallel, writing results back to shared memory.

Numeric columns are stored as Float64Array views over the shared buffer. This means aggregations, filters, and feature engineering operate directly on raw memory — no object allocation, no garbage collection pressure.

Cervid-data uses only Node.js core modules: fs, worker_threads, os, path. No npm install required beyond the package itself.

Core classes

Cervid — Static entry point. Use Cervid.read() to load CSV or JSON files into a DataFrame.

DataFrame — The main data structure. Holds columnar data and exposes all transformation, aggregation, and export methods.

Column — A linked reference to a single column that enables chained arithmetic ops.

Series — The internal columnar primitive. Each column inside a DataFrame is a Series backed by a TypedArray.

npm

install

npm i
        @cervid/data

Import

ESM (recommended)

import { Cervid } from '@cervid/data' import {
        DataFrame } from '@cervid/data' import {
        Series, Column } from '@cervid/data'

Cervid-data is ESM only. Make sure your package.json has "type": "module" or use .mjs extensions.

Node.js version

SharedArrayBuffer requires Node.js 16.4+ with cross-origin isolation headers if used in a browser context. In Node.js (server-side), no extra configuration is needed.

complete example

import { Cervid } from '@cervid/data' // 1. Load CSV — uses SharedArrayBuffer
        + Worker Threads const df = await Cervid.read('trips.csv')
        console.log(df.info()) // { rowCount: 7832546, columnCount: 19,
        memoryUsage: '...' } // 2. Select only the columns you need (frees RAM)
        const slim = df.select([ 'fare_amount', 'tip_amount', 'trip_distance',
        'passenger_count', 'tpep_pickup_datetime' ]) // 3. Clean — remove
        invalid rows const clean = slim.filter( ['fare_amount', 'trip_distance',
        'passenger_count'], (fare, dist, pax) => fare > 0 && dist > 0 && pax > 0
        ) // 4. Feature engineering — vectorized over TypedArrays
        clean.with_columns([ { name: 'tip_pct', inputs: ['tip_amount',
        'fare_amount'], formula: (tip, fare) => (tip / fare) * 100 }, { name:
        'revenue_per_mile', inputs: ['fare_amount', 'trip_distance'], formula:
        (fare, dist) => fare / dist } ]) // 5. Extract hour from timestamp
        clean.col('tpep_pickup_datetime').to_datetime().extract_hour(0) // 6.
        GroupBy — best hour by tip percentage const byHour =
        clean.groupBy('tpep_pickup_datetime_hour', { tip_pct: 'mean'
        }).sort('tip_pct', false) byHour.show(5) // 7. Export await
        clean.toCSV('output/clean_trips.csv')

static async Cervid.read(filePath: string, options?: ReadOptions) → Promise<DataFrame>

static

Detects file format by extension and routes to the appropriate engine. .csv uses the Nitro engine (SharedArrayBuffer + Workers). .json uses the JSON engine with auto flattening.

Parameter	Type	Default	Description
filePath	string	—	Path to the file
options.workers	number	os.cpus().length	Number of Worker Threads
options.indexerCapacity	number	10_000_000	Max rows for column buffers
options.useOffsets	boolean	true	Store byte offsets for string columns
options.type	'csv' \| 'json'	auto	Force a specific format

examples

const df
        = await Cervid.read('data.csv') const df = await Cervid.read('data.json') const df = await Cervid.read('data.csv', {
        workers: 4 })

The CSV engine reads the entire file into a single SharedArrayBuffer, then each Worker Thread receives a reference — not a copy. No data is duplicated in RAM.

JSON Engine

static async Cervid._readJSON(filePath: string) → Promise<DataFrame>

static

Automatically detects the root array (e.g. "prizes" in Nobel dataset). Recursively flattens nested objects into columns. Expands nested arrays into multiple rows, inheriting parent fields.

nested json example

//
        Input: { prizes: [{ year: "2023", laureates: [{id,name}] }] }
        const df = await Cervid.read('nobel.json') // Each laureate becomes its
        own row, with year inherited // Columns: year, laureates_id,
        laureates_firstname, ...

new DataFrame(config: DataFrameConfig)

instance

Creates a new DataFrame. You typically get DataFrames from Cervid.read(), but you can construct one manually.

Property	Type	Description
columns	Record<string, TypedArray>	Column data — use Float64Array for numerics
rowCount	number	Total number of rows
headers	string[]	Column names in order

static DataFrame.fromObjects(data: object[]) → DataFrame

static

Converts a plain JS array of objects into a DataFrame. Numeric values are stored as Float64Array automatically.

example

const df = DataFrame.fromObjects([ { name: 'Alice', score: 95 }, { name:
        'Bob', score: 82 },
        ])

static DataFrame.fromArray(data: Record<string, number>[]) → DataFrame

static

Constructs a DataFrame directly from an array of numeric objects, initializing SharedArrayBuffers automatically.

static DataFrame.fromShared(def: SharedDef) → DataFrame

static

Constructs a DataFrame from a SharedDef object, instantiating TypedArray views directly over a SharedArrayBuffer without copying memory. Used internally by the Worker Threads engine.

assign(data: Record<string, ColumnData> | DataFrame) → this

instance

Assigns new columns or overwrites existing ones. Accepts an object of columns or another DataFrame. Mutates in-place. Throws error if row counts mismatch.

getCol(name: string) → ColumnData | null

instance

Returns the underlying raw TypedArray or array data for a given column name.

show(n?: number = 5) → void

instance

Prints the first n rows as a console.table. String values longer than 20 characters are truncated with ....

info() → DataFrameInfo

instance

Returns a summary with row count, column count, column names, and estimated memory usage.

example

const
        info = df.info() // { rowCount: 7832546,
        columnCount: 19, columns: [...], memoryUsage: '1139.45 MB' }

describe() → void

instance

Prints descriptive statistics for all numeric columns: count, mean, min, 25%, 50%, 75%, max. Non-numeric columns are skipped.

stats(colName: string) → object | null

instance

Returns a quick statistical summary of a single numeric column: count, sum, mean, min, and max.

with_columns(specs: ColSpec[]) → DataFrame

instance

Vectorized feature engineering. Applies formulas row-by-row using direct TypedArray access. Optimized fast paths for 1, 2, and 4 inputs. Returns this for chaining.

ColSpec property	Type	Description
name	string	Name of the new column to create
inputs	string[]	Column names fed into the formula
formula	Function	`(...values: number[]) => number`

example

df.with_columns([ { name: 'revenue_per_mile', inputs: ['total_amount', 'trip_distance'], formula: (amount, dist) => dist > 0
        ? amount / dist : 0 }, { name: 'speed_mph',
        inputs: ['trip_distance', 'duration_hours'], formula: (dist, dur) => dur > 0 ?
        dist / dur : 0 } ])

with_columns mutates in-place and returns this. New columns are stored as Float64Array.

select(columnNames: string[]) → DataFrame

instance

Returns a new DataFrame with only the specified columns. Essential for freeing RAM — drop unused columns as early as possible.

const
        slim = df.select(['fare_amount', 'tip_amount',
        'trip_distance'])

rename(mapping: Record<string, string>) → DataFrame

instance

Renames columns without copying data. Returns a new DataFrame with updated headers.

const
        df2 = df.rename({ tpep_pickup_datetime: 'pickup', PULocationID: 'zone'
        })

cast(columnName: string, type: 'float' | 'int' | 'string') → DataFrame

instance

Forces a type conversion on a column. 'float' and 'int' both produce Float64Array. 'string' produces a regular JS Array. Mutates in-place.

cumsum(columnName: string) → DataFrame

instance

Computes a running cumulative sum over a column. Creates a new column named {columnName}_cumsum. Mutates in-place.

with_label(specs: LabelSpec[]) → DataFrame

instance

Applies a StringIndexer to encode a string column as numeric IDs. Creates a new column named {input}_indexed and stores the indexer in metadata.indexers for later decoding.

filter(inputs: string[], predicate: Function) → DataFrame

instance

Returns a new DataFrame containing only rows where the predicate returns true. The predicate receives the values of the listed columns for each row.

example

const valid = df.filter(
        ['fare_amount', 'trip_distance', 'passenger_count'], (fare, dist, pax) => fare > 0 &&
        dist > 0 && pax > 0 )

head(n?: number = 5) → DataFrame

instance

Returns a new DataFrame with the first n rows.

tail(n?: number = 5) → DataFrame

instance

Returns a new DataFrame with the last n rows.

dropNA(options?: { how?: 'any' | 'all' }) → DataFrame

instance

Removes rows containing null, undefined, or NaN. how: 'any' drops the row if any column has a null. how: 'all' drops only if all columns are null.

fillna(value: number | string) → DataFrame

instance

Replaces all null, undefined, or NaN values in all columns with the specified value. Mutates in-place.

str_contains(columnName: string, pattern: string) → DataFrame

instance

Filters rows where the string column matches the regex pattern. Case-insensitive.

example

const
        result = df.str_contains('product_name', 'wireless')

groupBy(groupCol: string, aggs: AggSpec) → DataFrame

instance

Groups rows by a column and applies aggregation functions. Supports sum, mean, count, max, min. Pass a string for a single op or an array for multiple — output columns will be named {col}_{op}.

single op

const byHour = df.groupBy('hour', { tip_pct:
        'mean', fare_amount: 'sum' })

multiple ops

const byZone = df.groupBy('zone_id', {
        fare_amount: ['sum', 'mean', 'count'], tip_amount:
        ['sum', 'max'] }) //
        Columns: fare_amount_sum, fare_amount_mean, fare_amount_count, ...

groupByCategory(groupCol: string, valueCol: string) → Map<string | number, GroupAccumulator>

instance

O(n) aggregation using a JavaScript Map. Ideal for unbounded or string categories. Returns a Map with sum and count.

groupByRange(colName: string, targetCol: string, maxRange: number) → Result[]

instance

O(n) groupBy for bounded integer keys. Uses Uint32Array as a direct lookup — no Map, no hashing, no allocations. Returns array of { group, avg } sorted descending.

This is the fastest aggregation in Cervid-data. Use it when group keys are bounded integers (e.g. zone IDs, hour 0–23).

groupByID(colName: string, targetCol: string) → Result[]

instance

Alias for groupByRange(colName, targetCol, 300). Preset for NYC Taxi zone IDs.

sort(columnName: string, ascending?: boolean = true) → DataFrame

instance

Index sort — builds an index array, sorts by target column values, then reorders all columns in one pass. Returns a new DataFrame.

example — top 10 most profitable trips

const top10 =
        df.sort('revenue_per_mile', false).head(10)

Scalar Aggregations

Method	Returns	Description
sum(col)	number	Sum of all values in a column
mean(col)	number	Arithmetic mean — returns 0 if rowCount is 0
max(col)	number	Maximum value
min(col)	number	Minimum value

example

df.sum('fare_amount') //
        48291043.21 df.mean('tip_pct') // 14.82 df.max('trip_distance') //
        189.4

unique(columnName: string) → any[]

instance

Returns an array of unique values in a column using a Set.

nunique(columnName: string) → number

instance

Returns the count of unique values. Faster than unique().length.

value_counts(columnName: string) → { value, count }[]

instance

Returns a frequency table sorted from most to least common.

example

const freq = df.value_counts('payment_type')
        // [{ value: 1, count: 5821034 }, { value: 2, count: 1823456 }]

join(other: DataFrame, on: string, how?: 'inner' | 'left' = 'inner') → DataFrame

instance

Hash join on a common column. inner returns only matching rows. left keeps all left rows, filling unmatched right columns with null.

example

const enriched = trips.join(zones, 'zone_id', 'left')

write(filePath: string) → Promise<void>

instance

Streaming write export for DataFrames directly to disk as CSV.

toCSV(outputPath: string, options?: object) → Promise<void>

instance

Exports the DataFrame to a CSV file via streaming write. Floats are written with 4 decimal places. Validates the .csv extension.

example

await
        df.toCSV('output/results.csv')

toJSON(outputPath: string) → Promise<void>

instance

Exports to a JSON file. Validates .json extension.

toTXT(outputPath: string) → Promise<void>

instance

Exports to a plain text file. Validates .txt extension.

toArray() → object[]

instance

Converts the DataFrame back to a plain JS array of row objects. Useful for interoperability.

All export methods validate the file extension and throw if it doesn't match. Pass the path with the extension explicitly.

col(name: string) → Column

instance

Returns a Column instance linked to the underlying TypedArray. Enables chained arithmetic that mutates the column in-place.

example — timestamp parsing

df.col('tpep_pickup_datetime') .to_datetime() .extract_hour(0) // Creates new
        column: tpep_pickup_datetime_hour

example — arithmetic between columns

df.col('total_amount') .sub(df.col('tip_amount')) .div(1.08)

Column methods

Method	Accepts	Description
add(value)	number \| Column	Addition in-place
sub(value)	number \| Column	Subtraction in-place
mul(value)	number \| Column	Multiplication in-place
div(value)	number \| Column	Division in-place — guards against divide by zero
to_datetime()	—	Converts string timestamps to ms since epoch (Date.getTime)
extract_hour(offsetSeconds)	number	Extracts hour 0–23 from ms timestamp. Creates `{name}_hour` column

Series

new Series(name, data: TypedArray, type: string, indexer?, mask?)

instance

The internal columnar primitive. Each column inside a DataFrame is backed by a Series. Use Series.fromRawBuffer() to reconstruct from raw buffer data. The optional indexer enables transparent numeric-ID to string translation via .get(index).

Series is the internal primitive — most users work with DataFrame and Column methods directly.

Method	Returns	Description
get(index)	any	Value at index — decodes via indexer if present
slice(start, end)	Series	Returns a slice preserving the indexer reference
Series.fromRawBuffer()	Series	Static factory from raw buffer + metadata
Series.formatResults()	object	Formats aggregation results with a .show() method

INTRODUCTION

INSTALLATION

QUICKSTART

Cervid.READ()

CORE & DISPLAY

TRANSFORMATION

FILTERING

AGGREGATION

INSPECTION

EXPORT

COLUMN & SERIES