Datum — Statistical Autopsy

Drop your dataset

CSV or TSV — any shape, any size. Datum will figure it out.

↑ drag & drop or click to browse

or try a sample

What Is This?

Datum is a statistical autopsy engine. Drop any CSV and it performs a full diagnostic — profiling every column, discovering distributions, computing correlations, flagging outliers, and generating a human-readable narrative. Every computation runs in your browser; nothing leaves your machine.

⊘ Zero AI · Zero Server · Pure Statistics

Methods Used

Descriptive Statistics

Mean, median, standard deviation, min/max, quartiles (Q1/Q3), interquartile range. Computed in a single pass using Welford's online algorithm for numerical stability.

Distribution Detection

Skewness (Fisher's method) and kurtosis classify each column: normal, right-skewed, left-skewed, uniform, or bimodal. Histogram binning via Sturges' rule: k = ⌈log₂(n) + 1⌉.

Pearson Correlation

Pairwise correlation coefficients for all numeric column pairs: r = Σ(xᵢ−x̄)(yᵢ−ȳ) / √(Σ(xᵢ−x̄)² · Σ(yᵢ−ȳ)²). Rendered as a heatmap with diverging color scale.

Outlier Detection

IQR method: any value below Q1 − 1.5×IQR or above Q3 + 1.5×IQR is flagged. Reports count, direction (high/low), and severity as multiples of IQR beyond the fence.

Cardinality Analysis

For categorical columns: unique value count, mode, frequency distribution, and entropy H = −Σ pᵢ log₂(pᵢ) to measure information content.

Data Quality Score

Composite metric from completeness (non-null ratio), consistency (type uniformity), and uniqueness. Weighted: 50% completeness, 30% consistency, 20% uniqueness.

Narrative Generation

The narrative is template-driven, not AI-generated. Datum uses conditional logic over computed statistics to select sentence fragments and compose paragraphs. For example: if skewness > 1.0, it writes "heavily right-skewed"; if the strongest correlation exceeds |0.7|, it highlights the pair. The writing is deterministic — the same dataset always produces the same report.

Why No AI?

This is a deliberate constraint. Every number in the report is mathematically verifiable. There are no hallucinations, no probabilistic summaries, no model weights. The formulas are classical statistics — the same math used in R, SciPy, and academic papers for decades. The constraint proves that meaningful data storytelling doesn't require LLMs — it requires thoughtful computation and good writing templates.

Lineage