Research vertical · health-AI

EgorGenom

Health-AI research vertical — rare-disease genomics. This is research held to clinical-grade standards — not a commercial product.

What it is

EgorGenom is a parent-driven whole-genome sequencing (WGS) re-analysis for a child with a suspected genetic disease, run in parallel with the clinical team. It is, in effect, a home bioinformatics lab built to the same standard a clinical genomics service would use — assembled to give one child the depth of scrutiny that a high-throughput pipeline rarely affords a single case.

The stack

Not a hobbyist setup — this is the industry-standard, open-source genomics toolchain that clinical labs and national sequencing centres run, assembled end to end and pinned for reproducibility.

Layer	Tools
Orchestration	`nf-core/sarek` 3.5.1 and `nf-core/raredisease` (Nextflow) — two independent pipelines for cross-validation.
Alignment	BWA-MEM2 (independent re-alignment from raw FASTQ; ~18 GB CRAM).
SNV / indel calling	DeepVariant, GATK4 HaplotypeCaller, Strelka2, bcftools — multi-caller consensus, not a single caller.
Structural / CNV	Manta, CNVkit, depth-based CNV screen.
Repeat expansions (STR)	ExpansionHunter + REViewer.
Pseudogene / paralog	Gauchian (GBA1), Paraphase — regions standard pipelines silently miss.
Pharmacogenomics · mtDNA	PharmCAT; dedicated mitochondrial-DNA analysis.
Annotation & pathogenicity	VEP 112 with ClinVar, SpliceAI, CADD, REVEL, AlphaMissense, LOFTEE; gnomAD v4.1 population frequencies.
Triage	A 4-tier clinical triage with automated ACMG-criteria support.
Evidence synthesis	Live PubMed / ClinVar / OMIM, bioRxiv, Consensus, ClinicalTrials, ChEMBL (via the Bio-Research MCP layer).
Infrastructure	WSL2 Ubuntu 24.04, Docker (containers pinned by digest), SQLite metadata. ~500 GB disk, 32 GB RAM, optional RTX 3070 GPU for DeepVariant.
Reproducibility	End-to-end runbook (FASTQ → clinical report), SHA256 data manifests, FASTQ provenance, and pre-registered, immutable numeric thresholds.

Scope & depth of work

The analysis ran in phases: a literature audit and a blinded six-expert review panel (clinical geneticist, bioinformatician, paediatric neurologist, disease-specific and statistical-genetics roles); a targeted multi-caller consensus; a full WGS re-call from raw reads (~21.5 h of compute); a gap-closure pass for pseudogenes, repeat expansions and CNVs; and a clinical hand-off package — 82 documents, ~109,000 words, bilingual (EN + RU), with methods, biomarker decision trees and phased recommendations written for the treating physicians.

Headline result: seven independent caller-and-pipeline combinations agreed on every candidate — the calling layer is, in the project's own words, bulletproof.

Why a parent-run lab can go deeper

A state diagnostic service processes many patients in a throughput pipeline and gives each case a bounded number of passes. A parent-driven analysis is not on that clock — it can spend far more iterations on one child: re-call the genome from raw reads, close the gaps a standard pipeline misses (pseudogenes, repeat expansions, copy-number variants), and re-triage every candidate against the newest database releases. More callers, more passes, more scrutiny — for a single patient.

What this is equivalent to

The software is free and open-source; what a comparable engagement costs elsewhere is expert time. A clinical WGS re-analysis and interpretation of this depth — independent re-calling, pseudogene resolution, multi-database triage, ACMG curation and a written clinical hand-off — is the kind of work a diagnostic lab or bioinformatics consultancy would typically bill in the thousands to tens of thousands of dollars per case, on top of the sequencing itself. The compute is modest (one workstation, or a few hundred dollars of cloud per genome); the value is in the depth and the number of passes. (Figures are an order-of-magnitude estimate, for context only.)

Finding the variant is only half the work

Across those seven caller-and-pipeline combinations the calling layer was unanimous — the same candidates were found every time. Every meaningful disagreement with the original analysis was on the interpretation layer: disease mechanism, how well a gene fits the child's phenotype, and how the literature and clinical databases should be weighed. That is precisely why this is run alongside — not instead of — the clinical team. A confirmed variant is a starting point; turning it into a diagnosis is a clinical judgement made together with the treating physicians.

Why the repository is private

The repository is private out of clinical sensitivity — it contains identifiable patient data. A de-identification plan governs any future public release of the generic pipeline and methodology, with raw data deposited only to controlled-access archives. We treat patient privacy as a hard constraint, not a preference.

Where Google Cloud fits (planned)

This vertical is where we expect to scale on Google Cloud next. It is explicitly forward-looking, not yet in production:

Planned / in evaluation

Vertex AI
BigQuery
Cloud Storage
Document AI

How we use Google Cloud →

Vertex AI — phenotype-to-gene scoring models.
BigQuery — querying population-frequency datasets (gnomAD) at scale.
Cloud Storage — de-identified genomic data and long-read archives.
Document AI — parsing clinical reports and lab PDFs.

How we use Google Cloud Clinical / research contact