About Our Data: Methodology & Sources

Transparency is central to NameAlmanac. This page explains exactly where our data comes from, how we process it, what calculations we perform, and what limitations exist. Every number on this site traces back to the steps described below.

Primary Data Source: The SSA National Baby Names Dataset

Every name, birth count, and ranking on NameAlmanac is derived from one upstream dataset: the Social Security Administration's (SSA) National Baby Names dataset.

The SSA publishes this data as part of its statutory transparency obligations. It is the most comprehensive public record of given names in the United States, covering every Social Security card application filed since 1880. No other publicly available dataset approaches its scope or completeness for American naming patterns.

Official source: Social Security Administration — Beyond the Top 1000 Names

We do not supplement the SSA data with any other source. We do not use crowdsourced data, user submissions, hospital records, or third-party estimates. This single-source approach ensures consistency and auditability: any number on NameAlmanac can be verified against the SSA's own published files.

How the SSA Collects Name Data

Understanding the SSA's data collection process is important for interpreting the data correctly:

  1. Birth certificate registration. When a child is born in the United States, the birth is registered with the state's vital records office. Parents provide the child's given name on the birth certificate.
  2. Social Security card application. Parents apply for a Social Security Number (SSN) for their child, typically within the first year of life. The application (Form SS-5) records the child's given name, sex, and date of birth. Since 1987, obtaining an SSN has been required to claim a child as a tax dependent, making coverage near-universal for modern records.
  3. SSA tabulation. The SSA's Office of the Chief Actuary tabulates name-sex-year combinations from all SSN applications and publishes them annually as the National Baby Names dataset. Names are recorded exactly as they appear on applications, preserving spelling variations (e.g., "Caitlin", "Kaitlyn", "Catelynn" are tracked as separate names).
  4. Privacy filtering. Before publication, the SSA removes any name-sex-year combination with fewer than 5 occurrences to prevent identification of specific individuals.

Data Coverage & Vintage

Our current database covers births from 1880 through 2024 — 145 consecutive years of recorded naming history. The dataset includes:

  • 104,800+ unique name spellings (about 116,500 counting each name separately by sex)
  • 2,140,000+ name-year records (individual name entries across all years)
  • Male and female birth records tracked separately
  • National data plus state-level breakdowns (state data available from 1910)

The dataset reflects the complete SSA publication as of the most recent annual release. The data vintage is displayed in the footer of the site and on individual name pages.

Our Data Extraction Pipeline

We process the raw SSA data through a multi-stage pipeline designed for accuracy and reproducibility:

  1. Download. We retrieve the SSA's national and state-level data files directly from ssa.gov. The national dataset consists of one text file per year (e.g., yob2024.txt), each containing comma-separated records of name, sex, and birth count.
  2. Parse. Each annual file is parsed to extract (name, sex, count) tuples. We validate that every record contains exactly three fields, that counts are positive integers, and that sex codes are limited to "M" and "F".
  3. Aggregate. Individual year records are combined to compute cumulative statistics: total births across all years, peak year and peak count, first and last year of appearance, and decade-level summaries.
  4. Rank. For each year and gender, names are ranked by birth count in descending order. Ties are broken alphabetically (A before B). This produces the annual popularity ranking displayed on name pages and year pages.
  5. Compute derived metrics. We calculate percentage-of-births (name's count divided by total births for that gender in that year), trend direction indicators, and decade-over-decade changes.
  6. Index and load. The processed data is loaded into a search-optimized SQLite database with indexes on name, year, rank, and state for fast lookups across all dimensions.

This entire pipeline is automated and versioned. Each database build produces a build log that records the SSA data vintage used, row counts at each stage, and any validation warnings.

Ranking Methodology

Popularity rankings are the most-referenced feature on NameAlmanac. Here is exactly how they work:

  • Scope: Rankings are calculated per year, per gender. A name's rank among boys in 2024 is independent of its rank among girls in 2024.
  • Metric: Rank is determined by raw birth count. The name given to the most babies of that gender in that year is rank 1.
  • Tie-breaking: When two or more names have identical birth counts, they are ranked alphabetically (A before Z).
  • Range: Rankings extend to every name in the dataset for that year. There is no artificial cutoff at the top 100 or top 1,000 — if a name appears in the SSA data (meaning it had at least 5 occurrences), it has a rank.

Importantly, a rank of 500 in 1900 represents a very different absolute number of births than rank 500 in 2020, because the total number of births has changed dramatically. For meaningful cross-era comparisons, use the percentage-of-births metric instead of raw rank.

Trend & Percentage Calculations

Trend charts on individual name pages show two views of popularity over time:

  • Birth count: The raw number of babies given this name in each year. This is the most intuitive measure but is affected by overall population size — a name with 5,000 births in 2020 represents a smaller share of the population than 5,000 births in 1920.
  • Percentage of births: The name's birth count divided by the total births for that gender in that year, expressed as a percentage. This normalizes for population growth and is the better metric for comparing a name's true cultural popularity across different eras.

Trend direction indicators (rising, falling, stable) are based on comparing the most recent 5-year average against the preceding 5-year average, with thresholds to avoid flagging noise as meaningful trends.

State-Level Data

The SSA publishes state-level name data separately from the national dataset. State records are available from 1910 onward (30 years later than national data), covering all 50 states plus the District of Columbia and U.S. territories.

State-level rankings follow the same methodology as national rankings but are scoped to births within that state. The same 5-occurrence privacy threshold applies at the state level, meaning more names are suppressed in smaller states. A name visible in the national data may not appear in state-level data for less-populated states.

Privacy Threshold & Suppressed Records

The SSA applies a minimum threshold of 5 occurrences before including a name-sex-year combination in their public dataset. This is a privacy protection measure to prevent identification of individuals with extremely rare names.

The practical effects are:

  • Very rare names (1-4 occurrences in a year) do not appear in our database for that year
  • A name may appear in some years but not others if its frequency dropped below 5
  • The sum of all published birth counts for a year is slightly less than the actual total births, because suppressed records are excluded
  • State-level data has more suppression than national data, because the same name may exceed 5 nationally but not within a single state

NameAlmanac does not attempt to estimate or fill in suppressed records. Where data does not exist in the SSA files, we show nothing.

Known Limitations

No dataset is perfect. We believe transparency about limitations is as important as the data itself. Known limitations include:

  • SSN coverage gap (pre-1937). The Social Security system was established in 1937. Records from 1880-1936 are reconstructed from retrospective SSN applications (people who applied for Social Security cards later in life and provided their birth year). These historical records are less complete than modern data, particularly for populations less likely to have applied for Social Security.
  • Privacy suppression. Names with fewer than 5 occurrences in a year are excluded, creating a slight undercount that disproportionately affects rare and newly-emerging names.
  • U.S. only. This data covers the United States only. Naming trends in other countries are not represented.
  • Spelling-sensitive. The SSA records names as spelled on applications. "Catherine", "Katherine", "Kathryn", and "Katharine" are all tracked as separate names. We do not merge spelling variants, because doing so would introduce editorial judgment about which names are "the same".
  • Legal given names only. The SSA records the legal given name on the Social Security application. Nicknames, middle names used as first names, and informal names are not captured unless they appear on the SS-5 form.
  • Non-SSN populations excluded. Individuals who never applied for a Social Security number (some immigrants, non-citizens) are not represented. For modern records (post-1987), this gap is very small due to the tax-dependent SSN requirement.

Update Schedule

The SSA publishes new annual data once per year, typically in May following the reference year (e.g., 2024 birth data was released in spring 2025). We update NameAlmanac within a few weeks of each new release.

Between annual releases, the underlying data does not change. The SSA does not revise previously published years. If you notice the data vintage has not updated following an SSA release, please let us know.

Data Integrity & Validation

We take several steps to ensure the data you see on NameAlmanac faithfully represents the SSA's published records:

  • Source verification. Data files are downloaded directly from ssa.gov. We do not use third-party mirrors or redistributions.
  • Automated validation. Our build pipeline checks that year ranges are continuous, that birth count totals match expected magnitudes, and that no records were lost during processing.
  • No interpolation or estimation. Where the SSA has no data (suppressed records, years before 1880), we show nothing. We never fill gaps with estimates.
  • Reproducible builds. Our processing pipeline is deterministic — running it again on the same SSA source files produces an identical database.
  • Version tracking. Each database build is tagged with the SSA data vintage it was built from, enabling traceability from any displayed number back to the source file.

Corrections & Feedback

If you believe any data on NameAlmanac is inaccurate, or if you have questions about our methodology, please contact us with the specific page URL and the concern. We investigate all data accuracy reports and publish corrections when warranted.

For questions about the underlying SSA dataset itself (e.g., why a name is missing, how the SSA handles name changes), the SSA provides additional documentation at ssa.gov/oact/babynames/background.html.