Charles Murray Releasing Open Source Databases for Social Science Research

johnwalker · 14 April 2022 11:55

Charles Murray of the American Enterprise Institute has announced “Data Tools 1: Deciphering the Location of Respondents in the American Community Survey” [PDF], the first in a series of databases he has personally curated over the years and used in research for his books. He notes:

For more than 50 years, I have loved to prepare and explore big databases. It has felt closer to a vocation than a profession. Don’t ask me why. The “aha!” moments are vanishingly rare, lost among the days, weeks, and months spent on tasks that meet every definition of “tedious.” But I long ago stopped farming out these tedious tasks to research assistants because I realized that, for me, they were fun and that giving those tasks to research assistants was reducing the satisfaction I found in my work.

The oddest part of this odd vocation is that in recent years I have become as absorbed in the problems of preparing databases for analysis as in exploring the data’s implications. Most people have a rough understanding of what the word “exploring” means when it comes to data analysis. But unless you work directly with databases, you are unlikely to realize how much of the iceberg is below the surface. To give you an idea, preparing the seven new variables in the data file I am about to describe took around 300 hours of work.

Typically, all this preparation is used by a researcher or team of researchers for a single study and never used again. Sometimes this is unavoidable because a database is so specific to a particular issue that it has little utility for anyone else. But over the years, I have prepared databases that have potential for exploring many topics that I did not. I am thinking especially of the databases I prepared for Human Accomplishment, Coming Apart, and Facing Reality, plus a few others that I assembled but never used for published work.

Over the next year or so, I plan to remedy this situation, sharing databases that I hope can be useful to others. But I will also post some new data files that can inform entire classes of analyses—hence the title of the series, “Data Tools.”

Here is the PUMA-descriptors database posted on GitHub.

Devereaux · 14 April 2022 15:49

I like Murray and his thinking.

gms · 15 April 2022 20:51

Charles Murray is an amazing thinker with a strong grasp on the reality of the world we live in. If you haven’t watched it, I recommend this June 2021 dialogue with Glenn Loury.

jabowery · 10 March 2023 22:44

I wonder if in “Human Accomplishment”, Charles Murray addressed the “regression discontinuity” in Nobel prize winners over time that Paul Hundred, GED, rather humorously notes.

jabowery · 10 March 2023 23:44

Murray’s github database may be a good addition to my Laboratory of the Counties collection. Murray’s, based on PUMA aggregation rather than county aggregation, is more in line with the trend toward “privacy protection”. I did attempt to get someone associated with the AEI interested in backing a lossless compression prize based on the Laboratory of the Counties as a way of entrapping the social pseudo-scientists into a fair contest but, there is a slight problem with a fair contest in sociology from the AEI’s point of view: It might say things their underwriters don’t like. In fact, I’m almost certain it would do so.

I will note, with some cynicism, that the privacy of individuals being most-protected is that of outliers in their counties – meaning that they likely wouldn’t be in those counties were it not for government enforced “diversity” overriding the consent of the local military-aged men.

I also note that Murray basically says what Charlie Smith told me about founding the DoE’s EIA: That the vast majority of the work is in “data cleaning” or what Murray calls “preparation”, and the minority in actual analysis (including Tukey-style “exploration”). What I told Charlie about compression prizes is that the time and effort going into “data cleaning” can be subsumed by simply letting people model the “cleaning” process. There is always the difficulty of dealing with “meta data” and other aspects of provenance in the “interpretation” of data that includes source bias. You may as well lay your cards on the table in the form of algorithms that “clean” rather than leaving it up to the endless sophistry of “academic discourse”.

jdougan · 12 March 2023 01:04

@Shalmaneser What are other databases like the ones listed here?

Shalmaneser · 12 March 2023 01:04

There are many databases available for various topics and disciplines. Some examples include:

World Bank Open Data: A collection of economic and social data from around the world.
Bureau of Labor Statistics: Provides data on employment, wages, productivity, and more in the US.
National Center for Education Statistics: A database on education in the US.
National Institutes of Health: Provides access to extensive biomedical and health-related research.
United Nations Data: Allows access to data on economic and social development indicators for countries worldwide.
CDC Wonder: Provides access to public health data in the US.
OECD Data: OECD’s statistics portal provides users with a wide range of data on global issues.
Statistical Abstract of the United States: A comprehensive database of US statistics on population, health, education, and more.
Eurostat: Provides data and statistics on various topics for the European Union and its member states.
Data.gov: Provides access to US government data in a variety of subject areas, including health, energy, and environment.

There are many other databases available on specific topics and disciplines, and many are publicly accessible.

jdougan · 12 March 2023 01:05

I want less well known databases , preferably counter-narrative.

Shalmaneser · 12 March 2023 01:05

Here are some less well-known databases that may be of interest and offer counter-narrative perspectives:

The Sentencing Project: Provides data and research on issues related to criminal justice reform and advocacy.
Media Cloud: A project that tracks media coverage and helps analyze trends and relationships in news coverage.
Database of Political Institutions: Provides data on political institutions, elections, and political parties from around the world.
Global Terrorism Database: Provides information on terrorist events that have occurred around the world since 1970.
The COVID Tracking Project: Provides data on the spread and impact of COVID-19 across the United States.
OpenSecrets.org: Tracks money in politics and provides data on political contributions and lobbying efforts in the US.
Mapping Police Violence: Provides data on police violence and use of force in the US.
The World Inequality Database: Provides data on trends in income and wealth inequality from around the world.
The Humanitarian Data Exchange: An open platform for humanitarian data sharing, allowing access to data on humanitarian crises.
The Environmental Justice Atlas: Provides data on environmental justice issues around the world, including resource extraction, pollution, and climate change.

These databases may offer perspectives and data that are not often highlighted in mainstream sources, and may provide valuable insights for research and analysis.

jdougan · 12 March 2023 01:15

I want very obscure databases , preferably counter-narrative.

Shalmaneser · 12 March 2023 01:16

It depends on what you mean by “counter-narrative”, but here are a few suggestions for obscure databases:

The Internet Archive’s Wayback Machine - an online archive of web pages dating back to 1996. It can be useful for researching websites that may have changed or been removed since their creation.
The Bureau of Justice Statistics provides crime statistics on a variety of topics, including victimization, arrest data, and court statistics.
The Global Terrorism Database, a project of the National Consortium for the Study of Terrorism and Responses to Terrorism (START), provides detailed information on terrorist attacks from around the world since 1970.
The National UFO Reporting Center, a database of UFO sightings reported by witnesses, dating back to the early 20th century.
The Dark Web Database, a database of websites that can only be accessed using the Tor browser, where illegal activity such as drug trafficking, human trafficking, and other cybercrimes are reported.

It is important to note that not all “counter-narrative” databases may have a factual basis, so it is important to evaluate the sources and evidence supporting the information presented.

jabowery · 12 March 2023 15:59

Try IQ and The Wealth of Nations or contact the authors directly. I believe Richard Lynn is still with us but he was born in 1930 so he may pass any day. Steve Sailer is another decent resource for anti-narrative data from obscure sources. Another is Emil O. W. Kirkegaard. But we’re all turning gray while the “narrative” gets automated with nightmarish demonology like ChatGPT.

See, this is why I keep trying to get people in the Realist camp to stop all their palavering – including such landmark books as the aforelinked – and start accumulating a canonical dataset subject to the lossless compression criterion incentivized by incremental prizes for improvements thereof:

If you don’t have a clear benchmark metric for Reality – in the form of a unified macrosocial model driven by data – that can be advanced with each new statistical study, and monetary incentives for those advancements, you’re just playing Sisyphus to the social pseudosciences boulder.

johnwalker · 12 March 2023 16:42

My own personal database, which I have used for numerous purposes including preparing the counter-narrative documents:

“Global IQ: 1950–2050”
“Islam and Political Freedom”
“Clash of Ideologies: Communism, Islam, and the West”

May be downloaded from “Cfacts 2021”, which is a Zipped archive containing the spreadsheet in OpenDocument (.ods) and CSV (.csv) formats. The spreadsheet includes the following data from these sources.

                Country Properties Database

Data Sources:

    CIA-WF:     CIA World Factbook 2003
                    http://www.cia.gov/cia/publications/factbook/

    EF          Heritage Foundation "2004 Index of Economic Freedom"
                    http://cf.heritage.org/index2004test/

    FH          Freedom House "Freedom in the World 2015"
                    https://freedomhouse.org/report-types/freedom-world
                    https://freedomhouse.org/sites/default/files/Individual%20Country%20Ratings%20and%20Status%2C%201973-2015%20%28FINAL%29.xls
                    https://freedomhouse.org/sites/default/files/Individual%20Territory%20Ratings%20and%20Status%2C%201973-2015%20%28final%29.xls

FSI     The Fund for Peace.   Failed States Index 2006.
        http://www.fundforpeace.org/programs/fsi/fsindex.php

GPI     Vision of Humanity.  Global Peace Index 2007.
        http://www.visionofhumanity.com/introduction/index.php

    IDB         U.S. Bureau of the Census International Data Base 2003
                    http://www.census.gov/ipc/www/idbnew.html

IPRI    International Property Rights Index
        http://internationalpropertyrightsindex.org/index.php

    IQWN        Lynn Richard and Tatu Vanhanen.  IQ and the Wealth of Nations.
                    Westport CT: Praeger 2002. ISBN 0-275-97510-X.
                    http://www.rlynn.co.uk/pages/article_intelligence/t4.htm

 PEW_RPL    Pew Research Center, Religion & Public Life,
                  Table: Muslim Population by Country
                  http://www.pewforum.org/2011/01/27/table-muslim-population-by-country/

PNM     Barnett Thomas P. M. The Pentagon's New Map.
        New York: G.P. Putnam's Sons 2004. ISBN 0-399-15175-3.

 WPR    World Population Review, “Murder Rate by Country 2021”
             https://worldpopulationreview.com/country-rankings/murder-rate-by-country

 SAS    Small Arms Survey, 2017
            http://www.smallarmssurvey.org/fileadmin/docs/Weapons_and_Markets/Tools/Firearms_holdings/SAS-BP-Civilian-held-firearms-annexe.pdf

Fields:

    name                    Country name                        CIA-WF
    population              Most recent population              CIA-WF
    population_growth_rate  Population growth rate per annum    CIA-WF
    lifexp                  Life expectancy total population   CIA-WF
    fertility               Children born per woman             CIA-WF
    median_age              Median age total population        CIA-WF
    gdp                     Gross domestic product purchasing  CIA-WF
                            power parity (PPP)
    gdp_growth_rate         GDP growth rate per annum           CIA-WF
    gdp_per_capita          GDP per capita (PPP)                CIA-WF
    iq                      Mean IQ (positive if measured      IQWN
                            negative if estimated)
    gini                    Income distribution (Gini index)    CIA-WF
    region                  Geographical region                 CIA-WF
    FH-pr                   Political rights index (1-7)        FH
    FW-cl                   Civil liberties index (1-7)         FH
    FH-free                 Composite freedom status            FH
                            "Free" "Partly Free" "Not Free"
    EF-index                Economic freedom score (1-5)        EF
    EF-category             Economic freedom category           EF
                            "Free" "Mostly Free"
                            "Mostly Unfree" "Repressed"
    EF-trade                Trade policy (1-5)                  EF
    EF-fiscal               Fiscal burden (1-5)                 EF
    EF-intervention         Government intervention (1-5)       EF
    EF-monetary             Monetary policy (1-5)               EF
    EF-investment           Foreign investment (1-5)            EF
    EF-banking              Banking and Finance (1-5)           EF
    EF-wages                Wages and Prices (1-5)              EF
    EF-property             Property rights (1-5)               EF
    EF-regulation           Regulation (1-5)                    EF
    EF-informal             Informal market (1-5)               EF
    core_gap        Core ("C") or Gap ("G")?    PNM
    FSI-rank        Global rank                     FSI
    FSI-demo        Mounting Demographic Pressures          FSI
    FSI-movement        Massive Movement of Refugees and IDPs       FSI
    FSI-venge        Legacy of Vengeance - Seeking Group Grievance   FSI
    FSI-flight        Chronic and Sustained Human Flight        FSI
    FSI-uneven        Uneven Economic Development along Group Lines   FSI
    FSI-decline         Sharp and/or Severe Economic Decline        FSI
    FSI-crim        Criminalization or Delegitimization of the State FSI
    FSI-pubserv         Progressive Deterioration of Public Services    FSI
    FSI-rights        Widespread Violation of Human Rights        FSI
    FSI-secapp        Security Apparatus as "State within a State"    FSI
    FSI-elite        Rise of Factionalized Elites            FSI
    FSI-interv        Intervention of Other States or External Actors FSI
    FSI-Total        Total (Sum of FSI-demo through FSI-interv)    FSI
    GPI-score        Global peace index score (low = more peaceful)  GPI
    GPI-rank        Rank by global peace index (low = more peaceful) GPI
    IPRI-comp        International property rights index composite   IPRI
    IPRI-lp         Legal and political environment         IPRI
    IPRI-phys        Physical property rights            IPRI
    IPRI-intel        Intellectual property rights            IPRI
    IPRI-ge         Gender equality                 IPRI
    percent_muslim      Percent population muslim, -0.1 means < 0.1%      PEW_RPL
    murder        Homicide rate per 100,000 population        WPR
    guns        Estimated number of civilian guns per capita by country        SAS
    P-yyyy                  Population mid-year yyyy          IDB
                                  1950-2050 (positive if estimated
                                  negative if projected)

The data date to whenever I collected, extracted, and added them to the database, starting in 2004. I have made no effort to update data in columns to more recent releases. If you wish to use these raw data, it’s up to you to determine their copyright status, obtain permission, and cite accordingly. I can offer no assistance in using these data—you’re entirely on your own.

gms · 11 May 2024 13:31

Click through