Steve Ruggles
IPUMS - Integrated Public Use Micro Data Series

Overview

Q:
Steve, can you explain the origins of "IPUMS?" The Integrated Public Use Microdata Series?

A: Well, the IPUMS really had its beginnings with the Census Bureau public use microdata files. What the Census Bureau found in 1960, after the 1960 census, they getting all kind of requests from academic researchers, mostly demographers and what not, but a few economists, for specialized tabulations to look at their particular research problem. They were getting a little overwhelmed with these and somebody in the Census Bureau and we have been unable to track down just whose idea it was, got the bright idea of pulling out a one in a thousand sample putting on punch cards and sending around to various universities around the country, stripping off the names and addresses so that researchers could make their own tabulations. This is just in the earliest period when academic researchers had access to computers. It was a terrific success. It was really the first time that academic researchers had access to this kind of data and they all of a sudden discovered we can not only make tables, we can do regressions. So it became very successful. Then in 1970, they came out with a vastly larger sample, and in conjunction with the 1970 census, the Census Bureau went back and expanded the 1960 sample from a one in a thousand density to one in a hundred. And they also included about 15%, or about 6% of the population in 1970. So you have a really big sample for 1970. And most importantly they coded all the codes in 1960 exactly the same way as they had in 1970. So all these studies started coming out that used two census years. That was again really important in all areas of social sciences; it became an important resource. That was a situation when quite independently two different people, Sam Preston at University of Washington and Hal Winsborough at University of Wisconsin independently came up with the idea of extending the series backwards. Preston got a big grant to retrospectively go back to the original microfilm and do large samples of the 1940 and 1950 censuses taken directly from the original enumerators' manuscripts that they used when they're walking around from house to house. And Sam Preston did the same thing for the 1900 census. So all of the sudden there was this series. Then of course when 1980 came along that census came out as well. So that was a situation when we started getting into this business. We decided, Russman, Art and I that we really ought to have samples for the 19th century. So we submitted a proposal to do a sample of the 1880 census. And that was accepted and we did a large sample of 1880. And then it began to look like it occurred to me that we could do it all. We could go right back to the earliest level census which was 1850 and go right up to 2000 and fill in the whole thing. And we started applying for money to do that and we also applied for funds to integrate the samples to make... because other than 1960 and 1970 every single one of these samples was a completely different coding schemes, different documentation there was a stack this high of code books for all the different census years that you had to plow through if you wanted to use more than one census year. And our goal, our hidden agenda was to try to get people to do history and to look at change over time. So we decided to make the censuses compatible with one another and we did that. That has been really successful. And then what really made it take off was putting everything online and developing an automatic data extraction system so that anybody can pull out any census year's variables and construct new variables and so on if they have access to the Internet. And now, the IPUMS has become very widely used. We made it available in a real usable version first in 1995 and since then, there has been 45 articles, 9 dissertations, 3 books done, which considering this is only 1999 is a pretty rapid turnaround. We are now distributing a 166 MB of data an hour, 24 hours a day, 7 days a week. We have distributed a total of 3 terabytes of data so far. Its been really amazing, the response that we've had to it.

Theory

Q:
Could you explain the origin of your interest in the African-American family structure and how you got into this project?

A: Well, the study I did on African-American families is really a part of a broader project where I am trying to describe and explain changes in the American family from the middle of the 19th century to the present. One of the big questions in that study is why is it that black families differ so much from whites and how did come to be. The history of it... there has just been more controversy about the black family than about anything else, in my field certainly. And the most vituperative kinds of acrimonious debates. It really has been stunning. Its been a topic where of all the things I've written, that article was picked up by USA Today and other national media, like 15 different newspapers around the country and its really not really the most important thing I've done I don't think. It's just a topic that somehow resonates. It really has its origins in the early 20th century, when a number of scholars, Franklin Fraser and others, observed that there were differences in the families of blacks and whites. And in the early '60s, Daniel Patrick Moynihan wrote an internal memo when he was at the White House as an assistant secretary which that argued that the growth of single parenthood among was becoming a national crisis, that it was really hurting the African-American community and the government ought to do something about it. That was met was met from very sharp attacks from the left as a lot of African-Americans and a lot of scholars thought that his criticism of... what he was doing was criticizing black families. He used this term "pathology" which a lot of people found offensive. In the late 1960s and the early 1970s, as historians were just beginning to look into issues of family structure, there were about 8 or 9 articles and one very, very big fat book written by Herbert Gutman, 8 or 9 articles by other historians, that all made the argument that essentially black families were really very much like white families in the late 19th and the early 20th century. And that by implication that the differences in family structure between blacks and whites were pretty recent. Now, this was supposed to be a criticism of Moynihan. But in fact, Moynihan was saying this was a new problem. I never understood why the historians thought that saying that black families and white families were just alike in the late 19th and early 20th century, why that was a critique of Moynihan. At any rate, what I did was simply to take this series of data and do some basic descriptions of the differences between white and black family structures from 1880 up to 1980 or 1990. I don't remember. Basically I found that for most of that period, black children were twice as likely as white children to live with less than two parents, and that there wasn't really much change until 1970 or so, until after Moynihan. So I argued that the differences were in fact between black and white family structures were very long standing. And that you could ascribe them to cultural differences or the disruptions of slavery or various other things. Or in fact as I now think, which I didn't think at the time when I wrote that, I now think that it really is economic differences that are fundamentally responsible for the difference between race difference in family structure.

Q: You begin with the question of how blacks and whites differ in family structure. But then when you move to the more difficult question "Why do they differ they way they do?", you have to introduce other variables. How do you address that problem?

A: The question is, that I am trying to answer, to what extent can economic factors account for race difference between blacks and whites. The problem is that if the blacks aren't living with their fathers you don't know what the economic status of that absent father is. We can only identify the social economic status of fathers that are present in the household. What the problem is, then, is how do we figure out the relationship between economic status and living arrangements. What I have done for whites, I have looked at narrow geographical areas, as narrow as I could get them—counties, metropolitan areas, and sometimes states—and tried to figure out for whites what the relationship is between the proportion in poverty, or the proportion with terrible jobs, or various other measures of low socioeconomic status, and the proportion of single parents. And I have done that decade by decade from the 19th century to the late 20th century. The relationship, the geographical relationship is very close. If you then take that regression and plug in the characteristics of blacks so you get... you can't do this with blacks because there aren't enough of them in our samples to be able to study this at their fine geographical level. But what you can do is you can say if the geographical relationship... you take the geographical relationship between low economic status and single-parenthood for blacks and then you say, what would be the predicted family structure of blacks if blacks had that same relationship as whites do. So you can see, how much higher single parenthood would be among blacks if blacks had the same relationship as whites do to the socioeconomic factors. And the gap between whites and blacks, particularly in the 19th century, is so huge that it easily appears to account for the race difference in family structure. Indeed, it may turn out in that period that the real question is "How did so many black families manage to hang together given the absolutely dismal economic circumstances they were in?" We have a variable in 1870 that was right after the end of the civil war of course, on property ownership. We have real property and personal property. The race difference in property ownership is effectively blacks didn't have any and whites all had some. That's the best predictor we have of single parenthood in 1870 is whether or not you live in a place where everybody has property or whether you don't. I am coming around to the view that really it may be more economic. It doesn't work however, when you get past 1960. There are, I think, real differences in family structure that have emerged in the recent period that cannot be entirely, easily ascribed to economic differences. But nonetheless the economic differences certainly account for much or most of the difference between....

Q: Could you explain the way in which you simulated black family structure based on evidence about the nature of white family structure and economic conditions?

A: You just take the means for the blacks in each year and plug them into the regressions and figure out the predicted percentage residing with single parents would be. Or in the case of logistic regressions the predicted probability of a black person with the mean characteristics of blacks being a single parent.

Design

Q:
Please say something about the nature of the data used in previous studies and how your study improves on the data series used to study the African-American family.

A: A lot of it was just a matter of consistency. Before we started making these large samples, historical samples of the nation as a whole, historians in the late 1960s and early 1970s had done a lot of research using census samples from particular communities, almost always urban places, occasionally rural places because they were interested in urban history I guess. They could make some generalizations about family structures in those particular places, but they never used the same classification methods, or the same exact measures as the US Census Bureau for the recent period. So it was impossible to make a direct comparison even for those particular areas between the late 19th century and the recent period. So what we can do now, we can make generalizations about the country as a whole or particular sections of the country, or small population subgroups, that are really very, very consistent from the middle of the 19th century to the late 20th century. Of course we got less information available for the earlier period, as the earlier censuses don't have nice things like income. But they have some other nice things. There is change in census questions in coverage. But we can for the first time really look at the long run. All I did in the paper was just the obvious thing, which was to do some basic descriptive measures of change in family structure over the very long run. Very often the obvious thing needs doing.

Q: What are the difficulties of using census data on a question like this over such a long period of time?

A:
It is in the definition of what households are. Is this what you mean? For example, in the 19th century up through 1930, it was the group of people who ate together, that shared a common table, or were employees, servants that were actually working for them. After that you distinguished a separate household because of cooking facilities—every unit that had their own cooking facilities. Now its based on whether you have a separate entrance; i.e., you don't have to walk through somebody else's living quarters. There are little differences. That probably doesn't make any difference when you are talking about how many kids are living with their parents. It is a little bit of concern when you are studying boarding and lodging whether or not this would've been counted as a separate household in this period but not in this period. But it doesn't affect family stuff. They are going to be in the same household in all periods. Then there is another little problem. In this study I started in 1880, which was the year they started asking the relationship of each individual in the household to some reference person, like the head of household—head, wife, son, daughter and so on. That makes it easy to sort out the family relationships, although it's complicated when you have a grandchild and two married daughters. Sometimes you need to do some fancy stuff to figure out which daughter that grandchild goes with and that sort of thing. But before 1880, you don't have family relationship all you have is age, sex, surname, and sequence in the household. But even there you can figure out most family relationships without too much trouble. There isn't that much problem with comparability. I guess that you get better quality as time goes on. The estimates are more precise… everything gets more reliable. Again, when we are looking at big picture stuff it doesn't matter.

Q: What kind of data would you like to add to the beginning of the time series? The ASR piece begins with the 1880 census.

A: If I was doing it now I would go back to 1850 or 1870, because that was the earliest data we have on the black population.

Q: What data are available for the period before the Civil War especially on black families?

A: We have slave schedules. We are entering slave schedules for 1860 right now. But they are essentially useless for family structure. The southerners were so concerned about anything that could used to highlight the conditions of slavery that they were successful in sharply limiting the available, the questions that were asked of slaves. So we have age, and we have first name, of course many slaves only had first names, and in no particular order, plantation. It can't really be used for family structure.

Measurement

Q:
It appears that the social context that shapes our interpretation of census data changed over time. Could you say a little about the problems of using data under those circumstances?

A: Yeah, even now I think that single mothers residing in households with their mothers effectively, it's about something like 25-30%. It's substantial. And its going down all the time. Its big enough that even now that female headship is not a good way to get at this problem. If you are interested is there a dad around for the kids, if that's the policy issue, you aren't going to get at it by the percentage of households that are headed by women.

Q:
How is a social scientist to know that the social context is influencing what you are capturing with your measure?

A: I don't know. In the case of family structure, and it's probably true with other measures as well, but family structure is something that you can measure in almost infinite number of ways. You got a configuration of people living together. You can look at any aspect of it. There is no classification scheme you can use to classify families into all possible types and get the meaning out of it. I think that you just have to use sound demographic principles and try to figure out in all cases, what it is the question you are really trying to ask and what it is the population at risk of this behavior. What numerator am I interested in and what denominator am I really interested in and narrow it down to that. And don't just go by what's easy to tabulate. Nowadays you can manipulate the data any way you want. A lot of what we're doing is based on the fact that 25-30 years ago it was very, very hard to do manipulations of these data. These are hierarchical files. A hierarchical file is one where you've got a household record followed by series of person records. You know the relationships among everybody in the person records and you're constructing a new variable which is describing some aspect of the group of people that is living together. So you are dealing with multiple records, constructing new variables. That used to be hard to do. Its not that hard anymore but we're kind of stuck in the old ways of doing things that were developed when basically the computer technology was a lot less advanced. When you are dealing with something like survey data, where you got one observation per person you are asking them each individual a whole bunch of questions, you don't run into these types of issues because it's kind of straightforward, you know what the unit of observation is, it's the respondent. You know how to measure things, that's the questions you have on your survey. But this is a little bit trickier I think.

Q:
How do you move from the hierarchical data structure in which the IPUMS data is first found to the rectangular, normal data matrix that you use for the analysis?

A: One of the things that make the IPUMS really popular is it has the helper variables that make it easy for you to construct your own measures of family and household composition. In this case, its got variables in it that are pointer variables to tell you the location within the household of each individual's own mother and their own father. It is contained on each individual record. So that, for example, you have a household that is head, wife, son… on the son there would be a thing saying that the father is in location 1 and the mother is in location 2. So you always know whether they have mothers and fathers present. So in the case of the IPUMS, all you do with this kind of measure, its easier than any kind of measure really, you select the kids, how many of them have mother present, how many of them father present, how many don't have either? Because if the pointer is set to zero that means that there's no parent in the household. So that's possibly because we've done the work to make those pointers.

Q:
How are the characteristics of the family or the parents associated with the child which is the unit of analysis for this paper?

A: If you are interested in parental income you use those link variables in a statistical package like SPSS or SAS or STATA, any one of those will do them very easily, you can use those pointer variables to link characteristics of the parents' to the kids' records. You make a new variable for the child which is something like father's income or mother's income or whatever it might be.

Data Analysis

Q:
Steve, could you explain the importance of the new technologies for analyzing data for your research?

A: Well the technology change is just amazing. Just in the period we've been working on it. When we submitted, in 1992, the proposal to do the IPUMS, we included in their $25,000 to pay for the magnetic tapes to ship the whole thing to ICPSR once we were done. We had enough money to have tapes that would hold one copy of the raw data and one copy of the finished product. The database is 25GB which is substantial if you are talking about those 9-track tapes. Of course we never used a single tape. We didn't have to buy one of those; we saved $25,000 which turned to a RA-ship. We've sent all the data thousands of time over, around the world over the Internet. Of course in 1960 the reason they had one-in-a-thousand sample well a one-in-a-thousand sample was a 180,000 records which is what, 90 boxes of punch cards? 90 boxes are this big. You couldn't get too much bigger than that and still have it feasible to analyze the stuff. So the real technological innovation between '60 and '70 was the development of the high-density magnetic tapes which allowed one to fit the whole 1970 on just something 150 magnetic tapes. So even though it was a 6% sample instead of one in a thousand of the population. So, and obviously even as recently as 1990, the notion of running through an entire, one of the recent large samples, 5% sample for the country as a whole, people thought you couldn't do it. That there just weren't the computing resources; they would be vastly too expensive to pull out all of the American Indians from the country as a whole. Of course what you did was you selected the states where you might find, be most likely to find American Indians, you wouldn't look at the whole country because it would cost you thousands of dollars. Now you can do it on a PC. It's effectively free. So the ability of researchers to handle large data sets is absolutely brand new. I mean it's the last three, four years where its become effectively free.

Q: Where did the label "King of Quant" come from?

A: Wired magazine did an article. They entitled it the King of Quant. That was fun.

Q: Where does your work on demography and quantitative analysis fit into the discipline of history?

A: Well it's really, it's been kind of sad for us. When I entered the field of history... I had a really tough time deciding whether I was going to go into history or whether I would go into demography. I had been taking a lot of demography and I was really interested in demography. But I really didn't like sociology all that much quite frankly. You usually have to do demography in a sociology department and really just didn't really want to do that. But I eventually decided to go to the history department. At that time it was exciting in history departments. At that time people were doing quantitative history all over the place; history of the family was this booming field. It was a very exciting time, but very shortly after I arrived in graduate school, things changed radically. Quantitative analysis in history became extremely unfashionable. Particularly in the mid-80s. Almost nobody continued to do it except for myself and my colleague Russ Menard. And my other colleague Bob McKay. We were a very lonely group for a long time. Now its beginning to come back and I am very encouraged by the fact that some of my students are getting jobs and what not. Still the audience for my work is primarily not in my own field, and it's still a little bit isolated. It is also very unusual for history departments to have large amounts of grant money as well. But it's a little bit difficult to run projects that are millions of dollars a year out of the history department because the infrastructure isn't there. They aren't used to it, nobody has done it before and they don't want to do grant administration. That is another little bit of a difficulty.

Interpretation

Q:
Why did you decide to publish your piece in the American Sociological Review?

A: Well, I don't know, I thought it might have a some interest among sociologists.

Q: But here you are an historian. Would it have been most natural to submit a piece to a historical journal?

A:
Well at the same time I did submit one to the American Historical Review and that was one that would have more interest in for historians. I do publish... the outlets I publish most in are probably more demographic than historical. The AHR one is the only one that I have done in a sociology journal that's not primarily demographic in focus. But I think that in recent years, historians are not as interested in my field as demographers have. And my biggest audience is demographers.

Q: What kind of reaction did you get to the publication of the study in the American Sociological Review?

A:
It did get a lot of press. It was first picked up by this reporter at the Star Tribune. So there was this big front-page article in the Star Tribune about black families have always been different from white families. And from there, then USA Today picked it up and it got on the AP wire services and so it was in seventeen papers around the country. Articles about this, all more or less the same article. Moynihan got a hold of the paper and wrote me a long letter claiming that he had been completely misrepresented over the years and he didn't think I was quite as bad, but still I was completely wrong and nobody understood him.

Q: In what way did Moynihan feel that he was misunderstood in his previous work?

A: He said that he never said that there was a tangle of pathology and that quote was taken out of context. He really never meant that there was anything pathological about the black family. I could certainly see... he did get a little more... a lot of the criticism has certainly been unfair. But on the other hand.... Well, I don't know. He is a little bit trying to rewrite history

Q: Could you say something about the future of your project? What kind of data would you like to add to IPUMS?

A: The biggest weakness of our database for the study of the United States is the fact that it ends in 1990. That is after all now 9 years old. For lots of policy questions in current problems, well it's historical. The first thing that we want to do is bring things up to date by incorporating the 2000 census. And after 2000, the Census Bureau is abandoning the long form of the census. The census will no longer will include detailed questions on income, housing conditions, etc. They are replacing it with the American Community Survey, which will be a survey of 3% of Americans every year. And so we want to incorporate or tag the American Community Survey onto our database so as to make it always up to date. The other thing we want to do is add in data from the current population surveys that are available monthly from 1964 to the present. I think those things will give the IPUMS a much more contemporary focus and make it much more useful for really studying more pressing problems. Right now it's the best way to do long term change. It's the only way to study long-term change. It would be nice that you could go a little bit closer to the present. The other thing that we're exploring is international data. There is lots of microdata around the world that's sitting on 7-track and 9-track tapes in computer rooms of central statistical offices all over the world, and basically becoming unreadable. And we want to rescue that before it does become unreadable. See if we can... the first thing is to get it safely put onto CDs and stuck under those sandstone cliffs under the Mississippi river so that they would be preserved for future generations. But then to begin to try to incorporate these data from all over the world into the same kind of consistent format that we have used for the American data. So that people can begin to do cross-national studies as well as cross-temporal studies.

Back to script menu.