Internet-Based Research in the Social Science of Religion



Internet-Based Research in the Social Science of Religion

by

William Sims Bainbridge

For a decade, social scientists have been aware that much religion-oriented communication takes place on Internet (Hadden and Cowan 2000). During that time, the amount of activity online has increased greatly, and the forms of Internet usage have diversified seemingly without end. It is also true that scientists have discovered new ways to extract data from websites or other Internet-based systems, even when they are not explicitly religious, that can benefit researchers interested in religion. No longer is the task merely studying the innovative ways people can use Internet for religious purposes. It is now also possible to use Internet-derived data to develop and test general theories of religious behavior that apply offline as well as online.

This paper will describe Internet-based research methods that are cutting-edge, meet reasonable tests of validity and reliability, and are sufficiently practical that students can use them for graduate papers and dissertations at the same time that their professors are preparing professional publications based on them.  The emphasis will be on quantitative methods, but some qualitative methods will also be mentioned, in part to place the quantitative techniques in a wider methodological context, as well as to identify directions in which innovations might be developed.  At the outset, we can identify seven general principles:

1. Internet based research can employ traditional techniques of social-science research, and can adapt those methods in fresh ways.

2. Entirely new valid methodological approaches can also be developed, sometimes with only the most tenuous or metaphoric relations to earlier methods.

3. To maximize both innovativeness and efficiency, collaborations between social scientists and computer scientists are often necessary.

4. Even when working collaboratively with computer scientists, a social scientist needs to develop a significant expertise managing Internet data, including even some programming knowledge, but this is actually not difficult to achieve.

5. Working with existing data collected from Internet, or with new data collected by an innovative online system, will require the social scientist to pay more attention to issues of data management than is common in more traditional contexts.

6. The best results will come from studies that carefully but aggressively address methodological and theoretical issues together, realizing that the most important challenges and opportunities require deep thinking about both, and that insights from one can inform the other.

7. Internet-related technologies and their social applications are in constant flux, so researchers should be looking for new possibilities, and the examples offered here are meant to inspire rather than constrain scientific creativity.

Collaborations between social scientists and computer or information scientists will require both sides to gain appreciation of the other's point of view. Social scientists in particular will need to realize that many of the very best computer scientists conceptualize science very differently, particularly without the same kind of dedication to theory and zeal in comparing comparing competing theoretical positions that social scientists love. One example will suffice, an excellent recent computer science article about religion and information technology: "Re-Placing Faith: Reconsidering the Secular-Religious Use Divide in the United States and Kenya" by Susan P. Wyche, Paul M. Aoki, and Rebecca E. Grinter.

Before we even consider the topic, it is important to note that this is a conference paper, given at CHI 2008 in Florence, Italy. Conferences play an almost totally different role in computer science from the role they play in social science, and CHI is the most prestigious and influential scientific gathering on the relationships between human beings and information technology. It is the annual conference of SIGCHI, the special interest group on human factors in computing of the Association of Computing Machinery. Giving a paper at CHI is like getting one published in Social Forces for a sociologist, but the publication is immediate, rather than waiting a year or two as with social science paper journals. A social scientist who wants to collaborate with computer scientists will need to adapt to the rough and rapid, but still seriously reviewed, publication system in computer science.

Another characteristic of this article that requires some adjustment on the part of social scientists is that it seems to have a very practical focus, rather than being motivated by the desire to test abstract theory. Noting the continuing and perhaps increasing significance of religion, and the possibility that secular populations make greater use of information technology, the researchers have carried out a series of studies to understand how information technologies could be better designed to serve the distinctive needs of highly religious people, indeed to serve some of their religious needs (Wyche et al. 2006, 2009a, 2009b). For example, in this study the researchers discovered that religious people often want to remember points that were made in an especially inspirational Sunday church sermon, and so they developed a note-taking system using mobile phone technology to help them accomplish this in a versatile, convenient, and cost effective manner.

A third characteristic of the study is that the investment in varied aspects of the methodology has a very different balance from what we would expect to see in a professional social scientific study. The research team collected data in both Atlanta, Georgia, and Nairobi, Kenya, at great effort, but did so through somewhat unstructured interviews and ethnographic observation with small numbers of individuals. This is standard in the field of human-computer interaction research. The goal is to understand in depth what can be learned from people who act as key native informants and who invest much of their own effort in the study, but without any concern over what fraction of the general population these people represent. Their function is to inspire innovation among the computer scientists, who design new technology through a sort of collaboration with their research subjects.

In this case of this fine study, the result is a contribution not only to knowledge, but even more importantly to the existing store of design ideas from which technologists may draw, and a contribution to the people of faith who will use future information technology designed to serve religious purposes. For computer scientists, theory tends to mean one of two things. First of all, it refers to mathematical theory typically concerning methods of calculating algorithms. The criterion of good theory by this definition is that it guides calculations that are both swift and accurate. Second, theory in the human-centered computing area really refers to design principles to guide the creation of new technologies to serve specified human needs. In this case, the computer scientists draw intelligently upon some social science of religion concepts, and they accomplish good ethnography of Kenyan religious and community culture, but in the service of future technologies to benefit religious people, rather than to frame abstract theories about religion.

One more feature of this study deserves mention as background for the present paper, namely that it studies information technology broader than the term "Internet" would cover. The people in Atlanta used Internet, but those in Nairobi used cellphones and text messaging over the phones. Technically, Internet refers to a data communication network that uses the TCP/IP protocol, but much of what you can access through Internet is not really native to it and may originally use other technologies. The World Wide Web is a subset of the billions of files reachable over Internet, those formatted with the Hypertext Markup Language (HTML), and within the Web there are many files belonging to the Deep Web that cannot be accessed by search engines because they are behind password protection or other barriers. Just as the Web is a subset of Internet, Internet is a subset of The Net, which comprises all forms of electronic communication. Already barriers are breaking down between traditional electronic media, and the distinctions between radio and podcasts, television and YouTube, telephone and Skype are historical anachronisms. Thus, while this paper will emphasize data that can indeed be accessed over the current Internet, the reader should be alert to the fact that realities and definitions are changing rapidly, and all modes of electronic communication are currently converging.

Here we shall emphasize the usual social-scientific concerns with theory and methods, more than technological results, but remain mindful of the somewhat different priorities of the computer scientists who provide us with the needed technologies. We shall consider different kinds of Internet-based research under six rough headings, arranged from work that is most similar to traditional quantitative social-scientific methodologies, to work that is least similar but still connects directly to the kinds of theories that social scientists have addressed for many years. We begin with online questionnaires, which draw upon a century of survey research traditions, then turn to recommender systems, which are very new but similar in many respects to questionnaires. Geographic data analysis also has a century-old tradition in the social sciences, but new sources of georeferenced data can be found online today. Although everybody is familiar with search engines like Google and Alta Vista, they can be used in a number of ways to collect data that can be analyzed in several ways, and more advanced natural language processing methods naturally build on familiar features of search engines. New areas where old theories can be applied include cultures inside virtual worlds.

 

1. Online Questionnaires

Computers have been used to administer questionnaires for many years, but mass administration online directly to respondents waited until the World Wide Web gained popularity in the mid-1990s. Perhaps the most important traditional application before then was in computer-assisted telephone interviewing. As pioneered by the US Census in 1790 and perhaps rather earlier around the year 0 when Caesar Augustus sent agents to count the population of the Roman Empire so it could be taxed, interviewers had long asked standardized questions verbally, writing down the responses themselves rather than requiring the respondent to do it. I have seen rough estimates that perhaps ten percent of the adults in the Roman Empire could read and write, so most could not have filled out a paper questionnaire, but we should be mindful of the fact that some people in modern societies cannot do so either, and each technology excludes at least some potential respondents.

Using a computer to do telephone interviewing has several advantages, some of which transfer to online questionnaires. The interviewer reads the questions from the screen, and enters the response with a single keypress or mouse click, or in some cases typing in the word or phrase the respondent speaks. The computer automatically moves to the next question, saving the effort of manually turning a page, and it can jump to contingent questions that might confuse the interviewer and would often confuse respondents if the questionnaire were on paper. A common example is questions about religious affiliation. Are you Catholic, Protestant, Jewish, Other, or None? People who selected "Protestant" are then often asked to define exactly which Protestant denomination they belong to, something one would not bother asking a Roman Catholic.

Among the greatest advantage of computer-assisted interviewing is that it skips the often laborious process of entering responses from a paper questionnaire into a computer for analysis. It should be recalled that while computer-administration may be relatively new, computer analysis is quite old, arguably dating from Hollerith's work on the 1900 census and even earlier (Bainbridge 2004c).

As Donald Dillman (2002) has noted, Internet-based questionnaire surveys are now one of the important research options, and their disadvantages are somewhat reduced by the increasing difficulty of  getting good samples for telephone surveys.  The chief issue for Internet-based questionnaires is that their data will not be representative of the population as a whole, both because many people – especially in some subgroups – will not have Internet access, and because many people will refuse to answer a questionnaire online when invited to do so.

However, I would argue that conceptualizing online questionnaires in terms of traditional survey research is too limiting.  As I understand the term, survey is not synonymous with questionnaire.  Rather it refers to an attempt to collect new data that are representative of the population of interest.  Conceivably, a survey could be done without asking any questions, for example visiting a random sample of rural homes to visually determine what fraction of them had indoor flush toilets. For at least two reasons, sociologists and political scientists had gotten in the habit of assuming that every proper questionnaire needed to be administered to a random sample.

The first reason was descriptive.  If the goal is to describe a population, then a census is the methodologically best method, but cost concerns often rule that out.  A simple random sample, if it is large enough, should accurately represent the population.  Furthermore, if nonresponse bias is also random, then it is possible to use statistical techniques to estimate  the sampling errors.  Unfortunately, nonresponse biases are not random, and increasing fractions of the population refuse to be surveyed, or simply cannot easily be located.  Face-to-face administration tends to get the highest response rate, but is exceedingly costly.  Thus, a national questionnaire like the General Social Survey will use a cluster sampling technique, to minimize interviewer travel costs, and tends to advise against uncritical application of tests of statistical significance which assume simple random samples.  In his textbooks in research methodology, Earl Babbie (2004) has been advising students that tests of statistical significance are not really appropriate in sociological research, a controversial point but one that clearly highlights the issue.

Descriptive accuracy primarily serves journalistic, political, and policy purposes, rather than scientific ones concerned with discovering and testing general theories.  Polling research earned the social sciences much prestige in the wider world, by offering insights and advice that were credibly based on rigorous, scientific methodology.  Journalists want to be able to say what is happening to "people" in their society, or to the society in general.  Politicians want to know what the electorate thinks about the issues of the day, so they require a random sample of voters – or of those mythical beasts, the "likely voters."  Policy makers similarly need to know what is happening to "the American family" or "the average citizen." 

When the General Social Survey was launched back in 1972, it was an expression of the Social Indicators Movement that hoped to use the GSS to monitor conditions in the United States so that policy makers could adjust government regulations and programs for maximum benefit.  Using questionnaire surveys as social indicators to guide government policy assumes a lot about the way governments and particular political parties actually function, and for most of the years since the birth of the GSS, sociological surveys were simply not a significant part of US government decision-making.

The second reason why representative samples are preferable is related to the fact that social statistics tend to assume simple random samples, but goes a bit deeper than that.  Hopefully simple random samples minimize the possibility that the correlation between two variables is the spurious result of other variables, or that the lack of a correlation results from a real relationship that is masked by some unmeasured suppressor variable.  This is a debatable point, but in practice I suggest that many social scientists take this idea for granted without even noticing it.  Consider a random sample of the United States. Typically, as in the case of the GSS, the sample leaves out "institutionalized" populations, children, Americans living abroad (or in the armed forces), and perhaps the underclass and undocumented immigrants. But even if you could get a true random sample of Americans, you would not have a random sample of human beings. Americans are 5 percent of the world, and the current world is perhaps 5 percent of all the humans who have ever lived. Thus there is a huge selection bias, and crucially for this point, that bias may correlate with variables of interest.

Rather then relying upon a random sample to limit spuriousness and suppression, which it may not really do very well anyway, a better choice is replication. There are really two functionally related ways to accomplish this. External replication means giving a questionnaire to members of very different groups, to see if the results carry over from one to another. Internal replication is accomplished when the use of subsamples or statistical techniques controlling for additional variables accomplishes the same thing within a single dataset. Under favorable conditions, both of these can be accomplished with online questionnaires, if effort is invested to get a very large and diverse set of respondents, affording many opportunities for internal replication, and if one is prepared to replicate key findings by some other method. Perhaps the most famous pre-Internet example of external questionnaire replication is the Glock and Stark (1966) study that initially surveyed Northern California church members in a very limited geographic area, then subsequently replicated key findings with a national sample.

While there were good reasons for giving high priority to sampling with pre-Internet questionnaires, this inescapably gave lower priorities to other values, notably item quality and topic coverage. In the 1950s and 1960s, much more effort was invested in item-creation than today, especially in development of multi-item and often multi-factor measurement scales. An expensive national survey often cannot afford to include many items on a single topic, and the ones that are included need to be intelligible to everybody. Thus, they are written to a "lowest common denominator" standard, rather than reflecting the complexity and nuance of theoretical debates in the social sciences of religion.

If the aim is to study a small subgroup of the population, such as atheists, then one will need either a huge sample, or a carefully targeted one, each of which might be achieved over Internet (Bainbridge 2005). Cost considerations and the fact that the average person has no opinion on many of the topics of interest to social science also militate against research on a wide range of topics that are relevant only to subgroups within the population. This is especially worrisome when the research concerns social and cultural change, because many new phenomena will be unknown to the majority of respondents in a random sample of the general population. Online questionnaires can address these issues in a number of ways, beginning with where the items come from in the first place.

At a first approximation, the material for questionnaire items can come from two very different sources: (1) existing theory expressed in the publications of social scientists, or (2) the experiences, beliefs, and behavior of the non-scientists we wish to study. In general, I do not favor survey researchers writing items out of their own imaginations, as they sit in their academic armchairs, but I advocate going through a serious process of discovery beyond the boundaries of their own personal experience. My favorite classic example of items derived from existing theory is the Mach Scale developed out of the works of Italian political theorist Niccolò Machiavelli (Christie and Geis 1970). A number of statements were derived directly from Machiavelli's publications, then augmented with a few others that expressed ideas that were in his works but not stated so simply. A large collection of these items were administered to college students, in a lengthy iterative process, and then statistical techniques were used to develop a high-reliability 20-item scale, containing a couple of subscales. This Mach Scale was then used in a wide variety of studies with different populations, which had the effect of determining its generalizability beyond the original student population.

Classical scale-construction work like this in personality and social psychology inspired me to launch an Internet-based project in 1997, called the Question Factory. I posted a number of online questionnaires consisting of open-ended items, asking people to express their views on some topic. One asked, "Imagine the future and try to predict how the world will change over the next century. Think about everyday life as well as major changes in society, culture, and technology." After successful preliminary work with The Question Factory, this item was included in the pioneering web-based questionnaire, Survey2000, organized by sociologist James Witte and sponsored by the National Geographic Society (Witte, et. al. 2000). Approximately 20,000 respondents gave thoughtful written responses to this item, from which I was able to cull 2,000 distinct predictions, 100 of them about religion (Bainbridge 2003, 2004b, 2004d).

A very more recent example, not directly about religion but it easily could have been, is part of a doctoral dissertation about World of Warcraft (WoW) by a British student named Jane Barnett (Barnett et al. in press). The focus was how people conceptualized anger and the behaviors that made them angry, in this online virtual world. Barnett began, using online forums and email rather than an online open-ended questionnaire, by eliciting examples of in-WoW scenarios that had made 33 thoughtful respondents angry, and she edited and combined these to produce a battery of 93 provisional items. Hundreds of other respondents rated them in terms of how angry these behaviors would make them feel, and an interactive process employing factor analysis and scale reliability measures reduced them to a 28-item scale with four subscales. One finding that might be relevant to the social science of religion is that people become angry at other people's negative behavior, regardless of whether that behavior was intended to harm. This reminds us that the moral codes promulgated by religions may not directly relate to the cognitive and emotional processes that determine people's senses of anger or appreciation.

Once one has questionnaire items, one needs respondents. One of the factors that made Survey2000 a success was the fact it was sponsored by the National Geographic Society, and the NGS publicized the questionnaire on its website and in its main magazine. About 50,000 people completed the questionnaire, most in the United States and Canada, but with at least 100 respondents from each of 33 other nations.

A year later, the NGS helped publicized Survey2001, which actually consisted of separate questionnaires for adults and children, and the adult questionnaire was administered online in four languages. Readers of National Geographic magazine have diverse interests, but they are probably far more aware of environmental and global issues than the average person. Thus many of the topic areas were salient for most respondents, even though they were not a random sample. Many items were organized in topical modules, and each respondent was given one at random. After completing it, the respondent was given the choice of doing another one, also selected by the computer at random. Again, this process trades representativeness of the sample against salience of the items for the respondent, but analysis of the data showed great diversity of opinion among respondents to any module. Given the very large number of respondents over-all, each module obtained many responses, and the article on the New Age I published in Journal for the Scientific Study of Religion (Bainbridge 2004) was based on fully 3,909 English-speaking respondents to the module I included in Survey2001.

Teen-age respondents to the youth questionnaire in Survey2001 were recruited in two very different ways. First, many were recruited off the National Geographic website. Second, others filled out the questionnaire as a school assignment connected with Geography Awareness Week. Teachers were recruited so that two classes did the questionnaire in each US state and province of Canada. The fact that these two methods obtained very different kinds of respondents, permitted internal replication, and in one study I compared gender correlations with 1,191 respondents in each group (Bainbridge 2002).

Inviting respondents is not the same thing as motivating them, and motivational factors will vary depending on the nature of the population and the topic of the research. A study by Dmitri Williams and his collaborators (Huh and Williams in press) is a marvelous example of how motivation and salience can combine with opportunities to collect additional data online to supplement a questionnaire. His study is part of a massive effort focused on the virtual world (or online multiplayer role-playing game) EverQuest II. The Sony company, which created EverQuest II, provided access to the raw data on its computer servers, documenting millions of social and economic interactions between the avatars of the users. A random sample of players was then sent an invitation to complete an online questionnaire, and offered a highly valuable virtual object as payment, achieving a very high response rate. The questionnaire included a well-developed battery of items about motivations for being in EverQuest II, as well as objective questions about the respondent such as his or her gender. It was then possible to connect the questionnaire responses to the characteristics and behavior of the avatars, for example comparing the gender of the person and his or her avatar, and comparing the degree of aggressiveness across both the real and virtual genders.

Another study that shows how online methodological innovations can achieve scientific gains was done in Japan and published in American Journal of Political Science (Horiuchi et al. 2007). This study combined a questionnaire with a randomized assignment experiment, and employed analytical innovations as well. One of the issues in the 2004 election to the upper house of the Japanese legislature was pension reform. Three questionnaires were used at different stages in the process: respondent screening, pre-election attitudes, and post-election attitudes. The sample was randomly assigned to one of three groups: (1) those asked to visit the website of one the two main political parties, (2) those asked to visit the websites of both parties, and (3) those not asked to visit any website and not given the pre-election questionnaire. Of course, the main comparisons concerned responses to the postelection questionnaire. Random assignment to the treatment groups and the control group is of course a traditional method used by experimentalists to get around biases introduced by non-random samples of respondents. This study underscores the tremendous possibilities for methodological innovation, building on traditional methods, which Internet offers.

2. Recommender Systems

A vast amount of information about modern culture lies latent in the databases of commercial websites in what are usually called recommender systems (Resnick and Varian 1997; Basu et al. 1998) but also sometimes referred to as collaborative filtering systems (Goldberg et al. 1992; Canny 2002). With the growth of online merchandising, websites have invested heavily in recommender systems of many kinds that advertise to a user products the merchant thinks that particular individual might want to buy. A vast scientific literature now exists concerning recommender systems, but essentially all of it is oriented toward making predictions of customer preferences, rather than exploring how these systems could be used as social science research tools (Herlocker et al. 2004). The most obvious way to use recommender systems to do social science research on religion is to examine what religious movies or books cluster together, based on preference correlations across large numbers of cases, employing statistical techniques almost identical to the ones we have been using for decades with questionnaire data.

In some cases, such as the one for the Netflix movie rental company, the system actually uses a simple questionnaire. People who rented movies are invited to rate them on the website, using a five-step scale. Then the system uses statistical methods to predict which other movies the individual might want to rent, based both on that individual's expressed preferences, and the preferences of other people whose preference patterns are similar. The Internet Movie DataBase is not a rental company, but it also encourages people to rate movies, using a ten-point preference scale. We will use some data from these two sources to illustrate typical research procedures, admittedly on a much smaller scale than a real research project would use.

The Internet Movie DataBase has a category called "based on the Bible," including 10 theatrical-release films that were rated on a scale from 1 to 10 by at least 1,000 persons.[i] Of these, seven are also in the NetFlix database, and listed here in Figure 1. The IMDB data are available for anyone to see on its website, whereas the NetFlix figures come from analysis of the raw data, which were distributed to anyone who wished to register as a contestant in the first NetFlix contest, designed to see if anyone could create a better algorithm for predicting people's preferences. The contest data consisted of 17,770 separate text files representing an equal number of movies, and some effort was required to get these data in shape for analyzing.

Figure 1: Seven Bible-Related Movies in Two Recommender Systems

|  |IMDB Raters |IMDB Mean |NetFlix |NetFlix Mean |

| | | |Raters | |

|The Ten Commandments (1956) |18,481 |7.9 |20,910 |3.9 |

|The Last Temptation of Christ (1988) |18,628 |7.5 |12,739 |3.4 |

|The Prince of Egypt (1998) |21,568 |6.8 |16,664 |3.7 |

|Jonah: A VeggieTales Movie (2002) |1,585 |6.4 |7,775 |3.6 |

|The Greatest Story Ever Told (1965) |2,976 |6.3 |3,180 |3.6 |

|The Bible: In the Beginning… (1966) |1,179 |5.7 |955 |3.3 |

|Left Behind (2000) |3,816 |4.6 |4,646 |3.3 |

A quick look at the work preparing the NetFlix data can illustrate the need for data management skills on the part of researchers. Each of the text files contained a long series of short lines, each one representing the response by one person. Here are the first five lines of the file for The Ten Commandments:

577397,3,2005-07-05

1527030,1,2005-07-07

2480084,5,2005-07-13

891353,3,2005-07-14

1718816,4,2005-07-15

The first number is an ID code representing the respondent; this is crucial, because it allows the researcher to combine the data for different films rated by the same person. The total number of respondents in the dataset is 400,000, but the ID numbers go considerably higher, one of the little details of which the researcher needs to be aware when preparing to assemble the dataset. The second, one-digit number, between the two commas, is the actual preference rating for that respondent and film, a number from 1 (did not like) to 5 (liked very much). The last part of each line is the date on which the person rated the film. The file for The Ten Commandments has fully 20,910 such lines of data.

Simply put, there are two ways to combine the necessary datafiles: (1) do it manually, using whatever standard tools one is already familiar with, or (2) write a computer program specially designed for the particular project. I use both methods, and generally find that I need to do a little manual work before I really understand what features need to be coded into a program that will do the "heavy lifting" for me.

For example, using an ordinary word processor and spreadsheet, I manually combined the data for the first three very popular films: The Ten Commandments, The Last Temptation of Christ, and The Prince of Egypt. The first two films are live-action epics depicting portions of the Old Testament and New Testament, respectively. The Prince of Egypt is a cartoon remake of The Ten Commandments, even adopting the same debatable assumption that the pharaoh Moses dealt with was Ramses the Great. The two movies about Moses treat the subject reverently, whereas The Last Temptation of Christ was a very controversial film, based on a controversial novel by Nikos Kazantzakis, as its Wikipedia page explains: "Like the novel, the film depicts the life of Jesus Christ, and its central thesis is that Jesus, while free from sin, was still subject to every form of temptation that humans face, including fear, doubt, depression, reluctance and lust. This results in the book and film depicting Christ being tempted by imagining himself engaged in sexual activities, a notion that has caused outrage from some Christians."[ii]

Thus these three films nicely illustrate ways in which works of popular culture may differ along various dimensions. The word processor was used to replace the commas with tabs, so that the data would automatically go into the correct columns when loaded in the spreadsheet. Then a good deal of manipulation – the equivalent of programming by putting IF-THEN statements into spreadsheet cells and doing several sortings – was required to get the data in shape for analysis both in the spreadsheet itself and after transfer to the SPSS statistical analysis software. For larger numbers of films, one would want to invest the effort to write a program that could combine hundreds of files automatically.

Of the total 42,572 respondents, 35,617 rated only one of these three movies, 6,169 rated two, and 786 rated all three. This suggests researchers will need to deal with challenges of missing data, but that whenever Internet provides very large numbers of cases for statistical analysis, a sufficient number will connect any two variables. For the 4,240 people who rated both movies about Moses, the films correlated significantly (r = 0.33). Just 2,634 people rated both Ten Commandments and Last Temptation of Christ, and the correlation was only 0.02. A total of 1,653 rated Last Temptation of Christ and Prince of Egypt, with a preference correlation of only 0.05. A recent publication, using a slightly different subset of the NetFlix data, found a solid positive correlation (0.31) between Ten Commandments and the reverent 2004 film, The Passion of Christ (Bainbridge 2007b).

The fact that many people rated both Moses films, but fewer rated either of them with the controversial film about Jesus, suggests that there is a second way to code preference data – not in terms of which scale rating was given, but whether a film was rated at all. I recoded the ratings so that 1 represented any rating and 0 represented no rating. This analysis produced three negative correlations, suggesting that the three films had significantly different audiences. The two Moses films had a moderate negative correlation (-0.23), and the two live action films had a somewhat larger one (-0.37). But there was a huge negative correlation between Prince of Egypt and Last Temptation of Christ (-0.60), probably because the former is a cartoon feature which families may have watched with their children, whereas the latter is decidedly an adult film.

This recoding eliminated the very concept of missing data, so the correlations were based on fully 42,572 cases. Although these correlations were calculated in a reasonable manner, quite suitable for comparison purposes, it should be pointed out that the calculation did not include any of the roughly 357,000 people in the dataset who did not rate any of the three films, something one might need to consider doing for different research purposes.

Researchers who want to make use of recommenders systems to chart cultural trends should realize that people's preferences for cultural products like movies are only partly determined by their ostensible topics. Also important for films are the featured actor, the year the film was made, and what might be called the mood, style, or emotional tone of the picture. An excellent example is what results when the 1959 movie Ben-Hur is entered into MovieLens, a motion picture recommender system created for research purposes by GroupLens Research at the University of Minnesota.[iii] The ten most similar movies, as reflected in correlations between people's preferences, are:

Ben-Hur: A Tale of the Christ (1925)

Spartacus (1960)

Ten Commandments, The (1956)

Great Escape, The (1963)

Patton (1970)

Bridge on the River Kwai, The (1957)

Seven Days in May (1964)

Longest Day, The (1962)

Fail-Safe (1964)

Magnificent Seven, The (1960)

The first of these is the silent film based on the same novel as the 1959 movie. Like Ben-Hur, Spartacus depicted the Roman Empire and was released just the year after it, however the ideological content of Spartacus was not Judeo-Christian but class politics. Ten Commandments, like Ben-Hur, was oriented toward the Bible and starred the same actor, Charlton Heston. The other films date from roughly the same period as the target film, concern human conflict, and tend either to have noble main characters or at least to raise issues about nobility of character. One could say these are all serious action pictures with strong plots, either set in historical settings, or in the case of the Cold War related movies, Seven Days in May and Fail-Safe, historical from today's perspective. All have famous main actors. Thus, the religious dimension of Ben-Hur is only one of the factors that makes it correlate with other films in people's expressed preferences.

Movies are a convenient example, but many kinds of products are covered by recommender systems, and others include items with religious significance. The online bookseller, , bases its recommender system on actual book-buying behavior, rather than preferences expressed on a questionnaire scale. 's internal data would be excellent for research purposes, but what is available online is not very detailed and useful chiefly for examples. On July 21, 2009, categorized 1,865 items in a general Religion and Spirituality category, with these three heading the best seller ranking:

The Family: The Secret Fundamentalism at the Heart of American Power by Jeff Sharlet

The Secret by Rhonda Byrne

The Biology of Belief: Unleashing the Power of Consciousness, Matter, & Miracles by Bruce H. Lipton

According to its web page, customers who bought The Family also bought

Crazy for God: How I Grew Up as One of the Elect, Helped Found the Religious Right, and Lived to Take All (or Almost All) of It Back by Frank Schaeffer and four secular books that were critical of contemporary American culture. Apparently, one popular current theme is conspiracy theories of American politics, some of which involve religion.

Customers who bought The Secret also bought three related products by the same author, plus Law of Attraction: The Science of Attracting More of What You Want and Less of What You Don't by Michael J. Losier and You Can Heal Your Life by Louise Hay which carries the motto, "What we think about ourselves becomes the truth for us..." Customers who bought The Biology of Belief also bought two self-control inspirational books by Dr. Wayne W. Dyer, Excuses Begone! and No Excuses!, and two mind control books by Lynne McTaggart, The Intention Experiment: Using Your Thoughts to Change Your Life and the World and The Field Updated Ed: The Quest for the Secret Force of the Universe. They also bought The Divine Matrix: Bridging Time, Space, Miracles, and Belief by Gregg Braden. These examples remind one of The Power of Positive Thinking by Dr. Norman Vincent Peale, and customers who bought that classic book also bought classic self-help books by Dale Carnegie. Thus, a second popular category of "Religion and Spirituality" books covers self-control books that vary in the extent to which they employ religious rather than psychological or pseudoscientific metaphors.

does carry many conventionally religious books, but these examples show how a recommender system can be used to explore ongoing developments in the surrounding culture that relate to religion without necessarily corresponding with traditional definitions.

3. Geographic Data Analysis

This approach applies traditional quantitative methods of social ecology to new kinds of data already available on the Web but little exploited so far. Social scientists have long compared geographically-based religion-related variables to develop and test theories. Perhaps the most familiar classic work is Emile Durkheim's 1897 book Suicide, which compared rates of self-murder between Protestant and Catholic areas of Europe. Less familiar, but at least available in English, was Henry Morselli's 1882 book on the same topic, which was the source of many of Durkheim's numbers but less ambitious theoretically. However, the real classic in this tradition is almost totally unknown, Adolph Heinrich Gotthilf Wagner's 1864 book, Die Gesetzmässigkeit in den Scheinbar Willkürlichen Menschlichen Handlungen vom Standpunkte der Statistik, which has never been translated. In my view Wagner's book is by far the most admirable of the three, not merely for being earlier, but precisely because it is more cautious than Durkheim in asserting theoretical explanations and does not, like Durkheim, leave out statistics that inconveniently contradict the theory.

Given the century and a half tradition of geographic statistics on religion, what Internet chiefly contributes is access to a large number of new measures, or more convenient access to data that have been available before. In the early 1980s, I counted classified telephone book listings for astrologers and new religious movements in both the United States and Canada (Stark and Bainbridge 1985). While some effort is required to assign them to the correct geographic units, the chief challenge thirty years ago was finding the phone books in the first place. I located many in my university library, others in a city's public library, and in a few cases I hired a student to call information operators in small cities and ask them politely to check their own local phonebook. For a study of the 22 metropolitan statistical areas in Canada, I actually obtained my own personal collection of all the paper phonebooks.

Online telephone directories greatly simplify this work, although they do not remove all the hand labor. First, one must compare online telephone directories to identify the most complete one. Typically, one must then work manually state by state in the US, entering the desired search term or scanning all the listings for churches, because it is hard to write a computer webcrawler program to do this automatically. For a recent tabulation of astrologers by state, I found that the most accurate method was to paste each page of astrology listings into a word processing document, then edit it with a combination of manual labor and search-and-replace commands, before porting the text into a spreadsheet (Bainbridge 2007a: 117, 254). Then more work was required to format the data, often simply because different listings had different numbers of lines of data, and to find duplicate listings that needed to be removed. Some sense of the magnitude of this work is reflected in the final total of unique listings, which was 3,859, and three work days were required to prepare the data manually for computer analysis.

Often, a religious denomination or movement lists its centers, clergy, or even members on a website, that may be used in the same manner to generate geographic rates. Figure 2 shows five measures I developed from such websites.

Figure 2: New Religion Indicators per 100,000

|Geographic Regions of the US | |TM Centers |3HO Teachers |Yoga Serve |Yoga |

| |Scientology Websites | | |Teachers |Alliance Teachers |

|New England |2.12 |0.22 |0.48 |6.29 |8.17 |

|Middle Atlantic |1.55 |0.03 |0.22 |2.06 |6.04 |

|East North Central |1.19 |0.04 |0.08 |0.82 |3.20 |

|West North Central |1.30 |0.07 |0.11 |0.74 |2.18 |

|South Atlantic |4.01 |0.05 |0.19 |1.14 |4.44 |

|East South Central |0.37 |0.02 |0.03 |0.47 |1.18 |

|West South Central |0.88 |0.03 |0.19 |0.52 |2.11 |

|Mountain |2.98 |0.07 |0.69 |1.33 |6.17 |

|Pacific |9.60 |0.10 |0.45 |0.87 |4.13 |

|USA |3.26 |0.06 |0.25 |1.30 |4.10 |

In 1998, the Church of Scientology launched 15,693 personal web pages in 11 languages for members in 45 nations. Of the total, 8,762 or 55.8 percent were residents of the United States, and they are tabulated by the nine divisions of the nation in Figure 2. The remaining columns tabulate data for four Asian-oriented religious or spiritual movements, beginning with rates based on 178 Transcendental Meditation centers in the United States in 2006. In the same year, the website of the International Kundalini Yoga Teachers Association, the successor to the Healthy-Happy-Holy Organization (3HO) of Yogi Bhajan, listed 747 3HO yoga teachers. A website called Yogaserve listed 3,847 teachers of yoga in the US who have chosen to register, and the website of the Yoga Alliance listed fully 12,166 teachers.

Such data are very useful to test or develop theories about the socio-cultural environments that are hospitable for new religious movements (Stark and Bainbridge 1985). In general, western areas of the United States have high rates of geographic migration, low rates of membership in conventional religious organizations, and probably as a consequence have high rates of new religious movements. However, in Figure 2 as in earlier data, New England has somewhat high rates, despite having church-member rates comparable to other eastern regions. Among the theories that could be tested about why this is the case are three: (1) new religious movements are attracted by the high density of elite educational institutions, (2) for historical reasons New England is weak in religious sects which would provide an alternative to mainstream denominations, or (3) something about the socially conscious (e.g. liberal) culture of New England. Like some earlier data, the table also suggests that the South Atlantic region may be increasingly open to some kinds of spiritual movements, possibly in retirement communities in Florida, or secular communities in Florida and around Atlanta and the District of Columbia. Of course, data on any one new religious movement may reflect its own unique regional history, and the geographic location of its headquarters, so the availability of data about numerous groups over Internet is a great benefit for researchers.

For the kinds of things counted in the above table, it makes perfect sense to use the total populations of the geographic area to produce rates. In some cases, one might want to use some subset of the population, such as adults or elderly people, as the divisor. In other cases, one might need to use a completely different kind of variable for the divisor in a rate. For example one might divide the number of churches belonging to one denomination by the number of churches belonging to all denominations. The first column of the table is based on websites belonging to the Church of Scientology, but established for individual members, so population is a good divisor. However, for rates with other kinds of websites in the numerator, one might need websites in the denominator as well.

For example, one could compare all the web pages hosted by the governments of US states, to see what fraction of them in each state contained a religion-related word like "church." At one time, one could get decent geographically-based counts from searching websites in each of the fifty US state domains, because originally the .us domain was limited to governments. Thus, one could enter "church site:ma.us" into Google to get all the Massachusetts government web pages registered in the .us domain that had the word "church" on them. More recently, the .us domain was opened up, so that citizens and non-governmental organizations can use these domains, and the implications for social science are not yet clear.

When basing rates on ratios of websites, one should be alert to the possibility that relationships will be non-linear, because for example very small-population states may need pages covering a wide range of topics, almost as wide as for large-population states. The basic lesson is that one must become familiar with one's data, and think carefully about what social process produced the cases, in order to know what the statistics actually measure.

Some researchers may want to invest in developing cooperative relationships with corporations that have access to geographically-based data through their online business. For example, Google offers businesses a complex service called Google Analytics, which can produce maps and tables of the numbers of people accessing a given web page from different geographic locations.[iv] In many cases, a company's website provides geographical data but in an inconvenient form, and thus working with the company to obtain the data directly could be much more efficient. For example, I just entered the word "Christ" into the eBay website and discovered 9,397 items for sale whose descriptions contained the word "Christ." For each, I could manually look at the advertisement to see geographically where the item was, but doing so for all of them would be exceedingly tedious.

4. Search Engines

Among the most heavily used online services – and one of the most useful for social scientists in often unexpected ways – are search engines like Google. Although some details of each search engine are kept secret by the company offering it, they are based on principles from the cognitive and social sciences, as well as on computer science. Thus, social scientists of religion would do well to learn as much as they can about their research potential, and this section of the current essay can only scratch the surface. A good starting point for readers who want to learn more is the classic book Finding Out About by Richard Belew (2000).

When the World Wide Web was launched in the early 1990s, creators of web pages were encouraged to put keywords describing the page in a hidden area of the HTML code that could be searched but would not be visible to the average user. Unfortunately, people very quickly gamed the system, putting popular but irrelevant terms in the code. In addition, as the Web grew – now with over a trillion pages – it became impossible to search it in realtime. Commercial search engines index the Web by sending crawler programs out across it looking for new pages. They categorize web pages in terms of the words in the part of the code visible to users, but for many searches the number of pages containing the search term is enormous. I just this moment searched for "God," and Google gave me 469,000,000 web pages on which to find Him!

One response, exemplified by the Alta Vista search engine, was to allow the user to do the Boolean searches preferred by librarians. Currently, Alta Vista allows the user to fill in any of four different text fields: all of these words, this exact phrase, any of these words, and none of these words. When I just now searched for "God," Alta Vista estimated it could find 1,450,000,000 pages containing this word. When I told it to search for "God" but only on pages that did not contain "Jesus" or "Christ," the estimated number of hits declined to 1,160,000,000. Clearly, this is still too large a number of pages for me to visit in this life. Therefore, modern search engines need to augment the traditional search for keywords with some method for prioritizing the pages. As it happens, the first hit Alta Vista gave me in this more restrictive search was a Wikipedia web page listing names of God in Judaism, clearly a very appropriate page given my search terms. Google's solution to the prioritization problem was PageRank, an algorithm based on links between web pages, measuring what fraction of other relevant pages link to the page in question, thus a measure of its popularity for people interested in the topic of the search (Brin and Page 1998).

Most users of search engines seem unaware of the special ways in which they can be used, both the different ways in which searches can be framed, and the potential uses of the results of a search. An example of how both kinds of awareness can be useful to the researcher is the possibility of exploiting the ability of several search engines to limit searches to specified Internet domains. Googling " God site:edu" gives you 4,350,000 pages that contain the word "God" which are in the ".edu" domain reserved primarily for US educational institutions. Googling "God site:gov" gives you the 826,000 US government pages that refer to God. "God site:" gives you the 8,470 pages mentioning God on the immense website of the National Institutes of Health. Given that different Internet domains represent different provinces of culture and society, comparing across domains can be useful for social scientists.

When I did the research for Figure 3 in 2006 (Bainbridge 2007a: 153, 257), Google estimated that 173,000,000 pages contained the word "God." Of these, 11,900,000 were in the .edu domain, and 82,200,000 were in the .com domain. The ratio of these two numbers (.edu/.com) is 0.145 or 14.5 percent. This is a measure of how educational versus how commercial the concept is, but only if compared with the ratios for other terms. Similarly, the ratio of .gov to .net pages, 9.4 percent, is a measure of how governmentally official the concept is. Note that the word "church" has higher ratios, reflecting the fact that churches are important educational and civic institutions, as well as religious ones. In contrast, words relating to agnosticism and atheism are relatively rare in official institutions of modern society, despite all the debates about secularization.

Another useful search trick is to seek the web pages that link to a particular other web page. For example, is the home page of The Association of Religion Data Archives, a prominent online digital library. Googling "link:" returns 728 hits, including a list of religion-related websites on the website of Paul Brians of Washington State University.[v]

Figure 3: Google Estimated Frequency of Words on Web Pages

| Words |Pages Containing the Word (thousands) |Ratios |

| |All Domains |.edu |

|Percent Priests |7.6% |14.0% |

|Percent Death Knights |13.7% |10.2% |

|Percent Warriors |9.4% |2.8% |

|Mean Achievement Points |533.5 |591.4 |

|Mean Experience Level |48.7 |49.7 |

|Percent Guild Rank ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download