Wednesday, July 14, 2010

Fake Your Data like a Machine

Wednesday, July 14, 2010
You may have heard the news about a little polling company called Research 2000, which has gotten into a little hot water recently for supposedly faking polling data during the 2008 election. The Daily Kos uncovered the story based on some tips from a few statistical wizards who spotted some abnormalities in the data.

To break it down, in addition to the Daily Kos, Research 2000 provided polling services for a large number of local television and newspaper affiliates. Well, “provided” in the sense that they will likely not be providing said research services much longer. Research 2000 president Del Ali quickly and unsurprisingly shot back against the charges of faking data, writing in a statement:

Every charge against my company and myself are pure lies, plain and simple and the motives as to why Kos is doing it will be revealed in the legal process and not before that. I will share one little minor reason that Kos is doing this and it pertains to the fact they owe us a significant sum of monies that is in the six figure category and payment was on June 15, 2010.
Of course, the fact that he won’t publicly release his data (likely because it doesn’t exist) does hurt his credibility a little.

But this brings us to the larger and more important issue – when you fake your survey data, don’t do it like a human being. See, the world has both a randomness and an order to it that our feeble minds can’t quite grasp. And when we try to randomize things, we do it in a much too orderly fashion.

The thing that brought down Research 2000 is that their data was much too “clean.” It didn’t have the error associated with it that one would expect from a random survey. For example, when you fake your data, don’t make all the breakdowns either even OR odd. It’s best to mix it up a little.

Bad fake data when ALL the male/female comparisons are both either even or odd

Also, when you fake your data, you most likely want to make sure it is normally distributed, because that’s how the world operates. See, this Gallop poll demonstrates what a normal distribution looks like. It's what's refered to the Bell curve.

This Research 2000 poll on the other hand, demonstrates that humans don’t like the number 0 when faking data.

There were a number of other problems with the data that have been well documented on the Daily Kos website. It all adds up to a damning set of evidence indicating that Research 2000 faked some serious-ass data. Hundreds of thousands of dollars worth of data. And didn't do it particularly well.

The fact that the company didn’t have a mailing address and operated out of a Kinko’s probably should have been enough of an indication that something wasn’t right.

0 comments

Monday, July 12, 2010

Introduction to Internet Surveys

Monday, July 12, 2010
The internet represents the fastest growing and most exciting way to conduct survey research. If you’ve been online for more than a couple minutes, there is a 100% chance that you’ve either signed up to participate in an online survey or have at least been invited to take an online survey (blame those annoying pop-up surveys).

Internet surveys have grown exponentially compared to other modes of surveys because they offer distinct advantages: they are 1) quick and easy to create and implement, and 2) usually don’t cost a lot of money. The cost of an internet survey ranges from almost nothing (if you have the sample and program the survey yourself) to moderately expensive but still cheaper than a telephone, mail or in-person survey (if you have to buy sample and outsource the development of the online survey).

Internet surveys are especially good in situations where the respondents are known to you and have an interest in the subject, such as employee, membership, or customer surveys. They can be used for hard to find respondents as online sampling firms have established “panels” of respondents whose characteristics are in a database and have pre-emptively agreed to participate in surveys.

Of course, like all research approaches, Internet surveys have their weaknesses. Response rates to internet surveys are typically very low because they are usually very easy to ignore, which means that 1) you’ll need a lot of sample to complete even a small number of surveys, and 2) there is more chance that your survey results are not representative of the larger population you are targeting. In addition, there is the ‘professional survey taker’ phenomenon, where people sign up for many online surveys for the sole purpose of making money. Just check out their rules.

Internet surveys certainly have an important place in the market research toolbox (or is it a shed) – but be aware that they have some tradeoffs compared to other modes of data collection. In addition, online surveys have become much more pervasive in the past few years (how many websites now have surveys that popup when you log onto them?) meaning that they are becoming more of an annoyance as well as more likely to be ignored.

0 comments

Friday, July 9, 2010

Quantitative Analytical Techniques

Friday, July 9, 2010
Statistics is a branch of mathematics that deals with the collection, analysis, interpretation, and presentation of masses of numerical data. When a measurement is calculated for an entire population, say the average age, it’s called a parameter. When we look across a sample and calculate a measurement, also the average age, we call it a statistic. Since people make entire careers out of the study of statistics, the point of this post is to present a birds-eye overview and brief description of common terms you’ll hear in conversations about quantitative analysis.

When discussing statistics, researchers usually talk about the data in terms of “variables.” A variable is a characteristic that may assume more than one set of values (age, income, birth place can all have more than one value). A variable can either be nominal, ordinal, interval, or ratio in its scale. Nominal variables are also referred to as categorical variables because they represent categories of responses. The color of a car would be represented by a categorical value (for example, black, red, or silver). Categorical variables have no set order, meaning that a black car is not necessarily any better than a silver car.

The level of satisfaction with one’s car on a 1 to 10 scale is an example an ordinal variable (where a 10 is a better score than a 5 and a 5 is better than a 1). Ordinal variables have a clear, set order, but they still represent categories of responses. Interval and ratio variables are numerical variables whose numbers have direct meaning. The age of a car would be ratio variable because it can be measured precisely and at equal intervals (in hours, years, or decades).

Variables can also be discrete or continuous. Continuous variables, such as time, have an infinite number of possible values, while discrete variables, such as a satisfaction scale, have a finite (in this case 10) number of possible values.

Descriptive statistics are simple portrayals of what the variables show. They are summaries of the frequency of the different values (like percentages); the central tendency (mean, median or mode); and the dispersion (like the range and the standard deviation). Cross tabs (short for tabulations) are popular for displaying the joint distribution of two or more variables. They are usually presented in a matrix called a contingency table. In a cross tab table, each cell gives the number of respondents that gave a particular combination of responses.

Measures of association summarize the relationship between two variables (correlation and regression, for instance). Two variables are associated when information about one can help us predict information about the other. A variety of techniques to measure association are available, each better suited to different classes of variables. When analyzing data, most statisticians use multivariate analysis where the effects of many variables are considered.

Tests of statistical significance are used to determine how sure we can feel about the associations found in the data -- Could it just be chance? Can we infer that the result can be generalized to the study population? Confidence intervals, chi square tests and t-tests are the most common statistics used to indicate the probability of saying that there is a difference between two groups when actually there is none (level of significance).

Measures of association can be used in very sophisticated ways. Conjoint analysis can be used to determine trade-offs customers are willing to make among product or service attributes. In addition to understanding current preferences, this technique allows modeling of the impact of the introduction of new factors on preferences.

Discrete choice analysis models selection of a product or concept with many attributes from a set of products or concepts. In essence, it models how people make decisions in the real world. For example, one could test products with varying combinations of features to assess which consumers prefer. As with conjoint analysis, discrete choice analysis allows modeling of the impact of the introduction of a new product or concept on factors such as market share.

Cluster analysis identifies population segments using groups of variables. This provides information to better understand and communicate with customers, or help you understand your place in the market place. In general, whenever one needs to classify a mass of information into manageable and meaningful results, cluster analysis is a technique of great usefulness.

Discriminant analysis is used to define which variables best differentiate between predefined groups. The key difference is that discriminant analysis relies on previously defined groups whereas cluster analysis uses the data to discover these groups.

Factor analysis finds the underlying construct behind answers to a series of questions. In other words, factor analysis is designed to classify variables. For clients, it simplifies the interpretation of answers to many questions to a few “factors” that seem to drive answers to all questions. It can be used to determine the key factors that drive aspects like satisfaction, image or customer retention. In addition, factor analysis is used when designing surveys. Often complex concepts (like “leadership”) need to be turned into a group of concrete questions in order to query meaningfully.

Regression Analysis (linear, non-linear and logistic) is widely used for forecasting. It compares the effects of one or more variables on another. The objective of regression analysis is to understand the relationship between several independent or predictor variables on a dependent or criterion variable. This allows forecasting or estimation of the change in a dependent variable based on the change in an independent variable.

***


0 comments

Monday, July 5, 2010

Qualitative Analytical Techniques

Monday, July 5, 2010
Qualitative Research employs special techniques that allow researchers to observe the ways respondents analyze and synthesize information. When attempting to understand people’s perceptions, beliefs, and behaviors, it is important to solicit information in a way that is meaningful to them. Rigorous sampling and advanced quantitative analysis will not remove bias introduced by a researcher who is subconsciously imposing his/her viewpoint on the study population. The following are a few methods that can be used to collect qualitative data.

Observational techniques are those where a researcher simply observes human behavior or actions first hand. This technique is useful to gain a full understanding of the context in which the behavior is talking place and also when people are unwilling or unable to verbalize the topic being evaluated.

Collecting nonverbal data requires such fineness of observational detail that special training is required to use the terminology and notational systems. Examples of nonverbal study include: Kinesics (observing detail of bodily movement) and Proxemics (social symbolic uses of space). A research study that includes analysis of nonverbal data is usually videotaped for repeated viewing (like focus group tapes).

Listing, Selecting and Sorting – Asking respondents to generate and/or sort lists, and make selections, is a way to explore their taxonomic systems (the way they organize information).

Projective Techniques are based on the understanding that people naturally project their beneath the-surface perceptions, beliefs and personal themes in their verbal responses and behavioral styles. There are many different tests and games to help probe beneath the surface that have been devised and tested over time. Familiar examples include: Sentence Completion Tests, Thematic Apperception Tests, Personalization, and Role Playing.

Interviews or Discussion – Are good at collecting information that can be verbalized easily. Common settings include focus groups and in-depth interviews.


0 comments

Friday, July 2, 2010

Self-Administered vs. Interviewer-Administered Surveys

Friday, July 2, 2010
One of the important factors to think about when designing or conducting a survey is who will be filling it out? A self-administered survey is really any survey where the respondent, rather than an interviewer, fills out the questionnaire. Examples of this include:
  • Intercepting a customer in a store – but having them fill out the questionnaire
  • Employee feedback surveys where they fill out surveys anonymously
  • Administering a survey to a group such as students in schools
  • Consumer product testing on the internet

And the list goes on and on. They can be conducted via a number of different modes, such as online, mail, IVR, or in-person. Self-administration simply means that the respondent (rather than an interviewer) takes on the burden of filling out a survey. This has more consequences than you probably think; some good and some bad.

On the plus side, since interviewer time is not required to complete a self-administered survey, they are usually cheaper to conduct. Surveys can also be completed more quickly when respondents fill out their own surveys rather than waiting for a limited number of interviewers. In addition, respondents are more likely to be truthful and provide information to sensitive topics when they can fill out a survey themselves and are assured of its confidentiality. If you have a survey on a sensitive topic, self-administration is likely the way to go.

On the negative side, self-administration reduces response rates, increases drop-offs and will likely increase non-response bias (because of the lower response rates). Respondents will find it easier to ignore these types of surveys and you will have to send out more surveys to get the same amount of completes compared to an interviewer-administrated survey.

The lack of an interviewer or researcher during a self-administered survey means that questions are more likely to be misinterpreted or an inappropriate answered given. If a respondent does not understand something, they won’t be able to get an explanation during the survey. Therefore, it is important that self-administered surveys are designed clearly and are easy for respondents to understand right from the start. Interviews can provide feedback on a poorly designed survey or difficult to understand questions while the survey is being conducted, but you probably won’t know that respondents can’t understand a self-administered survey until you analyze the data and see that it’s junk.

No probing can be done by an interviewer, so self-administered surveys provide less depth for the information collected. This is fine if you are conducting a simple satisfaction or feedback survey, but might cause a problem if you wanted more detailed information from customers, employees, etc.

Despite the drawbacks, the cost and time advantages to a self-administered survey make them a commonly used method, especially when you have a simple survey to conduct.

0 comments