Data Analysis for Social Scientists

Get the grades you deserve thru our step by step guide on your MITx MicroMasters Statistics and Data Science Course - 
Data Analysis for Social Scientists MITxT 14.310x
Write your awesome label here.

Module 10: Endogeneity, Instrumental Variables, Experimental Design, and Data Visualization

In this part of the problem set, we are going to replicate part of the results of Joshua Angrist and William Evans' article "Children and Their Parents' Labor Supply: Evidence from Exogenous Variation in Family Size." Here is the abstract of the study:

Research on the labor-supply consequences of childbearing is complicated by the endogeneity of fertility. This study uses parental preferences for a mixed sibling-sex composition to construct instrumental variables (IV) estimates of the effect of childbearing on labor supply. IV estimates for women are significant but smaller than ordinary least-squares estimates. The IV are also smaller for more educated women and show no impact of family size on husbands' labor supply. A comparison of estimates using sibling-sex composition and twins instruments implies that the impact of a third child disappears when the child reaches age 13. (JEL J13, J22)

The purpose of this exercise is to study how fertility affects female labor supply. In order to do this, we are going to compare female labor supply in households with two children versus households with three children. Since fertility decisions are endogenous, we are going to use two sets of instruments: whether there is a multiple pregnancy in the second pregnancy and sex composition of the first two children. This latter instrument was the one proposed by Angrist & Evans (1998). Intuitively, parents are more likely to have a third child when the first two have the same sex. Assuming that whether the first two children have the same sex is random, we can use this variable as an instrument for the number of children in the household. 

Use the command summary to summarize the variables in the data. Using your output, fill in the following information:

Module 9: Practical Issues in Running Regressions and Omitted Variable Bias

Difference in differences is a statistical tool broadly used by empirical economists. In this problem, we are going to replicate the results of David Card and Alan Krueger's "Minimum Wages and Employment: A Case Study of the Fast-Food Industry in New Jersey and Pennsylvania." The accompanying data (fastfood.csv) was used to study the effects of an increase in the minimum wage on unemployment. Here is the abstract of the study:
On April 1, 1992, New Jersey's minimum wage rose from $4.25 to $5.05 per hour. To evaluate the impact of the law we surveyed 410 fast-food restaurants in New Jersey and eastern Pennsylvania before and after the rise. Comparisons of employment growth at stores in New Jersey and Pennsylvania (where the minimum wage was constant) provide simple estimates of the effect of the higher minimum wage. We also compare employment changes at stores in New Jersey that were initially paying high wages (above $5 ) to the changes at lower-wage stores. We find no indication that the rise in the minimum wage reduced employment.

Load the data into R and run a linear model in which you compare whether there are differences between fast-food restaurants located in NJ and Pennsylvania prior to the change in the minimum wage in terms of the number of full-time employees and the starting wage.

What is the average difference between fast-food restaurants located in NJ and Pennsylvania in terms of the number of full-time employees (before the change in minimum wage)?

Module 8: Single and Multivariate Linear Models

For the following questions, you will need the data set: nlsw88.csv. The data has information on labor market outcomes of a representative sample of women in the US. It contains the following variables: the logarithm of wage (Iwage), total years of schooling (yrs_school), total experience in the labor markets (ttl_experience), and a dummy variable that indicates whether the woman is black or not. Since we are going to work with this data throughout this homework, please load it into R using the command read.csv
As a first step, we are interested in estimating the following linear model:
log⁡(wage_i )=β_0+β_1 ys_- 〖"school " 〗_i+ε_i
Estimate this equation by OLS using the command Im. Please go to the documentation in R to understand the syntax of the command. Based on your results, answer the following questions:
According to this model, what is the estimate of β_1 ?

Module 7: Causality, Analyzing Randomized Experiments, & Nonparametric Regression

The following problems are based on the paper:
Duflo, Esther, Rema Hanna, and Stephen P. Ryan. 2012. "Incentives Work: Getting Teachers to Come to School." American Economic Review, 102(4): 1241-78.
In this experiment, the researchers set out to test whether providing teachers with cameras to take photos to prove their attendance could be effective in reducing teacher absenteeism. First, read the abstract of the paper using the link above. You can refer back to the paper as necessary.
Note: The dataset used to generate the Lecture 15 slides relating to this paper is slightly different than the dataset we have provided, so do not be alarmed if your answers are slightly different! 
In order to complete this exercise we are providing you with the code. The code has some missing parts that you have to fill in order to run it. The dataset that you will need is teachers_final.csv 
Let’s start by thinking through how Fisher’s ideas can be applied to evaluate this program in this context.

Suppose that after the treatment has been assigned and the experiment has been carried out, the researcher has the following data. The variable open corresponds to the fraction of days that the school was opened when random visits were made.

For Questions 2-4, we will look at these 8 schools found in teachers_final.csv:
This is table with two columns: treatment and open. For treatment, 0 indicates no treatment and 1 indicates treatment. The data on the eight schools will be presented as (treatment, open): (0, .462), (1, .731), (0, .571), (0, .923), (0, .333), (1, .750), (1, .893), (1, .692).

Assume that we define as our statistic the absolute difference in means by treatment status.
To help you compute the test statistic for the observed data, we have provided you with the R code to load in this table and generate different permutations, although it is missing some parts that you will need to fill in. We make use of the package perm, specifically the function ChooseMatrix. Be sure to look up the documentation to make sure you understand what it is doing.

For this observed data, what would be the value of our statistic?
We recommend you compute this test statistic on your own and then check your answer using the code provided.

Module 6: Assessing and Deriving Estimators - Confidence Intervals, and Hypothesis Testing

Suppose that X_i i.i.d. U[0,θ]. You want to build a 90% confidence interval for θ. To do so, you will need an estimator for θ and you will need to know the estimator's distribution. Let's consider θ ˆ=(n+1)/n X_((n)). (Remember that X_((n)) is the nth order statistic.) This estimator is a variant on the MLE. We have used the n^"th " order statistic, which is the MLE, but multiplied it by (n+1)/n to remove its bias. Its PDF is n^(n+1)/((n+1)^n ) x^(n-1)/θ^n for x∈[0,(n+1)/n θ] and 0 otherwise.

Let a be a function of n and θ such that 5% of the distribution of θ ˆ is to the left of a.
a is then given by
a=√(n&A) B/C θ

Module 5: Special Distributions, the Sample Mean, the Central Limit Theorem

A manufacturer receives a shipment of 100 parts from a vendor. The shipment will be unacceptable if more than five of the parts are defective. The manufacturer is going to randomly select K  parts from the shipment for inspection, and the shipment will be accepted if no defective parts are found in the sample.

Now suppose that the manufacturer decides to accept the shipment if there is at most one defective part in the sample. How large does K have to be to ensure that the probability that the manufacturer accepts an unacceptable shipment is less than 0.1? As above, a shipment is unacceptable if there are more than 5 defective parts.

Want to know more ... | Can't wait to score? | Want to get admitted to MIT? | Subscribe to score!!!

Module 4: Functions and Moments of a Random Variables & Intro to Regressions

For each of the following expressions, find E [X}

Module 3: Describing Data, Joint and Conditional Distributions of Random Variables

To calculate summary statistics for a group of variables, there are a few different commands. The command mean() is just one example of the different options available. Now, we ask you to go through the R documentation and explore some of the other commands by yourself.
If you want to store the output as values in your dataset, or if you want to do something more complicated (ex. Generate these by group, or use one of the dplyr summary functions (ex. n_distinct()), you can use any of the basic summary functions as well as others, in combination with mutate() and summarise() to generate variables in your dataset containing summary values.
Now that you’ve learned how to look at and generate summary statistics, answer the following questions.

What is the sample mean and standard deviation of the adolescent fertility rate in 2000?

Module 2: Fundamentals of Probability, Random Variables, Joint Distributions + Collecting Data

Let the conditional probability we computed (1.9%) serve as the new prior. Compute the new probability that she has the virus (new posterior) based on her receiving a second positive test. Please use 1.9% as the prior.

Module 1: Introduction to the Course

If you run the following code in R, what does the object my_sqrt contain?
z <- c(pi, 205, 149, -2)
y <- c(z, 555, z)
y <- 2 * y + 760
my_sqrt <- sqrt(y - 1)

Sign up now and get more than 50% off the rack discount!

We can't tell you how we are your best tutor and the answer key to your exams and studies; however, it does have value beyond scoring in your MITx work.
In fact, if you want to get the credentials and not waste your monies for MIT's admission into the SCM program, and have the best learning experience possible, then, you need to use theexamhelper to its full potential. And that applies to the materials as well as supplemental materials – wherever theexamhelper's Solution Key that has explanations and solutions.
What Are the Benefits of Using theexamhelper's Solution Key?

There are 3 main benefits from following this process for completing and reviewing your work.
  • Enhanced Understanding of the Concepts Covered
  • Improved Self-teaching Skills
  • Advanced Progress Tracking
  • Get high scores for your exams
  • Become a Super Learner
  • Get admitted into MIT's Masters in Applied Science in Supply Chain Management in MIT

Our Students work at these places




Special offer

For a limited time!

Why wait? Pay now or pay later, get the same solutions!
50% OFF
Sign up now to enjoy 50% off! While course last.
Created with