*****************************************
* Cluster Analysis Paper ****************
* MMP DATA PREP, 2009 *******************
* Data created: October 4, 2009 *********
* Last update: December 21, 2009 12pm ***
*****************************************

* NOTE - The code here is the same as mmp_cluster_data9_final.do ********************
* The only difference is that we are keeping migrants and non-migrants in ***********
* the sample on the survey year. We are essentially comparing those who've migrated *
* at least once to those who have never migrated. ***********************************
* We correct also the computation of hh migrants in the sample **********************
*************************************************************************************

* IMPORTANT DECISIONS *
*************************

* 1. SAMPLE - We start with the 124 communities and eliminate the 4 interviewed in 1982 (Kandel's
* recommendation. We also do not include the sample surveyed in the U.S. (Munshi 2000) - non-random.
* I am not sure about the latter - WE COULD INCLUDE THE U.S. SAMPLE ALTHOUGH IT IS NOT RANDOM

* 		a. MIGRANTS - To study migrants alone, we use PERS data (instead of LIFE) because it contains 
*		information on all household members, and their first moves. (LIFE contains information on all
* 		moves made by household heads alone.)

*		b. MIGRANTS AND NON-MIGRANTS - To study migrants and non-migrants (i.e., to use clustering to 
*		discriminate between them), we use LIFE data, and include all migration moves made by hh heads.

*				i. First strategy could be Gary King's idea - discover clusters using data from the first
*				few periods, then apply them to the rest of the data. Then discover clusters using data from
*				the last period, and apply retrospectively to earlier periods. Show the different results.

*				ii. Second, similar to Nagin et al., we can study migration from a life-course perspective,
*				and discover different trajectories (i.e., non-migrants, those who migrate once, and those
*				who migrate repeatedly). Again, the LIFE data would be of interest.

* 2. HOUSEHOLD WEALTH - We use the data set house_wealth.dta generated in mmp_hhwealth_data124.do.

* 3. REMITTANCE DATA - We create a restricted data set with household heads on their last trips (to use
* the remittance information).

clear
set mem 2000m
set matsize 5000
set more off

cd "/Users/fgarip/Documents/Data/MMP Data"
* cd "/Volumes/s-vol$/Home/Faculty/fgarip/Documents/Data/MMP Data"
* cd "M:\Documents\Data\MMP Data\" 

* Sort Data To Be Merged *
**************************

use "Clustering\usjanwage.dta", clear
sort year
save "Clustering\usjanwage.dta", replace

use "Clustering\init_data.dta", clear
keep commun hhnum persnum room lnland business
ren room room2
ren lnland lnland2
ren business business2
sort commun hhnum persnum
save "Clustering\dataid_temp.dta", replace

**********************************
***** (1) NATIONAL-YEAR DATA *****
**********************************

use "Data\natlyear.dta", clear
sort year

* Get the US wage data from BLS - Massey and Espinosa data in natlyear.dta is not
* reliable. It is not clear if it is measured in real or nominal terms. So, I use data
* on the average hourly earnings of production workers (nominal) and convert to real 
* dollars ($US in 2000)

* Fill-in the missing years' data (2007 and 2008)		replace exchrate = 10.928 if year==2007		replace exchrate = 11.156 if year==2008		replace const00 = 0.83051 if year==2007		replace const00 = 0.79980 if year==2008
		
merge year using "Clustering\usjanwage.dta"
keep if _merge==3
drop _merge

replace usjanwage = usjanwage*const00
lab var usjanwage "Average hourly wages (prod workers) in the U.S. ($ in 2000)"

 * Generate a constant peso indicator for Mexico

gen const00_mx = 1
gsort -year
replace const00_mx = const00_mx[_n-1]*(1+infrate[_n-1]/100) if year<2000
gsort year
replace const00_mx = const00_mx[_n-1]/(1+infrate[_n-1]/100) if year>2000

	* Other National Indicators *
	* These variables are typically available until year 2002.

replace lwhrs = . if lwhrs==8888
replace probapp = . if probapp==8888
replace visaaccs = . if visaaccs==8888
replace insbudgt = . if insbudgt==8888
replace bpebudgt = . if bpebudgt==8888
replace mexrlint = . if mexrlint==8888
replace mxunemp = . if mxunemp==8888 | mxunemp==9999
replace usunemp = . if usunemp==8888
replace totusemp = . if totusemp==8888 
replace dfinvest = . if dfinvest==8888
replace exports = . if exports==8888
replace imports = . if imports==8888
replace tradebal = . if tradebal==8888

	* Convert exports, imports, trade bal to real dollars

replace exports = exports*const00
replace imports = imports*const00
replace tradebal = tradebal*const00

gen lnexports = ln(exports)
gen lnimports = ln(imports)

lab var exports "exports to Mexico in real 2000 US$"
lab var imports "imports from Mexico in real 2000 US$"

lab var lnexports "log of exports in real 2000 US$"
lab var lnimports "log of imports in real 2000 US$"

	* Trade growth (Import + Export)

gen trade = exports + imports
lab var trade "trade in real 2000 US$"

gen lntrade = ln(trade)
lab var lntrade "log of trade in real 2000 US$"

gen tradegr = .
sort year
replace tradegr = (trade[_n]-trade[_n-1])/trade[_n-1] if _n>1
lab var tradegr "trade (exp + imp) growth - annual"

gen tradegr_ma = tradegr
sort year
replace tradegr_ma = (tradegr[_n-2] + tradegr[_n-1] + tradegr[_n])/3 if _n>2
lab var tradegr_ma "trade (exp + imp) growth - mov av (3yr)"

	* Convert INS and BPE budget to real dollars

replace insbudgt = insbudgt*const00
lab var insbudgt "ins budget in real 2000 US$"

replace bpebudgt = bpebudgt*const00
lab var bpebudgt "border patrol enforcement budget in real 2000 US$"

gen lnins = ln(insbudgt)
gen lnbpe = ln(bpebudgt)
gen lnlw = ln(lw)
lab var lnins "log of ins budget in real 2000 US$"
lab var lnbpe "log of border patrol enforcement budget in real 2000 US$"
lab var lnlw "log of line watch hours"

	* Convert FDI to real dollars (I am not sure if it is in real or nominal dollars)

replace dfinvest = dfinvest*const00
gen dfigr = .
sort year
replace dfigr = (dfinvest[_n]-dfinvest[_n-1])/dfinvest[_n-1] if _n>1
lab var dfigr "FDI growth annual"

	* Compute Moving Average (in 3 year window) for FDI to Mexico

gen dfigr_ma = dfigr
sort year
replace dfigr_ma = (dfigr[_n-2] + dfigr[_n-1] + dfigr[_n])/3 if _n>2
lab var dfigr_ma "FDI growth - mov av (3yr)"

	* Compute US employment growth

gen usempgr = .
sort year
replace usempgr = (totusemp[_n] - totusemp[_n-1])/totusemp[_n-1] if _n>1
lab var usempgr "U.S. employment growth annual"
	
	* Compute Moving Average (in 3 year window) for US employment growth
	
gen usempgr_ma = usempgr
sort year
replace usempgr_ma = (usempgr[_n-2] + usempgr[_n-1] + usempgr[_n])/3 if _n>2
lab var usempgr_ma "U.S. employment growth - mov av (3yr)"

save "Clustering\natlyear_temp.dta", replace

*****************************
* (2) HOUSEHOLD WEALTH DATA *
*****************************
use "Clustering\house_wealth.dta", clearsort commun hhnum year

	* Compute commun inequality by land & property *
	************************************************

bys commun year: gen ssize = _N

gsort commun year -troom
bys commun year: gen rank = _n
by commun year: egen mnroom = mean(troom)
gen wrank = rank*troom
by commun year: egen swrank = sum(wrank)
gen rgini = (ssize+1)/(ssize-1)
replace rgini = rgini - 2/(ssize*(ssize-1)*mnroom)*swrank
drop rank wrank swrank
lab var rgini "rooms gini by community-year"

gsort commun year -vland
bys commun year: gen rank = _n
by commun year: egen mnland = mean(vland)
gen wrank = rank*vland
by commun year: egen swrank = sum(wrank)
gen lgini = (ssize+1)/(ssize-1)
replace lgini = lgini - 2/(ssize*(ssize-1)*mnland)*swrank
drop rank wrank swrank ssize
lab var lgini "land value gini by community-year"

	* Relative deprivation by land & property *
	*******************************************

	* prop is the proportion of people with MORE property than person
	* should be the same for all people with the same number of property	
	* i.e., ties are treated the same in ranking

sort commun year troom
gen prop = .
by commun year: replace prop = (_N-_n)/_N 

gsort commun year troom -prop
by commun year troom: replace prop = prop[_N]

	* mnex is the mean excess rooms by 
	* people with more number of property

gsort commun year -troom
gen mnex = troom
by commun year: replace mnex = troom[_n-1] if _n==2
by commun year: replace mnex = troom[_n-1]*1/(_n-1) + mnex[_n-1]*(_n-2)/(_n-1) if _n>2

gsort commun year troom mnex
by commun year troom: replace mnex = mnex[_N] 

by commun year: replace mnex = mnex-troom
gen rdroom = prop*mnex
lab var rdroom "hh rel depr wrt room in commun-year [0,1]"

drop mnex prop


sort commun year lnvland
gen prop = .
by commun year: replace prop = (_N-_n)/_N 

gsort commun year lnvland -prop
by commun year lnvland: replace prop = prop[_N]

gsort commun year -lnvland
gen mnex = lnvland
by commun year: replace mnex = lnvland[_n-1] if _n==2
by commun year: replace mnex = lnvland[_n-1]*1/(_n-1) + mnex[_n-1]*(_n-2)/(_n-1) if _n>2

gsort commun year lnvland mnex
by commun year lnvland: replace mnex = mnex[_N] 

by commun year: replace mnex = mnex-lnvland
gen rdland = prop*mnex
lab var rdland "hh rel depr wrt land value in commun-year [0,1]"

drop mnex prop
sort commun hhnum yearsave "Clustering\house_wealth_temp.dta", replace


***********************
* (3) REMITTANCE DATA *
***********************

* This information is only available in mig124.dta for household 
* heads during the last trip. We use it as an outcome variable.

use "Data\mig124.dta", clear

* As Durand et al. (1996) argue, migrants may either send monthly remittances (most US settlers
* do this, or they may bring back their savings on their return (most temporary migrants do this).

gen year = usyrl
replace year = . if year==9999
replace remit = . if remit==9999
gen remitfl = remit
replace remitfl = 1 if remit>0 & remit~=.
lab var remitfl "sent remittances? (0/1)"

	* Both remittances and savings are measured monthly
	* Keep them as is

	* Monthly savings are recorded in 'savings' variable. We do not know
	* if these savings were all brought to Mexico. So, we use 'savretrn' variable
	* which records total savings brought to Mexico upon return. We divide it by
	* the U.S. last visit duration, and compute monthly values.
	
	replace usdurl = . if usdurl==9999
	replace savings = . if savings==9999
	
	replace savretrn = . if savretrn==9999
	gen savm = savretrn/usdurl

	gen savfl = savm
	replace savfl = 1 if savm>0 & savm~=.

	lab var savm "average monthly savings brought to Mexico"
	lab var savfl "brought savings from Mexico? (0/1)"
	
	
	* Add remittances and savings, compute logs
	* Use rowtotal() which treats missing as 0, then replace with missing only if both are missing.

	egen remsavm = rowtotal(remit savm)
	replace remsavm = . if remit==. & savm==.

	* IMPORTANT NOTE - remsav variable is converted to constant 2000 US$ below
	**************************************************************************
	
gen remsavfl = .
replace remsavfl = 1 if remitfl==1 | savfl==1
replace remsavfl = 0 if remitfl==0 & savfl==0
replace remsavfl = 0 if remitfl==0 & savfl==.
replace remsavfl = 0 if remitfl==. & savfl==0

	* Illegal crossing information *
	********************************
	
	* Note a migrant may have crossed illegally several times within a year. Our goal is to record
	* whether the migrant crossed illegally in his or her first trip to the U.S. Check for the year of the
	* second or third attempt only if the prior attempts were unsuccessful (i.e., crsyes#==2)
	
	* First determine the undocumented migrants on their first trip
	
	gen undoc = 0
	replace undoc = 1 if crsyr1==usyr1 & crsyr1<8888
	replace undoc = 1 if crsyr2==usyr1 & crsyr2<8888 & crsyes1==2
	replace undoc = 1 if crsyr3==usyr1 & crsyr3<8888 & crsyes1==2 & crsyes2==2
	lab var undoc "Migrant (hh head) undocumented on first trip?"
	
	* Used coyote? (All migrants that attempt to cross illegally do so by their third trip.)
	
	gen coyote = 0 if undoc==1
	replace coyote = 1 if undoc==1 & (crscoy1==1 | crscoy2==1 | crscoy3==1)
	lab var coyote "Undocumented migrant (hh head) used coyote on first trip?"

	* How did s/he cross the border?
	
	gen crosshow = .
	replace crosshow = crshow1 if undoc==1 & crshow1<8888
	replace crosshow = crshow2 if undoc==1 & crshow2<8888 & crsyes1==2
	replace crosshow = crshow3 if undoc==1 & crshow3<8888 & crsyes1==2 & crsyes2==2
	
	gen crossfamfr = .
	replace crossfamfr = 1 if crosshow>=2 & crosshow<=4
	replace crossfamfr = 0 if crosshow==1 | crosshow==5
	lab var crossfamfr "Undoc migrant (hh head) crossed with family or friends (or alone or w strangers)?"
	
	* Where did s/he cross the border?
	
	gen cstate = .
	replace cstate = crsst1 if undoc==1 & crsst1<1111
	replace cstate = crsst2 if undoc==1 & crsst2<1111 & crsyes1==2
	replace cstate = crsst3 if undoc==1 & crsst3<1111 & crsyes1==2 & crsyes2==2
	
	gen crosstij = 0 if cstate>2 & cstate~=.
	replace crosstij = 1 if cstate==2
	lab var crosstij "Undoc migrant (hh head) crossed @ Tijuana?"
		
	gen persnum = 1

	* There is one individual who is observed twice. Drop it.

	bys commun hhnum persnum: drop if _n==2

keep commun hhnum persnum year remit remitfl savretrn savm savfl remsavfl remsavm undoc coyote crossfamfr crosstij 
sort commun hhnum persnum
save "Clustering\mig124_temp.dta", replace


*****************************
* (4) CONTEXTUAL INDICATORS *
*****************************

	* (a) Rainfall *
	****************
	
	* These data are collected at the community-level by Jessica Roman-Salazar.
	* Missing observations (about half) are completed by state-level indicators.

	use "Clustering\all_rain_red2.dta", clear
	sort commun 
	merge commun using "Clustering\commun_statenum.dta"
	keep if _merge==3
	drop _merge
	
	sort statenum
	merge statenum using "Data\environs.dta", keep(annual*)
	keep if _merge==3
	drop _merge
	
	bys commun year: keep if _n==1

	rename rain crain

	* Compute annual rainfall by state from environs data
	* Generate a time-specific rainfall variable	gen rain = .	forvalues i=41(1)99	{		replace rain = annual`i' if year==19`i'	}		forvalues i=0(1)5	{		replace rain = annual0`i' if year==200`i'	}	* 2006-2008 values for rainfall are missing - assume they are equal to the 2005 value.	sort commun year	by commun: replace rain = rain[_n-1] if year>=2006		lab var rain "annual rainfall in centimeters"	
	
	* Replace the missing and incomplete values of community-level rainfall with 
	* state-level data
	
	replace crain = rain if crain==. | incomplete==1
	
	* There are 9 observations with very high values of rainfall. Set these to the state average
	replace crain = rain if crain>2500
	
	replace crain = crain/100
	lab var crain "annual rainfall to community in meters"
	
		* Create lagged values of rainfall	gen crain1 = crain	sort commun year	by commun: replace crain1 = crain[_n-1] if _n>1	lab var crain1 "annual rainfall to community in meters (t-1)"	gen crain2 = crain1	sort commun year	by commun: replace crain2 = crain1[_n-1] if _n>1 	lab var crain2 "annual rainfall to community in meters (t-2)"	gen crain3 = crain2	sort commun year	by commun: replace crain3 = crain2[_n-1] if _n>1	lab var crain3 "annual rainfall to community in meters (t-2)"	gen recent_crain = (crain1 + crain2 + crain3)/3	lab var recent_crain "average rainfall to community in past 3 years (t-1 to t-3)"	sort commun year	save "Clustering\rain_temp.dta", replace
	* (b) Distance *
	****************
	
	use "Clustering\commun_distance.dta", clear
	
	sort commun
	gen km = distance/10000
	lab var km "10,000km to U.S. border"
	sort commun
	
	save "Clustering\distance_temp.dta", replace


	* (c) Community-level Indicators *
	**********************************

	use "Data\commun124.dta", clear

	* Use only the community indicators that are available over time (note - most
	* community indicators apply only to the survey year, which is different than 
	* the year of first migration for most migrants.

	keep commun pratio* lfp* agri* manu* serv* self* ltmin* minx* metrocat compop* manprdct hctirlnd ///
		 hctrnlnd polcat  yrprim yrsecon yrbank1 yrpave yrpavehw ejido dtaward1 dtawardl awardsz hctejido
		 
	sort commun

	save "Clustering\commun124_temp.dta", replace


* SAMPLE I - with PERS DATA *
*****************************
use "Data\pers124.dta", clear
keep commun hhnum persnum hhmemshp surveypl surveyyr relhead sex age marstat edyrs occ hhincome ldowage ///
	usborn usyr1 usyrl usdur1 usdurl usdoc1 usdocl usstate1 usstatel usplace1 usplacel usmar1 usmarl ///
	usocc1 usoccl uswage1 uswagel usby1 usbyl uscurtrp ustrips usexp legyrapp legyrrec legspon ///
	cityrapp cityrrec doyr1 dodur1 dostate1 doplace1 doocc1 doyrl dodurl dostatel doplacel dooccl ///
	dowagel dobyl docurtrp dotrips

	* Inital Sample Selection *
	***************************
	
	* Compute the sample size in each community

bys commun: gen csize = _N	

	* Keep the migrants and the nonmigrants - all on survey year
	* We are essentially comparing those who've migrated at least
	* once to those who have never migrated.

gen mig = 0
replace mig = 1 if usyr1 < 8888	

replace usyr1 = . if usyr1==8888 | usyr1==9999
gen year = surveyyr

	* Drop individuals whose relationship to hh head is unknown

drop if relhead==8888 | relhead==9999
	
	* Compute the ages of migrants at the time of their first migration. 
		
replace age = . if age==9999 | age==8888
gen agem = age - (surveyyr-usyr1) 
lab var agem "age at the time of first migration"
	
	* Keep only the individuals who are older than 15 and younger than 65 at the time of first migration.
	* This is to make sure that moves are not associational (children or elderly migrating with other hh members).
	
drop if (agem<15 | agem>65) & agem~=.

gen educ = edyrs
replace educ = . if educ==8888 | educ==9999
		
		* There are 10 observations with the same commun hhnum persnum - drop them
		
		bys commun hhnum persnum: keep if _n==1
		
* MERGE *
*********

sort commun 
merge commun using "Clustering\commun124_temp.dta"
drop _merge

sort commun year
merge commun year using "Clustering\rain_temp.dta", keep(crain* recent_crain rain)
drop if _merge==2
drop _merge

sort commun
merge commun using "Clustering\distance_temp.dta", keep(km)
drop  if _merge==2
drop _merge

sort year 
merge year using "Clustering\natlyear_temp.dta"
drop if _merge==2
drop _merge
 
sort commun hhnum yearmerge commun hhnum year using "Clustering/house_wealth_temp.dta"drop if _merge==2drop _merge

sort commun hhnum persnum
merge commun hhnum persnum using "Clustering/mig124_temp.dta"
drop if _merge==2 
* Note that these observations are those outside the time frame 1965-2005
drop _merge

* VARIABLES OF INTEREST *
*************************

	* Demographic Characteristics *
	*******************************
	
replace sex = 0 if sex==2

gen educcat = 0
replace educcat = 1 if educ>=6 & educ<9
replace educcat = 2 if educ>=9 & educ<12
replace educcat = 3 if educ>=12 & educ<16
replace educcat = 4 if educ>=16

lab define educlab 0 "less than pri" 1 "pri" 2 "some sec" 3 "secondary" 4 "adv"
lab val educcat educlab

	* Occupations in Origin and Destination *
	*****************************************

	* NOTE - Origin occupation is measured in the survey year
	
rename occ o
rename usocc1 u

gen     mexocc = .
replace mexocc = 1 if o<=99
replace mexocc = 2 if o>=110 & o<=129
replace mexocc = 3 if o>=130 & o<=219
replace mexocc = 4 if o>=410 & o<=419
replace mexocc = 5 if o>=510 & o<=549
replace mexocc = 6 if o>=550 & o<=839

gen     usocc = .
replace usocc = 1 if u<=99
replace usocc = 2 if u>=110 & u<=129
replace usocc = 3 if u>=130 & u<=219
replace usocc = 4 if u>=410 & u<=419
replace usocc = 5 if u>=510 & u<=549
replace usocc = 6 if u>=550 & u<=839
rename o occ
drop u

lab var mexocc "Occupation in Mexico in Survey year"
lab var usocc  "Occupation in the US"
lab define occlab 1 "unemployed" 2 "prof/tech" 3 "educ, arts, admin" ///
			4 "agriculture" 5 "manufacturing" 6 "service"
lab val mexocc occlab
lab val usocc  occlab

gen mx_none  = (mexocc==1)
gen mx_agri  = (mexocc==4)
gen mx_manuf = (mexocc==5)
gen mx_serv  = (mexocc==6)
gen mx_oth   = (mexocc==2 | mexocc==3)

gen us_none  = (usocc==1)
gen us_agri  = (usocc==4)
gen us_manuf = (usocc==5)
gen us_serv  = (usocc==6)
gen us_oth   = (usocc==2 | usocc==3)

	* Domestic Migration Experience *
	*********************************

gen mxmig = 0
replace mxmig = 1 if year>=doyr1 & doyr1~=8888 & doyr1~=9999
lab var mxmig "Individual migrated within Mexico?"

	* We can know the total number of mexican trips only if the last mexican trip
	* took place before the first U.S. migration trip
	
gen mxtrip = .
replace mxtrip = dotrips if year>=doyrl & doyrl~=8888 & doyrl~=9999 & dotrips~=8888 & dotrips~=9999
lab var mxtrip "No of domestic trips (if last trip took place b4 first U.S. trip)"


	* Prior migrants in the household *
	***********************************

	* For each year, compute the total number of migrants from a household.
	* This variable, miginy, should equal the total for only one migrant from each
	* year so that the cumulative sums for household over time are not inflated.
	* See the example of commun==3 & hhnum==166

	* list commun hhnum year persnum mig if commun==3 & hhnum==166

	* Create a dummy for migration until last year (i.e., exclude individuals whose
	* first migration is the survey year) - We do this to be consistent with the prior
	* definition of hhmig - number of migrants until last year

gen mig_temp = 0
replace mig_temp = 1 if year >= usyr1+1

bys commun hhnum: egen hhmig = total(mig_temp) 

	* Exclude the index individual

replace hhmig = hhmig - mig_temp

	* list commun hhnum year persnum mig hhmig if commun==3 & hhnum==166

lab var hhmig "ever migrants from hh (excl. ind) prior to survey year"

	* Prior LEGAL migrants in the household *
	*****************************************

	* Flag individuals who have been legalized until last year

gen leg = 0
replace leg = 1 if year >= legyrrec+1
replace leg = 1 if year >= cityrrec+1

bys commun hhnum: egen nhhleg = total(leg)

	* Exclude the index individual

replace nhhleg = nhhleg - leg
lab var nhhleg "number of hh mems legalized prior to survey year"
	
gen hhleg = nhhleg>0
lab var hhleg "any hh mems legalized prior to survey year"


	* Compute the total number of not-legalized migrants in hh *
	************************************************************
	
gen nhhnleg = 0
replace nhhnleg = hhmig - nhhleg
	
		* For about 200 observations we have a negative number - possibly measurement error
		* Correct
		
		replace nhhnleg = 0 if nhhnleg < 0
		
gen hhnleg = nhhnleg>0
lab var hhnleg "any hh migs not legalized prior to survey year"
lab var nhhnleg  "number of hh migrants not legalized prior survey year"
		
	
	* Prior migrants in the community sample *
	******************************************

sort commun year
by commun: egen cmig = total(mig_temp)

gen pcmig = cmig/csize
lab var pcmig "proportion of migrants in community sample upto t-1"
	
	
	* Household Wealth *
	********************

* Lag wealth variables - Land, Property, Business *

gen tlandlag = tlandgen vlandlag = vlandgen lnvlandlag = lnvlandgen tproplag = tpropgen troomlag = troomgen lntroomlag = lntroomgen tbuslag = tbus	sort commun hhnum yearby commun hhnum: replace tlandlag = tland[_n-1] if _n>1by commun hhnum: replace vlandlag = vland[_n-1] if _n>1by commun hhnum: replace lnvlandlag = lnvland[_n-1] if _n>1by commun hhnum: replace tproplag = tprop[_n-1] if _n>1by commun hhnum: replace troomlag = troom[_n-1] if _n>1by commun hhnum: replace lntroomlag = lntroom[_n-1] if _n>1by commun hhnum: replace tbuslag = tbus[_n-1] if _n>1


	* Migration Prevalence in Community *
	*************************************

gen prev = .
replace prev = pratio50 if year>=1950 & year<1960
replace prev = pratio60 if year>=1960 & year<1970
replace prev = pratio70 if year>=1970 & year<1980
replace prev = pratio80 if year>=1980 & year<1990
replace prev = pratio90 if year>=1990 & year<2000
replace prev = pratio00 if year>=2000
lab var prev "Mig prevalence in community in decade"

	* Community Indicators *
	************************

gen prisch = 0
replace prisch = 1 if year>yrprim
lab var prisch "pri school in commun in year?"

gen secsch = 0
replace secsch = 1 if year>yrsecon
lab var secsch "sec school in commun in year?"

gen bank = 0
replace bank = 1 if year>yrbank1
lab var bank "any bank in community in year?"

gen road = 0
replace road = 1 if year>yrpave
lab var road "paved roads in community in year?"

gen roadhw = 0
replace roadhw = 1 if year>yrpavehw
lab var roadhw "paved road from commun to highway in year?"

* Indicator for ejido - a program that gives land - usually taken as an 
* incentive to migrate to obtain financing to work the land (new economics)

* Ejido indicator applies to the survey year - to determine whether ejido was
* established at the year of first migration for each individual, we use 'dtaward1'
* variable. This is missing for about 4000 observations. If ejido = 1 for these observations,
* assume ejido existed through the period the community is observed (the assumption is that,
* if informants cannot recall it, it was many years ago.) If ejido = 9999 (missing), assume
* ejido was NOT established.

gen ejidoy = 0
replace ejidoy = 1 if year>=dtaward1 & dtaward1~=8888 & dtaward1~=9999
replace ejidoy = 1 if ejido==1 & dtaward1==9999 
lab var ejidoy "ejido in community-year?"

	* Community Economic Indicators *
	*********************************

gen compop = .
replace compop = compop60 if year>=1960 & year<1970
replace compop = compop70 if year>=1970 & year<1980
replace compop = compop80 if year>=1980 & year<1990
replace compop = compop90 if year>=1990 & year<2000
replace compop = compop00 if year>=2000
lab var compop "population of community 50-00"

gen lncompop = ln(compop)
lab var lncompop "log of population of community 50-00"

lab def metrolab 1 "metropolitan" 2 "smaller urban" 3 "town" 4 "rancho"
lab val metrocat metrolab

gen minx2 = .
replace minx2 = minx270 if year>=1965 & year<1980 & minx270~=8888
replace minx2 = minx280 if year>=1980 & year<1990 & minx280~=8888 & minx280~=9999
replace minx2 = minx290 if year>=1990 & year<2000 
replace minx2 = minx200 if year>=2000 
lab var minx2 "prop. lf earning 2x min wage 70-00"

gen self = .
replace self = self60 if year>=1960 & year<1970 & self60~=8888
replace self = self70 if year>=1970 & year<1980 & self70~=8888
replace self = self80 if year>=1980 & year<1990 & self80~=8888
replace self = self90 if year>=1990 & year<2000 & self90~=8888
replace self = self00 if year>=2000
lab var self "prop. lf self-employed 50-00"

gen manuf = .
replace manuf = manuf60 if year>=1960 & year<1970 & manuf60~=8888
replace manuf = manuf70 if year>=1970 & year<1980 & manuf70~=8888 
replace manuf = manuf80 if year>=1980 & year<1990 & manuf80~=8888
replace manuf = manuf90 if year>=1990 & year<2000
replace manuf = manuf00 if year>=2000
lab var manuf "prop. lf in manuf 50-00 females"

gen manum = .
replace manum = manum60 if year>=1960 & year<1970 & manum60~=8888
replace manum = manum70 if year>=1970 & year<1980 & manum70~=8888 
replace manum = manum80 if year>=1980 & year<1990 & manum80~=8888
replace manum = manum90 if year>=1990 & year<2000
replace manum = manum00 if year>=2000
lab var manum "prop. lf in manuf 50-00 males"

gen manu = (manuf + manum)/2
lab var manu "prop. lf in manuf 50-00"

gen manu_lag = .
replace manu_lag = (manum50 + manuf50)/2 if year>=1960 & year<1970 & manum50~=8888 & manuf50~=8888
replace manu_lag = (manum60 + manuf60)/2 if year>=1970 & year<1980 & manum60~=8888 & manuf50~=8888
replace manu_lag = (manum70 + manuf70)/2 if year>=1980 & year<1990 & manum70~=8888 & manuf50~=8888
replace manu_lag = (manum80 + manuf80)/2 if year>=1990 & year<2000 & manum80~=8888 & manuf50~=8888
replace manu_lag = (manum90 + manuf90)/2 if year>=2000
lab var manu_lag "prop. lf in manuf - lagged by a decade"

gen dmanu = (manu - manu_lag)/manu_lag*100
lab var dmanu "change in the prop. of lf in manuf in the last decade"

gen agrif = .
replace agrif = agrif60 if year>=1960 & year<1970 & agrif60~=8888 
replace agrif = agrif70 if year>=1970 & year<1980 & agrif70~=8888
replace agrif = agrif80 if year>=1980 & year<1990 & agrif80~=8888 
replace agrif = agrif90 if year>=1990 & year<2000 
replace agrif = agrif00 if year>=2000 
lab var agrif "prop. lf in agriculture 50-00 females"


gen agrim = .
replace agrim = agrim60 if year>=1960 & year<1970 & agrim60~=8888 
replace agrim = agrim70 if year>=1970 & year<1980 & agrim70~=8888
replace agrim = agrim80 if year>=1980 & year<1990 & agrim80~=8888 
replace agrim = agrim90 if year>=1990 & year<2000 
replace agrim = agrim00 if year>=2000 
lab var agrim "prop. lf in agriculture 50-00 males"

gen agri = (agrif + agrim)/2
lab var agri "prop. lf in agriculture 50-00"


gen agri_lag = .
replace agri_lag = (agrim50 + agrif50)/2 if year>=1960 & year<1970 & agrim50~=8888 & agrif50~=8888 & agrim50~=9999 & agrif50~=9999
replace agri_lag = (agrim60 + agrif60)/2 if year>=1970 & year<1980 & agrim60~=8888 & agrif50~=8888
replace agri_lag = (agrim70 + agrif70)/2 if year>=1980 & year<1990 & agrim70~=8888 & agrif50~=8888
replace agri_lag = (agrim80 + agrif80)/2 if year>=1990 & year<2000 & agrim80~=8888 & agrif50~=8888
replace agri_lag = (agrim90 + agrif90)/2 if year>=2000 
lab var agri_lag "prop. lf in agriculture - lagged by a decade"

gen dagri = (agri - agri_lag)/agri_lag*100
lab var dagri "change in the prop. of lf in agriculture in the last decade"

gen ltmin = .
replace ltmin = ltmin70 if year<1980 & ltmin70~=8888 & ltmin70~=9999
replace ltmin = ltmin80 if year>=1980 & year<1990 & ltmin80~=8888 & ltmin80~=9999
replace ltmin = ltmin90 if year>=1990 & year<2000 & ltmin90~=8888 & ltmin90~=9999
replace ltmin = ltmin00 if year>=2000 & ltmin00~=8888 & ltmin00~=9999
lab var ltmin "prop. lf w/ less than min wage"

gen lfpf = .
replace lfpf = lfpf60 if year>=1960 & year<1970 & lfpf60~=8888
replace lfpf = lfpf70 if year>=1970 & year<1980 & lfpf70~=8888
replace lfpf = lfpf80 if year>=1980 & year<1990 & lfpf80~=8888
replace lfpf = lfpf90 if year>=1990 & year<2000 & lfpf90~=8888
replace lfpf = lfpf00 if year>=2000 & lfpf00~=8888
lab var lfpf "lf participation rate"

gen lfpm = .
replace lfpm = lfpm60 if year>=1960 & year<1970 & lfpm60~=8888
replace lfpm = lfpm70 if year>=1970 & year<1980 & lfpm70~=8888
replace lfpm = lfpm80 if year>=1980 & year<1990 & lfpm80~=8888
replace lfpm = lfpm90 if year>=1990 & year<2000 & lfpm90~=8888
replace lfpm = lfpm00 if year>=2000 & lfpf00~=8888
lab var lfpm "lf participation rate - males"

gen lfp = (lfpf + lfpm)/2
lab var lfp "lf participation rate"

gen lfp_lag = .
replace lfp_lag = (lfpf50 + lfpm50)/2 if year>=1960 & year<1970 & lfpf50~=8888 & lfpm50~=8888
replace lfp_lag = (lfpf60 + lfpm60)/2 if year>=1970 & year<1980 & lfpf60~=8888 & lfpm60~=8888
replace lfp_lag = (lfpf70 + lfpm70)/2 if year>=1980 & year<1990 & lfpf70~=8888 & lfpm70~=8888
replace lfp_lag = (lfpf80 + lfpm80)/2 if year>=1990 & year<2000 & lfpf80~=8888 & lfpm80~=8888
replace lfp_lag = (lfpf90 + lfpm90)/2 if year>=2000 & lfpf90~=8888 & lfpm90~=8888
lab var lfp_lag "lf participation rate - lagged by a decade"

gen dlfp = (lfp-lfp_lag)/lfp_lag*100
lab var dlfp "change in lf participation rate in the last decade"

gen lnmanp = ln(manprdct)
lab var lnmanp "log of annual value of manufacturing in municipio"


	* Mexican Economic Indicators *
	*******************************

replace usavwage  = . if usavwage==8888
replace mxunemp   = . if mxunemp==9999
replace infrate = infrate/100
	
	* Mexican Minimum Wage *
	************************

replace mxminwag  = . if mxminwag==8888

* IMP NOTE - In year 1993, three zeros were taken out of the Mexican peso.

replace mxminwag = mxminwag/1000 if year<1993

* Convert to constant pesos

replace mxminwag = mxminwag*const00_mx

* Convert to U.S. dollars - note you need to use the exchange rate in 2000
* since the values are in 2000 pesos
* Note the codebook reports the exchange rate as USD/Peso but in 
* fact it is Peso/USD

replace mxminwag = mxminwag/9.572

* Convert to hourly wages (currently daily) - the trends observed over time are consistent
* with other studies

replace mxminwag = mxminwag/8
lab var mxminwag "Mexican hourly wages in constant 2000 pesos converted to U.S.$"

	* U.S. Average Wage *
	*********************
* I use usjanwage (from BLS) rather than usavwage provided by Massey and Espinosa. It is not
* clear whether the latter is in real or nominal $.

gen wratio = usjanwage/mxminwag
lab var wratio "Hourly wage ratio (US/Mexico)"

	* OUTCOME VARIABLES *
	*********************
	* These are the outcomes that cluster membership should predict.

	* Remittances *
	***************
	* Note - remittance information is only available for household heads on their
	* last trip - here, we have merged remittance data based on individual id not year.
	* Therefore, we have remittance information for all hh heads on their last trip, although
	* they are observed on their first trip for the purposes of this study. We need to include
	* remittance information only for individuals who are on their first AND last observed trip.
	
	replace remit = . if usyr1~=usyrl
	replace savm = . if usyr1~=usyrl
	replace remsavm = . if usyr1~=usyrl
	replace remsavfl = . if usyr1~=usyrl

gen remc = remit*const00
gen savmc = savm*const00
gen remsavmc = remsavm*const00

gen logremc    = ln(remc+1)
gen logsavmc    = ln(savmc+1)
gen logremsavm   = ln(remsavm+1)
gen logremsavmc  = ln(remsavmc+1)

lab var remc       "heads only - monthly remittance during (first and) last trip (2000 US$)"
lab var savmc      "heads only - monthly savings during (first and) last trip (2000 US$)"
lab var remsavm    "heads only - monthly remittances and savings during (first and) last trip"
lab var remsavmc   "heads only - monthly remittances and savings during (first and) last trip (2000 US$)"

lab var logremc  "heads only - logged monthly remittances during (first and) last trip (2000 US$)"
lab var logsavmc "heads only - logged monthly savings during last (first and) trip (2000 US$)"
lab var logremsavm    "heads only - logged monthly remittances and savings during (first and) last trip"
lab var logremsavmc   "heads only - logged monthly remittances and savings during (first and) last trip (2000 US$)"

	* U.S. Wages *
	**************

* Note that uswage1=8888 means migrant is not employed for wages. S/he may have his or her own 
* business. For now I am setting the wages of such individuals to missing. Alternatively, we can assume
* the wages to be zero.

replace uswage1 = . if uswage1==8888 | uswage1==9999
replace usby1   = . if usby1==8888 | usby1==9999

* Correct for potential measurement errors (e.g. migrants making more than 100$ hourly - 4 obs)
* There are possible inconsistencies in weekly or monthly measures (e.g. migrants makine 2.5$ weekly),
* but few observations. We ignore those for now.

replace uswage1 = . if uswage1>=100 & usby1==1

* Compute yearly wages using the frequency of wage information

gen uswageyr = .
replace uswageyr = uswage1 * 40 * 52 if usby1==1
replace uswageyr = uswage1 * 5  * 52 if usby1==2
replace uswageyr = uswage1 * 52 if usby1==3
replace uswageyr = uswage1 * 26 if usby1==4
replace uswageyr = uswage1 * 12 if usby1==5
replace uswageyr = uswage1 if usby1==6
lab var uswageyr "Estimated yearly wages of migrant in 2000 U.S.$ (using uswage1)"
	* Convert to constant dollars	replace uswageyr = uswageyr*const00	* tabstat uswageyr, by(year) stat(p25 p50 p75)
gen uswagem = uswageyr/12
lab var uswagem "Estimated monthly wages of migrant in 2000 U.S.$ (using uswagel)"

gen loguswageyr = ln(uswageyr+1)
gen loguswagem  = ln(uswagem+1) 

	* U.S. Destination *
	********************

rename usstate1  s
gen usdiv     = 1 if s==107 | s==120 | s==122 | s==130 | s==140 | s==146
replace usdiv = 2 if s==131 | s==133 | s==139
replace usdiv = 3 if s==115 | s==114 | s==123 | s==136 | s==150
replace usdiv = 4 if s==116 | s==117 | s==124 | s==126 | s==128 | s==135 | s==142
replace usdiv = 5 if s==108 | s==109 | s==110 | s==111 | s==121 | s==134 | s==141 | s==147 | s==149
replace usdiv = 6 if s==100 | s==124 | s==125 | s==143
replace usdiv = 7 if s==104 | s==119 | s==137 | s==144
replace usdiv = 8 if s==103 | s==106 | s==113 | s==132 | s==127 | s==145 | s==129 | s==151
replace usdiv = 9 if s==102 | s==105 | s==112 | s==138 | s==148

gen usreg     = 1 if usdiv==1 | usdiv==2
replace usreg = 2 if usdiv==3 | usdiv==4
replace usreg = 3 if usdiv==5 | usdiv==6 | usdiv==7
replace usreg = 4 if usdiv==8 | usdiv==9

rename s usstate1
lab define usdivlab  1 "new england" 2 "middle atlantic" 3 "east north central" ///
			   4 "west north central" 5 "south atlantic" 6 "east south central" ///
			   7 "west south central" 8 "mountain" 9 "pacific"
lab define usreglab  1 "northeast" 2 "midwest" 3 "south" 4 "west"
lab var usdiv "Mig is in US Division"
lab var usreg "Mig is in US Region"

* MATLAB cannot read-in labeled values
*lab val usdiv usdivlab
*lab val usreg usreglab 

gen calif = 0 if usdiv~=.
gen texas = 0 if usdiv~=.
gen illin = 0 if usdiv~=.
gen othdest = 0 if usdiv~=.

replace calif = 1 if usstate1==105
replace texas = 1 if usstate1==144
replace illin = 1 if usstate1==114
replace othdest = 1 if usstate1~=105 & usstate1~=144 & usstate1~=114 


	* Migration Patterns *
	**********************

gen ttrip = ustrips
replace ttrip = . if ttrip==9999
lab var ttrip "total number of US trips"

gen texp = usexp
replace texp = . if texp==9999
lab var texp "total months of US experience"

gen logtexp = ln(texp)
lab var logtexp "log of total months of US experience"

gen repmig = (ttrip>=2)
lab var repmig "Individual migrated again?"

gen resid = .
replace resid = 0 if legyrrec==8888
replace resid = 1 if legyrrec<8888
lab var resid "received legal residency (b4 or after first mig)"

gen legmig = .
replace legmig = 0 if undoc==1
replace legmig = 1 if undoc==0
replace legmig = 1 if legyrrec<=usyr1 & usyr1<8888 & legyrrec<8888
lab var legmig "Migrant had legal docs (hh heads only) or residency on first trip?"

		* LIST of OUTCOME VARS: undoc coyote crossfamfr crosstij remsavfl remsavmc logremsavmc
		* us_none us_agri us_manuf us_serv us_oth uswagem loguswagem usdiv usreg calif texas illin othdest
		* ttrip texp logtexp repmig resid legmig


******************************
******* FINAL SAMPLE *********
******************************

* Options: (1) Keep only the individuals who were interviewed in Mexico (NOT necessary for now)
*		   (2) Keep only the indiviudals who are members of the households they were interviewed
* in. Note that these individuals are the children of the hh heads, who have moved out of the household.
* About 30% of such individuals are migrants (10K cases). Dropping them leads to significant information
* loss. The downside of including them is that hh wealth may not apply to those individuals. It is reasonable
* to assume that at the year of their first migration, these individuals were residing in the household.

* drop if surveypl==2
* keep if hhmemship==2
* drop if hhmemship>=8888


gen hhmem = 0
replace hhmem = 1 if hhmemshp==2

drop rain usempgr tradegr ejido
ren ejidoy ejido
ren mx_none ocnone
ren mx_agri ocagri
ren mx_manuf ocmanuf
ren mx_serv ocserv
ren troomlag room
ren lntroomlag lnroom
ren vlandlag land
ren lnvlandlag lnland
ren tbuslag business
ren mxminwag  mxwage
ren usjanwage uswage
ren compop pop
ren lncompop lnpop
ren recent_crain rain
ren infrate inf
ren minx2 min2
ren lwhrs lw
ren probapp app
ren visaaccs visa
ren insbudgt ins
ren bpebudgt bpe
ren mexrlint mxint
ren totusemp usemp
ren dfinvest fdi
ren dfigr_ma fdigr
ren usempgr_ma usempgr
ren tradegr_ma tradegr

gen bus = business>0
gen metro = metrocat
gen head  = relhead==1
gen met   = metrocat<=2

* Drop all missing observations

* IMPORTANT NOTE - If we drop observations with missing lw, app, visa, mxint, etc. as in the mmp_cluster_data9_final.do
* We end up with more missing observations (i.e., less migrants) than init_data.dta used in clustering. The reason
* for this is that we keep migrants on the survey year in this data set. So, when we drop observations with missing
* contextual variables (mostly from 2002-2005) we lose more migrants. (In the other data set, migrants are observed
* on their first trip, which is typically in much earlier years with no missing contextual information.
********************************************************************************************************************

* To make sure we have the exact same sample, we merge id's from init_data.dta, and drop the rest of the migrants,
* while keeping all of the non-migrants.

* The goal is to have 17,049 migrants exactly.

sort commun hhnum persnum
merge commun hhnum persnum using "Clustering\dataid_temp.dta"

drop if mig==1 & _merge~=3

	* For 159  migrant obs wealth measures are missing, while they are available in init_data.
	* Take values from that data set.
	* codebook room lnland business educ agri self lnpop if mig==1

	replace room = room2 if mig==1 & room==.
	replace lnland = lnland2 if mig==1 & lnland==.
	replace business = business2 if mig==1 & business==.
	drop room2 lnland2 business2

	* For non-migrants drop all missing observations

drop if mig==0 & (educ==. | sex==9999 | room==. | lnland==. | business==. | educ==. | agri==. | self==. | lnpop==.)

save "Clustering\init_data_mig_nonmig.dta", replace

ren prisch pri
ren secsch sec

outsheet relhead age sex head educ pri sec mxmig ocagri ocmanuf ocserv room lnroom land lnland business ///
	hhmig nhhleg nhhnleg hhnleg hhleg pcmig prev agrim self ltmin met commun hhnum persnum mig ///
	using "Clustering\mig_nonmig_data.raw", replace