**************************************
* Life History Data Combined - Step1 *
**************************************
* Date created: Dec 13, 2007
* Last update: Dec 30, 2010 12pm
* This code is identical to the DataStep1-3.do in Used in All.
* The only difference is that we save the data from 51 villages (instead of 22)
* in the end. *****************************************************************
* This code will merge the life history data with hh and village indicators using three
* waves of the data 1984, 1994, 2000. (I will only use original data, so that the temporary
* data set could be regenerated in the future.)
*
* I am trying to recreate the combined data set lh9400_recoded.dta generated by Chang.
* I have used this data set previously for remittances and development papers. It seems like
* Chang included all the villages in the data, including villages for which the migrants
* were not followed up. Here, after merging all the data files, I am keeping only the 22
* villages.
*
* Also, in Chang's data, there are 363K observations, this is too high a number. If we add up all
* the observations in life94, mlife94, life00 and mlife00, we have 365K cases. Some cases (at least 30K)
* overlap in 1994 and 2000 data. So, 363K probably involves double-counting of certain cases. In my data,
* I have 313K obs, which reduces to about 189K when migrant follow-up villages are kept only.
*
* IMPORTANT NOTES:
*
* 1. You need to merge each data set separately to personid.dta, otherwise you have missing values
* for the unique nrpid identifier. (e.g., if you first merge life94.dta to personid.dta, and then
* merge the resulting data set to mlife94.dta, the 'nrpid' unique identifier will be missing for
* migrants.
*
* 2. When sorting data sets before merging them, always specify the stable option. Otherwise, a different
* ordering is done (on all variables other than the sorting variable) is obtained. In that case, we
* obtain a different number of cases (with given specifications) at each run.
*
* 3. In the mlife00 and mindiv00 files households are identified uniquely by "mid00 migtype" and individuals
* by "mid00 migtype mcep00". Note hhid00 is recorded only for student migrants, and is missing o/w. We need
* to use "mid00 migtype" as the hh identifier for migrants which connects to mfam00 file (the household
* roster for migrant families).
*
* 4. Similarly, for 1994 data, "mid" is the hh identifier for migrants, and connects mindiv94 to mfam94 migrant hh
* roster. For non-migrant hhs, we use "hhid94" which connects indiv94 to hh94 hh roster.
*
* 5. We need to merge each individual-level data separately to the corresponding life history, and then
* check for inconsistencies in merging life94 to life00; mlife94 to mlife00; and finally life9400 to mlif9400.
*
* 6. REMITTANCES - Remittance information is available in indiv94 and indiv00 files as well as in mremit94 and
* mindiv00. The former information is related to household members' account of remittances, while the latter
* is the migrants' account. In prior versions of the remittance paper, we have relied on household members'
* accounts, now we begin using the information provided by migrants instead. (In order to get household members'
* account of remittances, we need to somehow merge indiv94 to life94 - but since the latter only contains
* life histories of nonmigrants, when we do merge the two files, all the remittance variables are missing.
* Maybe we should try to merge the file indiv94 to the final version of the data file, where migrants
* and non-migrants are combined.)
*
* (a) To obtain remittances reported by migrants, we need to combine mlife94 to mremit94, and similarly mlife00
* to mindiv00. We can use mid mcep8 in 1994, and mid00 migtype mcep00 in the 2000 file for that purpose.
*
*
* 7. IMPORTANT DISTINCTION BETWEEN ORIGIN AND DESTINATION FILES - Note that the migrant life history files
* connect back to indiv, hh, and commun level files (in the origin) through a combination of village id+
* ban lek ti (government issued hh id) + cep number. Mindiv and mfam files do not contain information on
* ORIGIN households but destination HOUSEHOLDS - for now we are not interested in destination households.
* So, we should not be using mindiv94, mfam94, mindiv00 and mfam00 to get household characteristics.
* We should use mremit94 and mindiv00 to get remittance information (as reported by migrants) only!
*
* 8. Another issue is the code 2 individuals. (See the Data Files Chart.doc and code2inds.pdf) These individuals are
* those from the previous data collection round who answered question 1.1 as away from the hh (code 2) in this round. In such cases,
* interviewers listed the destination hh of the code 2 individual. So, code 2 individuals are listed on 2 hh rosters,
* origin and destination households. In the 1994 survey, there are 2,516 code 2 individuals, and in 2000 there are
* 2,511 cases.
*
* (a) To solve this problem, in the 1994 hh roster (indiv94), we need to drop the cases where an individual
* is verified to be living in another hh (that is, if dhhid94~=99999998(missing)|99999996(in temple). If dhhid94==
* 9999999 then the new destination is not known so we can drop the individual from the hh as well. If dhhid94==new hh id,
* then the new destination is known, ind lives in that hh, and is recorded in that hh's roster, so we should drop
* him/her from the origin hh roster. (In sum, drop if dhhid94~=99999998(missing)|99999996(in temple).)
*
* (b) In the 2000 data, the problem is dealt with slightly differently. Code 2 individuals that are tracked in the new
* households have nonmissing dhhid00. Those in temple have nonmissing dhhid00 and dcep00 of 77. We should drop the cases
* for which new households are known or missing (dcep00=99) since they are clearly not living in the origin hh.
* drop if dhhid00~=. & dcep00~=77 (that is keep those individuals in temple).
*
* (c) Note that we are starting our data merger with life history data (see the logic described below). For each life
* history data set, we begin by merging the personid.dta and updating the hh and individual level identifiers. Since in the
* personid.dta, for example, hhid94 reflects the 'new' household for code 2 individuals (while ohhid94 shows the origin hh in the
* prior time period), we obtain the most up-to-date residence information. Then, merging life history files to hh files we will naturally
* drop the second observations for the code 2 individuals. (We need to check if we are losing exactly 2,516 individuals in 1994 and 2,511
* in 2000 to be sure.)
*
* 9. Study Migrant Flag: An important concept in the migrant life history files is that of the study migrant. When the interviewers
* found migrants in destination, they questioned other migrants in their household that do not necessarily come from the Nang Rong
* villages. There is an indicatory stmigfl in both mlife94 and mlife00. We should only keep the observations for which the flag equals 1.
* (The existence of non-study migrants is actually why we lose a large number of cases when we try to combine life history data to household
* rosters in origin villages.)
clear
set mem 500m
*cd "N:\Temp Stata Files\"
cd "/Users/fgarip/Desktop/Combined LH 51 Villages/Temp Stata Files/"
************************************
** 0. Sorting Data Sets to Merged **
************************************
* NOTE - merge command has a 'sort' option, but it assumes the 'unique'
* option, namely that the variable used in the merge uniquely identifies
* all observations in the master and using data - For us, this is not the
* case most of the time, as we are merging hh or vill level data to ind level
* data. So, instead of using the automatic option, we are manually sorting
* all data sets.
use "hh00.dta", clear
sort hhid00, stable
save "hh00.dta", replace
use "hh94.dta", clear
sort hhid94, stable
save "hh94.dta", replace
use "hh84.dta", clear
sort vill84 house84, stable
save "hh84.dta", replace
use "life00.dta", clear
sort hhid00 cep00, stable
save "life00.dta", replace
use "indiv00.dta", clear
sort hhid00 cep00, stable
save "indiv00.dta", replace
use "indiv94.dta", clear
sort hhid94 cep94, stable
save "indiv94.dta", replace
use "mindiv94.dta", clear
sort mid mcep8, stable
save "mindiv94.dta", replace
use "personid.dta", clear
keep if mid00~="" & mcep00~=""
sort mid00 migtype mcep00, stable
save "Temp\personid00m.dta", replace
use "mindiv00.dta", clear
* Drop missing values of mcep00
drop if mcep00==""
sort mid00 migtype mcep00, stable
save "Temp\mindiv00t.dta", replace
***********************
* 1. Clean life94.dta *
***********************
use "personid.dta", clear
keep if hhid94~=. & cep94~=""
sort hhid94 cep94, stable
save "Temp\personid94.dta", replace
use "life94.dta", clear
* Merge with personid.dta to get hh and vill identifiers
* NOTE - vill84 + house84 uniquely identify hhs.
* hhid00 + cep00 uniquely identify individuals.
*********************************************************
sort hhid94 cep94, stable
merge hhid94 cep94 using "Temp\personid94.dta", ///
keep(house84 cep84 hhid94 cep94 mcep8 mid hhid00 cep00 mcep00 mid00 mcep00 nrpid vill84) update
* keep records iff master and using data match
keep if _merge==3
drop _merge
ren q5_1 age
ren q5_1_1 ageinyear
ren q5_6ed educ
ren q5_7j1_1 occ1
ren q5_5pl1 p1
ren q5_4ch1 locch1
gen year = 1994 - (age-ageinyear)
drop age
ren ageinyear age
replace sex=0 if sex==2
replace sex=. if sex==9
destring locch1, g(ch1)
gen gavebirth = . if sex==1
replace gavebirth=0 if sex==0
replace gavebirth=1 if ch1<999999997 & sex==0
lab var gavebirth "gave birth this year?"
* Check if nrpid year uniquely identify individuals, drop o/w (no cases dropped)
sort year nrpid, stable
by year nrpid: gen i = _n
drop if i==2
drop i
* Check if age is increasing linearly over time (for 88 case it is not, compute the values using 1994 age.)
gen agefl = 0
sort nrpid year, stable
by nrpid: replace agefl=1 if age~=age[_N]-(year[_N]-year)
tab agefl
drop agefl
* Recode the place variable - Note, we need to know which province belong to OTH, BMA and ESB
* categories. Use the Province Appendix and following groupings (taken from Chang's listing).
* I also checked it against other lists of province by region, seems consistent mostly. In the NEP,
* def, there are three inconsistencies 44 seems to belong to north (very north on the map), while
* 69 and 74 seem cental region. (74 is close to NE on the map). I move all these to OTH category.
* Note the 2000 data use 2 digit codes, while 1994 data use 4-digit province codes (40 + 2-digit code)
*
* BK = In addition to its other def, province 01
* BMA = 19,24,28,59,61,
* OTH = 02,03,05,07,10,12,13-18,22,23,25,26,29-41,44,46,48,50-53,57,58,60,62-66,68,69,71,72,74,75
* ESB = 08,
* NEP = 04,06,11,20,42,43,45,47,54-56,67,70,73,76
* CB = in addition to other def, 09,
* KR = in addition to other def, 21,
* RY = in addition to other def, 49,
* BR = in addition to other def, 27,
destring p1, g(p)
drop p1
gen p1 = .
lab def dest 0 "00:NR" 1 "01:BR" 2 "02:KR" 3 "03:NEP" 4 "04:CB" 5 "05:RY" 6 "06:ESB" 7 "07:BK" ///
8 "08:BMA" 9 "09:OTH" 10 "10:INTL" 99 "99:Missing"
* Recode Place Variable - Life94 *
**********************************
replace p1 = 0 if p>=10000 & p<=30000
replace p1 = 1 if p==54027
replace p1 = 2 if p==64021
replace p1 = 4 if p==84009
replace p1 = 5 if p==74049
replace p1 = 7 if p==44001
replace p1 = 10 if p>=5001 & p<=5024
replace p1 = 99 if p==99999
replace p = p-94000 if p>=94000 & p<=94100
replace p1 = 3 if (p==4 | p==6 | p==11 | p==20 | p==42 | p==43 | p==45 | p==47 | p==54 | p==55 | p==56 | p==67 | p==70 | p==73 | p==76)
replace p1 = 6 if p==08
replace p1 = 8 if (p==19 | p==24 | p==28 | p==59 | p==61)
replace p1 = 9 if (p==2 | p==3 | p==5 | p==7 | p==10 | p==12 | p==22 | p==23 | p==25 | p==26 | p==44 | p==46 | p==48 | p==57 | p==58 | p==60 ///
| p==68 | p==69 | p==71 | p==72 | p==74 | p==75 | (p>=13 & p<=18) | (p>=29 & p<=41) | (p>=50 & p<=53) | (p>=62 & p<=66))
replace p1 = 7 if p==1
replace p1 = 4 if p==9
replace p1 = 2 if p==21
replace p1 = 5 if p==49
replace p1 = 1 if p==27
lab val p1 dest
* Recode Occupation *
*********************
lab def occ 1 "1:Stu/NoWork" 2 "2:Farmer" 3 "3:Factory" 4 "4:Constrctn" 5 "5:Service" 6 "6:Other" 9 "9:Missing"
replace occ1 = 9 if occ1==7
lab val occ1 occ
* Education, Marital Status, Children *
***************************************
sort hhid94 cep94, stable
merge hhid94 cep94 using "indiv94.dta", keep(q7 q16 q21 q22) update
drop if _merge==2
drop _merge
gen married = 0
replace married=. if q7==99
replace married=1 if age>=q7
replace educ = q16 if (educ==. | educ>=94) & q16~=. & q16<95 & year==1994
gen noch = .
replace noch = q21 if q21<98 & sex==0
lab var noch "(women only) children ever born"
gen livch = .
replace livch = q22 if q22<98 & sex==0
lab var livch "(women only) living children"
drop q7 q16 q21 q22
* Migration & Remittances Sent or Received by Individual *
**********************************************************
* Note - there are no migrants among life94 observations.
* So migration and remittance variables are all missing.
* We need to merge mlife94 file to mremit94 to get migrants'
* information. (The information on indiv94 is household members'
* account of remittances, while the information on mremit94 is
* the migrant's account.)
gen l94 = 1
keep hhid94 cep94 nrpid vill84 vill94 age sex educ occ1 p1 gavebirth year migvill l94 ///
married noch livch house84 cep84 mcep8 mid hhid00 cep00 mcep00 mid00 mcep00
sort nrpid year, stable
save "Temp\life94t.dta", replace
************************
* 2. Clean mlife94.dta *
************************
* Unique identifier for mlife94.dta is mid mcep8
use "personid.dta", clear
keep if mid~=. & mcep8~=""
sort mid mcep8, stable
save "Temp\personid94m.dta", replace
use "mlife94.dta", clear
sort mid mcep8, stable
merge mid mcep8 using "Temp\personid94m.dta", keep(nrpid vill84 hhid94 cep94 house84 cep84) update
keep if _merge==3
drop _merge
* Drop non-study migrants (those that were interviewed as part of a migrant's hh, but do not come from
* the Nang Rong villages). Note there are cases where stmigfl==0 but hhid94 is not missing. These cases
* are likely to be miscoded as non-study migrants. So, we only drop the cases with stmigfl==0 and hhid94=.
drop if stmigfl==0 & hhid94==.
ren q10_1 agem
ren sex10 sexm
ren q10_1_1 ageinyearm
ren q10_6ed educm
ren q10_7j11 occ1m
ren q10_5pl1 p1m
ren q10_4ch1 locch1m
ren q10_4ch2 locch2m
gen year = 1994 - (agem-ageinyearm)
drop agem
rename ageinyearm agem
replace sexm=0 if sexm==2
destring locch1m, g(ch1m)
gen gavebirthm = . if sexm==1
replace gavebirthm=0 if sexm==0
replace gavebirthm=1 if ch1m<999997 & sexm==0
lab var gavebirthm "gave birth this year?"
* Check if nrpid year uniquely identify individuals, drop o/w (just 1 case is dropped)
sort year nrpid, stable
by year nrpid: gen i = _n
drop if i==2
drop i
* Check if age is increasing linearly over time (for 13 cases it is not - single nrpid)
* Change the cases with inconsistent age values.
gen agefl = 0
sort nrpid year, stable
by nrpid: replace agefl=1 if agem~=agem[_N]-(year[_N]-year)
tab agefl
codebook nrpid if agefl==1
*list nrpid year agem if nrpid=="028698"
replace agem=27 if year==1994 & nrpid=="028698"
drop agefl
destring p1m, g(p)
drop p1m
gen p1m = .
lab def dest 0 "00:NR" 1 "01:BR" 2 "02:KR" 3 "03:NEP" 4 "04:CB" 5 "05:RY" 6 "06:ESB" 7 "07:BK" ///
8 "08:BMA" 9 "09:OTH" 10 "10:INTL" 99 "99:Missing"
* Recode Place Variable - Mlife94 *
**********************************
replace p1m = 0 if p>=110000 & p<120000
replace p1m = 1 if p>=100000 & p<110000
replace p1m = 2 if p==64021
replace p1m = 4 if p==84009
replace p1m = 5 if p==74049
replace p1m = 7 if p==44001
replace p1m = 10 if p>=30000 & p<40000
replace p1m = 99 if p==999999
replace p = p-94000 if p>=94000 & p<94100
replace p1m= 3 if (p==4 | p==6 | p==11 | p==20 | p==42 | p==43 | p==45 | p==47 | p==54 | p==55 | p==56 | p==67 | p==70 | p==73 | p==76)
replace p1m= 6 if p==08
replace p1m= 8 if (p==19 | p==24 | p==28 | p==59 | p==61)
replace p1m= 9 if (p==2 | p==3 | p==5 | p==7 | p==10 | p==12 | p==22 | p==23 | p==25 | p==26 | p==44 | p==46 | p==48 | p==57 | p==58 | p==60 ///
| p==68 | p==69 | p==71 | p==72 | p==74 | p==75 | (p>=13 & p<=18) | (p>=29 & p<=41) | (p>=50 & p<=53) | (p>=62 & p<=66))
replace p1m= 7 if p==1
replace p1m= 4 if p==9
replace p1m= 2 if p==21
replace p1m= 5 if p==49
replace p1m= 1 if p==27
lab val p1m dest
* Recode Occupation *
*********************
lab def occ 1 "1:Stu/NoWork" 2 "2:Farmer" 3 "3:Factory" 4 "4:Constrctn" 5 "5:Service" 6 "6:Other" 9 "9:Missing"
* Assume occ1m= 10, 11, 15 are agricultural jobs. (It is not clear from the occupation appendix, but codes
* starting w 1 seem to come from agricultural occupations.)
* Collide monk/soldier (8) category w student/no work cateory
replace occ1m=2 if occ1m==10 | occ1m==11 | occ1m==15
replace occ1m=1 if occ1m==8
replace occ1m=9 if occ1m==99
replace occ1m=9 if occ1m==7
lab val occ1m occ
* Education, Marital Status, Children *
***************************************
sort mid mcep8, stable
merge mid mcep8 using "mindiv94.dta", keep(q8_7 q8_16 q8_21 q8_22) update
drop if _merge==2
drop _merge
gen marriedm = 0
replace marriedm=. if q8_7==99
replace marriedm=1 if agem>=q8_7
replace educm = q8_16 if (educm==. | educm>=94) & q8_16~=. & q8_16<95 & year==1994
gen nochm = .
replace nochm = q8_21 if q8_21<98 & sexm==0
lab var nochm "(women only) children ever born"
gen livchm = .
replace livchm = q8_22 if q8_22<98 & sexm==0
lab var livchm "(women only) living children"
drop q8_7 q8_16 q8_21 q8_22
save "Temp\mlife94t.dta", replace
* Remittances Sent or Received by Individual *
**********************************************
* We merge mlife94 to mremit94 to obtain "migrants' account" of remittances. A migrant may send money/goods to any of the
* individuals from his/her origin household in 1994. In particular, the migrant is asked about the groups of hh members that
* are listed in Question 1.1 of the household roster (i.e., asking where the person is) as 1 (in this hh), 2 (another hh in the village)
* or 3 (outside the village). A row is recorded for each case, for example if there are two individuals in the household with code 3,
* and they are at different places, then two rows are recorded, one for each. If, on the other hand, three individuals live in the same hh,
* they are recorded on the same row as q9_7_c1, q9_7_c2, and q9_7_c3 respectively. So, a migrant identified by mid mcep8 may have multiple
* observations in the data. We do the following steps:
* (a) Get q1.1 from the indiv94.dta. Use hhid94 to match migrant to his household, and the cep numbers of the hh members as recorded
* in q9_7_c*. Note, we first need to make the data into long format where each cep# is shown as a row.
* (b) We then drop the rows with missing recipient id (rcep=998) as these imply no recipients.
* (c) We identify the row that corresponds to the origin hh members still living there (q1.1=1), and drop the other observations.
* We are only interested in the remittances mig sends to (or receives from) the household members still living there.
* The 2000 remittances data actually records 3 household to which migrants can send to. For the sake of consistency, with 2000 as well,
* we will restrict our analysis to migrant's current household.
use "indiv94.dta", clear
rename cep94 rcep
sort hhid94 rcep, stable
save "Temp\indiv94r.dta", replace
use "mremit94.dta", clear
sort mid mcep8, stable
merge mid mcep8 using "Temp\personid94m.dta", keep(nrpid vill84 hhid94 cep94 house84 cep84) update
keep if _merge==3
drop _merge
* Reshape the data so that there is one observation for each cep number q9_7_c1 to q9_7_c9
rename q9_7_c1 rcep1
rename q9_7_c2 rcep2
rename q9_7_c3 rcep3
rename q9_7_c4 rcep4
rename q9_7_c5 rcep5
rename q9_7_c6 rcep6
rename q9_7_c7 rcep7
rename q9_7_c8 rcep8
rename q9_7_c9 rcep9
* Generate a unique id for reshaping (mid mcpe8 does not uniquely identify migrants, as a mig can send remittances
* to multiple cep no.s from the origin hh.
gen id=_n
reshape long rcep,i(id) j(rcepno)
* drop all the observations with missing rcep number - it means no recipients are identified.
drop if rcep=="998"
* Get the q1 for recipient cep numbers (rcep) from the indiv94.dta.
sort hhid94 rcep, stable
merge hhid94 rcep using "Temp\indiv94r.dta", keep(q1) update
drop if _merge==2
drop _merge
* Keep only the individuals currently living in the hh (that is, q1==1)
* Assume that q1==. is actually 1 (o/w we lose 42 unique mids, and 99 observations).
keep if q1==1 | q1==.
** For each mid mcep8, compute the amount of remittances received. Note each row represents a person
** in the hh, and ideally, since the migrant is remitting to the hh as a whole, the rows should be equal
** for a migrant. They usually are. But in some cases there are inconsistencies. In those cases, we will
** take the highest occurring number, i.e. the mode.
** q9_7_21 identifies if mig sends money to hh in the last 12 months.
replace q9_7_21 = . if q9_7_21==9
replace q9_7_21 = 0 if q9_7_21==2
bys mid mcep8: egen rm = mode(q9_7_21), max
lab var rm "mig sends money to hh? (mig's account)"
** q9_7_22 identifies number of times mig sent money
** q9_7_23 identifies the total amount of money mig sent
replace q9_7_22 = . if q9_7_22>=98
replace q9_7_23 = . if q9_7_23>=8
* Amount sent is coded as follows: (1) <1K baht - (2) 1K-3K baht - (3) 3K-5K baht - (4) 5K-10K baht
* (5) 10K-20K baht - (6) >20K baht. Take the average of the interval, and 30K as max and estimate
* the amount sent. Then sum these amount up across observations for the same individual.
replace q9_7_23 = 500 if q9_7_23==1
replace q9_7_23 = 2000 if q9_7_23==2
replace q9_7_23 = 4000 if q9_7_23==3
replace q9_7_23 = 7500 if q9_7_23==4
replace q9_7_23 = 15000 if q9_7_23==5
replace q9_7_23 = 30000 if q9_7_23==6
bys mid mcep8: egen rt = mode(q9_7_22), max
bys mid mcep8: egen ra = mode(q9_7_23), max
* There are 27 obs where rt>0 and rm==0, correct.
replace rm = 1 if rt>0 & rt~=. & (rm==0 | rm==.)
replace rt = 0 if rm==0 & rt==.
replace ra = 0 if rm==0 & ra==.
lab var rt "times mig sent money to hh (mig's account)"
lab var ra "amount of money mig sent to hh (mig's account)"
** q9_7_241 to q9_7_245 identify if mig sends goods (clothing, food, hh appliances,
** electric apppliances, vehicles) to hh.
replace q9_7_241 = . if q9_7_241==9 | q9_7_241==2
replace q9_7_242 = . if q9_7_242==9 | q9_7_242==2
replace q9_7_243 = . if q9_7_243==9 | q9_7_243==2
replace q9_7_244 = . if q9_7_244==9 | q9_7_244==2
replace q9_7_245 = . if q9_7_245==9 | q9_7_245==2
egen q9_7_24 = rsum(q9_7_241 q9_7_242 q9_7_243 q9_7_244 q9_7_245)
bys mid mcep8: egen rg = mode(q9_7_24), max
replace rg = 1 if rg>=1
lab var rg "mig sends goods to hh (mig's account)?"
gen rmg = 1 if rm==1 | rg==1
replace rmg = 0 if rm==0 & rg==0
lab var rmg "mig sends money or goods to hh (mig's account)?"
** q9_7_11 identifies if hh sends money to mig in the last 12 months
** q9_7_12 identifies the number of times hh sent money to mig
** q9_7_13 identifies the amount of money hh sent mig
** q9_7_141-q9_7_145 identify if hh sends goods (clothing, food, hh appliances,
** electric apppliances, vehicles) to mig.
replace q9_7_11 = . if q9_7_11==9
replace q9_7_11 = 0 if q9_7_11==2
bys mid mcep8: egen hrm = mode(q9_7_11), max
replace hrm=1 if hrm>=1
lab var hrm "mig receives money from hh? (mig's account)"
replace q9_7_12 = . if q9_7_12>=98
replace q9_7_13 = . if q9_7_13>=8
replace q9_7_13 = 500 if q9_7_13==1
replace q9_7_13 = 1500 if q9_7_13==2
replace q9_7_13 = 4000 if q9_7_13==3
replace q9_7_13 = 7500 if q9_7_13==4
replace q9_7_13 = 15000 if q9_7_13==5
replace q9_7_13 = 30000 if q9_7_13==6
bys mid mcep8: egen hrt = mode(q9_7_12), max
bys mid mcep8: egen hra = mode(q9_7_13), max
* There are 62 obs where hrt>0 and hrm==0, correct.
replace hrm = 1 if hrt>0 & hrt~=. & (hrm==0 | hrm==.)
replace hrt = 0 if hrm==0 & hrt==.
replace hra = 0 if hrm==0 & hra==.
lab var hrt "times mig received money from hh (mig's account)"
lab var hra "amount of money mig received from hh (mig's account)"
replace q9_7_141 = . if q9_7_141==9 | q9_7_141==2
replace q9_7_142 = . if q9_7_142==9 | q9_7_142==2
replace q9_7_143 = . if q9_7_143==9 | q9_7_143==2
replace q9_7_144 = . if q9_7_144==9 | q9_7_144==2
replace q9_7_145 = . if q9_7_145==9 | q9_7_145==2
egen q9_7_14 = rsum(q9_7_141 q9_7_142 q9_7_143 q9_7_144 q9_7_145)
bys mid mcep8: egen hrg = mode(q9_7_14), max
replace hrg = 1 if hrg>=1
lab var hrg "mig receives goods to hh (mig's account)?"
gen hrmg = 1 if hrm==1 | hrg==1
replace hrmg = 0 if hrm==0 & hrg==0
lab var hrmg "mig receives money or goods from hh (mig's account)?"
bys mid mcep8: keep if _n==1
keep mid mcep8 nrpid rm rt ra rg rmg hrm hrt hra hrg hrmg
sort nrpid
save "Temp\mremit94t.dta", replace
* Merge mremit94 to mlife94 *
*****************************
use "Temp\mlife94t.dta", clear
sort nrpid, stable
merge nrpid using "Temp\mremit94t.dta"
drop if _merge==2
drop _merge
* IMPORTANT NOTE - Remittance question is not asked of migrants if the whole
* household from 1984 has moved. We can check if all the missing remittance
* observations actually come from hhs that have moved. We need to get the
* hhmovefl from mindiv94.
sort mid mcep8
merge mid mcep8 using "mindiv94.dta", keep(hhmovefl hhid94)
drop if _merge==2
drop _merge
* Almost all of the remit variables are missing if hhmovefl==1
* But hhmovefl alone does not account for all the missing observations.
replace rm = . if year~=1994
replace rt = . if year~=1994
replace ra = . if year~=1994
replace rg = . if year~=1994
replace rmg = . if year~=1994
replace hrm = . if year~=1994
replace hrt = . if year~=1994
replace hra = . if year~=1994
replace hrg = . if year~=1994
replace hrmg = . if year~=1994
gen m94=1
keep mid mcep8 hhid94 cep94 nrpid vill84 vill94 year agem educm sexm occ1m p1m gavebirthm migtype stmigfl m94 marriedm nochm livchm ///
rm rt ra rg rmg hrm hrt hra hrg hrmg
sort nrpid year, stable
save "Temp\mlife94t.dta", replace
***********************
* 3. Clean life00.dta *
***********************
use "indiv00.dta", clear
sort hhid00 cep00, stable
save "indiv00.dta", replace
use "personid.dta", clear
keep if hhid00~="" & cep00~=""
sort hhid00 cep00, stable
save "Temp\personid00.dta", replace
use "life00.dta", clear
sort hhid00 cep00, stable
merge hhid00 cep00 using "Temp\personid00.dta", ///
keep(house84 cep84 hhid94 cep94 mcep8 mid hhid00 cep00 mcep00 mid00 mcep00 nrpid vill84) update
* keep records iff master and using data match
keep if _merge==3
drop _merge
* merge with indiv00.dta to get sex
sort hhid00 cep00, stable
merge hhid00 cep00 using "indiv00.dta",keep(x4 hhid94) update
keep if _merge==3
drop _merge
ren x4 sex0
replace sex0 = 0 if sex0==2
ren x5_1age age0
ren x5_6ed educ0
ren x5_7j1_1 occ10
ren x5_5r1 p10
ren x5_4chld chinyr0
gen year = beyear-543
gen gavebirth0 = . if sex0==1
replace gavebirth0=0 if sex0==0
replace gavebirth0=1 if chinyr0~=. & chinyr0~=9 & sex0==0
lab var gavebirth0 "gave birth this year?"
* Check if nrpid year uniquely identify individuals, drop o/w (no cases dropped)
sort year nrpid, stable
by year nrpid: gen i = _n
drop if i==2
drop i
* Check if age is increasing linearly over time (for 27 cases it is not - 2 unique nrpids)
* Change the cases with inconsistent age values.
gen agefl = 0
sort nrpid year, stable
by nrpid: replace agefl=1 if age0~=age0[_N]-(year[_N]-year)
tab agefl
codebook nrpid if agefl==1
*list nrpid year age0 if nrpid=="011614"
*list nrpid year age0 if nrpid=="013231"
replace age0=13 if year==1981 & nrpid=="011614"
drop if year<1981 & nrpid=="011614"
replace age0 = age0-1 if year<=1978 & nrpid=="013231"
drop agefl
destring p10, g(p)
drop p10
gen p10 = .
lab def dest 0 "00:NR" 1 "01:BR" 2 "02:KR" 3 "03:NEP" 4 "04:CB" 5 "05:RY" 6 "06:ESB" 7 "07:BK" ///
8 "08:BMA" 9 "09:OTH" 10 "10:INTL" 99 "99:Missing"
* Recode Place Variable - Life00 *
**********************************
replace p10 = 0 if p>=2000000 & p<4000000
replace p10 = 1 if p>=4000000 & p<4990000
replace p10 = 2 if p==6000000
replace p10 = 4 if p==8000000
replace p10 = 5 if p==7000000
replace p10 = 7 if p==5000000
replace p10 = 10 if p>=10000000 & p<=11000000
replace p10 = 99 if p==99999999
replace p = p-9000000 if p>=9000000 & p<9800000
replace p = p/10000 if p>10000
replace p10 = 3 if (p==4 | p==6 | p==11 | p==20 | p==42 | p==43 | p==45 | p==47 | p==54 | p==55 | p==56 | p==67 | p==70 | p==73 | p==76)
replace p10 = 6 if p==08
replace p10 = 8 if (p==19 | p==24 | p==28 | p==59 | p==61)
replace p10 = 9 if (p==2 | p==3 | p==5 | p==7 | p==10 | p==12 | p==22 | p==23 | p==25 | p==26 | p==44 | p==46 | p==48 | p==57 | p==58 | p==60 ///
| p==68 | p==69 | p==71 | p==72 | p==74 | p==75 | (p>=13 & p<=18) | (p>=29 & p<=41) | (p>=50 & p<=53) | (p>=62 & p<=66))
replace p10 = 7 if p==1
replace p10 = 4 if p==9
replace p10 = 2 if p==21
replace p10 = 5 if p==49
replace p10 = 1 if p==27
replace p10 = 99 if p10==.
lab val p10 dest
* Recode Occupation *
*********************
lab def occ 1 "1:Stu/NoWork" 2 "2:Farmer" 3 "3:Factory" 4 "4:Constrctn" 5 "5:Service" 6 "6:Other" 9 "9:Missing"
replace occ10 = 1 if occ10==1 | occ10==2 | occ10==8
replace occ10 = occ10-1 if occ10>=3 & occ10<=7
lab val occ10 occ
* Education, Marital Status, Children *
***************************************
sort hhid00 cep00, stable
merge hhid00 cep00 using "indiv00.dta", keep(x16 x20 x22 x23) update
drop if _merge==2
drop _merge
gen married0 = 0
replace married0=. if x16==99
replace married0=1 if age0>=x16
replace educ0 = x20 if (educ0==. | educ0>=94) & x20~=. & x20<94 & year==2000
gen noch0 = .
replace noch0 = x22 if x22<98 & sex0==0
lab var noch0 "(women only) children ever born"
gen livch0 = .
replace livch0 = x23 if x23<98 & sex0==0
lab var livch0 "(women only) living children"
drop x16 x20 x22 x23
gen l00=1
keep hhid00 cep00 nrpid vill00 age0 sex0 educ0 occ10 p10 gavebirth0 year migvill l00 ///
married0 noch0 livch0 house84 cep84 hhid94 cep94 mcep8 mid mid00 mcep00 vill84
sort nrpid year, stable
save "Temp\life00t.dta", replace
************************
* 4. Clean mlife00.dta *
************************
* Unique identifier for mlife00.dta is mid00 migtype mcep00
* Note that in personid.dta there are also identifiers named mid00b and mcep00b,
* but these are nonmissing for only 14 observations, so we ignore them.
use "mlife00.dta", clear
sort mid00 migtype mcep00, stable
merge mid00 migtype mcep00 using "Temp\personid00m.dta", ///
keep(house84 cep84 hhid94 cep94 mcep8 mid hhid00 cep00 mcep00 mid00 mcep00 nrpid vill84) update
keep if _merge==3
drop _merge
* Drop non-study migrants (those that were interviewed as part of a migrant's hh, but do not come from
* the Nang Rong villages). Note there are 322 cases where stmigfl==0 but hhid00 is not missing. These cases
* are likely to be miscoded as non-study migrants. So, we only drop the cases with stmigfl==0 and hhid00=""
drop if stmigfl==0 & hhid00==""
* merge with mindiv00.dta to get sex
sort mid00 migtype mcep00, stable
merge mid00 migtype mcep00 using "Temp\mindiv00t.dta",keep(x8_4) update
keep if _merge==3
drop _merge
ren x8_4 sexm0
replace sexm0 = 0 if sexm0==2
* Get the missing values (2345) from the soldier, monk or chinyr variables.
replace sexm0 = 1 if x11_2~=. | x11_3~=.
replace sexm0 = 0 if x11_4ch~=.
* For 64 unique nrpid's sexm0 is missing!
ren x11_1age agem0
ren x11_6ed educm0
ren x11_7j11 occ1m0
ren x11_5r1 p1m0
ren x11_4ch chinyrm0
ren stmigfl stmigfl0
ren migtype migtype0
gen year = beyear-543
gen gavebirthm0 = . if sexm0==1 | sexm0==.
replace gavebirthm0=0 if sexm0==0
replace gavebirthm0=1 if chinyrm0~=. & chinyrm0~=9 & sexm0==0
lab var gavebirthm0 "gave birth this year?"
* Check if nrpid year uniquely identify individuals, drop o/w (no cases are dropped)
sort year nrpid, stable
by year nrpid: gen i = _n
drop if i==2
drop i
* Check if age is increasing linearly over time (it is for all the cases)
gen agefl = 0
sort nrpid year, stable
by nrpid: replace agefl=1 if agem~=agem[_N]-(year[_N]-year)
tab agefl
drop agefl
destring p1m0, g(p)
drop p1m0
gen p1m0 = .
lab def dest 0 "00:NR" 1 "01:BR" 2 "02:KR" 3 "03:NEP" 4 "04:CB" 5 "05:RY" 6 "06:ESB" 7 "07:BK" ///
8 "08:BMA" 9 "09:OTH" 10 "10:INTL" 99 "99:Missing"
* Recode Place Variable - Mlife00 *
***********************************
* Note - From Chang's notes 702 is NR!
replace p1m0 = 0 if ((p>=8000000 & p<9000000) | p==702)
replace p1m0 = 1 if (p==701| (p>=703 & p<800))
replace p1m0 = 2 if p==3
replace p1m0 = 4 if p==5
replace p1m0 = 5 if p==4
replace p1m0 = 7 if p==2
replace p1m0 = 10 if p>=900 & p<1000
replace p1m0 = 99 if p==9999999
* Note p takes on values of 2-5 for bkk, cb, ry, etc. We already considered these, so set p to .
replace p = . if p>=2 & p<=5
* Provinces are coded both as 6+prov# (mlife00 codebook) and 9+4digit prov code (mlife94 codebook)
replace p = p-600 if p>=600 & p<700
replace p1m0 = 3 if (p==4 | p==6 | p==11 | p==20 | p==42 | p==43 | p==45 | p==47 | p==54 | p==55 | p==56 | p==67 | p==70 | p==73 | p==76)
replace p1m0 = 6 if p==08
replace p1m0 = 8 if (p==19 | p==24 | p==28 | p==59 | p==61)
replace p1m0 = 9 if (p==2 | p==3 | p==5 | p==7 | p==10 | p==12 | p==22 | p==23 | p==25 | p==26 | p==44 | p==46 | p==48 | p==57 | p==58 | p==60 ///
| p==68 | p==69 | p==71 | p==72 | p==74 | p==75 | (p>=13 & p<=18) | (p>=29 & p<=41) | (p>=50 & p<=53) | (p>=62 & p<=66))
replace p1m0 = 7 if p==1
replace p1m0 = 4 if p==9
replace p1m0 = 2 if p==21
replace p1m0 = 5 if p==49
replace p1m0 = 1 if p==27
replace p1m0 = 99 if p1m0==.
lab val p1m0 dest
* Recode Occupation *
*********************
lab def occ 1 "1:Stu/NoWork" 2 "2:Farmer" 3 "3:Factory" 4 "4:Constrctn" 5 "5:Service" 6 "6:Other" 9 "9:Missing"
replace occ1m0 = 1 if occ1m0==1 | occ1m0==2 | occ1m0==8
replace occ1m0 = occ1m0-1 if occ1m0>=3 & occ1m0<=7
lab val occ1m0 occ
* Education, Marital Status, Children *
***************************************
ren migtype0 migtype
sort mid00 migtype mcep00, stable
merge mid00 migtype mcep00 using "Temp\mindiv00t.dta", keep(x8_8 x8_12 x8_17_1 x8_17_2) update
drop if _merge==2
drop _merge
gen marriedm0 = 0
replace marriedm0=. if x8_8==99
replace marriedm0=1 if agem0>=x8_8
replace educm0 = x8_12 if (educm0==. | educm0>=94) & x8_12~=. & x8_12<94 & year==2000
gen nochm0 = .
replace nochm0 = x8_17_1 if x8_17_1<98 & sexm0==0
lab var nochm0 "(women only) children ever born"
gen livchm0 = .
replace livchm0 = x8_17_2 if x8_17_2<98 & sexm0==0
lab var livchm0 "(women only) living children"
drop x8_8 x8_12 x8_17_1 x8_17_2
save "Temp\mlife00t.dta", replace
* Remittances Sent or Received by Individual *
**********************************************
* We need to merge mlife00 to mindiv00 to get migrant-reported remittance information.
* Note in the 2000 data collection, remittance information was collected from individual
* study migrants as well as migrant's household in destination (which could include spouses,
* children, parents, or unrelated people, as long as there are 1-2 of them). The former remittance
* information is stored in mindiv00, while the latter is stored in mfam00. For the sake of consistency
* with 1994 wave, we are interested in individual remittances, so use the information from mindiv00.
* Note that there are many missing values for the remittances variables in the individual case.
use "Temp\mlife00t.dta", clear
sort mid00 migtype mcep00, stable
merge mid00 migtype mcep00 using "Temp\mindiv00t.dta", keep(x9_14* x9_15*)
drop if _merge==2
drop _merge
* A migrant may send money to three designated households in the origin, whose id's are stored in
* x9_15hh1, x9_15hh2, x9_15hh3. As in the case of 1994 data, we are only interested in the remittances
* a migrant sends to his or her origin household. So we flag the observations for which hhid00
* mathces one of the above hhids.
gen fl = 1 if x9_15hh1==hhid00
* There is only one unique migrants with non-missing remittance variable (x9_15_12) to the second household.
* The id of this hh (x9_15_hh2) does not match with any hhid00. Therefore we only focus on x9_15hh1, and
* remittances to the first household from now on.
drop x9_15hh2 x9_15_12 x9_15402 x9_15_22 x9_15_32 x9_14_12 x9_14402 x9_14_22 x9_14_32
* There are 160 unique individual migrants with missing receiving hh ids (x9_15_hh1). Since only one migrant
* sends remittances to another hh than his/her own (i.e., x9_15_hh2~=. & x9_15_12~=.), we can safely assume
* that those with missing hh ids are remitting to their origin hhs. (Note the question specifically asks
* if the migrant sent remittances to household in his/her village of origin. So, the question does not
* encompass out-of-village remittances where hhids would naturally be missing, increasing the
* plausibility of our assumption.
replace fl = 1 if x9_15hh1=="" & x9_15~=.
* As a result, we have 759 unique individuals for whom we have remittance information. (codebook nrpid if fl==1)
** x9_15 identifies if mig sends money or goods to hhs (*could be more than one*) in the last 12 months.
** NOTE - This variable is key in determining the values of the following variables, as send money? (x9_15_11)
** is not asked if a migrant said no to x9_15. Therefore, we should set all the money remittance indicator
** to 0 if x9_15==., similarly remittance frequency and amount variables should be set to 0.
gen rmg0 = . if x9_15==.
replace rmg0 = 0 if x9_15==2
replace rmg0 = 1 if x9_15==1
lab var rmg0 "mig sends money/goods to hh? (mig's account)"
** x9_15_11 and x9_15_12 identify if mig sends money to hh#1 and hh#2 resp. in the last 12 months.
** Note x9_15_12 is missing for all but one migrant, for whom the receiving hh id does not match any in our records.
** So we ignore this variable.
gen rm0 = . if x9_15_11==.
replace rm0 = 0 if x9_15_11==2
* Note - Sent money? qn is asked of inds. who have said yes to sent money/goods?
replace rm0 = 0 if rmg0==0
replace rm0 = 1 if x9_15_11==1
lab var rm0 "mig sends money to hh? (mig's account)"
* There are 21 obs where rm0==1 and rmg0==0, correct these.
replace rmg0 = 1 if rm0==1
** x9_15401 identifies if mig sends goods to hh#1 in the last 12 months.
* NOTE - x9_15401 asks a negative qn - "Sent no goods to hh?" (1: yes, 2:no)
gen rg0 = . if x9_15401==.
replace rg0 = 0 if x9_15401==1
* Note - Sent goods? qn is asked of inds. who have said yes to sent money/goods?
replace rg0 = 0 if rmg0==0
replace rg0 = 1 if x9_15401==2
lab var rg0 "mig sends goods to hh? (mig's account)"
** x9_15_21 identifies the number of times mig sent money to hh#1.
** x9_15_31 identifies the amount of money mig sent to hh#1.
gen rt0 = x9_15_21
replace rt0 = 0 if rm0==0
** x9_15_31 identifies the amount of money mig sends money to hh#1 in the last 12 months.
* Amount sent is coded as follows: (1) <1K baht - (2) 1K-3K baht - (3) 3K-5K baht - (4) 5K-10K baht
* (5) 10K-20K baht - (6) 20K-40K baht - (7) >40K. Take the average of the interval, and 30K as max and estimate
* the amount sent. Then sum these amount up across observations for the same individual.
replace x9_15_31 = 500 if x9_15_31==1
replace x9_15_31 = 1500 if x9_15_31==2
replace x9_15_31 = 4000 if x9_15_31==3
replace x9_15_31 = 7500 if x9_15_31==4
replace x9_15_31 = 15000 if x9_15_31==5
replace x9_15_31 = 30000 if x9_15_31==6
replace x9_15_31 = 50000 if x9_15_31==7
replace x9_15_31 = . if x9_15_31==9
gen ra0 = x9_15_31
replace ra0 = . if x9_15_31==.
replace ra0 = 0 if rm0==0
lab var rt0 "times mig sent money to hh (mig's account)"
lab var ra0 "amount of money mig sent to hh (mig's account)"
* There are 18 obs for which rt0==25 and ra0==0. Correct these.
replace rt0 = 0 if ra0==0
** x9_14 identifies if any hh sends money or goods to mig in the last 12 months
** x9_14_11 identifies if hh sends money to mig in the last 12 months
** x9_14401 identifies if hh sends goods to mig in the last 12 months
** x9_14_21 identifies the number of times hh sends money to mig in the last 12 months
** x9_14_31 identifies the amount of money hh sends to mig in the last 12 months
gen hrmg0 = . if x9_14==.
replace hrmg0 = 0 if x9_14==2
replace hrmg0 = 1 if x9_14==1
gen hrm0 = . if x9_14_11==.
replace hrm0 = 0 if x9_14_11==2
* Note - Sent money? qn is asked of inds. who have said yes to sent money/goods?
replace hrm0 = 0 if hrmg0==0
replace hrm0 = 1 if x9_14_11==1
* NOTE - x9_14401 asks a negative qn - "Sent no goods to mig?" (1: yes, 2:no)
gen hrg0 = . if x9_14401==.
replace hrg0 = 0 if x9_14401==1
* Note - Sent goods? qn is asked of inds. who have said yes to sent money/goods?
replace hrg0 = 0 if hrmg0==0
replace hrg0 = 1 if x9_14401==2
gen hrt0 = x9_14_21
replace hrt0 = 0 if hrm0==0
replace x9_14_31 = 500 if x9_14_31==1
replace x9_14_31 = 1500 if x9_14_31==2
replace x9_14_31 = 4000 if x9_14_31==3
replace x9_14_31 = 7500 if x9_14_31==4
replace x9_14_31 = 15000 if x9_14_31==5
replace x9_14_31 = 30000 if x9_14_31==6
replace x9_14_31 = 50000 if x9_14_31==7
replace x9_14_31 = . if x9_14_31==9
gen hra0 = x9_14_31
replace hra0 = 0 if hrm0==0
lab var hrmg0 "mig receives money or goods from hh (mig's account)?"
lab var hrm0 "mig receives money from hh? (mig's account)"
lab var hrg0 "mig receives goods to hh (mig's account)?"
lab var rt0 "times mig received money from hh (mig's account)"
lab var ra0 "amount of money mig received from hh (mig's account)"
replace rm0 = . if year~=2000
replace rt0 = . if year~=2000
replace ra0 = . if year~=2000
replace rg0 = . if year~=2000
replace rmg0 = . if year~=2000
replace hrm0 = . if year~=2000
replace hrt0 = . if year~=2000
replace hra0 = . if year~=2000
replace hrg0 = . if year~=2000
replace hrmg0 = . if year~=2000
ren migtype migtype0
gen m00 = 1
keep house84 cep84 hhid94 cep94 mcep8 mid hhid00 cep00 mcep00 mid00 mcep00 nrpid vill84 vill00 agem0 sexm0 educm0 ///
occ1m0 p1m0 gavebirthm0 year stmigfl0 migtype0 m00 marriedm0 nochm0 livchm0 rm0 rt0 ra0 rg0 rmg0 ///
hrm0 hrt0 hra0 hrg0 hrmg0
sort nrpid year, stable
save "Temp\mlife00t.dta", replace
*************************************
* 5. Merge life94.dta to life00.dta *
*************************************
use "Temp\life94t.dta", clear
merge nrpid year using "Temp\life00t.dta", update
* Note - for 35,250 cases _merge==3, meaning that the observations are common to both 1994 and 2000 waves.
* We need to take the values from life94 for years prior to 1994, and from life00 for years after 1994.
* For each variable, we also need to do consistency checks in transitions from 1994 to 1995.
* AGE *
*******
* Replace all the values for observations from the using (life00) data set.
* 121,727 changes made. Note keep the 1994 values for age for observations
* that are common to both data sets.
replace age = age0 if _merge==2
* Take the missing values in 94 data from 00 data - no changes made.
replace age = age0 if age==.
* Establish checks if age is increasing linearly over time
* Typically, there is inconsistency in transiton year 1995, where observation comes from
* life00, but indicates the same age as in 1994. Note that this is possible depending on the
* month of interviewing in 1994 and 2000. (i.e., if the month is before someone's birth month, s/he
* will be recorded as one year younger.)
* Take the 1994 age as the baseline, and compute retrospectively and prospectively.
* Note 1994 age is more reliable than 2000 age, using the latter leads to negative
* age values.
gen agefl = 0
sort nrpid year, stable
bys nrpid: gen aget = age if year==1994
by nrpid: egen age94 = total(aget)
by nrpid: replace agefl=1 if age~=age94-(1994-year) & age94~=0
by nrpid: replace age = age94-(1994-year) if agefl>=1
drop agefl* age0 aget age94
* EDUCATION *
*************
* Replace all the values for observations from the using (life00) data set.
* 20513 changes made.
replace educ = educ0 if _merge==2
* Take the missing values in 94 data from 00 data - no changes made.
replace educ = educ0 if educ==.
gen educ2 = educ
* Missing values - identify the first observation (before missing observations start) and the last observation
* (after the missing observations). In other words, first and last note the beginning and end of the missing
* interval.
gen first = 0
gen last = 0
sort nrpid year
by nrpid: replace first=_n if educ~=. & educ<94 & (educ[_n+1]==. | educ[_n+1]>=94) & _n<_N
by nrpid: replace last=_n if educ~=. & educ<94 & (educ[_n-1]==. | educ[_n-1]>=94) & _n>1 & first==0
by nrpid: gen educ_f = educ if first~=0
by nrpid: gen educ_l = educ if last~=0
sort nrpid first
by nrpid: replace first=first[_N]
by nrpid: replace educ_f=educ_f[_N]
sort nrpid last
by nrpid: replace last=last[_N]
by nrpid: replace educ_l=educ_l[_N]
* First consider the case where there are missing observations between two non-missing educ values.
* We need to extrapolate the values in between.
* Consider the case where the first and last nonmissing obs are identical. Fill-in the values
* b/w w the same value
sort nrpid year
by nrpid: replace educ=educ_f if first~=0 & last~=0 & educ_f==educ_l & _n>first & _nfirst & _nfirst & _neduc_l & _n>first & _n<=last & educ_f~=. & educ_l~=.
* Consider the case where last==0 and first>0, we need to fill the values after the first prospectively. Note
* first and last refer to the beginning and ending observations of the missing interval. If last==0, there is
* no end to missing values, we might all equate them the the beginning value of the interval.
* codebook nrpid if last==0 & first>0
by nrpid: replace educ=educ_f if first~=. & first>0 & last==0 & _n>first
* Now, again consider the case where first>0 and last==0, and first is in the middle of missing values. We already
* filled in the missing values after first, now we need to fill the ones before it retrospectively. See the note below
* on age-education structure.
* codebook nrpid if last==0 & first>0 & (educ==. | educ>=94)
gsort nrpid -year
by nrpid: replace educ = educ[_n-1] if first>0 & last==0 & _n>(_N-first+1) & age[_n-1]>educ[_n-1]+6 & (educ==.|educ>=94)
by nrpid: replace educ = educ[_n-1]-1 if first>0 & last==0 & _n>(_N-first+1) & age[_n-1]==educ[_n-1]+6 & (educ==.|educ>=94)
* Now, consider the case where first==0 and last>0, we need to fill the values retrospectively before the last
* observation. Note, we need to check the age of the ind before we fill any preceding missing values. For instance, if an ind is 40
* and has 6 years of education, we cannot decrease it to 0 by the age 36. Instead, we need to copy the number 6
* until the age 13. By contrast, if a 15 yr old has 8 years of education, and educ for the ages 13 and 14 are
* missing, then we should have educ=7 at age 14, and educ=6 at age 13. The rule is to copy the next year's educ
* if the diff between next year's age and educ is more than 7 (the regular starting age for education), and reduce]
* educ linearly if age[next year]=educ[next year] + 7. Note observations are sorted in descending order of year,
* so [_n-1] indexes next observation.
* codebook nrpid if last>0 & first==0
* codebook nrpid if last>0 & first==0 & age>20 & educ>10 & educ~=. & educ<94
gsort nrpid -year
by nrpid: replace educ = educ[_n-1] if first==0 & last>0 & _n>1 & age[_n-1]>educ[_n-1]+6
by nrpid: replace educ = educ[_n-1]-1 if first==0 & last>0 & _n>1 & age[_n-1]==educ[_n-1]+6
* Now, we need to consider the case where first>0 and last>0, and since we already filled the gap in the interval
* between first and last above, we need to fill the missing values preceding first retrospectively. The above discussion
* applies here as well.
* codebook nrpid if last>0 & first>0 & (educ==. | educ>=94)
gsort nrpid -year
by nrpid: replace educ = educ[_n-1] if first>0 & last>0 & _n>(_N-first+1) & age[_n-1]>educ[_n-1]+6 & (educ==.|educ>=94)
by nrpid: replace educ = educ[_n-1]-1 if first>0 & last>0 & _n>(_N-first+1) & age[_n-1]==educ[_n-1]+6 & (educ==.|educ>=94)
* Now, we need to consider the case where first>0 and last>0, and since we already filled the gap in the interval
* between first and last above, we need to fill the missing values after the last prospectively.
* codebook nrpid if last>0 & first>0 & (educ==. | educ>=94)
sort nrpid year
by nrpid: replace educ=educ[_n-1] if (educ==. | educ>=94) & educ[_n-1]~=. & educ[_n-1]<94 & first>0 & last>0 & _n>1
* There are only two specific cases with missing values in between. (You find them by running the code below, and
* typing the commented out 'codebook' command.)Correct for these cases as follows:
gsort nrpid -year
by nrpid: replace educ=educ[_n-1] if (educ==. | educ>=94) & educ[_n-1]~=. & educ[_n-1]<94 & nrpid=="016817"
by nrpid: replace educ=educ[_n-1]-1 if (educ==. | educ>=94) & educ[_n-1]~=. & educ[_n-1]<94 & nrpid=="034557"
* Now, we should have no missing observations except for cases where all values for an nrpid are missing.
sort nrpid year
gen miss = 0
replace miss = 1 if educ==. | educ>=94
bys nrpid: egen misst=total(miss)
bys nrpid: gen N=_N
*codebook nrpid if misst0
* NOTE - We have recoverd 98% of observations (tab misst). Now, all we need to do is to see if education is increasing
* linearly!!!!!!
drop N miss misst
* Check if educ is increasing linearly
gen edfl=1
sort nrpid year, stable
* Unflag all the observations for which educ is stable, or increasing linearly
by nrpid: replace edfl = 0 if educ==educ[_n-1] & _n>1
by nrpid: replace edfl = 0 if educ==educ[_n+1] & _n<_N
by nrpid: replace edfl = 0 if educ==educ[_n-1]+1 & _n>1
by nrpid: replace edfl = 0 if educ==educ[_n+1]-1 & _n<_N
replace edfl = 0 if educ>=94 | educ==.
by nrpid: replace edfl = 1 if educ-educ[_n-1]>1 & educ<=94 & educ~=. & educ[_n-1]<=94 & educ[_n-1]~=. & _n>1
by nrpid: replace edfl = 1 if educ-educ[_n+1]>1 & educ<=94 & educ~=. & educ[_n+1]<=94 & educ[_n+1]~=. & _n<_N
by nrpid: egen edflt = total(edfl)
*codebook nrpid if edflt>0
*list nrpid year age educ educ2 edfl edflt if edflt>=1
* NOTE - For only 130 unique nrpid's we have non-linearly increasing education. A total of 2,431 observations.
* We might as well drop them. Take the final observation as the right one, and make changes accordingly
gsort nrpid -year
by nrpid: replace educ = educ[_n-1] if edflt>=1 & _n>1 & age[_n-1]>educ[_n-1]+6
by nrpid: replace educ = educ[_n-1]-1 if edflt>=1 & _n>1 & age[_n-1]==educ[_n-1]+6
* If the last value seems not right, e.g., 10 years of educ at age 14, then take the first observation as
* the right one
gsort nrpid year
by nrpid: replace educ = educ[_n-1]+1 if edflt>=1 & _n>1 & age[_n]noch, correct.
replace noch = livch if livch>noch
drop noch0 livch0 married0
drop _merge
sort nrpid year, stable
save "Temp\life9400t.dta", replace
***************************************
* 6. Merge mlife94.dta and mlife00.dta *
***************************************
use "Temp\mlife94t.dta", clear
merge nrpid year using "Temp\mlife00t.dta", update
* Note - for 7,758 cases _merge==3, meaning that the values in master and using data set are common.
* AGE *
*******
* Replace all the values for observations from the using (life00) data set.
* 29,685 changes made.
replace agem = agem0 if _merge==2
* Take the missing values in 94 data from 00 data - no changes made.
replace agem = agem0 if agem==.
* Establish checks if age is increasing linearly over time
* Typically, there is inconsistency in transiton year 1995, where observation comes from
* life00, but indicates the same age as in 1994. Note that this is possible depending on the
* month of interviewing in 1994 and 2000. (i.e., if the month is before someone's birth month, s/he
* will be recorded as one year younger.)
* Take the 1994 age as the baseline, and compute retrospectively and prospectively.
* Note 1994 age is more reliable than 2000 age, using the latter leads to negative
* age values.
gen agefl=0
sort nrpid year, stable
bys nrpid: gen aget = agem if year==1994
bys nrpid (aget): gen age94 = aget[1] if aget[1]~=.
* There are two observations w very inconsistent agem values, that lead to negative age when the code below
* is run. Drop those observations.
drop if nrpid=="041046" | nrpid=="050310"
drop if nrpid=="031490" & year==1970
by nrpid: replace agefl=1 if agem~=age94-(1994-year) & age94~=.
by nrpid: replace agem = age94-(1994-year) if agefl>=1
drop agem0 agefl aget age94
* EDUCATION *
*************
ren educm educ
ren educm0 educ0
* Replace all the values for observations from the using (life00) data set.
* 7362 changes made.
replace educ = educ0 if _merge==2
* Take the missing values in 94 data from 00 data - no changes made.
replace educ = educ0 if educ==.
gen educ2 = educ
* Missing values - identify the first observation (before missing observations start) and the last observation
* (after the missing observations). In other words, first and last note the beginning and end of the missing
* interval.
gen first = 0
gen last = 0
sort nrpid year
by nrpid: replace first=_n if educ~=. & educ<94 & (educ[_n+1]==. | educ[_n+1]>=94) & _n<_N
by nrpid: replace last=_n if educ~=. & educ<94 & (educ[_n-1]==. | educ[_n-1]>=94) & _n>1 & first==0
by nrpid: gen educ_f = educ if first~=0
by nrpid: gen educ_l = educ if last~=0
sort nrpid first
by nrpid: replace first=first[_N]
by nrpid: replace educ_f=educ_f[_N]
sort nrpid last
by nrpid: replace last=last[_N]
by nrpid: replace educ_l=educ_l[_N]
* First consider the case where there are missing observations between two non-missing educ values.
* We need to extrapolate the values in between.
* Consider the case where the first and last nonmissing obs are identical. Fill-in the values
* b/w w the same value
sort nrpid year
by nrpid: replace educ=educ_f if first~=0 & last~=0 & educ_f==educ_l & _n>first & _nfirst & _nfirst & _neduc_l & _n>first & _n<=last & educ_f~=. & educ_l~=.
* Consider the case where last==0 and first>0, we need to fill the values after the first prospectively. Note
* first and last refer to the beginning and ending observations of the missing interval. If last==0, there is
* no end to missing values, we might all equate them the the beginning value of the interval.
* codebook nrpid if last==0 & first>0
by nrpid: replace educ=educ_f if first~=. & first>0 & last==0 & _n>first
* Now, again consider the case where first>0 and last==0, and first is in the middle of missing values. We already
* filled in the missing values after first, now we need to fill the ones before it retrospectively. See the note below
* on age-education structure.
* codebook nrpid if last==0 & first>0 & (educ==. | educ>=94)
gsort nrpid -year
by nrpid: replace educ = educ[_n-1] if first>0 & last==0 & _n>(_N-first+1) & age[_n-1]>educ[_n-1]+6 & (educ==.|educ>=94)
by nrpid: replace educ = educ[_n-1]-1 if first>0 & last==0 & _n>(_N-first+1) & age[_n-1]==educ[_n-1]+6 & (educ==.|educ>=94)
* Now, consider the case where first==0 and last>0, we need to fill the values retrospectively before the last
* observation. Note, we need to check the age of the ind before we fill any preceding missing values. For instance, if an ind is 40
* and has 6 years of education, we cannot decrease it to 0 by the age 36. Instead, we need to copy the number 6
* until the age 13. By contrast, if a 15 yr old has 8 years of education, and educ for the ages 13 and 14 are
* missing, then we should have educ=7 at age 14, and educ=6 at age 13. The rule is to copy the next year's educ
* if the diff between next year's age and educ is more than 7 (the regular starting age for education), and reduce]
* educ linearly if age[next year]=educ[next year] + 7. Note observations are sorted in descending order of year,
* so [_n-1] indexes next observation.
* codebook nrpid if last>0 & first==0
* codebook nrpid if last>0 & first==0 & age>20 & educ>10 & educ~=. & educ<94
gsort nrpid -year
by nrpid: replace educ = educ[_n-1] if first==0 & last>0 & _n>1 & age[_n-1]>educ[_n-1]+6
by nrpid: replace educ = educ[_n-1]-1 if first==0 & last>0 & _n>1 & age[_n-1]==educ[_n-1]+6
* Now, we need to consider the case where first>0 and last>0, and since we already filled the gap in the interval
* between first and last above, we need to fill the missing values preceding first retrospectively. The above discussion
* applies here as well.
* codebook nrpid if last>0 & first>0 & (educ==. | educ>=94)
gsort nrpid -year
by nrpid: replace educ = educ[_n-1] if first>0 & last>0 & _n>(_N-first+1) & age[_n-1]>educ[_n-1]+6 & (educ==.|educ>=94)
by nrpid: replace educ = educ[_n-1]-1 if first>0 & last>0 & _n>(_N-first+1) & age[_n-1]==educ[_n-1]+6 & (educ==.|educ>=94)
* Now, we need to consider the case where first>0 and last>0, and since we already filled the gap in the interval
* between first and last above, we need to fill the missing values after the last prospectively.
* codebook nrpid if last>0 & first>0 & (educ==. | educ>=94)
sort nrpid year
by nrpid: replace educ=educ[_n-1] if (educ==. | educ>=94) & educ[_n-1]~=. & educ[_n-1]<94 & first>0 & last>0 & _n>1
* There are only four specific cases with missing values in between. (You find them by running the code below, and
* typing the commented out 'codebook' command.)Correct for these cases as follows:
gsort nrpid -year
by nrpid: replace educ=educ[_n-1]-1 if (educ==. | educ>=94) & educ[_n-1]~=. & educ[_n-1]<94 & ///
(nrpid=="030338"| nrpid=="033046" | nrpid=="034109" | nrpid=="049332")
* Now, we should have no missing observations except for cases where all values for an nrpid are missing.
sort nrpid year
gen miss = 0
replace miss = 1 if educ==. | educ>=94
bys nrpid: egen misst=total(miss)
bys nrpid: gen N=_N
*codebook nrpid if misst0
* NOTE - We have recoverd 93% of observations (tab misst). Now, all we need to do is to see if education is increasing
* linearly!!!!!!
drop N miss misst
* Check if educ is increasing linearly
gen edfl=1
sort nrpid year, stable
* Unflag all the observations for which educ is stable, or increasing linearly
by nrpid: replace edfl = 0 if educ==educ[_n-1] & _n>1
by nrpid: replace edfl = 0 if educ==educ[_n+1] & _n<_N
by nrpid: replace edfl = 0 if educ==educ[_n-1]+1 & _n>1
by nrpid: replace edfl = 0 if educ==educ[_n+1]-1 & _n<_N
replace edfl = 0 if educ>=94 | educ==.
by nrpid: replace edfl = 1 if educ-educ[_n-1]>1 & educ<=94 & educ~=. & educ[_n-1]<=94 & educ[_n-1]~=. & _n>1
by nrpid: replace edfl = 1 if educ-educ[_n+1]>1 & educ<=94 & educ~=. & educ[_n+1]<=94 & educ[_n+1]~=. & _n<_N
by nrpid: egen edflt = total(edfl)
*codebook nrpid if edflt>0
*list nrpid year age educ educ2 edfl edflt if edflt>=1
* NOTE - For only 62 unique nrpid's we have non-linearly increasing education. A total of 921 observations.
* Take the final observation as the right one, and make changes accordingly
gsort nrpid -year
by nrpid: replace educ = educ[_n-1] if edflt>=1 & _n>1 & age[_n-1]>educ[_n-1]+6
by nrpid: replace educ = educ[_n-1]-1 if edflt>=1 & _n>1 & age[_n-1]==educ[_n-1]+6
* If the last value seems not right, e.g., 10 years of educ at age 14, then take the first observation as
* the right one
gsort nrpid year
by nrpid: replace educ = educ[_n-1]+1 if edflt>=1 & _n>1 & age[_n]noch, correct.
replace nochm = livchm if livchm>nochm
drop nochm0 livchm0 marriedm0
* Remittances Sent or Received by Individual *
**********************************************
replace rm = rm0 if _merge==2
replace rg = rg0 if _merge==2
replace rmg = rmg0 if _merge==2
replace ra = ra0 if _merge==2
replace rt = rt0 if _merge==2
replace hrm = hrm0 if _merge==2
replace hrg = hrg0 if _merge==2
replace hrmg = hrmg0 if _merge==2
replace hra = hra0 if _merge==2
replace hrt = hrt0 if _merge==2
drop _merge rm0 rg0 rmg0 ra0 rt0 hrm0 hrg0 hrmg0 hra0 hrt0
sort nrpid year, stable
save "Temp\mlife9400t.dta", replace
*******************************************
* 7. Merge life9400.dta and mlife9400.dta *
*******************************************
use "Temp\life9400t.dta", clear
* Note - when life9400 is merged with mlife9400, there are 8,280 obs for which the using data replace
* the missing values of the master data (due to the update option - note, since only the id variables, such as vill84, are
* common to both data sets, i.e., have the same names, agreements refer to among those variables.) In addition,
* there are 318 obs for which master and using are in agreement, and 9 observations they are not. We take the
* master data values for the latter.
* Note _merge==3|_merge==4|_merge==5 refer to obs that are common to both data sets. (8,607 in total)
* THESE ARE THE OBS FOR WHICH WE NEED TO MAKE A DECISION - WHICH DATA SET (MIG OR HH) IS MORE RELEVANT?
merge nrpid year using "Temp\mlife9400t.dta", update
* AGE *
*******
replace age = agem if _merge==2
* Take the 1994 age as the baseline, and compute retrospectively and prospectively.
* Note 1994 age is more reliable than 2000 age, using the latter leads to negative
* age values.
gen agefl = 0
* Find the inconsistent values for age
sort nrpid year, stable
bys nrpid: gen aget = age if year==1994
bys nrpid (aget): gen age94 = aget[1] if aget[1]~=.
by nrpid: replace agefl=1 if age~=age94-(1994-year) & age94~=.
by nrpid: replace age = age94-(1994-year) if agefl>=1
drop agefl* agem aget age94
* EDUCATION *
*************
* Replace all the values for observations from the using (life00) data set.
* 41,511 changes made.
replace educ = educm if _merge==2
* Take the missing values in 94 data from 00 data - 143 changes made.
replace educ = educm if educ==. & educm~=. & educm<94
gen educ2 = educ
* There is 1 obs with -ve educ - drop it.
drop if educ<0
* Missing values - identify the first observation (before missing observations start) and the last observation
* (after the missing observations). In other words, first and last note the beginning and end of the missing
* interval.
gen first = 0
gen last = 0
sort nrpid year
by nrpid: replace first=_n if educ~=. & educ<94 & (educ[_n+1]==. | educ[_n+1]>=94) & _n<_N
by nrpid: replace last=_n if educ~=. & educ<94 & (educ[_n-1]==. | educ[_n-1]>=94) & _n>1 & first==0
by nrpid: gen educ_f = educ if first~=0
by nrpid: gen educ_l = educ if last~=0
sort nrpid first
by nrpid: replace first=first[_N]
by nrpid: replace educ_f=educ_f[_N]
sort nrpid last
by nrpid: replace last=last[_N]
by nrpid: replace educ_l=educ_l[_N]
* First consider the case where there are missing observations between two non-missing educ values.
* We need to extrapolate the values in between. NOTE - NO CHANGES MADE - SHOWING THAT OUR PRIOR
* CORRECTIONS WERE SUCCESSFUL
* Consider the case where the first and last nonmissing obs are identical. Fill-in the values
* b/w w the same value
sort nrpid year
by nrpid: replace educ=educ_f if first~=0 & last~=0 & educ_f==educ_l & _n>first & _nfirst & _nfirst & _neduc_l & _n>first & _n<=last & educ_f~=. & educ_l~=.
* Consider the case where last==0 and first>0, we need to fill the values after the first prospectively. Note
* first and last refer to the beginning and ending observations of the missing interval. If last==0, there is
* no end to missing values, we might all equate them the the beginning value of the interval. (350 CHANGES MADE)
* codebook nrpid if last==0 & first>0
by nrpid: replace educ=educ_f if first~=. & first>0 & last==0 & _n>first
* Now, again consider the case where first>0 and last==0, and first is in the middle of missing values. We already
* filled in the missing values after first, now we need to fill the ones before it retrospectively. See the note below
* on age-education structure. (NO CHANGES MADE)
* codebook nrpid if last==0 & first>0 & (educ==. | educ>=94)
gsort nrpid -year
by nrpid: replace educ = educ[_n-1] if first>0 & last==0 & _n>(_N-first+1) & age[_n-1]>educ[_n-1]+6 & (educ==.|educ>=94)
by nrpid: replace educ = educ[_n-1]-1 if first>0 & last==0 & _n>(_N-first+1) & age[_n-1]==educ[_n-1]+6 & (educ==.|educ>=94)
* Now, consider the case where first==0 and last>0, we need to fill the values retrospectively before the last
* observation. Note, we need to check the age of the ind before we fill any preceding missing values. For instance, if an ind is 40
* and has 6 years of education, we cannot decrease it to 0 by the age 36. Instead, we need to copy the number 6
* until the age 13. By contrast, if a 15 yr old has 8 years of education, and educ for the ages 13 and 14 are
* missing, then we should have educ=7 at age 14, and educ=6 at age 13. The rule is to copy the next year's educ
* if the diff between next year's age and educ is more than 7 (the regular starting age for education), and reduce]
* educ linearly if age[next year]=educ[next year] + 7. Note observations are sorted in descending order of year,
* so [_n-1] indexes next observation. (498+15 CHANGES MADE)
* codebook nrpid if last>0 & first==0
* codebook nrpid if last>0 & first==0 & age>20 & educ>10 & educ~=. & educ<94
gsort nrpid -year
by nrpid: replace educ = educ[_n-1] if first==0 & last>0 & _n>1 & age[_n-1]>educ[_n-1]+6
by nrpid: replace educ = educ[_n-1]-1 if first==0 & last>0 & _n>1 & age[_n-1]==educ[_n-1]+6
* Now, we need to consider the case where first>0 and last>0, and since we already filled the gap in the interval
* between first and last above, we need to fill the missing values preceding first retrospectively. The above discussion
* applies here as well. (31 CHANGES MADE)
* codebook nrpid if last>0 & first>0 & (educ==. | educ>=94)
gsort nrpid -year
by nrpid: replace educ = educ[_n-1] if first>0 & last>0 & _n>(_N-first+1) & age[_n-1]>educ[_n-1]+6 & (educ==.|educ>=94)
by nrpid: replace educ = educ[_n-1]-1 if first>0 & last>0 & _n>(_N-first+1) & age[_n-1]==educ[_n-1]+6 & (educ==.|educ>=94)
* Now, we need to consider the case where first>0 and last>0, and since we already filled the gap in the interval
* between first and last above, we need to fill the missing values after the last prospectively. (138 CHANGES MADE)
* codebook nrpid if last>0 & first>0 & (educ==. | educ>=94)
sort nrpid year
by nrpid: replace educ=educ[_n-1] if (educ==. | educ>=94) & educ[_n-1]~=. & educ[_n-1]<94 & first>0 & last>0 & _n>1
* Now, we should have no missing observations except for cases where all values for an nrpid are missing.
sort nrpid year
gen miss = 0
replace miss = 1 if educ==. | educ>=94
bys nrpid: egen misst=total(miss)
bys nrpid: gen N=_N
*codebook nrpid if misst0
drop N miss misst
* Check if educ is increasing linearly
gen edfl=1
sort nrpid year, stable
* Unflag all the observations for which educ is stable, or increasing linearly
by nrpid: replace edfl = 0 if educ==educ[_n-1] & _n>1
by nrpid: replace edfl = 0 if educ==educ[_n+1] & _n<_N
by nrpid: replace edfl = 0 if educ==educ[_n-1]+1 & _n>1
by nrpid: replace edfl = 0 if educ==educ[_n+1]-1 & _n<_N
replace edfl = 0 if educ>=94 | educ==.
by nrpid: replace edfl = 1 if educ-educ[_n-1]>1 & educ<=94 & educ~=. & educ[_n-1]<=94 & educ[_n-1]~=. & _n>1
by nrpid: replace edfl = 1 if educ-educ[_n+1]>1 & educ<=94 & educ~=. & educ[_n+1]<=94 & educ[_n+1]~=. & _n<_N
by nrpid: egen edflt = total(edfl)
*codebook nrpid if edflt>0
*list nrpid year age educ educ2 edfl edflt if edflt>=1
* NOTE - For only 96 unique nrpid's we have non-linearly increasing education. A total of 1766 observations.
* Take the final observation as the right one, and make changes accordingly
gsort nrpid -year
by nrpid: replace educ = educ[_n-1] if edflt>=1 & _n>1 & age[_n-1]>educ[_n-1]+6
by nrpid: replace educ = educ[_n-1]-1 if edflt>=1 & _n>1 & age[_n-1]==educ[_n-1]+6
* If the last value seems not right, e.g., 10 years of educ at age 14, then take the first observation as
* the right one
gsort nrpid year
by nrpid: replace educ = educ[_n-1]+1 if edflt>=1 & _n>1 & age[_n]=3 & l94==1 & m94==1 & p1m~=. & p1m~=99 & p1~=p1m
replace p1 = p1m if _merge>=3 & l00==1 & m00==1 & p1m~=. & p1m~=99 & p1~=p1m
* Where 94 data is from life, and 00 data is from mlife, keep mlife for ALL years
replace p1 = p1m if _merge>=3 & l94==1 & m00==1 & p1m~=. & p1m~=99 & p1~=p1m
drop p1m
* OCCUPATION *
**************
* Replace all the values for observations from the using (life00) data set.
* 43,789 changes made.
replace occ1 = occ1m if _merge==2
replace occ1 = 9 if occ1==.
* Note - for 8,577 cases, we have observations from both 1994 and 2000 data sets. These observations
* come from any combination of m94, l94, m00, and l00 files. Decide as below:
* Where both life and mlife obs in same year, keep the obs from mlife (more likely to be accurate about mig.)
replace occ1 = occ1m if _merge>=3 & l94==1 & m94==1 & occ1m~=. & occ1m~=99 & occ1~=occ1m
replace occ1 = occ1m if _merge>=3 & l00==1 & m00==1 & occ1m~=. & occ1m~=99 & occ1~=occ1m
* Where 94 data is from life, and 00 data is from mlife, keep mlife for ALL years
replace occ1 = occ1m if _merge>=3 & l94==1 & m00==1 & occ1m~=. & occ1m~=99 & occ1~=occ1m
drop occ1m
* SEX & GAVEBITH *
******************
* Replace all the values for observations from the using (life00) data set.
replace sex = sexm if _merge==2
replace sex = sexm if sex==.
* Note - for 8,577 cases, we have observations from both 1994 and 2000 data sets. These observations
* come from any combination of m94, l94, m00, and l00 files. Decide as below:
* When there are disagreement in sex in two data files, keep the most recent observation.
replace sex = sexm if _merge>=3 & l94==1 & m00==1 & sexm~=.& sex~=sexm
replace gavebirth = gavebirthm if _merge==2
replace gavebirth = 0 if gavebirth==. & sex==0
* When there are disagreement in sex in two data files, keep the most recent observation.
replace gavebirth = gavebirthm if _merge>=3 & l94==1 & m00==1 & gavebirthm~=.
drop sexm gavebirthm
* MARITAL STATUS, CHILDREN *
*****************************
* Replace all the values for observations from the using (mlife00) data set.
* 43,697 changes made.
replace married = marriedm if _merge==2
replace noch = nochm if _merge==2
replace livch = livchm if _merge==2
replace married = marriedm if married==.
replace noch = nochm if noch==.
replace livch = livchm if livch==.
drop nochm livchm marriedm _merge
sort nrpid year
save "Temp\combt.dta", replace
*******************************************
* 8. Recode the combined lh life9400t.dta *
*******************************************
* We do not have a hhid84. vill84 + house84 uniquely identify households
* (house84 is the house number within each village). We need to get hhid84
* from hh94.dta.
* First, try to obtain the missing id variables (esp. the links between mid and hhid94,
* and between mid00 migtype and hhid00).
use "Temp\combt.dta", clear
* First corrections using the personid.dta - 11,790 missing values are recovered (_merge==4), and 1,311 inconsistencies
* (i.e., _merge==5) corrected by specifying the replace option.
sort nrpid, stable
merge nrpid using "personid.dta", keep(vill84 house84 hhid94 hhid00 mid mcep8 mid00 migtype mcep00) update replace
drop if _merge==2
drop _merge
sort mid00 migtype mcep00, stable
merge mid00 migtype mcep00 using "Temp\mindiv00t.dta", keep(hhid00) update
drop if _merge==2
drop _merge
* Get the missing hhid's from mindiv files.
sort mid mcep8, stable
merge mid mcep8 using "mindiv94.dta", keep(hhid94) update
replace hhid94 =. if hhid94==99999998
drop if _merge==2
drop _merge
sort hhid94, stable
merge hhid94 using "hh94.dta", keep(hhid94 hhid84) update
drop if _merge==2
drop _merge
* (a) Complete the missing village id's
* Note - 84 villages were split to form 94 villages, which were again split to
* form 00 villages. So essentially, there is a unique 94 and 84 id for 00 villages,
* and a unique 84 id for 00 villages but not vice versa.
* NOTE - Sort command places missing values first for string data (like vill00), but
* places nonmissing values first for integer or real data.
sort nrpid vill00
by nrpid: replace vill00 = vill00[_N] if vill00[_N]~=""
sort nrpid vill94
by nrpid: replace vill94 = vill94[_N] if vill94[_N]~=""
sort nrpid vill84
by nrpid: replace vill84 = vill84[_N] if vill84[_N]~=""
sort vill00 vill94 vill84
by vill00: replace vill94 = vill94[_N] if vill94=="" & vill00~=""
by vill00: replace vill84 = vill84[_N] if vill84=="" & vill00~=""
* We end up with non-missing vill94 id's, use them to generate vill84 and vill94 ids.
sort vill94 vill84
by vill94: replace vill84 = vill84[_N] if vill84=="" & vill94~=""
sort vill94 vill00
by vill94: replace vill00 = vill00[_N] if vill00=="" & vill94~=""
* (b) Complete the missing hh id's
* NOTE - Sort command places missing values first for string data (like vill00), but
* places nonmissing values first for integer or real data. house84 and hhid00 are string vars,
* but hhid94 is recoded as long. To avoid confusion first convert it to string, then destring it.
*
* IMPORTANT NOTE - We can use hhid00 and hhid94 to complete hhid84, but not vice versa. Because
* 2000 households may be split from one 84 hh, 00 id can be used to identify 84 id, but 84 id does
* not necessarily correspond to a single 00 hh.
tostring hhid94, replace
replace hhid94 = "" if hhid94=="."
sort nrpid hhid00
by nrpid: replace hhid00 = hhid00[_N] if hhid00[_N]~=""
sort nrpid hhid94
by nrpid: replace hhid94 = hhid94[_N] if hhid94[_N]~=""
sort nrpid hhid84
by nrpid: replace hhid84 = hhid84[_N] if hhid84[_N]~=""
sort nrpid vill84 house84
by nrpid vill84: replace house84 = house84[_N] if house84[_N]~=""
sort hhid00 hhid94
by hhid00: replace hhid94 = hhid94[_N] if hhid94=="" & hhid00~=""
sort hhid00 hhid84
by hhid00: replace hhid84 = hhid84[_N] if hhid84=="" & hhid00~=""
sort hhid00 vill84 house84
by hhid00 vill84: replace house84 = house84[_N] if house84=="" & hhid00~="" & vill84~=""
sort hhid94 hhid84
by hhid94: replace hhid84 = hhid84[_N] if hhid84=="" & hhid94~=""
sort hhid94 vill84 house84
by hhid94 vill84: replace house84 = house84[_N] if house84=="" & hhid94~="" & vill84~=""
sort hhid84 vill84 house84
by hhid84 vill84: replace house84 = house84[_N] if house84=="" & hhid84~="" & vill84~=""
sort vill84 house84 hhid84
by vill84 house84: replace hhid84 = hhid84[_N] if hhid84=="" & house84~="" & vill84~=""
* (c) Complete the missing MIG hh id's
* NOTE - Sort command places missing values first for string data (like vill00), but
* places nonmissing values first for integer or real data. mid00 is a string var,
* but mid(94) is recoded as long. To avoid confusion first convert it to string, then destring it.
tostring mid, replace
replace mid = "" if mid=="."
sort nrpid mid00
by nrpid: replace mid00 = mid00[_N] if mid00[_N]~=""
sort nrpid mid
by nrpid: replace mid = mid[_N] if mid[_N]~=""
* Migrant households identified by mid00 migtype in 2000 (and only by mid in 1994) may have
* several migrants. Our goal here is: If one migrant is known to be associated with a hhid00,
* another migrant living in the same destination hh, is also likely to be associated with the
* same origin hh. -- WE CANNOT MAKE THAT ASSUMPTION, IGNORE THE CODE BELOW
* bys mid00 migtype: gen n=_N if mid00~=""
* bys mid00 migtype hhid00: gen nh=_N if mid00~=""
* sum n nh
* sort mid00 migtype mcep00 hhid00
* by mid00 migtype mcep00: replace hhid00 = hhid00[_N] if mid00~="" & migtype~="" & hhid00==""
* sort mid hhid94
* by mid: replace hhid94 = hhid94[_N] if mid~="" & hhid94==""
destring mid, replace
* (d) Determine the migrant villages (note migvill indicator may be missing for some observations
* which actually come from migrant villages - check) Need to use 2000 id here since it comes from
* the smallest unit (in terms of village size).
gsort vill00 -migvill
by vill00: replace migvill=migvill[1]
save "Temp\comb_lh.dta", replace