Twitter Poststratification

This provides an R implementation for the different poststratification models described in the Web Conference (WWW) 2019 paper titled Demographic Inference and Representative Population Estimates from Multilingual Social Media Data.

About

To correct for sampling biases in using social media data for downstream applications, this implementation provides interpretable multilevel regression methods to predict the population of a group (e.g., country or nuts3 region) using non-representative social media counts. The poststratification models were evaluated on publicly available ground-truth census data and inferred joint population counts from Twitter using [M3-inference tool] (https://github.com/euagendas/m3inference)

Using the best performing poststratification model, we provide a code to compute the inclusion probability of an individual from a group with given demographics (e.g., age: 18-29, gender: Female) to be on a given social media platform.

Preprocessing

The Python notebooks are used to preprocess the dataset required for the debiasing code in R. They are:

nuts3_data_prep.ipynb : this preprocessing is when M3 inference distinguishes between organizational and non-organizational Twitter accounts
nuts3_data_prep.orgs-as-humans.ipynb : this preprocessing is when M3 inference treats organizational accounts as humans/personal accounts

Debiasing models

The debiasing models using R syntax are:

N ∼ M is our base model that uses only the total population count from the census (N ) and Twitter (M).

formular <- 'census ~ twitter + (twitter+0|country)'

N ∼ \sum_g M(g) uses gender marginal counts only (i.e., the total counts of males and females not broken down by ages).

formular <- 'census ~ gender_F + gender_M + (0+gender_F |country) + (0+gender_M |country)'

N ∼ \sum_a M(a) uses age marginal counts only.

formular <- 'census ~ age_0017 + age_1829 + age_3039 + age_4099 + (0+age_0017 |country) + (0+age_1829 |country) + (0+age_3039 |country) + (0+age_4099 |country)'

N ∼ \sum_{a,g} M(a, g) uses the joint histograms inferred from Twitter but only the total population values from the census.

log N(a, g) ∼ log M(a, g) + a + g uses the joint histograms inferred from Twitter and the joint histograms from the census.

formular <- 'census ~ twitter + age+gender + (0+twitter |country) + (0+age+gender|country)'

Evaluation and Cross validation

We evaluate the debiasing models using mean abssolute percentage error in the following leave-one-group cross-validation settings:

leave one region out: debias_twitter_leave_one_region_CV.R
leave one country out (i.e., leave out all regions from a given country) : debias_twitter_leave_one_country_CV.R
and leave one stratum out (e.g.,leave out only females aged 18-29): debias_twitter_leave_one_stratum_CV.R

Impact of treating organizational account as humans

For the leave-one-region out cross-validation, we provided two implementation:

Treating organizational accounts as humans: debias_twitter_leave_one_region_CV_org_as_human.R
Ignoring organization account counts: debias_twitter_leave_one_region_CV.R

Computing inclusion probability

To compute the inclusion probabilities for each nuts3 region following equation 2 in the paper, run compute_inclusion_probabilities_by_region.R

Citation

Please cite our WWW 2019 paper if you use these scripts in your project.

@inproceedings{wang2019demographic,
  title={Demographic Inference and Representative Population Estimates from Multilingual Social Media Data},
  author={Wang, Zijian and Hale, Scott A. and Adelani, David and Grabowicz, Przemyslaw A. and Hartmann, Timo and Fl{\"o"}ck, Fabian and Jurgens, David},
  booktitle={Proceedings of the 2019 World Wide Web Conference},
  year={2019},
  organization={ACM}
}

License

This source code is licensed under the GNU Affero General Public License, which allows for non-commercial re-use of this software. For commercial inqueries, please contact us directly. Please see the LICENSE file in the root directory of this source tree for details.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
result		result
LICENSE		LICENSE
README.md		README.md
compute_inclusion_probabilities_by_region.R		compute_inclusion_probabilities_by_region.R
debias_twitter_leave_one_country_CV.R		debias_twitter_leave_one_country_CV.R
debias_twitter_leave_one_region_CV.R		debias_twitter_leave_one_region_CV.R
debias_twitter_leave_one_region_CV_org_as_human.R		debias_twitter_leave_one_region_CV_org_as_human.R
debias_twitter_leave_one_stratum_CV.R		debias_twitter_leave_one_stratum_CV.R
nuts3_data_prep.ipynb		nuts3_data_prep.ipynb
nuts3_data_prep.orgs-as-humans.ipynb		nuts3_data_prep.orgs-as-humans.ipynb

License

euagendas/twitter-poststratification

Folders and files

Latest commit

History

Repository files navigation

Twitter Poststratification

About

Preprocessing

Debiasing models

N ∼ M is our base model that uses only the total population count from the census (N ) and Twitter (M).

N ∼ \sum_g M(g) uses gender marginal counts only (i.e., the total counts of males and females not broken down by ages).

N ∼ \sum_a M(a) uses age marginal counts only.

N ∼ \sum_{a,g} M(a, g) uses the joint histograms inferred from Twitter but only the total population values from the census.

log N(a, g) ∼ log M(a, g) + a + g uses the joint histograms inferred from Twitter and the joint histograms from the census.

Evaluation and Cross validation

Impact of treating organizational account as humans

Computing inclusion probability

Citation

More Questions

License

About

Resources

License

Stars

Watchers

Forks

Languages