Unpacking immigration collocations

[This article was first published on R – Gradient Metrics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

As part of our road to detecting metaphors we got stuck on a simple problem: compound nouns.

If you take the sentence:

series of immigration policy changes

Series modifies changes in reference to immigration policy, which is a compound noun.

“Series of changes” is not what we would consider metaphorical usage, but our detector would label “series of immigration” as potentially metaphorical, given its strangeness. Identifying compound nouns, and identifying which specific word is being modified (and is thus the “target” concept in the metaphor), is critical to improving performance.

But, we realized we didn’t want to throw this extra information out. Enter collocations:

a sequence of words or terms that co-occur more often than would be expected by chance

Using the same corpus that we’ve been using (which contains news articles, social media posts, and TV transcripts), we calculated the most prominent collocations containing “immigrant”, “immigration”, and “migration”.

prefixsuffixprefix frequencysuffix frequencyco-occurrence
chainimmigration1627124547 8548
undocumentedimmigrant2314512760013645
migrationcrisis2454730606 1980
illegalimmigrant6179112760018759
comprehensiveimmigration95502616644930
migrationvisa24547236671041
immigrationcustom261664161356669
unaccompaniedimmigrant79451276001547
immigrationreform2616644223316839
testifyimmigrant134441276002288

In graphical form, here’s what the information looks like:

The blue lines indicate the word is a prefix to the key source words (e.g. “chain migration”, “unaccompanied immigrant”), and the green lines indicate it is a suffix (e.g. “immigration policy”, “illegal reform”.)  

What stands out to us is how little overlap there is in the collocations that overlap between the three words (except “illegal”, which is highly related to all three). This is especially surprising between “migration” and “immigration” which are both abstract nouns.

To leave a comment for the author, please follow the link and comment on their blog: R – Gradient Metrics.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)