Friday, April 13, 2018

Targeted Content

You must have heard of, or have suspected first-handedly, the famous conspiracy theory that the Facebook app listens to your phone's microphone in order to better target ads that match your current interests. I've had the funniest experience with that myself: a friend in the cosmetics business has told me about this conspiracy, and in the same conversation she mentioned that an advertising agent has called her to offer advertising her business. Later that day, I got a Facebook ad "advertise your cosmetics business". What the heck? What are the odds of that? And I don't even have a Facebook app installed, just the Facebook messenger.

Although Mark Zuckerberg denied this conspiracy theory in his senate hearing, I doubt that people would stop believing it whenever the ads algorithm surprises them. Choosing to believe Zuckerberg that they don't listen to our microphones (yet, I suspect), I'm pretty confident that they, as well as other companies, are using our written content (emails, social media posts, search queries).

Most people are alarmed by these suspicions from the privacy aspect: what data does this company hold about me? how do they use it? who do they share it with? This post will not be about that. Instead, this post will be about the technical aspect, which is what interests me most as an NLP researcher. If we assume that our apps constantly listen to us and that our written content is monitored and analyzed, what does it say about the text understanding capabilities of these companies?

Oh, and expect no answers. This post is all about questions and conspiracy theories!

What is personalized content?
Personalized content doesn't have to come in the form of an ad. It can take the form of recommendations (products to buy based on previous purchases, songs to listen to, as in this post). It can be relevant professional content from LinkedIn, discounts on services you've previously consumed, cheap flights to your planned destinations, and so on. Some of this will be a direct result of the preferences and settings you defined in the website. For example, I've registered in several websites to get updates on concerts of my favorite bands, and I get healthy vegetarian recipes from Yummly. Some of this content will be based on inferences that the system makes, assuming that certain content is relevant for you. Here is one example:

In that case I was amazed by the accuracy of the Quora digest emails I was getting. Specifically, I had a conversation with my husband about the confidence it takes to admit you don't know something, and he mentioned he likes to say something more helpful than "I don't know" to someone who needs help. The next day, I got a personally-tailored Quora digest email that contained an answer to the question "Could you say something nice instead of 'I don't know'?". It wasn't under any of the topics that I follow (computer science related topics and parakeets).

In what follows I will exemplify most of my points using ads.

What we think these algorithms do
OK, so in my case, I have to try to put my knowledge about the limitations of this technology and my skepticism aside for a second and think like the average person. In that case, I think that:
  • If the ad is about a topic that I discussed in a spoken conversation, then there must be a recorder, and a speech-to-text component that converts the speech into written text.
  • Which language did I speak or have written when this happened? In case this happened for more than one language, it's possible that the company has different algorithms (or at least different trained models of the same algorithm) for each language.
  • Written content and transcribed speech are processed to match with the available content/ads.
  • In some cases, it seems that even simple keyword matching leads to nice results. E.g., if you mentioned a vacation in Thailand you will be matched with ads containing the words vacation and Thailand (I will let you know if I get any such ads after writing this post...). It takes no text understanding capabilities to do so, it only requires recognizing that a bunch of words said in the same sentence (or in a short period of time) also appear in some ad. If you insist, it may work with information retrieval (IR) algorithms to recognize the most important words.
  • In other cases, it seems that a deeper understanding of the meaning of my queries and conversations is required in order to match it to the relevant content. A good example is the Quora digest example from above. Based on IR algorithms, searching for common words like I, don't, know, helpful, nice, say, something will not get you as far as searching for more rare content words like vacation and Thailand. So it must be that the algorithm has built some meaning representation to our conversation, and compared it with the one of that Quora answer, which was phrased with slightly different words. On top of everything, our conversation was in Hebrew, so it must have a universal multi-lingual meaning representation mechanism. 

Alternative explanations
Skepticism returns; I can believe that my speech is recorded and transcribed fairly accurately to text when I speak English. It's a bit harder to believe when it happens in other languages (e.g. Hebrew in my case), but I can still find it somewhat reasonable; Automatic speech recognition (ASR), although isn't perfect, still works reasonably well. It's the text understanding component I'm much, much more skeptical about. Despite the constant progress, and although popular media makes it seem like AI is solved and computers completely understand human language, I know it definitely isn't the case yet. So what other explanations can there be for the targeted content we see?

By Chance. None of this actually happens and we're just imagining. Well, OK, not none of this, but in some cases, it's really just chance.

One of the reasons that we're not easily convinced by this "by chance" argument is that we generally tend to pay attention only to the true-positive cases ("hits") in which we talked about something and immediately got an ad about it. It's much harder to notice the "misses": an ad that seems off (false positive) or all the things that we discussed and got no ads about (false negative).

In the end of the day, we're all just common people that share many common interests. Advertisers may reach us because they try to reach a large audience and we happen to fall under the very broad categories they target (e.g. age group). It could be that by chance we see ads exactly for the product or service we need now.

Other Means. Technically speaking, rather than understanding text, it's much easier to consider other parameters such as your location, your declared interests (i.e. pages you've liked on Facebook, search results you clicked on in Google), your location, your age, gender, marital status, and more. If you didn't provide one or more of these details, no worries! Your friends have, and it's likely you share some of these details with them!

Here is one good example:
I keep getting babies and pregnancy ads on Facebook. I'm a married woman in her 30s, both information items are available in my Facebook profile, and that alone is enough to assume this topic is relevant for me (personally, it is not, but the percent of women like me is too small to care about the error rate, and I totally accept that). Add to this that many of my Facebook friends are other people in my age who are members of parenting groups, have liked pages of baby-related stuff, etc. I can't ever make this stop, but I guess it will stop naturally when I'm in my late forties.

I'd like to finish with an anecdote about how non-sophisticated targeted content can sometimes be, to the point where you rub your eyes in disbelief and say "how stupid can these algorithms be?". A few days ago I've written to someone in an email "I'll be in Seattle on May 30". Minutes later, I get an email from with the title "Vered, Seattle has some last-minute deals!". That would have been smart, unless I've already used to book a hotel room in Seattle for exactly these dates.

I may be way off and it may be that these companies have killer AI abilities which are kept very well in secret. In that case, some of my readers who work for these companies must be giggling now. To paraphrase Joseph Heller (or whoever said it first), just because you're paranoid, doesn't mean they're not after you, but hey, there's no way their technology is good enough to do what you think it does, so some of it is just pure chance. Not as catchy as the original quote, I know.

Tuesday, January 2, 2018

Fun with lyrics

This post stems from a (very boring) casual thought I've had about a year ago: "Hmm... I wonder whether there is more rain in British songs?", which later generalized into "Is there any correlation between song lyrics and the weather in the country of origin of the artists?". I've spent an entire weekend writing code to scrap lyrics from the web, and then life got in the way and I've never finished this (uninteresting) project.

Since I already have a very large corpus of lyrics,1 I've figured why not combine two of my loves -- text analysis and music -- into one blog post? So in this post I will show you some fun analyses that people commonly do with lyrics.

Word Clouds
Word clouds provide a nice illustration to the frequency of word occurrences. Given a text, the word cloud contains the most common k words in the text, where more frequent words appear larger and in the center of the cloud. In this case, I chose an artist and created a word cloud from the lyrics of all the songs of that artist. I lowercased all the words, removed punctuation, stop words (very common function words like "and" and "the"), and the word "chorus". I used worditout to draw the word clouds. Here are a few examples (click on the links to enlarge):

Left to right: word clouds for the lyrics of Red Hot Chili PeppersMorrissey, and Eminem.
A few interesting, though expected, observations: Red Hot Chili Peppers often sing about love, Morrissey mostly moan. When he doesn't moan ("Oh"), he sings about serious topics such as war, the world and life. Eminem curses a lot. Funnily, since I kept the words in their inflected form, we get multiple variations of the F word in his word cloud. 

Now that we see which words are common in each artists' lyrics, we can take it a step forward and try to visualize the topics that they sing about. There are many ways to do that, and we'll do it simply by visualizing their word embeddings using t-SNE, a technique for projecting high-dimensional vectors to 2-dimensional space. The underlying assumption of word embeddings is that words with similar meanings or those that belong to the same topics would have similar vectors. This should also reflect in their 2-dimensional visualization.

To give the lyrics some context and demonstrate how they relate to all the possible topics in the world, I took the words from the lyrics and visualized their vectors along with the 2,500 most common words in English, highlighting words from the lyrics in red. Here is the result for Morrissey:

You'd have to scroll through the graph and look for clusters of red dots, then try to figure out what is their common theme. For example, I've found adjectives describing negative feelings (unhappy, sad, tired, weary, ...), words related to love (hearts, lonely, love, hug, kiss), body parts (body, arms, hands, head), and people (young, children, nephew, girl, boy, woman, ...).

And here is the result for Muse:

Here I see positive emotions (love, dream, fate), negative emotions (sorrow, shame, greed, apathy, bitterness), evil stuff (daemons, evil, exorcise, sins) and war-related words (war, struggle, fighting, revolt).

[Some technical details for my technical readers: I took the first 2500 words from this list of 10k most common words in English. For the lyrics, I considered the 500 most common words which are adjectives, nouns or verbs. I drew the t-SNE graph using this script, and used the pre-trained 50d GloVe word embeddings].

Generating New Songs
As the word clouds may suggest, each artist has a specific style which is reflected in the word choice and topic of their songs. We can train a model that captures this specific style, mimics this artist and generates new songs that would look like they've been written by this artist.

Unfortunately, for better-quality results, you need a large amount of training data, so forget about generating new songs of artists who tragically died after releasing only a few records (e.g. 123) or of your favorite indie bands that have relatively few songs (e.g. 123, 4, 5, 6, 7, 8). We'll stick with the more mainstream bands and try to generate new songs by Muse, Weezer, and Red Hot Chili Peppers.

For that purpose, we are going to learn an artist-specific language model. I've written an elaborate post about language models in the context of machine translation; in short, language models estimate the probability of a certain text in the language (e.g. English, or a more specific domain, like Twitter data or Muse lyrics). Each word in the text depends on the previous words, so in an English language model, for instance, the probability of "she doesn't" is larger than that of "she don't" (although, this may not be the case for English rap songs language models!). Language models can be used to compute the probability of an existing text, but they can also be used to generate new texts by sampling words from the distribution. We're going to use them for generation.

As opposed to the language models in my blog post, we will train a neural language model. These are explained very clearly in Andrej Karpathy's blog post "The Unreasonable Effectiveness of Recurrent Neural Networks". In short, a recurrent neural network (RNN) is a model that receives as input a sequence (e.g. of words / characters) and outputs vectors representing each subsequence (i.e. the first item, the first two items, ..., the entire sequence). These vectors can then be used by other machine learning models, e.g. for classification.

In the context of language models, the RNN learns to model the probability distribution of the next item in the sequence (e.g. the next word in the song). During training, the model goes over the entire text corpus (e.g. all the lyrics of a specific artist) and tries to predict the next item (word). If the predicted next item is incorrect, i.e. different from the actual next item, the model adjusts itself, until it is accurate enough. At test time, once the model parameters are settled, you can use it to generate new texts by sampling from the distribution of possible items (words) and constantly sampling new words conditioned on the already-sampled ones. The result should look similar to the original text corpus it was trained on. Very often, generated sequences will be actual texts from the corpus (and then you've just trained a parrot... Thanks Don Patrick for the great metaphor, I'm constantly quoting you on this!).

[Some technical details for my technical readers: I trained a word-level LSTM using DyNet, largely based on the char-level RNN example. My code is available here.]

So, let's take a look at the results! After training each model, I sampled a single song. I sampled each sentence separately, so subsequent sentences are not expected to be related to each other. I enforced the song structure by forcing a line break after every 5 lines. Here is the new "Weezer" song:

let me see the joy
holding on to what they give,
turn it, turn it,
i'd bury diamonds
you're just smile

excuse my manners if i make a scene
we're just visiting
i'm still afloat
and i'm lost without your love
why are all american girls so rough?

i'm a robot
and kick you out of sight
and if you're up all night
i cried for you, you were the blast

i'm just meant to be your latest tragedy
why are all american girls so rough?
you are.
how man is this in the world
i feel safe

oo-ee-oo i look just like buddy holly
wish that they would teach me when our critics
i don't want your love
chicks are trying to freak

Some sentences are clearly copied from existing songs ("oo-ee-oo I look just like Buddy Holly") but others are brand new. Overall it feels like a Weezer song to me!

Moving on to the new Muse song:

than you could ever give
and i want you now
i wish i could
and make a fuss
like the evil in your veins
you are

(your time is now)
our hopes and expectations
we don't belong here
i won't let you bury it
i wish i could

they will pull us down
in your world
now i just was to name
with who knows who
i'm growing tired of fighting

in my sleep
loneliness be over
vast human and material resources
you're unsustainable

is it enough
killed by drones
and our time is running out
you and i must fight to survive

This one is a bit disappointing, because the only reason it feels like a real Muse song is that it's a "summary of Muse songs" created by copying whole sentences from their songs. My intuition is that the amount of training data was too small, leading to "overfitting" (the training data is regenerated perfectly). This calls for an action by Muse to release more albums!

And the highlight is this new Red Hot Chili Peppers song:

when i find my peace of mind,
that i could find the fireflies
to close a right today
that i slept

you say the is least my love
start jumping and that sherri meet?
funky crime funky crime
just a mirror for the sun
i wrote a letter to you

i've been here before
stuck in the muck of the pond
to be afraid
play your hand and glory

well, i'm gonna ride a sabertooth horse
let's play
mother angel in your hand
take a star in a telegram
upon the places beyond

today loves smile for me
part of my scenery
i'll play all night
i am not wide

Wow... this looks nothing like every Red Hot Chili Peppers song ever. It doesn't even contain the word California! Maybe I should've trained the model for a few more iterations. It is pretty cool, though, that most sentences are new, and they make sense at least like the actual lyrics by RHCP make sense.

Now that we've got the data, we can finally answer the sleep-depriving question: "is there a correlation between the occurrence of rain-related words in lyrics and the country of origin of the artist?". For the lyrics that I scraped from the web I've also the kept the artists' countries. For these countries I've also looked up the annual precipitation statistics. I then looked for the occurrence of either of the following words in lyrics: rain, rainning, rained, rains, storm, stormy, cloud, cloudy, drizzle, flood. I computed the percentage of "rain" songs per country (out of all of the songs by artists in this country). The hypothesis was that artists from countries with a high average of annual precipitation are more likely to sing about it.

The percentage of songs mentioning rain in each country compared with average annual precipitation.
I was wrong. There was no correlation. It is also possible that this was a failed experiment because the number of songs for some countries was too small to draw any meaningful statistical conclusions.

Can we answer more interesting questions regarding lyrics? For example, this every Red Hot Chili Peppers song ever claims that all they ever sing about is California, but this wasn't reflected in the word cloud, nor in the generated song, meaning that this specific word was not very frequent in the corpus. However, if we only check which US states were mentioned in the songs, would California be more frequent?2  And if we make this question more general, do artists tend to sing more about the countries of origin, and do some places get more attention regardless of where the artists are originally from?

This time I focused on American artists, and took the lyrics of the first 200 artists from each state, checking for mentions of any states. I created a 51x51 table in which the columns represent the mentions and the rows represent the artists' state of origin. Rather than displaying this messy table, I plotted a heatmap where the lighter colors represent higher values (and 0 values are colored black).3
Mention of states in lyrics by artists' state of origin. Columns: states mentioned in lyrics. Rows: states of origin.
Here's how to interpret this heatmap: light values on the diagonal are pretty common, meaning that it's common for artists to sing about their states of origin. Two columns have light values across many rows: California and New York. Those are states which are common in lyrics, regardless of the artist origin.

Notice that the states are sorted alphabetically, so it's difficult to answer the question whether artists tend to sing about states in their proximity. A better visualization would be if we could place these statistics on a map. We can, and I used the Google Maps API to do so! Click on a state from the list and you'll see the states that sing about it visualized on a map.

I think I can see a pattern of states singing about their neighbors (this kind of visualization was helpful for someone like me who doesn't know much about US geography...).

Sentiment Analysis
Many words in Morrissey's word cloud are notably negative: killhate, die, leavegone, etc. This is of no surprise as anyone who's been listening to Morrissey or to the Smiths knows most of their songs are gloomy; according to this study, one of the gloomiest among UK artists.

This negativity can be "proved" computationally, using software for sentiment analysis. Sentiment analysis takes a text and determines its sentiment: either negative/positive, or a range of sentiments. Traditional models used to look at the words that appear in the text independently and score the sentence according to the individual words' sentiment, recognizing "good" and "bad" words. For example, "I am happy today" would be considered positive thanks to the positivity of the word happy (and the neutrality of the other words). Today's models are mostly based on neural networks, and sometimes they also take into account the structure of the sentence (which should be helpful in recognizing that "I am not happy today" is negative). The Stanford Sentiment Analysis system is an example for such a model.

I was planning to compute the sentiment of all the lyrics of Morrissey vs. all the lyrics of a presumably more cheerful artist (e.g. Queen, David Bowie), but I've found that most analyzers I've tried to use did pretty bad on recognizing the sentiment of lyrics. To be fair, they are usually trained on movie/restaurant reviews, and lyrics are often more sophisticated (As a proof: we've had a human disagreement on the sentiment of several Morrissey lines at home...). Here are some examples from the Stanford Sentiment Analysis demo:

A positive sentence from David Bowie. Sounds fun.

A negative sentence from Muse. A bit less fun.

Finally, this last example is a subtle insult (at least in my interpretation) from Morrissey: "you were good in your time", interpreted simply as a positive saying by the model. This was a difficult one!

1 In this post I use the lyrics I downloaded (315,357 songs) along with two lyrics corpora from Kaggle: from Sergey Kuznetsov (57,650 songs) and from Gyanendra Mishra (380,000 songs). I was planning to share the code for scraping the lyrics from the web, but when I finally started writing this post, I've found out that the website I've been using has changed and scraping lyrics with my code no longer works.  

2 It is very, very frequent in general, so the prior probability of the occurrence of California in songs is high, not just the conditional probability given that it's a RHCP song. I never realized how common it is until I came back from California last summer and tried to fill the void by creating and constantly listening to this America playlist (biased towards songs about California).  

3 One note about the statistics in this post: they are inaccurate. Some states have just a few artists, the number of mentions is counted equally if they are one from song or many songs, I didn't normalize the statistics by the size of each state, I didn't check for mentions of cities, etc.  

Monday, October 30, 2017


One of the problems with teaching computers to understand natural language, is that much of the meaning in what people say is actually hidden in what they don't say. As humans, we trivially interpret the meaning of ambiguous words, written or spoken, according to their context. For example, this blog post is published in a blog that largely discusses natural language processing, so if I write "NLP", you'd know I refer to natural language processing rather than to neuro-linguistic programming. If I told you that the blog post doesn't fit into a tweet because it's too long, you'd know that the blog post is too long and not that the tweet is too long. You would infer that even without having any knowledge about Twitter's character limit, because it just doesn't make sense otherwise. Unfortunately, common-sense and world knowledge that come so easily for us are not trivial to teach to machines. In this post, I will present a few cases in which ambiguity is a challenge in NLP, along with common ways in which we try to overcome it.

Polysemous words, providing material for dad jokes since... ever.
Lexical Ambiguity
Lexical ambiguity can occur when a word is polysemous, i.e. has more than one meaning, and the sentence in which it is contained can be interpreted differently depending on its correct sense.

For example, the word bank has two meanings - either a financial institute or the land alongside the river. When we read a sentence with the word bank, we understand which sense of bank the text refers to according to the context:

(1) Police seek person who robbed bank in downtown Reading.
(2) The faster-moving surface water travels along the concave bank.

In these example sentences, "robbed" indicates the first sense while "water" and "concave" indicate the second.

Existing Solutions for Lexical Ambiguity
Word embeddings are great, but they conflate all the different senses of a word into one vector. Since word embeddings are learned from the occurrences of a word in a text corpus, the word embedding for bank is learned from its occurrences in both senses, and will be affected from neighbors related to the first sense (money, ATM, union) and of the second (river, west, water, etc.). The resulting vector is very likely to tend towards the more common sense of bank, as can be seen in this demo: see how all the nearest words to bank are related to its financial sense.

Word Sense Disambiguation (WSD) is an NLP task aimed at disambiguating a word in context. Given a list of potential word senses for each word, the correct sense of the word in the given context is determined. Similar to the way humans disambiguate words, WSD systems also rely on the surrounding context. A simple way to do so, in a machine-learning based solution (i.e. learning from examples), is to represent a word-in-context as the average of its context word vectors ("bag-of-words"). In the example above, we get for the first occurrence of bankfeature_vector(bank) = 1/8( (vector(police) + vector(seek) + vector(person) + vector(who) + vector(robbed) + vector(in) + vector(downtown) + vector(reading))and for the second: feature_vector(bank) = 1/9(vector(the) + vector(faster) + vector(moving) + vector(surface) + vector(water) + vector(travels) + vector(along) + vector(the) + vector(concave)).

Can Google expand the acronym "ACL" correctly for me?
While many words in English are polysemous, things turn absolutely chaotic with acronyms. Acronyms are highly polysemous, some having dozens of different expansions. To make things even more complicated, as opposed to regular words, whose various senses are recorded in dictionaries and taxonomies like WordNet, acronyms are often domain-specific and not commonly known.

Take for example a Google search for "ACL 2017". I get results both for the Annual Meeting of the Association for Computational Linguistics (which is what I was searching for) and for the Austin City Limits festival. I have no idea whether this happens because (a) these are the two most relevant/popular expansions of "ACL" lately or the only ones that go with "2017"; or (b) Google successfully disambiguated my query, showing the NLP conference first, and leaving also the musical festival ranked lower in the search results, since it knows I also like music festivals. Probably (a) :)

Existing Solutions for Acronym Expansion
Expanding acronyms is considered a different task from WSD, in which there is no inventory of potential expansions for each acronym. Given enough context (e.g. "2017" is a context word for the acronym ACL), it is possible to find texts that contain the expansion. This can either be by searching for a pattern (e.g. "Association for Computational Linguistics (ACL)") or considering all the word sequences that start with these initials, and deciding on the correct one using rules or a machine-learning based solution.

Syntactic Ambiguity
No beginner NLP class is complete without at least one of the following example sentences:
  1. They ate pizza with anchovies
  2. I shot an elephant wearing my pajamas
  3. Time flies like an arrow
Common to all these examples is that each can be interpreted as multiple different meanings, where the different meanings differ in the underlying syntax of the sentence. Let's go over the examples.

The first sentence "They ate pizza with anchovies", can be interpreted as (i) "they ate pizza and the pizza had anchovies on it", which is the more likely interpretation, illustrated on the left side of the image below. This sentence has at least two more crazy interpretations: (ii) they ate pizza using anchovies (instead of using utensils, or eating with their hands), as in the right side of the image below, and (iii) they ate pizza and their anchovy friends ate pizza with them.

Visual illustration of the interpretations of the sentence "They ate pizza with anchovies".
Image taken from
The first interpretation considers "with anchovies" as describing the pizza, while the other two consider it as describing the eating action. In the output of a syntactic parser, the interpretations will differ by the tree structure, as illustrated below.

Possible syntactic trees for the sentence "They ate pizza with anchovies", using displacy.

Although this is a classic example, both the Spacy and the Stanford Core NLP demos got it wrong. The difficulty is that syntactically speaking, both trees are likely. Humans know to prefer the first one based on the semantics of the words, and using their knowledge that anchovy is something that you eat rather than eat with. Machines don't come with this knowledge.

A similar parser decision is crucial in the second sentence, and just in case you haven't managed to find the funny interpretations yet: "I shot an elephant wearing my pajamas" has two ambiguities: first, does shoot mean taking a photo of, or pointing a gun to? (a lexical ambiguity). But more importantly, who's wearing the pajamas? Depending on whether wearing is attached to shot (meaning that I wore the pajamas while shooting) or to elephant (meaning that the elephant miraculously managed to squeeze into my pajamas). This entire scene, regardless of the interpretation, is very unlikely, and please don't kill elephants, even if they're stretching your pajamas.

The third sentence is just plain weird, but it also has multiple interpretations, of which you can read about here.

Existing Solutions for Syntactic Ambiguity
In the past, parsers were based on deterministic grammar rules (e.g. a noun and a modifier create a noun-phrase) rather than on machine learning. A possible solution to the ambiguity issue was to add different rules for different words. For more details, you can read my answer to Natural Language Processing: What does it mean to lexicalize PCFGs? on Quora.

Today, similarly to other NLP tasks, parsers are mostly based on neural networks. In addition to other information, the word embeddings of the words in the sentence are used for deciding on the correct output. So potentially, such a parser may learn that "eat * with [y]" yields the output in the left of the image if y is edible (similar to word embeddings of other edible things), otherwise the right one.

Coreference Ambiguity
Very often a text mentions an entity (someone/something), and then refers to it again, possibly in a different sentence, using another word. Take these two paragraphs from a news article as an example:

The various entities participating in the article were marked in different colors.

I marked various entities that participate in the article in different colors. I grouped together different mentions of the same entities, including pronouns ("he" as referring to "that son of a bitch"; excuse my language, I'm just quoting Trump) and different descriptions ("Donald Trump", "the president"). To do that, I had to use my common sense (the he must refer to that son of a bitch who disrespected the flag, definitely not to the president or the NFL owners, right?) and my world knowledge (Trump is the president). Again, any task that requires world knowledge and reasoning is difficult for machines.

Existing Solutions for Coreference Resolution
Coreference resolution systems group mentions that refer to the same entity in the text. They go over each mention (e.g. the president), and either link it to an existing group containing previous mentions of the same entity ([Donald Trump, the president]), or start a new entity cluster ([the president]). Systems differ from each other, but in general, given a pair of mentions (e.g. Donald Trump, the president), they extract features referring either to each single mention (e.g. part-of-speech, word vector) or to the pair (e.g. gender/number agreement, etc.), and decide whether these mentions refer to the same entity.

Note that mentions can be proper-names (Donald Trump), common nouns (the president) and pronouns (he); identifying coreference between pairs of mentions from each type requires different abilities and knowledge. For example, proper-name + common noun may require world knowledge (Donald Trump is the president), while pairs of common nouns can sometimes be solved with semantic similarity (e.g. synonyms like owner and holder). Pronouns can sometimes be matched to their antecedent (original mention) based on proximity and linguistic cues such as gender and number agreement, but very often there is more than one possible option for matching.

A nice example of solving coreference ambiguity is the Winograd Schema challenge, of which I've first heard from this post in the Artificial Detective blog. In this contest, computer programs are given a sentence with two nouns and an ambiguous pronoun, and they need to answer which noun the pronoun refers to, as in the following example:

The trophy would not fit in the brown suitcase because it was too big. What was too big?
Answer 0: the trophy
Answer 1: the suitcase

Answering such questions requires, yes, you guessed correctly - commonsense and world knowledge. In the given example, the computer must reason that for the first object to fit into the second, the first object must be smaller than the second, so if the trophy could not fit into the suitcase, the trophy must be too big. Conversely, if instead of big, the question would have read small, the answer would have been "the suitcase".

Noun Compounds
Words are usually considered as the basic unit of a language, and many NLP applications use word embeddings to represent the words in the text. Word embeddings do a pretty decent job in capturing the semantics of a single word, and sometimes also its syntactic and morphological properties. The problem starts when we want to capture the semantics of a multi-word expression (or a sentence, or a document). The embedding of a word, for example dog, is learned from its occurrences in a large text corpus; the more common a word is, the more occurrences there are, and the higher the quality of the learned word embedding would be (it would be located "correctly" in the vector space near things that are similar to dog). A bigram like hot dog is already much less frequent, even less frequent is hot dog bun, and so on. The conclusion is clear - we can't learn embeddings for multi-word expressions the same way we do for single words.

The alternative is to try to somehow combine the word embeddings of the single words in the expression into a meaningful representation. Although there are many approaches for this task, there is no one-size-fits-all solution for this problem; a multi-word expression is not simply the sum of its single word meanings (hot dog is an extreme counter-example!).

One example out of many would be noun-compounds. A noun-compound is a noun that is made up of two or more words, which usually consists of the head (main) noun and its modifiers, e.g. video conference, pumpkin spice latte, and paper clip. The use of noun-compounds in English is very common, but most noun-compounds don't appear frequently in text corpora. As humans, we can usually interpret the meaning of a new noun-compound if we know the words it is composed of; for example, even though I've never heard of watermelon soup, I can easily infer that it is a soup made of watermelon.

Similarly, if I want my software to have a nice vector representation of watermelon soup, there is no way I can base it on the corpus occurrences of watermelon soup -- it would be too rare. However, I used my commonsense to build a representation of watermelon soup in my head -- how would my software know that there is a made of relation between watermelon and soup? This relation can be one out of many, for example: video conference (means), paper clip (purpose), etc. Note that the relation is implicit, so there is no immediate way for the machine to know what's the correct relation between the head and the modifier.1  To complicate things a bit further, many noun-compounds are non-compositional, i.e. the meaning of the compound is not a straightforward combination of the meaning of its words, as in hot dog, baby sitting, and banana hammock.

Existing Solutions for Noun-compound Interpretation
Automatic methods for interpreting the relation between the head and the modifier of noun-compounds have largely been divided into two approaches:

(1) machine-learning methods, i.e. hand-labeling a bunch of noun-compounds to a set of pre-defined relations (e.g. part of, made of, means, purpose...), and learning to predict the relation for unseen noun-compounds. The features are either related to each single word (head/modifier), such as their word vectors or lexical properties from WordNet, or to the noun-compound itself and its corpus occurrences. Some methods also try to learn a vector representation for a noun-compound in the form of applying a function to the word embeddings of its single words (e.g. vector(olive oil) = function(vector(olive), vector(oil))).

(2) finding joint occurrences of the nouns in a text corpus, some of which would explicitly describe the relation between the head and the modifier. For example "oil made of olives".

While there has been a lot of work in this area, success on this task is still mediocre. A recent paper suggested that current methods succeed mostly due to predicting the relation based solely on the head or on the modifier - for example, most noun-compounds with the head "oil" hold the made of relation (olive oil, coconut oil, avocado oil, ...). While this guess can be pretty accurate most of the times, it may cause funny mistakes as in the meme below.


For the sake of simplicity, I focused on two-word noun-compounds, but noun-compounds with more than two words have an additional ambiguity - a syntactic ambiguity - what are the head-modifier relations in the compound? It is often referred to as bracketing. Without getting into too many details, consider the example of hot dog bun from before. It should be interpreted as [[hot dog][bun]] rather than [hot [dog bun]].

More to read?
Yeah, I know it was a long post, but there is so much more ambiguity in language that I haven't discussed. Here is another selected topic, in case you're looking for more to read. We all speak a second language called emoji, which is full of ambiguity. Here are some interesting articles about it: Emoji could cause confusion, trouble in the workplace, The real meaning of all those emoji in Twitter handles, Learning the language of emoji, and Why emojis may be the best thing to happen to language in the digital age. For the older people among us (and in the context of emoji, I consider myself old too, so no offence anyone), if you're not sure about the meaning of an emoji, why don't you check emojipedia first, just to make sure you're not accidentally using phallic symbols in your grocery list?

1 In this very interesting paper by Preslav Nakov there is a nice observation: a noun-compound is a "compression device" that allows saying more with less words. 

Wednesday, August 9, 2017


One of the things that make natural language processing so difficult is language variability: there are multiple ways to express the same idea/meaning. I mentioned it several times in this blog, since it is a true challenge for any application that aims to interact with humans. You may program it to understand common things or questions that a human may have, but if the human decides to deviate from the script and phrase it slightly differently, the program is helpless. If you want a good example, take your favorite personal assistant (Google assistant, Siri, Alexa, etc.) and ask it a question you know it can answer, but this time use a different phrase. Here is mine:

Both questions I asked have roughly the same meaning, yet, Google answers the first perfectly but fails to answer the second, backing off to showing search results. In fact, I just gave you a "free" example of another difficult problem in NLP which is ambiguity. It seems that Google interpreted showers as "meteor showers" rather than as a light rain.

One way to deal with the language variability difficulty is to construct a huge dictionary that contains groups or pairs of texts with roughly the same meaning: paraphrases. Then, applications like the assistant can, given a new question, look up the dictionary for any question they were programmed to answer which has the same meaning. Of course, this is a naive idea, given that language is infinite and one can always form a new sentence that has never been said before. But it's a good start, and it may help developing algorithms that can associate a new unseen text to an existing dictionary entry (i.e. generalizing). 

Several approaches have been used to construct such dictionaries, and in this post I will present some of the simple-but-smart approaches. 

Translation-based paraphrasing
The idea behind this approach is super clever and simple: suppose we are interested in collecting paraphrases in English. If two English texts are translated to the same text in a foreign language, then they are likely paraphrases of each other. Here is an example:

The English texts on the left are translated into the same Italian text on the right, implying that they have the same meaning.
This approach goes as far as 2001. The most prominent resource constructed with this approach is the paraphrase database (PPDB). It is a resource containing hundreds of millions of text pairs with roughly the same meanings. Using the online demo, I looked up for paraphrases of "nice to meet you", yielding a bunch of friendly variants that may be of use for conference small talks: 

it was nice meeting you
it was nice talking to you
nice to see you
hey, you guys
it's nice to meet you
very nice to meet you
nice to see you
i'm pleased to meet you
it's nice to meet you
how are you
i'm delighted
it's been a pleasure

Paraphrases of "nice to meet you", from PPDB.

In practice, all these texts appear as paraphrases of "nice to meet you" in the resource, with different scores (to what extent is this text a paraphrase of "nice to meet you"?). These texts were found to be translated to the same text in a single or in multiple foreign languages, and their scores correspond to the translation scores (as explained here), along with other heuristics.2  

While this approach provides a ton of very useful paraphrases, as you can guess, it also introduces errors, as in every automatic method. One type of an error occurs when the foreign word has more than one sense, each translating into a different, unrelated English word. For example, the Spanish word estacion has two meanings: station and season. When given a Spanish sentence that contains this word, it is translated (hopefully) to the correct English word according to the context. This paraphrase approach, however, does not look at the original sentences in which these words occur, but only at the phrase table -- a huge table of English phrases and their Spanish translations without their original contexts. It has therefore no way at this point to tell that stop and station refer to the same meaning of estacion, and are therefore paraphrases, while season and station are translations of two different senses of estacion.

Even without making such a horrible mistake of considering two texts as paraphrases when they are not related at all, paraphrasing is not well-defined, and the paraphrase relation encompasses many different relations. For example, looking for paraphrases of the word tired in PPDB, you will get equivalent phrases like fatigued, more specific phrases like overtired/exhausted, and related but not-quite-the-same phrases like bored. This may occur when the translator likes being creative and does not remain completely faithful to the original sentence, but also when the target language does not contain an exact translation for a word, defaulting in a slightly more specific or more general word. While this is not a specific phenomenon of this approach but rather of all the paraphrasing approaches (for different reasons), this has been studied by the PPDB people who did an interesting analysis of the different semantic relations the resource captures.

The following approaches focus on paraphrasing predicates. A predicate is a text describing an action or a relation between one or more entities/arguments, very often containing a verb. For example: John ate an apple or Amazon acquired Whole Foods. Predicate paraphrases are pairs of predicate templates -- i.e. predicates whose arguments were replaced by placeholders -- that would have roughly the same meaning given an assignment to their arguments. For example, [a]0 acquired [a]1 and [a]0 bought [a]are paraphrases given the assignment [a]= Amazon and [a]= Whole Foods.1  Most approaches focus on binary predicates (predicates with two arguments).

Argument-distribution paraphrasing
This approach relies on a simple assumption: if two predicates have the same meaning, they should normally appear with the same arguments. Here is an example:

In this example, the [a]0 slots in both predicates are expected to contain names of companies that acquired other companies while the [a]1 slot is expected to contain acquired companies. 

The DIRT method represents each predicate as two vectors: (1) the distribution of words that appeared in its [a]0 argument slot, and (2) the distribution of words that appeared in its [a]1 argument slot. For example, the [a]0 vectors of the predicates in the example will have positive/high values for names of people and names of companies that acquired other companies, and low values for other (small) companies and other unrelated words (cat, cookie, ...). To measure the similarity between two predicates, the two vector pairs ([a]0 in each predicate and [a]1 in each predicate) are compared using vector similarity measures (i.e. cosine similarity), and a final score averages the per-slot similarities.

Now, while it is true that predicates with the same meaning often share arguments, it is definitely not true that predicates that share a fair amount of their argument instantiations are always paraphrases. A simple counterexample would be of predicates with opposite meanings, that often tend to appear with similar arguments: for instance, "[stock] rise to [30]" and "[stock] fall to [30]" or "[a]0 acquired [a]1" and "[a]0 sold [a]1" with any [a]0 that once bought an [a]and then sold it.

Following this approach, other methods were suggested, such as capturing a directional inference relation between predicates (e.g. [a]0 shot [a]1 => [a]0 killed [a]1 but not vice versa), releasing a huge resource of such predicate pairs (see the paper); and a method to predict whether one predicate entails the other, given a specific context (see the paper). 

Event-based paraphrases
Another good source for paraphrases is multiple descriptions of the same news event, as various news reporters are likely to choose different words to describe the same event. To automatically group news headlines discussing the same story, it is common to group them according to the publication date and word overlap. Here is an example of some headlines describing the acquisition of Whole Foods by Amazon:

We can stop here and say that all these headlines are sentential paraphrases. However, going a step further, if we've already observed in the past Google to acquire YouTube / Google is buying YouTube as sentential paraphrases (and many other similar paraphrases), we can generalize and say that [a]0 to acquire [a]1 and [a]0 is buying [a]are predicate paraphrases.

Early works relying on this approach are 1, 2, followed by some more complex methods like 3. We recently harvested such paraphrases from Twitter, assuming that tweets with links to news web sites that were published on the same day are likely to describe the same news events. If you're interested in more details, here are the paper, the poster and the resource.

This approach is potentially more accurate than the argument-distribution approach. The latter assumes that predicates that often occur with the same arguments are paraphrases, while the former considers predicates with the same argument as paraphrases only if it believes that they discuss the same event.

What does the future hold? neural paraphrasing methods, of course. I won't go into technical details (I feel that there are enough "neural network for dummies" blog posts out there, and I'm by no means an expert on that topic). The idea is to build a model that reads a sequence of words and then generates a different sequence of words that has the same meaning. If it sounds like inexplicable magic, it is mostly because even the researchers working on this task can at most make educated guesses on why something works well or not. In any case, if this ever ends up working well, it will be much better than the resources we have today, since it will be capable of providing paraphrases / judging correctness of paraphrases for new texts that were never observed before.

1 Of course, given a different choice of arguments, these predicates will not be considered as paraphrases. For example, Mary acquired a skill is not a paraphrase of Mary bought a skill. The discussed approaches consider predicate-pairs as paraphrases, if there exists an argument assignment (/context) under which these predicates are paraphrases.   
2 See also more recent work on translation-based paraphrasing.  

Wednesday, March 1, 2017

Women in STEM*

This is a special post towards International Women's Day (March 8th). Every year I find myself enthusiastically conveying my thoughts about the topic to the people around me, so I thought I might as well share it with a broader audience. As always, this post presents my very limited knowledge/interpretation to a broadly discussed and studied topic. However, it may be a bit off topic for this blog, so if you're only interested in computational stuff, you can focus on section 3.

1. The Problem
Even though we are half of the population, women are quite poorly represented in STEM:

USA: the percentage of computing occupations held by women has been declining since 1991, when it reached a high of 36%. The current rate is 25%. [2016, here]

OECD member countries: While women account for more than half of university graduates in scientific fields in several OECD countries, they account for only 25% to 35% of researchers in most OECD countries. [2006, here]

2. The Causes (and possible solutions)

2.1 Cognitive Differences
There is a common conception that female abilities in math are biologically inferior to those of males. Many highly cited psychology papers prove differently, for example:

"Stereotypes that girls and women lack mathematical ability persist, despite mounting evidence of gender similarities in math achievement." [1].

"...provides evidence that mathematical and scientific reasoning develop from a set of biologically based cognitive capacities that males and females share. These capacities lead men and women to develop equal talent for mathematics and science." [2]


    In addition, if cognitive differences were so prominent, there wouldn't be so many women graduating in scientific fields. It seems that the problem lies in occupational gender segregation, which may be explained by any one of the following:

    2.2 Family Life
    Here are some references from studies conducted about occupational gender segregation:

    "In some math-intensive fields, women with children are penalized in promotion rates." [3]
      "[...] despite the women's movement and more efforts in society to open occupational doors to traditional male-jobs for women, concerns about balancing career and family, together with lower value for science-related domains, continue to steer young women away from occupations in traditionally male-dominated fields, where their abilities and ambitions may lie." [4]

      "women may “prefer” those [jobs] with flexible hours in order to allow time for childcare, and may also “prefer” occupations which are relatively easy to interrupt for a period of time to bear or rear children." [5] (the quotation marks are later explained, indicating that this is not a personal preference but rather influenced by learned cultural and social values).

      I'd like to focus the discussion now on my local point view of the situation in Israel, since I suspect that it is the most prominent cause of the problem here. I would be very interested in getting comments regarding what it is like in other countries.


      According to the Central Bureau of Statistics, in 2014, 48.9% of the workers in Israel were women (and 51.1% were men). The average salary was 7,439 NIS for women and 11,114 for men. Wait, what?... let me introduce another (crucial) factor.

      While the fertility rate has decreased in all other OECD member countries, in Israel it remained stable for the last decade, with an average of 3.7 children per family. On a personal note, as a married woman without children, I can tell you that it is definitely an issue, and "when are you planning to have children already?" is considered a perfectly valid question here, even from strangers (and my friends with 1 or 2 children often get "when do you plan to have the 2nd/3rd child?").

      Paid maternity leave is 14 weeks with a possibility (used by anyone who can afford it) to extend it to 3 more unpaid months. Officially, any one of the parents can take maternity leave, but in practice, since this law was introduced in 1998, only roughly 0.4% of the parents who took maternity leave were fathers. 

      Here is the number connecting the dots, and explaining the salary gap: in 2014, the average number of work hours per week was 45.2 for men and 36.7 for women. The culture in Israel is torn between the traditional family roles (mother as a main parent) and the modern opportunities for women. Most women I know have a career in the morning, and a second job in the afternoon with the kids. With a hard constraint of leaving work before 16:00 to pick up the kids, in a demanding market like in Israel, it is much harder for a woman to get promoted. It poses the high-tech industry, in which the working hours are known to be long, as a male-dominated environment. Indeed, in 2015, only 36.2% of the high-tech workers in Israel were women.

      This situation is doubly troubling: on the one hand, it is difficult for women who do choose demanding careers. They have to juggle between home and work in a way that men are never required to. On the other hand, we are oriented since childhood to feminine occupations that are less demanding in working hours. 

      Don't get me wrong, I'm not here to judge. Being a feminist doesn't entail that the woman must have a career while the man has to stay at home with the children. Each couple can decide on their division of labor as they wish. It's the social expectations and cultural bias that I'm against. I've seen this happening time after time: the man and the woman both study and build up their careers, they live in equality, and then the birth of their first child, and specifically maternity leave, is the slippery slope after which equality is a fantasy. 

      To make a long story short, I think it is not women the market is against, but mothers. When I say "against" I include allegedly good ideas such as allowing a mother to leave work at 16:00. While I'm not against leaving work at 16:00 (modern slavery is a topic for another discussion...), I don't see why this "privilege" should be reserved only for mothers. In my humble opinion, it will benefit mothers, fathers, children and the market if men and women could each get 3 days a week to leave work as "early" as at 16:00. It wouldn't hurt if both men and women will have the right to take parental leave together, developing their parenthood as a shared job. This situation will never change unless the market will overcome ancient society rules and stop treating parenthood as a job for women.

      2.3 Male-dominated Working Environments 
      Following the previous, tech workplaces (everywhere) are dominated by men, so that even women who choose to work in this industry might feel uncomfortable in their workplaces. Luckily for me I can't attest this by my own experience: I've never been treated differently as a woman, and have never felt threatened or uncomfortable in situations in which I was an only woman. This article exemplifies some of the things that other women experienced:

      "Many [women] will say that their voice is not heard, they are interrupted or ignored in meetings; that much work takes place on the golf course, at football matches and other male-dominated events; that progress is not based on merit and women have to do better than men to succeed, and that questions are raised in selection processes about whether a woman “is tough enough”."

        I've only become aware of these problems recently, so I guess it is both a good sign (that it might not be too common, or at least that not all women experience that), but also a bad sign (that many women still suffer from it and there's not enough awareness). This interesting essay written by Margaret Mitchell suggests some practical steps to make women feel more comfortable in their workplaces.

        Of course, things get much worse when you consider sexual harassment in workplaces. I know the awareness to the subject is very high today, an employer's duty to prevent sexual harassment is statutory in many countries, and many big companies require new employees to undergo a sexual harassment prevention training. While this surely mitigates the problem, it is still too common, with a disturbing story just from the last week (and many other stories untold). As with every other law, there will always be people breaking it, but it is the employers' duty to investigate any reported case and handle it even at the cost of losing a valuable worker.

        2.4 Gender Stereotypes 
        Simply because it's so difficult to change reality; even if some of the reasons why women were previously less likely to work in these industries are no longer relevant, girls will still be less oriented to working in these fields since they are considered unsuitable for them.


        An interesting illustration was provided in this work, where 26 girls (around 4 years old) were shown different Barbie dolls and asked whether they believed women could do masculine jobs. When the Barbie dolls were dressed in "regular" outfits, many of them replied negatively, but after being showed a Barbie dressed up in a masculine outfit (firefighter, astronaut, etc.), the girls believed that they too could do non-stereotypical jobs.

        This is the vicious circle that people are trying to break by encouraging young girls to study scientific subjects and supporting woman already working in these fields. Specifically, by organizing women-only conferences, offering scholarships for women, and making sure that there is a female representative in any professional group (e.g. panel, committee, etc). While I understand the rational behind changing the gender distribution, I often feel uncomfortable with these solutions. I'll give an example.

        Let's say I submitted a paper to the main conference in my field, and that paper was rejected. Then somebody tells me "there's a women-only workshop, why don't you submit your paper there?". If I submit my paper there and it gets accepted, how can I overcome the feeling of "my paper wasn't good enough for a men's conference, but for a woman's paper it was sufficient"?

        For the same reason, I'm uncomfortable with affirmative action. If I'm a woman applying for a job somewhere and I find out that they prefer women, I might assume that there was a man who was more talented/adequate than me but they settled for me because I was a woman. If that's true, it is also unfair for that man. In general, I want my work to be judged solely based on its quality, preferably without taking gender into consideration, for better and for worse.

        I know I'm presenting a naive approach and that in practice, gender plays a role, even if subconsciously. I also don't really have a better solution for that, but I do hope that if we take care of all the other reasons I discussed, this distribution will eventually change naturally. 

        3. Statistics and Bias
        Last year there was an interesting paper [6], followed by a lengthy discussion, about gender stereotypes in word embeddings. Word embeddings are trained with the objective of capturing meaning through co-occurrence statistics. In other words, words that often occur next to the same neighboring words in a text corpus are optimized to be close together in the vector space. Word embeddings have proved to be extremely useful for many downstream NLP applications.

        The problem that this paper presented was that these word embeddings capture also "bad" statistics, for example gender stereotypes with regard to professions. For instance, word embeddings have a nice property of capturing analogies like "man:king :: woman:queen", but these analogies contain also gender stereotypes like "father:doctor :: mother:nurse", "man:computer programmer :: woman:homemaker", and "he:she :: pilot:flight attendant".

        Why this is happening is pretty obvious - word embeddings are not trained to capture "truth" but only statistics. If most nurses are women, they would occur in the corpus next to words that are more likely to occur with feminine words than with masculine words, resulting in higher similarity between nurse and woman than nurse and man. In other words, if the input corpus reflects stereotypes and biases of society, so will the word embeddings.

        So why is this a problem, anyway? Don't we want word embeddings to capture the statistics of the real world, even the kind of statistics we don't like? If something should be bothering us, it is the bias in society, rather than the bias these word embeddings merely capture. Or in other words:

        I like this tweet because I was wondering just the same when I first heard about this work. The key concern about bias in word embeddings is that these vectors are commonly used in applications, and this might inadvertently amplify unwanted stereotypes. The example in the paper mentions web search aided by word embeddings. The scenario described is of an employer looking for an intern in computer science by searching for terms related to computer science, and the authors suggest that a LinkedIn page of a male researcher might be ranked higher in the results than that of a female researcher, since computer science terms are closer in the vector space to male names than to female names (because of the current bias). In this scenario, and in many other possible scenarios, the word embeddings are not just passively recording the gender bias, but might actively contribute to it!

        Hal Daumé III wrote a blog post called Language Bias and Black Sheep about the topic, and suggested that the problem goes even deeper, since corpus co-occurrences don't always capture real-world co-occurrences, but rather statistics of things that are talked about more often:

        "Which leads us to the "black sheep problem." We like to think that language is a reflection of underlying truth, and so if a word embedding (or whatever) is extracted from language, then it reflects some underlying truth about the world. The problem is that even in the simplest cases, this is super false."

        Prior to reading this paper (and the discussion and blog posts that followed it), I never realized that we are more than just passive observers of data; the work we do can actually help mitigate biases or inadvertently contribute to them. I think we should all keep this in mind and try to see in our next work whether it can have any positive or negative affect on that matter -- just like we try to avoid overfitting, cherry-picking, and annoying reviewer 2.

        [1] Cross-national patterns of gender differences in mathematics: A meta-analysis. Else-Quest, Nicole M.; Hyde, Janet Shibley; Linn, Marcia C. Psychological Bulletin, Vol 136(1), Jan 2010, 103-127.
        [2] Sex Differences in Intrinsic Aptitude for Mathematics and Science?: A Critical Review. Spelke, Elizabeth S. American Psychologist, Vol 60(9), Dec 2005, 950-958.
        [3] Women's underrepresentation in science: Sociocultural and biological considerations. Ceci, Stephen J.; Williams, Wendy M.; Barnett, Susan M. Psychological Bulletin, Vol 135(2), Mar 2009, 218-261. 
        [4] Why don't they want a male-dominated job? An investigation of young women who changed their occupational aspirations. Pamela M. Frome, Corinne J. Alfeld, Jacquelynne S. Eccles, and Bonnie L. Barber. Educational Research And Evaluation Vol. 12 , Iss. 4,2006
        [5] Women, Gender and Work: What Is Equality and How Do We Get There? Loutfi, Martha Fetherolf. International Labour Office, 1828 L. Street, NW, Washington, DC 20036, 2001.
        [6] Quantifying and Reducing Stereotypes in Word Embeddings. Tolga Bolukbasi, Kai-Wei Chang, James Zou, Venkatesh Saligrama, Adam Kalai. 2016 ICML Workshop on #Data4Good: Machine Learning in Social Good Applications.

        *STEM = science, technology, engineering and mathematics