There are languages that are not gender neutral, such as Hebrew and Arabic, where a brief look at the text can tell whether the author is male or female. For example, when a woman writes “I am going” in Hebrew or Arabic, the verb “going” is gender specific. In a previous post I mentioned how the book “To Kill a Mockingbird” is told by the point of a view of a female child, named Scout. As the story unfolds Scout’s character evolves from a boyish “Tom Boy” into a girlish character who wears dresses and reveals her femininity. When reading the story in English, Scout’s gender is unknown and is revealed later in the book as she is discovering her femininity. This evolvement of Scout’s character is not possible in languages such as Hebrew and Arabic, where Scout’s gender is clear in the first few pages, and the reading experience is significantly different.
A study by Prof. Moshe Koppel from Bar Ilan University examined thousands of pieces of text in English, and using artificial intelligence and machine learning methods searched for a wide range of patterns in the text. The study added relevant information about the author of each piece of text. For example, whether it is written by males or females, and found repetitive patterns of systematic differences between the genders. Following the machine learning phase, the study examined the ability of this learning system to classify the gender of the author of a new unfamiliar text.
The average person might speculate that text written by males may have more usage of words such as “I”, “me”, etc. indicating that men, are full of themselves. However, it turned out that the study found quite the opposite.
The word “I” like all personal pronouns (me, I, she) is used more by women than by men. And, in fact it was found as one of the most significant characteristics of female writing. Other words frequently used by women are “no”, “not”, “for” and “with”. Words frequently used by men are “the”, “those”, “these” and “that”.
This study is related to a computer science discipline and is essentially a quantitative study that showed 80% success rate in the ability to identify gender in the English language by analyzing the text alone. The study even identified cases where a male intentionally pretended to write in a feminine style. These features are automatically analyzed by the computer program without human intervention. The researchers then wrote an article about this research in an attempt to publish it in a scientific journal. In the article they added a speculative explanation, that seems to make sense, in an attempt to explain why women use more “I”, “me”, “you”, etc. and men more often use words such as, “these”, “that”, etc. The explanation given was that women seem more interested in working and interacting with humans and living things, while men are more interested in working with objects and physical processes, supporting the “People versus Things” distinction made by Susan Pinker. The publication of the article was rejected due to the speculative part of the article that was politically incorrect. The article was corrected by removing the speculative part, leaving only the mathematical, quantitative, computer science data and indeed was published.
This research software that analyzes features of an English language text can in many cases identify the author’s native language according to typical errors of people with English as a second language, as well as identify whether certain text is written by a single person or multiple people. The software is also able to identify whether the text is positive or negative and whether the characters or products described in the text are portrayed sympathetically or not.
When the researchers were asked if they could analyze biblical text in and attempt to identify if it is was written in male or female style (is God male or female?), they said that the male/female writing identification is limited to twentieth century English texts, where there is a clear comparative reference to texts written by males and females. However, the algorithm is able to analyze a writing style and divide text according to different styles/authors. This analysis was tested on the Bible and indeed it detected some interesting findings.
When using Siri or OK Google or various machine translation tools, the apps use computational linguistics tools. The field of computational linguistics has developed greatly in recent years and demonstrates impressive abilities. The business potential of computational linguistics apps is enormous, and indeed, technology giants and many academic institutions invest greatly in this technology.
Computers do not “understand” language as humans do, and in order to “teach” them a complex and expansive natural language one must provide the computer with tools to analyze and process the language. These are complex machine learning tools that include statistical methods and diverse rule-based algorithms.
The challenge in processing natural languages is huge because natural languages are rich, diverse, constantly changing, and ambiguous. Sometimes even humans have difficulty understanding what is being said due to a lack of context or familiarity with a particular jargon of the speaker.
Another challenge in natural language processing is that most of the information available in the world about natural languages is in a form that is not computer-friendly. A form called “unstructured” as opposed to “structured”, which is information that exists in spreadsheets and databases. In order to process content that is inherently “unstructured” using Big Data algorithms, it is necessary to formalize the information. For this, tools such as Text Mining or Text Analytics are used to extract meaningful quantifiable information from natural languages. Natural Language Processing or NLP for short refers to the discipline in computer science and artificial intelligence for natural language processing. To simplify things, it can be said that the Text Mining part deals with extracting information for NLP analysis that then detects a range of quantitative patterns within the extracted text.
One of the applications of NLP is called Sentiment Analysis (also called Opinion Mining), which extracts and quantifies information about the subjective mental state of the speaker or text writer. This application is common in analyzing customer feedback or, in the social network world, for the purpose of characterizing users and providing them with information according to their mental state.
Other common applications in the field of NLP are Chatbots and voice recognition apps used by systems such as Siri, Google Assistance and other advanced audio systems. Of course, applications such as machine translation apps as well as advertisement matching systems are also based on NLP technology that has been rapidly evolving over the years.
NLP is divided into two main parts: the analysis, “understanding” part whose job is to produce formal data structures in computer language, mapping everything that can be mapped, and the second part that deals with construction, “generation”, creating text in natural language based on the formal data structures in the learning system.
There is also a dramatic development in the field of human translation supporting tools – what is referred to as Translation Memory, which takes advantage of translating repetitive text translated by human translators. Recent related improvements include the ability to automatically correct human translated text even when matching it with previously translated text is incomplete, the ability to point out conflicts and lack of consistency in human translations, the ability to “guess” the correct position of tags in translated text that are used for visual formatting of text such as bold text, hyperlinks, etc. Translation service providers use a range of technology to streamline the translation process to improve the consistency and professionalism of the translated content. The revolution in language processing technology, even if it has not yet completely replaced the need for human translators, certainly provides new and interesting tools for improving the translation process.