Damegender_gender_detection_with_NLTK_and_Scikit,_by_David_Arroyo_-_PyBCN_Meetup_May_2019

Damegender gender detection with NLTK and Scikit, by David Arroyo - PyBCN Meetup May 2019

Gendergap400x400.jpg

Introduction

Sometimes, it seems impossible in the society that the women have not
right to be called by her own name. In the developed countries we say
to give visibility to the women, the women must have voice because
when you are seeing an important man there are a big woman near, so
she must be in the place in the history where she deserves. Although,
in some countries, for example, Afghanistan many women doesn't have
names, they are using nicknames deleting her identities.

Tools to detect gender from the names

A tool to detect gender from the name is about a name as input and sex
(male or female) as output and perhaps depending about the context
national and/or cultural.

Many tools about gender detection in the market with very good
accuracies are propietary software and only you access paying a fee
with an API until starts with Damegender . Although GenderGuesser,
before sexmachine or gender.c brings an international and wide
dataset. This tool is not maintained from various years ago.

The main reason to use Damegender versus Genderguesser is the open
data collection of datasets provided by the states. Another features
such as guess nicknames, scripts to use and compare this tools.

Applications

In this times with COVID19 and Tinder where the relations between the
people is being distanced and ephimeral, I think that any people can
find funny to count males and females in different communities
(mailing lists, software repositories, channels of instant messaging,
...). The feminist movement knows make groups with many women in
solidarity to win to men and so make experiments about gender
equality. In this sense tools such as Perceval is giving good plugins
to interact with this kind of communities.


Benchmarking

From October until December of 2019 we were doing experiments reaching
the next results:

Name                 Accuracy        Precision       F1score          Recall 
Genderapi             0.969           0.972              0.964               1.0 
Genderize             0.927           0.976              0.965               1.0 
Namsor                 0.867           0.973              0.924               1.0 
Nameapi               0.83             0.974              0.905               1.0 
Gender Guesser   0.774           0.985              0.872               1.0 


We were repeating this tests with algorithms about Machine Learning
using the datasets: INE.es (Spain), USA, United Kingdom, Uruguay,
reaching very good results to be in the third position in the market

Sometimes, it seems impossible in the society that the women have not
right to be called by her own name. In the developed countries we say
to give visibility to the women, the women must have voice because
when you are seeing an important man there are a big woman near, so
she must be in the place in the history where she deserves. Although,
in some countries, for example, Afghanistan many women doesn't have
names, they are using nicknames deleting her identities.

Sometimes, it seems impossible in the society that the women have not
right to be called by her own name. In the developed countries we say
to give visibility to the women, the women must have voice because
when you are seeing an important man there are a big woman near, so
she must be in the place in the history where she deserves. Although,
in some countries, for example, Afghanistan many women doesn't have
names, they are using nicknames deleting her identities.

Name                         Accuracy   Precision   F1score   Recall 
SVC                             0.879       0.972          0.972      1.0 
Random Forest           0.862       0.902          0.902      1.0 
NLTK (Bayes)             0.862        0.902          0.902      1.0 
MultinomialNB            0.782        0.791          0.791      1.0 
Tree                             0.764       0.821          0.796      1.0 
SGD                            0.709       0.943          0.815      1.0 
GaussianNB               0.709       0.968          0.887      1.0 
BernoulliNB                0.699       0.965          0.816      1.0 
AdaBoost                   0.698       0.965           0.815      1.0 
MLP                           0.677       0.819           0.755      1.0 
Average                     0.765       0.906           0.845      1.0

In the last month, we receive 3000 downloads from pypi.

More over from the names: surnames and the national origins

Although the dataset from United States of America provides statistics
about surnames and races, the dataset from Spain provides simple
statistics between surnames and national origins.

$ python3 surnameincountries.py Arroyo
In Spain (Instituto Nacional de Estadística) the surname ARROYO is present with people of another countries:
+ Côte d'Ivoire
+ Cuba
+ Dominican Republic
+ France
+ United Kingdom of Great Britain
+ Guatemala
+ Italy
+ United States of America
+ Uruguay

Conclusions

In the future of the names and the gender it will be with free data
and Free Software. In Damegender, we are looking up women to
collaborate doing useful this tool to the feminist movement. Thanks to
Geek Feminist due to the opportunity to diseminate the work.

Community content is available under CC-BY-SA unless otherwise noted.