As our first project at the Metis Data Science Bootcamp, we were assigned a fictional client, WomenTechWomenYes, and asked to optimize the placement of their street teams in order to maximize attendees to their gala.
The Client:
WomenTechWomenYes (WTWY) has an annual gala at the beginning of the summer each year. As we are new and inclusive organization, we try to do double duty with the gala both to fill our event space with individuals passionate about increasing the participation of women in technology, and to concurrently build awareness and reach.
To this end we place street teams at entrances to subway stations. The street teams collect email addresses and those who sign up are sent free tickets to our gala.
Where we’d like to solicit your engagement is […] to help us optimize the placement of our street teams, such that we can gather the most signatures, ideally from those who will attend the gala and contribute to our cause.
WomenTechWomenYes Client Email
The Process
As suggested, we looked at data from the MTA in New York to find which subway stations had the highest number of entries to determine which stations would have the most traffic and therefore the best chance of collection signatures.
We also looked at US Census Data to determine which boroughs of New York had populations who would be most likely to contribute to the cause.
MTA Data
In analyzing the MTA data, we decided to focus on the number of entries to each station. Potentially given more time, we would have widened the parameters to include both entries and exits, but with only four days to complete the work, we limited ourselves to entries only.
The first thing we did was standardize our column names and remove any duplicate entries that existed for each turnstile
As the data we worked from listed lifetime entries for each turnstile, we had to use that data to find the number of entries for each block of time. After that, we found that some of the data seemed to fall outside of the correct scope, including some stations with a negative number of entries per day (impossible) and some with over a million per day (not technically impossible, but incredibly unlikely). We removed all of the negative entries as well as any that were more than 3 standard deviations from the mean so that our data wouldn’t be tainted with outliers.
When plotting this, we ended up with this mess:
Out of which we pulled the top ten stations to get:
We also used the top ten station data to find the times of highest activity:
We could have simply stopped here and recommended the top ten stations at 6pm. But this is where we decided to add…
Census Data
Once we had our MTA data sorted out and easily identifiable, we pulled in data from the US Census to help determine where to place the street teams by borough. Since the census data was pretty messy, a lot of time went into cleaning this up, including transposing it in order to have our borough values represented by rows instead of columns.
Once we had our data in a readable format, we looked at which boroughs had a higher concentration of the following things:
- Women per Square Mile
- Female-Owned Firms per Square Mile
- We decided to focus on these two areas as locations with more female traffic as well as more businesses with female owners would be more likely to be interested in the mission of WTWY to increase the participation of women in technology.
- Median Annual Income
- Areas with a higher median annual income were targeted as higher-income people are more likely to donate to the cause.
- Homes with Broadband
- Areas with more broadband-equipped homes were targeted as broadband usage is often associated with a higher-tech base of people, which is what we were targeting.
Unsurprisingly, we found that the highest concentration of all of our criteria were in Manhattan, so we decided that WTWY should focus its primary efforts on Manhattan. However, in the spirit of inclusiveness mentioned in the initial client email, we also decided to deploy teams to Queens, which was third in income and second in broadband usage, as well as Brookly, which was second in both women and female-owned firms per square mile.
The Recommendation
We advised WTWY to place their street teams in the following stations:
- Manhattan
- 34th St.–Penn Station
- 23rd St
- Grand Central Station
- 34th St.–Herald Square
- Union Square
- Queens
- Flushing–Main Street
- Jackson Heights–Roosevelt Avenue
- Brooklyn
- Atlantic Avenue–Barclays Center
To optimize the traffic, we recommended deploying the teams between 5 and 9 p.m. Tuesday through Friday.
Further Study
With more time, the data could be further optimized to include exits in addition to entries to better target certain stations. We could also analyze each individual stations to craft more specific day and time recommendations.