Clustering Neighborhoods in Allahabad
Introduction to the Problem
We would try to implement the similar problem we have been taught and discussed in the course itself. We would try to find out that how similar or dissimilar two areas of a city are considering some specific features. For our case we are going to consider Allahabad, It was not easy to find the Allahabad datasets but still, we managed to collect it.
Solution
Here I will convert addresses to their corresponding latitude and longitude values. I will use the Foursquare API to explore neighborhoods in Allahabad. I will use the explore function to get the most common venue categories in each neighborhood, and then use this feature to group the neighborhoods into clusters. I will use the k-means clustering algorithm to complete this task. Finally, I will use the Folium library to visualize the neighborhoods in Allahabad City and their emerging clusters
Way to the Solution
Download and Explore Dataset
Explore Neighborhoods in Allahabad India
Analyze Each Neighborhood
Cluster Neighborhoods
Examine Clusters
Installing all the required dependencies
!pip install geocoder
Import each and every required library and package
- BeautifulSoup and requests for scraping the data
- Pandas and numpy for making structure and preprocessing of the data
- Geopy for getting the long and lats of the places
- Folium for maps and more information
- Matplotlib for visualization
- Sklearn for KMeans model
import requests
from bs4 import BeautifulSoup
import pandas as pd
from geopy.geocoders import Nominatim
import numpy as np
import matplotlib.cm as cm
import matplotlib.colors as colors
import folium
from sklearn.cluster import KMeans
Scrapping of the data from the Wikipedia page https://en.m.wikipedia.org/wiki/Neighbourhoods_of_Allahabad¶
After doing the proper inspection of the page I got to know that the the names are stored under li tags inside a division of class 'mw-parser-output'.
data = requests.get("https://en.m.wikipedia.org/wiki/Neighbourhoods_of_Allahabad").text
soup = BeautifulSoup(data, 'html.parser')
neighborhoodList = []
for row in soup.find_all("div", class_= "mw-parser-output")[0].findAll("ul")[0].findAll("li"):
neighborhoodList.append(row.text.split()[1])
kl_df = pd.DataFrame({"Neighborhood": neighborhoodList})
kl_df.head()
print(len(kl_df))
Geo-location coordinates generation of the places
geolocator = Nominatim(user_agent="courcera_capston")
new_list = []
def get_latlng(neighborhood):
global new_list
location = geolocator.geocode('{}, Prayagraj, India'.format(neighborhood))
try:
loc = (location.latitude, location.longitude)
new_list.append(neighborhoodList)
return loc
except:
pass
coords = [get_latlng(neighborhood) for neighborhood in kl_df["Neighborhood"].tolist() if get_latlng(neighborhood) != None]
Get the location of the city Allahabad and combining them to the location data frame.
address = 'Prayagraj, India'
geolocator = Nominatim(user_agent="my-application")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Hyderabad, India {}, {}.'.format(latitude, longitude))
df_coords = pd.DataFrame(coords, columns=['Latitude', 'Longitude'])
kl_df['Latitude'] = df_coords['Latitude']
kl_df['Longitude'] = df_coords['Longitude']
kl_df.dropna(inplace=True)
print(kl_df.shape)
Plot the data-points of the data-frame on the map using Folium
map_kl = folium.Map(location=[latitude, longitude], zoom_start=11)
for lat, lng, neighborhood in zip(kl_df['Latitude'], kl_df['Longitude'], kl_df['Neighborhood']):
label = '{}'.format(neighborhood)
label = folium.Popup(label, parse_html=True)
folium.CircleMarker([lat, lng],radius=5,popup=label,color='blue',fill=True,fill_color='#3186cc',fill_opacity=0.7).add_to(map_kl)
map_kl
Connecting to the foursquare api to get more info about the locations
CLIENT_ID = 'JH54IDPYRYILFWBGNXRIB2UXSNYGDGUJVHKPROH44R0TLGII'
CLIENT_SECRET = '1C0YP3ZVJP3ZS3VOQEWAUP4DJM5TBBBHMTIFUTCEAGYZQKBM'
VERSION = '20180605'
radius = 2000
LIMIT = 100
venues = []
for lat, long, neighborhood in zip(kl_df['Latitude'], kl_df['Longitude'], kl_df['Neighborhood']):
url = "https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}".format(CLIENT_ID,CLIENT_SECRET,VERSION,lat,long,radius,LIMIT)
results = requests.get(url).json()["response"]['groups'][0]['items']
for venue in results:
venues.append((neighborhood,lat,long,venue['venue']['name'],
venue['venue']['location']['lat'],venue['venue']['location'] ['lng'],venue['venue']['categories'][0]['name']))
venues_df = pd.DataFrame(venues)
venues_df.columns = ['Neighborhood', 'Latitude', 'Longitude', 'VenueName', 'VenueLatitude', 'VenueLongitude', 'VenueCategory']
print(venues_df.shape)
venues_df.head()
print('There are {} unique categories.'.format(len(venues_df['VenueCategory'].unique())))
venues_df['VenueCategory'].unique()
# One hot encoding of the l
kl_onehot = pd.get_dummies(venues_df[['VenueCategory']], prefix="", prefix_sep="")
# Adding neighborhood column back to dataframe
kl_onehot['Neighborhoods'] = venues_df['Neighborhood']
# Moving neighbourhood column to the first column
fixed_columns = [kl_onehot.columns[-1]] + list(kl_onehot.columns[:-1])
kl_onehot = kl_onehot[fixed_columns]
print(kl_onehot.head())
kl_grouped=kl_onehot.groupby(["Neighborhoods"]).sum().reset_index()
print(kl_grouped.shape)
kl_grouped.head()
# Creating a dataframe for Shopping Mall data only
kl_mall = kl_grouped[["Neighborhoods","Shopping Mall"]]
kclusters = 2
kl_clustering = kl_mall.drop(["Neighborhoods"], 1)
# Run k-means clustering algorithm
kmeans = KMeans(n_clusters=kclusters,random_state=0).fit(kl_clustering)
# Checking cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]
# Creating a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.
kl_merged = kl_mall.copy()
# Add the clustering labels
kl_merged["Cluster Labels"] = kmeans.labels_
kl_merged.rename(columns={"Neighborhoods": "Neighborhood"}, inplace=True)
kl_merged.head(10)
# Adding latitude and longitude values to the existing dataframe
kl_merged['Latitude'] = kl_df['Latitude']
kl_merged['Longitude'] = kl_df['Longitude']
# Sorting the results by Cluster Labels
kl_merged.sort_values(["Cluster Labels"], inplace=True)
kl_merged
# Creating the map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)
# Setting color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]
# Add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(kl_merged['Latitude'], kl_merged['Longitude'], kl_merged['Neighborhood'], kl_merged['Cluster Labels']):
label = folium.Popup(str(poi) + ' - Cluster ' + str(cluster), parse_html=True)
folium.CircleMarker([lat,lon],radius=5,popup=label,color=rainbow[cluster-1],fill=True,fill_color=rainbow[cluster-1],fill_opacity=0.7).add_to(map_clusters)
map_clusters
print(len(kl_merged.loc[kl_merged['Cluster Labels'] == 0]))
print(len(kl_merged.loc[kl_merged['Cluster Labels'] == 1]))
Keep it up bro
ReplyDelete