Data Science FAQ | https://www.kaggle.com/rounakbanik/data-science-faq
Novice to Grandmaster | https://www.kaggle.com/ash316/novice-to-grandmaster
-복잡한 코드 리뷰 형식으로
-더보기에 jupyter노트북으로 했던거 있음
Q8. 데이터 과학자의 평균 급여는 얼마나 될까?
mcq['CompensationAmount'] = mcq[
'CompensationAmount'].str.replace(',','')
mcq['CompensationAmount'] = mcq[
'CompensationAmount'].str.replace('-','')
# 환율계산을 위한 정보 가져오기
rates = pd.read_csv('data/conversionRates.csv')
rates.drop('Unnamed: 0',axis=1,inplace=True)
salary = mcq[
['CompensationAmount','CompensationCurrency',
'GenderSelect',
'Country',
'CurrentJobTitleSelect']].dropna()
salary = salary.merge(rates,left_on='CompensationCurrency',
right_on='originCountry', how='left')
salary['Salary'] = pd.to_numeric(
salary['CompensationAmount']) * salary['exchangeRate']
salary.head()
- salary = salary.merge(rates,left_on='CompensationCurrency',
right_on='originCountry', how='left')- salary를 rate에 병합한다. 좌측은 'CompensationCurrency', 우측은 'originCountry'을 이용해서 rates를 salary의 좌측에 붙인다.
- to_numeric() 숫자 아닌것을 숫자로 바꿔준다.
더보기
kaggle_DS_FAQ_3
In [33]:
# 노트북 안에서 그래프를 그리기 위해
%matplotlib inline
# Import the standard Python Scientific Libraries
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
# Suppress Deprecation and Incorrect Usage Warnings
import warnings
warnings.filterwarnings('ignore')
question = pd.read_csv("data/schema.csv")
# 판다스로 선다형 객관식 문제에 대한 응답을 가져 옴
mcq = pd.read_csv('data/multipleChoiceResponses.csv', encoding="ISO-8859-1", low_memory=False)
mcq.shape
Out[33]:
(16716, 228)
In [1]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
#창 맞추기위함
Q6. 블로그, 팟캐스트, 수업, 기타 등등 추천할만한 것이 있는지?¶
In [2]:
mcq['BlogsPodcastsNewslettersSelect'] = mcq[
'BlogsPodcastsNewslettersSelect'
].astype('str').apply(lambda x: x.split(','))
mcq['BlogsPodcastsNewslettersSelect'].head()
Out[2]:
0 [Becoming a Data Scientist Podcast, Data Machi... 1 [Becoming a Data Scientist Podcast, Siraj Rava... 2 [FastML Blog, No Free Hunch Blog, Talking Mach... 3 [KDnuggets Blog] 4 [Data Machina Newsletter, Jack's Import AI New... Name: BlogsPodcastsNewslettersSelect, dtype: object
In [3]:
s = mcq.apply(lambda x: pd.Series(x['BlogsPodcastsNewslettersSelect']),
axis=1).stack().reset_index(level=1, drop=True)
s.name = 'platforms'
s.head()
Out[3]:
0 Becoming a Data Scientist Podcast 0 Data Machina Newsletter 0 O'Reilly Data Newsletter 0 Partially Derivative Podcast 0 R Bloggers Blog Aggregator Name: platforms, dtype: object
In [4]:
s = s[s != 'nan'].value_counts().head(20)
In [5]:
plt.figure(figsize=(6,8))
plt.title("Most Popular Blogs and Podcasts")
sns.barplot(y=s.index, x=s)
Out[5]:
<AxesSubplot:title={'center':'Most Popular Blogs and Podcasts'}, xlabel='platforms'>
In [6]:
mcq['CoursePlatformSelect'] = mcq[
'CoursePlatformSelect'].astype(
'str').apply(lambda x: x.split(','))
mcq['CoursePlatformSelect'].head()
Out[6]:
0 [nan] 1 [nan] 2 [Coursera, edX] 3 [nan] 4 [nan] Name: CoursePlatformSelect, dtype: object
In [7]:
t = mcq.apply(lambda x: pd.Series(x['CoursePlatformSelect']),
axis=1).stack().reset_index(level=1, drop=True)
t.name = 'courses'
t.head(20)
Out[7]:
0 nan 1 nan 2 Coursera 2 edX 3 nan 4 nan 5 nan 6 nan 7 Coursera 8 nan 9 nan 10 Coursera 11 nan 12 Coursera 12 DataCamp 12 edX 13 nan 14 nan 15 nan 16 nan Name: courses, dtype: object
In [8]:
t = t[t != 'nan'].value_counts()
In [9]:
plt.title("Most Popular Course Platforms")
sns.barplot(y=t.index, x=t)
Out[9]:
<AxesSubplot:title={'center':'Most Popular Course Platforms'}, xlabel='courses'>
Q7. 데이터 사이언스 직무에서 가장 중요하다고 생각되는 스킬은?¶
In [10]:
job_features = [
x for x in mcq.columns if x.find(
'JobSkillImportance') != -1
and x.find('JobSkillImportanceOther') == -1]
job_features
Out[10]:
['JobSkillImportanceBigData', 'JobSkillImportanceDegree', 'JobSkillImportanceStats', 'JobSkillImportanceEnterpriseTools', 'JobSkillImportancePython', 'JobSkillImportanceR', 'JobSkillImportanceSQL', 'JobSkillImportanceKaggleRanking', 'JobSkillImportanceMOOC', 'JobSkillImportanceVisualizations']
In [11]:
jdf = {}
for feature in job_features:
a = mcq[feature].value_counts()
a = a/a.sum()
jdf[feature[len('JobSkillImportance'):]] = a
jdf
Out[11]:
{'BigData': Nice to have 0.574065 Necessary 0.379929 Unnecessary 0.046006 Name: JobSkillImportanceBigData, dtype: float64, 'Degree': Nice to have 0.598107 Necessary 0.279867 Unnecessary 0.122026 Name: JobSkillImportanceDegree, dtype: float64, 'Stats': Necessary 0.513889 Nice to have 0.457576 Unnecessary 0.028535 Name: JobSkillImportanceStats, dtype: float64, 'EnterpriseTools': Nice to have 0.564970 Unnecessary 0.290200 Necessary 0.144829 Name: JobSkillImportanceEnterpriseTools, dtype: float64, 'Python': Necessary 0.645994 Nice to have 0.327214 Unnecessary 0.026792 Name: JobSkillImportancePython, dtype: float64, 'R': Nice to have 0.513945 Necessary 0.414807 Unnecessary 0.071247 Name: JobSkillImportanceR, dtype: float64, 'SQL': Nice to have 0.491778 Necessary 0.434224 Unnecessary 0.073998 Name: JobSkillImportanceSQL, dtype: float64, 'KaggleRanking': Nice to have 0.677261 Unnecessary 0.203876 Necessary 0.118863 Name: JobSkillImportanceKaggleRanking, dtype: float64, 'MOOC': Nice to have 0.606994 Unnecessary 0.285752 Necessary 0.107255 Name: JobSkillImportanceMOOC, dtype: float64, 'Visualizations': Nice to have 0.490820 Necessary 0.455392 Unnecessary 0.053788 Name: JobSkillImportanceVisualizations, dtype: float64}
In [12]:
jdf = pd.DataFrame(jdf).transpose()
jdf
Out[12]:
Necessary | Nice to have | Unnecessary | |
---|---|---|---|
BigData | 0.379929 | 0.574065 | 0.046006 |
Degree | 0.279867 | 0.598107 | 0.122026 |
Stats | 0.513889 | 0.457576 | 0.028535 |
EnterpriseTools | 0.144829 | 0.564970 | 0.290200 |
Python | 0.645994 | 0.327214 | 0.026792 |
R | 0.414807 | 0.513945 | 0.071247 |
SQL | 0.434224 | 0.491778 | 0.073998 |
KaggleRanking | 0.118863 | 0.677261 | 0.203876 |
MOOC | 0.107255 | 0.606994 | 0.285752 |
Visualizations | 0.455392 | 0.490820 | 0.053788 |
In [13]:
plt.figure(figsize=(10,6))
sns.heatmap(jdf.sort_values("Necessary",
ascending=False), annot=True)
Out[13]:
<AxesSubplot:>
In [14]:
jdf.plot(kind='bar', figsize=(12,6),
title="Skill Importance in Data Science Jobs")
plt.xticks(rotation=60, ha='right')
Out[14]:
(array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), [Text(0, 0, 'BigData'), Text(1, 0, 'Degree'), Text(2, 0, 'Stats'), Text(3, 0, 'EnterpriseTools'), Text(4, 0, 'Python'), Text(5, 0, 'R'), Text(6, 0, 'SQL'), Text(7, 0, 'KaggleRanking'), Text(8, 0, 'MOOC'), Text(9, 0, 'Visualizations')])
Q8. 데이터 과학자의 평균 급여는 얼마나 될까?¶
In [15]:
mcq[mcq['CompensationAmount'].notnull()].shape
Out[15]:
(5224, 228)
In [16]:
mcq['CompensationAmount'] = mcq[
'CompensationAmount'].str.replace(',','')
mcq['CompensationAmount'] = mcq[
'CompensationAmount'].str.replace('-','')
# 환율계산을 위한 정보 가져오기
rates = pd.read_csv('data/conversionRates.csv')
rates.drop('Unnamed: 0',axis=1,inplace=True)
salary = mcq[
['CompensationAmount','CompensationCurrency',
'GenderSelect',
'Country',
'CurrentJobTitleSelect']].dropna()
salary = salary.merge(rates,left_on='CompensationCurrency',
right_on='originCountry', how='left')
salary['Salary'] = pd.to_numeric(
salary['CompensationAmount']) * salary['exchangeRate']
salary.head()
Out[16]:
CompensationAmount | CompensationCurrency | GenderSelect | Country | CurrentJobTitleSelect | originCountry | exchangeRate | Salary | |
---|---|---|---|---|---|---|---|---|
0 | 250000 | USD | Male | United States | Operations Research Practitioner | USD | 1.000000 | 250000.0 |
1 | 80000 | AUD | Female | Australia | Business Analyst | AUD | 0.802310 | 64184.8 |
2 | 1200000 | RUB | Male | Russia | Software Developer/Software Engineer | RUB | 0.017402 | 20882.4 |
3 | 95000 | INR | Male | India | Data Scientist | INR | 0.015620 | 1483.9 |
4 | 1100000 | TWD | Male | Taiwan | Software Developer/Software Engineer | TWD | 0.033304 | 36634.4 |
In [17]:
print('Maximum Salary is USD $',
salary['Salary'].dropna().astype(int).max())
print('Minimum Salary is USD $',
salary['Salary'].dropna().astype(int).min())
print('Median Salary is USD $',
salary['Salary'].dropna().astype(int).median())
Maximum Salary is USD $ 208999999 Minimum Salary is USD $ -2147483648 Median Salary is USD $ 53812.0
In [18]:
plt.subplots(figsize=(15,8))
salary=salary[salary['Salary']<500000]
sns.distplot(salary['Salary'])
plt.axvline(salary['Salary'].median(), linestyle='dashed')
plt.title('Salary Distribution',size=15)
Out[18]:
Text(0.5, 1.0, 'Salary Distribution')
In [19]:
plt.subplots(figsize=(8,12))
sal_coun = salary.groupby(
'Country')['Salary'].median().sort_values(
ascending=False)[:30].to_frame()
sns.barplot('Salary',
sal_coun.index,
data = sal_coun,
palette='RdYlGn')
plt.axvline(salary['Salary'].median(), linestyle='dashed')
plt.title('Highest Salary Paying Countries')
Out[19]:
Text(0.5, 1.0, 'Highest Salary Paying Countries')
In [20]:
plt.subplots(figsize=(8,4))
sns.boxplot(y='GenderSelect',x='Salary', data=salary)
Out[20]:
<AxesSubplot:xlabel='Salary', ylabel='GenderSelect'>
In [21]:
salary_korea = salary.loc[(salary['Country']=='South Korea')]
plt.subplots(figsize=(8,4))
sns.boxplot(y='GenderSelect',x='Salary',data=salary_korea)
Out[21]:
<AxesSubplot:xlabel='Salary', ylabel='GenderSelect'>
In [22]:
salary_korea.shape
Out[22]:
(26, 8)
In [23]:
salary_korea[salary_korea['GenderSelect'] == 'Female']
Out[23]:
CompensationAmount | CompensationCurrency | GenderSelect | Country | CurrentJobTitleSelect | originCountry | exchangeRate | Salary | |
---|---|---|---|---|---|---|---|---|
479 | 30000 | KRW | Female | South Korea | Data Analyst | KRW | 0.000886 | 26.58 |
2903 | 800000 | KRW | Female | South Korea | Researcher | KRW | 0.000886 | 708.80 |
4063 | 60000000 | KRW | Female | South Korea | Researcher | KRW | 0.000886 | 53160.00 |
In [24]:
salary_korea_male = salary_korea[
salary_korea['GenderSelect']== 'Male']
salary_korea_male['Salary'].describe()
Out[24]:
count 23.000000 mean 43540.617217 std 37800.608484 min 0.886000 25% 17500.000000 50% 37212.000000 75% 59238.000000 max 177200.000000 Name: Salary, dtype: float64
In [25]:
salary_korea_male
Out[25]:
CompensationAmount | CompensationCurrency | GenderSelect | Country | CurrentJobTitleSelect | originCountry | exchangeRate | Salary | |
---|---|---|---|---|---|---|---|---|
85 | 40000000 | KRW | Male | South Korea | Business Analyst | KRW | 0.000886 | 35440.000 |
147 | 80000 | USD | Male | South Korea | Researcher | USD | 1.000000 | 80000.000 |
314 | 60000 | USD | Male | South Korea | Business Analyst | USD | 1.000000 | 60000.000 |
333 | 60000000 | KRW | Male | South Korea | Researcher | KRW | 0.000886 | 53160.000 |
562 | 50000000 | KRW | Male | South Korea | Researcher | KRW | 0.000886 | 44300.000 |
769 | 42000000 | KRW | Male | South Korea | Software Developer/Software Engineer | KRW | 0.000886 | 37212.000 |
799 | 1000 | KRW | Male | South Korea | Machine Learning Engineer | KRW | 0.000886 | 0.886 |
1060 | 75000000 | KRW | Male | South Korea | Scientist/Researcher | KRW | 0.000886 | 66450.000 |
1360 | 30000000 | KRW | Male | South Korea | Statistician | KRW | 0.000886 | 26580.000 |
1568 | 90000 | SGD | Male | South Korea | Computer Scientist | SGD | 0.742589 | 66833.010 |
1576 | 10800000 | KRW | Male | South Korea | Data Scientist | KRW | 0.000886 | 9568.800 |
1905 | 20000 | USD | Male | South Korea | Researcher | USD | 1.000000 | 20000.000 |
1945 | 50000 | KRW | Male | South Korea | Machine Learning Engineer | KRW | 0.000886 | 44.300 |
1949 | 80000000 | KRW | Male | South Korea | Software Developer/Software Engineer | KRW | 0.000886 | 70880.000 |
2322 | 200000000 | KRW | Male | South Korea | Other | KRW | 0.000886 | 177200.000 |
2334 | 60000000 | KRW | Male | South Korea | Machine Learning Engineer | KRW | 0.000886 | 53160.000 |
2557 | 7200000 | KRW | Male | South Korea | Researcher | KRW | 0.000886 | 6379.200 |
2924 | 15000 | USD | Male | South Korea | Researcher | USD | 1.000000 | 15000.000 |
3394 | 66000000 | KRW | Male | South Korea | Programmer | KRW | 0.000886 | 58476.000 |
3832 | 30000000 | KRW | Male | South Korea | Data Scientist | KRW | 0.000886 | 26580.000 |
3979 | 35000000 | KRW | Male | South Korea | Researcher | KRW | 0.000886 | 31010.000 |
4300 | 60000000 | KRW | Male | South Korea | Scientist/Researcher | KRW | 0.000886 | 53160.000 |
4366 | 10000 | USD | Male | South Korea | Data Scientist | USD | 1.000000 | 10000.000 |
Q9. 개인프로젝트나 학습용 데이터를 어디에서 얻나요?¶
In [26]:
mcq['PublicDatasetsSelect'] = mcq[
'PublicDatasetsSelect'].astype('str').apply(
lambda x: x.split(',')
)
In [27]:
q = mcq.apply(
lambda x: pd.Series(x['PublicDatasetsSelect']),
axis=1).stack().reset_index(level=1, drop=True)
q.name = 'courses'
In [28]:
q = q[q != 'nan'].value_counts()
In [29]:
pd.DataFrame(q)
Out[29]:
courses | |
---|---|
Dataset aggregator/platform (i.e. Socrata/Kaggle Datasets/data.world/etc.) | 6843 |
Google Search | 3600 |
University/Non-profit research group websites | 2873 |
I collect my own data (e.g. web-scraping) | 2560 |
GitHub | 2400 |
Government website | 2079 |
Other | 399 |
In [30]:
plt.title("Most Popular Dataset Platforms")
sns.barplot(y=q.index, x=q)
Out[30]:
<AxesSubplot:title={'center':'Most Popular Dataset Platforms'}, xlabel='courses'>
In [31]:
# 주관식 응답을 읽어온다.
ff = pd.read_csv('data/freeformResponses.csv',
encoding="ISO-8859-1", low_memory=False)
ff.shape
Out[31]:
(16716, 62)
In [34]:
# 설문내용과 누구에게 물어봤는지를 찾아봄
qc = question.loc[question[
'Column'].str.contains('PersonalProjectsChallengeFreeForm')]
print(qc.shape)
qc.Question.values[0]
(1, 3)
Out[34]:
'What is your biggest challenge with the public datasets you find for personal projects?'
개인프로젝트에서 공개된 데이터셋을 다루는 데 가장 어려운 점은 무엇일까?¶
In [35]:
ppcff = ff[
'PersonalProjectsChallengeFreeForm'].value_counts().head(15)
ppcff.name = '응답 수'
pd.DataFrame(ppcff)
Out[35]:
응답 수 | |
---|---|
None | 23 |
Cleaning the data | 20 |
Cleaning | 20 |
Dirty data | 16 |
Data Cleaning | 14 |
none | 13 |
dirty data | 10 |
Data cleaning | 10 |
- | 9 |
Size | 9 |
Incomplete data | 8 |
Missing data | 8 |
cleaning | 8 |
Lack of documentation | 7 |
Dirty | 6 |
Q11. 데이터 사이언스 업무에서 가장 많은 시간을 필요로 하는 일은?¶
In [37]:
time_features = [
x for x in mcq.columns if x.find('Time') != -1][4:10]
In [38]:
tdf = {}
for feature in time_features:
tdf[feature[len('Time'):]] = mcq[feature].mean()
tdf = pd.Series(tdf)
print(tdf)
print()
plt.pie(tdf, labels=tdf.index,
autopct='%1.1f%%', shadow=True, startangle=140)
plt.axis('equal')
plt.title("Percentage of Time Spent on Each DS Job")
plt.show()
GatheringData 36.144754 ModelBuilding 21.268066 Production 10.806372 Visualizing 13.869372 FindingInsights 13.094776 OtherSelect 2.396247 dtype: float64
In [ ]: