Data Science FAQ | https://www.kaggle.com/rounakbanik/data-science-faq
Novice to Grandmaster | https://www.kaggle.com/ash316/novice-to-grandmaster
-복잡한 코드 리뷰 형식으로
-더보기에 jupyter노트북으로 했던거 있음
Q12. 데이터 사이언티스트가 되기 위해 학위가 중요할까
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.figure_factory as fig_fact
top_uni = mcq['UniversityImportance'].value_counts().head(5)
top_uni_dist = []
for uni in top_uni.index:
top_uni_dist.append(
mcq[(mcq['Age'].notnull()) & \
(mcq['UniversityImportance'] == uni)]['Age'])
group_labels = top_uni.index
fig = fig_fact.create_distplot(top_uni_dist,group_labels)
py.iplot(fig, filename='University Importance by Age')
- pip install plotly
- Age 컬럼에 있는 데이터가 notnull이고 UniversityImportance 열이 uni와 같은 데이터의 Age를 top_uni_dist에 붙여넣음
더보기
kaggle_DS_FAQ_4
;
In [1]:
# 노트북 안에서 그래프를 그리기 위해
%matplotlib inline
# Import the standard Python Scientific Libraries
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
# Suppress Deprecation and Incorrect Usage Warnings
import warnings
warnings.filterwarnings('ignore')
question = pd.read_csv("data/schema.csv")
# 판다스로 선다형 객관식 문제에 대한 응답을 가져 옴
mcq = pd.read_csv('data/multipleChoiceResponses.csv', encoding="ISO-8859-1", low_memory=False)
mcq.shape
Out[1]:
(16716, 228)
In [1]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
#창 맞추기위함
Q11. 데이터사이언스 직업을 찾는데 가장 고려해야 할 요소는 무엇일까요?¶
In [2]:
# 설문내용과 누구에게 물어봤는지를 찾아봄
qc = question.loc[question[
'Column'].str.contains('JobFactor')]
print(qc.shape)
qc.Question.values
(16, 3)
Out[2]:
array(['How are you assessing potential job opportunities? - Opportunities for professional development', 'How are you assessing potential job opportunities? - The compensation and benefits offered', "How are you assessing potential job opportunities? - The office environment I'd be working in", "How are you assessing potential job opportunities? - The languages, frameworks, and other technologies I'd be working with", "How are you assessing potential job opportunities? - The amount of time I'd have to spend commuting", 'How are you assessing potential job opportunities? - How projects are managed at the company or organization', 'How are you assessing potential job opportunities? - The experience level called for in the job description', "How are you assessing potential job opportunities? - The specific department or team I'd be working on", "How are you assessing potential job opportunities? - The specific role or job title I'd be applying for", 'How are you assessing potential job opportunities? - The financial performance or funding status of the company or organization', "How are you assessing potential job opportunities? - How widely used or impactful the product or service I'd be working on is", 'How are you assessing potential job opportunities? - The opportunity to work from home/remotely', "How are you assessing potential job opportunities? - The industry that I'd be working in", "How are you assessing potential job opportunities? - The reputations of the company's senior leaders", 'How are you assessing potential job opportunities? - The diversity of the company or organization', 'How are you assessing potential job opportunities? - Opportunity to publish my results'], dtype=object)
In [3]:
mcq = pd.read_csv('data/multipleChoiceResponses.csv', encoding="ISO-8859-1", low_memory=False)
job_factors = [
x for x in mcq.columns if x.find('JobFactor') != -1]
In [4]:
jfdf = {}
for feature in job_factors:
a = mcq[feature].value_counts()
a = a/a.sum()
jfdf[feature[len('JobFactor'):]] = a
jfdf = pd.DataFrame(jfdf).transpose()
plt.figure(figsize=(6,10))
plt.xticks(rotation=60, ha='right')
sns.heatmap(jfdf.sort_values('Very Important',
ascending=False), annot=True)
Out[4]:
<AxesSubplot:>
In [5]:
jfdf.plot(kind='bar', figsize=(18,6),
title="Things to look for while considering Data Science Jobs")
plt.xticks(rotation=60, ha='right')
plt.show()
Q12. 데이터 사이언티스트가 되기 위해 학위가 중요할까¶
In [6]:
sns.countplot(y='UniversityImportance', data=mcq)
Out[6]:
<AxesSubplot:xlabel='count', ylabel='UniversityImportance'>
In [8]:
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.figure_factory as fig_fact
top_uni = mcq['UniversityImportance'].value_counts().head(5)
top_uni_dist = []
for uni in top_uni.index:
top_uni_dist.append(
mcq[(mcq['Age'].notnull()) & \
(mcq['UniversityImportance'] == uni)]['Age'])
group_labels = top_uni.index
fig = fig_fact.create_distplot(top_uni_dist,group_labels)
py.iplot(fig, filename='University Importance by Age')
Q13. 어디에서 부터 데이터사이언스를 시작해야 할까¶
In [9]:
mcq[mcq['FirstTrainingSelect'].notnull()].shape
Out[9]:
(14712, 228)
In [10]:
sns.countplot(y='FirstTrainingSelect', data=mcq)
Out[10]:
<AxesSubplot:xlabel='count', ylabel='FirstTrainingSelect'>
Q14. 데이터사이언티스트 이력서에서 가장 중요한 것은 무엇일까¶
In [11]:
sns.countplot(y='ProveKnowledgeSelect', data=mcq)
Out[11]:
<AxesSubplot:xlabel='count', ylabel='ProveKnowledgeSelect'>
Q15. 머신러닝 알고리즘을 사용하기 위해 수학이 필요할까¶
In [12]:
# 설문내용과 누구에게 물어봤는지를 찾아봄
qc = question.loc[question[
'Column'].str.contains('AlgorithmUnderstandingLevel')]
qc
Out[12]:
Column | Question | Asked | |
---|---|---|---|
227 | AlgorithmUnderstandingLevel | At which level do you understand the mathemati... | CodingWorker |
In [13]:
mcq[mcq['AlgorithmUnderstandingLevel'].notnull()].shape
Out[13]:
(7410, 228)
In [14]:
sns.countplot(y='AlgorithmUnderstandingLevel', data=mcq)
Out[14]:
<AxesSubplot:xlabel='count', ylabel='AlgorithmUnderstandingLevel'>
Q16. 어디에서 일을 찾아야 할까¶
In [15]:
# 설문내용과 누구에게 물어봤는지를 찾아봄
question.loc[question[
'Column'].str.contains(
'JobSearchResource|EmployerSearchMethod')]
Out[15]:
Column | Question | Asked | |
---|---|---|---|
108 | EmployerSearchMethod | How did you find your current job? - Selected ... | CodingWorker-NC |
109 | EmployerSearchMethodOtherFreeForm | How did you find your current job? - Some othe... | CodingWorker-NC |
271 | JobSearchResource | Which resource has been the best for finding d... | Learners |
272 | JobSearchResourceFreeForm | Which resource has been the best for finding d... | Learners |
In [16]:
plt.title("Best Places to look for a Data Science Job")
sns.countplot(y='JobSearchResource', data=mcq)
Out[16]:
<AxesSubplot:title={'center':'Best Places to look for a Data Science Job'}, xlabel='count', ylabel='JobSearchResource'>
In [17]:
plt.title("Top Places to get Data Science Jobs")
sns.countplot(y='EmployerSearchMethod', data=mcq)
Out[17]:
<AxesSubplot:title={'center':'Top Places to get Data Science Jobs'}, xlabel='count', ylabel='EmployerSearchMethod'>
그렇다면 한국인은?¶
In [18]:
korea = mcq.loc[(mcq['Country']=='South Korea')]
plt.title("Best Places to look for a Data Science Job")
sns.countplot(y='JobSearchResource', data=korea)
Out[18]:
<AxesSubplot:title={'center':'Best Places to look for a Data Science Job'}, xlabel='count', ylabel='JobSearchResource'>
In [19]:
plt.title("Top Places to get Data Science Jobs")
sns.countplot(y='EmployerSearchMethod', data=korea)
Out[19]:
<AxesSubplot:title={'center':'Top Places to get Data Science Jobs'}, xlabel='count', ylabel='EmployerSearchMethod'>
In [ ]: