Data science, also known as data-driven decision, is an
interdisciplinary field about scientific methods, process and systems to
extract knowledge from data in various forms, and take decision based on this
knowledge. A data scientist should not only be evaluated only on his/her knowledge
on machine learning, but he/she should also have good expertise on statistics.
I will try to start from very basics of data science and then slowly move to
expert level. So let’s get started.
We’ve broken the interview questions for data scientists
into six different categories: statistics, programming, modeling, behavior,
culture, and problem-solving.
Statistics
Programming
General
Big Data
Python
R
SQL
Modeling
Behavioral
Culture Fit
Problem-Solving
Programming
To test your programming skills, employers will typically
include two specific data science interview questions: they’ll ask how you
would solve programming problems in theory without writing out the code, and
then they will also offer white boarding exercises for you to code on the spot.
For the latter types of questions, we will provide a few examples below, but if
you’re looking for in-depth practice solving coding challenges, visit Hacker Rank.
With a “learn by doing” philosophy, there are challenges organized around core
concepts commonly tested during interviews.
General
With which programming languages and environments are you
most comfortable working?
What are some pros and cons about your favorite statistical
software?
Tell me about an original algorithm you’ve created.
Describe a data science project in which you worked with a
substantial programming component. What did you learn from that experience?
Do you contribute to any open-source projects?
How would you clean a data set in (insert language here)?
Tell me about the coding you did during your last project?
Big Data
What are two main components of the Hadoop framework?
The Hadoop Distributed File System (HDFS), MapReduce, and
YARN. Read more here.
Explain how MapReduce works as simply as possible.
“MapReduce is a programming model that enables distributed
processing of large data sets on compute clusters of commodity hardware. Hadoop
MapReduce first performs mapping which involves splitting a large file into
pieces to make another set of data.” Read more here
Data science Certification
How would you sort a large list of numbers?
Say you’re given a large data set. What would be your plan
for dealing with outliers?
How about missing values? How about transformations?
Python or R – Which one would you prefer for text analytics?
The best possible answer for this would be Python because it
has Pandas library that provides easy to use data structures and high
performance data analysis tools.
What is Linear Regression?
Linear regression is a statistical technique where the score
of a variable Y is predicted from the score of a second variable X. X is
referred to as the predictor variable and Y as the criterion variable.
What is
Interpolation and Extrapolation?
Estimating a value from 2 known values from a list of values
is Interpolation. Extrapolation is approximating a value by extending a known
set of values or facts.
What is power
analysis?
An experimental design technique for determining the effect
of a given sample size.
What is K-means? How can you select K for K-means?
What is
Collaborative filtering?
The process of filtering used by most of the recommender
systems to find patterns or information by collaborating viewpoints, various
data sources and multiple agents.
How you can make data normal using Box-Cox transformation?
What is the difference between Supervised Learning an
Unsupervised Learning?
If an algorithm learns something from the training data so
that the knowledge can be applied to the test data, then it is referred to as
Supervised Learning. Classification is an example for Supervised Learning. If
the algorithm does not learn anything beforehand because there is no response
variable or any training data, then it is referred to as unsupervised learning.
Clustering is an example for unsupervised learning.
Explain the use of Combinatorics in data science.
Why is vectorization considered a powerful method for optimizing numerical code?
What is the goal of A/B Testing?
It is a statistical hypothesis testing for randomized
experiment with two variables A and B. The goal of A/B Testing is to identify
any changes to the web page to maximize or increase the outcome of an interest.
An example for this could be identifying the click through rate for a banner
ad.
What is an
Eigenvalue and Eigenvector?
Eigenvectors are used for understanding linear
transformations. In data analysis, we usually calculate the eigenvectors for a
correlation or covariance matrix. Eigenvectors are the directions along which a
particular linear transformation acts by flipping, compressing or stretching.
Eigenvalue can be referred to as the strength of the transformation in the
direction of eigenvector or the factor by which the compression occurs.
Become a Data Scientist by working on interesting Data
Science Projects
How can outlier
values be treated?
What is Gradient Descent?
Outlier values can be identified by using univariate or any
other graphical analysis method. If the number of outlier values is few then
they can be assessed individually but for large number of outliers the values
can be substituted with either the 99th or the 1st percentile values. All
extreme values are not outlier values.The most common ways to treat outlier
values –
To change the value and bring in within a range
To just remove the value.
What is Regularization and what kind of problems does
regularization solve?
What is
multicollinearity and how you can overcome it?
What is the curse
of dimensionality?
How do you decide
whether your linear regression model fits the data?
What is the difference between squared error and absolute
error?
What is Machine
Learning?
The simplest way to answer this question is – we give the
data and equation to the machine. Ask the machine to look at the data and
identify the coefficient values in an equation.
For example for the linear regression y=mx+c, we give the
data for the variable x, y and the machine learns about the values of m and c
from the data.
How are confidence
intervals constructed and how will you interpret them?
How will you explain logistic regression to an economist,
physican scientist and biologist?
How can you overcome Overfitting?
Differentiate between
wide and tall data formats?
Is Naïve Bayes bad?
If yes, under what aspects.
How would you develop
a model to identify plagiarism?
How will you define the number of clusters in a clustering
algorithm?
Though the Clustering Algorithm is not specified, this
question will mostly be asked in reference to K-Means clustering where “K”
defines the number of clusters. The objective of clustering is to group similar
entities in a way that the entities within a group are similar to each other
but the groups are different from each other.
SQL
Often, SQL questions are case-based, meaning that an
employer will task you with solving an SQL problem in order to test your skills
from a practical standpoint. For example, you could be given a table and asked
to extract relevant data, then filter and order the data as you see fit, and
finally report your findings. If you do not feel ready to do this in an
interview setting, Mode Analytics has a delightful introduction to using SQL
that will teach you these commands through an interactive SQL environment.
What is the purpose of the group functions in SQL?
Give some
examples of group functions.
Group functions are necessary to get summary statistics of a
data set. COUNT, MAX, MIN, AVG, SUM, and DISTINCT are all group functions.
Tell me the difference between an inner join, left
join/right join, and union.
“In a Venn diagram the inner join is when both tables have a
match, a left join is when there is a match in the left table and the right
table is null, a right join is the opposite of a left join, and a full join is
all of the data combined.” Read more
Data science online course
What does UNION do? What is the difference between UNION and
UNION ALL?
“UNION removes duplicate records (where all columns in the
results are the same), UNION ALL does not.”
What is the difference between SQL and MySQL or SQL Server?
“SQL stands for Structured Query Language. It’s a standard
language for accessing and manipulating databases. MySQL is a database
management system, like SQL Server, Oracle, Informix, Postgres, etc.”
If a table contains duplicate rows, does a query result
display the duplicate values by default?
How can you eliminate duplicate rows
from a query result?
Yes. One way you can eliminate duplicate rows with the
DISTINCT clause.
Knowing the interview questions to prepare for is just one part of the interview process. Learn step-by-step everything you need to know to not only land an interview, but ace the data science interview with onlineitguru.com website
APTRON Gurgaon's Data Science course agenda has been meticulously designed with R Programming, Python , Machine Learning, Forecasting and Tableau addressing the complete Data life cycle.
ReplyDeleteFor More Info: Data Science Training in Gurgaon