Data science Interview Questions in 2019 for Freshers

Data science, also known as data-driven decision, is an interdisciplinary field about scientific methods, process and systems to extract knowledge from data in various forms, and take decision based on this knowledge. A data scientist should not only be evaluated only on his/her knowledge on machine learning, but he/she should also have good expertise on statistics. I will try to start from very basics of data science and then slowly move to expert level. So let’s get started.

We’ve broken the interview questions for data scientists into six different categories: statistics, programming, modeling, behavior, culture, and problem-solving.

Statistics

Programming

General

Big Data

Python

SQL

Modeling

Behavioral

Culture Fit

Problem-Solving

Programming

To test your programming skills, employers will typically include two specific data science interview questions: they’ll ask how you would solve programming problems in theory without writing out the code, and then they will also offer white boarding exercises for you to code on the spot. For the latter types of questions, we will provide a few examples below, but if you’re looking for in-depth practice solving coding challenges, visit Hacker Rank. With a “learn by doing” philosophy, there are challenges organized around core concepts commonly tested during interviews.

General

With which programming languages and environments are you most comfortable working?

What are some pros and cons about your favorite statistical software?

Tell me about an original algorithm you’ve created.

Describe a data science project in which you worked with a substantial programming component. What did you learn from that experience?

Do you contribute to any open-source projects?

How would you clean a data set in (insert language here)?

Tell me about the coding you did during your last project?

Big Data

What are two main components of the Hadoop framework?

The Hadoop Distributed File System (HDFS), MapReduce, and YARN. Read more here.

Explain how MapReduce works as simply as possible.

“MapReduce is a programming model that enables distributed processing of large data sets on compute clusters of commodity hardware. Hadoop MapReduce first performs mapping which involves splitting a large file into pieces to make another set of data.” Read more here Data science Certification

How would you sort a large list of numbers?

Say you’re given a large data set. What would be your plan for dealing with outliers?

How about missing values? How about transformations?

Python or R – Which one would you prefer for text analytics?

The best possible answer for this would be Python because it has Pandas library that provides easy to use data structures and high performance data analysis tools.

What is Linear Regression?

Linear regression is a statistical technique where the score of a variable Y is predicted from the score of a second variable X. X is referred to as the predictor variable and Y as the criterion variable.

What is Interpolation and Extrapolation?

Estimating a value from 2 known values from a list of values is Interpolation. Extrapolation is approximating a value by extending a known set of values or facts.

What is power analysis?

An experimental design technique for determining the effect of a given sample size.

What is K-means? How can you select K for K-means?

What is Collaborative filtering?

The process of filtering used by most of the recommender systems to find patterns or information by collaborating viewpoints, various data sources and multiple agents.

How you can make data normal using Box-Cox transformation?

What is the difference between Supervised Learning an Unsupervised Learning?

If an algorithm learns something from the training data so that the knowledge can be applied to the test data, then it is referred to as Supervised Learning. Classification is an example for Supervised Learning. If the algorithm does not learn anything beforehand because there is no response variable or any training data, then it is referred to as unsupervised learning. Clustering is an example for unsupervised learning.

Explain the use of Combinatorics in data science.

Why is vectorization considered a powerful method for optimizing numerical code?

What is the goal of A/B Testing?

It is a statistical hypothesis testing for randomized experiment with two variables A and B. The goal of A/B Testing is to identify any changes to the web page to maximize or increase the outcome of an interest. An example for this could be identifying the click through rate for a banner ad.

What is an Eigenvalue and Eigenvector?

Eigenvectors are used for understanding linear transformations. In data analysis, we usually calculate the eigenvectors for a correlation or covariance matrix. Eigenvectors are the directions along which a particular linear transformation acts by flipping, compressing or stretching. Eigenvalue can be referred to as the strength of the transformation in the direction of eigenvector or the factor by which the compression occurs.

Become a Data Scientist by working on interesting Data Science Projects

How can outlier values be treated?

What is Gradient Descent?

Outlier values can be identified by using univariate or any other graphical analysis method. If the number of outlier values is few then they can be assessed individually but for large number of outliers the values can be substituted with either the 99th or the 1st percentile values. All extreme values are not outlier values.The most common ways to treat outlier values –

To change the value and bring in within a range

To just remove the value.

What is Regularization and what kind of problems does regularization solve?

What is multicollinearity and how you can overcome it?

What is the curse of dimensionality?

How do you decide whether your linear regression model fits the data?

What is the difference between squared error and absolute error?

What is Machine Learning?

The simplest way to answer this question is – we give the data and equation to the machine. Ask the machine to look at the data and identify the coefficient values in an equation.

For example for the linear regression y=mx+c, we give the data for the variable x, y and the machine learns about the values of m and c from the data.

How are confidence intervals constructed and how will you interpret them?

How will you explain logistic regression to an economist, physican scientist and biologist?

How can you overcome Overfitting?

Differentiate between wide and tall data formats?

Is Naïve Bayes bad? If yes, under what aspects.

How would you develop a model to identify plagiarism?

How will you define the number of clusters in a clustering algorithm?

Though the Clustering Algorithm is not specified, this question will mostly be asked in reference to K-Means clustering where “K” defines the number of clusters. The objective of clustering is to group similar entities in a way that the entities within a group are similar to each other but the groups are different from each other.

SQL

Often, SQL questions are case-based, meaning that an employer will task you with solving an SQL problem in order to test your skills from a practical standpoint. For example, you could be given a table and asked to extract relevant data, then filter and order the data as you see fit, and finally report your findings. If you do not feel ready to do this in an interview setting, Mode Analytics has a delightful introduction to using SQL that will teach you these commands through an interactive SQL environment.

What is the purpose of the group functions in SQL?

Give some examples of group functions.

Group functions are necessary to get summary statistics of a data set. COUNT, MAX, MIN, AVG, SUM, and DISTINCT are all group functions.

Tell me the difference between an inner join, left join/right join, and union.

“In a Venn diagram the inner join is when both tables have a match, a left join is when there is a match in the left table and the right table is null, a right join is the opposite of a left join, and a full join is all of the data combined.” Read more Data science online course

What does UNION do? What is the difference between UNION and UNION ALL?

“UNION removes duplicate records (where all columns in the results are the same), UNION ALL does not.”

What is the difference between SQL and MySQL or SQL Server?

“SQL stands for Structured Query Language. It’s a standard language for accessing and manipulating databases. MySQL is a database management system, like SQL Server, Oracle, Informix, Postgres, etc.”

If a table contains duplicate rows, does a query result display the duplicate values by default?

How can you eliminate duplicate rows from a query result?

Yes. One way you can eliminate duplicate rows with the DISTINCT clause.

Knowing the interview questions to prepare for is just one part of the interview process. Learn step-by-step everything you need to know to not only land an interview, but ace the data science interview with onlineitguru.com website

Search This Blog

Data Science Training for Beginners

The basic principle of robotics and AI