Skip to main content

DataX Practice Questions (V1)

Dive into practice questions

Question 1

A data scientist built a model to predict machine failures that occur infrequently but result in large amounts of downtime for users. When the company conducts preventive checks on machines that are not expected to fail soon, the company's labor cost increases. The additional scheduled maintenance sessions also result in lower customer satisfaction scores and discourage many customers from purchasing newer versions of the machine or renewing warranty contracts. Which of the following is the main objective in this scenario?

A. Labor cost

B. Downtime

C. Client retention

D. Service quality

Question 2

A data scientist is looking at a distribution of a continuous variable but is unable to draw any conclusions from a histogram of the data. Which of the following should the data scientist do to get a better visual summary?

A. Increase the number of bins.

B. Lower the range of values.

C. Collect additional data.

D. Add high-contrast coloring.

Question 3

Which of the following is the best method to handle data imbalance?

A. PCA

B. SMOTE

C. Binomial logistic regression

D. DBSCAN

Question 4

A data scientist is building a model that predicts electricity usage for consumers based on the size of the residence in square feet, age in years, and type (e.g., apartment, single-family residence, or boat). Which of the following is the most appropriate way to handle outliers?

A. Selecting a non-linear model

B. Dropping the outlier values

C. Normalizing the data set

D. Performing hyperparameter tuning

Question 5

A data scientist is trying to predict customer churn. After conducting a literature review, the data scientist identifies several potential models that were successful in similar contexts. Which of the following is the most appropriate next step in selecting a model design for iteration?

A. Implement all models that were identified in the literature review and choose the one with the highest accuracy on the test set.

B. Select the most recently published model from the literature review since it is likely the most current and effective.

C. Develop a baseline model and repeatedly implement and compare more complex models, considering interpretability requirements.

D. Choose the model with the most citations in the literature review, as it is the most reliable and accepted.

Question 6

A data science team has a data set consisting of engagements that sales personnel had with customers. Each engagement includes the set of actions taken by the salesperson and whether the customer made a purchase. Which of the following is the best way to find correlations between different sets of actions and customer purchasing behavior?

A. KNN

B. Cluster analysis

C. Feature importance chart

D. Association rules

Question 7

A data scientist is analyzing house prices and observes the following distribution:

Price range  Number of houses 
100,000–200,000 50
200,001–3200,000 30
300,001–4200,000 15
400,001–5200,000 3
500,001–600,000 1
600,001–700,000 1
 

Which of the following techniques should the data scientist apply to make the data more normally distributed?

A. Box-Cox transformation

B. Principal component analysis

C. Min-max scaling

D. One-hot encoding

Question 8

A data scientist completes an ML project predicting customer churn and wants to document the process for future use. Which of the following should the data scientist include in the documentation?

A. The customer purchase history used in the model

B. The model's performance metrics on the test set

C. The source code of the model implementation

D. The description of each feature used in the model

Answer key

Question 1: C (Client retention)

Question 2: A (Increase the number of bins)
Question 3: B (SMOTE)
Question 4: C (Normalizing the data set)
Question 5: C (Develop a baseline model and repeatedly implement and compare more complex models, considering interpretability requirements)
Question 6: D (Association rules)
Question 7: A (Box-Cox transformation)

Question 8: D (The description of each feature used in the model)