<
查看: 10040| 回复: 13
收起左侧

整理了一点机器学习八股文的答案(1)

    |只看干货
本楼: 👍   97% (35)
 
 
2% (1)   👎
全局: 👍   97% (135)
 
 
2% (3)    👎

注册一亩三分地论坛,查看更多干货!

您需要 登录 才可以下载或查看附件。没有帐号?注册账号

x
本帖最后由 Elvira 于 2023-6-4 21:24 编辑

最近准备面试整理了一些机器学习八股文的答案(题目来源于地理的三篇八股文帖子),可能复制到md文档里读起来更方便一点,之后会更新其他的部分, 求加米

# ML基础概念类1
1. overfitting/underfiting是指的什么
   - Underfitting: Machine learning model is too simple to capture the underlying patterns in the data. Model performs poorly on both training and new unseen data
     - training and validation error both high
     - more complex model, adding more features
   - Overfitting: machine learning model becomes too complex and starts to memorize the trianing data instead of learning generalizable patterns.
     - training error significantly lower than the validation error/model perform poorly on new data
     - reducing the complexity of the model
     - regularizing the model (l1, l2 regularization, dropout, cross-validation to choose best model)
     - collecting more training data
1. bias/variance trade off 是指的什么
   - bias: difference between predicted value and the expectation of the real data. Bias happens when the model oversimplifies the underlying patterns in the data and makes strong assumptions. This can lead to underfitting, where the model fails to capture the true relationships between the features and the target variable
   - Variance: measures how spread the predicted values are from the expected value. A model with high variance is sensitive to the specific data points and memorize noise or outliers. This can lead to overfitting.
   - Trade-off: Low variance models tend to be less complex, with simple structure, this can lead to high bias. Low bias models tend to be more complex, with flexible undering structure, which can leads to high variance. decreasing one component often leads to an increase in the other. Achieving low bias and low variance simultaneously is challenging. The goal is to find the right balance between bias and variance for optimal model performance.
3. Overfitting一般有哪些预防手段
   - reducing the complexity of the model
   - regularizing the model (l1, l2 regularization, dropout, cross-validation to choose best model)
   - Early stop
   - collecting more training data
   - data augmentation
4. Give a set of ground truths and 2 models, how do you be confident that one model is better than another? Model Selection
    - Evaluation Metrics
    - Cross-Validation (split data into multi-fold,train each model on different fold and test on alternating set, evaluate their average performance)
    - Hypothesis Testing A/B testing
    - Domain Experties
## Regression
1. Linear Regression的基础假设是什么
    - There is a linear relationship between the independent variables(X) and the dependent variables (y)
    - Independence: Independence assumes that there is no relationship or correlation between the errors (residuals) of different observations.
    - Normality: The residuals of the linear regression model are assumed to be normally distributed.
    - Homoscedasticity: Homoscedasticity assumes that the variability of the errors (residuals) is constant across all levels of the independent variables.
    - No Multicollinearity between features
2. what will happen when we have correlated variables, how to solve
    - Outcome of correlated variables: unstable coefficient estimates, unreliable significance tests, difficulties in interpreting the individual contributions of the correlated variables
    - Solve: feature selection, ridge regression, PCA…
3. explain regression coefficient
    Coefficients represent the change in the dependent variable associated with a one-unit change in the corresponding independent variable, while holding other variables constant. It is essential to note that the interpretation of regression coefficients should be done with caution and within the context of the specific regression model and dataset.
4. what is the relationship between minimizing squared error and maximizing the likelihood
    - In linear regression, when the assumption of Gaussian errors holds, minimizing the squared error is equivalent to maximizing the likelihood of the observed data. This connection arises because the squared error can be derived from the likelihood function assuming Gaussian errors.
    - In cases where the assumption of Gaussian errors is not appropriate, such as when dealing with non-Gaussian or heteroscedastic errors, the relationship between minimizing squared error and maximizing likelihood might not hold.
5. How could you minimize the inter-correlation between variables with Linear Regression?
    - Feature Selection
    - PCA
    - Ridge Regression
    - Feeature Engineering
6. If the relationship between y and x is no linear, can linear regression solve that
    - simple linear regression may not accurately capture the underlying relationship.
    - Solve:
        - interaction terms
        - piecewise linear regression
        - non-linear regression
7. why use interaction variables
   - Capture Non-Additive Effects:
   - Improved Model Fit
   - Context-Specific Relationships
   - Avoiding Omitted Variable Bias
   - Enhanced Interpretability
## Reguarlization
1. L1 vs L2 **regularization**:, which one is which and difference
    - Add a term of L1 norm of the parameters in the loss function (sum of absolute values)
    - Add a term of L2 norm of the parameters in the loss function ($||\beta||_2 = (\sum \beta_i^2)^{1/2}$)
2. Lasso Regression
   - Least Absolute Shrinkage and Selection Operator
   - Introduces an additional penalty term based on the absolute values of the coefficients, L1 norm of the coefficients
   - objective: find the value of the coefficients that minimize the sum of the squared differences between the predicted values and the actual values, while also minimizing the L1 regularization term
   - $L=|| \hat{y} - y ||_2 + \lambda || \beta ||_1$,
   - where $\hat{y} = f_{\beta}(x)$
   - Lasso regression can shrink the coefficients towards zero. when $\lambda$ is sufficiently large, some coefficients are driven to zero. Useful for feature selection
3. Ridge Regression
   - Linear Regression with L2 Regularization
   - $L = ||\hat{y} - y||_2 + \lambda||\beta||_2$
   - Higher values of lambda result in more aggressive shrinkage of the coefficient estimators.
4. 为什么L1比L2稀疏
   - L1 norm has corners at zero, while L2 norm is smooth and continuously differentiable
   - L1 norm penalty creates diamond-shaped constraint regions in the coefficient space, centered around the origin. As a result, the optimization process may drive some coefficient exactly to zero, leading to sparsity (the optimum solution/plain usually hits the vertex of the dimond) Whereas L2 norm is a ball, the optimum solution usually hits a point where the coefficients are non zero.
5. 为什么regularization works
    Regularization works by introducing penalty term into the objective function of a machine learning model. This penalty term encourage the model to have certain desirable properties, such as simplicity, sparsity, or smoothness (Adding more constrains to the coefficient). Reduce the variance.
6. 为什么regularization用L1 L2,而不是L3, L4..
    - mathematical properties, L1 L2 norms have well-studied mathematical properties that make them particularly useful for regularization. Their properties align with the goals for reducing model complexity, handling Multicollinearity and identifying important features
    - Computational simplicity, high order can introduce additional computational complexity without providing significant advantages over L1 and L2 norm
    - Interpretability
## Metrics
1. precision and recall, trade-off
    - Precision is a measure of how many of the **positively predicted instances are actually true** positives. =
        - true positive / (true positive + false positive).
        - Precision focuses on the quality of the positive predictions, high precision aiming for **a low number of false positives**.
    - Recall is a measure of how many of the actual positive instances are correctly identified
        - True positive / (true positive + false negative).
        - Recall emphasizes the completeness of the positive predictions, high recall aiming for a **low number of false negatives.**
    - Trade-off:
        - improving one metric might lead to a decrease in the other.
        - high precision, low recall: tuned to prioritize precision. more conservative in predicting positive instances. result in a **low number of false positives but may lead to missing some true positive instances, resulting in a low recall.**
        - low precision, high recall: be more liberal in predicting positive instances, lead to a **high number of true positives, but may also generate more false positives, reducing precision.**
        - the consequences of false positives and false negatives
        - the desired balance between avoiding mis-classification errors and capturing all relevant positive instances
        - It's important to consider precision and recall together and select the appropriate balance
2. label 不平衡时用什么metric
    - Precision and Recall
    - F1-score (harmonic mean of precision and recall) provides a balanced evaluation metric for imbalanced dataset
    - Area Under the Precision-Recall Curve (AUPRC) always use when the focus is on the positive class robust to class imbalance
    - Receiver Operating Characteristic (ROC) curve and the Area Under The Curve (AUC). ROC curve plots the true positive rate (recall) against the false positive rate at different classification thresholds. AUC is the area under the ROC curve it is widely used metric that quantifies the model’s discriminative power and is suitable for imbalanced dataset
3. 分类问题该选用什么metric,and why
    - Understand the problem, identify the importance of correctly classifying each class and whether there is a class imbalance in the dataset
    - Define evaluation goals, consider false positive vs false negative, different impacts? Decide whether the emphasis is on overall accuracy, precision or recall, or a balanced trade-off
    - class imbalance
    - domain knowledge
    - multiple metrics
4. confusion matrix
    A confusion matrix is a table that summarizes the performance of a classification model by showing the counts of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) for each class. By examining the values in the confusion matrix, you can gain insights into different performance aspects of the classification model, such as accuracy, precision, recall, and F1-score.
5. true positive rate, false positive rate, ROC (for binary classification)
    - **True Positive Rate (TPR) or Sensitivity or Recall**: The TPR measures the proportion of actual positive instances that are correctly classified as positive by the classifier. It is calculated as the ratio of true positives (TP) to the sum of true positives and false negatives (FN). **TPR = TP / (TP + FN) = TP / (ALL Actual positive examples)**
    - **False Positive Rate (FPR):** The FPR measures the proportion of actual negative instances that are incorrectly classified as positive by the classifier. It is calculated as the ratio of false positives (FP) to the sum of false positives and true negatives (TN). **FPR = FP / (FP + TN) = FP / (All Actual Negative examples).** The TPR indicates the classifier's ability to correctly identify positive instances from the actual positive class. A higher TPR suggests better sensitivity or recall.
    - **Receiver Operating Characteristic (ROC) Curve**: plots the true positive rate (TPR) against the false positive rate (FPR) at various classification thresholds
6. AUC的解释 (for binary classification)
    - The AUC represents the area under the receiver operating characteristic (ROC) curve, which plots the true positive rate (TPR) against the false positive rate (FPR) at various classification thresholds.
    - It represents the probability that the model will rank a randomly chosen positive instance higher than a randomly chosen negative instance.
    - The AUC value ranges from 0 to 1. A model with an AUC of 0.5 performs no better than random guessing, as the ROC curve coincides with the diagonal line connecting (0,0) and (1,1). A perfect classifier achieves an AUC of 1, as it can perfectly separate positive and negative instances.
7. Ranking metrics
    - **Mean reciprocal rank (MRR)**: This metric measures the quality of the model by considering the rank of the first relevant item in each output list produced by the model, and then averaging them.
        - $$MRR = \frac{1}{m} \sum_{i=1}^m \frac{1}{\text{rank}_i}$$
        - shortcoming: only considers the first relevant item and ignores other relevant items in the list, it does not measure the precision and ranking quality of a ranked list.
    - **Recall@k:** This metric measures **the ratio between the number of relevant items** **in the output list** and **the total number or relevant items available in the entire dataset**. The formula is
        - $$\text{recall \@ k} = \frac{\text{number of relevant items among the top $k$ items in the output list}}{\text{total relevant items}}$$
        - measures how many relevant items the model failed to include in the output list
        - shortcoming: in some systems, the total number of relevant items can be very high. This negatively affects the recall as the denominator is very large. For example, if we want to find a list of image the close to a query image of dog, when the databse may contain millions of dog images. The goal is not to return every dog image but to retreve a handful of the most similar dog images.
    - **Precision@k:** measures the **proportion** of **relevant items among the top k items in the output list**. The formula is:
        - $$\text{precision\@k} = \frac{\text{number of relevant items among the top $k$ items in the output list}}{k}$$
        - measures **how precise the output lists are**, but **it doesn’t consider the ranking quality**.
    - **Average Precision (AP):** computes average precision@K for each k. AP is high if more relevant items are located at the top of the list.
    - **mAP**: first computes the average precision (AP) for each output list, and then averages AP values.
        - mAP is designed for binary relevances; in other words, it works well when each item is either relevant or irrelevant. For continuous relevance scores, nDCG is a better choice.
    - Normalized discounted cumulative gain (nDCG)
        - DCG calculates the cumulative gain of items in a list by summing up the relevance score of each item
        - $$\text{DCG}p = \sum_{i=1}^p\frac{rel_i}{\log_2(i+1)}$$
        - **nDCG divides the DCG by the DCG of an ideal ranking. The formula is:**
        - $$nDCG_p = \frac{DCG_p}{IDCG_p}$$
        - Its primary shortcoming is that **deriving ground truth relevance scores is not always possible**.
8. Recommender System Metrics
    - **Precision@k: proportion of relevant content among the top k recommended items**
    - MRR: focuses on the rank of the first relevant item in the list, suitable in system where only one relevant item is expected
    - mAP: average of all recommended items AP, measures the ranking quality of recommended items. mAP works only when the relevance scores are binary (if the score is ether relevant vs irrelevant, mAP is a better fit)
    - nDCG: relevance score between a user and an item is non-binary ( [relevant vs irrelevant case(mAP)] vs [how relevant case(nDCG)]
    - Diversity: This metric measures how dissimilar recommended videos are to each other. This metric is important to track, as users are more interested in diversified videos. To measure diversity, we calculate the average pairwise similarity (e.g., cosine similarity or dot product) between videos in the list. A low average pairwise similarity score indicates the list is diverse.
    mAP, MRR, nDCG are commonly used to measure ranking quality

补充内容 (2023-06-06 03:17 +08:00):

第二篇传送门 https://www.1point3acres.com/bbs...98257&page=1&extra=#pid18627800

补充内容 (2023-06-06 11:51 +08:00):

传送门好像还在审核。。。

评分

参与人数 54大米 +91 收起 理由
apple08 + 1 给你点个赞!
bryanjhy + 25 给你点个赞!
dabuliu + 1 很有用的信息!
生夕 + 1 赞一个
brogwai + 1 很有用的信息!

查看全部评分


上一篇:28岁开始读ML交叉学科的PHD好不好
下一篇:整理了一点机器学习八股文的答案(2)

本帖被以下淘专辑推荐:

 楼主| Elvira 2023-6-6 02:56:18 | 显示全部楼层
本楼: 👍   100% (4)
 
 
0% (0)   👎
全局: 👍   97% (135)
 
 
2% (3)    👎
Felomeng 发表于 2023-6-5 05:56
面试中这么初级的问题通常90%的candidate不会,也是很怪异

确实都挺基础的。但是好多工作的时候都不用啊。我毕业之前这些全都会。工作了几年以后忘得差不多了说实话。

大家领回家查漏补缺吧!!!
回复

使用道具 举报

奚仲 2023-6-5 23:43:26 | 显示全部楼层
本楼: 👍   100% (4)
 
 
0% (0)   👎
全局: 👍   93% (256)
 
 
6% (18)    👎
看着都很基本 拿到一个实际问题场景里面随便挑七八个题能刷掉50%以上过了hr screen的简历看着还可以的那种 junior DS/RS candidate

几十个数据点吧
回复

使用道具 举报

赤赤菌 2023-6-11 03:21:00 | 显示全部楼层
本楼: 👍   100% (2)
 
 
0% (0)   👎
全局: 👍   89% (2641)
 
 
10% (321)    👎
Felomeng 发表于 2023-6-5 09:56
面试中这么初级的问题通常90%的candidate不会,也是很怪异

太正常了。你一个天天做聚类算法的如果啥也没有准备去面试,对方的组做分类的,考几个AUC,Confusion Matrix啥的,你可能真的一下子就觉得应该会,但是概念模糊了。
所以一般我面试别人都是面试别人简历里面的项目,而不是拿自己最熟悉的几个概念去问对方。这年头谁还不是个本科生呢,你给他2分钟,他看一眼马上就能想起来的事情,我不会拿来筛他
回复

使用道具 举报

Felomeng 2023-6-5 21:56:28 | 显示全部楼层
本楼: 👍   0% (0)
 
 
0% (0)   👎
全局: 👍   93% (783)
 
 
6% (53)    👎
面试中这么初级的问题通常90%的candidate不会,也是很怪异
回复

使用道具 举报

本楼: 👍   0% (0)
 
 
0% (0)   👎
全局: 👍   92% (72)
 
 
7% (6)    👎
不错, 期待下一篇
回复

使用道具 举报

 楼主| Elvira 2023-6-6 02:57:30 | 显示全部楼层
本楼: 👍   0% (0)
 
 
0% (0)   👎
全局: 👍   97% (135)
 
 
2% (3)    👎
奚仲 发表于 2023-6-5 07:43
看着都很基本 拿到一个实际问题场景里面随便挑七八个题能刷掉50%以上过了hr screen的简历看着还可以的那种  ...

嗯确实,不过我其实基本没问到过这些问题。可能面的还不够多吧 哈哈
回复

使用道具 举报

buxiang996 2023-6-6 03:02:46 | 显示全部楼层
本楼: 👍   0% (0)
 
 
0% (0)   👎
全局: 👍   72% (1889)
 
 
27% (730)    👎
期待follow up!
回复

使用道具 举报

ybl330 2023-6-6 04:35:55 | 显示全部楼层
本楼: 👍   0% (0)
 
 
0% (0)   👎
全局: 👍   100% (30)
 
 
0% (0)    👎
感谢分享~ 不过第二篇传送门好像打不开
回复

使用道具 举报

 楼主| Elvira 2023-6-6 05:15:45 | 显示全部楼层
本楼: 👍   0% (0)
 
 
0% (0)   👎
全局: 👍   97% (135)
 
 
2% (3)    👎
ybl330 发表于 2023-6-5 12:35
感谢分享~ 不过第二篇传送门好像打不开

可能还在审核

评分

参与人数 1大米 +1 收起 理由
ybl330 + 1 很有用的信息!

查看全部评分

回复

使用道具 举报

Mubing540 2023-6-11 02:45:36 | 显示全部楼层
本楼: 👍   0% (0)
 
 
0% (0)   👎
全局: 👍   89% (554)
 
 
10% (68)    👎
奚仲 发表于 2023-6-5 08:43
看着都很基本 拿到一个实际问题场景里面随便挑七八个题能刷掉50%以上过了hr screen的简历看着还可以的那种  ...

什么样的题是不基本的题?求举例
回复

使用道具 举报

您需要登录后才可以回帖 登录 | 注册账号
隐私提醒:
  • ☑ 禁止发布广告,拉群,贴个人联系方式:找人请去🔗同学同事飞友,拉群请去🔗拉群结伴,广告请去🔗跳蚤市场,和 🔗租房广告|找室友
  • ☑ 论坛内容在发帖 30 分钟内可以编辑,过后则不能删帖。为防止被骚扰甚至人肉,不要公开留微信等联系方式,如有需求请以论坛私信方式发送。
  • ☑ 干货版块可免费使用 🔗超级匿名:面经(美国面经、中国面经、数科面经、PM面经),抖包袱(美国、中国)和录取汇报、定位选校版
  • ☑ 查阅全站 🔗各种匿名方法

本版积分规则

>
快速回复 返回顶部 返回列表