详细的当年walmartlabs的一道面试题。 - Statistics版 - 未名存档

本页内容为未名空间相应帖子的节选和存档，一周内的贴子最多显示50字，超过一周显示500字访问原贴

Statistics版 - 详细的当年walmartlabs的一道面试题。

相关主题
● 15个包子求大牛指导做一道regression 题
● 求一个用SPSS算multilevel logistic regression。谢谢了。
● linear regression 中的categorical data
● 请教logistic regression的independent variable是categorical
● how to interpret the interaction terms between two continuous variables in regression model?
● 如何在一个regression model里面同时处理continuous和categorical变量
● regression prediction问题
● 接着问统计问题（有包子答谢）
● 大家做过这个面试题吗？
● regression problem - go confused

相关话题的讨论汇总
话题: regression话题: variables话题: 面试话题: variable

进入Statistics版参与讨论

1

(共1页)

s****e 发帖数: 1180	1 他们的第二轮面试吧。主要问了一些regression的问题： We use regression to analyze a data set with two explanatory variables, and it works fine, if we change the data set, increase the number of variables to be 5, or 6, will the regression method still work? i.e. whether the regression method is robust to the data set. How do you know the regression method is robust? We have many explanatory variables, such as the searching key words, like “ Christmas lights” from google AD, we want to use the variables to predict the profit, profit is a continuous variable, what model shall we use? the explanatory variable matrix is very sparse, under this situation, what model shall we use? how about we categorize the explanatory variables first? then what’s next? 大家可以看看。美国的面试，尤其是technical 的面试，还是考一些东西的。做了几年 research的phd，面试前最好抽几个月时间，把以前学的东西过一下。其实，他们考的也不是像phd research那么难。但是，还是值得复习。我也在工业界面试过别人，基本上都是面试前，找一些自己在工作中遇到的问题，问一下。我出过比较难的面试题。有比较牛的candidate过不了我的面试，后来又去了大公司的。所以，过不了一两个面试，不要灰心。offer会有的。努力吧。
s*r 发帖数: 2757	2 这个怎么回答 the explanatory variable matrix is very sparse, under this situation, what model shall we use? how about we categorize the explanatory variables first? 我们这行处理类似data有两个手段，一是collapse variables, 二是kernal regression
s******t 发帖数: 71	3 感觉这类问题木有处理过还不好回答。。。自己看书的话哪类书能看到呢
m********t 发帖数: 94	4 好问题第一个问题怎么答？ 5,6 variables regression应该问题不大考点是哪里？ regularization 或者pca的东东？第二个呢？这个有点难了还要continuous的我只能想到regression 现场问起来估计只能胡扯一个就是group variables 频率高的单独弄频率低的clutering segement啊他们的第二轮面试吧。主要问了一些regression的问题： We use regression to analyze a data set with two explanatory variables, and it works fine, if we change the data set, increase the number of variables to be 5, or 6, will the regression method still work? i.e. whether the regression method is robust to the data set. How do you know the regression method is robust? We have many explanatory variables, such as the searching key words, like “ Christmas lights” from google AD, we want to use the variables to predict the profit, profit is a continuous variable, what model shall we use? the explanatory variable matrix is very sparse, under this situation, what model shall we use? how about we categorize the explanatory variables first? then what’s next? 大家可以看看。美国的面试，尤其是technical 的面试，还是考一些东西的。做了几年 research的phd，面试前最好抽几个月时间，把以前学的东西过一下。其实，他们考的也不是像phd research那么难。但是，还是值得复习。我也在工业界面试过别人，基本上都是面试前，找一些自己在工作中遇到的问题，问一下。我出过比较难的面试题。有比较牛的candidate过不了我的面试，后来又去了大公司的。所以，过不了一两个面试，不要灰心。offer会有的。努力吧。【在 s****e 的大作中提到】 : 他们的第二轮面试吧。主要问了一些regression的问题： : We use regression to analyze a data set with two explanatory variables, and : it works fine, if we change the data set, increase the number of variables : to be 5, or 6, will the regression method still work? i.e. whether the : regression method is robust to the data set. How do you know the regression : method is robust? : We have many explanatory variables, such as the searching key words, like “ : Christmas lights” from google AD, we want to use the variables to predict : the profit, profit is a continuous variable, what model shall we use? the : explanatory variable matrix is very sparse, under this situation, what model
m********t 发帖数: 94	5 厉害厉害。。。搜了一下没看出所以然来能不能大概说说这两种方法？ first? 【在 s*r 的大作中提到】 : 这个怎么回答 : the explanatory variable matrix is very sparse, under this situation, what : model shall we use? how about we categorize the explanatory variables first? : 我们这行处理类似data有两个手段，一是collapse variables, 二是kernal : regression
s*r 发帖数: 2757	6 这是genetics领域处理sequencing variants的方法,和你做的没啥关系吧【在 m********t 的大作中提到】 : 厉害厉害。。。 : 搜了一下没看出所以然来能不能大概说说这两种方法？ : : first?
a*z 发帖数: 294	7 大牛们讲讲？很好的问题。多谢
m********t 发帖数: 94	8 稍微查了下 collapse是stata里的命令吧本质就是aggregate吧 kernal regression本质不了解但是就是regression本身没有形式 fit着来？大概展开说说吧其实我更好奇callapse是具体怎么做的【在 s*r 的大作中提到】 : 这是genetics领域处理sequencing variants的方法,和你做的没啥关系吧

l******n 发帖数: 9344	9 第一个问题不清楚，什么叫"still work"?这个需要问出题的人。一般来说你variable 多了，理论上能够解释更多的variance。但是performance还取决于你看得那个 measurement的性质。如果说要看model robust不，因为2个variable这个model是 robust的，说明这2个variable的极值不会对结果有大的影响，但是不保证其他的变量不会有。你可以看你得estimator的具体形式,基本就是丢掉outlier或者给很少weight ，和新变量怎么影响他的。或者自己用不同的data，然后看看系数的变化来决定robust 不。第二个，最简单的想法就是把explanatory variable matrix变成blockwise diagonal 的，这个基本上就是线性变换，有很多方法做。也可以看作是变换后，变量都是 blockwise正交的，这样后面的计算简单很多。pca,cluster之类的其实都是差不多的。 and regression model 【在 s****e 的大作中提到】 : 他们的第二轮面试吧。主要问了一些regression的问题： : We use regression to analyze a data set with two explanatory variables, and : it works fine, if we change the data set, increase the number of variables : to be 5, or 6, will the regression method still work? i.e. whether the : regression method is robust to the data set. How do you know the regression : method is robust? : We have many explanatory variables, such as the searching key words, like “ : Christmas lights” from google AD, we want to use the variables to predict : the profit, profit is a continuous variable, what model shall we use? the : explanatory variable matrix is very sparse, under this situation, what model
m********t 发帖数: 94	10 我这是属于外行乱吐槽了第二个问题不是矩阵变换这么简单的吧如果是search terms 那可不是几百几千的量级而是几万几十万的量级这种规模的matrix 我不知道还能不能用普通regression handle 另外pca 能处理到这么大么？我想到clustering也是划归后分组处理从来没想过所有问题一起解决 variable weight robust diagonal 【在 l******n 的大作中提到】 : 第一个问题不清楚，什么叫"still work"?这个需要问出题的人。一般来说你variable : 多了，理论上能够解释更多的variance。但是performance还取决于你看得那个 : measurement的性质。如果说要看model robust不，因为2个variable这个model是 : robust的，说明这2个variable的极值不会对结果有大的影响，但是不保证其他的变量 : 不会有。你可以看你得estimator的具体形式,基本就是丢掉outlier或者给很少weight : ，和新变量怎么影响他的。或者自己用不同的data，然后看看系数的变化来决定robust : 不。 : 第二个，最简单的想法就是把explanatory variable matrix变成blockwise diagonal : 的，这个基本上就是线性变换，有很多方法做。也可以看作是变换后，变量都是 : blockwise正交的，这样后面的计算简单很多。pca,cluster之类的其实都是差不多的。
O********9 发帖数: 59	11 试着从machine learning的角度回答你的问题。第一题，如果该regression model被用来预测，那么预测误差随着输入变量个数增加，先是递减，然后增加。换句话说，并不是输入变量越多越好，而是有一个最优值。这个最优值，通常取决于问题本身，可以通过cross validation的方法求出来。这是从 performance方面考虑。从robustness角度考虑，增加的变量有可能与已有的变量高度相关。在极端的情况下，输入变量矩阵列秩不满，从而regression model有无穷多组解（即colinearity问题）。即使输入变量矩阵满列秩，但是gram matrix的conditional number太大，regression model的系数的置信区间会很大，model也不会robust。解决方法，求解之前，先通过feature selection或者PCA，去掉输入变量矩阵的相关性。第二题，取决于你怎样对“Christmas lights”。你可以设一变量x。如果出现关键词，则x＝1，否则x＝0。适用于这种离散变量的模型包括但不限于naive bayesian， regression tree，和任何tree based model。或者你可以定义x＝key word的词频。这样你可以使用linear regression，logistic regression等等。“variable matrix is very sparse” 定义不明。如果面试官的意思是输入变量很多，但是大多数与输出无关，那么可以考虑适用sparsity promoting 技术，包括 lp-norm regularization (p <= 1),或者有sparsity promoting priors的Bayesian模型。此类问题有快速解发，适用于大规模数据处理。
l******n 发帖数: 9344	12 思想是这个，具体做的时候就具体分析了。其实上百万dimension的matrix，也可以处理，当然这个就得super computer了【在 m********t 的大作中提到】 : 我这是属于外行乱吐槽了 : 第二个问题不是矩阵变换这么简单的吧 : 如果是search terms 那可不是几百几千的量级 : 而是几万几十万的量级这种规模的matrix 我不知道还能不能用普通regression : handle : 另外pca 能处理到这么大么？ : 我想到clustering也是划归后分组处理从来没想过所有问题一起解决 : : variable : weight

1

(共1页)

进入Statistics版参与讨论

相关主题
● regression problem - go confused
● 紧急求助，问到MULTIPLE REGRESSION的题
● how to convert a categorical variable into a continuous variable
● sas question
● 一个 proc mixed 的问题
● a question
● SAS Regression Macro 问题请教 (有包子)
● sampling weight variable怎么用到linear regression里啊？
● model和variables都sig.但每个category都不sig
● 统计课题弱问, 包子感谢.

相关话题的讨论汇总
话题: regression话题: variables话题: 面试话题: variable