Ularsing t1_isvms80 wrote on October 19, 2022 at 12:55 AM

I feel compelled to add here that Pandas has an absolutely dogshit API plagued by breaking changes and bastardizations of R code. It's the best package for what it does, but it leaves a lot to be desired. Trying to prioritize Pandas knowledge reads like someone trying to hire based on their omniscient understanding of the field that they gained from their coding bootcamp.

marr75 t1_isxougd wrote on October 19, 2022 at 1:45 PM

I read a good blog post from a guy talking about how modern IDEs encourage you to learn really weird "motions" (using pycharm's refactor, codegen, and code completion mid-stream, for example). He wasn't saying it was bad per se, just that we should all remember the point isn't to be "good" at the IDE, it's to solve problems with the code.

I feel the same about pandas. If anything, the skill to focus on is vectorizing your operations. That's the biggest readability and performance improvement and it's portable to dplyr, polars, etc.

chief167 t1_istrptp wrote on October 18, 2022 at 5:19 PM

As someone who sometimes has to hire people, perhaps this is the issue:

Imagine how difficult it is for big companies to get a MLOps framework going, with all the red tape and scattered IT systems. It was very painful where I work. In the end we got something working using a python platform that really needs you to use pandas and sklearn type interfaces.

Let's hypothetically say you are a great data scientist using R, or Sas or MATLAB or ... If I don't have a lot of options I'd hire you and put you on a training program for our framework. But if I have multiple decent candidates, and some don't require retraining, yeah imma gonna pick one of them. I am not spending 2 months trying to get compliance and cybersec to approve your docker container with R code in it, if I can have a similar model in our pre-approved workflow.

SkinnyJoshPeck t1_isu0ui7 wrote on October 18, 2022 at 6:19 PM

I hear ya; I think the point is less about proficiency and more about mastery -- in my case, I was marked down heavily since I didn't use iloc. Something like

df[df.col &lt; 10]
vs
df[df.iloc[:, 0] &lt; 10]

because I guess it makes it more clear to the reader, and it protects the code from explicit column names; the fact that I didn't use it made me seem like I didn't know pandas well.

to your point, though, I see the importance in the infrastructure. In this case, it was for an ml scientist role where I wouldn't actually be doing any of the MLOps, just designing and tuning the models.

phb07jm t1_isudd5v wrote on October 18, 2022 at 7:39 PM

Can someone please explain why the second is preferable? I would always do the first because it's more likely that the position of a column will change than the name.

monkeyunited t1_isufl8u wrote on October 18, 2022 at 7:53 PM

Yep, you’re right.

It’s not preferred.

silvershadow t1_isurezr wrote on October 18, 2022 at 9:08 PM

Change the iloc to a loc and then I would maybe see the argument.

.iloc and .loc explicitly return the original data frame, while [] indexing can in some cases return a copy. Pandas makes no promises on what you get

So depending on what the full expression was the criticism of using [] inducing could make sense. You’d need to see the full context of what OP was writing though.

From the sounds of what they wrote though, this is not the thinking the interviewer was following.

chief167 t1_isuculx wrote on October 18, 2022 at 7:36 PM

Ok yeah well that's stupid. Because I am actually in favour of column names instead of indexes. Indexes are pain in the ass when your incoming dataframe changes, it creates an implicit dependency.

But your last line is my point. You shouldn't be concerned about MLops stuff, but if your models is already in the right framework, it saves soooo much time

monkeyunited t1_isufjc7 wrote on October 18, 2022 at 7:53 PM

That’s dumb and violates the “explicit is better than implicit” rule.

[deleted] t1_isyguhg wrote on October 19, 2022 at 4:55 PM

[deleted]

AutumnStar t1_isuy0el wrote on October 18, 2022 at 9:53 PM

I agree with gist of your comment, but FYI, model selection matters a lot for many different reasons. You have no idea how many people I’ve interviewed who just want to just use Neural Nets or XGBoost every time. Or people who couldn’t tell me any advantages/disadvantages for any algorithm.

I tend to look for people who can critically think well. That’s the hardest skill to find in any DS. They should have some experience and competence in coding, obviously, but realistically almost everything else can be taught more easily.

[deleted] t1_isxh39q wrote on October 19, 2022 at 12:42 PM