Got hired as a data steward, feel like I'm incredibly lagging behind

Today, I finally got access to their databases. I was assigned the task of checking some P&L reports from the finances. I can't but feel that my skills are a bit inadequate. First, I never worked with databases that huge and that complex. Their sheer scope and complexity make me feel a bit overwhelmed; it's difficult to muddle through the nomenclature and what stems from what.

Everyone at new company feels this way to some extent even if they have prior work experience. You won't internalize the company's data landscape until you start working on projects or questions that come to you. "Go check these P&L reports in the database" is a classic task given to somebody who just got their database access. A manager gives you this task because they don't WANT to overwhelm you. It shouldn't surprise you or anybody at your company that you've never seen such huge and complex databases...any course in SQL or any other data domain basically can't FULLY prepare you for a real company's data. This is all natural and should be expected.

One more thing: you said you feel your skills are inadequate but I strongly disagree with your opinion about yourself. Your SKILLS are likely fully adequate. Your current KNOWLEDGE is what's lacking. But you, your manager, and everyone should be full aware that you just got your data access TODAY. Nobody can possibly expect you to understand your company's data on Day 1 of having access to it. And the good news is in process of expanding your KNOWLEDGE you will use your SKILLS therefore you will LEARN how to APPLY your SKILLS to your KNOWLEDGE.

All I had previously done was some exercises of various complexity. I know how to use window fns, how to write fairly simple, non-recursive CTEs, but I never used COALESCE, 1=1, other tricks like these. As I was learning, I used PostgreSQL and PGAdmin4, but the guys I work for use MS SQL. Granted, differences are not that big, but MS SQL Mgmt Studio is something I'm not used to. I hoped there would be more Python-ic things like Pandas and Numpy, but it feels like I'll be muddling through cumbersome SQL queries and tables trying to find where this or that discrepancy stems from.

Okay:

-COALESCE(): Takes a list of arguments and outputs the first non-null argument in the list. You can use it in various ways but the way I've used in like ~90% of cases is I have a column in a table that can have a value or is NULL and I want the NULL to be 0 for instance COALESCE(transaction_amount,0). For my use cases this situation comes up when left joining. You can also use it with multiple columns for instance COALESCE(col1,col2,col3) as col4. As before, it will output the first non-NULL column in the arguments.

-1=1: This one's a little trickier and more of a stylistic question. The way I've always seen it used is essentially in place of a CROSS JOIN as in you have table A and table B and you want every combination of rows from B concatenated to rows from A. For this task I would only use a CROSS JOIN but some people do a LEFT JOIN or INNER JOIN ON 1=1 instead. And on occasion you'll run into a query written by someone who's unskilled and added ON 1=1 to a LEFT JOIN with other columns to join on and in fact the 1=1 was completely unneeded.

See? You can learn those sorts of things in like 5 minutes but it takes EXPERIENCE to know when to actually use them. And experience takes time...and you literally just started. So stop beating yourself up over it. It's just practice.

As far as Python, I feel that that preferences on SQL vs pandas/other Python tools strongly depends on how somebody started their career. My first real job was actually using MSSQL Studio so I've always had a preference towards SQL but certain things are easier in pandas. You may find that after working on real world problems using SQL you start to prefer it as well. You may also find with a little more time and understanding how to get shit done at your company there may be opportunities to use Python that you're not aware of yet.

/r/datascience Thread