Data science, also known as data-driven science, is an interdisciplinary field of scientific methods, processes, and systems to extract knowledge or insights from data in various forms, either structured or unstructured, similar to data mining.
A data scientist is a broad term that can refer to a number of types of careers. Generally, a data scientist analyzes data to learn about scientific processes. Some job titles in data science include data analyst, data engineer, computer and information research scientist, operations research analyst, and computer systems analyst.
Data scientists work in a variety of industries, ranging from tech to medicine to government agencies and the qualifications for a job in data science vary, because the title is so broad. However, there are certain skills employers look for in almost every data scientist. Data scientists need statistical, analytical and reporting skills.
Here’s a detailed list of the most important data scientist skills, as well as a longer list of even more related skills.
Companies want to see that you’re a (data-driven) problem solver. That is, at some point during your interview process, you’ll probably be asked about some high level problem – for example, about a test the company may want to run or a data-driven product it may want to develop. It’s important to think about what things are important, and what things aren’t. How should you, as the data scientist, interact with the engineers and product managers? What methods should you use? When do approximations make sense?
Perhaps the most important skill for a data scientist is to be able to analyze information. Data scientists have to look at, and make sense of, large swaths of data. They have to be able to see patterns and trends in the data, and explain those patterns. All of this takes strong analytical skills.
Being a good data scientist also means being creative. Firstly, you have to use creativity to spot trends in data. Secondly, you need to make connections between data that might seem unrelated. This takes a lot of creative thinking. Finally, you need to explain this data in ways that are clear to the executives at your company. This often requires creative analogies and explanations.
Data scientists not only have to analyze data, but they also have to explain that data to others. They must be able to communicate data to people, explain the importance of patterns in the data, and suggest solutions. This involves explaining complex technical issues in a way that is easy to understand. Often, communicating data requires visual, oral, and written communication skills.
While soft skills like analysis, creativity, and communication are important, hard skillsare also critical to the job. A data scientist needs math skills, particularly in multivariable calculus and linear algebra.
Data scientists require basic computer skills, but programming skills are particularly important. Being able to code is critical to almost any data scientist position. Knowledge of programming languages such as Java, R, Python, or SQL are important.
Along with the basic skill-sets, the core set data science competencies you should develop are:
Knowledge of Basic Tools: No matter what type of company you’re interviewing for, you’re likely going to be expected to know how to use the tools of the trade. This means a statistical programming language, like R or Python, and a database querying language like SQL.
Must Understand Basic Statistics: At least a basic understanding of statistics is vital as a data scientist. An interviewer once told me that many of the people he interviewed couldn’t even provide the correct definition of a p-value. You should be familiar with statistical tests, distributions, maximum likelihood estimators, etc. Think back to your basic stats class! This will also be the case for machine learning, but one of the more important aspects of your statistics knowledge will be understanding when different techniques are (or aren’t) a valid approach. Statistics is important at all company types, but especially data-driven companies where the product is not data-focused and product stakeholders will depend on your help to make decisions and design / evaluate experiments.
Be Familiar with Machine Learning: If you’re at a large company with huge amounts of data, or working at a company where the product itself is especially data-driven, it may be the case that you’ll want to be familiar with machine learning methods. This can mean things like k-nearest neighbors, random forests, ensemble methods – all of the machine learning buzzwords. It’s true that a lot of these techniques can be implemented using R or Python libraries – because of this, it’s not necessarily a dealbreaker if you’re not the world’s leading expert on how the algorithms work. More important is to understand the broadstrokes and really understand when it is appropriate to use different techniques.
Techniques of Multivariable Calculus and Linear Algebra: You may in fact be asked to derive some of the machine learning or statistics results you employ elsewhere in your interview. Even if you’re not, your interviewer may ask you some basic multivariable calculus or linear algebra questions, since they form the basis of a lot of these techniques. You may wonder why a data scientist would need to understand this stuff if there are a bunch of out of the box implementations in sklearn or R. The answer is that at a certain point, it can become worth it for a data science team to build out their own implementations in house. Understanding these concepts is most important at companies where the product is defined by the data and small improvements in predictive performance or algorithm optimization can lead to huge wins for the company.
“Data scientist” is often used as a blanket title to describe jobs that are drastically different. tweet
Data Munging: Often times, the data you’re analyzing is going to be messy and difficult to work with. Because of this, it’s really important to know how to deal with imperfections in data. Some examples of data imperfections include missing values, inconsistent string formatting (e.g., ‘New York’ versus ‘new york’ versus ‘ny’), and date formatting (‘2017-01-01’ vs. ‘01/01/2017’, unix time vs. timestamps, etc.). This will be most important at small companies where you’re an early data hire, or data-driven companies where the product is not data-related (particularly because the latter has often grown quickly with not much attention to data cleanliness), but this skill is important for everyone to have.
Data Visualization & Communication: Visualizing and communicating data is incredibly important, especially at young companies who are making data-driven decisions for the first time or companies where data scientists are viewed as people who help others make data-driven decisions. When it comes to communicating, this means describing your findings or the way techniques work to audiences, both technical and non-technical. Visualization wise, it can be immensely helpful to be familiar with data visualization tools like ggplot and d3.js. It is important to not just be familiar with the tools necessary to visualize data, but also the principles behind visually encoding data and communicating information.
Background of Software Engineering: If you’re interviewing at a smaller company and are one of the first data science hires, it can be important to have a strong software engineering background. You’ll be responsible for handling a lot of data logging, and potentially the development of data-driven products.
Data science is still nascent and ill-defined as a field. Getting a job is as much about finding a company whose needs match your skills as it is developing those skills. This writing is based on my own firsthand experiences – I’d love to hear if you’ve had similar (or contrasting) experiences during your own process.