In this fast-paced and increasingly global consumer market, data is everything. Businesses that learn to wield their data effectively make headway over competitors who don’t. Data science as a profession is uniquely situated to capitalize upon this trend by providing businesses with cutting-edge tools and techniques.
In this article we will discuss the definition of data science, big data trends, how data science has changed algorithms for the better and the symbiotic relationship between customer insights and data science. Lastly, we’ll discuss how Instnt leverages data science to protect businesses from fraud in a variety of industries and contexts.
What Is Data Science?
A successful data scientist will have a theoretical basis in statistics & probability, the computer science knowledge to design and code efficient programs and the business sense to know which type of patterns should be honed in on. From there, the data scientist may choose to focus on subtopics such as big data, deep learning, natural language processing, computer vision, machine learning, or data visualization, to name a few.
I’ve heard data scientists jokingly describe themselves as “jack of all trades and master of some”; data scientists often come from a diversity of backgrounds and experiences, so some data scientists more closely resemble software developers while others, statisticians or finance quants. However, all data scientists will have some basis in math, computer science and business.
The Data Science Process
The data science process, also known as the data science lifecycle, encompasses all the tasks a data scientist is expected to do from start to finish. This includes data collection, cleaning, exploratory data analysis, model building and model deployment.
Every step is crucial to properly studying customer data. Data collection should be from a reputable source with minimal missing data. Once the data source is decided upon, it is still often messy and unstructured. The goal of data cleaning is then to mold the data into good enough shape to be analyzed without changing the underlying meaning of the data itself.
After collection and cleaning comes exploratory data analysis (EDA). At this step, data scientists begin studying the data for patterns that might be of interest in a business context. This could be relationships between items, such as finding that putting peanut butter and jelly together on the shelf increases sales of both, or it could be relationships between customers, such as finding that customers with X similar traits are likely to buy generic and shop for the same types of products at similar times of day.
Finally, we have model building and deployment. Once the data is in working order and patterns of interest have been identified, the data scientist can then use those patterns to build a model. Often this model is predictive, which is the goal of machine learning. By creating and deploying models, data scientists aim to predict the future habits of consumers, given what patterns are already known about them.
Big Data Trends
According to Forbes, there are 2.5 quintillion bytes of data produced each day and 90% of all the world’s data has been generated in the last few years. Companies are incentivized to collect all kinds of data, all the time; it then becomes the data scientist, data analyst or data engineer’s job to analyze this data and formulate business recommendations. Fortunately, there are many recent big data trends that have made the field a little more welcoming.
Edge computing shifts the storage and processing load of that data from the large network to the individual device. This may sound like it would slow your phone down, but it actually reduces latency because it means that there is no main entity trying to process everyone’s data at the same time. This makes edge computing a win-win for both users and companies.
While it may not be for everyone, there are more than a few benefits from the rise of the cloud. Scalability, a perpetual problem that held up progress in years past, is now virtually a non-issue today due to these services. Similarly, it is a cost-effective option for small and medium-size businesses who would rather pay-as-they-go than buy and run expensive servers (and even more as they grow). Highly-regulated industries such as government are becoming more cost-effective by making use of a hybrid cloud, keeping sensitive information safe while enjoying the benefits of third-party cloud computing.
Whereas traditional data warehouses hold structured data that might need processing in order to “fit”, a data lake can hold both unstructured and structured data simultaneously without issue. This can be a solid alternative when a company has exponentially increased data collection but only plans to analyze small portions at a time.
Results-Driven Data Science Algorithms
Before data could be collected and analyzed at scale, the best that researchers could often do was to create theoretical frameworks. These frameworks, while beautiful mathematically, often relied on many assumptions that do not always hold true. For example, economists often make the assumption that every person acts rationally at all times, but if you’ve ever been in love, spent more money than you should or ate something just because you already paid for it, then that assumption is broken.
The goal of a data science team is to create algorithms that are both mathematically sound and true to real life. This is especially true for machine learning, where the number-one goal is to make the most accurate and precise prediction possible.
All that being said, data science would not be useful without the implicit use of mathematics. Algorithms are math formulas brought to life; understanding the mathematics is what allows the data scientist to tweak and interpret model results. Data scientists often rely heavily on linear algebra, calculus, probability and matrix multiplication in their work.
Computer Science Efficiency
Without computer science, data scientists would probably be stuck in the 70s, drawing regression curves by hand like many statisticians and economists of the time. With the constant improvements made to computational power and cloud technologies in recent years, data scientists now have the ability to analyze extremely large data sets in a fraction of the time. Additionally, data scientists can capitalize upon the data structures, programming best practices, code optimization and parallel computing principles made popular by computer science.
Lastly, data scientists can improve upon previous algorithms by not just incorporating computer science but also through domain knowledge. Unlike statisticians of the 70s, data scientists have access to huge quantities of data and the computational power to study it. Further, data scientists in industry are incentivized to optimize results over theoretical correctness. If an assumption doesn’t hold true for a certain domain, a data science team can discard it. Domain expertise allows the data scientist to creatively tweak old algorithms to fit the business problem at hand.
Customer Data and Data Science: The Perfect Match
There are many tools for analyzing customer data with data science. Below we will discuss the two types of learning - unsupervised and supervised - that are used and how they are mobilized to gain insights about consumers.
Supervised learning is when the data we are using has a filled in y variable. This y could be the sale price of a house, the shell color of a crab, or a simple Y/N on whether a child grew up to attend college. This type of data is incredibly useful for prediction because it allows us to infer the future by learning the past.
Supervised learning often includes the types of techniques most commonly associated with data science; logistic regression, support vector machines, k-nearest neighbors, decision trees, and neural networks for example, although some of these techniques can also be used for unsupervised learning as well.
In a business context, running a supervised model could mean using past purchase data to predict how busy the Christmas season will be next year, predicting the final sale price of a house based on neighborhood data, or predicting if a customer will buy mascara in July given that they make a purchase for it about every 6 months.
Alternatively, the data we are working with often does not have labels at all. In this scenario, we focus on what individual observations have in common with each other instead of focusing on predicting the habits of a single observation. One common example of unsupervised learning would be k-means clustering.
In k-means clustering, each observation’s attributes are examined in order to place it into one of k groups (hence the name, k-means). To give an example, if we have a mixed pool of 100 blue crabs and 100 red crabs, while we don’t know the color of each crab (no labels, remember?), we can still identify clusters of crabs based on their other attributes. This might be the temperature each type prefers, what food they hunt, or what areas of the ocean they are found in. Once that is identified, then we can experiment with the value of k and find that k=2 provides the best results.
While we weren’t told there were two groups of crabs in the sample, we can infer it by looking at the relationships between the subjects in the group. The same goes for customers; while we don’t necessarily know the specific y variable, we can group customers based on similar attributes and infer information about each group without having to have labels for them.
Fraud Detection with Data Science
Fraud detection, a subset of anomaly detection, is one of the most common applications of data science. In particular, machine learning techniques such as random forest have frequently proven successful for fraud detection, and credit card fraud is one of the most popular Kaggle competitions to date. Because cases of fraud are not always discovered and happen in real time, fraud detection is usually studied as a case of unsupervised or semi-supervised learning. Deep neural networks, including the AI model created by Instnt, are well-situated to detect fraud due to their learning capacity and scalability.
In these types of cases, called rare event classification, it’s likely that less than 1% (or even 0.01%) of total observations are fraudulent, a condition which data scientists call class imbalance. When this occurs, it can be tricky to both accurately pick out positive cases without also having an exceedingly high false positive rate. Deep learning models have the ability to identify exceedingly complex relationships that a machine learning model cannot through networks that loosely resemble the way humans think. They can also scale to any size, meaning that more data equates to more learning opportunities. Using a deep learning model can be good when the problem requires a high level of detail — such as the metaphorical needle-in-a-haystack search required by rare event classification problems.
Data Science with Instnt Technology
Data science has managed to commandeer a number of tools and techniques from other fields to carve out a domain all of its own. From big data to supervised and unsupervised methods to fraud detection, there are many positives that data scientists can add to a business in any industry. Instnt leverages data science and deep learning technologies to perform compliance checks and monitor fraud risk in real time when a user signs up for your platform. Further, Instnt seamlessly integrates to your website or web app. Calculate your expected ROI and get started with Instnt today.