Enter your keyword

Introduction to Machine Learning

Introduction to Machine Learning

facebook face recognition is a good example of machine learning. If you are posting a group picture of ten people and all these people are a member of Facebook and have their own profile picture, Facebook face recognition will advise you to tag specific friends; even though lighting, poses, and other factors make the people look different.

Machine Learning Categorization
  • Text categorization, i.e. spam filtering
  • Fraud detection, i.e. credit card fraud
  • Machine vision, i.e. image processing/face detection
  • Natural language processing, i.e. spoken language understanding
  • Market segmentation, i.e. predict if customer will respond to a promotion
  • Bioinformatics, i.e. pharmacy trying to understand if an antibiotic will treat a cold
Types of Machine Learning
  • Supervised learning: The algorithm has training data with a known expected output. Example: credit card fraud detector.
  • Unsupervised learning: The algorithm identifies patterns in the data without being told the expected outcome. Example: anomaly detection.
  • Reinforcement learning: The algorithm learns from interactions with the environment. It uses trial-and-error and memorizes strategy for further improvement. Example: chess program.

Note: Supervised learning is best suited for the finance domain.

Types of Machine Learning Algorithms
  • Classification: A set of data is given, and your answer is one of the pieces of data.
  • Regression: Used to find numbers (numeric value).
  • Anomaly detection: Analyzes patterns, i.e. credit card fraud detection.
  • Clustering: Used if we need to know about structure; forms groups to interpret the data.
  • Reinforcement: Used when a decision needs to be made based on past experience and the environment.
Linear Classification

Classification and regression are two types of machine learning problems and are categorized by desired output. First, we should identify our problem in either of these two to find a suitable algorithm.


We are modeling the relationship between a continuous input variable X and a continuous target variable T.

For example, suppose you want to predict if the market will go high or low, but the desired output is a price series. Mid-price, volume, and continuous volatility will be used to get the target variable price series.


The input variable X may still be continuous, but the target variable X is discrete.

  • T=1 if assigned to C1.
  • T=0 if assigned to C2.

For example, suppose you want to predict if the market will go high or low. Mid-price, volume, and continuous volatility will be used to get the discrete target variable price. In this case, if the price goes up, it’s +1; if the price goes down, it’s 0 or -1.

Regression vs. Classification
  • Classification is more advisable if we see what we do with prediction.
  • Long-term trading organizations use classification.
  • Regression is advisable when the cost of a few mistakes doesn’t cancel out being right most of the time.
  • Most high-frequency trading organizations use regression.
Steps of Machine Learning

It doesn’t matter that which machine learning model you use, these 4 steps are pretty common.

  • Prepare the data: Get the raw data and structure it.
  • Train the model: Use the data and train the model.
  • Test the model: Test the model with some test data; do the model fitting and test it again. Repeat to get the best model.
  • Deploy the model: Once satisfied with the model, deploy it to use.
Collateral Management Scenario

In the case of collateral management where the collateral is a security (let’s say a house) that needs to be priced.

Here, if we have 70K pieces of data about houses, we will split the data into three parts:

  1. Training set: Train model (60% training)
  2. Cross-validation: 20%
  3.  Test set: Test the model (20% test)

If this doesn’t give you a satisfactory result, then adjust the model (known as model fitting).

New Apple iPhones like iPhone 11 Pro and 11 Max are disappearing from retail stores — this time due to supply issues appearing after the coronavirus pandemic amid less footfall.

According to KeyBanc Capital Markets analyst John Vinh, Apple’s “iPhone sell-through was adversely impacted by supply issues” due to coronavirus, “particularly on the Pro/Max models and by lower foot traffic in outbreak areas.”

The stores have been running out of iPhones for some weeks now and they have no idea when the new stock will arrive, reports Seeking Alpha.

Apple shares were down nearly 5.5 percent pre-market to $260 on Thursday.

According to a NY Post report last week, wireless retailers have either run out of stock or are running low on iPhone 11 and iPhone 11 Pro models.

“Employees at numerous retail locations around Manhattan contacted by The Post uniformly told the same story of low stock and infrequent shipments,” said the report.

“We got a shipment and it didn’t have any iPhones in it a” just flip phones and Samsungs,” a Verizon store employee on the Upper West Side in New York was quoted as saying.

There is, however, cautious optimism that the worst of the “outbreak in China” is now past.

Factories are beginning to ramp up production slowly, though many are still below normal capacity at this time of year.

Apple supplier Foxconn said it is running at about half its normal low-season capacity — this equates to about 25 percent of full capacity, according to Counterpoint Research.

While factories are anxious to ramp-up production, they’re also being careful that labour-intensive work does not rekindle viral outbreaks.

Rumours have also been circulating for several weeks that Apple will delay the launch of the yet to be named lower cost iPhone.

This is to be the successor of the iPhone SE, built around the same form-factor as the iPhone 8 series (SE 2 or iPhone 9).

“Problems with the launch were initially thought to be because initial volume ramps could be delayed due to Foxconn’s inability to start production,” said Peter Richardson, Research Director, Counterpoint.

Travel restrictions on Apple’s engineers flying to China to supervise pre-production testing might also be a factor.

And if all these were not problematic enough, just holding a launch event at this time is difficult.

We all know that Pandas and NumPy are amazing, and they play a crucial role in our day to day analysis. Without Pandas and NumPy, we would be left deserted in this huge world of data analytics and science. Today, I am going to share 12 amazing Pandas and NumPy functions that will make your life and analysis much easier than before. 

Let’s start with NumPy:

NumPy is the fundamental package for scientific computing with Python. It contains among other things:

  • a powerful N-dimensional array object
  • sophisticated (broadcasting) functions
  • tools for integrating C/C++ and Fortran code
  • useful linear algebra, Fourier transform, and random number capabilities

Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.

  1. argpartition()

NumPy has this amazing function which can find N largest values index. The output will be the N largest values index, and then we can sort the values if needed.

index_val = np.argpartition(x, -4)[-4:]
array([1, 8, 2, 0], dtype=int64)np.sort(x[index_val])
array([10, 12, 12, 16])

2. allclose()

Allclose() is used for matching two arrays and getting the output in terms of a boolean value. It will return False if items in two arrays are not equal within a tolerance. It is a great way to check if two arrays are similar, which can actually be difficult to implement manually.

array1 = np.array([0.12,0.17,0.24,0.29])
array2 = np.array([0.13,0.19,0.26,0.31])# with a tolerance of 0.1, it should return False:
False# with a tolerance of 0.2, it should return True:

3. clip()

Clip() is used to keep values in an array within an interval. Sometimes, we need to keep the values within an upper and lower limit. For the mentioned purpose, we can make use of NumPy’s clip(). Given an interval, values outside the interval are clipped to the interval edges.

x = np.array([3, 17, 14, 23, 2, 2, 6, 8, 1, 2, 16, 0])np.clip(x,2,5)
array([3, 5, 5, 5, 2, 2, 5, 5, 2, 2, 5, 2])

4. extract()

Extract() as the name goes, is used to extract specific elements from an array based on a certain condition. With extract(), we can also use conditions like and and or.

# Random integers
array = np.random.randint(20, size=12)
array([ 0,  1,  8, 19, 16, 18, 10, 11,  2, 13, 14,  3])#  Divide by 2 and check if remainder is 1
cond = np.mod(array, 2)==1
array([False,  True, False,  True, False, False, False,  True, False, True, False,  True])# Use extract to get the values
np.extract(cond, array)
array([ 1, 19, 11, 13,  3])# Apply condition on extract directly
np.extract(((array < 3) | (array > 15)), array)
array([ 0,  1, 19, 16, 18,  2])

5. where()

Where() is used to return elements from an array that satisfy a certain condition. It returns the index position of values that fall in a certain condition. This is almost similar to the where condition that we use in SQL, I’ll demonstrate that in the examples below.

y = np.array([1,5,6,8,1,7,3,6,9])# Where y is greater than 5, returns index position
array([2, 3, 5, 7, 8], dtype=int64),)# First will replace the values that match the condition, 
# second will replace the values that does not
np.where(y>5, "Hit", "Miss")
array(['Miss', 'Miss', 'Hit', 'Hit', 'Miss', 'Hit', 'Miss', 'Hit', 'Hit'],dtype='<U4')

6. percentile()

Percentile() is used to compute the nth percentile of the array elements along the specified axis.

a = np.array([1,5,6,8,1,7,3,6,9])print("50th Percentile of a, axis = 0 : ",  
      np.percentile(a, 50, axis =0))
50th Percentile of a, axis = 0 :  6.0b = np.array([[10, 7, 4], [3, 2, 1]])print("30th Percentile of b, axis = 0 : ",  
      np.percentile(b, 30, axis =0))
30th Percentile of b, axis = 0 :  [5.1 3.5 1.9]

Let me know if you’ve used them earlier and how far did it help you. Let’s move on to the amazing Pandas.


pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with structured (tabular, multidimensional, potentially heterogeneous) and time-series data both easy and intuitive.

pandas is well suited for many different kinds of data:

  • Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
  • Ordered and unordered (not necessarily fixed-frequency) time-series data.
  • Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
  • Any other form of observational/statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure.

Here are just a few of the things that pandas does well:

  • Easy handling of missing data (represented as NaN) in floating point as well as non-floating point data
  • Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects
  • Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the user can simply ignore the labels and let Series, DataFrame, etc. automatically align the data for you in computations
  • Powerful, flexible group by functionality to perform split-apply-combine operations on data sets, for both aggregating and transforming data
  • Make it easy to convert ragged, differently-indexed data in other Python and NumPy data structures into DataFrame objects
  • Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
  • Intuitive merging and joining data sets
  • Flexible reshaping and pivoting of data sets
  • Hierarchical labeling of axes (possible to have multiple labels per tick)
  • Robust IO tools for loading data from flat files (CSV and delimited), Excel files, databases, and saving/loading data from the ultrafast HDF5 format
  • Time series-specific functionality: date range generation and frequency conversion, moving window statistics, date shifting and lagging.
  1. read_csv(nrows=n)

You might already be aware of the use of read_csv function. But, most of us still make a mistake of reading the entire .csv file even when it is not required. Let’s consider a situation where we are unaware of the columns and the data present in a .csv file of 10gb, reading whole .csv file here would not be a smart decision because it would be the unnecessary use of our memory and would take a lot of time. We can just import a few rows from the .csv file and then proceed further as per our need.

import io
import requests# I am using this online data set just to make things easier for you guys
url = "https://raw.github.com/vincentarelbundock/Rdatasets/master/csv/datasets/AirPassengers.csv"
s = requests.get(url).content# read only first 10 rows
df = pd.read_csv(io.StringIO(s.decode('utf-8')),nrows=10 , index_col=0)

2. map()

The map() function is used to map values of Series according to input correspondence. Used for substituting each value in a Series with another value, that may be derived from a function, a dict or a Series.

# create a dataframe
dframe = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'), index=['India', 'USA', 'China', 'Russia'])#compute a formatted string from each floating point value in frame
changefn = lambda x: '%.2f' % x# Make changes element-wise

3. apply()

The apply() allows the users to pass a function and apply it on every single value of the Pandas series.

# max minus mix lambda fn
fn = lambda x: x.max() - x.min()# Apply this on dframe that we've just created above

4. isin()

The isin() is used to filter data frames. isin() helps in selecting rows with having a particular(or Multiple) value in a particular column. It is the most useful function I’ve come across.

# Using the dataframe we created for read_csv
filter1 = df["value"].isin([112]) 
filter2 = df["time"].isin([1949.000000])df [filter1 & filter2]

5. copy()

 is used to create a copy of a Pandas object. When you assign a data frame to another data frame, its value changes when you make changes in the other one. To prevent the mentioned issue, we can make use of copy().

# creating sample series 
data = pd.Series(['India', 'Pakistan', 'China', 'Mongolia'])# Assigning issue that we face
data1= data
# Change a value
# Also changes value in old dataframe
data# To prevent that, we use
# creating copy of series 
new = data.copy()# assigning new values 
new[1]='Changed value'# printing data 

6. select_dtypes()

The select_dtypes() function returns a subset of the data frame’s columns based on the column dtypes. The parameters of this function can be set to include all the columns having some specific data type or it could be set to exclude all those columns which has some specific data types.

# We'll use the same dataframe that we used for read_csv
framex =  df.select_dtypes(include="float64")# Returns only time column



The most amazing and useful function of pandas is pivot_table. If you hesitate to use groupby and want to extend its functionalities then you can very well use the pivot_table. If you’re aware of how pivot table works in excel, then it’s might be a piece of cake for you. Levels in the pivot table will be stored in MultiIndex objects (hierarchical indexes) on the index and columns of the result DataFrame.

# Create a sample dataframe
school = pd.DataFrame({'A': ['Jay', 'Usher', 'Nicky', 'Romero', 'Will'], 
      'B': ['Masters', 'Graduate', 'Graduate', 'Masters', 'Graduate'], 
      'C': [26, 22, 20, 23, 24]})
# Lets create a pivot table to segregate students based on age and course
table = pd.pivot_table(school, values ='A', index =['B', 'C'], 
                         columns =['B'], aggfunc = np.sum, fill_value="Not Available") 

Do let me know down below in the comments if you guys have come across or used any other amazing functions. I would love to know more about them.

  • Sign up
Lost your password? Please enter your username or email address. You will receive a link to create a new password via email.