Pattern Recognition

5 downloads 0 Views 1MB Size Report
algorithms to recognize patterns and trends in data from every corner of the . . . process. Current live projects include product recommendation for our website ...
Pattern Recognition Prof. Christian Bauckhage

outline lecture 15 (a.k.a. the Christmas lecture)

how to get a job interview for a senior data scientist position? a problem from a 2015 job advertisement

the opportunity

Welcome to . . . , one of the most exciting places for data scientists in Europe. Our department develops and applies highly scalable algorithms to recognize patterns and trends in data from every corner of the . . . process. Current live projects include product recommendation for our website, fraud detection for orders, optimizing the mix of channels for marketing and much more. We create statistical models, apply state-of-the-art algorithms to our gigantic database, visualize the results and get the impact, all while keeping our focus on increasing customer satisfaction and creating shareholder value. If you have a good idea, you will have the tools to calculate it, the data to test it and millions of customers in 14 countries to benefit from the results. Bold projects, detailed analysis, quick turnaround and fun at work: that is data science at . . . .

the opportunity

Welcome to . . . , one of the most exciting places for data scientists in Europe. Our department develops and applies highly scalable algorithms to recognize patterns and trends in data from every corner of the . . . process. Current live projects include product recommendation for our website, fraud detection for orders, optimizing the mix of channels for marketing and much more. We create statistical models, apply state-of-the-art algorithms to our gigantic database, visualize the results and get the impact, all while keeping our focus on increasing customer satisfaction and creating shareholder value. If you have a good idea, you will have the tools to calculate it, the data to test it and millions of customers in 14 countries to benefit from the results. Bold projects, detailed analysis, quick turnaround and fun at work: that is data science at . . . .

what we are looking for

When you see a problem, you latch on and cannot let go until it is solved. If this describes you, check out the teaser problem below. You have a Master or PhD in Machine Learning, Computer Science, Mathematics, Statistics or another quantitative field and having published in highly ranked journals would be a plus. 4 or more years experience in the analysis of real data with the tools of Machine Learning, Data Mining, Time Series Analysis and Computational Statistics. You should be able to, for example, clearly explain the differences, advantages and disadvantages of different classification, clustering and regression algorithms. Solid programming ideally in Python and logical thinking skills that enable you to solve problems . You can translate the business problems from different departments into quantitative terms and present the results in an easily understandable manner. Deep skepticism of, plus the ability to identify and debunk, buzzword mumbo jumbo, false assumptions and analysis.

what we are looking for

When you see a problem, you latch on and cannot let go until it is solved. If this describes you, check out the teaser problem below. You have a Master or PhD in Machine Learning, Computer Science, Mathematics, Statistics or another quantitative field and having published in highly ranked journals would be a plus. 4 or more years experience in the analysis of real data with the tools of Machine Learning, Data Mining, Time Series Analysis and Computational Statistics. You should be able to, for example, clearly explain the differences, advantages and disadvantages of different classification, clustering and regression algorithms. Solid programming ideally in Python and logical thinking skills that enable you to solve problems . You can translate the business problems from different departments into quantitative terms and present the results in an easily understandable manner. Deep skepticism of, plus the ability to identify and debunk, buzzword mumbo jumbo, false assumptions and analysis.

what we are looking for

When you see a problem, you latch on and cannot let go until it is solved. If this describes you, check out the teaser problem below. You have a Master or PhD in Machine Learning, Computer Science, Mathematics, Statistics or another quantitative field and having published in highly ranked journals would be a plus. 4 or more years experience in the analysis of real data with the tools of Machine Learning, Data Mining, Time Series Analysis and Computational Statistics. You should be able to, for example, clearly explain the differences, advantages and disadvantages of different classification, clustering and regression algorithms. Solid programming ideally in Python and logical thinking skills that enable you to solve problems . You can translate the business problems from different departments into quantitative terms and present the results in an easily understandable manner. Deep skepticism of, plus the ability to identify and debunk, buzzword mumbo jumbo, false assumptions and analysis.

what we are looking for

When you see a problem, you latch on and cannot let go until it is solved. If this describes you, check out the teaser problem below. You have a Master or PhD in Machine Learning, Computer Science, Mathematics, Statistics or another quantitative field and having published in highly ranked journals would be a plus. 4 or more years experience in the analysis of real data with the tools of Machine Learning, Data Mining, Time Series Analysis and Computational Statistics. You should be able to, for example, clearly explain the differences, advantages and disadvantages of different classification, clustering and regression algorithms. Solid programming ideally in Python and logical thinking skills that enable you to solve problems . You can translate the business problems from different departments into quantitative terms and present the results in an easily understandable manner. Deep skepticism of, plus the ability to identify and debunk, buzzword mumbo jumbo, false assumptions and analysis.

what we are looking for

When you see a problem, you latch on and cannot let go until it is solved. If this describes you, check out the teaser problem below. You have a Master or PhD in Machine Learning, Computer Science, Mathematics, Statistics or another quantitative field and having published in highly ranked journals would be a plus. 4 or more years experience in the analysis of real data with the tools of Machine Learning, Data Mining, Time Series Analysis and Computational Statistics. You should be able to, for example, clearly explain the differences, advantages and disadvantages of different classification, clustering and regression algorithms. Solid programming ideally in Python and logical thinking skills that enable you to solve problems . You can translate the business problems from different departments into quantitative terms and present the results in an easily understandable manner. Deep skepticism of, plus the ability to identify and debunk, buzzword mumbo jumbo, false assumptions and analysis.

what we are looking for

When you see a problem, you latch on and cannot let go until it is solved. If this describes you, check out the teaser problem below. You have a Master or PhD in Machine Learning, Computer Science, Mathematics, Statistics or another quantitative field and having published in highly ranked journals would be a plus. 4 or more years experience in the analysis of real data with the tools of Machine Learning, Data Mining, Time Series Analysis and Computational Statistics. You should be able to, for example, clearly explain the differences, advantages and disadvantages of different classification, clustering and regression algorithms. Solid programming ideally in Python and logical thinking skills that enable you to solve problems . You can translate the business problems from different departments into quantitative terms and present the results in an easily understandable manner. Deep skepticism of, plus the ability to identify and debunk, buzzword mumbo jumbo, false assumptions and analysis.

what we are looking for

When you see a problem, you latch on and cannot let go until it is solved. If this describes you, check out the teaser problem below. You have a Master or PhD in Machine Learning, Computer Science, Mathematics, Statistics or another quantitative field and having published in highly ranked journals would be a plus. 4 or more years experience in the analysis of real data with the tools of Machine Learning, Data Mining, Time Series Analysis and Computational Statistics. You should be able to, for example, clearly explain the differences, advantages and disadvantages of different classification, clustering and regression algorithms. Solid programming ideally in Python and logical thinking skills that enable you to solve problems . You can translate the business problems from different departments into quantitative terms and present the results in an easily understandable manner. Deep skepticism of, plus the ability to identify and debunk, buzzword mumbo jumbo, false assumptions and analysis.

what we are looking for

When you see a problem, you latch on and cannot let go until it is solved. If this describes you, check out the teaser problem below. You have a Master or PhD in Machine Learning, Computer Science, Mathematics, Statistics or another quantitative field and having published in highly ranked journals would be a plus. 4 or more years experience in the analysis of real data with the tools of Machine Learning, Data Mining, Time Series Analysis and Computational Statistics. You should be able to, for example, clearly explain the differences, advantages and disadvantages of different classification, clustering and regression algorithms. Solid programming ideally in Python and logical thinking skills that enable you to solve problems . You can translate the business problems from different departments into quantitative terms and present the results in an easily understandable manner. Deep skepticism of, plus the ability to identify and debunk, buzzword mumbo jumbo, false assumptions and analysis.

the problem

Our Data Intelligence Team is searching for a new top analyst. We already know of an excellent candidate with top analytical and programming skills. Unfortunately, we don’t know her exact whereabouts but we only have some vague information where she might be. Can you tell us where to best send our recruiters and plot an easy to read map of your solution for them? This is what we could extract from independent sources . . .

the problem

The candidate is likely to be close to the river Spree. The probability at any point is given by a Gaussian function of its shortest distance to the river. The function peaks at zero and has 95% of its total integral within ±2730 m.

A probability distribution centered around the Brandenburg Gate also informs us of the candidate’s location. The distributions radial profile is log-normal with a mean of 4700 m and a mode of 3877 m in every direction. A satellite offers further information: with 95% probability she is located within 2400 m distance of the satellites path (assuming a normal probability distribution). Please make use of the information in the file http://bit.ly/19fdgVa.

additional information (1) format of your results Please provide the GPS coordinates of the next Top Analyst as part of your application. In addition, you can send us your code and some visualizations. coordinates Earth radius 6371 km Brandenburg Gate GPS coordinates 52.516288, 13.377689 Satellite path is a great circle path between coordinates 52.590117, 13.39915 52.437385, 13.553989

additional information (2) coordinates River Spree can be approximated as piecewise linear between the following coordinates: 52.529198, 13.274099 52.531835, 13.29234 52.522116, 13.298541 52.520569, 13.317349 52.524877, 13.322434 52.522788, 13.329 52.517056, 13.332075 52.522514, 13.340743 52.517239, 13.356665 52.523063, 13.372158 .. .

additional information (3) tip for conversion of coordinates You can (but dont have to) use following simple projection for getting GPS coordinates into an orthogonal coordinate system. The projection is reasonably accurate for the Berlin area. Result is an x, y coordinate system with the origin (0, 0) at the South-West corner of the area we are interested in. The x axis corresponds to East-West and is given in kilometers. The y axis corresponds to North-South and is also given in kilometers. South-west corner of the area we are interested in: SWlat = 52.464011 (Latitude) SWlon = 13.274099 (Longitude) The x and y coordinates of a GPS coordinate P with (Plat , Plon ) can be calculated using: Px = (Plon − SWlon ) · cos(SWlat · π/180) · 111.323 Py = (Plat − SWlat ) · 111.323

what gives?

the solution

notation

in the following, let x ⇔ a location in Berlin R ⇔ information related to the river Spree G ⇔ information related to the Brandenburg Gate S ⇔ information related to the path of the satellite

observe

we seem to be dealing with an estimation problem  x∗ = argmax p x G, R, S x

 p G, R, S x p(x)  = argmax p G, R, S x

question

can we safely assume that     p x G, R, S ∝ p x G · p x R · p x S

observe

for the prior p(x), we may assume that it is constant, say c=

1  A Berlin

 for the joint probability p G, R, S , we may assume  p G, R, S = p(G) p(R) p(S) and accordingly     p G, R, S x = p G x p R x p S x

therefore

  p G, R, S x p(x)  p x G, R, S = p G, R, S    p G x p R x p S x c  = p(G) p(R) p(S    p x G p(G) p x R p(R) p x S p(S) c = c p(G) c p(R) c p(S)    ∝p x G ·p x R ·p x S X

observe

the form of the likelihoods suggests to determine the solution numerically

so, let’s write properly vectorized numpy / scipy code that solves the problem, i.e. determines    x∗ = argmax p x G · p x R · p x S x

required imports

import import import import import

numpy as np numpy.linalg as la scipy.spatial as spt scipy.interpolate as ipl matplotlib.pyplot as plt

available geographic data

# GPS coordinates (latitude,longitude) # river Spree Rll = np.genfromtxt(’SpreeGPS.csv’, delimiter=’,’).T # Brandenburg Gate Gll = np.array([52.516288, 13.377689]) # points on satellite path Sll = np.array([[52.590117, 52.437385], [13.39915 , 13.553989]])

conversion between GPS and map coordinates

let φ ≡ latitude and λ ≡ longitude, then   φπ · 111.323 x = (λ − Oλ ) · cos 180 y = (φ − Oφ ) · 111.323 and vice versa λ= cos φ=



φπ 180

x 

· 111.323

y + Oφ 111.323

+ Oλ

conversion between GPS and map coordinates

# NOTE: if ll is a column matrix, then # ll[0] is 1st row and ll[1] is 2nd row def ll2xy(ll, O=np.array([52.464011, 13.274099])): x = (ll[1] - O[1]) * np.cos(O[0]*np.pi/180) * 111.323 y = (ll[0] - O[0]) * 111.323 return np.vstack((x, y)) def xy2ll(xy, O=np.array([52.464011, 13.274099])): lg = xy[0] / (np.cos(O[0]*np.pi/180) * 111.323) + O[1] lt = xy[1] / 111.323 + O[0] return np.vstack((lt, lg)) Rxy = ll2xy(Rll) Gxy = ll2xy(Gll) Sxy = ll2xy(Sll)

plotting what we have so far

plt.plot(Gxy[0], Gxy[1], ’gs’) plt.plot(Rxy[0], Rxy[1], ’b-’) plt.plot(Sxy[0], Sxy[1], ’r-’) plt.show() river Spree Brandenburg Gate

satellite path

8 7 6 5 4 3 2 1 0

0

2

4

6

8

10

12

14

16

observe

the fact that   p x R ∼ N d(x, R) 0, σR   p x S ∼ N d(x, S) 0, σS   p x G ∼ LN d(x, G) µG , σG suggests to compute distance maps . . .

distance maps

river Spree Brandenburg Gate

satellite path

river Spree Brandenburg Gate

8

8

7

7

6

6

5

5

4

4

3

3

2

2

1 0

1 0

2

4

6

8

10

12

river Spree Brandenburg Gate

14

16

0

0

satellite path

2

4

6

8

10

12

river Spree Brandenburg Gate

8

8

7

7

6

6

5

5

4

4

3

3

2

2

1 0

satellite path

14

16

satellite path

1 0

2

4

6

8

10

12

14

16

0

0

2

4

6

8

10

12

14

16

create arrays of x and y coordinates

xlen = 16 ylen = 8 nx = 512 ny = int(nx * (1. * ylen) / xlen) x = np.linspace(0, xlen, nx) y = np.linspace(ylen, 0, ny) xs, ys = np.meshgrid(x, y)

we obtain

print xs [[0. 0.03131115 0.06262231 ..., 15.96868885 16.] [0. 0.03131115 0.06262231 ..., 15.96868885 16.] ..., [0. 0.03131115 0.06262231 ..., 15.96868885 16.]]

print ys [[8. 8. [7.96862745 7.96862745 ..., [0. 0.

..., 8. 8. ] ..., 7.96862745 7.96862745] ..., 0.

0.

]]

computing d x, G



 computing d x, G is to compute Euclidean distances between many points x and a single point g def GDistMap(G, xs, ys): return np.sqrt((xs-G[0])**2 + (ys-G[1])**2)

DG = GDistMap(Gxy, xs, ys)

plotting what we have so far

plt.plot(Gxy[0], Gxy[1], ’gs’) plt.plot(Rxy[0], Rxy[1], ’b-’) plt.plot(Sxy[0], Sxy[1], ’r-’) plt.imshow(DG, extent=(0, xlen, 0, ylen)) plt.show() river Spree Brandenburg Gate

satellite path

8 7 6 5 4 3 2 1 0

0

2

4

6

8

10

12

14

16

computing d x, S



 computing d x, S is to compute Euclidean distances between many points x and a line determined by w s1

x

w s2

computing d x, S



 computing d x, S is to compute Euclidean distances between many points x and a line determined by w s1

xc whw, x − s2i x x − s2 w s2

computing d x, S



 computing d x, S is to compute Euclidean distances between many points x and a line determined by w s1

xc d(x, S)

whw, x − s2i

x x − s2 w s2

computing d x, S



def SDistMap(S, xs, ys): ny, nx = xs.shape w = S[:,-1] - S[:,0] w = w/la.norm(w) X = np.vstack((xs.flatten(), ys.flatten())).T - S[:,0] p = np.dot(X, w) Y = np.outer(p,w) - X d = np.sqrt(np.sum(Y**2, axis=1)) return d.reshape(ny,nx) DS = SDistMap(Sxy, xs, ys)

computing d x, R



 computing d x, R is to compute Euclidean distances between many points x and a sequence of line segments we will represent this sequence of line segments using a linear interpolation polynomial y = f (x)  computing d x, R then becomes to compute Euclidean distances between two sets of many points

computing d x, R



def RDistMap(R, xs, ys): ny, nx = xs.shape # # f x y

create an interpolation function using points in R and create array of x values and compute y values = ipl.interp1d(R[0], R[1], ’linear’) = np.linspace(R[0].min(), R[0].max(), 3*nx) = f(x)

# collect x and y values in row matrix Q Q = np.vstack((x,y)).T X = np.vstack((xs.flatten(), ys.flatten())).T D = spt.distance.cdist(Q, X, ’euclidean’) d = np.min(D, axis=0) return d.reshape(ny,nx) DR = RDistMap(Rxy, xs, ys)

again: distance maps

river Spree Brandenburg Gate

satellite path

river Spree Brandenburg Gate

8

8

7

7

6

6

5

5

4

4

3

3

2

2

1 0

1 0

2

4

6

8

10

12

river Spree Brandenburg Gate

14

16

0

0

satellite path

2

4

6

8

10

12

river Spree Brandenburg Gate

8

8

7

7

6

6

5

5

4

4

3

3

2

2

1 0

satellite path

14

16

satellite path

1 0

2

4

6

8

10

12

14

16

0

0

2

4

6

8

10

12

14

16

 computing p x G

we are informed that   p x G ∼ LN d(x, G) µG , σG but neither know µG nor σG explicitly

the log-normal distribution for x ∈ R, x > 0

LN(x | µ, σ) =

2 1 (ln x−µ) 1 √ e− 2 σ2 x σ 2π

E[x] = eµ+

σ2 2

x0 = eµ−σ

2

x0 E[x]

x

assignment

convince yourself that Z∞ 2 1 (ln x−µ) σ2 x √ e− 2 σ2 dx = eµ+ 2 E[x] = 0 x σ 2π and that x0 = argmax x

2 1 (ln x−µ) 2 1 √ e− 2 σ2 = eµ−σ x σ 2π

 computing p x G

we are informed that 4.700 = eµ+

σ2 2

3.877 = eµ−σ

2

σ2 2



ln(4.700) = µ +



ln(3.877) = µ − σ2

 computing p x G

we are informed that 4.700 = eµ+

σ2 2

3.877 = eµ−σ

2

ln(4.700) = µ +



ln(3.877) = µ − σ2

which leads to σ2 =

σ2 2



 2 ln(4.700) − ln(3.877) 3

µ = ln(3.877) + σ2

 computing p x G

def lognormal_pdf(x, mu, sig): num = np.exp(-0.5 * ((np.log(x)-mu) / sig)**2) den = x * sig * np.sqrt(2*np.pi) return num / den

siG = np.sqrt(2./3. * (np.log(4.700) - np.log(3.877))) muG = np.log(3.877) + siG**2 pXG = lognormal_pdf(DG, muG, siG)

 computing p x R

we are informed that   p x R ∼ N d(x, R) 0, σR such that 2.73 Z

 N x 0, σR dx = 0.95 −2.73

but we do not know σR

−2.73

µ=0

2.73

observe

standard normal ϕ(x) =

1 2 √1 e− 2 x 2π

Zx Φ(x) =

ϕ(y) dy −∞

ϕ(x)

Φ(x)

1 − Φ(x) x

observe

standard normal ϕ(x) =

1 2 √1 e− 2 x 2π

Zx Φ(x) =

ϕ(y) dy −∞

ϕ(x)

general case f (x) =

− 12 √1 e σ 2π

F(x) = Φ

x−µ  σ

(x−µ)2 σ2

Φ(x)

1 − Φ(x) x

observe

if x ∼ f (x) and z =

x−µ σ ,

then z ∼ ϕ(x)

moreover Zz  ϕ(y) dy = 1 − 2 1 − Φ(z) = 2 Φ(z) − 1 = 0.95 −z

implies that Φ(z) =

1.95 = 0.975 2

observe

Φ and its inverse Φ−1 are special functions of extreme importance in statistics they are both well tabulated and we know that Φ(z) = 0.975



z ≈ 1.96

observe

Φ and its inverse Φ−1 are special functions of extreme importance in statistics they are both well tabulated and we know that Φ(z) = 0.975



z ≈ 1.96

for our problem, where z = x/σR we therefore find σR =

x 2.73 = ≈ 1.39 z 1.96

  computing p x R and p x S

def normal_pdf(x, mu, sig): num = np.exp(-0.5 * ((x-mu)/sig)**2) den = sig * np.sqrt(2*np.pi) return num / den

siR = 2.73 / 1.96 pXR = normal_pdf(DR, 0., siR) siS = 2.4 / 1.96 pXS = normal_pdf(DS, 0., siS)

   conditional probabilities p x G , p x R , p x S

river Spree Brandenburg Gate

satellite path

river Spree Brandenburg Gate

8

8

7

7

6

6

5

5

4

4

3

3

2

2

1 0

1 0

2

4

6

8

10

12

river Spree Brandenburg Gate

14

16

0

0

satellite path

2

4

6

8

10

12

river Spree Brandenburg Gate

8

8

7

7

6

6

5

5

4

4

3

3

2

2

1 0

satellite path

14

16

satellite path

1 0

2

4

6

8

10

12

14

16

0

0

2

4

6

8

10

12

14

16

    therefore, p x G, R, S ∝ p x G · p x R · p x S

river Spree Brandenburg Gate

satellite path canditate location

8 7 6 5 4 3 2 1 0

0

2

4

6

8

10

12

14

16

computing map and GPS coordinates of candidate

pXGRS = pXG * pXR * pXS k = np.argmax(pXGRS) i, j = divmod(k, nx) xy = np.vstack((x[j], y[i])) ll = xy2ll(xy) print xy [[ 12.2739726 ] [ 5.30196078]] print ll [[ 52.51163782] [ 13.45506536]]

final result

it appears as if the next top analyst could be found in Grunbergerstr. 46, 10245 Berlin ¨

note

we did not use a single for loop we did not discuss pretty plotting however, if we were to present our result(s) to management, outstanding visualization would have to be among our top priorities