Decrease in volume Loginom Wiki. Decrease in volumetricity Methods of reduction in volumetricity of data analysis

Antipyretic remedies for children are prescribed by a pediatrician. But there are situations of indispensable help for fevers, if the child needs to give faces negligently. Then the fathers take on themselves the resuscitation and stop the antipyretic preparations. What can be given to infants? How can you lower the temperature in older children? What are the most safe faces?

In a rich statistical analysis of the skin, the object is described by a vector, the size of which is sufficient (although it is the same for all objects). Prote people can without a doubt take less than numerical data and specks on the flat. Analyze the tightness of the points in the trivial space already richly folded. It is impossible to accept these things without a middle ground. To that, by nature, it is necessary to go from a rich vibrancy to the data of a small scale, so that “you can marvel at them”.

Crim pragnennya naochnostі, є th іnshі motives for lowering rozmіrnostі. Those officials, in view of which the successor of the change cannot be stale, are less important for statistical analysis. In the first place, resources are spent on collecting information about them. In another way, as you can tell, their inclusion in the analysis will increase the power of statistical procedures (zocrema, which increases the dispersion of parameter estimates and the characteristics of distributions). Tom should be spared such officials.

It is negotiable in view of the reduction in rozmіrnostі butt vykoristannya regression analysis for predicting obyagu sale, razgljanuty vіdrozdіlі 3.2.3. In a first way, in my butt, the speed of the number of independent changes from 17 to 12 was far away. In another way, it was possible to construct a new factor - a linear function of 12 guessing factors, which is the best for other linear combinations of factors in the sales forecast. Therefore, we can say that as a result, the expansion of the plant changed from 18 to 2. And one independent factor (indicated in the 3.2.3 line combination) was left out, and one fallow factor - sales volume.

In the analysis of rich data, one should look at not one, but an impersonal zavdan, and in a different way choose independent and fallow changes. Therefore, we can look at the task of lowering the volume in the offensive formula. Given a rich vibirka. It is necessary to go to the lesser aggregation of the vector, to the smallest possible extent, to preserve the structure of the data as much as possible, if possible not to waste the information that is avenged in the data. The task is concretized at the boundaries of the skin specific method of dimensionality reduction.

Principal Component Methodє one of the most frequently victorious methods of reducing the volume. The main idea of ​​yoga is influenced by the next one directly, in some cases the greatest rozkid. Let the choice be made up of vectors, but nevertheless, subdivide with vectors X = (x(1), x(2), … , x(n)). Look at line combinations

Y(λ(1), λ(2), …, λ( n)) = λ(1) x(1) +λ(2) x(2) + … + λ( n)x(n),

λ 2 (1) + λ 2 (2) + …+ λ 2 ( n) = 1.

Here the vector λ = (λ(1), λ(2), …, λ( n)) lie on a single sphere in n- peaceful space.

In the method of the main components, we must know directly the maximum expansion, tobto. such λ, when reaching the maximum, the variance of the falloff value Y(λ) = Y(λ(1), λ(2), …, λ( n)). Then the vector λ defines the first leading component, and the value Y(λ) є projection of the slope vector X on the entire first head component.

Then, using the terms of linear algebra, consider the hyperplane n-peaceful expanse, perpendicular to the first head component, and project all the elements of the vibrator onto the hyperplane. The expansion of the hyperspace by 1 less, the lower expansion of the outer space.

In the analysis of hyperplanes, the procedure is repeated. Someone knows directly the biggest rose, tobto. another head component. Then we see a hyperplane perpendicular to the first two head components. Її rozmіrnіst 2 less, nіzh rozmіrnіst vihіdnogo expanse. Next - iteration is coming.

From a glance of linear algebra, there is a possibility of a new basis in n-peaceful expanse, orths of which are the main components.

Dispersion, which shows the skin new head component, less, lower for the front. Zzvichay zupinyayutsya, if there are less mensha for tasks threshold. What is selected k head components, then tse means that n-peaceful expanse far away go to k- peaceful, tobto. shortness of peace n-before k, practically without compromising the structure of weekend data .

For visual analysis of data, projections of external vectors onto the area of ​​the first two main components are often used. You can clearly see the structure of the data, you can see compact clusters of objects and vectors that can be seen more clearly.

The head component method is one of the methods factor analysis. Different algorithms for factorial analysis are discussed, that all of them have a transition to a new basis in the output n- peaceful space. It is important to understand the “importance of the factor”, as it is necessary to describe the role of the external factor (change) in the formation of a new vector from a new basis.

A new idea against the method of head components in the fact that at the base of the vanity there is a breakdown of officials in the group. Factories will join one group to create a similar injection on the elements of a new basis. Therefore, from the skin group, it is recommended to leave one representative. As a substitute for the choice of a representative, a new factor is formed by the Rozrakhunk way, which is central to the group, which is being considered. Decreased rozmirnosti vіdbuvaєtsya pіd hour transition to the system chinnіv, є representatives of groups. Other officials are showing up.

The procedure is described, but it can be developed not only for additional factor analysis. Go about the cluster-analysis of signs (chinnikov, change). To break the sign of a group, you can use different algorithms for cluster analysis. It is enough to introduce a vіdstan (close to the world, an indicator of vіdminnostі) between signs. Come on Xі At- two signs. Vіdminnіst d(X, Y) between them you can fight for the help of vibratory correlation coefficients:

d 1 (X,Y) = 1 – rn(X,Y), d 2 (X,Y) = 1 - ρ n(X,Y),

de rn(X, Y) is Pearson's vibratory linear correlation coefficient, ρ n(X, Y) - vibratory coefficient of Spirman's rank correlation.

Bagatomir shkalyuvannya. On vikoristanny vіdstaney (showing near, showing off) d(X, Y) between characters Xі At foundations are a great class of methods of rich scaling. The main idea of ​​this class of methods is that in the case of a skin object it is a point of geometric space (calculate dimensionality 1, 2 or 3), the coordinates of which are the values ​​of attachment (latent) factors, which in totality can adequately describe the object. For whom, the blues between the objects are replaced by the blues between the points - their representatives. So, data about the similarity of objects - between the points, data about the difference - between the points.

Practice wins a row different models rich scale. All of them face the problem of assessing the true extent of factor space. Let's take a look at this problem from the example of processing data about the similarity of objects with additional metric scaling.

Come on є n objects Pro(1), Pro(2), …, O(n), for skin betting objects Pro(i), O(j) the world of their similarities is given s(i, j). Please, watch out s(i, j) = s(j, i). Matching numbers s(i, j) has no meaning for the description of the robot and the algorithm. The stench could be taken away either by an uninterrupted vimir, or by the choice of experts, or by the way of calculating the totality of description characteristics, or otherwise.

In the Euclidean space they look n objects can be represented by configuration n point, moreover, as the world is close to the point-representatives in speaking Euclidean stand d(i, j) between similar points. Steps in the difference between the succession of objects and the succession of the points that represent them, are determined by the path of setting the matrix of similarity || s(i, j)|| that vіdstaney || d(i, j)||. The metric similarity functional can be seen

The geometric configuration must be chosen in such a way that the functional S reaches the smallest value.

Respect. In a non-metric scale, the closeness of the entrances themselves and the closeness of the entrances and the proximity of the ordering on the impersonal entrances and the impersonal appearances of the stations are seen. Deputy Functional S the analogues of the ranking coefficients of the correlation of Spirman and Kendal are victorious. In other words, it is not metrical to get out of the allowance, because the world is close to the world around the ordinal scale.

Let Euclidean expanse maє rozmirnіst m. Let's look at the minimum of the middle square of the pardon

,

de minimum take care of all possible configurations n point in m-peaceful Euclidean space. You can show that the minimum of analyzes can reach the valid configuration. It dawned on me that when growing m the value of α m changes monotonously (more precisely, it does not increase). Can you show what m > n- 1 won 0 (exactly s(i, j) is a metric). In order to increase the possibilities of zmistovnoy іnterpretatsіy, it is necessary to have children in the open space, perhaps less razmіrnostі. However, the dimension must be chosen in such a way that the points represent objects without great creations. Vikaє nutrition: how to optimally choose the rozmіrnіst, tobto. natural number m?

At the borders of a deterministic analysis of data, there is a primed line of food, a penny, no. Also, it is necessary to observe the behavior of α m in other modern models. Like the world of proximity s(i, j) є drop-down values, rozpodіl yah fallow in the form of “right rozmіrnostі” m 0 (i, maybe, depending on any parameters), then you can set the task of estimating in the classical mathematical-statistical style m 0, search for possible ratings, etc.

Let's keep imovirnіsnі models. It is acceptable that objects are speckled in the Euclidean expanse of space k, de k get big. Those who are “right peace” are good m 0 means that all points lie on the space hyperplane m 0 . It is acceptable for the purpose that the collection of points that are being looked at is a selection from a circular normal distribution with a dispersion of σ 2 (0). Tse means that objects Pro(1), Pro(2), …, O(n) є independent in the collection of variant vectors, the skin of which will be like ζ(1) e(1) + ζ(2) e(2) + … + ζ( m 0)e(m 0), de e(1), e(2), … , e(m 0) is an orthonormal basis for the subspace of space m 0 , where the points are examined, and ζ(1), ζ(2), … , ζ( m 0) - independent in the marriage of the same normal fluctuations in the value of h mathematical refinement) and variance σ 2 (0).

Let's take a look at two models of the world of intimacy s(i, j). The first of them s(i, j) are connected to the Euclidean lines between the points through those points that are connected to the creations. Come on h(1),h(2), … , h(n) - Looked at points. Todi

s(i, j) = d(c(i) + ε( i), c(j) + ε( j)), i, j = 1, 2, … , n,

de d- Euclidean between points in k-worldly space, vectors ε(1), ε(2), … , ε( n) є vibrkoyu from the circular normal rozpodіlu in k-world space with zero mathematical scores and covariance matrix σ 2 (1) I, de I is a single matrix. In other words, ε( i) = η(1) e(1) + η(2) e(2) + … + η( k)e(k), de e(1), e(2), …, e(k) is an orthonormal basis for k-peaceful space, and (η( i, t), i= 1, 2, …, n, t= 1, 2, …, k) - the number of independent values ​​of the same-world declining values ​​with zero mathematical points and variance σ 2 (1).

In another model, the creation is superimposed directly on the windows themselves:

s(i,j) = d(c(i), c(j)) + ε( i,j), i,j = 1, 2, … , n, ij,

de (ε( i, j), i, j = 1, 2, … , n) - independent in the totality of the normal deviations of the value with mathematical scaling) and dispersion σ 2 (1).

The robot shows that for both formulations of the models, the minimum of the mean square of the pardon α m at n→ ∞

f(m) = f 1 (m) + σ 2 (1) ( km), m = 1, 2, …, k,

Thus, the function f(m) is linear on intervals i , moreover, on the first interval it is less swidche, lower on the other. Sounds like a statistic

є a possible assessment of the correctness m 0 .

Also, from the imovirnisnoy theory, a strong recommendation - as an assessment of the sizability of the factorial space of winning m*. It is significant that a similar recommendation was formulated as heuristic by one of the founders of the rich scale, J. Kraskal. Vіn vyhodiv іz dosvіdu praktichestvenny vykoristannya rich scaling and counting experiments. Imovirnіsna theory made it possible to round out this heuristic recommendation.

Front

Keywords

MATHEMATICS / APPLIED STATISTICS / MATHEMATICAL STATISTICS/ POINTS OF GROWTH / HEAD COMPONENT METHOD / FACTORY ANALYSIS / BAGATOMIRNE SHKALYVANNYA / EVALUATION OF THE DATA RELATIONSHIP / EVALUATION OF THE DIFFERENCE OF THE MODEL/ MATHEMATICS / APPLIED STATISTICS / MATHEMATICAL STATISTICS / GROWTH POINTS / PRINCIPAL COMPONENT ANALYSIS / FACTOR ANALYSIS / MULTIDIMENSIONAL SCALING / ESTIMATION OF DATA DIMENSION / ESTIMATION OF MODEL DIMENSION

Abstract scientific statistics of mathematics, author of scientific work - Oleksandr Ivanovich Orlov, Evgen Veniaminovich Lutsenko

One of the "point of growth" applied statisticsє methods for reducing the volume of statistical data. The stench is more and more often used to analyze data from specific applied studies, for example, sociological ones. Let's take a look at the most promising methods for reducing the volume. Principal Component Methodє one of the most frequently victorious methods of reducing the volume. For visual analysis of data, projections of external vectors onto the area of ​​the first two main components are often used. You can clearly see the structure of the data, you can see compact clusters of objects and vectors that can be seen more clearly. Principal Component Methodє one of the methods factor analysis. New idea in the region path of the main components I believe in the fact that on the basis of the vanity there is a split of factors on the group. Officials will unite in one group to create a similar injection on the elements of a new basis. Therefore, from the skin group, it is recommended to leave one representative. As a substitute for the choice of a representative, a new factor is formed by the Rozrakhunk way, which is central to the group, which is being considered. Decreased rozmirnosti vіdbuvaєtsya pіd hour transition to the system chinnіv, є representatives of groups. Other officials are showing up. On vikoristanny vіdstaney (the world of proximity, pokaznikі vіdminnosti) between signs and foundations a great class of methods rich scale. The main idea of ​​this class of methods is that in the case of a skin object it is a point of geometric space (calculate dimensionality 1, 2 or 3), the coordinates of which are the values ​​of attachment (latent) factors, which in totality can adequately describe the object. As an example of stosuvannya imovirnіsno-statistical modeling and the results of statistics of non-numerical data, it is possible to assess the scalability of the vastness of data rich scale, earlier proponated by Farbal from heuristic mirkuvan estimating the dimensions of models(At regression analysis and in the theory of classification). Information is given about the algorithms for reducing the volume in an automated system-cognitive analysis.

Similar topics science practices in mathematics, author of scientific work - Oleksandr Ivanovich Orlov, Evgen Veniaminovich Lutsenko

  • Mathematical methods in sociology for forty-five years

  • The diversity of objects of non-numerical nature

  • Estimation of parameters: single-scale estimates are shorter than maximum likelihood estimates

  • Applied statistics - perspective perspective

    2016 / Orlov Oleksandr Ivanovych
  • Stan and prospects for the development of applied and theoretical statistics

    2016 / Orlov Oleksandr Ivanovych
  • Relationship between boundary theorems and the Monte Carlo method

    2015 / Orlov Oleksandr Ivanovych
  • About the development of statistics of objects of non-numerical nature

    2013 / Orlov Oleksandr Ivanovych
  • Growth points of statistical methods

    2014 / Orlov Oleksandr Ivanovych
  • About new promising mathematical tools for controlling

    2015 / Orlov Oleksandr Ivanovych
  • Stand in the vastness of statistical data

    2014 / Orlov Oleksandr Ivanovych

One of the "growth periods" applied statistics is methods of reducing the dimension of statistical data. The stench is friendly in the analysis of data in specific applied research, as well as sociology. We investigate the most promising methods to reduce the dimensionality. Overhead components is one of the most overhead methods to get down to changing dimensionality. For Visual Analysis of Data, there are projects of original vectors on the plan of the first two main components. Sound like structures - clearly marked, highly illuminated compact clusters of objects and okremoly spread vectors. The main components are one factor analysis method. The new idea is factor analysis in comparison with method principal components is that, based on loads, factors breaks up into groups. In one group of factors, the new factor is combined with similar impact on elements of the new basis. Such a group is the sum of one representative. Actual hour, we see that vikonati vibirkovu assessment, as a new factor, which is central to the group in nutrition. A change in dimension occurs during the transition to the factors system, i.e. representatives of groups. Other factory є discarded. Proximity measures, indicators of differences between features and extensive class are based methods of multidimensional scaling . The main idea of ​​the category of methods is to give an object, like a geometric space item (calculate dimension 1, 2, or 3), like coordinates є the values ​​of the external (latent) factors, like to name the object. As an application of probabilistic and statistical modeling and results of statistics of nonnumerical data, we justify the consistency of estimators of dimension of the data in multidimensional scaling , which are proposed previously by Kruskal from heuristic considerations. The stench was ranked in the last row of dimensions of models (in regression analysis and in theory of classification). We also give some information about algorithms for reducing the dimensionality in the automated system-cognitive analysis

The text of scientific work on the topic "Methods for reducing the scope of statistical data"

UDC 519.2: 005.521:633.1:004.8

01.00.00 Physical and mathematical sciences

METHODS FOR DOWNLOADING THE SPACE OF STATISTICAL DATA

Orlov Oleksandr Ivanovych

Doctor of Economics, Doctor of Technical Sciences, Ph.D., Professor

РІНЦ BRSH code: 4342-4994

Moscow State Technical

university im. not. Bauman, Russia, 105005,

Moscow, 2-a Baumanskaya vul., 5, [email protected] T

Lutsenko Evgen Veniaminovich Doctor of Economics, Candidate of Technical Sciences, Professor Russian Scientific Center BRSH-code: 9523-7101 Kuban State Agrarian University, Krasnodar, Russia [email protected] com

One point of growth in applied statistics is the method of reducing the scope of statistical data. The stench is more and more often used to analyze data from specific applied studies, for example, sociological ones. Let's take a look at the most promising methods for reducing the volume. The method of head components is one of the most common methods of dimensionality reduction. For visual analysis of data, projections of external vectors onto the area of ​​the first two main components are often used. You can clearly see the structure of the data, you can see compact clusters of objects and vectors that can be seen more clearly. The method of principal components is one of the methods of factor analysis. A new idea against the method of head components in the fact that at the base of the vanity there is a breakdown of officials in the group. Factories will join one group to create a similar injection on the elements of a new basis. Therefore, from the skin group, it is recommended to leave one representative. As a substitute for the choice of a representative, a new factor is formed by the Rozrakhunk way, which is central to the group, which is being considered. Decreased rozmirnosti vіdbuvaєtsya pіd hour transition to the system chinnіv, є representatives of groups. Other officials are showing up. On vikoristanny vіdstany (showing nearness, showing off vіdminnostі) between signs and foundations is a great class of methods of rich scale. The main idea of ​​this class of methods is to describe the skin object as a point of geometric space (calculate dimensions 1, 2 or 3), the coordinates of which are the values ​​of attachment (latent) factors, which in totality can be adequately described

UDC 519.2:005.521:633.1:004.8

Physics and mathematical sciences

METHODS OF REDUCING SPACE DIMENSION OF STATISTICAL DATA

Alexander Orlov

Dr.Sci.Econ., Dr.Sci.Tech., Cand.Phys-Math.Sci.,

Bauman Moscow State Technical University, Moscow, Russia

Lutsenko Eugeny Veniaminovich Dr.Sci.Econ., Cand.Tech.Sci., profesor RSCI SPIN-code: 9523-7101

Kuban State Agrarian University, Krasnodar, Russia

[email protected] com

One of the "growth periods" applied statistics is methods of reducing the dimension of statistical data. The stench is friendly in the analysis of data in specific applied research, as well as sociology. We investigate the most promising methods to reduce the dimensionality. The main components are one of the most important methods to achieve a change in dimensionality. For Visual Analysis of Data, there are projects of original vectors on the plan of the first two main components. Sound u structures є clearly marked, highly illumined compact clusters of objects and okremo seen vectors. The head components are one factor analysis method. The new idea is factor analysis in comparison with the method of principal components is that, based on loads, factors breaks up into groups. In one group of factors, the new factor is combined with similar impact on elements of the new basis. Such a group is the sum of one representative. Actual hour, we see that vikonati vibirkovu assessment, as a new factor, which is central to the group in nutrition. A change in dimension occurs during the transition to the factors system, i.e. representatives of groups. Other factory є discarded. Vicorist distance (proportionality of parameters, indicators of difference) between rice and main classes - based multidimensional scaling method. The main idea of ​​the category of methods is to give an object, like a geometric space item (calculate dimension 1, 2, or 3), like coordinates є the values ​​of the external (latent) factors, like to name the object. As an application of probabilistic and statistical modeling and results of statistics of non-numeric data, we justify the consistency of estimators of the

object. As an example of stosuvannya іmovіrnіsno-statistical modelling, the results of the statistics of non-numerical data are obstructed by the possibility of estimating the spaciousness of the data in a rich scale of scales, previously proponated by Farbal from heuristic mirkuvan. A number of studies were reviewed to evaluate the dimensions of models (in regression analysis and theory of classification). Information is given about the algorithms for reducing the volume in an automated system-cognitive analysis.

Key words: MATHEMATICS, APPLIED STATISTICS, MATHEMATICAL STATISTICS, GROWTH POINTS, HEAD COMPONENTS METHOD, FACTORY ANALYSIS

dimension of the data in multidimensional scaling, which are proposed previously by Kruskal from heuristic considerations. The stench was ranked in the last row of dimensions of models (in regression analysis and in theory of classification). We also give some information about algorithms for reducing the dimensionality in the automated system-cognitive analysis

Keywords: MATHEMATICS, APPLIED STATISTICS, MATHEMATICAL STATISTICS, GROWTH POINTS, THE PRINCIPAL COMPONENT ANALYSIS, FACTOR ANALYSIS, MULTIDIMENSIONAL SCALING, ESTIMATION OF DATA DIMENSION, ESTIMATION OF MODELDI

1. Introduction

As it was intended, one of the "point of growth" of applied statistics is the method of reducing the size of the space of statistical data. The stench is more and more often used to analyze data from specific applied studies, for example, sociological ones. Let's take a look at the most promising methods for reducing the volume. As an example of stosuvannya іmovіrnіsno-statistical modeling and the results of statistics of non-numerical data, it is possible to promote the assessment of the spaciousness of space, previously proponated by Farbal from heuristic mirkuvan.

In a rich statistical analysis of the skin, the object is described by a vector, the size of which is sufficient (although it is the same for all objects). Prote people can without a doubt take less than numerical data and specks on the flat. Analyze the tightness of the points in the trivial space already richly folded. It is impossible to accept these things without a middle ground. Therefore, by nature, it’s necessary to go from a rich vibrancy to these small sparseness, so that “you can bulo on them

marvel." For example, a marketer can inadvertently talk, skilki є different types the behavior of the spontaneous (so that the skilki dotally see segments in the market) and the same (with some authority) spozhivatschi before them enter.

Crim pragnennya naochnostі, є th іnshі motives for lowering rozmіrnostі. Those officials, in view of which the successor of the change cannot be stale, are less important for statistical analysis. First, on the basis of the selection of information about them, financial, teamwork, and personnel resources are used. In another way, as you can tell, their inclusion in the analysis will increase the power of statistical procedures (zocrema, which increases the dispersion of parameter estimates and the characteristics of distributions). Tom should be spared such officials.

In the analysis of rich data, one should look at not one, but an impersonal zavdan, and in a different way choose independent and fallow changes. Therefore, we can look at the task of lowering the volume in the offensive formula. Given a rich vibirka. It is necessary to go to the lesser aggregation of the vector, to the smallest possible extent, to preserve the structure of the data as much as possible, if possible not to waste the information that is avenged in the data. The task is concretized at the boundaries of the skin specific method of dimensionality reduction.

2. Method of principal components

Vin is one of the most frequently victorious methods of reducing the volume. The main idea of ​​yoga is influenced by the next one directly, in some cases the greatest rozkid. Let the selection be composed of vectors, which are, however, subdivided by the vector X = (x(1), x(2), ... , x(n)). Let's look at the linear combinations

7(^(1), X(2), ., l(n)) = X(1)x(1) + X(2)x(2) + ... + l(n)x(n) ,

X2(1) + X2(2) + ...+ X2(n) = 1. Here the vector X = (X(1), X(2), ..., X(n)) lies on a single sphere n to the worldly space.

In the method of the main components, we must know directly the maximum expansion, tobto. so X, which reaches the maximum variance of the falling value 7(X) = 7(X(1), X(2), ..., X(n)). Then the vector X designates the first head component, and the value 7(X) is the projection of the vypadkovy vector X onto the entire first head component.

Then, using the terms of linear algebra, we look at the hyperplane in the n-world space, perpendicular to the first head component, and project all the elements of the vibrator onto the tsyu hyperplane. The expansion of the hyperspace is 1 less, the expansion of the outside space is lower.

In the analysis of hyperplanes, the procedure is repeated. Someone knows directly the biggest rose, tobto. another head component. Then we see a hyperplane perpendicular to the first two head components. Її rozmіrnіst 2 less, nіzh rozmіrnіst vihіdnogo expanse. Next - iteration is coming.

From the point of view of linear algebra, there is a possibility of a new basis in the n-world space, the unit vectors of which are the main components.

Dispersion, which shows the skin new head component, less, lower for the front. Zzvichay zupinyayutsya, if there are less mensha for tasks threshold. If it is selected to the main components, it means that from the n-dimensional space it was possible to go to the k-dimensional one, that is. speed up the expansion of the s p-to, practically not having created the structure of the holidays.

For visual analysis of data, projections of external vectors onto the area of ​​the first two main components are often used. chime in

You can clearly see the data structure, you can see compact clusters of objects and small vectors.

3. Factor analysis

The head component method is one of the methods of factor analysis. Different algorithms of factorial analysis are combined, that they allow transition to a new basis in the outer n-world expanse. It is important to understand the “importance of the factor”, as it is necessary to describe the role of the external factor (change) in the formation of a new vector from a new basis.

A new idea against the method of head components in the fact that at the base of the vanity there is a breakdown of officials in the group. Officials will unite in one group to create a similar injection on the elements of a new basis. Therefore, from the skin group, it is recommended to leave one representative. As a substitute for the choice of a representative, a new factor is formed by the Rozrakhunk way, which is central to the group, which is being considered. Decreased rozmirnosti vіdbuvaєtsya pіd hour transition to the system chinnіv, є representatives of groups. Other officials are showing up.

The procedure is described, but it can be developed not only for additional factor analysis. Go about the cluster-analysis of signs (chinnikov, change). To break the sign of a group, you can use different algorithms for cluster analysis. It is enough to introduce a vіdstan (close to the world, an indicator of vіdminnostі) between signs. Come X and Y - two signs. The difference d(X,Y) between them can be compared with the help of vibratory correlation coefficients:

di(X,Y) = 1 - \rn(X,Y)\, d2(X,Y) = 1 - \pn(X,Y)\, de rn(X,Y) - Pearson's vibratory linear correlation coefficient, pn(X, Y) - Vibration coefficient of Spirman's rank correlation.

4. Bagatomirne shkalyuvannya.

On vikoristanny vіdstaney (the world of proximity, pokaznіv vіdminnostі) d(X,Y) between the signs X and U of foundations is a great class of methods of rich scale scale. The main idea of ​​this class of methods is that in the case of a skin object it is a point of geometric space (calculate dimensionality 1, 2 or 3), the coordinates of which are the values ​​of attachment (latent) factors, which in totality can adequately describe the object. When you see the blues between the objects, they are replaced by the blues between the points - their representatives. So, data about the similarity of objects - between the points, data about the difference - between the points.

5. The problem of assessing the correctness of factor space

In practice, the analysis of sociological data is based on a number of different models of rich scale. All of them have the problem of assessing the correctness of the factor space. Let's take a look at this problem from the example of processing data about the similarity of objects with additional metric scaling.

Let n objects 0(1), O(2), ..., O(n), the skin pair of objects 0(/), O(j) be given the world of their similarities s(ij). Note that s(i,j) = s(j,i). The similarity of the numbers s(ij) cannot be used to describe the algorithm. The stench could be taken away either by an uninterrupted death, or by the best experts, or by a way of charging for the sumptuousness of the description characteristics, or otherwise.

In the Euclidean space, n objects are considered due to the configuration of n points, moreover, as the world is close to the point-representatives in the appearance of the Euclidean line d(i,j)

between vіdpovіdnimi points. Steps in the difference between the consistency of objects and the consistency of the points that represent them, it is determined by the path of setting the matrix of similarity ||I(,)|| that vіdstaney CMM-Metric functional similarity may look

i = t|*(/, ]) - d(/, M

The geometric configuration must be chosen in such a way that the functional S reaches the smallest value.

Respect. In a non-metric scale, the closeness of the entrances themselves and the closeness of the entrances and the proximity of the ordering on the impersonal entrances and the impersonal appearances of the stations are seen. Instead of the functional S, there are analogs of the rank coefficients of the Spirman and Kendal correlations. In other words, it is not metrical to get out of the allowance, because the world is close to the world around the ordinal scale.

Let's Euclidean expanse, let's look at the minimum of the middle square of the pardon

de minimum take on all possible configurations of points in the worldly Euclidean space. You can show that the minimum of analyzes can reach the valid configuration. It is clear that with increasing t, the value of am changes monotonously (more precisely, it does not increase). You can show that for m > n - 1 won is equal to 0 (which is a metric). In order to increase the possibilities of zmistovnoy іnterpretatsіy, it is necessary to have children in the open space, perhaps less razmіrnostі. However, the dimension must be chosen in such a way that the points represent objects without great creations. Vinikaє nutrition: how to optimally choose the expansion of space, tobto. natural number t?

6. Models and methods for assessing the expanse of data

At the borders of a deterministic analysis of data, there is a primed line of food, a penny, no. Also, it is necessary to remember the behavior of am in other modern models. If the world is close to s(ij) and has low values, if it is possible to find such values ​​in the "true diversity" m0 (i, possibly, in any of the parameters), then it is possible to set the estimate m0 in the classical mathematical-statistical style, and possible estimates і і etc.

Let's keep imovirnіsnі models. It is acceptable that the objects are speckled in the Euclidean expanse of space to, de dosit is great. Those that have a “right dimension” equal to m0, mean that all points lie on the hyperplane of dimension m0. It is acceptable for the purpose that the collection of points that are being looked at is a selection from a circular normal distribution with a dispersion about (0). Tse means that objects 0(1), 0(2), ..., O(n) are independent vectors

Z(1)e(1) + Z(2)e(2) + ... + Z(m0)e(m0), de e(1), e(2), ... , e(m0) - orthonormal basis of subspace m0, in which the points are examined, and Z(1), Z(2), Z(m0) - independent in the totality of one-dimensional normal deviations with mathematical scaling 0 and variance pro (0).

Let's take a look at two models of possession entering the proximity of s(ij). In the first of them, s(ij) are interchanged in the Euclidean line between the points through those points that are connected to the creations. Let c(1), c(2), ... , c(n) be the points to be looked at. Todi

s(i,j) = d(c(i) + e(i), c(j) + s(/)), ij = 1, 2, ..., n,

de j - Euclidean distance between points in the world space, the vectors e(1), e(2), ... , e(n) are a selection from the circular normal subdivision in the world space with zero mathematical scaling and the covariance matrix (1)/, de I-alone matrix. In other words,

e(0 = n(1)e(1) + P(2)e(2) + ... + u(k)v(k), de e(1), e(2), ..., e(k) is an orthonormal basis in the world space, and [ц^^), i = 1, 2, ..., n,? \u003d 1, 2, ..., k) - the number of independent values ​​of one-dimensional variance with zero mathematical points and variance about (1).

In another model, the creation is superimposed directly on the windows themselves:

Kch) = d(F\ CI)) + £(YX i = 1, 2., n, i f j,

de i , moreover, the first interval is smaller than the second, lower to the other. Sounds like a statistic

m* = Arg minam+1 - 2am + an-x)

є with the help of an assessment of the true volume m0.

Otzhe, z ymovirnіsnoy ї teorії vyplivaya recommendation - as an assessment of the size of the factorial space vikoristovuvat t *. It is significant that a similar recommendation was formulated as heuristic by one of the founders of the rich scale, J. Kraskal. Vіn vyhodiv іz dosvіdu praktichestvenny vykoristannya rich scaling and counting experiments. Imovirnіsna theory made it possible to round out this heuristic recommendation.

7. Evaluation of the dimensionality of the model

If it is possible to add a multiplier sign to establish a family that expands, for example, evaluates the steps of a polynomial, then it is natural to introduce the term “expansion of the model” (it is understood a lot in what is analogous to the victorious scale of understanding the expanse of data). The author of this article should lay down a number of robots for evaluating the expanse of the model, as well as to establish with robots for evaluating the expanse of the expanse of data, looked at in the world.

The first such robot was vikonan by the author of the statute for the hour of delivery to France in 1976. She had one estimate of the model expansion in regression, and the estimate of the degree of the polynomial in the allowance itself, so the fallowness is described by the polynomial. This assessment was found in the literature, but in the past, they began to ascribe to the author of the stats, which is deprived of power, having established a zocrema, that it is not possible, and knowing the boundary geometrical rose. In addition, it is already possible to estimate the resilience of the regression model, which was inspired by the statistics. The whole cycle was completed by the robot, to clarify the row.

The final publication on this topic includes a discussion of the results of the development of flexibility in the boundary theorems that I have taken away by the Monte Carlo method.

Analogous to the methodology for assessing the size of the model in the problem of splitting the sums (part of the theory of classification) are reviewed in the article.

Looked at more estimates of the versatility of the model in a richly scaled scale are twisted on robots. In these robots, the boundary behavior of the indicators was installed in the method of principal components (with the help of asymptotic theory of behavior for solving extremal statistical problems).

8. Algorithms for reducing the volume in automated system-cognitive analysis

In the automated system-cognitive analysis (ASC-analysis) proponated and in the system "Eidos" another method of reducing the volume was implemented. 4.2 "Description of algorithms for basic cognitive operations of system analysis (BCOSA)" and 4.3 "Detailed algorithms of BCOS (ASC-analysis)". We provide a short description of two algorithms - BKOSA-4.1 and BKOSA-4.2.

BKOSA-4.1. "Abstraction of factors (decrease in the scope of the semantic expanse of factors)"

For the help of the method of subsequent approximations (iterative algorithm) for the assignment of boundary minds, the expansion of the expanse of attributes is reduced without a significant change in the obligation. The criterion for the progress of the iterative process is the reach of one of the borderline minds.

BKOSA-4.2. "Abstraction of classes (reducing the scope of the semantic expanse of classes)"

For the help of the method of subsequent approximations (iterative algorithm) for the assignment of boundary minds, the expansion of the space of classes is reduced without a significant change in the obligation. The criterion for the progress of the iterative process is the reach of one of the borderline minds.

All real algorithms implemented in the "Eidos" system, version 1, are listed here, as it was implemented at the time of preparation of the robot (2002 version): http://lc.kubagro.ru/aidos/aidos02/4.3.htm

The essence of algorithms is this.

1. Amount of information is provided for the values ​​of the factors about the transition of the object at the station, which correspond to the classes.

2. Razrakhovuetsya tsіnnіst znachennya chinnik for diferentіatsії ob'єkta for classes. Tsya tsіnіst - tse variability of information values ​​of chinnikіv (kіlkіshnіh zahodіv vіrіbelnosti rich: avіdnє vіdhіlennya vіd avіdnyh, avіdnє svіdnє vіdhіlennya, іnshikh.). Otherwise, it seems that in the value of the factor in the middle there is little information about the value and not the value of the object to the class, then the value is not more than the value, but rather the value is richer.

3. Razrakhovuetsya tsіnnіst opisovykh scales for differentiation of objects by classes. At the robots E.V. Lutsenko today is trying to be like the average of the values ​​of the gradation scale.

4. We will then carry out Pareto-optimization of the value of factors and description scales:

The values ​​of the factors (gradation of the description scales) are ranked in the order of the change in value and are distinguished from the models and the smallest values, so as to go right dotic to the Pareto curve of 45 °;

Factors (describing scales) are ranked in the order of change in value and are sorted out from the model with the least value, so that they go right dotically to the Pareto curve of 45°.

Through the war, the openness of the space, inspired on the description scales, gradually decreases with the help of the distance between the scales that are core to each other, tobto. in fact, ce orthonormal space in the information metrics.

Tsey can repeat, tobto. be iterative, at the same time new version In the "Eidos" system, iterations are started manually.

Similarly, the information space of classes is orthonormalized.

The scales of this gradation can be numeric (the same interval values ​​are processed), and they can be textual (ordinal and nominal).

In this rank, for the additional algorithms of BCOSA (ASK-analysis), the space expansion is reduced as much as possible with a minimum amount of information.

For the analysis of statistical data in applied statistics, a low number of other algorithms for reducing the volume was broken down. Before the date of publication, the article does not include a description of all the diversity of such algorithms.

Literature

1. Orlov A.I. Points of growth of statistical methods // Polythematic meshing of electronic scientific journal Kuban State Agrarian University. 2014. No. 103. P. 136-162.

2. Farkal J. Relationship between rich scales and cluster analysis // Classification and cluster. M: Light, 1980. S.20-41.

4. Harman G. Suchasny factor analysis. M: Statistics, 1972. 489 s.

5. Orlov A.I. Notes from the theory of classification. / Sociology: methodology, methods, mathematical models. 1991. No. 2. S.28-50.

6. Orlov A.I. Basic results of the mathematical theory of classification // Polythematic Merezhevy Electronic Science Journal of the Kuban State Agrarian University. 2015. No. 110. S. 219-239.

7. Orlov A.I. Mathematical Methods of the Theory of Classification // Polythematic Merezhevy Electronic Science Journal of the Kuban State Agrarian University. 2014. No. 95. S. 23 – 45.

8. Teryokhina A.Yu. Analysis of data by the methods of varicose scaling. -M: Nauka, 1986. 168 p.

9. Perekrest V. T. Non-linear typological analysis of social and economic information: Mathematical and numerical methods. - L.: Nauka, 1983. 176 p.

10. Tyurin Yu.M., Litvak B.G., Orlov A.I., Satarov G.A., Shmerling D.S. Analysis of non-numerical information. M.: Naukova Rada of the Academy of Sciences of the SRSR from the complex problem "Cybernetics", 1981. - 80 p.

11. Orlov A.I. A deep look at the statistics of objects of non-numerical nature// Analysis of non-numerical information in sociological studies. - M: Nauka, 1985. S.58-92.

12. Orlov A.I. Boundary analysis subdivided one estimate of the number of basic functions in regression // Applied statistical rich analysis. Vcheni notes from statistics, v.33. - M: Nauka, 1978. S.380-381.

13. Orlov A.I. Evaluation of the scalability of the model in regression // Algorithmical ta software security applied statistical analysis Vcheni zapiski zі statistiki, v.36. - M: Nauka, 1980. S.92-99.

14. Orlov A.I. Asymptotics of the current estimates of the resilience of the model in regression // Applied Statistics. Vcheni notes from statistics, v.45. - M: Nauka, 1983. S.260-265.

15. Orlov A.I. About the estimation of the regression polynomial // Factory laboratory. Diagnostics of materials. 1994. V.60. No. 5. pp.43-47.

16. Orlov A.I. Acts of development of the theory of classification // Prikladna statistika. Vcheni notes from statistics, v.45. - M: Nauka, 1983. S.166-179.

17. Orlov A.I. On the development of articles with nonnumerical objects // Design of Experiments and Data Analysis: New Trends and Results. - M.: ANTAL, 1993. Р.52-90.

18. Orlov A.I. Methods for reducing the volume / / Addendum 1 to the book: Tolstova Yu.M. Fundamentals of bagatovimir scale: Chief help for universities. - M.: Vidavnitstvo KDU, 2006. - 160 p.

19. Orlov A.I. Asymptotics of solutions of extremal statistical data// Analysis of non-numerical data in system data. Selected work. Vip. 10. - M: All-Union Scientific Research Institute of Systemic Studies, 1982. S. 412.

20. Orlov A.I. Organizational and economic modeling: assistant: about 3 years. Part 1: Non-numeric statistics. - M: View of MDTU im. not. Bauman. - 2009. - 541 p.

21. Lutsenko E.V. Automation of system-cognitive analysis in the management of active objects (system theory of information and її zastosuvannya in recent economic, socio-psychological, technological and organizational-technical systems): Monograph (scientific knowledge). -Krasnodar: KubDAU. 2002. - 605 p. http://elibrary.ru/item.asp?id=18632909

1. Orlov A.I. Tochki rosta statisticheskich metod // Politematicheskij setevej jelektronny nauchny zhurnal Kubanskogogosudarstvennogo agrarnogo universiteta. 2014. No. 103. S. 136-162.

2. Kraskal J. Relationship between multidimensional scaling and cluster analysis // Classification and cluster. M.: Mir, 1980. S.20-41.

3. Kruskal J.B., Wish M. Multidimensional scaling // Sage University paper series: Qualitative applications in the social sciences. 1978. No. 11.

4. Harman G. Current factorial analysis. M.: Statistika, 1972. 489 s.

5. Orlov A.I. Notes on the theory of classification. / Sociologija: metodologija, metody, matematicheskie modeli. 1991. No. 2. S.28-50.

6. Orlov A.I. Basic resources of mathematical theory of classification // Political Network Electronic Science Journal of the Kuban State Agrarian University. 2015. No. 110. S. 219-239.

7. Orlov A.I. Mathematical methods of the theory of classification // Polythematic Network Electronic Science Journal of the Kuban State Agrarian University. 2014. No. 95. S. 23 - 45.

8. Terehina A.Ju. Analysis of data by methods of multidimensional scaling. - M: Nauka, 1986. 168 s.

9. Perekrest V.T. Non-linear topological analysis of social "but-economic information: Matematicheskie and inferential" methods. - L.: Nauka, 1983. 176 p.

10. Tjurin J.N., Litvak B.G., Orlov A.I., Satarov G.A., Shmerling D.S. Analiz nechislovoj informacii. M.: Scientific Council in SSSR on complex problems "Cybernetics", 1981. - 80 p.

11. Orlov A.I. General look at the statistics of objects of a non-historical nature // Analysis of non-historical information in sociological studies. - M.: Nauka, 1985. S.58-92.

12. Orlov A.I. Podіl "not rozpovyudzhennya one of the bases of the number of fundamental functions in regression"

13. Orlov A.I. Center of model expansion in regression // Algorithmic and programmatic extension of the main statistical analytics. Teaching Notes on Statistice, t.36. - M.: Nauka, 1980. S.92-99.

14. Orlov A.I. Asimptotika nekotoryh ocenok razmernosti modeli in regression // Prikladna statistika. Vcheni notes on statistike, t.45. - M.: Nauka, 1983. S.260-265.

15. Orlov A.I. On the otenization of the regression polynomial // Zavodska laboratoria. Diagnostics of materials. 1994. T.60. No. 5. S.43-47.

16. Orlov A.I. Some probable questions and teorii klasifikacii / / Prikladna statistika. Vcheni notes on statistike, t.45. - M.: Nauka, 1983. S.166-179.

17. Orlov A.I. On the development of articles with nonnumerical objects // Design of Experiments and Data Analysis: New Trends and Results. – M.: ANTAL, 1993. R.52-90.

18. Orlov A.I. Methods for reducing the volume // Appendix 1 to the book: Tolstova Yu.N. Basic multidimensional scaling: Teaching method for universities. - M.: Izdatel "stvo KDU, 2006. - 160 s.

19. Orlov A.I. Asimptotika reshenij jekstremal "them statisticheskich zadach // Analiz nechislovych dannych v sistemnyh issledovaniyah. Sbornik trudov. Vyp.10. - M.: Vsesojuznyy scientific-issledovatel" skij institut sistemnys issledovan2.

20. Orlov A.I. Organizational-joekonomichee modeling: uchebnik: v 3 ch. Chast" 1: Nechislovaja statistika. - M.: Izd-vo MGTU im. N.Je. Baumana. - 2009. - 541 s.

21. Lucenko E.V. Automation of system-cognitive analysis in the management of active objects (system theory of information and її zastosuvannya in studies of econometric, socio-psychological, technological). 605s. http://elibrary.ru/item.asp?id=18632909

Data reduction

In analytical technologies, under the reduction of data comprehension, the processes of their transformation into a form, the most convenient for analysis and interpretation, are understood. Sounds out of reach of the change of their obliga- tion , the shortness of the number of victorious signs and the remembrance of their meaning.

Often the data is analyzed inaccurately, if the stench is badly exposing the staleness and regularity of continuing business processes. The reasons for this may be the lack of a number of warnings, the presence of a sign, yakі vdbivayut the essence of the authority of objects. In this mood, the wealth of data is stagnant.

Decreased rozmіrnostі zastosovuєtsya at protivolezhnoy vypadku, if given above the world. The supra-worldliness is blamed, if the task of analysis can be violated with the same efficiency and accuracy, alas, the complexity of the data is less. Tse allow you to speed up the time and calculate the number of tasks on the development of tasks, to give data and the results of their analysis to be more interpretable and understandable for the koristuvach.

The speedy number of guardians of the dans zastosovuetsya as a decision of the equal capacity can be taken away on the choice of a smaller obligation, speeding up the counting of the time of the hour itself. It is especially important for algorithms that do not scale, if the number of records is not large enough to produce up to a sum of numbers in hourly rates.

The abbreviation of the number of the sign may be carried out, if the information is necessary for the yakіsnogo vyvіshennya zavdannya, it should be put in the same submultiple sign and it is not necessary to vicorate them. It is especially important for the sign of the korelyuvannya. For example, the signs "Vik" and "Experience of work", in fact, carry the same information, so one of them can be turned off.

The most effective way of speeding up the number of signs is factorial analysis and the method of principal components.

The shortness of the variability of the meaning is a sign of a small sense, for example, as the accuracy of the manifestation of data is transcendental and the replacement of speech meanings can be scored without degrading the quality of the model. Ale, when you change the amount of memory, that you are engaged in tribute, and calculate the number.

The increase in data, otrimane as a result of the shortness of time, may decrease due to the temporary increase in the style of information, the speed is necessary for the completion of the task from the specified accuracy, and the calculation of the hours spent on the speed of data is not due to the calculation of the data from

Analytical model, inspired on the basis of a short multiplier of data, may be simpler for processing, implementation of that reasonable, lower model, prompted on a random multiplier.

The decision about the choice of the method of rapid expansion is based on a priori knowledge about the specifics of the task and the assessment of the results, as well as the exchange of watch and counting resources.

Machine learning - nothing else, like a learning area, like allowing computers to “learn”, like people, without the need for explicit programming.

What is the model predicted: predictive modeling is a process that allows us to predict results based on existing predictors. Qi predictors are more important than functions, which are included in the group of the residual result, that is, the result of the model.

What is the change in volume?

At the heads of the classification of machine learning, there are often too many officials who tend to struggle with a residual classification. Tsі chinniki is more important than zminnі, zvani signs. The more functions, the more folding visualize the training set and practice on it. Some more of these functions are mutually related and, also, above the world. Tse de algorithms for changing the size enter the group. The change in the diversity is the change in the number of analyzes of the fluctuations in the way of the set of the main changes. Tse mozhe buti was divided into a choice of particularity and a selection of particularity.

Why is changing the size important for machine learning and predictive modeling?

An intuitively sensible example of a change in roaming can be discussed with the help of a simple task of classifying electronic mail, de needing to determine which e-mail no spam. Can you turn on large number functions, for example, chi maє electronic sheet header, zmіst electronic sheet, chi vikoristovuє electronic sheet template and so on. may be included in one main characteristic, the shards of insult from the fortune-tellers correlate with the high world. Also, we can change the number of functions for such tasks. It is important to point out the problem of trivi-world classification, as it is possible to make a two-world classification with a simple two-world space, and a one-world problem with a simple line. Pointings below the little ones illustrate the concept, de trivimirny expanse of the sign is subdivided into two similarly expanse signs, and later, as it is shown that the stench is correlated, the number of signs can be changed even more.

Components of the change in volume

There are two components to the change in volume:

  • Select parameters: we need to know the number of variables for a set of different or other functions, so we can take less than a dozen, so you can win for modeling the problem. Sounds like three ways:
    1. Filter
    2. burnt
    3. in promotion
  • Viluchennya sign: Change the tribute to the rich world space to the lower world, to the space with a smaller number. reconciliation

Methods for changing the volume

Various methods that are used to change the volume include:

  • Principal component analysis (PCA)
  • Linear Discriminant Analysis (LDA)
  • Advanced Discriminant Analysis (GDA)

The change in the volume can be linear, non-linear, fallow in the vicarious method. The main linear method, the titles of principal component analysis, or PCA, are discussed below.

Principal component analysis

This method was introduced by Karl Pearson. Vіn pratsyuє for um, scho wanting data in the expanse of the greater world of the world to be reflected in the data in the expanse of the lower world, the dispersion of data in the expanse of the lower world is due to the maximum.

Vin includes feet:

  • Induce a covariant matrix of data.
  • Calculate the power vectors and numbers of matrices.
  • The power vectors, which are based on the highest power values, are based on a large part of the variance of the output data.

Otzhe, we have lost less powerful vectors, and in the process could have been a waste of money. And yet, the most important reparations are to be saved by the power vectors that are left out.

Advantages of changing the volume

  • Tse help in the squeeze of data, and, later, change the space for saving.
  • Tse change the hour to calculate.
  • It also helps to remove superfluous functions, as it is.

Insufficient changes in the volume

  • Tse can call to the singing spend tribute.
  • PCA may have a tendency to know linear correlations between changes, which are sometimes negligible.
  • The PCA recognizes failures in performance if the mean value of that availability is not sufficient to determine data collection.
  • We may not know how many of the main components should be practically achieved, and how the rules of the thumb can be enforced.

Qia statya nadana Ananney Uberoy. If you like GeeksforGeeks and would like to make your contribution, you can also write an article for help contribute.geeksforgeeks.org or send an article [email protected] Marvel at your article, which is on the main page of GeeksforGeeks, and help other geeks.

Support the project - share your efforts, darling!
Read also
How to install avast free antivirus How to install avast free antivirus How to clean the computer'ютер від вірусів самостійно How to clean your computer from viruses on your own How to clean up the computer again'ютер від вірусів How to clear the computer again from viruses