Big data Archive - agile Companies

What is data mining?

Dr. Dominic Lindner — Sun, 21 Mar 2021 17:29:38 +0000

The term data mining is often used when it comes to the storage and management of information in the big data area. Many companies use data mining as a tool by enabling the systematic application of computer-based procedures to find patterns, trends and relationships within large databases.

It builds on various findings from the fields of computer science, statistics and mathematics by performing analyzes of databases. These analyzes pursue the goal of finding connections, patterns, trends and relationships between information within large databases and making them usable.

Data mining works in a purely automated manner, which results in both cost and time savings. Companies can then use the results provided to make decisions about strategies or problem solving more easily.

Functions

Data mining is mostly used for the achievement of several goals by companies. In order to achieve these goals, it has to do a variety of tasks.

This includes:

classification : Object data are divided into classes.

Segmentation: Combination of feature-like objects into groups.

Forecast: Prediction of unknown or new features.

Dependency analysis: Knowledge of connections and relationships between features of objects.

Deviation Analysis: Identification of objects whose characteristics are not dependent on other objects.

Significance for big data areas

While big data often serves as a framework for data mining, the latter does not tend to be linked to it. Because data mining only describes an analysis of data stocks for characteristics and relationships of individual objects. It is often used in the context of large databases, such as in the area of big data, but it can also be used for smaller databases.

Nonetheless, it can be found far more frequently in the fume cupboard on big data and uses the technical basis to effectively obtain information from existing data. In addition to artificial intelligence, it also uses statistical algorithms. This enables more structure and transparency to be promoted and more relevant results to be delivered, especially with large databases that are often confusing.

Who can benefit from data mining?

Data mining is already used in practice in a large number of areas, as it offers great potential for users. For example, it is currently widely used in finance, marketing and medicine, and even as a tool for police analysis. But data mining is also used for improved customer service and risk analyzes, for example by banks and insurance companies. It can also be used to analyze the buying behavior of customers and is therefore also very popular in the area of online shops.

The following articles also provide more on the subject of data and big data:

[werbung]

Image source: pixabay.com

[fotolia]

Der Beitrag What is data mining? erschien zuerst auf agile Companies.

How does a data warehouse work?

Dr. Dominic Lindner — Sun, 21 Mar 2021 17:29:38 +0000

In the context of big data, one always needs powerful platforms that can efficiently store a large amount of data. Such a platform is also called a data warehouse. This analyzes the information it contains according to certain patterns.

Data warehousing process

The data warehousing process, which is often used to describe how it works, comprises four main main steps for analyzing data by managing the data in the data warehouse and evaluating it for results.

The 4-stage analysis process of a data warehouse

Acquisition of data from the source system
Loading the data
Backup of the data
Analysis and evaluation of the stored data

This is how a data warehouse is structured

A data warehouse, like a real building, is basically a construct made up of several elements. The foundation is an operational database that contains a large amount of information. The so-called staging area, which has the task of pre-sorting the information, finally rises from the foundation. Only after special ETL processes that collect, extract, transform and load the data according to a predetermined structure does the information finally reach the data warehouse. This enables separate access to data, independent of operational data stores. Finally, the information can be accessed with special data access tools. This is possible on different levels, the so-called data marts.

In order to obtain an even better structure with large amounts of data, so-called OLAP databases can also be used. These enable the consolidation of information from different areas and can efficiently map relationships and hierarchies.

However, it should be noted that every data warehouse is only as high-quality as the data on which it is based. Poor data quality or incomplete data stocks can lead to considerable problems in the analysis processes.

Data warehouse tasks

In the context of big data, it is now essential for companies to have an overview of the mass of information in order to be able to efficiently evaluate the stored data. For this reason, a data warehouse usually has four important tasks.

Central collection of all data: Data is compressed at a collection point.
Sorting of the data stocks: Separation into analytical and unprocessed data sets in order to obtain unadulterated results.
Data integration: Combination of data from different sources in different formats into an evaluable model.
Long-term storage of the data: Backup of the data in the form of a history for specific query options and time-related analyzes.

Advantages and disadvantages

A data warehouse is used by many companies as a helpful tool when it comes to storing large amounts of data. In addition to numerous advantages, there are also some disadvantages when using it.

benefits

powerful function for storing large amounts of data
special tools for the individual areas
Data quality management

disadvantage

sometimes long loading times (especially with increasing volumes of data)
unstructured data cannot be processed (ins. films or audios)
no possibility of real-time streaming

The following articles also provide more on the subject of data and big data:

[werbung]

Image source: pixabay.com

[fotolia]

Der Beitrag How does a data warehouse work? erschien zuerst auf agile Companies.

Big data with Hadoop

Dr. Dominic Lindner — Sun, 21 Mar 2021 17:29:37 +0000

Data processing in the area of big data often poses great difficulties for many companies. To counteract this problem, many organizations use tools such as software-based frameworks. These also include Hadoop, which is connected to Java.

What is Hadoop

The Java-based software framework Hadoop can most easily be imagined as a kind of shell that can be tailored to the most varied of architectures and operated by a wide variety of workers, in this case the hardware.

The framework was invented by Doug Cutting, who developed Hadoop into one of the best projects in the field of the Apache Software Foundation by 2008. Cutting developed the software framework for better management of distributed and scalable systems. It is based on the MapReduce algorithm from Google, which uses Hadoop to combine large amounts of data in detailed computing processes on distributed but networked computers.

Hadoop is not only so popular, but also because it is made available to everyone free of charge as free source code by Apache and is also written in the well-known Java programming language.

What role does Hadoop play in big data?

Hadoop’s expertise in being able to process large amounts of data, no matter what kind, in the area of big data, not only in a structured way, but also quickly, make the software framework an attractive tool for many companies. In particular, the ability to process data from different sources with different structures in parallel in a bundle in a clear and tangible way is a great enrichment, especially for organizations in the business intelligence industry.

In addition, with the help of Hadoop it is also possible to efficiently solve complex computing tasks in the petabyte area and, on the basis of this, for example, to develop new corporate strategies, to collect basic information for important decisions or to considerably simplify the reporting of an organization.

construction

Hadoop is made up of several building blocks which, when combined, make all the basic functions of the software framework possible.

These are:

Hadoop is made up of individual components. The four central components of the software framework are:

Hadoop Common,
Hadoop Distributed File System (HDFS),
MapReduce algorithm
Yet Another Resource Negotiator (YARN).

Hadoop Common is responsible for the basic functions and thus also serves as the basis for all other tools, such as the Java archive files. Hadoop Common is connected to the other elements via interfaces with defined access rights.

The Hadoop Distributed File System is used to store the individual data stocks on different systems. According to the manufacturer, the HDFS is able to manage data in the hundreds of millions.

Hadoop is powered by Google’s MapReduce algorithm. This enables the software framework to distribute complex computing tasks to various systems, which then process them in parallel. This can enormously increase the speed of data processing.

The MapReduce algorithm is supplemented by the Yet Another Resource Negotiator. The YARN manages the individual resources by assigning them to their tasks in the respective clusters.

functionality

As already mentioned, Hadoop is largely based on Google’s MapReduce algorithm. In addition, central tasks are also controlled by the HDFS file system, which is responsible for distributing the data to the individual bundle components. The MapReduce algorithm from Google, in turn, splits the processing of the data so that it can run in parallel on all bundle components. Hadoop then brings the individual results together to form a large overall result.

Hadoop divides the data volumes independently into individual clusters. Each cluster has a single master (represented by a computer node) while the other computer nodes are subject to the one in slave mode. The slaves serve as storage locations for data, while the master is responsible for replication and thus makes the data available on several nodes. Thanks to its ability to determine the exact location of a data block at any time, the master protects efficiency against data loss. In addition, he takes on the role of a supervisor of the individual nodes, who automatically accesses its data block if a node is absent for a long period of time and replicates and saves it again.

The following articles also provide more on the subject of data and big data:

[werbung]

Image source: pixabay.com

[fotolia]

Der Beitrag Big data with Hadoop erschien zuerst auf agile Companies.

What is data mart and data lineage?

Dr. Dominic Lindner — Sun, 21 Mar 2021 17:29:36 +0000

In the age of digitization and big data, a lot revolves around one thing: data. Terms such as data mart and data lineage regularly catch the eye. It is not always clear exactly what the technical terms are, which is why this article is intended to provide a brief overview.

What is a data mart?

Data marts are a kind of collection point for user-defined data. In doing so, data is extracted from large data stocks and made accessible in isolation for certain user groups. They thus form a sub-segment of a data warehouse and can help to make certain data accessible to the user more quickly and with less effort. This not only saves time but also costs.

Data Mart vs Data Warehouse

Both data marts and data warehouses are used to store and manage data records until they are used. Data warehouses are specialized in organizing the entire data of a company, while data marts only organize collection points for the data of individual departments. They represent a tool that isolates certain data records and makes them available separately to the respective function field.

species

A basic distinction is made between 3 different categories for data marts.

Dependent

Dependent data marts are always directly related to an enterprise data warehouse in that they are developed according to the top-down principle. For this purpose, data is first combined at a collection point and then certain data records are extracted, which are then distributed in their intended data mart.

Independent

An independent data mart, on the other hand, is not linked to a data warehouse and thus forms an autonomous system. Data is obtained from internal and external data sources of an organization instead of from the collection point of the data warehouse and then specifically distributed to the individual data marts. This type of data marts is thus much easier to implement and particularly helpful when pursuing short-term business goals.

Hybrid

Hybrid data marts describe the connection of dependent and independent data marts in a system by obtaining data from a data warehouse as well as from internal and external sources of a company. This allows you to combine the advantages of both methods and create a complex but clear system.

benefits

Because of their function as an accelerator when accessing special data sets, data marts offer many advantages.

Minimizing the time it takes to acquire certain data
Ready for use much faster than an enterprise data warehouse
Data marts require comparatively little specialist knowledge for implementation
Inexpensive alternative to an enterprise data warehouse
Data marts help improve the performance of a data warehouse because they can obtain data with less effort
Thanks to the data mart, KPIs are easier to monitor
Data marts support data maintenance by assigning data records to specific departments, which in turn can monitor them independently

What is data lineage?

Data lineage plays a role in connection with the origin of data, which is why the term is often used as a synonym. Data Lineage has the task of recording changes and optimizations of data as well as the development of their elements in a history. It tracks a data record on its journey from creation to adjustments to the final destination and at the same time also documents the associated properties. Simply put, data lineage can be thought of as a kind of biography of a data set.

benefits

With its function, data lineage offers many advantages for the user.

Data can be fully monitored at any time
Increased transparency about the development and history of data sets
The quality of the data is retained
Helpful when it comes to confidential data that needs to be protected
Companies can use data lineage to more easily comply with data-based standards and regulations

Beyond data mart and data lineage

The following articles also provide more on the subject of data and big data:

[werbung] [fotolia]

Der Beitrag What is data mart and data lineage? erschien zuerst auf agile Companies.

What is a data lake?

Dr. Dominic Lindner — Sun, 21 Mar 2021 17:29:36 +0000

Big data analyzes usually require a large amount of data in order to capture and collect all information in its raw state. This data storage resembles a real sea in size, which is why the technical term “data lake” has been established for it. You can find out exactly what this is all about in this article.

definition

As a large data store, the data lake manages the entire mass of data in its original form, i.e. in its raw format. He makes use of the collection of information from a wide variety of sources. It makes no difference to the data lake whether the data has a structure or not. This large data store also does not require any prior validation or reformatting of the data. However, a data lake cannot manage number or text-based data. In addition, it can also save information from the media area, such as images and videos.

What appears to be a chaotic collection of data, however, follows a system. Because even if the data lake receives all information in its individual raw states, it structures it as soon as the data is required. Then, if necessary, he also initiates a restructuring of the data.

Use of a data lake

The many different ways of using and applying the information collected by a data lake, such as flexible analyzes, make the large data store extremely attractive. However, the application requires some requirements in order to be able to use the system optimally.

The most important basic function of the data lake is primarily to be able to collect and manage data from a wide variety of sources. By grouping all data in one place, data silos can be avoided and information is available more quickly. However, given the large amount of data, even a single storage space does not guarantee problem-free data management. Therefore, data lakes require common frameworks as well as the creation of protocols of the contained databases in order to bring more structure into the mass of information.

In the course of security and data protection requirements, additional access controls must be implemented and information encryption must be ensured. At the same time, data lakes should always enable a function of backing up and restoring data.

Advantages and disadvantages

The use of a data lake is particularly useful when large amounts of data are repeatedly generated that have to be managed. At the same time, however, such a large collection of information can also pose a number of hurdles.

benefits

fast and uncomplicated data storage in raw format
low requirements with regard to the required computing power
provides the basis for detailed and content-rich analyzes
many possibilities for the evaluation of data, since all data is collected without prior sorting
Big data analytics can be a competitive advantage

disadvantage

High requirements in terms of data protection and security
Need for a complex data protection system
Requirement of prior implementation of access rights and regular user controls

Conclusion

As you can see, a data lake is a real asset, especially for companies with large volumes. This is because, when used optimally, real competitive advantages can be achieved thanks to in-depth Big Data analyzes. At the same time, however, sufficient data protection must be ensured with regard to the amount of data. However, this sometimes makes the use of a data lake very complex.

The following articles also provide more on the subject of data and big data:

[werbung]

Image source: pixabay.com

[fotolia]

Der Beitrag What is a data lake? erschien zuerst auf agile Companies.

Case study: building a data lake for the use of big data

Dr. Dominic Lindner — Sun, 21 Mar 2021 17:29:35 +0000

Big data is an important topic. That already shows a study by Bitkom . In 2018, the association surveyed over 600 companies on trending topics and found the following results:

57 percent are planning investments in big data or are already being implemented
The five top topics are big data (57%), Industry 4.0 (39%), 3D printing (38%), robotics (36%) and VR (25%)
But: New concepts and possibilities such as artificial intelligence and blockchain have only rarely been used so far

Reading tip: What is big data

Implementation of big data only hesitantly

According to the study, the potential of big data is only being used hesitantly. According to the study, the reasons for this are the requirements for data protection (63%) and the technical implementation (54%) as well as a lack of specialists (42%).

Reading tip: What is a data scientist

I am currently working on the technical implementation. In order to really use big data in a meaningful way, numerous technical requirements have to be created. I would like to give a practical example, which should serve as an impulse for practice.

Practical example: building a data lake

Due to my job, I am often involved in customer projects that want to set up big data architectures. In the following I made a kind of blueprint from the majority of the projects. I would like to introduce them to you today. To make the example clearer, I’ll write the whole thing as a case study.

initial situation

The customer is a fictitious large bank. Out different sources For example databases and data streams, the system copies data in raw form into a Staging area. A data stream is a continuous flow of data records, the end of which can usually not be foreseen in advance, e.g. transfers from a bank or payments to an account, as well as the heart rate monitor in the hospital or the temperature measurement of a weather station. Staging is used to ensure that the raw data is saved in its current form with a time stamp. The advantage is that these are still there in the event that the external data source is lost.

The staging area saves the data on different hard drives using a secure and redundant storage format. The data is converted into a uniform format in the Norming Area saved again. With the help of SQL queries data is exported in CSV format and saved in Sharepoint. Numerous external IT consultants use this to prepare weekly reports in PowerPoint and Excel. The reports are sorted in folders with the source data to be found on Sharepoint.

Initial situation of the customer: Reports are generated by consultants in a data warehouse using SQL.

Summary of the architecture:

Data comes from different source systems
At the end of the day, the data is processed, standardized and recorded
A wide variety of queries are made
Reports are generated weekly by external service providers (MS Office and Sharepoint as storage location)

Now we come into play. The customer asked my team and me to design a new architecture. The reasons for this were:

The reports require a lot of effort
Little flexibility (especially AD-HOC requests)
No versioned raw data
High costs for external service providers
Requirements regarding BCBS239 and MARISK can no longer be implemented (principles for the effective aggregation of risk data and risk reporting)

Target situation

Now we have started to redesign the customer’s architecture. In the first step, we revised the charging processes. On the one hand, we loaded new data into our lake every day (at night) through batch loading processes. On the other hand, the data streams are loaded continuously, in contrast.

New architecture of the customer – A data lake provides various evaluations throughout the company and enables numerous new potentials

swell

All of this data is saved on hard drives and given a versioning stamp. This concept is called data lake and metadata management. When it comes to the data lake, we are talking about a very large data storage medium, i.e. an oversized hard drive that accepts data from a wide variety of sources in its raw format.

Data lake

The data lake helped us to carry out so-called data lineage. Think of it this way: Data lineage is like a patient record with important information about when data was created, the current data age and changes. The presentation is usually very clear in the form of a diagram.

Advantage: In reports we were able to prove exactly on which data basis we produced them at time X. With the help of data lineage, we were then able to track the change in the data.

Our data lake was based on Hadoop with modern monitoring by Prometheus, Grafana and Icinga 2. We have set up high-availability clusters for this purpose. The Hadoop Distributed File System (HDFS) is a distributed file system which, using a MapReduce algorithm, can split complex and computationally intensive tasks into many small individual parts on several computers. This means that evaluations based on the raw data are possible during runtime.

By the way: Due to the guidelines of MARISK and BCBS 239 for banks, we have loaded risk data into a separate cluster. This cluster could only be accessed by authorized persons. There are concerns that so many cross-circuits are drawn in the lake through the combination of data that we have saved certain data for security reasons.

Now we want to process the raw data from the brine as well. First we have the data from standardized reports that are used every week or requested by certain departments as Data marts copied to extra hard drives. This enabled us to guarantee access control and improve performance. We thus had fixed (permanent) and volatile data marts (project-based).

Limitation : I realize that data marts are a concept of the data warehouse, but in our context they were really helpful because we cannot rely 100% on the data lake.

With the help of new Algorithms (internal algorithms of the customer) we normalized the data at runtime or for the data marts. With the help of software, we also tried to gain new insights and knowledge by forming clusters using artificial intelligence Models to collect from the data. For example, we had data correlated or investigated departure. This has been done by a special AI service provider. Another element is the data catalog. Of the Data catalog is a catalog of metadata and shows the presentation rules for all data and the relationships between the various data.

Note: The data catalog has the important function that no relationship can be established between certain personal risk data without access to the catalog.

evaluation

Now we come to the logic of evaluation . We want to ensure that the various stakeholders in the company simply adhere to three standards Query server can send. The requests are sensibly distributed by a load balancer and also protected by an access control. The three standards are:

SQL
Hive (SQL-like Hadoop compatible language)
Tableau (software tool)

Reporting

Now we come to the actual report generation for the end user. We have our three groups in the company for this purpose. These are:

classic controlling,
the specialist departments (and project managers) and
the data scientists.

All three roles can send requests to our Query server send. There is a possible evaluation for each role:

Fixed and automated reports for controlling,
Interactive and customizable dashboards for the specialist departments and
Exploratory reports in Tableau for the data scientists.

In summary, there are the standardized automatic reports, which we could view in CSV or Excel, as well as interactive dashboards for individual real-time reports using our own software. Furthermore, a team of data scientists had the goal of using Tableau to gain new knowledge from data (exploratory reports).

Summary of the new architecture :

Raw data is loaded into the data storage (Hadoop cluster)
Map-Reduce algorithm intelligently distributes the evaluation
Evaluations are made directly from the raw data layer
Transformation always only at runtime
Data catalog for storing data relationships
Modeling through AI
Query Server allows various queries in various languages
Data lineage for versioning the data
Automated reports and interactive / exploratory dashboards through in-house development

Conclusion

Big data can significantly change companies and is right at the top of the agenda. The main obstacles, however, are the technical implementation and preparation of the data. Classical concepts are no longer capable of upgrading such data and companies are required to save data in a targeted manner and deliver up-to-date reports under pressure due to legal requirements.

In this case study, I have given an example of technical implementation that can serve as an impetus for practice. I built a data lake in the case study and used various well-known concepts such as Hadoop. My experience shows that the correct implementation of the concepts can help to make the potential of Big Data possible. It is important to draw meaningful reports and insights from the data.

Reading tip: Big data benefits

Image source: Business photo created by mindandi – www.freepik.com

[werbung] [fotolia]

Der Beitrag Case study: building a data lake for the use of big data erschien zuerst auf agile Companies.

Chatbots in companies – these are the possibilities

Dr. Dominic Lindner — Sun, 21 Mar 2021 17:29:00 +0000

Chatbots are on everyone’s lips and could significantly change the working world in companies. The use of chatbots, for example, offers a complete revolution in customer service for companies. On the other hand, this could also lead to massive job losses and also make numerous job profiles such as the call center agent no longer necessary. In the following article I would like to shed light on the current and following possibilities of chatbots.

Klarmobil chatbot

The first thing I did was look at the Klarmobil chatbot. According to the mobile operator, 1000 customers use the chatbot every week and ask almost 6 questions. In addition to the classic customer advisor, the bot answers a set of standard questions similar to the FAQ.

WetterOnline chatbot

I found another great use case at WetterOnline. There I can give the bot special instructions. These are for example show me the weather on my mobile phone every morning or show me the weather in Berlin. I can also ask the bot about the weather. Very convenient!

H&M chatbot

The H&M bot is available in the KIK Messenger. There you can exchange ideas on specific questions about the outfit or ask questions like: What goes with my style or my clothes. The bot also asks me questions like: How old are you and what style do you prefer?

Maggie chatbot

I find Maggie’s bot very funny, which gives me recipes and ideas for my diet on request. I can name ingredients or express special requests and the bot will find something that suits me.

Telekom HR chatbot

Telekom has the bot Katy, which means career at T-Systems. The bot supports applicants in their search for information about corporate culture, information about entry opportunities or about employer benefits. The bot also helps with tips and tricks to find the right place.

Chatbots – these are the possibilities

Chatbots now offer great added value for users. That surprised me because a few years ago bots were hardly mature and warned no conversations. The companies hope to provide 24/7 customer support as well as lower costs and improve service times.

According to Gartner, the total investment in chatbots is to increase from 680 million to 2.4 billion dollars (keyword robotic processing). Most chatbots are currently limited to simple use cases and offer an addition to the company’s core business and customer service. However, this could change one day. So it remains exciting!

Image source: Pixabay

[werbung] [fotolia]

Der Beitrag Chatbots in companies – these are the possibilities erschien zuerst auf agile Companies.

Big Data Risks – A Question of Implementation!

Dr. Dominic Lindner — Sun, 21 Mar 2021 17:27:47 +0000

“Big data creates mixed feelings for many people. The economic opportunities are obvious. But the possibilities of abuse are also evident ( Computer week ) “. Big data is certainly more than just hype and brings numerous new opportunities with it. However, there are also many risks, which are discussed in this article.
Reading tip: What is big data

Big Data Risks: Monitoring

An example can be found in the Computer week when looking for a perpetrator on the autobahn: “The investigators had installed cameras on seven relevant sections of the autobahn. These read in the license plates of all passing automobiles, including those of the vehicles being shot at. In April 2013, the police received reports of gunfire on trucks again within five days, a total of six. “
Of course, this massive storage of data allows a high level of surveillance. The verdict of Computerwoche: “The evaluation of massive amounts of data has an ambivalent character.
An example can also be the storage of numerous data by wearables by health insurance companies. The health of the individual becomes transparent and it can be understood how often you do sport and move. So it creates a permanent feeling of surveillance.

Big data risks: Sensitive data requires special protection

“Dealing with large amounts of data, especially data about internal company relationships, poses technical challenges for companies and IT departments. The balance between accessibility and security must be precisely balanced. “this can be found in the BigDatablog . Sensitive data must therefore also be specially secured. Not only from abuse but also from manipulation.

Big data risks: manipulation

Another risk of big data is its manipulative use. For example, big data can be misused and voters can be influenced, as is the case with elections. It is therefore about the sensible and ethical use of the large amount of data.
So the magazine warns Propaganda show : “Critical contemporaries have always warned, but only now is it slowly becoming apparent – all the more powerfully – how data collections can and are already being used to target people who previously” thoughtlessly “made their thoughts available to third parties, specifically and unnoticed to manipulate.”
So the magazine continues to believe that “the data is today Not primarily used to analyze political moods and then to make laws and politics in the interests of the citizens’ known opinions and interests, but rather to manipulate the raised moods and opinions with targeted measures in the interests of the elites. “

Big data risks: uselessness

A final risk is that large amounts of data are stored and certainly also evaluated, but you also like to sit in front of it and think: Mhmmm? And what do we do with it now? What does that tell us? When I was still on the road as an external consultant, I experienced this very often. We have evaluated and thought about a lot of things, but hardly came up with a meaningful idea of what that could mean.
So he warns too Havard Business Manager : Ready and open to experiment: “Managers and analysts must be able to apply scientific methods in their business area. You need to know what reasonable working hypotheses look like. You also need to understand the basics of experiments and how they are set up “

Conclusion: Big data but with care!

In addition to the risks mentioned, it is important to use big data sensibly. For this, guidelines are certainly necessary on the one hand to change abuse and manipulation and, on the other hand, further training to prevent uselessness. Because there must also be a data culture in the company.
That’s what he says Havard Business Manager At the end of the day, great efforts are needed in further training so that Big Data leads to more added value. It is about promoting a data-oriented mindset and analytical culture in the company and introducing new technologies.
Tip: Book suggestions too Big data
Of course, not everything is just a risk and big data can offer numerous opportunities such as well-founded prognoses for decisions and new products as well as an individual approach to customers. So read my follow-up article on this next week.
Reading tip: Chances of Big data

Opportunities and risks of big data (own illustration)

[werbung]Verwendete Quellen anzeigen

Image source: Designed by Freepik

[fotolia]

Der Beitrag Big Data Risks – A Question of Implementation! erschien zuerst auf agile Companies.

Tips on the quantitative survey and evaluation method

Dr. Dominic Lindner — Sun, 21 Mar 2021 17:27:47 +0000

In my research, in addition to a group discussion, I also collected the data. I carried out a quantitative survey for this. This is used for the quick collection of data that I have in the Round tables could evaluate together with the participants. In this article, I’ll give you some tips on how to do this. The following sections are taken from and quoted from my doctorate.

Advantages and pilot test

The advantage is that a high mass can be achieved quickly. The links to the questionnaire can be distributed specifically to already known participants and recommendations in their networks. In contrast to direct methods such as telephone interviews, the layout and design of the questionnaire is very important, as the researcher cannot provide explanations (Porst 2014). For example, a question can be misinterpreted. For this reason, my questionnaires were always tested with 5 pilot people, who then sent direct feedback to the research team.

Define target group

Every survey has a specific target group, which you should define. On the one hand you should limit the professional group and on the other hand justify in the thesis why you are questioning them. It also shows whether you should conduct the survey online or offline. For example, I wanted to interview disciplinary managers because I wanted to investigate leadership behavior in virtual teams. That was perfectly possible online. In another study on the digital workplace, my target group was small businesses. Most of them could hardly be reached via e-mail or the Internet, which is why we asked offline. So define and justify precisely the target group before the survey.
Tip: don’t forget yours Cleanly limit methodology.

Formulation of the questions

The exact content of the questions depends of course on your research question. However, you should precisely conceptualize the nature of the question. You derive a variable (measured variable) for each question. This can be, for example, turnover of the respondents, team size, preferred agile method, type of company or much more. The important thing is: every question creates a variable. You can then generate confirmed hypotheses from the variables. If you already have hypotheses in advance of your work, the questions should of course be adapted to the hypotheses.
If not, you can derive theses from your survey, for example, from a team size of 12 people, the interviewed managers prefer Kanban or managers from SMEs prefer Kanban and corporations prefer Scrum. The wording is based on the selected evaluation method. I’ll tell you something about this below. In a thesis, 5-10 questions are usually sufficient.
Reading tip: Prepare theses

Excursus: online vs. offline

There are two possible types of survey: online and offline. You can print out the survey and offline carry out. This has the advantage that you can reach participants whom you would otherwise not reach and that you can follow the distribution very closely. For example, a questionnaire can be distributed specifically to managers or project managers. For example, I wanted to interview exactly the same companies in a study that my co-author interviewed in 2016. We have therefore distributed the printed questionnaires specifically to them. Nobody else should take the survey. Limitation: Of course you can also protect the survey online with access codes, but this is never 100% secure.
In the Online survey you can reach a high mass very quickly, but you can hardly control how the link to the survey is distributed. Participants can also pass this on. You should therefore formulate an introductory question. In my survey of executives, for example, the initial question I asked was what number of employees the respondent currently leads. If the answer was: no guidance, the questionnaire broke off immediately. This is how you avoid wrong answers. Someone who does not lead can hardly say what managers think. In the same way, only project managers should answer a survey among project managers.

Construction and process

My online questionnaires are divided into small, thematically separated blocks and care is taken that, in addition to common answer types such as drop menus, lists, checkboxes, radio buttons, etc., there are not too many questions on one page, as too many questions can reduce the research participants’ concentration (Kuckartz et al. 2009) or an information overload for too many different topic blocks can occur for the participants. The questionnaire was therefore designed to last 15 minutes, which also turned out to be an acceptable length of time in the pilot tests.
I always schedule the surveys to last 4 weeks and have a minimum of 60 valid answers. The data is then exported from the questionnaire software and first evaluated for validity in a tool. After sorting out unsatisfied questionnaires, they are evaluated using SPSS so that the data can be visualized.
Reading tip: Book by Kuckhartz

How many people should I interview?

This is a good question, and in general, the more you ask, the more meaningful the results. My tip is that you look at the results and calculate: How significantly have the results changed since the last 5 participants surveyed. If there is no change, the results can be assumed to be stable.
To do this, add the following sentence to your work: 25 participants were interviewed. To Wilde and Hess (2006) the saturation criterion of a research method is reached if, after a certain number of participants, no significant new knowledge has been gained after an iteration. After measuring the last 5 participants, no significant new changes could be achieved in the survey.
In short: If I ask more questions, the result will hardly change, e.g. 80% prefer agile over traditional IT methods. Even if you interview 40 other participants, things should change under normal circumstances.
Overall, however, I can say as a guideline that 15-30 people were interviewed in my bachelor thesis.

quantitative survey – evaluation

After completing the simple evaluation, the data is exported from the questionnaire software. I have always examined the difference between SMEs and corporations. So I specifically wanted to find out whether, for example, special knowledge would arise for leadership in SMEs. In the SPSS analysis tool, the data is separated into the data for SMEs and non-SMEs and the responses per group are examined for a significant difference. In the check for statistical deviations, the previously specified, customary significance level of α = 5% is used (Kuckartz et al. 2013). It is derived whether a survey variable applies uniformly to all companies or whether there is a specific deviation for SMEs.

Reading tip: Book for evaluation

Four possibilities for evaluation

In addition to many other options, there are a total of four known options for evaluating large amounts of data. Of course, there are also many other methods such as difference analysis, con-joint analysis, neural networks and discriminant analysis. However, I will only describe these four procedures, as I see them most often in theses at my university.

method	Explanation	example
Significance analysis	Deviation from answers	How do you invest in an SME and how do you compare to corporate groups
Regression analysis	Determine the relationship between variables	From what amount does the marketing budget influence the sales figures of a B2B SME?
Correlation analysis	Determine deviation from variables	What is the current relationship between employee satisfaction and home office?
Cluster analysis	Derive groupings from the answers	Which generations prefer to found startups?

Significance analysis

As already explained, it is described how a hypothesis deviates from the null hypothesis. Like in my example: SME and NON_KMU. This is worthwhile as soon as you interview 2 or more groups. There are differences between SMEs and NON_SMUs, for example in the number of home office days, etc. This makes sense as soon as you want to compare something or work out differences or as I specifically examine SMEs.

Regression analysis

Here you are trying to map the dependency of one independent variable on another. For example, a CEO would like to know how much money he has to invest in advertising in order for something to change in the company, such as sales figures or new customers. To do this, they create so-called scatter diagrams and see whether there is a connection between the selected variable and the others. In contrast to the next method, here the cause and effect is examined in detail. This enables predictions, which is useful for research questions when you want to make predictions.

Correlation Analysis

Here you look at the relationship between 2 variables. So whether these are related. For example, you can say whether employee satisfaction and days in the home office can be related. Does this increase or decrease with increased home office days? This makes sense when research examines an influence on something. In contrast to regression, no cause and effect is determined here, but only how similar two variables are. So here you are examining the connection in the here and now.
Reading tip: Study by me with correlation analysis

Cluster analysis

With a cluster analysis one can determine similarities in large groups and summarize them. Customer group analyzes are a great example. A marketing manager looks at which customer groups are shopping in his online shop. In a thesis, you summarize similar answers in groups. The result is then groups such as who founds startups. With the help of the cluster analysis it is possible to divide the data into groups.

Conclusion: Tips on the quantitative survey method

The method was very well suited for my research and focuses on data collection. I have used the method offline as well as online. The preparation and evaluation takes a long time, which is why every survey must be properly prepared and planned. It is also important to interview at least 50 people, since the significance tests in particular only make sense from 30 people, i.e. a total of 60 participants. My tips should give an initial orientation to the methodology. Definitely look too in my other book tips!
[student]
Verwendete Quellen anzeigen

Kuckartz , U., Radiker , S., Ebert , T., & Schehl , J. (2013): Statistics: An understandable introduction, Wiesbaden: VS Verlag für Sozialwissenschaften.
Kuckartz , U., Ebert , T., Radiker , S. and Stefer , C. (2009): Evaluation online: Internet-based survey in practice, 1st edition, Wiesbaden: VS Verlag für Sozialwissenschaften.
Porst , R. (2014): Questionnaire – A work book, 1st edition, Wiesbaden: VS Verlag für Sozialwissenschaften.
Designed by Freepik

[fotolia]

Der Beitrag Tips on the quantitative survey and evaluation method erschien zuerst auf agile Companies.

What does a data scientist do?

Dr. Dominic Lindner — Sun, 21 Mar 2021 17:27:47 +0000

The large amount of data continues to grow. In fact, it is now being said that data is the new oil. At the same time, there is also a new job description. The name of the data scientist appears more and more. So says the portal SAS : “Anyone who knows how strategically important knowledge can be drawn from large amounts of data and can also convey this has a key position in the company as a consultant for top management.”
But if you look at the job advertisements you will find a lot about it and you ask yourself: What does a data scientist do? This article is intended to provide information.
Reading tip: What is big data

What should a data scientist be able to do?

If you look at the job advertisements, a data scientist should usually be able to do the following:

Analytical talent
Expertise
communication
Urge to research
Coordination talent

On the one hand, the scientist must recognize relationships in large amounts of data and be able to analyze them. Furthermore, he should also have business and specialist knowledge in order to understand the problems of the specialist areas. With the help of his communication skills, he can talk extensively with them and be in constant contact. His curiosity also allows him to solve difficult problems and work with hypotheses. But he is also a project manager and has to expand, manage and maintain the database.

Skills and tasks of a data scientist (own illustration)

What does a data scientist do?

There are many names for the new job profiles. The Computer week has already broken down some of it and defines it as follows:

The (Big) Data Engineer is the master of the data supply.
The management scientist is the mediator between the departmental worlds.
The data scientist provides answers to analytical questions based on data.
The data steward is responsible for monitoring data quality and integrity.

The data engineer is responsible for merging the data and knows where the data is and how it is merged. The management scientist then analyzes this data and defines the actual problems. She says to the data scientist Computer week : “The main task of the data scientist is to generate answers to analytical questions from data – with the help of analytical methods from the areas of statistics, machine learning or operations research.” The data steward monitors the data going there and ensures that it is technically correct.
So in the end there are numerous job descriptions and these terms are not always used and clearly separated. Therefore this is only used as an orientation. I don’t think that every company separates this so strictly and often everything is summarized under the umbrella of the data scientist or data analyst. The following figure shows the fields of application and provides the answer to the question: What does a data scientist do.

Tasks of the data scientist in a process chain (own illustration)

Conclusion: what does a data scientist do?

There are numerous terms and answers to the question: What does a data scientist do but in the end an abstract picture emerges. On the one hand, the data scientist is a scientist who solves business problems with the help of data and, on the other hand, he is a project manager who manages the database in the company.
I hope to have shed some light on the darkness with this article and I look forward to your comments on what the data scientists are doing in your environment. I have already worked with a few during my consulting time and can also see what my big data colleagues at the chair are researching. I also learned something from you and one quantitative analysis using data used in my doctorate.
Tip: Book suggestions too Big data
[werbung]
Verwendete Quellen anzeigen

Image source: Designed by Freepik

[fotolia]

Der Beitrag What does a data scientist do? erschien zuerst auf agile Companies.