research problems with data

Towards Data Science

Dr. Sunil Kumar Vuppala

Jun 27, 2020

Member-only

Top 20 Latest Research Problems in Big Data and Data Science

Problem statements in 5 categories, research methodology and research labs to follow.

E ven though Big data is in the mainstream of operations as of 2020, there are still potential issues or challenges the researchers can address. Some of these issues overlap with the data science field. In this article, the top 20 interesting latest research problems in the combination of big data and data science are covered based on my personal experience (with due respect to the Intellectual Property of my organizations) and the latest trends in these domains [1,2]. These problems are covered under 5 different categories, namely

Core Big data area to handle the scale Handling Noise and Uncertainty in the data Security and Privacy aspects Data Engineering Intersection of Big data and Data science The article also covers a research methodology to solve specified problems and top research labs to follow which are working in these areas.

I encourage researchers to solve applied research problems which will have more impact on society at large. The reason to stress this point is that we are hardly analyzing 1% of the available data. On the other hand, we are generating terabytes of data every day. These problems are not very specific to a domain and can be applied across the domains.

Let me first introduce 8 V’s of Big data (based on an interesting article from Elena ), namely Volume, Value, Veracity, Visualization, Variety, Velocity, Viscosity, and Virality. If we closely look at the questions on individual V’s in Fig 1, they trigger interesting points for the researchers. Even though they are business questions, there are underlying research problems. For instance, 02-Value: “Can you find it when you most need it?” qualifies for analyzing the available data and giving context-sensitive answers when needed.

Having understood the 8V’s of big data, let us look into details of research problems to be addressed. General big data research topics [3] are in the lines of:

Next, let me cover some of the specific research problems across the five listed categories mentioned above. The problems related to core big data area of handling the scale:-

Hadoop or Spark kind of environment is used for offline or online processing of data. The industry is looking for scalable architectures to carry out parallel data processing of big data. There is a lot of progress in recent years, however, there is a huge potential to improve performance.

2. Handling real-time video analytics in a distributed cloud:

With the increased accessibility to the internet even in developing countries, videos became a common medium of data exchange. There is a role of telecom infrastructure, operators, deployment of the Internet of Things (IoT), and CCTVs in this regard. Can the existing systems be enhanced with low latency and more accuracy? Once the real-time video data is available, the question is how the data can be transferred to the cloud, how it can be processed efficiently both at the edge and in a distributed cloud?

3. Efficient graph processing at scale:

Social media analytics is one such area that demands efficient graph processing. The role of graph databases in big data analytics is covered extensively in the reference article [4]. Handling efficient graph processing at a large scale is still a fascinating problem to work on.

The research problems to handle noise and uncertainty in the data:-

4. Identify fake news in near real-time:

This is a very pressing issue to handle the fake news in real-time and at scale as the fake news spread like a virus in a bursty way. The data may come from Twitter or fake URLs or WhatsApp. Sometimes it may look like an authenticated source but still may be fake which makes the problem more interesting to solve.

5. Dimensional Reduction approaches for large scale data:

One can extend the existing approaches of dimensionality reduction to handle large scale data or propose new approaches. This also includes visualization aspects. One can use existing open-source contributions to start with and contribute back to the open-source.

6. Training / Inference in noisy environments and incomplete data :

Sometimes, one may not get a complete distribution of the input data or data may be lost due to a noisy environment. Can the data be augmented in a meaningful way by oversampling, Synthetic Minority Oversampling Technique (SMOTE), or using Generative Adversarial Networks (GANs)? Can the augmentation help in improving the performance? How one can train and infer is the challenge to be addressed.

7. Handling uncertainty in big data processing:

There are multiple ways to handle the uncertainty in big data processing[4]. This includes sub-topics such as how to learn from low veracity, incomplete/imprecise training data. How to handle uncertainty with unlabeled data when the volume is high? We can try to use active learning, distributed learning, deep learning, and fuzzy logic theory to solve these sets of problems.

The research problems in the security and privacy [5] area:-

8. Anomaly Detection in Very Large Scale Systems:

The anomaly detection is a very standard problem but it is not a trivial problem at a large scale in real-time. The range of application domains includes health care, telecom, and financial domains.

9. Effective anonymization of sensitive fields in the large scale systems :

Let me take an example from Healthcare systems. If we have a chest X-ray image, it may contain PHR (Personal Health Record). How one can anonymize the sensitive fields to preserve the privacy in a large scale system in near real-time? This can be applied to other fields as well primarily to preserve privacy.

10. Secure federated learning with real-world applications:

Federated learning enables model training on decentralized data. It can be adopted where the data cannot be shared due to regulatory / privacy issues but still may need to build the models locally and then share the models across the boundaries. Can we still make the federated learning work at scale and make it secure with standard software/hardware-level security is the next challenge to be addressed. Interested researchers can explore further information from RISELab of UCB in this regard.

11. Scalable privacy preservation on big data:

Privacy preservation for large scale data is a challenging research problem to work on as the range of applications varies from the text, image to videos. The difference in country/region level privacy regulations will make the problem more challenging to handle.

The research problems related to data engineering aspects:-

12. Lightweight Big Data analytics as a Service:

Everything offering as a service is a new trend in the industry such as Software as a Service (SaaS). Can we work towards providing lightweight big data analytics as a service?

13. Auto conversion of algorithms to MapReduce problems:

MapReduce is a well-known programming model in Big data. It is not just a map and reduce functions but provide scalability and fault-tolerance to the applications. However, there are not many algorithms that support map-reduce directly. Can we build a library to do an auto conversion of standard algorithms to support MapReduce?

14. Automated Deployment of Spark Clusters:

A lot of progress is witnessed in the usage of spark clusters in recent times but they are not completely ready for automated deployment. This is yet another challenging problem to explore further.

The research problems in intersection of big data with data science:-

15. Approaches to make the models learn with less number of data samples:

In the last 10 years, the complexity of deep learning models increased with the availability of more data and compute power. Some researchers proudly claim that they solved a complex problem with hundreds of layers in deep learning. For instance, image segmentation may need a 100 layer network to solve the segmentation problem. However, the recent trend is that can anyone solve the same problem with less relevant data and with less complexity? The reason behind this thinking is to run the models at the edge devices, not just only at the cloud environment using GPUs/TPUs. For instance, the deep learning models trained on big data might need deployment in CCTV / Drones for real-time usage. This is fundamentally changing the approach of solving complex problems. You may work on challenging problems in this sub-topic.

16. Neural Machine Translation to Local languages:

One can use Google translation for neural machine translation (NMT) activities. However, there is a lot of research in local universities to do neural machine translation in local languages with support from the Governments. The latest advances in Bidirectional Encoder Representations from Transformers (BERT) are changing the way of solving these problems. One can collaborate with those efforts to solve real-world problems.

17. Handling Data and Model drift for real-world applications:

Do we need to run the model on inference data if one knows that the data pattern is changing and the performance of the model will drop? Can we identify the drift in the data distribution even before passing the data to the model? If one can identify the drift, why should one pass the data for inference of models and waste the compute power. This is a compelling research problem to solve at scale in the real world. Active learning and online learning are some of the approaches to solve the model drift problem.

18. Handling interpretability of deep learning models in real-time applications:

Explainable AI is the recent buzz word. Interpretability is a subset of explainability. Machine / Deep learning models are no more black-box models. Few models such as Decision Trees are interpretable. However, if the complexity increases, the base model itself may not be useful to interpret the results. We may need to depend on surrogate models such as Local interpretable model-agnostic explanations (LIME) / SHapley Additive exPlanations (SHAP) to interpret. This can help the decision-makers with the justification of the results produced. For instance, rejection of a loan application or classifying the chest x-ray as COVID-19 positive. Can the interpretable models handle large scale real-time applications?

19. Building context-sensitive large scale systems:

Building a large scale context-sensitive system is the latest trend. There are some open-source efforts to kick start. However, it requires a lot of effort in collecting the right set of data and building context-sensitive systems to improve search capability. One can choose a research problem in this topic if you have a background on search, knowledge graphs, and Natural Language Processing (NLP). This is applicable across the domains.

20. Building large scale generative based conversational systems (Chatbot frameworks):

One specific area gaining momentum is building conversational systems such as Q&A and Chatbot generative systems. A lot of chatbot frameworks are available. Making them generative and preparing summary in real-time conversations are still challenging problems. The complexity of the problem increases as the scale increases. A lot of research is going on in this area. This requires a good understanding of Natural Language Processing and the latest advances such as Bidirectional Encoder Representations from Transformers (BERT) to expand the scope of what conversational systems can solve at scale.

Research Methodology:

Hope you can frame specific problems with your domain and technical expertise from the topics highlighted above. Let me recommend a methodology to solve any of these problems. Some points may look obvious for the researchers, however, let me cover the points in the interest of a larger audience:

Identify your core strengths whether it is in theory, implementation, tools, security, or in a specific domain. Other new skills you can acquire while doing the research. Identifying the right research problem with suitable data is kind of reaching 50% of the milestone. This may overlap with other technology areas such as the Internet of Things (IoT), Artificial Intelligence (AI), and Cloud. Your passion for research will determine how long you can go in solving that problem. The trend is interdisciplinary research problems across the departments. So, one may choose a specific domain to apply the skills of big data and data science.

Literature survey : I strongly recommend to follow only the authenticated publications such as IEEE, ACM, Springer, Elsevier, Science direct, etc… Do not get into the trap of “International journal …” which publish without peer reviews. Please do not limit the literature survey to only IEEE/ACM papers only. A lot of interesting papers are available in arxiv.org and paperswithcode . One needs to check/follow the top research labs in industry and academia as per the shortlisted topic. That gives the latest research updates and helps to identify the gaps to fill in.

Lab ecosystem : Create a good lab environment to carry out strong research. This can be in your research lab with professors, post-docs, Ph.D. scholars, masters, and bachelor students in academia setup or with senior, junior researchers in industry setup. Having the right partnership is the key to collaboration and you may try the virtual groups as well. Having that good ecosystem boosts up the results as one can challenge the others on their approach to improve the results further.

Publish at right avenues: As mentioned in the literature survey, publish the research papers in the right forum where you will receive peer reviews from the experts around the world. We may get obstacles in this process in the way of rejections. However, as long as you receive constructive feedback, one should be thankful to the anonymous reviewers. You may see the potential opportunity to patent the ideas if the approach is novel, non-obvious, and inventive. The recent trend is to open source the code while publishing the paper. If your institution permits it to open source, you may do so by uploading the relevant code in Github with appropriate licensing terms and conditions.

Top Research labs to follow:

Some of these research areas are active in the top research centers around the world. I request you to follow them and identify further gaps to continue the work. Here are some of the top research centers around the world to follow in big data + data science area:

RISE Lab at the University of Berkeley , USA

Doctoral Research Centre in Data Science, The University of Edinburgh, United Kingdom

Data Science Institute, Columbia University, USA

The Institute of Data-Intensive Engineering and Science, John Hopkins University, USA

Facebook Data Science research

Big Data Institute, University of Oxford, United Kingdom

Center for Big Data Analytics, The University of Texas at Austin, USA

Center for data science and big data analytics, Oakland University, USA

Institute for Machine Learning, ETH Zurich, Switzerland

The Alan Turing Institute, United Kingdom

IISc Computational and Data Sciences Research

Data Lab, Carnegie Mellon University, USA

If you wish to continue your learning in big data , here are my recommendations:

Coursera Big Data Specialization

Big data course from the University of California San Diego

Top 10 books based on your need can be picked up from the summary article in Analytics India Magazine.

Data Challenges:

In the process of solving the real-world problems, one may come across these challenges related to data:

Conclusion:

In this article, I briefly introduced the big data research issues in general and listed Top 20 latest research problems in big data and data science in 2020. These problems are further divided and presented in 5 categories so that the researchers can pick up the problem based on their interests and skill set. This list is no means exhaustive. However, I hope these inputs can excite some of you to solve the real problems in big data and data science. I covered these points along with some background on big data in a webinar for your reference [7]. You may refer to my other article which lists the problems to solve with data science amid Covid-19[8]. Let us come together to build a better world with technology.

References:

[1] https://www.gartner.com/en/newsroom/press-releases/2019-10-02-gartner-reveals-five-major-trends-shaping-the-evoluti

[2] https://www.forbes.com/sites/louiscolumbus/2019/09/25/whats-new-in-gartners-hype-cycle-for-ai-2019/#d3edc37547bb

[3] https://arxiv.org/ftp/arxiv/papers/1705/1705.04928.pdf

[4] https://www.xenonstack.com/insights/graph-databases-big-data/

[5] https://journalofbigdata.springeropen.com/articles/10.1186/s40537-019-0206-3

[6] https://www.rd-alliance.org/group/big-data-ig-data-security-and-trust-wg/wiki/big-data-security-issues-challenges-tech-concerns

[7] https://www.youtube.com/watch?v=maZonSZorGI

[8] https://medium.com/@sunil.vuppala/ds4covid-19-what-problems-to-solve-with-data-science-amid-covid-19-a997ebaadaa6

Choose the right research problem and apply your skills to solve it. All the very best. Please share your feedback in the comments section. Feel free to add if you come across further topics in this area.

More from Towards Data Science

Your home for data science. A Medium publication sharing concepts, ideas and codes.

About Help Terms Privacy

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store

Dr. Sunil Kumar Vuppala

Dr. Sunil is a Director of Data Science in Ericssion. 16+ years of exp in ML/DL, IoT, Analytics; Inventor, Speaker, Thought leader. Top Data Scientist in India.

Text to speech

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

Data Collection | Definition, Methods & Examples

Published on June 5, 2020 by Pritha Bhandari . Revised on November 30, 2022.

Data collection is a systematic process of gathering observations or measurements. Whether you are performing research for business, governmental or academic purposes, data collection allows you to gain first-hand knowledge and original insights into your research problem .

While methods and aims may differ between fields, the overall process of data collection remains largely the same. Before you begin collecting data, you need to consider:

To collect high-quality data that is relevant to your purposes, follow these four steps.

Table of contents

Step 1: define the aim of your research, step 2: choose your data collection method, step 3: plan your data collection procedures, step 4: collect the data, frequently asked questions about data collection.

Before you start the process of data collection, you need to identify exactly what you want to achieve. You can start by writing a problem statement : what is the practical or scientific issue that you want to address and why does it matter?

Next, formulate one or more research questions that precisely define what you want to find out. Depending on your research questions, you might need to collect quantitative or qualitative data :

If your aim is to test a hypothesis , measure something precisely, or gain large-scale statistical insights, collect quantitative data. If your aim is to explore ideas, understand experiences, or gain detailed insights into a specific context, collect qualitative data. If you have several aims, you can use a mixed methods approach that collects both types of data.

Based on the data you want to collect, decide which method is best suited for your research.

Carefully consider what method you will use to gather data that helps you directly answer your research questions.

What can proofreading do for your paper?

Scribbr editors not only correct grammar and spelling mistakes, but also strengthen your writing by making sure your paper is free of vague language, redundant words, and awkward phrasing.

research problems with data

See editing example

When you know which method(s) you are using, you need to plan exactly how you will implement them. What procedures will you follow to make accurate observations or measurements of the variables you are interested in?

For instance, if you’re conducting surveys or interviews, decide what form the questions will take; if you’re conducting an experiment, make decisions about your experimental design (e.g., determine inclusion and exclusion criteria ).

Operationalization

Sometimes your variables can be measured directly: for example, you can collect data on the average age of employees simply by asking for dates of birth. However, often you’ll be interested in collecting data on more abstract concepts or variables that can’t be directly observed.

Operationalization means turning abstract conceptual ideas into measurable observations. When planning how you will collect data, you need to translate the conceptual definition of what you want to study into the operational definition of what you will actually measure.

You may need to develop a sampling plan to obtain data systematically. This involves defining a population , the group you want to draw conclusions about, and a sample, the group you will actually collect data from.

Your sampling method will determine how you recruit participants or obtain measurements for your study. To decide on a sampling method you will need to consider factors like the required sample size, accessibility of the sample, and timeframe of the data collection.

Standardizing procedures

If multiple researchers are involved, write a detailed manual to standardize data collection procedures in your study.

This means laying out specific step-by-step instructions so that everyone in your research team collects data in a consistent way – for example, by conducting experiments under the same conditions and using objective criteria to record and categorize observations. This helps you avoid common research biases like omitted variable bias or information bias .

This helps ensure the reliability of your data, and you can also use it to replicate the study in the future.

Creating a data management plan

Before beginning data collection, you should also decide how you will organize and store your data.

Finally, you can implement your chosen methods to measure or observe the variables you are interested in.

The closed-ended questions ask participants to rate their manager’s leadership skills on scales from 1–5. The data produced is numerical and can be statistically analyzed for averages and patterns.

To ensure that high quality data is recorded in a systematic way, here are some best practices:

Data collection is the systematic process by which observations or measurements are gathered in research. It is used in many different contexts by academics, governments, businesses, and other organizations.

When conducting research, collecting original data has significant advantages:

However, there are also some drawbacks: data collection can be time-consuming, labor-intensive and expensive. In some cases, it’s more efficient to use secondary data that has already been collected by someone else, but the data might be less reliable.

Quantitative research deals with numbers and statistics, while qualitative research deals with words and meanings.

Quantitative methods allow you to systematically measure variables and test hypotheses . Qualitative methods allow you to explore concepts and experiences in more detail.

Reliability and validity are both about how well a method measures something:

If you are doing experimental research, you also have to consider the internal and external validity of your experiment.

Operationalization means turning abstract conceptual ideas into measurable observations.

For example, the concept of social anxiety isn’t directly observable, but it can be operationally defined in terms of self-rating scores, behavioral avoidance of crowded places, or physical anxiety symptoms in social situations.

Before collecting data , it’s important to consider how you will operationalize the variables that you want to measure.

In mixed methods research , you use both qualitative and quantitative data collection and analysis methods to answer your research question .

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

Bhandari, P. (2022, November 30). Data Collection | Definition, Methods & Examples. Scribbr. Retrieved February 28, 2023, from https://www.scribbr.com/methodology/data-collection/

Is this article helpful?

Pritha Bhandari

Pritha Bhandari

Other students also liked, qualitative vs. quantitative research | differences, examples & methods, sampling methods | types, techniques & examples, what is your plagiarism score.

research problems with data

Ten Research Challenge Areas in Data Science

To drive progress in the field of data science, we propose 10 challenge areas for the research community to pursue. Since data science is broad, with methods drawing from computer science, statistics, and other disciplines, and with applications appearing in all sectors, these challenge areas speak to the breadth of issues spanning science, technology, and society. We preface our enumeration with meta-questions about whether data science is a discipline. We then describe each of the 10 challenge areas. The goal of this article is to start a discussion on what could constitute a basis for a research agenda in data science, while recognizing that the field of data science is still evolving.

Keywords: artificial intelligence, causal reasoning, computing systems, data life cycle, deep learning, ethics, machine learning, privacy, trustworthiness

Although data science builds on knowledge from computer science, engineering, mathematics, statistics, and other disciplines, data science is a unique field with many mysteries to unlock: fundamental scientific questions and pressing problems of societal importance.

In this article we enumerate 10 areas of research in which to make progress to advance the field of data science. Our goal is to start a discussion on what could constitute a basis for a research agenda in data science, while recognizing that the field of data science is still evolving.

Before we plunge into this enumeration, we preface our discussion by raising, but not answering, a meta-question: Is data science a discipline? Answering this meta-question is still under lively debate, including within the pages of this journal. Herein, we suggest additional meta-questions to help frame the debate.

Is Data Science a Discipline?

Data science is a field of study: one can get a degree in data science, get a job as a data scientist, and get funded to do data science research. But is data science a discipline, that is, a branch of knowledge? If not yet, will it evolve to be one, distinct from other disciplines? Here are a few meta-questions on whether data science is a discipline.

Are there driving deep question(s) in data science? If so, what are they? Each scientific discipline (usually) has one or more ‘deep’ questions that drive its research agenda: What is the origin of the universe (astrophysics)? What is the origin of life (biology)? What is computable (computer science)? Does data science inherit its deep questions from all its constituent disciplines or does it have its own unique ones?

What is the role of the domain in the field of data science? Many academics have argued (Wing et al., 2018) that data science is unique in that it is not just about methods, but also about the use of those methods in the context of a domain—the domain of the data being collected and analyzed; the domain in which, from this data, a question is to be answered. Is the inclusion of a domain inherent in defining the field of data science? Other methods-based disciplines, such as computer science, mathematics, and statistics, are used in the context of other domains, and are correspondingly inspired by problems from these domains. Can one study data science, as we do in computer science, mathematics, and statistics, without studying it in the context of a domain? Is the (more integral?) way a domain is included in the study of data science unique to data science?

What makes data science data science? Is there a problem unique to data science that one can convincingly argue would not be addressed or asked by any of its constituent disciplines, for example, computer science or statistics? When should a set of methods, analyses, or results be considered data science, and not just methods, analyses, or results in computer science or statistics (or mathematics, etc.)? Or should all methods, analyses, and results in all these disciplines be considered part of data science?

Data science as a field of study is still too new to have definitive answers to all these meta-questions. Their answers will likely evolve over time, as the field matures and as members of the contributing established disciplines share scholarship and perspectives from their respective disciplines. We encourage the data science community to ponder and debate these meta-questions, as we make progress on more concrete scientific and societal challenges raised by the preponderance of data, data science methods, and applications of data science.

Ten Research Areas

So, let’s ask an easier question, one that also underlies any field of study: What are the research challenge areas that drive the study of data science? Here is a list of 10. They are not in any priority order, and some of them are related to each other. They are phrased as challenge areas, not challenge questions; each area suggests many questions. They are not necessarily the ‘top 10’ but they are a good 10 to start the community discussing what a broad research agenda for data science might look like. Given our discussion above, they unsurprisingly overlap with challenges found in computer science, statistics (Berger et al., 2019), social sciences, and so on. Given the author’s background, they are posed from the perspective of a computer scientist. The list begins, roughly speaking, with challenges relevant to science, then to technology, and then to society.

1. Scientific Understanding of Learning, Especially Deep Learning Algorithms.

As much as we admire the astonishing successes of deep learning, we still lack a scientific understanding of why deep learning works so well, though we are making headway (Arora et al., 2018; Balestriero & Baraniuk, 2018). We do not understand the mathematical properties of deep learning algorithms or of the models they produce. We do not know how to explain why a deep learning model produces one result and not another. We do not understand how robust or fragile models are to perturbations to input data distributions. We do not understand how to verify that deep learning will perform the intended task well on new input data. We do not know how to characterize or measure the uncertainty of a model’s results. We do not know deep learning’s fundamental computational limits (Thompson et al., 2020); at what point does more data and more compute not help? Deep learning is an example of where experimentation in a field is far ahead of any kind of complete theoretical understanding. And, it is not the only example in learning: random forests (Biau & Scornet, 2015) and high-dimensional sparse statistics (Johnstone & Titterington, 2009) enjoy widespread applicability on large-scale data, where gaps remain between their performance in practice and what theory can explain.

2. Causal Reasoning

Machine learning is a powerful tool to find patterns and to examine associations and correlations, particularly in large data sets. While the adoption of machine learning has opened many fruitful areas of research in economics, social science, public health, and medicine, these fields require methods that move beyond correlational analyses and can tackle causal questions. A rich and growing area of current study is revisiting causal inference in the presence of large amounts of data. Economists are devising new methods that incorporate the wealth of data now available into their mainstay causal reasoning techniques, for example, the use of instrumental variables; these new methods make causal inference estimation more efficient and flexible (Athey, 2016; Taddy, 2019). Data scientists are beginning to explore multiple causal inference, not just to overcome some of the strong assumptions of univariate causal inference, but because most real-world observations are due to multiple factors that interact with each other (Wang & Blei, 2019). Inspired by natural experiments used in economics and the social sciences, as more government agency and commercial data becomes publicly available, data scientists are using synthetic control for novel applications in public health, retail, and sports (Abadie et al., 2010; Amjad et al. 2019).

3. Precious Data

Data can be precious for one of three reasons: the data set is expensive to collect; the data set contains a rare event (low signal-to-noise ratio); or the data set is artisanal—small, task-specific, and/or targets a limited audience. A good example of expensive data comes from large, one-off, expensive scientific instruments, for example, the Large Synoptic Survey Telescope, the Large Hadron Collider, and the IceCube Neutrino Detector at the South Pole. A good example of rare event data is data from sensors on physical infrastructure, such as bridges and tunnels; sensors produce a lot of raw data, but the disastrous event they are used to predict is (thankfully) rare. Rare data can also be expensive to collect. A good example of artisanal data is the tens of millions of court judgments that China has released online to the public since 2014 (Liebman et al., 2017) or the two-plus-million U.S. government declassified documents collected by Columbia’s History Lab (Connelly et al., 2019). For each of these different kinds of precious data, we need new data science methods and algorithms, taking into consideration the domain and the intended uses and users of the data.

4. Multiple, Heterogeneous Data Sources

For some problems, we can collect lots of data from different data sources to improve our models and to increase knowledge. For example, to predict the effectiveness of a specific cancer treatment for a human, we might build a model based on 2-D cell lines from mice, more expensive 3-D cell lines from mice, and the costly DNA sequence of the cancer cells extracted from the human. As another example, multiscale, spatiotemporal climate models simulate the interactions among multiple physical systems, each represented by disparate data sources drawn from sensing: the ocean, the atmosphere, the land, the biosphere, and humans. Many of these data sources might be precious data (see Challenge no. 3). State-of-the-art data science methods cannot as yet handle combining multiple, heterogeneous sources of data to build a single, accurate model. Bounding the uncertainty of a data model is exacerbated when built from multiple, possibly unrelated data sources. More pragmatically, standardization of data types and data formats could reduce undesired or unnecessary heterogeneity. Focused research in combining multiple sources of data will provide extraordinary impact.

5. Inferring From Noisy and/or Incomplete Data.

The real world is messy and we often do not have complete information about every data point. Yet, data scientists want to build models from such data to do prediction and inference. This long-standing problem in statistics comes to the fore as: (1) the volume of data, especially about people, that we can generate and collect grows unboundedly; (2) the means of generating and collecting data is not under our control, for example, data from mobile phone and web apps vary—by design—across different users and across different populations; and 3) many sectors, from finance to retail to transportation, embrace the desire to do real-time personalization. A great example of a novel formulation of this problem is the planned use of differential privacy for Census 2020 data (Abowd, 2018; Hawes, 2020), where noise is deliberately added to a query result, to maintain the privacy of individuals participating in the census. Handling ‘deliberate’ noise is particularly important for researchers working with small geographic areas such as census blocks, since the added noise can make the data uninformative at those levels of aggregation. How then can social scientists, who for decades have been drawing inferences from census data, make inferences on this ‘noisy’ data and how do they combine their past inferences with these new ones? Machine learning’s ability to better separate noise from signal can improve the efficiency and accuracy of those inferences.

6. Trustworthy AI

We have seen rapid deployment of systems using artificial intelligence and machine learning in critical domains such as autonomous vehicles, criminal justice, health care, hiring, housing, human resource management, law enforcement, and public safety, where decisions taken by AI agents directly impact human lives. Consequently, there is an increasing concern if these decisions can be trusted to be correct, fair, ethical (see Challenge no. 10), interpretable, private (see Challenge no. 9), reliable, robust, safe, and secure, especially under adversarial attacks. Many of these properties borrow from a long history of research on Trustworthy Computing (National Research Council, 1999), but AI raises the ante (Wing, 2020): reasoning about a machine learning model seems to be inseparable from reasoning about the available data used to build it and the unseen data on which it is to be used; and these models are inherently probabilistic. One approach to building trust is through providing explanations of the outcomes of a machine learned model (Adadi & Berrada, 2018; Chen et al., 2018; Murdoch et al., 2019; Turek, 2016). If we can interpret the outcome in a meaningful way, then the end user can better trust the model. Another approach is through formal methods, where one strives to prove once and for all a model satisfies a certain property. New trust properties yield new tradeoffs for machine learned models, for example, privacy versus accuracy; robustness versus efficiency; fairness versus robustness. There are multiple technical audiences for trustworthy models: model developers, model users (human and machine), and model customers; as well as more general audiences: consumers, policymakers, regulators, the media, and the public.

7. Computing Systems for Data-Intensive Applications

Traditional designs of computing systems have focused on computational speed and power: the more cycles, the faster the application can run. Today, the primary focus of applications, especially in the sciences (e.g., astronomy, biology, climate science, materials science), is data. Novel special-purpose processors, for example, GPUs, FPGAs, TPUs, are now commonly found in large data centers. Domain-specific accelerators, including those designed for deep learning, show orders of magnitude performance gains over general-purpose computers (Dally et al., 2020). Even with all these data and all this fast and flexible computational power, it can still take weeks to build accurate predictive models; however, applications, whether from science or industry, want real-time predictions. Distributing data, computing, and models helps with scale and reliability (and privacy), but then runs up against the fundamental limit of the speed of light and practical limits of network bandwidth and latency. Also, data-hungry and compute-hungry algorithms, for example, deep learning, are energy hogs (Strubell et al., 2019). Not only should we consider space and time, but energy consumption, in our performance metrics. In short, we need to rethink computer systems design from first principles, with data (not compute) the focus. New computing systems designs need to consider: heterogeneous processing, efficient layout of massive amounts of data for fast access, communication and network capability, energy efficiency, and the target domain, application, or even task.

8. Automating Front-End Stages of the Data Life Cycle

While the excitement in data science is due largely to the successes of machine learning, and more specifically deep learning, before we get to use machine learning algorithms, we need to prepare the data for analysis. The early stages in the data life cycle (Wing, 2019) are still labor intensive and tedious. Data scientists, drawing on both computational and statistical tools, need to devise automated methods that address data collection, data cleaning, and data wrangling, without losing other desired properties, for example, accuracy, precision, and robustness, of the end model. One example of emerging work in this area is the Data Analysis Baseline Library (Mueller, 2019), which provides a framework to simplify and automate data cleaning, visualization, model building, and model interpretation. The Snorkel project addresses the tedious task of data labeling (Ratner et al., 2018). Trifacta, a university spin-out company, addresses data wrangling (Trifacta, 2020). Complementing these needs, commercial services already support later stages in the data life cycle, in particular, automating construction of machine learning models, for example, Cloud AutoML (Google, 2020) and Azure Machine Learning (Microsoft, 2020).

For many applications, the more data we have, the better the model we can build. One way to get more data is to share data, for example, multiple parties pool their individual data sets to build collectively a better model than any one party can build. However, in many cases, due to regulation or privacy concerns, we need to preserve the confidentiality of each party’s data set. An example of this scenario is in building a model to predict whether someone has a disease or not. If multiple hospitals could share their patient records, we could build a better predictive model; but due to Health Insurance Portability and Accountability Act (HIPAA, 1996) privacy regulations, hospitals cannot share these records. We are only now exploring practical and scalable ways, using cryptographic and statistical methods, for multiple parties to share data, models, and/or model outcomes while preserving the privacy of each party’s data set. Industry and government are already exploiting techniques and concepts, for example, secure multiparty computation, homomorphic encryption, zero-knowledge proofs, differential privacy, and secure enclaves, as elements of point solutions to point problems (Abowd, 2018; Ion et al., 2017; Kamara, 2014). We can also apply these methods to the simpler scenario where a single entity’s data must be kept private prior to analysis.

Data science raises new ethical issues. They can be framed along three axes: (1) the ethics of data: how data are generated, recorded, and shared; (2) the ethics of algorithms: how artificial intelligence, machine learning, and robots interpret data; and (3) the ethics of practices: devising responsible innovation and professional codes to guide this emerging science (Floridi & Taddeo, 2016) and to define institutional review board (IRB) criteria and processes specific for data (Wing et al., 2018). The ethical principles expressed in the Belmont Report (Belmont Report, 1979) and the Menlo Report (Dittrich & Kenneally, 2011) give us a starting point for identifying new ethical issues data science technology raises. The ethical principle of Respect for Persons suggests that people should always be informed when they are talking with a chatbot. The ethical principle of Beneficence requires a risk/benefit analysis on the decision a self-driving car makes on whom not to harm. The ethical principle of Justice requires us to ensure the fairness of risk assessment tools in the court system and automated decision systems used in hiring. These new ethical issues correspondingly raise new scientific challenges for the data science community, for example, how to detect and eliminate racial, gender, socioeconomic, or other biases in machine learning models.

Closing Remarks

As many universities and colleges are creating new data science schools, institutes, centers, and so on (Wing et al., 2018), it is worth reflecting on data science as a field. Will data science as an area of research and education evolve into being its own discipline or be a field that cuts across all other disciplines? One could argue that computer science, mathematics, and statistics share this commonality: they are each their own discipline, but they each can be applied to (almost) every other discipline.

What will data science be in 10 or 50 years? The answer to this question is in the hands of the next-generation researchers and educators. To advance and study data science will take a commitment to learn the vocabulary, methods, and tools from multiple, traditionally siloed disciplines. Integrating and applying this knowledge takes patience, but can be exhilarating. To today’s undergraduates, graduate students, postdoctoral fellows, and early-career faculty and researchers: Through the data science research problems you choose to tackle, you will shape this field!

Acknowledgments

I would like to thank Cliff Stein, Gerad Torats-Espinosa, Max Topaz, and Richard Witten for their feedback on earlier renditions of this paper. Many thanks to all Columbia Data Science faculty who have helped me formulate and discuss these 10 (and other) challenges during our Fall 2019 DSI retreat. Final thanks to the anonymous reviewers whose comments helped me sharpen and enhance many of my points.

Disclosure Statement

Jeannette M. Wing has no financial or non-financial disclosures to share for this article.

Abadie, A., Diamond, A., & Hainmüller, J. (2010). Synthetic control methods for comparative case studies: Estimating the effect of California's Tobacco Control Program. Journal of the American Statistical Association, 105 (490), 493–505. https://doi.org/10.1198/jasa.2009.ap08746

Abowd, J. M. (2018). The U.S. Census Bureau adopts differential privacy. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (p. 2867). Association for Computing Machinery. https://doi.org/10.1145/3219819.3226070

Adadi, A., & Berrada, M. (2018). Peeking Inside the Black-Box: A survey on explainable artificial intelligence (XAI). IEEE Access , 6 , 52138–52160. https://doi.org/10.1109/ACCESS.2018.2870052

Amjad, M., Misra, V., Shah, D., & Shen, D. (2019). mRSC: Multi-dimensional Robust synthetic control. Proceedings of the ACM on Measurement and Analysis of Computing Systems , 3 (2), Article 37, 1–27. Association for Computing Machinery. https://doi.org/10.1145/3341617.3326152

Arora, S. Ge, R., Neyshabur, B., & Zhang, Y. (2018). Stronger generalization bounds for deep nets via a compression approach. In Proceedings of Machine Learning Research: Vol. 80. Proceedings of the 35th International Conference on Machine Learning (pp. 254–263). http://proceedings.mlr.press/v80/arora18b.html

Athey, S. (2016). Susan Athey on how economists can use machine learning to improve policy . Stanford Institute for Economic Policy Research.

Balestriero, R., & Baraniuk, R. G. (2018). A spline theory of deep networks. In Proceedings of Machine Learning Research: Vol. 80. Proceedings of the 35th International Conference on Machine Learning (pp. 374–383). http://proceedings.mlr.press/v80/balestriero18b.html

Belmont Report. (1979). The Belmont report: Ethical principles and guidelines for the protection of human subjects of research. U.S. Department of Health, Education, and Welfare.

Berger, J., He, X., Madigan, C., Murphy, S., Yu, B., & Wellner, J. (2019). Statistics at a crossroad: Who is for the challenge? NSF workshop report. National Science Foundation. https://hub.ki/groups/statscrossroad

Biau, G., & Scornet, E. (2015). A random forest guided tour. TEST , 25 , 197–227. https://doi.org/10.1007/s11749-016-0481-7

Chen, C., Lin, K., Rudin, C., Shaposhnik, Y., Wang, S., & Wang, T. (2018). An interpretable model with globally consistent explanations for credit risk. In NIPS 2018 Workshop on Challenges and Opportunities for AI in Financial Services: The Impact of Fairness, Explainability, Accuracy, and Privacy . https://doi.org/10.48550/arXiv.1811.12615

Connelly, M., Madigan, D., Jervis, R., Spirling, A., & Hicks, R. (2019). The History Lab . http://history-lab.org/

Dally, W. J., Turakhia, Y., & Han, S. (2020). Domain-specific accelerators. Communications of the ACM , 63 (7), 48–57. https://doi.org/10.1145/3361682

Dittrich, D., & Kenneally, E. (2011). The Menlo report: Ethical principles guiding information and communication technology research. IEEE Security and Privacy, 10 (2), Article 6173001. https://doi.org/10.1109/MSP.2012.52

Floridi , L., & Taddeo , M. (2016). What is data ethics? Philosophical Transactions of the Royal: Society A , 374 (2083), Article 20160360. https://doi.org/10.1098/rsta.2016.0360

Google. (2020). Cloud AutoML . https://cloud.google.com/automl/

Hawes, M. B. (2020). Implementing differential privacy: Seven lessons from the 2020 United States Census.  Harvard Data Science Review, 2 (2). https://doi.org/10.1162/99608f92.353c6f99

HIPAA (1996), Health Insurance Portability and Accountability Act, US Congress, Pub.L. 104–191, 110 Stat. 1936, enacted August 21, 1996.

Ion, M., Kreuter, B., Nergiz, E., Patel, S., Saxena, S., Seth, K., Shananhan, D., & Yung, M. (2017). Private intersection-sum protocol with applications to attributing aggregate ad conversions. Cryptology ePrint Archive, Report 2017/738. https://eprint.iacr.org/2017/738

Johnstone, I. M., & Titterington, D. M. (2009). Statistical challenges of high-dimensional data. Philosophical transactions: Series A, Mathematical, Physical, and Engineering Sciences , 367 (1906), 4237–4253. https://doi.org/10.1098/rsta.2009.0159

Kamara, S., Mohassel, P., Raykova, M., & Sadeghian, S. (2014). Scaling private set intersection to billion element sets. In N. Christin, & R. Safavi-Naini (Eds.), Financial cryptography and data security (pp. 195–215). Springer. https://doi.org/10.1007/978-3-662-45472-5_13

Liebman, B. L., Roberts, M., Stern, R. E., & Wang, A. (2017). Mass digitization of Chinese court decisions: How to use text as data in the field of Chinese law . UC San Diego School of Global Policy and Strategy, 21st Century China Center Research Paper No. 2017-01; Columbia Public Law Research Paper No. 14–551. https://scholarship.law.columbia.edu/faculty_scholarship/2039

Microsoft. (2020). What is automated machine learning (AutoML)? https://docs.microsoft.com/en-us/azure/machine-learning/concept-automated-ml

Mueller, A. (2019). Data analysis baseline library . GitHub. https://libraries.io/github/amueller/dabl

Murdoch, W. J., Singh, C., Kumbier, K., Abbasi-Asl, R., & Yu, B. (2019). Definitions, methods, and applications in interpretable machine learning. Proceedings of the National Academy of Sciences of the United States of America , 116 (44), 22071–22080. https://doi.org/10.1073/pnas.1900654116

National Research Council. (1999). Trust in cyberspace . National Academies Press. https://doi.org/10.17226/6161

Ratner, A., Bach, S., Ehrenberg, H., Fries, J., Wu, S., & Ré, C. (2018). Snorkel: Rapid training data creation with weak supervision. Proceedings of the VLDB Endowment, 11 (3), 269–282 . https://doi.org/10.14778/3157794.3157797

Strubell E., Ganesh, A., & McCallum, A. (2019). Energy and policy considerations for deep learning in NLP. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (pp. 3645–3650) . http://doi.org/10.18653/v1/P19-1355

Taddy, M. (2019). Business data science: Combining machine learning and economics to optimize, automate, and accelerate business decisions . McGraw Hill.

Thompson, N. C., Greenewald, K., Lee, K., & Manso, G. F. (2020). The computational limits of deep learning. arXiv . https://doi.org/10.48550/arXiv.2007.05558

Trifacta. (2020). https://www.trifacta.com/

Turek, M. (2016). Defense Advanced Research Projects Agency, Explainable AI Program . https://www.darpa.mil/program/explainable-artificial-intelligence

Wang, Y., & Blei, D. M. (2019). The blessings of multiple causes. Journal of the American Statistical Association , 114 (528), 1574–1596, https://doi.org/10.1080/01621459.2019.1686987

Wing, J. M. (2019). The data life cycle. Harvard Data Science Review , 1 (1). https://doi.org/10.1162/99608f92.e26845b4

Wing, J. M. (2020). Trustworthy AI. arXiv. https://doi.org/10.48550/arXiv.2002.06276

Wing, J. M., Janeia, V. P., Kloefkorn, T., & Erickson, L. C. (2018). Data Science Leadership Summit. Workshop Report. National Science Foundation. https://dl.acm.org/citation.cfm?id=3293458

©2020 Jeannette M. Wing. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license , except where otherwise indicated with respect to particular material included in the article.

Connections

Training and Funding Pipelines for Data Science: The Need for a Common Core of Transdisciplinary Principles

Early-Career View on Data Science Challenges: Responsibility, Rigor, and Accessibility

Mathematics: The Tao of Data Science

The Future of Data Science

Insightful Data Science

Northeastern University Graduate Programs

The Biggest Data Analytics Challenges of 2022

The Biggest Data Analytics Challenges of 2022

Industry Advice Analytics

The year 2019 saw some exciting changes in both the amount of data generated globally and the various applications of that data across industries. This continued evolution of big data and the analytics industry has led to impactful new technology , business practices , and careers for those in the field. However, the rapid and consistent level of advancement has also brought with it a new set of challenges that will come to define the sector in 2022.

Below, we explore what those challenges are, their potential to impact the future of data, and how aspiring data professionals can find a lucrative career in analytics as this field continues to evolve.

Download Our Free Guide to Breaking Into Analytics

A guide to what you need to know, from the industry’s most popular positions to today’s sought-after data skills.

DOWNLOAD NOW

Top Data Analytics Challenges in 2022

1. the need for more trained professionals.

Research shows that, as of 2021,humans generated a total of 79 zettabytes of data. This is only expected to grow to even  greater increases as the number of streams, posts, searches, texts, and more are used each and every day.Yet this increase in the quantity of data being generated isn’t expected to plateau anytime soon. By 2025, it is now predicted that humans will create an astounding 463 exabytes of data daily.

From artificial intelligence to supply chain management , applications of this incredible amount of data are limitless— if there are enough professionals trained to handle it.

Thomas Goulding , professor for Northeastern’s Master of Professional Studies in Analytics program, says that the biggest analytics challenge of 2022 will be a lack of qualified data analysts with the tools and training needed to work with this massive amount of information.

“The big challenge we will face in analytics will be having enough qualified professionals to support the industry need,” Goulding says. “The demand for analytic talent is really far outstripping the ability of the education system to produce it.”

This shortage is due to a myriad of factors within the industry. One possible explanation Goulding offers is that many universities in America are seeing an increase in international students studying data while domestic numbers decline. While the increase in students of any citizenship studying data is beneficial during this global shortage, these students are often choosing to return to their home countries after graduation, only furthering the gap between qualified professionals and available roles in the U.S . What’s more, these graduates are often restricted from employment in certain U.S. organizations, only further exacerbating the shortage of domestic graduates in this industry.

Learn More: What Does A Data Analyst Do?

“There are literally thousands of openings [in analytics] that are currently not going to be filled anytime in the near future,” Goulding continues, and his estimation isn’t far off; Burning Glassdoor Labor Insight reports that there are approximately 394,715 graduate-level jobs available for those with the proper advanced training—an increase of over 32 percent in the last two years alone. Without individuals to take on these open roles, however, Goulding emphasizes that advancements in research and the applications of analytics will come to an unnecessary standstill.

Facing This Challenge in 2022

In order to bridge this talent gap, Goulding recommends anyone with an interest in data consider honing their skills and increasing their career potential with a graduate degree in analytics from a top university like Northeastern.

An advanced degree provides aspiring analysts with the tools they need to be successful in this evolving field. From practical skills such as programming and statistics to professional or “soft skills” like communication and presentational abilities, graduates leave with the tools and experiences they need to not only fill one of the many open roles in this industry, but to thrive in it.

2. Bridging the Gap Between Executives and Predictive Data

Organizations across sectors use data to inform their decisions every day. Retailers, for example, might determine how much of a product to stock based on past sales of that item. Similarly, insurance companies can use collected information on a client’s past experiences to determine whether or not it’s in their best interest to cover them. Even healthcare organizations might use data in this way to track a patient’s entire medication list in an effort not to prescribe contraindicated doses.

Yet, while all these organizations have embraced this use of data, they are only scratching the surface of its potential. Many organizational decision-makers are unaware of analytical advances that allow them to make predictive rather than descriptive use of their collected data, and those that are aware might not have the technical understanding to fully appreciate the potential of this change.

This is the second-largest challenge that Goulding sees facing the analytics industry: a gap between the new ways and speed at which massive data can be processed to inform decision-making, and the level of data-expertise needed to inform those decision-makers

“There’s a culture change that has to happen in corporate America, where data must now become an increasingly strategic ally to corporate decision-making,” Goulding says. “Executives at the senior-level really need to understand the strategic value of their data so that they’re willing to trust in the judgments that come from their analytics teams.”

Digging In Deeper: This process of making predictions about the future based on an established set of data from the past is known as predictive analysis . Through the use of various statistical modeling tools , analysts can utilize the massive amounts of data that have been collected over the last decade to make informed decisions about what’s to come. This is an incredibly valuable tool for businesses hoping to stay ahead of changing trends.

Goulding sees a few paths to overcoming this challenge and making predictive data a part of the decision-making process at the executive level. First, he believes that “getting adequate professionals trained” in these new forms of analytics will help to start shedding light on the potential of these practices.

From there, he believes it will be up to the data analysts within larger organizations to find ways of demonstrating the strategic value of predictive methods to senior leadership. By showing executives how data can help them “answer board-level questions with analytics,” analysts will be able to exemplify the true power of data. These questions may include:

An analyst that can help answer these questions with data will be able to demonstrate to executives the true strategic value of these tools. However, while most analysts may have the technical training to carry out the technical analysis, the most valuable will have the communication, presentation, and data visualization skills needed to effectively share the value of their analysis with leadership teams. This, Goulding identifies, is a significant opportunity for young professionals entering the industry.

For this reason, programs like Northeastern’s master’s in analytics , have incorporated courses in data visualization, presentation , and communicating with data alongside technical ones to develop analysts who are ready to tackle this challenge head-on.

3. Data That Isn’t Harmonious

While the increase in available daily data is positively impacting many aspects of data analytics, there are some downfalls to the increased quantity. For example, Goulding explains that while the data we’re collecting is extremely valuable once it has been properly processed, it is not easy to manage in its raw form.

“The data that we have isn’t what I might call harmonious,” he says. “We have so many diverse sources of data being generated in so many different formats that it’s not easily integrated. Getting all that data together and into a single format that can be easily rationalized is going to be a major challenge now and for the foreseeable future.”

While the act of rationalizing data is nothing new to a data analyst, it is the amount of time, energy, and resources that businesses will need to put toward this process that is the new factor in the coming decade.

Today, “90 percent of an analyst’s time is working with data to get it integrated and harmonized in a way that’s useful,” Goulding says, and since this work is only going to increase the more data humans produce, future analysts need to ensure they have the necessary skills and training to handle it.

Analytics training programs that are abreast of challenges like these in the coming decade have developed their curriculum to embrace this type of large-scale integration early on.

In Northeastern’s Master of Professional Studies in Analytics program , for example, students are able to practice working with large-scale data sets from corporate partners and government research organizations. As a result, they learn to “overcome [any] challenges they encounter as if they were a professional within that company, and work to actually answer a question of real value to an employer,” Goulding says.

This type of real-world experience is vital for analysts hoping to hone these essential skills and prepare for the realities of the data analytics field in 2022.

Interested in landing a role in analytics? Learn how to break into the industry with our custom e-book , or explore how a Master of Professional Studies in Analytics from Northeastern can help set you on a path toward success in this ever-evolving field.

Download Our Free Guide to Breaking Into Analytics

Subscribe below to receive future content from the Graduate Programs Blog.

About ashley difranza, related articles.

What is Enterprise Analytics? Key Strategies and Challenges

What is Enterprise Analytics? Key Strategies and Challenges

7 Must-Have Skills For Data Analysts

7 Must-Have Skills For Data Analysts

Increasing your salary as a data analyst, did you know.

Nearly 50% of CIOs report having issues finding qualified candidates for advanced data roles (State of the CIO Report, 2020)

Graduate Programs in Analytics

Join the next generation of data-driven leaders.

Most Popular:

Tips for taking online classes: 8 strategies for success, public health careers: what can you do with a master’s degree, 7 international business careers that are in high demand, edd vs. phd in education: what’s the difference, in-demand biotechnology careers shaping our future, the benefits of online learning: 7 advantages of online degrees, how to write a statement of purpose for graduate school, keep reading:.

research problems with data

Northeastern’s Online DMSc Program: What To Expect

research problems with data

Doctor of Health Science vs. Medical Science: Which Is Better?

research problems with data

What Is a Doctor of Medical Science Degree?

research problems with data

What to Look for in an Online College: A Guide

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Save citation to file

Email citation, add to collections.

Add to My Bibliography

Your saved search, create a file for external citation management software, your rss feed, qualitative study, affiliations.

Qualitative research is a type of research that explores and provides deeper insights into real-world problems. Instead of collecting numerical data points or intervene or introduce treatments just like in quantitative research, qualitative research helps generate hypotheses as well as further investigate and understand quantitative data. Qualitative research gathers participants' experiences, perceptions, and behavior. It answers the hows and whys instead of how many or how much. It could be structured as a stand-alone study, purely relying on qualitative data or it could be part of mixed-methods research that combines qualitative and quantitative data. This review introduces the readers to some basic concepts, definitions, terminology, and application of qualitative research.

Qualitative research at its core, ask open-ended questions whose answers are not easily put into numbers such as ‘how’ and ‘why’. Due to the open-ended nature of the research questions at hand, qualitative research design is often not linear in the same way quantitative design is. One of the strengths of qualitative research is its ability to explain processes and patterns of human behavior that can be difficult to quantify. Phenomena such as experiences, attitudes, and behaviors can be difficult to accurately capture quantitatively, whereas a qualitative approach allows participants themselves to explain how, why, or what they were thinking, feeling, and experiencing at a certain time or during an event of interest. Quantifying qualitative data certainly is possible, but at its core, qualitative data is looking for themes and patterns that can be difficult to quantify and it is important to ensure that the context and narrative of qualitative work are not lost by trying to quantify something that is not meant to be quantified.

However, while qualitative research is sometimes placed in opposition to quantitative research, where they are necessarily opposites and therefore ‘compete’ against each other and the philosophical paradigms associated with each, qualitative and quantitative work are not necessarily opposites nor are they incompatible. While qualitative and quantitative approaches are different, they are not necessarily opposites, and they are certainly not mutually exclusive. For instance, qualitative research can help expand and deepen understanding of data or results obtained from quantitative analysis. For example, say a quantitative analysis has determined that there is a correlation between length of stay and level of patient satisfaction, but why does this correlation exist? This dual-focus scenario shows one way in which qualitative and quantitative research could be integrated together.

Examples of Qualitative Research Approaches

Ethnography

Ethnography as a research design has its origins in social and cultural anthropology, and involves the researcher being directly immersed in the participant’s environment. Through this immersion, the ethnographer can use a variety of data collection techniques with the aim of being able to produce a comprehensive account of the social phenomena that occurred during the research period. That is to say, the researcher’s aim with ethnography is to immerse themselves into the research population and come out of it with accounts of actions, behaviors, events, etc. through the eyes of someone involved in the population. Direct involvement of the researcher with the target population is one benefit of ethnographic research because it can then be possible to find data that is otherwise very difficult to extract and record.

Grounded Theory

Grounded Theory is the “generation of a theoretical model through the experience of observing a study population and developing a comparative analysis of their speech and behavior.” As opposed to quantitative research which is deductive and tests or verifies an existing theory, grounded theory research is inductive and therefore lends itself to research that is aiming to study social interactions or experiences. In essence, Grounded Theory’s goal is to explain for example how and why an event occurs or how and why people might behave a certain way. Through observing the population, a researcher using the Grounded Theory approach can then develop a theory to explain the phenomena of interest.

Phenomenology

Phenomenology is defined as the “study of the meaning of phenomena or the study of the particular”. At first glance, it might seem that Grounded Theory and Phenomenology are quite similar, but upon careful examination, the differences can be seen. At its core, phenomenology looks to investigate experiences from the perspective of the individual. Phenomenology is essentially looking into the ‘lived experiences’ of the participants and aims to examine how and why participants behaved a certain way, from their perspective . Herein lies one of the main differences between Grounded Theory and Phenomenology. Grounded Theory aims to develop a theory for social phenomena through an examination of various data sources whereas Phenomenology focuses on describing and explaining an event or phenomena from the perspective of those who have experienced it.

Narrative Research

One of qualitative research’s strengths lies in its ability to tell a story, often from the perspective of those directly involved in it. Reporting on qualitative research involves including details and descriptions of the setting involved and quotes from participants. This detail is called ‘thick’ or ‘rich’ description and is a strength of qualitative research. Narrative research is rife with the possibilities of ‘thick’ description as this approach weaves together a sequence of events, usually from just one or two individuals, in the hopes of creating a cohesive story, or narrative. While it might seem like a waste of time to focus on such a specific, individual level, understanding one or two people’s narratives for an event or phenomenon can help to inform researchers about the influences that helped shape that narrative. The tension or conflict of differing narratives can be “opportunities for innovation”.

Research Paradigm

Research paradigms are the assumptions, norms, and standards that underpin different approaches to research. Essentially, research paradigms are the ‘worldview’ that inform research. It is valuable for researchers, both qualitative and quantitative, to understand what paradigm they are working within because understanding the theoretical basis of research paradigms allows researchers to understand the strengths and weaknesses of the approach being used and adjust accordingly. Different paradigms have different ontology and epistemologies . Ontology is defined as the "assumptions about the nature of reality” whereas epistemology is defined as the “assumptions about the nature of knowledge” that inform the work researchers do. It is important to understand the ontological and epistemological foundations of the research paradigm researchers are working within to allow for a full understanding of the approach being used and the assumptions that underpin the approach as a whole. Further, it is crucial that researchers understand their own ontological and epistemological assumptions about the world in general because their assumptions about the world will necessarily impact how they interact with research. A discussion of the research paradigm is not complete without describing positivist, postpositivist, and constructivist philosophies.

Positivist vs Postpositivist

To further understand qualitative research, we need to discuss positivist and postpositivist frameworks. Positivism is a philosophy that the scientific method can and should be applied to social as well as natural sciences. Essentially, positivist thinking insists that the social sciences should use natural science methods in its research which stems from positivist ontology that there is an objective reality that exists that is fully independent of our perception of the world as individuals. Quantitative research is rooted in positivist philosophy, which can be seen in the value it places on concepts such as causality, generalizability, and replicability.

Conversely, postpositivists argue that social reality can never be one hundred percent explained but it could be approximated. Indeed, qualitative researchers have been insisting that there are “fundamental limits to the extent to which the methods and procedures of the natural sciences could be applied to the social world” and therefore postpositivist philosophy is often associated with qualitative research. An example of positivist versus postpositivist values in research might be that positivist philosophies value hypothesis-testing, whereas postpositivist philosophies value the ability to formulate a substantive theory.

Constructivist

Constructivism is a subcategory of postpositivism. Most researchers invested in postpositivist research are constructivist as well, meaning they think there is no objective external reality that exists but rather that reality is constructed. Constructivism is a theoretical lens that emphasizes the dynamic nature of our world. “Constructivism contends that individuals’ views are directly influenced by their experiences, and it is these individual experiences and views that shape their perspective of reality”. Essentially, Constructivist thought focuses on how ‘reality’ is not a fixed certainty and experiences, interactions, and backgrounds give people a unique view of the world. Constructivism contends, unlike in positivist views, that there is not necessarily an ‘objective’ reality we all experience. This is the ‘relativist’ ontological view that reality and the world we live in are dynamic and socially constructed. Therefore, qualitative scientific knowledge can be inductive as well as deductive.”

So why is it important to understand the differences in assumptions that different philosophies and approaches to research have? Fundamentally, the assumptions underpinning the research tools a researcher selects provide an overall base for the assumptions the rest of the research will have and can even change the role of the researcher themselves. For example, is the researcher an ‘objective’ observer such as in positivist quantitative work? Or is the researcher an active participant in the research itself, as in postpositivist qualitative work? Understanding the philosophical base of the research undertaken allows researchers to fully understand the implications of their work and their role within the research, as well as reflect on their own positionality and bias as it pertains to the research they are conducting.

Data Sampling

The better the sample represents the intended study population, the more likely the researcher is to encompass the varying factors at play. The following are examples of participant sampling and selection:

Purposive sampling- selection based on the researcher’s rationale in terms of being the most informative.

Criterion sampling-selection based on pre-identified factors.

Convenience sampling- selection based on availability.

Snowball sampling- the selection is by referral from other participants or people who know potential participants.

Extreme case sampling- targeted selection of rare cases.

Typical case sampling-selection based on regular or average participants.

Data Collection and Analysis

Qualitative research uses several techniques including interviews, focus groups, and observation. [1] [2] [3] Interviews may be unstructured, with open-ended questions on a topic and the interviewer adapts to the responses. Structured interviews have a predetermined number of questions that every participant is asked. It is usually one on one and is appropriate for sensitive topics or topics needing an in-depth exploration. Focus groups are often held with 8-12 target participants and are used when group dynamics and collective views on a topic are desired. Researchers can be a participant-observer to share the experiences of the subject or a non-participant or detached observer.

While quantitative research design prescribes a controlled environment for data collection, qualitative data collection may be in a central location or in the environment of the participants, depending on the study goals and design. Qualitative research could amount to a large amount of data. Data is transcribed which may then be coded manually or with the use of Computer Assisted Qualitative Data Analysis Software or CAQDAS such as ATLAS.ti or NVivo.

After the coding process, qualitative research results could be in various formats. It could be a synthesis and interpretation presented with excerpts from the data. Results also could be in the form of themes and theory or model development.

Dissemination

To standardize and facilitate the dissemination of qualitative research outcomes, the healthcare team can use two reporting standards. The Consolidated Criteria for Reporting Qualitative Research or COREQ is a 32-item checklist for interviews and focus groups. The Standards for Reporting Qualitative Research (SRQR) is a checklist covering a wider range of qualitative research.

Examples of Application

Many times a research question will start with qualitative research. The qualitative research will help generate the research hypothesis which can be tested with quantitative methods. After the data is collected and analyzed with quantitative methods, a set of qualitative methods can be used to dive deeper into the data for a better understanding of what the numbers truly mean and their implications. The qualitative methods can then help clarify the quantitative data and also help refine the hypothesis for future research. Furthermore, with qualitative research researchers can explore subjects that are poorly studied with quantitative methods. These include opinions, individual's actions, and social science research.

A good qualitative study design starts with a goal or objective. This should be clearly defined or stated. The target population needs to be specified. A method for obtaining information from the study population must be carefully detailed to ensure there are no omissions of part of the target population. A proper collection method should be selected which will help obtain the desired information without overly limiting the collected data because many times, the information sought is not well compartmentalized or obtained. Finally, the design should ensure adequate methods for analyzing the data. An example may help better clarify some of the various aspects of qualitative research.

A researcher wants to decrease the number of teenagers who smoke in their community. The researcher could begin by asking current teen smokers why they started smoking through structured or unstructured interviews (qualitative research). The researcher can also get together a group of current teenage smokers and conduct a focus group to help brainstorm factors that may have prevented them from starting to smoke (qualitative research).

In this example, the researcher has used qualitative research methods (interviews and focus groups) to generate a list of ideas of both why teens start to smoke as well as factors that may have prevented them from starting to smoke. Next, the researcher compiles this data. The research found that, hypothetically, peer pressure, health issues, cost, being considered “cool,” and rebellious behavior all might increase or decrease the likelihood of teens starting to smoke.

The researcher creates a survey asking teen participants to rank how important each of the above factors is in either starting smoking (for current smokers) or not smoking (for current non-smokers). This survey provides specific numbers (ranked importance of each factor) and is thus a quantitative research tool.

The researcher can use the results of the survey to focus efforts on the one or two highest-ranked factors. Let us say the researcher found that health was the major factor that keeps teens from starting to smoke, and peer pressure was the major factor that contributed to teens to start smoking. The researcher can go back to qualitative research methods to dive deeper into each of these for more information. The researcher wants to focus on how to keep teens from starting to smoke, so they focus on the peer pressure aspect.

The researcher can conduct interviews and/or focus groups (qualitative research) about what types and forms of peer pressure are commonly encountered, where the peer pressure comes from, and where smoking first starts. The researcher hypothetically finds that peer pressure often occurs after school at the local teen hangouts, mostly the local park. The researcher also hypothetically finds that peer pressure comes from older, current smokers who provide the cigarettes.

The researcher could further explore this observation made at the local teen hangouts (qualitative research) and take notes regarding who is smoking, who is not, and what observable factors are at play for peer pressure of smoking. The researcher finds a local park where many local teenagers hang out and see that a shady, overgrown area of the park is where the smokers tend to hang out. The researcher notes the smoking teenagers buy their cigarettes from a local convenience store adjacent to the park where the clerk does not check identification before selling cigarettes. These observations fall under qualitative research.

If the researcher returns to the park and counts how many individuals smoke in each region of the park, this numerical data would be quantitative research. Based on the researcher's efforts thus far, they conclude that local teen smoking and teenagers who start to smoke may decrease if there are fewer overgrown areas of the park and the local convenience store does not sell cigarettes to underage individuals.

The researcher could try to have the parks department reassess the shady areas to make them less conducive to the smokers or identify how to limit the sales of cigarettes to underage individuals by the convenience store. The researcher would then cycle back to qualitative methods of asking at-risk population their perceptions of the changes, what factors are still at play, as well as quantitative research that includes teen smoking rates in the community, the incidence of new teen smokers, among others.

Copyright © 2022, StatPearls Publishing LLC.

Similar articles

Publication types

Related information

Linkout - more resources, full text sources.

Research Materials

Miscellaneous

book cover photo

NCBI Literature Resources

MeSH PMC Bookshelf Disclaimer

The PubMed wordmark and PubMed logo are registered trademarks of the U.S. Department of Health and Human Services (HHS). Unauthorized use of these marks is strictly prohibited.

Research Method

Home » Research Problem – Types, Example and Guide

Research Problem – Types, Example and Guide

Table of Contents

Research Problem

Research Problem

A Research Problem is a specific issue or concern that a researcher would like to investigate. It is the beginning phase of any project. It is important to define the research problem clearly, as it will provide guidance and focus for the rest of the research project . The research problem should be something that can be addressed through research, and it should be narrow enough to be manageable. Once the research problem has been identified, the researcher can begin to develop a research question, which will guide the rest of the project.

How to Define a Research Problem

The first step in solving a research problem is to define it. This may seem like a simple task, but it can be quite difficult. There are a few things to keep in mind when defining a research problem.

Components of a Research Problem

There are four main components of a research problem:

Research Problem Examples

Here are some examples of research problems that have been studied in the past:

Why is the research problem important?

A research problem is the main organizing principle guiding the analysis of your paper. The problem under investigation offers us an occasion for writing and a focus that governs what we want to say. It represents the core subject matter of scholarly communication and the means by which we arrive at other topics of conversation and the discoveries that result from our work.

A good research problem should have the following qualities :

it should be interesting to you and other scholars, important for the field, complex enough to merit a detailed answer, manageable given your available resources, and feasible within the time frame you have for completing your project.

Your job in writing a research paper is to clarify the problem until it can be adequately addressed in a paper of some length. In this way, formulating a research problem becomes a gateway activity to successful research.

What is the purpose of a Research problem statement?

The purpose of a research problem statement is to help focus your research. By providing a clear and concise statement of the problem, you can more easily identify relevant research and develop hypotheses. Additionally, a well-crafted problem statement will help reviewers understand your proposed research and provide feedback on its potential contribution.

About the author

' src=

Muhammad Hassan

I am Muhammad Hassan, a Researcher, Academic Writer, Web Developer, and Android App Developer. I have worked in various industries and have gained a wealth of knowledge and experience. In my spare time, I enjoy writing blog posts and articles on a variety of Academic topics. I also like to stay up-to-date with the latest trends in the IT industry to share my knowledge with others through my writing.

You may also like

What is a Hypothesis

What is a Hypothesis – Types, Examples, Guide

Research Process

Research Process – Definition and Steps

Research Techniques

Research Techniques – Definition and Types

Assignment

Assignment – Definition and Meaning

Thesis

Thesis – Structure with Writing Guide

Research Paper

Research Paper – Writing Guide and Tips

IMAGES

  1. 😎 Good research problem. Seven Important Criteria for a Good Research. 2019-02-04

    research problems with data

  2. Selection of a Research Problem Presentation

    research problems with data

  3. Sample Research Problems (1)

    research problems with data

  4. Top 20 Latest Research Problems in Big Data and Data Science

    research problems with data

  5. Types of Research

    research problems with data

  6. Sources of research problems

    research problems with data

VIDEO

  1. The Research Problem (B)

  2. Why Do People Stay Up late From Different Countries

  3. Research and research problem

  4. Common Problems Of People From Different Countries

  5. Special New Year Livestream

  6. Research Methods

COMMENTS

  1. Issues with data and analyses: Errors, underlying ... - PNAS

    Errors producing “bad data.” We define bad data as those acquired through erroneous or sufficiently low-quality collection methods, study designs, or sampling techniques, such that their use to address a particular scientific question is scientifically unjustifiable.

  2. How to Define a Research Problem | Ideas & Examples - Scribbr

    A research problem is a specific issue or gap in existing knowledge that you aim to address in your research. You may choose to look for practical problems aimed at contributing to change, or theoretical problems aimed at expanding knowledge. Some research will do both of these things, but usually the research problem focuses on one or the other.

  3. Top 20 Latest Research Problems in Big Data and Data Science

    Top 20 Latest Research Problems in Big Data and Data Science Problem statements in 5 categories, research methodology and research labs to follow E ven though Big data is in the mainstream of operations as of 2020, there are still potential issues or challenges the researchers can address.

  4. Data Collection | Definition, Methods & Examples - Scribbr

    Data collection is a systematic process of gathering observations or measurements. Whether you are performing research for business, governmental or academic purposes, data collection allows you to gain first-hand knowledge and original insights into your research problem.

  5. Ten Research Challenge Areas in Data Science · Issue 2.3 ...

    Data scientists are beginning to explore multiple causal inference, not just to overcome some of the strong assumptions of univariate causal inference, but because most real-world observations are due to multiple factors that interact with each other (Wang & Blei, 2019).

  6. The 3 Biggest Data Analytics Challenges of 2022

    Research shows that, as of 2021,humans generated a total of 79 zettabytes of data. This is only expected to grow to even greater increases as the number of streams, posts, searches, texts, and more are used each and every day.Yet this increase in the quantity of data being generated isn’t expected to plateau anytime soon.

  7. Qualitative Study - PubMed

    Qualitative research is a type of research that explores and provides deeper insights into real-world problems. Instead of collecting numerical data points or intervene or introduce treatments just like in quantitative research, qualitative research helps generate hypotheses as well as further inves ….

  8. Research Problem - Types, Example and Guide - Research Method

    A Research Problem is a specific issue or concern that a researcher would like to investigate. It is the beginning phase of any project. It is important to define the research problem clearly, as it will provide guidance and focus for the rest of the research project.